Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
7017
Yang Xiang Alfredo Cuzzocrea Michael Hobbs Wanlei Zhou (Eds.)
Algorithms and Architectures for Parallel Processing 11th International Conference, ICA3PP 2011 Melbourne, Australia, October 24-26, 2011 Proceedings, Part II
13
Volume Editors Yang Xiang Wanlei Zhou Deakin University, School of Information Technology Melbourne Burwood Campus, 221 Burwood Highway Burwood, VIC 3125, Australia E-mail: {yang, wanlei}@deakin.edu.au Alfredo Cuzzocrea ICAR-CNR and University of Calabria Via P. Bucci 41 C, 87036 Rende (CS), Italy E-mail:
[email protected] Michael Hobbs Deakin University, School of Information Technology Geelong Waurn Ponds Campus, Pigdons Road Geelong, VIC 3217, Australia E-mail:
[email protected]
ISSN 0302-9743 e-ISSN 1611-3349 e-ISBN 978-3-642-24669-2 ISBN 978-3-642-24668-5 DOI 10.1007/978-3-642-24669-2 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011937820 CR Subject Classification (1998): F.2, H.4, D.2, I.2, G.2, H.3 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Message from the ADCN 2011 Chairs
We are happy to welcome you to the 2011 International Symposium on Advances of Distributed Computing and Networking (ADCN 2011). ADCN 2011 is held in conjunction with the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), Melbourne, Australia, October 24-26, 2011. ADCN 2011 contains 16 full papers selected from those submitted to the ICA3PP 2011 main track. All the papers were peer reviewed by members of the ICA3PP 2011 Program Committee. The symposium covers a broad range of topics in the field of parallel and distributed computing such as cluster; distributed and parallel operating systems and middleware; cloud, grid, and services computing; reliability and fault-tolerant computing; multi-core programming and software tools; distributed scheduling and load balancing; high-performance scientific computing; parallel algorithms; parallel architectures; parallel and distributed databases; parallel I/O systems and storage systems; parallel programming paradigms; performance of parallel and distributed computing systems resource management and scheduling; tools and environments for parallel and distributed software development; software and hardware; reliability testing, verification and validation; security, privacy, and trusted computing; self-healing, self-protecting and fault-tolerant systems, information security on internet, multimedia in parallel computing parallel computing in bioinformatics dependability issues in computer networks and communications; dependability issues in distributed and parallel systems; dependability issues in embedded parallel systems; industrial applications; and scientific applications. We thank the authors for submitting their work and the members of the ICA3PP 2011 Program Committee for managing the reviews of the ACDN 2011 symposium papers in such short time. We firmly believe this symposium complements perfectly the topics covered by ICA3PP 2011, and provides additional breadth and depth to the main conference. Finally, we hope you enjoy the symposium and have a fruitful meeting in Melbourne, Australia. August 2011
Wanlei Zhou Alfredo Cuzzocrea Michael Hobbs
Message from the IDCS 2011 Chairs
It is our great pleasure that the accepted papers of the 4th International Workshop on Internet and Distributed Computing Systems (IDCS 2011) included in the proceedings of the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), held in Melbourne, Australia during October 24–26, 2011. Following the previous three successful IDCS workshops – IDCS 2008 in Dhaka, Bangladesh; IDCS 2009 on Jeju Island, Korea; and IDCS 2010 in Melbourne, Australia – IDCS 2011 is the fourth in its series to promote research in diverse fields related to Internet and Distributed Computing Systems. In this workshop, we are interested in presenting innovative papers on emerging technologies related to Internet and distributed systems to support the effective design and efficient implementation of high-performance computer networks. The areas of interest for this year’s event are the following: – – – – – – – – –
Internet architectures and protocols modeling and evaluation of internet-based systems Internet quality of service grid, cloud, and P2P computing middleware for wireless sensor networks security of network-based systems network-based applications (VoIP, streaming) network management and traffic engineering tools and techniques for network measurements
The target audience of this event includes researchers and industry practitioners interested in different aspects of the Internet and distributed systems, with a particular focus on practical experiences with the design and implementation of related technologies as well as their theoretical perspectives. We received 23 submissions from 7 different countries. Each submission was reviewed by three members of the international Program Committee. After a rigorous review process, we selected 10 papers for inclusion in the workshop program. We plan to invite extended and enhanced versions of top-quality selected papers for submission on a fast-track basis for the Springer Journal of Internet Services and Applications (JISA) and International Journal of Internet and Distributed Computing Systems (IJIDCS). In addition, selected papers in the information security area will be recommended for publication in the International Journal of Risk and Contingency Management. The organization of IDCS 2011 includes direct or indirect contributions from many individuals, including program chairs, Program Committee members, external reviewers, logistics personnel and student volunteers. We would like to thank Dr Wen Tao Zhu and Dr Muhammad Khurram Khan for accepting the
VIII
Message from the IDCS 2011 Chairs
IDCS 2011 workshop proposal within ICA3PP. Special thanks to ICA3PP general chairs Andrzej Goscinski and Peter Brezany, as well as program chairs Yang Xiang, Alfredo Cuzzocrea, and Michael Hobbs for their continuous support in making IDCS 2011 a success. Last but not least, we express our gratitude to all authors of the accepted and submitted papers. Their contribution has made these proceedings a scholarly compilation of exciting research outcomes. August 2011
Jemal Abawajy Giancarlo Fortino Ragib Hasan Mustafizur Rahman
IDCS 2011 Organizing Committee
Workshop Chairs Jemal Abawajy Giancarlo Fortino Ragib Hasan Mustafizur Rahman
Deakin University, Australia University of Calabria, Italy Johns Hopkins University, USA IBM, Australia
Web, Publicity and Logistics Chair Al-Sakib Khan Pathan Mukaddim Pathan
International Islamic University, Malaysia CSIRO, Australia
International Program Committee Joaqu´ın Garc´ıa-Alfaro Doina Bein Rajkumar Buyya Antonio Coronato Mustafa Mat Deris Zongming Fei S.K. Ghosh Victor Govindaswamy Jaehoon Paul Jeong Syed Ishtiaque Ahmed Tarem Ahmed Mohammad Mehedi Hassan Dimitrios Katsaros Fahim Kawsar Ram Krishnan Hae Young Lee Ignacio M. Llorente Carlo Mastroianni Jaime Lloret Mauri Sudip Misra Muhammad Mostafa Monowar Manzur Murshed Marco Netto George Pallis Rajiv Ranjan
´ ECOM ´ TEL Bretagne, France Pennsylvania State University, USA University of Melbourne, Australia ICAR-CNR, Italy Universiti Tun Hussein Onn, Malaysia University of Kentucky, USA IIT-Kharagpur, India Texas A&M University-Texarkana, USA University of Minnesota, USA BUET, Bangladesh Brac University, Bangladesh Kyung Hee University, South Korea University of Thessaly, Greece Bell Labs, BE and Lancaster University, UK University of Texas at San Antonio, USA ETRI, South Korea Universidad Complutense de Madrid, Spain ICAR-CNR, Italy Universidad Polit´ecnica de Valencia, Spain IIT-Kharagpur, India University of Chittagong, Bangladesh Monash University, Australia IBM Research, Brazil University of Cyprus, Cyprus University of New South Wales, Australia
X
IDCS 2011 Organizing Committee
Thomas Repantis Riaz Ahmed Shaikh Ramesh Sitaraman Mostafa Al Masum Shaikh Paolo Trunfio Christian Vecchiola Spyros Voulgaris Anwar Walid Lizhe Wang Bin Xie Norihiko Yoshida
Akamai Technologies, USA University of Quebec in Outaouais, Canada University of Massachusetts, USA University of Tokyo, Japan University of Calabria, Italy University of Melbourne, Australia Vrije Universiteit, The Netherlands Alcatel-Lucent Bell Labs, USA Indiana University, USA InfoBeyond Technology, USA Saitama University, Japan
M2A2 Foreword
It is with great pleasure that we present the proceedings of the Third International Workshop on Multicore and Multithreaded Architectures and Algorithms (M2A2 2011) held in conjunction with the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011) in Melbourne, Australia. Multicore systems are dominating the processor market, and it is expected that the number of cores will continue to increase in most of the commercial systems, such as high-performance, desktops, or embedded systems. This trend is driven by the need to increase the efficiency of the major system components, that is, the cores, the memory hierarchy, and the interconnection network. For this purpose, the system designer must trade off performance versus power consumption, which is a major concern in current microprocessors. Therefore new architectures or architectural mechanisms addressing this trade-off are required. In this context, load balancing and scheduling can help to improve energy saving. In addition, it remains a challenge to identify and productively program applications for these architectures with a resulting substantial performance improvement. The M2A2 2011 workshop provided a forum for engineers and scientists to address the resulting challenge and to present new ideas, applications, and experience on all aspects of multicore and multithreaded systems. This year, and because of the high quality of the submitted papers, only about 38% of the papers were accepted for the workshop. We would like to express our most sincere appreciation to everyone contributing to the success of this workshop. First, we thank the authors of the submitted papers for their efforts in their research work. Then, we thank the TPC members and the reviewers for their invaluable and constructive comments. Finally, we thank our sponsors for their support of this workshop. August 2011
Houcine Hassan Julio Sahuquillo
XII
M2A2 Foreword
General Co-chairs Houcine Hassan Julio Sahuquillo
Universidad Politecnica de Valencia, Spain Universidad Politecnica de Valencia, Spain
Steering Committee Laurence T. Yang Jong Hyuk Park
St Francis Xavier University, Canada Seoul National University of Technology, Korea
Program Committee Hideharu Amano Hamid R. Arabnia Luca Benini Luis Gomes Antonio Gentile Zonghua Gu Rajiv Gupta Houcine Hassan Seongsoo Hong Shih-Hao Hung Eugene John Seon Wook Kim Jihong Kim Chang-Gun Lee Sebastian Lopez Yoshimasa Nakamura Sabri Pllana Julio Sahuquillo Zili Shao Kenjiro Taura
Keio University, Japan The University of Georgia, USA University of Bolonia, Italy Universidade Nova de Lisboa, Portugal Universit` a di Palermo, Italy University of Science and Technology, Hong Kong University of California, Riverside, USA Universidad Politecnica de Valencia, Spain Seoul National University, Korea National Taiwan University, Taiwan University of Texas at San Antonio, USA Korea University, Korea Seoul National University, Korea Seoul National University, Korea Universidad Las Palmas, Spain Kyoto University, Japan University of Vienna, Austria Universidad Politecnica de Valencia, Spain The Hong Kong Polytechnic University, Hong Kong University of Tokyo, Japan
HardBio 2011 Foreword
It gives us great pleasure to introduce this small collection of papers that were presented at the First International Workshop on Parallel Architectures for Bioinformatics Systems (HardBio 2011), October 23–26, 2011, Melbourne, Australia. Bioinformatics is a research field that focuses on algorithms and statistical techniques that allow efficient interpretation, classification and understanding of biological datasets. These applications are to the general benefit of mankind. The datasets typically consist of huge numbers of DNA, RNA, or protein sequences. Sequence alignment is used to assemble the datasets for analysis. Comparisons of homologous sequences, gene finding, and prediction of gene expression are the most common techniques used on assembled datasets. However, analysis of such datasets have many applications throughout all fields of biology. The down-side of bioinformatics-related applications is that they need a humongous computational effort to be executed. Therefore, a lot of research effort is being channeled towards the development of special-purpose hardware accelerators and dedicated parallel processors that allow for efficient execution of this kind of applications The Program Committee received 12 submissions, from which it selected 4 for presentation and publication. Each paper was evaluated by three referees. Technical quality, originality, relevance, and clarity were the primary criteria for selection. We wish to thank all these who submitted manuscripts for consideration. We also wish to thank the members of the Program Committee who reviewed all of the submissions. I hope that many more reserachers will submit the results of their work to next year’s workwhop. August 2011
Nadia Nedjah Luiza de Macedo Mourelle
XIV
HardBio 2011 Foreword
Program Committee Felipe Maia Galv˜ ao Fran¸ca Nader Bagherzadeh Leandro dos Santos Coelho Jurij Silc Heitor Silv´erio Lopes Lech J´ ozwiak Zhihua Cui Hamid Sarbazi-Azad
Federal University of Rio de Janeiro, Brazil University of California, Irvine, USA Pontifical Catholic University of Paran´ a, Brazil Jozef Stefan Institute, Slovenia Federal Technological University of Paran´ a, Brazil Eindhoven University of Technology, The Netherlands Taiyuan University of Science and Technology, China Sharif University of Technology, Iran
Table of Contents – Part II
ADCN 2011 Papers Lightweight Transactional Arrays for Read-Dominated Workloads . . . . . . Ivo Anjo and Jo˜ ao Cachopo Massively Parallel Identification of Intersection Points for GPGPU Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandre Solon Nery, Nadia Nedjah, Felipe M.G. Fran¸ca, and Lech Jozwiak Cascading Multi-way Bounded Wait Timer Management for Moody and Autonomous Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Asrar Ul Haque and Javed I. Khan World-Wide Distributed Multiple Replications in Parallel for Quantitative Sequential Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mofassir Haque, Krzysztof Pawlikowski, Don McNickle, and Gregory Ewing Comparison of Three Parallel Point-Multiplication Algorithms on Conic Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yongnan Li, Limin Xiao, Guangjun Qin, Xiuqiao Li, and Songsong Lei Extending Synchronization Constructs in OpenMP to Exploit Pipeline Parallelism on Heterogeneous Multi-core . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shigang Li, Shucai Yao, Haohu He, Lili Sun, Yi Chen, and Yunfeng Peng
1
14
24
33
43
54
Generic Parallel Genetic Algorithm Framework for Protein Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lukas Folkman, Wayne Pullan, and Bela Stantic
64
A Survey on Privacy Problems and Solutions for VANET Based on Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hun-Jung Lim and Tai-Myoung Chung
74
Scheduling Tasks and Communications on a Hierarchical System with Message Contention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jean-Yves Colin and Moustafa Nakechbandi
89
Spiking Neural P System Simulations on a High Performance GPU Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francis George Cabarle, Henry Adorna, Miguel A. Mart´ınez–del–Amor, and Mario J. P´erez–Jim´enez
99
XVI
Table of Contents – Part II
SpotMPI: A Framework for Auction-Based HPC Computing Using Amazon Spot Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Moussa Taifi, Justin Y. Shi, and Abdallah Khreishah
109
Investigating the Scalability of OpenFOAM for the Solution of Transport Equations and Large Eddy Simulations . . . . . . . . . . . . . . . . . . . . Orlando Rivera, Karl F¨ urlinger, and Dieter Kranzlm¨ uller
121
Shibboleth and Community Authorization Services: Enabling Role-Based Grid Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fan Gao and Jefferson Tan
131
A Secure Internet Voting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Abdul Based and Stig Fr. Mjølsnes
141
A Hybrid Graphical Password Based System . . . . . . . . . . . . . . . . . . . . . . . . Wazir Zada Khan, Yang Xiang, Mohammed Y. Aalsalem, and Quratulain Arshad
153
Privacy Threat Analysis of Social Network Data . . . . . . . . . . . . . . . . . . . . . Mohd Izuan Hafez Ninggal and Jemal Abawajy
165
IDCS 2011 Papers Distributed Mechanism for Protecting Resources in a Newly Emerged Digital Ecosystem Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ilung Pranata, Geoff Skinner, and Rukshan Athauda Reservation-Based Charging Service for Electric Vehicles . . . . . . . . . . . . . . Junghoon Lee, Gyung-Leen Park, and Hye-Jin Kim Intelligent Ubiquitous Sensor Network for Agricultural and Livestock Farms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junghoon Lee, Hye-Jin Kim, Gyung-Leen Park, Ho-Young Kwak, and Cheol Min Kim
175 186
196
Queue-Based Adaptive Duty Cycle Control for Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heejung Byun and Jungmin So
205
Experimental Evaluation of a Failure Detection Service Based on a Gossip Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leandro P. de Sousa and Elias P. Duarte Jr.
215
On the Performance of MPI-OpenMP on a 12 Nodes Multi-core Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abdelgadir Tageldin Abdelgadir, Al-Sakib Khan Pathan, and Mohiuddin Ahmed
225
Table of Contents – Part II
XVII
A Protocol for Discovering Content Adaptation Services . . . . . . . . . . . . . . Mohd Farhan Md Fudzee and Jemal Abawajy
235
Securing RFID Systems from SQLIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Harinda Fernando and Jemal Abawajy
245
Modeling QoS Parameters of VoIP Traffic with Multifractal and Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Homero Toral-Cruz, Al-Sakib Khan Pathan, and Julio C. Ram´ırez-Pacheco Hybrid Feature Selection for Phishing Email Detection . . . . . . . . . . . . . . . Isredza Rahmi A. Hamid and Jemal Abawajy
255
266
M2A2 2011 Papers On the Use of Multiplanes on a 2D Mesh Network-on-Chip . . . . . . . . . . . . Cruz Izu A Minimal Average Accessing Time Scheduler for Multicore Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Canhao Xu, Pasi Liljeberg, and Hannu Tenhunen
276
287
Fast Software Implementation of AES-CCM on Multiprocessors . . . . . . . . Jung Ho Yoo
300
A TCM-Enabled Access Control Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gongxuan Zhang, Zhaomeng Zhu, Pingli Wang, and Bin Song
312
Binary Addition Chain on EREW PRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . Khaled A. Fathy, Hazem M. Bahig, Hatem M. Bahig, and A.A. Ragb
321
A Portable Infrastructure Supporting Global Scheduling of Embedded Real-Time Applications on Asymmetric MPSoCs . . . . . . . . . . . . . . . . . . . . . Eugenio Faldella and Primiano Tucci Emotional Contribution Process Implementations on Parallel Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carlos Dom´ınguez, Houcine Hassan, Jos´e Albaladejo, Maria Marco, and Alfons Crespo A Cluster Computer Performance Predictor for Memory Scheduling . . . . M´ onica Serrano, Julio Sahuquillo, Houcine Hassan, Salvador Petit, and Jos´e Duato
331
343
353
XVIII
Table of Contents – Part II
HardBio 2011 Papers Reconfigurable Hardware Computing for Accelerating Protein Folding Simulations Using the Harmony Search Algorithm and the 3D-HP-Side Chain Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C´esar Manuel Vargas Ben´ıtez, Marlon Scalabrin, Heitor Silv´erio Lopes, and Carlos R. Erig Lima Clustering Nodes in Large-Scale Biological Networks Using External Memory Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ahmed Shamsul Arefin, Mario Inostroza-Ponta, Luke Mathieson, Regina Berretta, and Pablo Moscato Reconfigurable Hardware to Radionuclide Identification Using Subtractive Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcos Santana Farias, Nadia Nedjah, and Luiza de Macedo Mourelle
363
375
387
A Parallel Architecture for DNA Matching . . . . . . . . . . . . . . . . . . . . . . . . . . Edgar J. Garcia Neto Segundo, Nadia Nedjah, and Luiza de Macedo Mourelle
399
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
409
Table of Contents – Part I
ICA3PP 2011 Keynote Keynote: Assertion Based Parallel Debugging . . . . . . . . . . . . . . . . . . . . . . . . David Abramson
1
ICA3PP 2011 Regular Papers Secure and Energy-Efficient Data Aggregation with Malicious Aggregator Identification in Wireless Sensor Networks . . . . . . . . . . . . . . . . Hongjuan Li, Keqiu Li, Wenyu Qu, and Ivan Stojmenovic
2
Dynamic Data Race Detection for Correlated Variables . . . . . . . . . . . . . . . Ali Jannesari, Markus Westphal-Furuya, and Walter F. Tichy
14
Improving the Parallel Schnorr-Euchner LLL Algorithm . . . . . . . . . . . . . . Werner Backes and Susanne Wetzel
27
Distributed Mining of Constrained Frequent Sets from Uncertain Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alfredo Cuzzocrea and Carson K. Leung
40
Set-to-Set Disjoint-Paths Routing in Recursive Dual-Net . . . . . . . . . . . . . . Yamin Li, Shietung Peng, and Wanming Chu
54
Redflag: A Framework for Analysis of Kernel-Level Concurrency . . . . . . . Justin Seyster, Prabakar Radhakrishnan, Samriti Katoch, Abhinav Duggal, Scott D. Stoller, and Erez Zadok
66
Exploiting Parallelism in the H.264 Deblocking Filter by Operation Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tsung-Hsi Weng, Yi-Ting Wang, and Chung-Ping Chung Compiler Support for Concurrency Synchronization . . . . . . . . . . . . . . . . . . Tzong-Yen Lin, Cheng-Yu Lee, Chia-Jung Chen, and Rong-Guey Chang
80 93
Fault-Tolerant Routing Based on Approximate Directed Routable Probabilities for Hypercubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dinh Thuy Duong and Keiichi Kaneko
106
Finding a Hamiltonian Cycle in a Hierarchical Dual-Net with Base Network of p -Ary q-Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yamin Li, Shietung Peng, and Wanming Chu
117
XX
Table of Contents – Part I
Adaptive Resource Remapping through Live Migration of Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Muhammad Atif and Peter Strazdins LUTS: A Lightweight User-Level Transaction Scheduler . . . . . . . . . . . . . . . Daniel Nic´ acio, Alexandro Baldassin, and Guido Ara´ ujo Verification of Partitioning and Allocation Techniques on Teradata DBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ladjel Bellatreche, Soumia Benkrid, Ahmad Ghazal, Alain Crolotte, and Alfredo Cuzzocrea Memory Performance and SPEC OpenMP Scalability on Quad-Socket x86 64 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Molka, Robert Sch¨ one, Daniel Hackenberg, and Matthias S. M¨ uller Anonymous Communication over Invisible Mix Rings . . . . . . . . . . . . . . . . . Ming Zheng, Haixin Duan, and Jianping Wu Game-Based Distributed Resource Allocation in Horizontal Dynamic Cloud Federation Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammad Mehedi Hassan, Biao Song, and Eui-Nam Huh
129
144
158
170
182
194
Stream Management within the CloudMiner . . . . . . . . . . . . . . . . . . . . . . . . . Yuzhang Han, Peter Brezany, and Andrzej Goscinski
206
Security Architecture for Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . . Udaya Tupakula, Vijay Varadharajan, and Abhishek Bichhawat
218
Fast and Accurate Similarity Searching of Biopolymer Sequences with GPU and CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert Pawlowski, Bo˙zena Malysiak-Mrozek, Stanislaw Kozielski, and Dariusz Mrozek Read Invisibility, Virtual World Consistency and Probabilistic Permissiveness are Compatible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tyler Crain, Damien Imbs, and Michel Raynal Parallel Implementations of Gusfield’s Cut Tree Algorithm . . . . . . . . . . . . Jaime Cohen, Luiz A. Rodrigues, Fabiano Silva, Renato Carmo, Andr´e L.P. Guedes, and Elias P. Duarte Jr. Efficient Parallel Implementations of Controlled Optimization of Traffic Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sameh Samra, Ahmed El-Mahdy, Walid Gomaa, Yasutaka Wada, and Amin Shoukry
230
244
258
270
Table of Contents – Part I
Scheduling Concurrent Workflows in HPC Cloud through Exploiting Schedule Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . He-Jhan Jiang, Kuo-Chan Huang, Hsi-Ya Chang, Di-Syuan Gu, and Po-Jen Shih Efficient Decoding of QC-LDPC Codes Using GPUs . . . . . . . . . . . . . . . . . . Yue Zhao, Xu Chen, Chiu-Wing Sham, Wai M. Tam, and Francis C.M. Lau
XXI
282
294
ICA3PP 2011 Short Papers A Combined Arithmetic Logic Unit and Memory Element for the Design of a Parallel Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammed Ziaur Rahman Parallel Implementation of External Sort and Join Operations on a Multi-core Network-Optimized System on a Chip . . . . . . . . . . . . . . . . . . . . Elahe Khorasani, Brent D. Paulovicks, Vadim Sheinin, and Hangu Yeo STM with Transparent API Considered Harmful . . . . . . . . . . . . . . . . . . . . . Fernando Miguel Carvalho and Joao Cachopo A Global Snapshot Collection Algorithm with Concurrent Initiators with Non-FIFO Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diganta Goswami and Soumyadip Majumder An Approach for Code Compression in Run Time for Embedded Systems – A Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wanderson Roger Azevedo Dias, Edward David Moreno, and Raimundo da Silva Barreto Optimized Two Party Privacy Preserving Association Rule Mining Using Fully Homomorphic Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Golam Kaosar, Russell Paulet, and Xun Yi SLA-Based Resource Provisioning for Heterogeneous Workloads in a Virtualized Cloud Datacenter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saurabh Kumar Garg, Srinivasa K. Gopalaiyengar, and Rajkumar Buyya ΣC: A Programming Model and Language for Embedded Manycores . . . Thierry Goubier, Renaud Sirdey, St´ephane Louise, and Vincent David Provisioning Spot Market Cloud Resources to Create Cost-Effective Virtual Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . William Voorsluys, Saurabh Kumar Garg, and Rajkumar Buyya
306
318
326
338
349
360
371
385
395
XXII
Table of Contents – Part I
A Principled Approach to Grid Middleware: Status Report on the Minimum Intrusion Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jost Berthold, Jonas Bardino, and Brian Vinter
409
Performance Analysis of Preemption-Aware Scheduling in Multi-cluster Grid Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohsen Amini Salehi, Bahman Javadi, and Rajkumar Buyya
419
Performance Evaluation of Open Source Seismic Data Processing Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Izzatdin A. Aziz, Andrzej M. Goscinski, and Michael M. Hobbs
433
Reputation-Based Resource Allocation in Market-Oriented Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masnida Hussin, Young Choon Lee, and Albert Y. Zomaya
443
Cooperation-Based Trust Model and Its Application in Network Security Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wu Liu, Hai-xin Duan, and Ping Ren
453
Performance Evaluation of the Three-Dimensional Finite-Difference Time-Domain(FDTD) Method on Fermi Architecture GPUs . . . . . . . . . . . Kaixi Hou, Ying Zhao, Jiumei Huang, and Lingjie Zhang
460
The Probability Model of Peer-to-Peer Botnet Propagation . . . . . . . . . . . . Yini Wang, Sheng Wen, Wei Zhou, Wanlei Zhou, and Yang Xiang
470
A Parallelism Extended Approach for the Enumeration of Orthogonal Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hien Phan, Ben Soh, and Man Nguyen
481
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
495
Lightweight Transactional Arrays for Read-Dominated Workloads Ivo Anjo and Jo˜ ao Cachopo ESW INESC-ID Lisboa/Instituto Superior T´ecnico/Universidade T´ecnica de Lisboa Rua Alves Redol 9, 1000-029 Lisboa, Portugal {ivo.anjo,joao.cachopo}@ist.utl.pt
Abstract. Many common workloads rely on arrays as a basic data structure on top of which they build more complex behavior. Others use them because they are a natural representation for their problem domains. Software Transactional Memory (STM) has been proposed as a new concurrency control mechanism that simplifies concurrent programming. Yet, most STM implementations have no special representation for arrays. This results, on many STMs, in inefficient internal representations, where much overhead is added while tracking each array element individually, and on other STMs in false-sharing conflicts, because writes to different elements on the same array result in a conflict. In this work we propose new designs for array implementations that are integrated with the STM, allowing for improved performance and reduced memory usage for read-dominated workloads, and present the results of our implementation of the new designs on top of the JVSTM, a Java library STM. Keywords: Parallel Programming, Software Transactional Memory.
1
Introduction
Software Transactional Memory (STM) [10, 15] is a concurrency control mechanism for multicore and multiprocessor shared-memory systems, aimed at simplifying concurrent application development. STM provides features such as atomicity and isolation for program code, while eliminating common pitfalls of concurrent programming such as deadlocks and data races. During a transaction, most STMs internally work by tracking the memory read and write operations done by the application on thread-local read and write-sets. Tracking this metadata adds overheads to applications that depend on the granularity of transactional memory locations. There are two main STM designs regarding granularity: Either word-based [4, 8] or object-based [7, 11]. Wordbased designs associate metadata with either each individual memory location, or by mapping them to a fixed-size table; whereas object-based designs store
This work was supported by FCT (INESC-ID multiannual funding) through the PIDDAC Program funds and by the RuLAM project (PTDC/EIA-EIA/108240/2008).
Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 1–13, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
I. Anjo and J. Cachopo
transactional information on each object or structure’s header, and all of the object’s fields share the same piece of transactional metadata. Arrays, however, are not treated specially by STM implementations. Thus, programmers either use an array of transactional containers in each position, or they wrap the entire array with a transactional object. Neither option is ideal, if we consider that array elements may be randomly but infrequently changed. Because arrays are one of the most elemental data structures on computing systems, if we hope to extend the usage of STM to provide synchronization and isolation to array-heavy applications, minimizing the imposed overhead is very important. In this paper, we describe how existing transactional arrays are implemented, and explore new approaches that are integrated with the STM, achieving better performance and reducing memory usage for read-dominated workloads. Our work is based on the Java Versioned Software Transactional Memory (JVSTM) [2, 3], a multi-version STM. The rest of this work is organized as follows. Section 2 introduces the JVSTM transactional memory. Section 3 describes current black-box approaches to arrays. Section 4 introduces the new proposals for handling arrays. In Section 5, we compare the different array implementations. Experimental results are presented in Section 6, followed, in Section 7, by a survey of related work. Finally, in Section 8, we finish by presenting the conclusions and future research directions.
2
The JVSTM Software Transactional Memory
The Java Versioned Software Transactional Memory (JVSTM) is a pure Java library implementing an STM [3]. JVSTM introduces the concept of versioned boxes [2], which are transactional locations that may be read and written during transactions, much in the same way of other STMs, except that they keep the history of values written to them by any committed transaction. Programmers using the JVSTM must use instances of the VBox class to represent the shared mutable variables of a program that they want to access transactionally. In Java, those variables are either class fields (static or not) or array components (each element of an array). As an example, consider a field f of type T in a class C whose instances may be accessed concurrently. To access f transactionally, the programmer must do two things: (1) transform the field f in C into a final field that holds an instance of type VBox
, and (2) replace all the previous accesses to f by the corresponding operations on the contents of the box now contained in f. JVSTM implements versioned boxes by keeping a linked-list of VBoxBody instances inside each VBox: Each VBoxBody contains both the version number of the transaction that committed it and the value written by that transaction. This list of VBoxBody instances is sorted in descending order of the version number, with the most recent at the head. The key idea of this design is that transactions typically need to access the most recent version of a box, which is only one indirection-level away from the box object.
Lightweight Transactional Arrays for Read-Dominated Workloads
3
Yet, because the JVSTM keeps all the versions that may be needed by any of the active transactions, a transaction that got delayed for some reason can still access a version of the box that ensures that it will always perform consistent reads: The JVSTM satisfies the opacity correctness criteria [9]. In fact, a distinctive feature of the JVSTM is that read-only transactions are lock-free and never conflict with other transactions. They are also very lightweight, because there is no need to keep read-sets or write-sets: Each read of a transactional location consists only of traversing the linked-list to locate the correct VBoxBody from which the value is to be read. These two characteristics make the JVSTM especially suited for applications that have a high read/write transaction ratio. Currently there are two versions of the JVSTM that differ on their commit algorithm. The original version of the JVSTM uses a lock-based commit algorithm, described below, whereas more recently Fernandes and Cachopo described a lock-free commit algorithm for the JVSTM [6]. Unless otherwise stated, the approaches described in this paper apply to both versions of the JVSTM. To synchronize the commits of read-write transactions, the lock-based JVSTM uses a single global lock: Any thread executing a transaction must acquire this lock to commit its results, which means that all commits (of read-write transactions) execute in mutual exclusion. After the lock acquisition, the committing transaction validates its read-set and, if valid, writes-back its values to new VBoxBody instances, which are placed at the head of each VBox’s history of values. To prevent unbounded growth of the memory used to store old values for boxes, the JVSTM implements a garbage collection algorithm, which works as follows: Each committing transaction creates a list with all the newly created instances of VBoxBody and stores this list on its descriptor. The transaction descriptors themselves also form a linked-list of transactions, with increasing version numbers. When the JVSTM detects that no transactions are running with version number older than some descriptor, it cleans the next field of each VBoxBody instance in the descriptor, allowing the Java GC to clean the old values.
3
Current Black-Box Array Implementations
In this section, we describe the two most common alternatives to implement transactional arrays with the JVSTM if we use only its provided API — that is, if we use the JVSTM as a black-box library. 3.1
Array of Versioned Boxes
The most direct and commonly used way of obtaining a transactional array with the JVSTM is the array of VBoxes. A graphical representation of the resulting structure is shown in Figure 1. One of the shortcomings of this approach is the array initialization: All positions on the array need to be initialized with a VBox before they are used, typically as soon as the array is created and before it is published. Trying to perform lazy initialization highlights one of the issues of implementing such a data-structure outside the STM: the underlying native Java array is
4
I. Anjo and J. Cachopo
Fig. 1. Array of versioned boxes
Fig. 2. Versioned box with array
not under the control of the STM, and as such the programmer must provide his own synchronization mechanism for this operation. Side-stepping the synchronization provided by the STM while at the same time using the STM must be done carefully, or key STM characteristics might be lost, such as lock-freedom and atomicity, and common concurrent programming issues such as deadlocks might arise again. We will see in Section 4.1 a variant of this approach that uses lazy initialization and knowledge of the JVSTM’s internals. Since all VBoxes and their associated VBoxBody instances are normal Java objects, they still take up a considerable amount of memory when comparing to the amount needed to store each reference on the VBox array. As such, it is not unexpected for the application to spend more than twice the space needed for the native array to store these instances in memory. 3.2
Versioned Box with Array
The other simple implementation of a transactional array is one where a single VBox keeps the entire array, as shown in Figure 2. Creation of this kind of array is straightforward, with overheads comparable to a normal non-transactional array. Array reads are the cheapest possible, only adding the cost of looking up the correct VBoxBody to read from; but writes are very expensive, as they need to duplicate the entire array just to change one of the positions. In addition, a single array write conflicts with every other (non read-only) transaction that is concurrently accessing the array, as the conflict detection granularity is the VBox holding the entire array. Moreover, there is a very high overhead in keeping the history of values: For each version, an entire copy of the array is kept, even if only one element of the array was changed. This may lead the system to run out of memory very quickly, if writes to the array are frequent and some old running transaction prevents the garbage collector from running. In conclusion, this approach is suited only for very specific workloads, with zero or almost-zero writes to the array. On the upside, for those workloads, it offers performance comparable to native arrays, while still benefiting from transactional properties. It is also the only approach that allows the underlying array to change size and dimensions dynamically with no extra overhead.
Lightweight Transactional Arrays for Read-Dominated Workloads
5
Type value = getVBox(index).get(); // Reading from a VBoxArray getVBox(index).put(newValue); // Writing to a VBoxArray VBox getVBox(int index) { // Helper method getVBox VBox vbox = transArray[index]; if (vbox == null) { vbox = new VBox((VBoxBody) null); vbox.commit(null, 0); if (!unsafe.compareAndSwapObject(transArray, ..., null, vbox)) vbox = transArray[index]; } return vbox; }
Fig. 3. Code for the
4
VBoxArray
approach
New Array Proposals
In this section, we describe three proposals to implement transactional arrays that improve on the black-box approaches presented in the previous section. 4.1
VBoxArray and VBodyArray
The VBoxArray approach is obtained by adding lazy creation and initialization of VBoxes to the approach presented in Section 3.1. The main operations for this implementation are shown in Figure 3. The getVBox() helper method first tries to obtain a VBox from the specified array position. If it exists, it is returned; otherwise a new one is created using an empty body, that is immediately written back, and tagged with version 0. This is conceptually the same as if the VBox was created by a transaction that ran before every other transaction and initialized all the boxes. The VBox is then put into the array in an atomic fashion: Either the compareAndSwap1 operation succeeds, and the box is placed on the underlying array, or it fails, meaning that another thread already initialized it. We can take the VBoxArray one step further and obtain the VBodyArray by doing away with the VBoxes altogether. The insight is that a VBox is needed only to uniquely identify a memory location on which we can transactionally read and write. If we provide our transactional array inside a wrapper VBodyArray class, we can use another method to identify uniquely a memory position: a pair . Using this pair, we no longer need the VBoxes, because the underlying array can directly contain the VBoxBody instances that would normally be kept inside them; initialization can still be done lazily. The VBodyArray saves a considerable amount of memory for larger arrays, and also lowers overhead on reads, as less memory reads need to be done to reach the values. 1
Available in the
sun.misc.Unsafe
class included in most JVM implementations.
6
I. Anjo and J. Cachopo
Fig. 4. The
VArray
transactional array
Type value = array.values.get(index); // Read value from array (volatile read!) int version = array.version; // Read array version // If the array did not change, return the value read, otherwise check the log if (version <= currentTransactionVersion) return value; Type logValue = array.log.getLogValue(index, currentTransactionVersion); return logValue != null ? logValue : value;
Fig. 5. Reading from a
4.2
VArray
VArray
The VArray approach, shown in Figure 4, does away entirely with the normal storage mechanisms in the JVSTM: No VBoxes and no VBoxBodies are needed. It is designed to have the upsides of the VBox with Array approach described in Section 3.2, but to eliminate or minimize the downsides.2 The main design idea is to have an array that keeps both a set of values tagged with a version and a log containing the remaining versions. Based on this design, two strategies are possible: – The underlying array keeps the oldest values for each array position, and newer values are kept in the log; there must be a strategy to decide when to transfer values from the log to the main array. – The underlying array keeps the latest values for each array position, and older values are kept in the log; there must be a strategy to allow garbage collection of older values from the log. We argue that the second choice is more in line with the spirit of the JVSTM’s design, because newer transactions find their values quickly, while older longrunning transactions have to search through the log to find their older values. Additionally, JVSTM’s existing garbage collection algorithm can, with minor modifications, be used to perform garbage collection of the log. Reading from a VArray. We start by reading the value directly from the array and then we check the array version: If it is older than the current transaction version number, we may return the value that we read directly from the array. If, instead, an older value is needed, we have to check the log to find the value corresponding to our current version, and return it, if found; otherwise, we may safely return the value originally read from the array because that position was never changed, although other array positions were. Figure 5 shows the code to read from a VArray. 2
The full source-code for the JVSTM with the VArray class is available on the jvstmlock-free branch at http://groups.ist.utl.pt/esw-inesc-id/git/jvstm.git/
Lightweight Transactional Arrays for Read-Dominated Workloads
7
int txNumber; // Version of transaction being committed VArrayEntry[] writesToCommit; // Sorted list of writes to be committed // Create and initialize logEntryIndexes to be used in the log int[] logEntryIndexes = new int[writesToCommit.length]; for (int i = 0; i < writesToCommit.length; i++) logEntryIndexes[i] = writesToCommit[i].index; // Create and place log node Type[] logEntryValues = (Type[]) new Object[writesToCommit.length]; array.log = new VArrayLogNode(logEntryIndexes, logEntryValues, txNumber - 1, array.log); // Bump array version array.version = txNumber; // Writeback values int i = 0; for (VArrayEntry entry : writesToCommit) { // Read old value from the array, and copy it to the log logEntryValues[i++] = array.values.get(entry.index); // Write the new value array.values.lazySet(entry.index, entry.object); // Volatile write! }
Fig. 6. Committing changes to a
VArray
Fig. 7. Structure of a
VArrayLogNode
Writing and Committing to a VArray. Writing to a VArray is similar to writing to a VBox: the value to be written is added to the transaction’s write-set. During the commit, the write-back to a VArray proceeds as follows: 1. Create a new log entry with the indexes of the array positions that are going to be overwritten and add that entry to the head of the log; 2. Update the array version; 3. Finally, backup to the log and write-back each changed array position. Each write operation is done with volatile semantics. These steps need to be done while inside the commit lock. On the lock-free version of the JVSTM, because there is no commit lock, each array is locked individually; note that this approach eliminates the property of lock-freedom from the commit of transactions that wrote to a VArray, but lock-freedom is restored when no active transactions are writing to VArray instances. Figure 6 shows a simplified version of the VArray commit code. Accessing the Log. Figure 7 shows the structure of a VArrayLogNode. The VArray log is a linked list of VArrayLogNodes containing the older values of array positions that were overwritten by newer transactions. Inside each VArrayLogNode, a version
8
I. Anjo and J. Cachopo
field keeps the last version when the values contained in it were valid. This means that if the log contains two nodes, with versions 50 and 40, then the node with version 50 is valid for transactions with a version in the range ]40, 50] and the older entry is valid for transactions with versions ]0, 40]. Moreover, each VArrayLogNode maintains two arrays: logEntryIndexes and logEntryValues. The first keeps an ordered list of indexes that were changed by the transaction that created the log node, and the second keeps the values that were at those indexes, and were overwritten in the main array. When a transaction with version n needs to look up the log for the value in index index, it first traverses the log nodes until it finds the log node with the smallest version >= n. It then checks that node for the index, by performing a binary search on the logEntryIndexes array. If this search finds the index, it returns the corresponding value. Otherwise, the search is resumed from the previous node, until a value is found, or the beginning of the log is reached — meaning that the requested value should be read from the main array. Synchronization. As we saw, the read algorithm first reads the value from the array, and then reads its version. To commit a new value we reverse this order: First the committer updates the version, and then writes back the new values. Yet, without additional synchronization, we have a data race and the following can happen: The update of the array value may be reordered with the update of the version, which means that a reader may read the new value written by the committing transaction, but still read the old version value, causing the algorithm to return an invalid (newer) value to the application. To solve this issue, and taking into account the Java memory model [12] we might be inclined to make the field that stores the array version volatile. Unfortunately, this will not work: If the committing thread first does a volatile write on the array version, and then updates the array, if the reading thread does not observe the write to the array version, no synchronizes-with 3 relation happens, and so the update to the array value may be freely reordered before the version write, making a reader read the new value, and miss the new version. The other possible option would be for the committing thread to first write-back the value, and then update the array version with a volatile write; in this case, a simple delay or context switch between the two writes would cause issues. As such, we can see that no ordering of writes to update both the array value and version can work correctly if just the version is declared volatile. As it turns out, the commit algorithm works correctly if only the array value is read and written with volatile semantics (through the usage of the AtomicReferenceArray class), and the version as a normal variable. This way, the reader can never read a newer value and an old version, because by volatile definition, if we observe a value, we at least observe the correct version for that value, but may also 3
The volatile keyword, when applied to a field states that if a thread t1 writes to normal field f1 and then to volatile field f2 ; then if other thread observes the write on f2, it is guaranteed that it will also see the write to f1, and also every other write done by t1 before the write to f2. This is called a synchronizes-with [12] relationship.
Lightweight Transactional Arrays for Read-Dominated Workloads
9
% &
Table 1. Comparison of array implementations. The memory overheads are considered for two workloads: a workload where only a single position is ever used after the array is created, and one where the entire array is used.
! " " ! ! # $ # $ # $ # $
" $ # $ #
" " $
" " ! "
observe a later version, which poses no problem: In both cases the algorithm will correctly decide to check the log. Garbage Collection. We also extended the JVSTM garbage collection algorithm to work with the VArray log. As the linked list structure of the array log is similar to the linked list of bodies inside a VBox, new instances of VArrayLogNode that are created during transaction commit are also saved in the transaction descriptor, and from then the mechanism described in Section 2 is used.
5
Comparison of Approaches
Table 1 summarizes the key characteristics of the multiple approaches described in this paper. The single position memory overhead test case considers an array of n positions, where, after creation, only one of those positions is ever used during the entire program; conversely the entire array test case considers one where every position of the array is used. The memory overheads considered are in addition to a native array of size n, which all implementations use. The main objective of this work was the creation of an array implementation that provided better performance for read-only operations, while minimizing memory usage and still supporting write operations without major overheads. We believe VArray fulfills those objectives, as it combines the advantages of the “VBox with Array” approach, such as having a very low memory footprint and read overhead, with advantages from other approaches, notably conflict detection done at the array position level, and low history overhead. Writes to a VArray are still more complex than most other approaches, but as we will see in Section 6 they can still be competitive.
6
Experimental Results
We shall now present experimental results of the current implementation of VArray. They were obtained on two machines: one with two Intel Xeon E5520 processors (8 cores total) and 32GB of RAM, and another with four AMD Opteron 6168 processors (48 cores total) and 128GB of RAM, both running Ubuntu
10
I. Anjo and J. Cachopo
!"#$% #!
Fig. 8. Comparison of VArray versus the Array of VBoxes approach for the array benchmark, with a read-only workload on our two test systems
Fig. 9. Comparison of VArray versus the Array of VBoxes approach for the array benchmark, with varying number of read-write transactions (10%, 50% and 100%) on the 48-core AMD machine
10.04.2 LTS 64-bit and Oracle Java 1.6.0 22. For our testing, we compared VArray to the Array of VBoxes approach, using the array benchmark,4 which can simulate multiple array-heavy workloads. Before each test, the array was entirely initialized— note that after being fully initialized, the Array of Versioned Boxes and VBoxArray behave similarly. Each test was run multiple times, and the results presented are the average over all executions. Figure 8 shows the scaling of VArray versus the Array of VBoxes approach for a read-only workload, with a varying number of threads. Each run consisted of timing the execution of 1 million transactions, with an array size of 1,000,000 on the 8-core machine, and 10,000,000 on the 48-core machine. Due to the reduced overheads imposed on array reads, VArray presents better performance. Figure 9 shows the scaling of VArray versus a Array of VBoxes approach for a workload with a varying percentage of read-only and read-write transactions. Each read-only transaction reads 1000 (random) array positions, and each 4
http://web.ist.utl.pt/sergio.fernandes/darcs/array/
Lightweight Transactional Arrays for Read-Dominated Workloads
11
read-write transaction reads 1000 array positions and additionally writes to 10. Each run consisted of timing the execution of 100,000 transactions. As we can see, the increased write overhead of VArray eventually takes its toll and beyond a certain number of cores (that depend on the percentage of read-write transactions), VArray presents worse results than the Array of VBoxes approach. These results show that while VArray is better suited for read-only workloads, if needed it can still support a moderate read-write workload. To test the memory overheads of VArray, we measured the minimum amount of memory needed to run a read-only workload in the array benchmark, on a single CPU, for an array with 10 million Integer objects. Due to its design, VArray was able to complete the benchmark using only 57MB of RAM, 10% of the 550MB needed by the Array of VBoxes approach. Finally, we measured, using a workload comprised of 10% read-write transactions and 90% read-write transactions, and 4 threads, the minimum memory needed for both approaches to present acceptable performance, when compared with a benchmark run with a large heap. In this test, VArray took approximately 25% longer to execute with a 256MB heap, when compared to a 3GB heap; runs with an Array of VBoxes needed at least 800MB and also took 25% longer.
7
Related Work
Software Transactional Memory (STM) [15] is an optimistic approach to concurrency control on shared-memory systems. Many implementations have been proposed — Harris et al.’s book [10] provides a very good overview of the subject. CCSTM [1] is a library-based STM for Scala based on SwissTM [5]. Similarly to the JVSTM, the programmer has to explicitly make use of a special type of reference, that mediates access to a STM-managed mutable value. Multiple memory locations can share the same STM metadata, enabling several levels of granularity for conflict detection. The CCSTM also provides a transactional array implementation that eliminates some of the indirections needed to access transactional metadata, similar to our VBodyArray approach. The DSTM2 [11] STM framework allows the automatic creation of transactional versions of objects based on supplied interfaces. Fields on transactional objects are allowed to be either scalar or other transactional types, which disallows arrays; to work around this issue, the DSTM2 includes the AtomicArray class that provides its own specific synchronization and recovery, but no further details on its implementation are given. Another approach to reducing the memory footprint of STM metadata on arrays and other data structures is changing the granularity of conflict detection. Word-based STMs such as Fraser and Harris’s WSTM [8] and TL2 in per-stripe mode [4] use a hash function to map memory addresses to a fixed-size transactional metadata table; hash collisions may result in false positives, but memory usage is bounded to the chosen table size. Marathe et al. [13] compared word-based with object-based STMs, including the overheads added and memory usage; one of their conclusions is that the studied systems incur significant bookkeeping overhead for read-only transactions.
12
I. Anjo and J. Cachopo
Riegel and Brum [14] studied the impact of word-based versus object-based STMs for unmanaged environments, concluding that object-based STMs can reach better performance than purely word-based STMs. Our VArray implementation is novel because it presents the same memory overheads of word-based schemes, while still detecting conflicts for each individual array position. Processing overhead for read-write transactions is still larger than with word-based approaches, because the transaction read-set must contain all individual array positions that were read, and all of them must be validated at commit-time, which is something word-based STMs can further reduce.
8
Conclusions and Future Work
Software transactional memory is a very promising approach to concurrency. Still, to expand into most application domains, many research and engineering issues need to be examined and solved. The usage of arrays is one such issue. In this work we presented the first comprehensive analysis of transactional array designs, described how arrays are currently implemented on top of the JVSTM, and presented two implementations that improve on previous designs. In particular, the VArray implementation has memory usage comparable to native arrays, while preserving the lock-free property of JVSTM’s read-only transactions. In addition, our experimental results show that VArray is highly performant for readdominated workloads, and competitive for read-write workloads. Future research directions include researching the possibility of a lock-free VArray commit algorithm, and exploring the usage of bloom filters for log lookups.
References 1. Bronson, N., Chafi, H., Olukotun, K.: CCSTM: A library-based STM for Scala 2. Cachopo, J., Rito-Silva, A.: Versioned boxes as the basis for memory transactions. Science of Computer Programming 63(2), 172–185 (2006) 3. Cachopo, J.: Development of Rich Domain Models with Atomic Actions. Ph.D. thesis, Technical University of Lisbon (2007) 4. Dice, D., Shalev, O., Shavit, N.: Transactional locking II. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 194–208. Springer, Heidelberg (2006) 5. Dragojevi´c, A., Guerraoui, R., Kapalka, M.: Stretching transactional memory. ACM SIGPLAN Notices 44, 155–165 (2009) 6. Fernandes, S., Cachopo, J.: Lock-free and scalable multi-version software transactional memory. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, pp. 179–188. ACM, New York (2011) 7. Fraser, K., Harris, T.: Practical lock-freedom. Tech. rep. (2004) 8. Fraser, K., Harris, T.: Concurrent programming without locks. ACM Trans. Comput. Syst. 25 (2007) 9. Guerraoui, R., Kapalka, M.: On the correctness of transactional memory. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 175–184. ACM, New York (2008) 10. Harris, T., Larus, J., Rajwar, R.: Transactional memory. Synthesis Lectures on Computer Architecture 5(1), 1–263 (2010)
Lightweight Transactional Arrays for Read-Dominated Workloads
13
11. Herlihy, M., Luchangco, V., Moir, M.: A flexible framework for implementing software transactional memory. ACM SIGPLAN Notices 41(10), 253–262 (2006) 12. Manson, J., Pugh, W., Adve, S.: The Java Memory Model 13. Marathe, V.J., Scherer, W.N., Scott, M.L.: Design tradeoffs in modern software transactional memory systems. In: Proceedings of the 7th Workshop on Workshop on Languages, Compilers, and Run-Time Support for Scalable Systems, LCR 2004, pp. 1–7. ACM, New York (2004) 14. Riegel, T., Brum, D.B.D.: Making object-based STM practical in unmanaged environments. In: TRANSACT 2008: 3rd Workshop on Transactional Computing (2008) 15. Shavit, N., Touitou, D.: Software transactional memory. Distributed Computing 10(2), 99–116 (1997)
Massively Parallel Identification of Intersection Points for GPGPU Ray Tracing Alexandre S. Nery1,3 , Nadia Nedjah2 , Felipe M.G. Fran¸ca1 , and Lech Jozwiak3 1
LAM – Computer Architecture and Microeletronics Laboratory, Systems Engineering and Computer Science Program, COPPE, Universidade Federal do Rio de Janeiro 2 Department of Electronics Engineering and Telecommunications, Faculty of Engineering, Universidade do Estado do Rio de Janeiro 3 Department of Electrical Engineering – Electronic Systems, Eindhoven University of Technology, The Netherlands
Abstract. The latest advancements in computer graphics architectures, as the replacement of some fixed stages of the pipeline for programmable stages (shaders), have been enabling the development of parallel general purpose applications on massively parallel graphics architectures (Streaming Processors). For years the graphics processing unit (GPU) is being optimized for increasingly high throughput of massively parallel floating-point computations. However, only the applications that exhibit Data Level parallelism can achieve substantial acceleration in such architectures. In this paper we present a parallel implementation of the GridRT architecture for GPGPU ray tracing. Such architecture can expose two levels of parallelism in ray tracing: parallel ray processing and parallel intersection tests, respectively. We also present a traditional parallel implementation of ray tracing in GPGPU, for comparison against the GridRT-GPGPU implementation.
1
Introduction
High-fidelity computer generated images is one of the main goals in the Computer Graphics field. Given a 3-D scene, usually described by a set of 3-D primitives (e.g. triangles), a typical rendering algorithm creates a corresponding image by several matrix computations and space transformations applied to the 3-D scene, together with many per-vertex shading computations [1]. All these computations are organized in pipeline stages, each one performing many SIMD floating-point operations, in parallel. The Graphics Processing Unit (GPU) is also known as a Stream Processor, because of such massively parallel pipeline organization, that continuously processes a stream of input data through pipeline stages. In the final stage, all primitives are rasterized to produce an image (a.k.a. frame). In order to achieve real-time rendering speed it is necessary to produce at least 60 frames per second (fps), so that the change between frames is not perceived and interactivity is ensured. The Streaming Processor model of current GPU architectures can deliver enough throughput of frame rates for most 3-D scenarios, but at the cost Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 14–23, 2011. c Springer-Verlag Berlin Heidelberg 2011
Massively Parallel Identification of Intersection Points
15
of a lower degree of realism in each produced frame. For example, important Global Illumination effects like shadows and reflections must be handled at the application level, because the hardware is based on a Local Illumination model and, thus, is especialized in processing 3-D primitives only [4]. Although the ray tracing algorithm [10] is also a computer graphics application for rendering 3-D scenes, the algorithm operates particularly in opposition to traditional rendering algorithms [1]. For instance, instead of projecting the primitives to the viewplane, where the final image is produced, the ray tracing algorithm fires rays towards the scene and traces their path in order to identify what are the visible objects, their properties and the light trajectory within the scene, through several intersection computations. In the end, all these informations are merged to produce the final image. For that reason, ray tracing is a high computation cost application that can produce high-fidelity images of a 3-D scene, with shadow and reflection effects. Besides, the algorithm has a very high parallelization potential, because each ray can be processed independently from the others, usually achieving almost linear acceleration for a parallel implementation [4]. Thus, there are parallel implementations on Clusters [11] and Shared Memory Systems [2], using spatial subdivision of the 3-D scene. Parallel implementations in GPGPUs (General Purpose Graphics Processing Units) have also achieved substantial results [9]. Some stages of the pipeline, such as the Vertex and Geometry processing stages, have recently evolved to programmable Shaders, that can be programmed to perform different algorithms [6]. So, the GPU is no longer dedicated to run graphic related algorithms, but also general purpose parallel algorithms that can benefit from the massively parallel architectue of modern GPGPUs. For instance, Data Level parallel applications in general achieve high acceleration when mapped to GPGPUs, because these applications perform well in SIMD machines [9]. However, if control flow and recursion are strongly required, which is often the case for ray tracing, then existing Von Neumann architectures may be a better option for Task Level parallel applications. In ray tracing, every ray can be processed independently, in parallel, but each ray must be tested for intersections against the primitives of the 3-D scene and, if there is such intersection, the computation may proceed in many different ways. So, the work granularity is in the task level and each task may execute through different branches, which makes control flow and recursion a big issue for ray tracing. So, there are consistent approaches to accelerate ray tracing with custom parallel architectures in hardware, as in [8,13], operating at low frequencies. Hence, the low frequency of operation is compensated by the parallelism of the custom design and several limitations can be overcome by a custom hardware design. In general, the target device is a Field Programmable Gate Array (FPGA), which can be used to prototype the design, and later an Application Specific Integrated Circuit (ASIC) can be produced, operating at much higher frequencies. Throughout this paper we briefly describe our GridRT parallel architecture for ray tracing and we present a GPGPU implementation of the architecture
16
A.S. Nery et al.
in CUDA, exhibiting Task Level parallelism of rays and Data Level parallelism of intersection computations. The CUDA kernel algorithm, which corresponds to the GridRT parallel architecture with some minor modifications, is also described. In the end, we present performance results for two modern NVidia Fermi GPUs, the GTX 460 and GTX 465. Furthermore, we describe a traditional parallel ray tracer implementation in GPGPU, for comparison with the GridRTGPGPU. The rest of this paper is organized as follows. First, Section 2 briefly explains the ray tracing algorithm. Then, Section 3 shows a traditional parallel implementation of the algorithm in GPGPUs. After that, Section 4 presents the GridRT architecture before the GPGPU implementation is presented in Section 5. Finally, Section 6 presents performance results, while Section 7 draws the conclusion of this work.
2
Ray Tracing
The ray tracing algorithm is briefly explained in this section, while further details can be found in [10]. Thus, the first step of the algorithm is the setup of a virtual camera, so that primary rays can be fired towards the scene. Each primary ray passes through a pixel of the camera’s viewplane, where the final image is going to be captured. For every primary ray, a simple and straightforward ray tracing algorithm usually computes intersection tests against all the 3-D primitives of the scene, looking for the primitives (objects) that are visible from the camera’s perspective. If an intersection is encountered, the object properties are used to determine wether the ray will be reflected, refracted or completely absorbed. For instance, if the ray is reflected or refracted, the algorithm is recursively executed to determine the objects that are visible from the previous intersection point perspective, which is why the algorithm can naturally produce mirror like effects in the final image. On the other hand, if the ray is absorbed, the processing ends and all the information that has been gathered until that point is merged to compose the color of the corresponding pixel of the viewplane. This ray tracing style is known as Whitted-Style ray tracing [12]. The program main entry is presented in Algorithm 1, in which the primary rays are being traced. The trace procedure in Algorithm 1 is responsible for determining the closest intersection point, while the shade procedure (called by the trace procedure) is responsible for coloring the pixel and recursively calling the trace procedure in case the intersected object surface is specular or transparent. For the sake of simplicity and brevity, these two procedures are not described in this work. Further details on shading algorithms can be found in [1]. In order to avoid intersection computations between each ray and the whole scene, a spatial subdivision of scene can be applied to select only those objects that are in the direction of a given ray, avoiding unnecessary computation. There are several spatial subdivision techniques, such as Binary Space Partitioning Trees, Boundary Hierarchical Volumes and KD-Trees and Uniform Grids [10,1], each one of them with its own advantages and disadvantages. For instance, the
Massively Parallel Identification of Intersection Points
17
Algorithm 1. Ray Tracing primary rays 1 2 3 4 5 6 7 8
3-D scene = load3DScene(file); viewplane = setupViewplane(width,height); camera = setupCamera(viewplane,eye,view direction); depth = 0; for i = 1 to viewplane’s width do for j = 1 to viewplane’s height do ray = getPrimaryRay(i,j,camera); image[i][j] = trace(3-D scene, ray, depth);
KD-Tree structure adapts very well to the 3-D scene and, hence, selects fewer objects than the other techniques. However, the KD-Tree building time is more expensive and complex, as well as the algorithm that is used to traverse the tree structure [5]. On the other hand, the Uniform Grid structure is less expensive to build and the traversal algorithm is very fast [10], but such structure is not adaptive and, because of that, may select a few more objects for intersection tests or perform extra traversal steps through empty areas of the 3-D scene. In this work we use the Uniform Grid structure, which is the base of the GridRT parallel architecture [8].
3
Traditional Parallel GPGPU Ray-Tracer
The ray tracing algorithm exhibits parallelism at the task level. Each ray can be processed independently from the others and each one can be assigned to an execution process or thread across different computing nodes or processing elements. So, one ray is going to be processed as a task, producing one pixel at the end of the computation. The main idea is to spread tasks (rays) across different processes. Also, it is possible to assign a group of tasks per process, instead of one task only. In the end, each task or group of tasks will produce the color of one or more pixels of the final image, respectively. Modern general purpose GPUs are capable of executing many thousands of threads in parallel [6], achieving peaks of 1TFLOPs or more. Thus, in modern GPGPUs, each thread can be assigned to a primary ray that crosses a pixel of the viewplane. The result is that a portion of the final image is going to be produced by a block of threads (one pixel per thread). The number of blocks of threads corresponds to as many subdivisions the image is going to be split into, which corresponds to distributing primary rays among threads. The given CUDA Kernel is presented in Algorithm 2, considering that all data transfers between the host and the GPGPU have already been performed. Note that this algorithm does not have the loop construction presented in lines 5 and 6 of the sequential Algorithm 1 version, because now each ray has been assigned to a thread of a block of threads. Every block of threads has its own identifier, as every thread too. In that way each thread can access its own data to process. So, in Algorithm 2, the given thread uses its own identifiers to select the corresponding ray that
18
A.S. Nery et al.
Algorithm 2. Traditional parallel GPGPU ray tracer CUDA-Kernel 1 2 3 4 5
int i = blockDim.x * blockIdx.x + threadIdx.x; int j = blockDim.y * blockIdx.y + threadIdx.y; ray = rays[i][j]; color = trace(3-D scene,ray,depth); image[i][j] = color;
will be traced, resulting in one pixel color. Depending on the configuration that is set on Kernel launch, the identifiers can have up to three coordinates. In the case of Algorithm 2, only two coordinates are used (i, j), because the data (primary rays) is organized in two dimensions. In the end, the whole image will have been produced by parallel threads that processed one primary ray each, together with any secondary rays that may have been generated for each intersection test.
4
The GridRT Parallel Architecture
Before explaining the GridRT implementation in GPGPU we explain the GridRT parallel architecture, which can be implemented in any kind of multiprocessor system, such as Clusters, Chip-Multiprocessors, GPGPUs or custom parallel designs in FPGA. The GridRT architecture is strongly based on the Uniform Grid structure. In such spatial subdivision scheme the 3-D scene is split into regions of equal size, called voxels. Each voxel has a list of primitives (triangles) that are inside it or almost inside it. Thus, only those voxels that are pierced by a ray are going to be sequentially accessed for intersection tests, from the voxel that is closest to the ray origin to the furthest. Therefore, if an intersection is found, no more tests are required for the given ray, because it is already the closest to the ray origin. In the example depicted in Fig. 1a, three voxels were accessed until an intersection t1 was found in voxel v5 . On the other hand, the GridRT parallel model maps each voxel onto a Processing Element (PE), as depicted in Fig. 1b. So, intersection tests are performed in
v0
Uniform Grid 4x4x1 v1 v2 v3
PE0
Uniform Grid 4x4x1 PE1 PE2 PE3 block of threads
voxel v4
v5
v6
v7
PE4
PE5
t1 v8
v9
v10
v11 ray origin
v12
v13
v14
v15
(a) Uniform Grid
PE8
PE9
B4
PE13
B5
PE10
PE14
B6
PE11
PE15
(b) Parallel GridRT
B7 t2
t1 ray origin
PE12
Uniform Grid 4x4x1 B1 B2 B3
t2
t1
ray origin
PE7
PE6
B0
B8
B9
B10
B11
B12
B13
B14
B15
(c) GridRT-GPGPU
Fig. 1. Sequential Uniform Grid, Parallel GridRT model and GridRT in GPGPU
Massively Parallel Identification of Intersection Points
19
parallel by those PEs that are pierced by a ray and, because of that, it becomes necessary to decide which PE holds the result that is closest to the ray origin. At first, one solution is to exchange the results between the PEs, but it would require every PE to wait for the others to finish their computation on the given ray before deciding which one holds the correct results. Thus, the GridRT parallel model uses the order of traversal for each ray to determine the correct result. For instance, in Fig. 1b, P E5 and P E6 have found an intersection each, t1 and t2 . According to the ray traversal order, P E5 is closer to the ray origin. Thus, P E5 may send an interrupt message to the next PE in the list, so it can abort its computation and forward the interrupt message to the next one, until every following PE is aborted. The computation is now reduced between the remaining PEs. If one of them finds an intersection within its piece of scene data, it can also proceed in the same way, sending interruption messages to the following ones in the list. Otherwise, if none of them finds an intersection, a feedback message is sent from the first to the last remaining PEs. Such message is used to ensure that none of the previous PEs in the list have found intersection tests. Then, the remaining PE holds the correct result, like P E5 of Fig. 1b, or none of them. Note that each PE needs to communicate such messages between their direct neighbors, which depends on the target architecture that the parallel model is going to be implemented. For example, if each PE is mapped onto a Process running on different computation nodes, the messages can be exchanged via Message Passing Interface (MPI) [7]. But if the target architecture is a FPGA parallel design, then the neighborhood of PEs can be connected by interrupt signals. Further details on the GridRT architecture and its communication model can be found in [8].
5
GridRT-CUDA Parallel Ray-Tracer
Following the GridRT parallel architecture presented in the previous section, the GridRT-CUDA implementation maps each voxel onto a block of threads, as depicted in Fig. 1c. Thus, every block of threads is performing intersection tests along a ray, in parallel. Also, the intersection tests are performed in parallel inside a block of threads. Thus, two levels of parallelism are exhibited in such organization. The first one is in the task level parallelism, while the second one is in the data level parallelism. For instance, if a given block has n triangles and n or more threads at disposal, then the intersection tests are performed in parallel by the threads of the block. Otherwise, parallel intersection tests can be performed in chunks of n triangles by n threads. In this work, the 3-D scene is such that there are always enough threads to process all the triangles in parallel inside the block, as will be presented in Section 6. However, in order to determine the correct result among the blocks of threads, a different approach from the one presented in Section 4 had to be developed, because threads from different blocks of threads cannot coordinate their activities. Only threads inside the same block can coordinate their activities through a Shared Memory. Thus, a given block of threads cannot inform the next block in the traversal list about its computation results.
20
A.S. Nery et al.
Algorithm 3. GridRT-CUDA Kernel 1 2 3 4 5
shared float res[max. number of triangles per voxel]; foreach ray i do foreach element j of the corresponding traversal list do if this blockId is in the list then if there are triangles in this block then
6
res[threadIdx.x ] = intersectTriangles(ray,vertices [threadIdx.x ]);
7
syncthreads();
8
if threadIdx.x = 0 then /* Finds the smallest result /* Copy the result to global memory
*/ */
Therefore, we let the host processor determine the correct result at the end of the whole computation, also according to the order of traversal presented in Section 4. Hence, each ray has an array of results associated to it and the size of an array corresponds to the maximum number of PEs, i.e. blocks, that can be traversed by a given ray. The size is given by the total number of subdivisions applied to each of the three axis (nx , ny , nz ) of the GridRT spatial structure, as defined in Eq. 1. For instance, considering the grid of Fig. 1c, the maximum size of the array is N = 7, since the uniform grid subdivision is nx = 4, ny = 4 and nz = 1. N = nx + (ny − 1) + (nz − 1)
(1)
When each block of threads have finished the intersection checks with respect to the corresponding voxel, the result is stored in the array at the block associated entry. Thereafter, the block can proceed with the computation of a different ray, which also has a different array of results associated to it. In the end, the matrix of results is copied from the GPU to the host processor, so it can proceed with further analysis of results. Each row of the matrix corresponds to the array of results computed by the block for a given ray, while each column contains the result that was computed by a block. The algorithm that is executed by a block of threads is shown in Algorithm 3. Each block of threads takes as input an array of rays, which has also an array of results associated to each row, thus yielding a matrix. The 3-D scene is already copied to the GPU before the kernel execution. The scene is stored according to the uniform grid structure as an unidimensional array. Each position of the array points to the list of triangles that belongs to the corresponding voxel (i.e. block of threads). Once the necessary data has been copied to the GPU, the kernel is launched. According to Algorithm 3, the first step taken by each block is to declare an array of shared results, as in line 1. This shared array is used to store the results from the parallel intersection tests, as in line 6. For each ray in the input data, the block will search for its own identifier in the traversal list, as in lines 3 and 4. Then, if there are any triangles in the block, parallel intersection tests are performed by the threads. Finally, one of the threads (the
Massively Parallel Identification of Intersection Points
21
Table 1. GridRT-CUDA kernel execution times in GTX 460 and GTX 465 Blocks of threads GridRT GTX 460 GridRT GTX 465 TradRT GTX 465
1 2 4 8 12 18 27 64 - 1.69 0.87 0.92 0.97 1.07 2.39 - 1.34 0.63 0.61 0.59 0.55 1.2 0.94 0.70 0.47 0.35 0.28 0.23 0.21 0.17
125 3.85 2.41 0.14
216 7.03 4.02 0.12
*All times are in seconds. Low-res Stanford Bunny 3-D scene.
one with identifier zero) searches the smallest result (that is closest to the ray origin, in respect to that block) from the array of shared results, as in line 8.
6
Results
In this section we present the comparison results between our GridRT-CUDA implementation in two different Nvidia GPU’s (GTX 460, 465) and also the results for our traditional parallel ray-tracer in GPGPU. These results are summarized in Table 1, respectively. The execution times for configurations of 1 and 2 blocks of threads are not available for the GridRT implementation, because the execution was terminated due to kernel execution timeout. A second dedicated GPGPU graphics card could have been used to avoid this limitation. Otherwise, the same GPU has to be shared between the host Operating System applications and thus cannot execute long time CUDA kernels (up to tens of seconds). Also, because of kernel execution timeout limitation, we could not use higher resolution 3-D scenes. As we can observe from Table 1, the GridRT-CUDA implementation achieves acceleration in both GPU models. However, the performance starts to degenerate when more than 8 blocks of threads are used by the GTX 460 or more than 27 blocks of threads are used by the GTX 465. The latter scales better because it has 11 Streaming Multiprocessors (SMs), while the first has 7 SMs. In essence, a block of threads is executed by a SM. The more SMs are available, the more blocks of threads can be executed in parallel. The results from Table 1 for the GridRT-CUDA are depicted in Fig. 2a. If more SMs were available, more acceleration was likely to be achieved. The kernel execution times for a traditional parallel ray tracer in CUDA are depicted in Fig. 2b, together with the GridRT-CUDA. In contrast, the traditional parallel ray-tracer implementation uses a different approach. Parallelism is employed at the ray (task) level only. Thus, blocks of threads are not mapped to voxels of the uniform grid subdivision. Instead, blocks of threads are mapped to groups of primary rays that are going to be traced in parallel, as presented in Section 3. So, each thread is in fact processing a independent primary ray and its corresponding secondary rays and shadow rays that may be spawned by the algorithm. In the end, each thread produces the color of an individual pixel of the final image. From Table 1 and Fig. 2b, it is clear that this version of ray
22
A.S. Nery et al.
(a) GridRT-CUDA kernel execution time in GTX 460 and GTX 465.
(b) Traditional parallel CUDA ray tracing compared to GridRT-CUDA.
Fig. 2. Execution time results and comparisons
tracing scales almost linearly to the number of blocks of threads. The explanation for such acceleration is also in the GPGPU architecture itself: if two or more threads are going to execute through different branches, they are serialized [3]. Hence, we can see from the GridRT-CUDA (Algorithm 3) that there are several possible branches of execution, that can lead to serialization of threads. For that reason, a custom parallel design in FPGA is preferable, because the architecture can be designed according to the characteristics of the application. For instance, although the execution time in [8] is higher, the acceleration is much higher as more processing elements can be fit into the FPGA.
7
Conclusion
In this paper, two different implementations of parallel ray tracing are discussed: the GridRT-CUDA implementation and a traditional CUDA parallel ray tracer. These two implementations are analyzed an compared regarding performance. The GridRT-CUDA implementation achieves acceleration up to 27 blocks of threads in a Nvidia GTX 465 GPU and up to 8 blocks of threads in a Nvidia GTX 460 GPU. From that point, the performance degenerates, especially because of the Streaming Processor model, which is not good for applications that exhibit too many branches of execution, such as in the GridRT architecture. So, several threads were serialized. Also, the performance degenerates because many blocks of threads have to compete for execution on the GPU fewer resources. A more powerful GPU is likely to achieve higher acceleration for further more blocks of threads. Compared to the traditional GPGPU ray tracer, the GridRT-CUDA performance is not good. However, since the GPGPU implementation introduces more hardware overhead compared to a custom hardware design (ASIP-based ASIC implementation), the custom hardware implementation is expected to have lower area and power consumption, as well as better performance.
Massively Parallel Identification of Intersection Points
23
References 1. Akenine-M¨ oller, T., Haines, E., Hoffman, N.: Real-Time Rendering, 3rd edn. A.K. Peters, Ltd., Natick (2008) 2. Carr, N.A., Hall, J.D., Hart, J.C.: The ray engine. In: HWWS 2002: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pp. 37–46. Eurographics Association, Aire-la-Ville (2002) 3. Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M.: Dynamic warp formation and scheduling for efficient gpu control flow. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, pp. 407– 420. IEEE Computer Society, Washington, DC, USA (2007) 4. Govindaraju, V., Djeu, P., Sankaralingam, K., Vernon, M., Mark, W.R.: Toward a multicore architecture for real-time ray-tracing. In: Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 41, pp. 176– 187. IEEE Computer Society, Washington, DC, USA (2008) 5. Havran, V., Prikryl, J., Purgathofer, W.: Statistical comparison of ray-shooting efficiency schemes. Technical report, Institute of Computer Graphics and Algorithms, Vienna University of Technology, Favoritenstrasse 9-11/186, A-1040 Vienna, Austria (2000) 6. Kirk, D.B., Hwu, W.-m.W.: Programming Massively Parallel Processors: A Handson Approach. Morgan Kaufmann Publishers Inc., San Francisco (2010) 7. Nery, A.S., Nedjah, N., Fran¸ca, F.M.G.: Two alternative parallel implementations for ray tracing: Openmp and mpi. In: Mecnica Computacional, vol. XXiX, pp. 6295–6302. Asociacin Argentina de Mecnica Computacional (2010) 8. Nery, A.S., Nedjah, N., Fran¸ca, F.M.G., Jozwiak, L.: A parallel architecture for ray-tracing with an embedded intersection algorithm. In: International Symposium on Circuits and Systems, pp. 1491–1494. IEEE Computer Society, Los Alamitos (2011) 9. Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A.E., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Computer Graphics Forum 26(1), 80–113 (2007) 10. Suffern, K.: Ray Tracing from the Ground Up, 1st edn. A.K. Peters, Ltd., Natick (2007) 11. Wald, I., Ize, T., Kensler, A., Knoll, A., Parker, S.G.: Ray tracing animated scenes using coherent grid traversal. In: SIGGRAPH 2006: ACM SIGGRAPH 2006 Papers, pp. 485–493. ACM, New York (2006) 12. Whitted, T.: An improved illumination model for shaded display. Commun. ACM 23(6), 343–349 (1980) 13. Woop, S., Schmittler, J., Slusallek, P.: Rpu: a programmable ray processing unit for realtime ray tracing. In: SIGGRAPH 2005: ACM SIGGRAPH 2005 Papers, pp. 434–444. ACM, New York (2005)
Cascading Multi-way Bounded Wait Timer Management for Moody and Autonomous Systems Asrar Ul Haque1 and Javed I. Khan2 1
College of Computer Science and Information Tech., King Faisal University, Al-Ahsa 31982, Kingdom of Saudi Arabia [email protected] 2 Media Communications and Networking Research Laboratory Department of Math & Computer Science, Kent State University 233 MSB, Kent, OH 44242 [email protected]
Abstract. Timer management is one of the central issues in addressing the ‘moody’ and autonomous characteristics of current Internet. In this paper we formalize the multi-way bounded wait principle for ‘moody’ and autonomous environment. We propose an optimum scheme and compare it with a set of generalized heuristic-based timer management schemes recommended for the harness-a distributed communication al and computational system for moody and autonomous environment. Keywords. Optimum timeout scheme, Timeout heuristics, Grid Computing, P2P search, Web service.
1 Introduction Any distribute system with millions of components must learn to operate with incomplete information. This is becoming the case of various distributed systems operating over the Internet. A classic example is the search for service discovery [1-2]. Such a distributed search is quite different from conventional distributed algorithms. A particular unique characteristic of such a search is that it is never complete. The search propagates via millions of other nodes from a source to the entire network as illustrated in Fig. 1. While it is ideal to expect that answers will arrive from a sweep covering all the nodes in the network, but almost always that is never the case. A search must learn to adapt to work with an imperfect sweep. An interesting question faced by this set of distributed algorithms is how to maximize the quality of the result without waiting inordinate amount of time. The network-based distributed algorithms are run in an environment consisting of various manifestations of this inherent unreliability such as dead-beat node, unreliable or busy peer, missing messages, authentication failure, intentional non-cooperation, selective cooperation etc. We will call it Moody and Autonomous Environment (MAE) environment. Classical communication layer has handled only limited aspects and forms of this unreliability. Various schemes such as error resilience coding and retransmission based transport essentially tell how a communication can best try to Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 24–32, 2011. © Springer-Verlag Berlin Heidelberg 2011
Cascading Multi-way Bounded Wait Timer Management
25
Fig. 1. (Top) A client begins searching for items. The search then propagates through thousands of other nodes. Each plays role in forwarding/routing results back to the root of the request. (Bottom) The root peer waits for the results.
create a ‘perfect’ and fault-free notion of services for mostly point-to-point communication. Classical network algorithms running on top of such transport thus assume that the imperfection can be sealed-off at lower layers and it can operate over a virtual perfection. Unfortunately this assumption of virtual perfection is does not always hold in the emerging distributed MAE. All natural communications systems for the MAE use the principle of bounded wait to address the moodiness property. An entity in the MAE while running a distributed algorithm faces the dilemma of how long it should wait for some other entity – if it waits too long it delays the overall computation completion time or even it may miss the timer deadline of its parent. Conversely, if it waits too short a period then it may miss communication from many of its children. Consequently, this deteriorates the overall quality of computation. This paper formalizes the above multi-way bounded wait
26
A. Ul Haque and J.I. Khan
principle for general distributed algorithms for MAE. We suggest an optimum scheme and compare it with a set of generalized heuristic-based timer management. The solution is applicable to any schema of multi-way communication- whether inside a new protocol, at middle-layer, or as a pure application level effort. In this paper we present the proposed timer management scheme within a formal framework of multi-way communication based general distributed computing. We call it harness [3]. The harness is a network computing framework which has a reusable multi-way communication primitive designed to operate in MAE. The harness makes a separation between the communication and the information part of the data exchange. Then a set of six plug-ins allows the computation and communication part to be separately programmed. The messaging pattern and the message content can be programmed independent of each other. Essentially a set of pre-programmed communication patterns can be reused with another set of message synthesis procedures. The harness has been shown to solve various network algorithms. In this section- we briefly introduce the harness framework. Details can be found [3]. The paper is arranged in the following way. First in section 2 we provide a brief overview of various interesting related work. We then formalize the multi-way bounded wait problem and solve it to set the timer optimally in section 3. For comparison, we also provide a set of plausible heuristics. Finally in section 5 we provide a detail performance comparison of the heuristics and the optimum schemes.
2 Related Work As the problem is increasingly becoming real, recently, various timeout schemes have been proposed for a range of large scale network applications. Network Weather Service [4]-a Grid performance data management and forecasting service, is one the first to try dynamic timeout in forecast computing and noted substantial improvement over static scheme. A RTO (retransmission timeout) selection has been proposed [5] for multimedia to achieve the optimal tradeoff between the error probability and the rate cost in order to improve throughput. Timeout strategy has been proposed associating costs for waiting time and retransmission attempts [6] where the timeout value was set to minimize the overall expected cost. The goal of this work was to reduce the number of retransmission attempts. In [7] a scheme was proposed for Deterministic Timeouts for Reliable Multicast (DTRM) to avoid negative feedback explosion [7]. DTRM ensures that retransmission caused by only one NACK from a receiver belonging to a sub-tree arrives early enough so that the timers do not expire in the other receivers in that sub-tree. These recent schemes- have used various heuristics approximations, but are notable because of there pioneering role in the timeout management in multi-way communication in moody environment. The solution we propose assumes that the nodes have prior knowledge of link delay. Indeed, there has been considerable amount of work related to finding link-level delay. Network tomography [8-12] based on statistical estimation theory is an emerging field of characterizing various link level performance metrics of a network. Network tomography closely estimates network internal characteristics including delay distribution, loss, and delay variance by correlating end-to-end measurements for
Cascading Multi-way Bounded Wait Timer Management
27
multicast tree. However, a major limitation of Network tomography is that it focuses on multicast routing only whereas the bulk of the traffic is uni-cast. This has been overcome this by estimating delay distribution by employing uni-cast, end-to-end measurement of back to back packets [13].
3 Multi-way Bounded Wait Principle A node during running a search algorithm might aggregate messages received from a set of downstream children and forwarding to a parent. As messages are received from its children the computation is more accurate since each message contains search results from some other nodes as well. A node faces the dilemma of how long it should wait for the message – if it waits too long it delays the overall computation completion time or even it may miss the timer deadline of its parent. Conversely, if it waits too short a period then it may miss communication from many of its children. Consequently, this deteriorates the overall quality of computation. In this section, the above multi-way bounded wait principle is formalized and an optimum scheme is suggested for setting timers for message generation and aggregation at the intermediate nodes and the terminals. In the following subsections, in order to generalize the problem formulation, we assume the timer value for message aggregation at the root is denoted by D (deadline) and the timer value for message generation of its child is represented by T. Furthermore, we use the notion of profit (denoted by ω ) to signify the total number of nodes pertaining to which search results have been accumulated in a message. 3.1 Formulation of the Problem Let, as shown in Fig. 2, node j has a parent k and a set of children nodes i={ i1, i2,… in}. Let ri x j (t ) and rk(t) be the probability distribution function of round trip time between
k
rkj (t ) j
ri1 j (t ) i1
rin j (t )
ri2 j (t ) i2
ωi j 1
ωi
in 2
j
ωi
Fig. 2. Optimal Timer Setting for Node j
n
j
28
A. Ul Haque and J.I. Khan
the nodes ix and j and, j and k where x={1..n}. Let
ωi
x
j
be the profit carried by a mes-
sage from node ix to j. Given D the timeout value of k, calculate the maximum value of expected profit from j to k and the corresponding timeout value Topt for j. 3.2 Generic Profit Function Formulation T
The question we now pose is how to maximize the profit-P(t). Let C (t )dt be the ∫ 0
D −T
∫ S (t )dt
total profit accumulation at j in time t and
be the probability of successfully
0
reaching the parent, k, in time (D-T) with the accumulated profit. So we show that the basket function is the product of profit accumulation and probability of successful delivery. T
D −T
0
0
P(t ) = ∫ C (t )dt
∫ S (t )dt
(1)
The profit accumulation is summation of the product of profit and delay distribution of each children of j i.e., T
T
0
0 s∈i
∫ C (t )dt = ∫ ∑ ωsj rsj (t )dt D −T
D −T
0
0
(2)
∫ S (t )dt = ∫ r (t )dt
(3)
k
From (1), (2), and (3) we get, T
D
0 s∈i
T
P( D) = ∫ ∑ ω sj rsj (t )dt ∫ rk (t )dt
(4)
Fig. 3 illustrates the formulation of the profit function for node j as in Eq. 1. As the time T is increased the area under ri1 j (t ), ri2 j (t ), and ri n j (t ) also increases indicatD −T
ing higher accumulation of profit C(t). However the area under
∫r
k
(t )dt i.e.
0
D −T
∫ S (t )dt decreases as T increases since D is fixed. Thus possibility of reaching parent 0
Cascading Multi-way Bounded Wait Timer Management
29
T
node k with the accumulated profit C (t )dt diminishes as T is increased. The product ∫ 0
D −T
of
T
∫ S (t )dt and ∫0 C (t )dt is the total profit accumulated at node k in time T. 0
3.3 Solution for Optimal Timer Setting The optimum time, Topt, and the corresponding maximum profit, Pmax, are calculated based on the following assumptions: • The delay distributions of the parent and children of j are of Normal distribution. • The delay distributions of the children of j are independent of each other.
Fig. 3. Formulation of Profit Function
30
A. Ul Haque and J.I. Khan
• Node j generates a message for its parent even if no message is received from its children before timeout occurs at j. Further simplification of Eq. 4 is beyond the scope of this paper. However, it can be noted that P(D), in Eq. 4 is a positive, bounded, and continuous function of T. Furthermore, as T → ±∞, it goes to zero. Therefore, a global maximum of P(D) and the corresponding T, denoted by Topt, must exist, and must satisfy the equation
dP ( D) / dT = 0.
4 Simulation In this section the performance of the optimal scheme is partially presented. To put the performance of the optimum scheme into perspective we have also constructed a number of timer-setting heuristics and picked five best performing ones. We present the optimum scheme in the context of these intuitive solutions. These heuristics are shown in Table 2 and are discussed in [3]. The marginal constant and factors used in simulation for the heuristics were manually selected after extensively looking into performance of individual heuristics in various network scenarios. We assume ϕ , β , α , ρ , ξ , σ , and
λ
to be 40000, 10000, 10, 6.5, 10, 3000, and 1.5 respectively.
4.1 Measuring Parameters One of the important parameters of interest is the completion time (CT) which is the time elapsed between root generating the first request and receiving and processing all the replies. However, completion time alone can not justify usage of a heuristic. To better compare various heuristics, we define capturing efficiency (CE) as the ratio of number of node from which information has been collected at the root divided by total number of responsive nodes. Let N m be overall count of nodes from which information has propagated to the root node,
N t be total number of nodes, and
N NRN number of NRN. Then,
CE =
Nm N t − N NRN
(5)
4.2 Impact of Size of Network Fig. 4 and 5 illustrate impact of size on the heuristics with respect to CT and CE respectively. The CT for optimal scheme for 2500, 5000 and 10000 nodes are 10.11,13.54, and 12.9 seconds respectively. For MCD, CT increases by 2.3s as graph size increases from 2500 to 10000 nodes where as for other heuristics the increase is noticeable. However, CE is not 1 for all three graph sizes for MCD. For 10000 nodes, CE is only 0.72. The optimal scheme has CE=1.0 for al three graph sizes.
Cascading Multi-way Bounded Wait Timer Management Table 1. Various Timeout Heuristics
Ti h = η ih−1 + β where β ≠ f (i, L, RTT ) Ti h = α (η ih−1 )
where α
≠ f (i, L, RTT ) andα > 1
h Ti −1 = RTTi k + γ i where γ i = ρ * ( L − i ) * RTTi −k1 andρ > 1
Ti −h1 = ξ * RTTi k + Ti k where ξ ≠ f (i, L, RTT ) and ξ > 1 Ti −h1 = RTTi k + Ti k + σ
Topt =
where σ
≠ f (i, L, RTT ) and σ > 1
1 (L + μ j − μk ) + π (σ k − σ j ) 2 2 2 NRN Loc=Terminal,NRN=0.2% , α=2.3
100 Time(s)
80
2.5K
60 40
5K 10K
Scheme
Opt
MDT
PDT
PLD
PCD
0
MCD
20
Fig. 4. Impact of Size of Graph on CT NRN Loc=Terminal,NRN=0.2% , α=2.3l 1 2.5K
0.75
Scheme
Opt
0
PDPT
10K PDT
0.25 PLD
5K
PCD
0.5
MCD
Optimal Scheme
Formula
CE
Scheme Heuristic MCD (Fixed Margin Over Cumulative Delay (CRTT) Heuristic PCD (Proportionate Cumulative RTT) Heuristic PLD (Proportionate Level over RTT) Heuristic PDT (Proportionate RTT over Fixed Child Timer) Heuristic MDT (Fixed Margin RTT over Fixed Child Timer)
Fig. 5. Impact of Size of Graph on CE
31
32
A. Ul Haque and J.I. Khan
5 Conclusions Timer management is the natural technique to address the ‘unreliability’ posed by any MAE entity which inherently different from unreliability handled by TCP. We formalized the multi-way bounded wait principle for MAE to respond to this sort ‘unreliability’. We introduced the notion of the lower bound of completion time and the upper bound of capturing efficiency. In this paper we used completion time and capturing efficiency to compare optimal scheme with some heuristics proposed for the harness to show better performance of optimal scheme. We have shown that the optimal scheme outperforms other heuristics in various network conditions. Among the heuristics most promising is MCD. However, a major concern for MCD is that its performance degrades with size of the network whereas the optimal scheme scales well with the size of a network.
References 1. Meshkova, E., Riihij, J., Petrova, M., Petri, M.: A survey on resource discovery mechanisms, peer-to-peer and service discovery frameworks. The International Journal of Computer and Telecommunications Networking Archive 52(11) (August 2008) 2. Ahmed, R., Boutaba, R.: A Survey of Distributed Search Techniques in Large Scale Distributed Systems. Communications Surveys & Tutorials 13(2) (May 2011) 3. Khan, J.I., Haque, A.U.: Computing with data non-determinism: Wait time management for peer-to-peer systems. Journal Computer Communications 31(3) (February 2008) 4. Allen, M.S., Wolski, R., Plank, J.S.: Adaptive. Timeout Discovery using the Network Weather Service. In: Proceedings of HPDC 2011 (July 2002) 5. Zhan, J.C.W., He, Z.: Optimal Retransmission Timeout Selection For Delay-Constrained Multimedia Communications. In: International Conference on Image Processing, ICIP 2004, October 24-27, vol. 3, pp. 2035–2038 (2004), doi:10.1109/ICIP.2004.1421483 6. Libman, L., Orda, A.: Optimal retrial and timeout strategies for accessing network resources. IEEE/ACM Transactions on Networking 10(4), 551–564 (2002) 7. Grossglauser, M.: Optimal deterministic timeouts for reliable scalable multicast. In: IEEE Infocom 1996, pp. 1425–1441 (March 1996) 8. Bu, T., Duffield, N., Presti, F.L., Towsley, D.: Network tomography on general topologies. In: Proc. of ACM SIGMETRICS (2002) 9. Duffield, N.G., Lo Presti, F.: Multicast Inference of Packet Delay Variance at Interior Networks Links. In: Proc. Infocom 2000, Tel Aviv, Israel (March 26-30, 2000) 10. Adams, A., Bu, T., Caceres, R., Duffield, N., Friedman, T., Horowitz, J., Lo Presti, F., Moon, S.B., Paxson, V., Towsley, D.: The use of end-to-end multicast measurements for characterizing internal network behavior. IEEE Communications Magazine (May 2000) 11. Lo Presti, F., Duffield, N.G., Horowitz, J., Towsley, D.: Multicast-Based Inference of Network-Internal Delay Distribution, preprint, AT&T Labs and University of Massachusetts (1999) 12. Bu, T., Duffield, N.G., Lo Presti, F., Towsley, D.: Network tomography on general topologies. ACM SIGMETRICS (June 2002) 13. Coates, M.J., Nowak, R.: Network Delay Distribution Inference from End-to-end Unicast Measurement. In: Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (May 2001) 14. Duffield, N.G., Horowitz, J., Lo Presti, F., Towsley, D.: Network delay tomography from end-to-end unicast measurements. In: Palazzo, S. (ed.) IWDC 2001. LNCS, vol. 2170, pp. 576–595. Springer, Heidelberg (2001)
World-Wide Distributed Multiple Replications in Parallel for Quantitative Sequential Simulation Mofassir Haque1, Krzysztof Pawlikowski1, Don McNickle2, and Gregory Ewing 1 1
University of Canterbury, Department of Computer Science, Christchurch 8140, New Zealand 2 University of Canterbury, Department of Management, Christchurch 8140, New Zealand [email protected] {Krys.Pawlikowski,Don.McNickle,Greg.Ewing}@canterbury.ac.nz
Abstract. With the recent deployment of global experimental networking facilities, dozens of computer networks with large numbers of computers have become available for scientific studies. Multiple Replications in Parallel (MRIP) is a distributed scenario of sequential quantitative stochastic simulation which offers significant speedup of simulation if it is executed on multiple computers of a local area network. We report results of running MRIP simulations on PlanetLab, a global overlay network which can currently access more than a thousand computers in forty different countries round the globe. Our simulations were run using Akaroa2, a universal controller of quantitative discrete event simulation designed for automatic launching of MRIP-based experiments. Our experimental results provide strong evidence that global experimental networks, such as PlanetLab, can efficiently be used for quantitative simulation, without compromising speed and efficiency. Keywords: Multiple Replications in Parallel, Experimental networking facilities, Akaroa2, PlanetLab, Sequential quantitative stochastic simulation, Open queuing network.
1 Introduction Quantitative stochastic simulation of complex scenario can take hours or days to complete. SRIP (Single Replication in Parallel) and MRIP (Multiple Replication in Parallel) are two methods used to reduce simulation time. In SRIP, the simulation program is divided into smaller logical parts and run on different computers. In MRIP, multiple processors run their own replications of sequential simulation, but cooperate with central analyzers (one central analyzer for each performance measure analyzed) that are responsible for analyzing the results and stopping the simulations when the specified level of accuracy is met [1]. The MRIP technique can significantly speed up simulation if replications are launched on a larger homogeneous set of computers [2, 3]. In last few years, a large number of experimental networking facilities have been, or are being developed across the globe: e.g. PlanetLab, GENI, OneLab, G-Lab, Akari, Panlab, etc. [4]. These global networks often consist of thousands of computers. Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 33–42, 2011. © Springer-Verlag Berlin Heidelberg 2011
34
M. Haque et al.
Thus they provide a viable alternative for running distributed stochastic simulations in the Multiple Replications in Parallel scenario (MRIP). We selected PlanetLab as the provider of distributed computing resources for investigating various aspects of MRIP simulations, since it is a continuously evolving computing platform with thousands of nodes [5]. These nodes can be easily accessed for running MRIP without investing in infrastructure. However, before using such a globally distributed networking facility for sequential stochastic simulation on multiple computers, factors such as load at selected nodes and potential communication overhead between them have to be carefully considered, as these computers can be shared by a large number of users and some of them are thousands of miles apart. Load generated by these users can vary significantly and quickly. Thus, it can adversely affect performance of computers, and the simulations running on them.
Fig. 1. PlanetLab with deployed nodes around the world [5]
We did extensive experimentation to determine the suitability of PlanetLab nodes for MRIP simulations. Our simulations were run with Akaroa2, a universal controller of quantitative discrete event simulation, designed for automatic launching of MRIPbased experiments. Experiments were designed to measure times needed to produce final simulation results over various sets of PlanetLab computers. Results obtained from the experiments executed over PlanetLab nodes were compared with the results obtained from running MRIP simulations on a local area network at the University of Canterbury. This has allowed us to conclude that a global networking facility such as PlanetLab can be effectively utilized for running MRIP. The rest of the paper is organized as follows. Section 2 spells out the procedure for running Akaroa2 on PlanetLab. Sections 3 explains in detail the experimental set up and evaluation metric. Section 4, presents experimental results, and conclusions are in Section 5.
2 Akaroa2 on PlanetLab In Akaroa2, multiple independent replications of a stochastic simulation are run on different processors, which play the role of independent simulation engines producing
World-Wide Distributed Multiple Replications
35
statistically equivalent output data during one simulation. Multiple simulation engines cooperate with the global analyzer that processes streams of output data coming out from different simulation engines, and stops the simulation once the required accuracy of the results has been achieved. The accuracy is typically measured by the relative statistical error of the results. Two main processes of Akaroa2 are: Akmaster and Akslave. The Akslave process initiates simulation engines on multiple processors, while Akmaster controls sequential collection of output data and their analysis. It collects local estimates from all running Akslaves, calculates final global estimates, displays results, and then terminates the simulation when the stopping criterion is reached [6]. Both steady-state simulations and terminating simulations are supported. In the former case, the procedures for sequential mean and variance analysis are described in [1, 7-8], while the procedure adopted for terminated simulation is presented in [2]. Akaroa2 is widely used for simulations executed on local area networks, as its records of the last 10 years (in July 2011) show over 3100 downloads of the software by users from over 80 countries [9]. In order to run Akaroa2 on PlanetLab, first we need to copy and install Akaroa2 on all the nodes which will be used for running simulation engines. Copying and installing software on hundreds of machines is an intricate task. Either the CoDeploy program [10] provided by PlanetLab or, alternatively, simple shell scripts for automating copying, installation and running of Akaroa-2 on PlanetLab can be used. The shell script we used can be downloaded from the PlanetLab New Zealand web site [11]. For proper execution of MRIP-based simulation, the path variable should be correctly set in the bash profile file of all participating PlanetLab nodes, and simulation program should be copied in the directory specified in the path. The detailed procedure with step by step instructions for running Akaroa-2 on PlanetLab using Linux or Windows operating system can be downloaded from PlanetLab New Zealand web site [11].
3 Experimental Setup To study the feasibility of running Akaroa2 on PlanetLab, we conducted a large number of experiments, considering different strategies for selecting participating nodes of the network. The aim was to measure the times to produce simulation results, from the time instant when the simulation was launched until the time instant when the final results were obtained, to find out how using differently selected sets of PlanetLab nodes can affect users’ quality of experience, in comparison with simulations executed on local computers only. We compared two of many possible strategies for selection of PlanetLab nodes for MRIP simulations. We assumed that the computers are either distributed over a restricted geographical region (so they operate in the same or close time zones), or they are distributed globally (so computers participating in MRIP simulations work in very different time zones). 3.1 Computing Setup CS1 In this computing setting, while operating from New Zealand, we installed Akaroa2 on PlanetLab nodes spread over the European Union. The Akmaster was installed in
36
M. Haque et al.
Italy and simulation engines were located in France, UK, Belgium, Italy, Hungary and Poland. PlanetLab nodes were carefully selected using the CoMon utility [12] to avoid currently heavily loaded nodes. The CoMon utility is provided by PlanetLab for monitoring of resource utilization of all PlanetLab nodes. In CS1 our aim was to assess response times of MRIP-based simulation experiments. The experiments were run on Friday, beginning at 2pm British Standard Time. 3.2 Computing Setup CS2 In this computing environment, simulation engines of Akaroa2 were installed worldwide, so they operated in very different time zones. Again, while operating from New Zealand, we installed the Akmaster in Italy, and the simulation engines were launched in Europe, USA, Canada, New Zealand and Asia; see Figure 2. Nodes were again carefully selected using the CoMon utility, avoiding nodes which were heavy loaded. This setup was used to study and verify effect of communication overhead when simulation engines are thousands of miles apart. The experiments were run on Friday, beginning at 2pm USA Central Standard Time.
Fig. 2. Global distribution of Akaroa2 in CS2
Note that the nodes of PlanetLab used by Akaroa2 represented a homogenous set of computers, as the computers of PlanetLab have to satisfy some minimum common technical requirements. For comparing the quality of users’ experience in such distributed simulation environments, we have also measured the waiting times for simulation results in the traditional local processing environment of Akaroa2, where its simulation engines are located around a local area network. 3.3 Computing Setup CS3 Here, the simulation experiments were run on computers linked by a local area network in a computer laboratory of the Department of Computer Science and Software Engineering, at the University of Canterbury, in Christchurch. Akmaster and Akslave were installed in this controlled local area network environment, the original home location of Akaroa2. The results were used as the reference for comparison with the results obtained from the two other, distributed computing environments.
World-Wide Distributed Multiple Replications
37
The experiments were run on Friday, beginning from 2pm, New Zealand time. The nodes of the local area network, physically located in one laboratory, constitute a homogenous set of computers. Laboratory and PlanetLab nodes are equipped with quad processors and both use the Fedora operating system based on the Linux Kernel. However, the computers available on PlanetLab are of slightly higher technical standards in terms of memory and clock frequency than those available in our CS3 setting. 3.4 Simulation Setting and Evaluation We ran the same sequential stochastic simulation in MRIP scenario in all three computing setups: CS1, CS2 and CS3. For our study, we simulated a simple open queuing network, consisting of a CPU and two disk memories with unlimited buffer capacities, depicted in Figure 3. We estimated steady-state mean response (mean time spent by a customer in this system), assuming that arriving customers form a Poisson process with λ= 0.033 tasks per second. All service times are exponentially distributed with mean service time at the CPU of 6 seconds, mean service time at Disk 1 and mean service time at Disk 2 both of 14 seconds. This makes the servers to the CPU, Disk 1 and Disk 2 loaded at 96%, 92.4% and 92.4%, respectively. Disk 1
Job Sink
Job Source
p1 p3
CPU p2
Disk 2
Fig. 3. Simulated open queuing network
The simulation processes on all computers were to stop when the estimate of steady-state mean response reached a relative statistical error not greater than 5%, at a confidence level of 0.95. This should require about 20 million observations. Running simulation in Multiple Replications in Parallel scenario allowed us to collect this sample of output data faster, as it is produced by multiple simulation engines. To demonstrate that this attractive feature of MRIP remains practical also in case of globally distributed simulation engines, we assessed speedup and relative efficiency of MRIP simulations in setup CS1 and CS2, and compared the results with those from locally distributed simulation engines in CS3. The performance of our MRIP simulations was assessed by measuring response time (RT) of a given simulation setting, defined as the time interval between the time of launching the simulation until
38
M. Haque et al.
the time when the final results are delivered to the user. Then, the speedup of simulation at P > 1 simulation engines can be found as: S P =
Mean_RT(1) Mean_RT(P)
(1)
where Mean_RT (P) is mean response time of P simulation engines running MRIP simulation with P ≥ 1. Alternatively, we looked at the relative speedup of MRIP simulation, defined as SR (P)=
Mean_RT(1)-Mean_RT(P) Mean _RT(1)
*100 %
(2)
for P= 1, 2, 3, …. Note that, due to the truncated Amdahl law for MRIP formulated in [2, 3], there exists a limit on the number of processors which would increase the speedup of MRIP simulation. It is also known that the largest speedup can be obtained in homogeneous computing environments. In the extreme case, if one simulation engine uses a very fast processor and remaining processors are slow, a simulation will not benefit at all from MRIP at all, as the fastest simulation engine can produce the entire sample of required observations needed for stopping the simulation, before any of the remaining slower simulation engines is able to reach its first checkpoint. Another performance measure which we considered is the efficiency of distributed processing during MRIP simulation, or speedup per simulation engine: E (P) =
S(P)
P
(3)
In an ideal situation, the efficiency would be equal to one. However, in practical applications of parallel processing it is usually much smaller. E (P) measures how well the contributing processors are utilized for solving a given problem, despite their mutual communication and synchronization activities.
4 Experimental Results In this section, we present our experimental results obtained under computing setups CS1, CS2 and CS3. We use mean response time as the measure of quality for testing our hypothesis that the MRIP scenario can also be efficiently used in the case of world-wide distributed simulation engines. The mean response times obtained for CS1, CS2 and CS3, measured in seconds, are given in Table 1. Each reported result is an average over 10 independent measurements/simulations. The relative statistical errors of these estimates are not larger than 1% for CS3 and not larger than 6 % for CS1 and CS2, at 0.95 confidence level. Fig. 4 compares mean response times of CS1, CS2 and CS3. The histogram clearly shows that mean response time reduces as the number of nodes increases. The PlanetLab nodes are being shared by a large number of users and are located hundreds of miles apart. Conversely, laboratory nodes are used by only one person and are located close to each other. The mean response times in case of CS3 are therefore smaller than in the case of PlanetLab nodes both in CS1 and CS2. In order to obtain good performance, PlanetLab nodes should be carefully selected, avoiding heavily loaded nodes and busy working hours.
World-Wide Distributed Multiple Replications
39
Table 1. Mean response time for scenario CS1, CS2 and CS3 Number of Nodes
CS1
CS2
CS3
2 4 6 8 10 12 15
88.13 61.94 52.25 45.23 39.8 34.32 27.14
97.53 75.48 64.74 59.34 46.81 43.21 36.35
59.78 47.53 37.08 32.81 29.98 28.62 15.67
CS1
CS2
CS3
Time in Seconds
100 80 60 40 20 0 2
4
6
8
10
12
15
Number of Nodes
Fig. 4. Comparison of mean response times in CS1, CS2 and CS3
Comparison of the mean response times for CS1 and CS2 shows that these mean response times are much shorter if all the PlanetLab nodes are selected from one area (continent), for example within Europe rather than from all over the world. This is primarily because of communication overhead. When controller and simulation engines are located thousands of mile apart, the time used for exchanging data between simulation engines and controller directly effects the mean response time. We also ran the same experiment by selecting PlanetLab nodes from North America only and found results similar to the setup CS2. Speedup for distributed scenario of CS1 and CS2 is calculated using Equation (1) and given in Table 2. Speedup has been calculated using mean response time of two nodes as a reference. In spite of the longer distance between nodes, speedup offered by PlanetLab nodes in the case of CS1 is better than in CS3, because of the slightly better hardware of PlanetLab nodes.
40
M. Haque et al. Table 2. Speedup for distributed scenario of CS1 and CS3 Number of Nodes
CS1
CS3
2 4 6 8 10 12 15
1 1.42 1.69 1.95 2.21 2.57 3.24
1 1.26 1.61 1.82 1.99 2.09 3.17
Efficiency in the case of CS1 and CS3 has been calculated using Equation (3) and is shown in table 3. In this case, there is only small difference between the results. The efficiency decreases as the number of processors increases. This is due to the fact that processor communication is usually slower than computation and exchange of local estimates between Akslaves and Akmaster results in frequent communication. Table 3. Efficiency for scenario CS1 and CS3 Number of Nodes
CS1
CS3
2 4 6 8 10 12 15
0.5 0.31 0.26 0.22 0.19 0.17 0.20
0.5 0.35 0.28 0.24 0.22 0.21 0.21
These results allow us to conclude that it has become practical to use distributed computing resources of global experimental networks for fast quantitative stochastic simulation, paying only a small penalty in the form of a minor worsening of response times, speedup and efficiency of the simulation as comparing with the same simulations run on a local area network. The advantage of using globally distributed computing resources is that they can be substantially larger than the ones available locally. We conducted experiments using two different ways of selection of computers in PlanetLab for simulation engines and compared their performance with performance of simulation run on computers of a local area network. The performance of MRIP in CS1 appears to be better that in CS2. Thus, for best results selection of computers from closer geographical location, avoidance of both heavily loaded nodes and busy hours is recommended.
World-Wide Distributed Multiple Replications
41
5 Conclusions In this paper we have shown that the distributed computing resources of global experimental networks, such as PlanetLab, can be effectively used for running quantitative stochastic simulations in MRIP scenario. Only a small penalty (in the form of a minor worsening of performance) is paid for using globally distributed resources instead of local ones. Launching and on-line control of globally distributed simulations can be done by using for example Akaroa2. It is encouraging news for those who need to run time-consuming quantitative simulations to get accurate final results, but do not have access to sufficiently large number of computers for launching multiple simulation engines. Recently, there has been a surge in development of global and regional experimental networking facilities, see Table 4 [13]. Most of these networks offer free membership and can be effectively used for conducting simulation experiments under control of Akaroa2. Table 4. Selected experimental networking facilities, with size and accessibility Name
Purpose
Size
Access
OneLab Panlab
Multipurpose Multipurpose
Regional Regional
Federica PlanetLab GENI JNB 2 CNGI
Multipurpose Multipurpose Multipurpose Multipurpose Multipurpose
Regional Global Regional Regional Regional
Free membership Planned to be on Payment Free membership Free membership Free membership Free membership Free membership
In future, we plan to investigate the upper bounds for speedup of globally distributed sequential stochastic simulation, such as those in the MRIP scenario. This will require running experiments at full scale, employing hundreds of PlanetLab nodes as simulation engines, with simulations requiring extremely large samples of output data for producing accurate simulation results, in particular if the simulated processes are strongly correlated. Acknowledgments. This work was partially supported by REANNZ (2010/2011 KAREN Capability Build Fund).
References 1. Pawlikowski, K., Yau, V., McNickle, D.: Distributed stochastic discrete-event simulation in parallel time streams. In: 26th Conference on Winter Simulation, pp. 723–730. Society for Computer Simulation International, Orlando (1994) 2. Pawlikowski, K., Ewing, G., McNickle, D.: Performance Evaluation of In-dustrial Processes in Computer Network Environments. In: European Conference on Concurrent Engineering, pp. 129–135. Int. Society for Computer Simulation, Erlangen (1998)
42
M. Haque et al.
3. Pawlikowski, K., McNickle, D.: Speeding Up Stochastic Discrete-Event Simulation. In: European Simulation Symposium, pp. 132–138. ISCS Press, Marseille (2001) 4. Lemke, M.: The Role of Experimentation in Future Internet Research: FIRE and Beyond. In: 6th International Conference on Testbeds and Research Infrastructures for the Development of Networks and Communities, Berlin, Germany (2010) 5. PlanetLab, http://www.planet-lab.org/ 6. Ewing, G., Pawlikowski, K., McNickle, D.: Akaroa-2: Exploiting Network Computing by Distributing Stochastic Simulation. In: 13th European Simulation Multi-Conference, Warsaw, Poland, pp. 175–181 (1999) 7. Ewing, G., Pawlikowski, K.: Spectral Analysis for Confidence Interval Estimation Under Multiple Replications in Parallel. In: 14th European Simulation Symposium, pp. 52–61. ISCS Press, Dresden (2002) 8. Shaw, N., McNickle, D., Pawlikowski, K.: Fast Automated Estimation of Varience in Sequential Discrete Event Stochistic Simulation. In: 25th European Conference on Modelling and Simulation, Krakow, Poland (2011) 9. Akaroa2, http://www.cosc.canterbury.ac.nz/research/RG/net_sim/simulat ion_group/akaroa/about.chtml 10. CoDeploy, http://codeen.cs.princeton.edu/codeploy/ 11. PlanetLab NZ, http://www.planetlabnz.canterbury.ac.nz/ 12. CoMon, http://comon.cs.princeton.edu 13. Haque, M., Pawlikowski, K., Ray, S.: Challenges to Development of Multipurpose Global Federated Testbed for Future Internet Experimentation. In: 9th ACS/IEEE International Conference on Computer Systems and Applications, Sharm El-Sheikh, Egypt (2011)
Comparison of Three Parallel Point-Multiplication Algorithms on Conic Curves Yongnan Li1,2, Limin Xiao1,2, Guangjun Qin1,2, Xiuqiao Li1,2, and Songsong Lei1,2 1
State Key Laboratory of Software Development Environment, Beihang University, Beijing, 100191, China 2 School of Computer Science and Engineering, Beihang University, Beijing, 100191, China {liyongnan.buaa,guangjunster,xiuqiaoli,lss.linux}@gmail.com [email protected]
Abstract. This paper makes a comparison of three parallel point-multiplication algorithms on conic curves over ring Zn. We propose one algorithm for paralleling point-multiplication by utilizing Chinese Remainder Theorem to divide point-multiplication over ring Zn into two different point-multiplications over finite field and to compute them respectively. Time complexity and speedup ratio of this parallel algorithm are computed on the basis of our previous research about the basic parallel algorithms in conic curves cryptosystem. A quantitative performance analysis is made to compare this algorithm with two other algorithms we designed before. The performance comparison demonstrates that the algorithm presented in this paper can reduce time complexity of pointmultiplication on conic curves over ring Zn and it is more efficient than the preceding ones. Keywords: conic curves, ring Zn, finite field Fp, point-addition, point-double, point-multiplication, Chinese Remainder Theorem.
1
Introduction
In recent years, three main classes of public key cryptosystem are considered both secure and efficient: integer factorization system, discrete logarithm system and discrete logarithm system based on mathematical curves. Conic curves cryptosystem belongs to the third one. Professor Cao presented the concept of conic curves cryptography firstly in [1-2]. Then a public-key cryptosystem scheme on conic curves over ring Zn was proposed in [3-5]. Research in [6] introduced the definitions of extended point-addition and point-double on conic curves over ring Zn. In this paper, an efficient technique for parallel computation of the pointmultiplication on conic curves over ring Zn is proposed and our algorithm can reduce time complexity of point-multiplication. The analysis of this parallel methodology is based on our previous work about the basic parallel algorithms used in conic curves cryptosystem. Study in [7] proposed several parallel algorithms for cryptosystem on conic curves over finite field Fp. In [8], original point-addition and point-double were paralleled for cryptosystem on conic curves over ring Zn. Work in [9] introduced Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 43–53, 2011. © Springer-Verlag Berlin Heidelberg 2011
44
Y. Li et al.
traditional parallel point-multiplication in conic curves cryptosystem over ring Zn and finite field Fp. Parallel extended basic operations of point-addition and point-double were proposed in [10]. Study in [11] designed two high performance algorithms of point-multiplication for conic curves cryptosystem based on standard NAF algorithm and Chinese Remainder Theorem. The methodology presented in this paper partitions point-multiplication over ring Zn into two point-multiplications over finite field Fp and finite field Fq by calling Chinese Remainder Theorem, which is proposed by famous military strategist “Sun Tzu”. Two point-multiplications are executed respectively and then the temporary results are merged to get the final value. This method is similar with the one we proposed in [11] and the difference is that the preceding research adopted standard NAF algorithm to compute the point-multiplication over finite filed Fp. Time complexity and speedup ratio of this parallel algorithm are computed on the basis of our previous research about the time complexities of the fundamental algorithms on conic curves. And we will evaluate the performance of this parallel algorithm and compare with two old parallel algorithms. The parallel algorithm proposed in this paper not only accelerates the speed of point-multiplication, but also shows higher efficiency than two old parallel algorithms we designed before. The rest of this paper is organized as follows. Next section introduces the definition of point-multiplication on conic curves over ring Zn. Section 3 depicts time complexities of the basic operations in conic curves cryptosystem. Section 4 presents the methodology of paralleling point-multiplication on conic curves over ring Zn. The performance comparison of our techniques is proposed in section 5. The last section concludes the whole paper and points out some future works briefly.
2
Definition of Point-Multiplication
The definitions of point-addition and point-double must be introduced firstly. In conic curves cryptosystem over ring Zn, Cn ( a, b ) means the conic curves. C1, C2 and C3 represent three different fields over ring Zn. For any point P ( x1 , y1 ) ∈ Cn ( a, b) and
Q ( x2 , y2 ) ∈ Cn ( a, b ) , the operator ⊕ is defined as: •
If P ≠ Q , then operation of point-addition is P ⊕ Q .
•
If P = Q , then operation of point-double is 2P .
Operators ⊕ are defined differently in the expressions of point-addition and pointdouble. Point-multiplication signifies summation of many points on conic curves. Parameter k and parameter P represent coefficient and point on conic curves respectively. Point-multiplication kP is defined as: count = k 644 7448 kP = P ⊕ P ⊕ L ⊕ P ,
(1)
In conic curves cryptosystem over ring Zn, we define n=pq (p and q are two different odd prime integers). For more details, please refer to researches in [1-5].
Comparison of Three Parallel Point-Multiplication Algorithms on Conic Curves
3
45
Time Complexities of Basic Operations
As depicted in [7-10], time complexities of several basic operations are listed in Table 1 and Table 2. And we set the runtime of single-precision multiplication as the basic measurement unit. The time complexities of multiple-precision addition and deduction are O(1). Table 1. Time complexities of parallel operations Operation Original point-addition over ring Zn
Time complexity computation communication 644444 47444444 8 64444444 4744444444 8 2 2 2 2 2N p 2N p 3N 3N O( + + 9lN + 3lA + 21) + O( + + 6 N + 6lN + 4 N p + 2lA + 5) sn sp sn sp
Original point-double over ring Zn
communication 6444444computation 4744444448 644444444 7444444448 2 2 3N 2 2 N p 3N 2 2 N p O( + + 9lN + 3lA + 2a + 20.5) + O( + + 6 N + 6lN + 4 N p + 2lA + 5) sn sp sn sp
Extended point-addition over ring Zn
computation communication 64444 4744444 8 64444444 4744444444 8 2 2 2N 2 6N p 2N 2 6N p O( + + 6lN p + 28) + O ( + + 4 N + 12 N p + 4lN p + 10) sn sp sn sp
Extended point-double over ring Zn
computation communication 64444 4744444 8 64444444744444448 2 2 2N 2 6N p 2N 2 6N p O( + + 6lN p + 28) + O( + + 4 N + 12 N p + 4lN p + 4) sn sp sn sp
Pointmultiplication over ring Zn Pointmultiplication over finite field Fp
computation communication 64444444 4744444444 8 6444444444 7444444444 2 2 3N 2 2 N p 3N 2 2 N p O(t ( + + 9lN + 3lA + 24.5) − 3.5) + O(t ( + + 6 N + 6lN + 4 N p + 2lA + 5) + sn sp sn sp
computation communication 6444444 74444448 6444444 4744444448 3n 2 3n2 O(t ( + 10 + 3 ⎢⎡lg X ⎥⎤ + 3 ⎢⎡lg P ⎥⎤ )) + O(t ( + 6n + 2 ⎢⎡lg X ⎥⎤ + 2 ⎢⎡lg P ⎥⎤ ) + 1) s s
Multiplication
computation communication 6 4748 64 748 n2 n2 O( + 2) + O( + 2n) s s
Barrett reduction
computation communication 64 748 64 4744 8 2n 2 2n 2 O( + 8) + O( + 4 n) s s
The meanings of the variables in Table 1 and Table 2 are: • • • •
N: multiple-precision of operand n over ring Zn. Np: multiple-precision of operand p over ring Zn. Nq: multiple-precision of operand q over ring Zn. a: a coefficient of conic curves equation.
46
Y. Li et al.
• • • • • • • • • • •
b: a coefficient of conic curves equation. A: a fixed coefficient over ring Zn. s: process number for computing multiplication. l: word length of the computer. X: numerator of inversion-multiplication. P: denominator of inversion-multiplication. n: multiple-precision of operand over finite field. t: the numbers of bit in coefficient k for computing point-multiplication kP. Sn: the value of s over ring Zn. Sp: the value of s over finite field Fp. Sq: the value of s over finite field Fq. Table 2. Time complexities of sequential operations
Operation
Time complexity
Original point-addition over ring Zn
O(2 N 2 + N p 2 + N q 2 + 10 N + 12lN + 4lA + 1.5b + 19.5)
Original point-double over ring Zn
O(2 N 2 + N p 2 + N q 2 + 10 N + 12 N + 4lA + 1.5b + 2a + 17.5)
Extended point-addition over ring Zn
O( N 2 + 5 N p 2 + 5 N q 2 + 23 N + 8lN + 37)
Extended point-double over ring Zn
O( N 2 + 4 N p 2 + 4 N q 2 + 19 N + 8lN + 27)
Point-multiplication over ring Zn
O((3 ( t − 1) 2)(2 N 2 + N p 2 + N q 2 + 10 N + 12lN +4lA + 1.5b + 17.5) + (2a + 1)(t − 1))
Point-multiplication over finite field Fp
O((3(t − 1) 2)(2n2 + 9n + 5 + 4 ⎡⎢ lg X ⎤⎥ + 4 ⎡⎢lg P ⎤⎥ ))
Multiplication
O ( n 2 + 2n )
Barrett reduction
O(n 2 + 4n + 5)
The relationship of variable N, Np and Nq in Table 1 and Table 2 is: N = N p + N q , N p ≥ N q . The value of coefficient A is
C1 N + C2 N q + C3 N p C1 + C2 + C3
. C1 , C2 and
C3 stand for the numbers of point in the fields of C1, C2 and C3. In this paper, the
symbol “Fq” has the same meaning of finite field Fp and the distinction is that its module is prime integer “q”.
Comparison of Three Parallel Point-Multiplication Algorithms on Conic Curves
4
47
Parallel Point-Multiplication
This section explains the methodology of paralleling point-multiplication for cryptosystem on conic curves over ring Zn. It uses Chinese Remainder Theorem to partition point-multiplication over ring Zn into two point-multiplications over finite field. As Fig.1 shows, there are three steps in the parallel procedure of point-multiplication. Firstly, two reductions are calculated to divide parameter t into tp and tq. Then two point-multiplications are computed over finite field Fp and finite field Fq respectively. Lastly, kP(tp) and kP(tq) are incorporated by calling Chinese Remainder Theorem to get the value of kP(t ) .
Fig. 1. Parallel procedure using Chinese Remainder Theorem
In the first step, two reductions are calculated to map parameter t over ring Zn into tp over finite field Fp and tq over finite field Fq. Then, Tp − left
Tp − right
computation communication 644 7448 644 47444 8 2 = O(2 N p s p + 8) + O(2 N p 2 s p + 4 N p ) ,
(2)
computation communication 644 7448 644 47444 8 = O(2 N q 2 s q + 8) + O (2 N q 2 sq + 4 N q ) .
(3)
We could conclude that the value of Tp − left is bigger than Tp − right because of N p ≥ N q . This procedure costs two communication units (variable t and qq-1), so the parallel runtime of the first step is Tp − first
computation communication 644 7448 6444 474444 8 2 = O(2 N p s p + 8) + O (2 N p 2 s p + 4 N p + 2) .
(4)
48
Y. Li et al.
In the second step, two operations of point-multiplication over finite field Fp and finite field Fq are executed simultaneously. Then the values of the two pointmultiplications and the inversions of the two modules should be multiplied to get the final value of point-multiplication over ring Zn. Parameters pp −1 and qq −1 are two constants in the cryptosystem because p and q are fixed. The multiple-precision of kP ( t p ) pp −1 is 2Np and the multiple-precision of kP ( tq ) qq −1 is 2Nq. Consequently, it costs one point-multiplication and one multiple-precision multiplication over each finite field. We could get the parallel runtime of the second step:
Tp − left
computation 64444444 744444448 2 == O (t (3N p s p + 6lN p + 10) + N p 2 s p + 2)
communication 644444444 47444444444 8 2 + O (t (3N p s p + 6 N p + 4lN p ) + N p 2 s p + 2 N p + 1)
Tp − right
6444444computation 4744444448 = O(t (3 N q 2 sq + 6lN q + 10) + N q 2 sq + 2)
communication 644444444 47444444444 8 2 + O (t (3N q s q + 6 N q + 4lN q ) + N q 2 sq + 2 N q + 1)
,
.
(5)
(6)
Then, Tp − sec ond
computation 64444444 744444448 2 = O(t (3N p s p + 6lN p + 10) + N p 2 s p + 2)
communication 644444444 47444444444 8 2 + O (t (3N p s p + 6 N p + 4lN p ) + N p 2 s p + 2 N p + 1)
.
(7)
In the third step, the values of parameters kP ( t p ) pp −1 and kP ( tq ) qq −1 are merged to get the point-multiplication over ring Zn by computing (8):
(
)
k ( t p ) = kP ( t p ) pp −1 + kP ( tq ) qq −1 mod n .
(8)
Sum of kP ( t p ) pp −1 and kP ( tq ) qq −1 is a 2Np multiple-precision integer. One multiple-precision reduction will be needed because the final result is a 2N multipleprecision integer and 2 N p ≥ N . Therefore, the third step costs one multiple-precision addition and one multiple-precision reduction. The parallel runtime of the third step is
Tp − third
communication 64computation 4744 8 644 474448 2 = O(2 N sn + 8) + O (2 N 2 sn + 4 N + 1) .
(9)
Comparison of Three Parallel Point-Multiplication Algorithms on Conic Curves
49
Consequently, the parallel runtime of point-multiplication is computation 6444444444 474444444444 8 2 Tp1 = O(t (3N p s p + 10 + 6lN p ) + 2 N 2 sn + 3N p 2 s p + 18)
+
communication 6444444 474444444 8 O(t (3 N p 2 s p + 6 N p + 4lN p ) + 2 N 2 sn
.
(10)
+ 3N p 2 s p + 6 N p + 4 N + 4)
The sequential runtime of point-multiplication could be looked up in Table 2: Ts = O((3 ( t − 1) 2)(2 N 2 + N p 2 + N q 2 + 10 N + 12lN + 4lA + 1.5b + 17.5) + (2a + 1)(t − 1))
.
(11)
Then we can get the speedup ratio: O ((3 ( t − 1) 2)(2 N 2 + N p 2 + N q 2 + 10 N + 12lN S=
+ 4lA + 1.5b + 17.5) + (2a + 1)(t − 1)) . computation 6444444444 474444444444 8 O(t (3 N p 2 s p + 10 + 6lN p ) + 2 N 2 sn + 3N p 2 s p + 18) +
5
(12)
communication 6444444 474444444 8 2 O(t (3N p s p + 6 N p + 4lN p ) + 2 N 2 sn
+ 3N p 2 s p + 6 N p + 4 N + 4)
Performance Comparison
This section evaluates the performance of parallel point-multiplication proposed in this paper. And the quantitative performance comparison is made between this parallel point-multiplication and two other parallel point-multiplications we presented before. The parameters over ring Zn are assumed as: N = 2sn = 2 N p = 2 N q = 4s p = 4sq , a = 2 , b = 1 , l = 32 . As demonstrated in [8], A=
C1 N + C2 N q + C3 N p C1 + C2 + C3
.
(13)
If variable n over ring Zn is big enough, the value of C1 is much bigger than C2 and C3 . Then coefficient A will be approximately equal to N. Therefore, Tp1 in (10) is simplified as:
50
Y. Li et al.
computation communication 64444 744448 644 47444 8 Tp1 = O(t (99 N + 10) + 7 N + 18) + O(70tN + 14 N + 4) .
(14)
The time complexity of the parallel point-multiplication proposed in [11] is (15) and it could be simplified as (16). Tp 2
computation 6444444444447444444444448 2 = O ( t + 1 2 ) ( 3N p s p + 6lN p + 10 ) + 2 N 2 sn + 3N p 2 s p + 18
(
)
64444444744444448 ⎛ ( t + 1 2 ) ( 3N p 2 s p + 6 N p + 4lN p ) ⎞ ⎟ +O⎜ ⎜ + 2 N 2 s + 3N 2 s + 4 N + 6 N + 4 ⎟ n p p p ⎝ ⎠ communication
Tp 2
.
(15)
computation communication 644444 47444444 8 64444 4744444 8 = O ( ( t + 0.5 )( 99 N + 10 ) + 7 N + 18) + O ( 70 N ( t + 0.5 ) + 14 N + 4 ) .
(16)
As demonstrated in Table 1 and Table 2, the runtime of traditional parallel pointmultiplication and sequential point-multiplication could be simplified as: computation communication 6444 474444 8 644 47444 8 Tp 3 = O(t (384 N + 24.5) − 3.5) + O (t (272 N + 5) + 1) ,
(17)
Ts = O((3 ( t − 1) 2)(2.5 N 2 + 522 N + 19) + 5(t − 1)) .
(18)
On condition that the communication time unit is 20 percent of computation time unit, the performance evaluation and comparison could be showed in Table 3, Fig. 2, Fig. 3 and Fig. 4. It could be seen that the methodology of paralleling point-multiplication accelerates the speed of point-multiplication and it is more efficient than two other methods. We also make other assumption of relationship between communication time unit and computation time unit. Same conclusion is derived by analyzing the performance comparison on different conditions. Table 3. Performance evaluation N
t
Tp1
Tp2
Tp3
Ts
8
10
9237.2
9694.2
35323.7
58837.5
8
20
18377.2
18834.2
70650.7
124212.5
8
30
27517.2
27974.2
105977.7
189587.5
8
40
36657.2
37114.2
141304.7
254962.5
8
50
45797.2
46254.2
176631.7
320337.5
8
60
54937.2
55394.2
211958.7
385712.5
8
70
64077.2
64534.2
247285.7
451087.5
16
10
18355.6
19264.6
70395.7
121693.5
Comparison of Three Parallel Point-Multiplication Algorithms on Conic Curves Table 3. (Continued) 16
20
36535.6
37444.6
140794.7
256908.5
16
30
54715.6
55624.6
211193.7
392123.5
16
40
72895.6
73804.6
281592.7
527338.5
16
50
91075.6
91984.6
351991.7
662553.5
16
60
109255.6
110164.6
422390.7
797768.5
16
70
127435.6
128344.6
492789.7
932983.5
24
10
27474
28835
105467.7
188869.5
24
20
54694
56055
210938.7
398724.5
24
30
81914
83275
316409.7
608579.5
24
40
109134
110495
421880.7
818434.5
24
50
136354
137715
527351.7
1028289.5
24
60
163574
164935
632822.7
1238144.5
24
70
190794
192155
738293.7
1447999.5
Fig. 2. Performance comparison while N=8
Fig. 3. Performance comparison while N=16
51
52
Y. Li et al.
1800000 1500000 1200000
Tp1 Tp2 Tp3 Ts
900000 600000 300000 0
10
20
30
40
50
60
70
Fig. 4. Performance comparison while N=24
6
Conclusions
In this paper, we presented a methodology of paralleling point-multiplication on conic curves over ring Zn. The method is proposed based on Chinese Remainder Theorem. The performance comparison between the parallel methodology and two other ones demonstrates that the technique introduced in this paper is the most efficient one. Our researches, including study in this paper, are focused on the basic parallel algorithms in conic curves cryptosystem. We plan to design the parallel algorithm of Elgamal cryptograpy on conic curves over ring Zn based on these parallel algorithms in the future. Acknowledgments. This study is sponsored by the fund of the State Key Laboratory of Software Development Environment under Grant No. SKLSDE-2009ZX-01, the Fundamental Research Funds for the Central Universities under Grant No. YWF-1002-058 and the National Natural Science Foundation of China under Grant No. 60973007.
References 1. Cao, Z.: A public key cryptosystem based on a conic over finite fields Fp. In: Advances in Cryptology: Chinacrypt 1998, pp. 45–49. Science Press, Beijing (1998) (in Chinese) 2. Cao, Z.: Conic analog of RSA cryptosystem and some improved RSA cryptosystems. Natural Science Journal of Heilongjiang University 16(4), 5–18 (1999) 3. Chen, Z., Song, X.: A public-key cryptosystem scheme on conic curves over the ring Zn. In: 6th International Conference on Machine Learning and Cybernetics, vol. 4, pp. 2183– 2187. IEEE Press, Hong Kong (2007) 4. Sun, Q., Zhu, W., Wang, B.: The conic curves over Zn and public key cryptosystem protocol. J. Sichuan Univ. (Nat. Sci. Ed.) 42(3), 471–478 (2005) (in Chinese) 5. Wang, B., Zhu, W., Sun, Q.: Public key cryptosystem based on the conic curves over Zn. J. Sichuan Univ. (Engin. Sci. Ed.) 37(5), 112–117 (2005) (in Chinese) 6. Li, Y.: Research of Conic Curve Cryptosystems and the Construction of CC-CSP. Thesis for the degree of master in computer application technology, Northestern University, pp. 25–27 (2008) (in Chinese)
Comparison of Three Parallel Point-Multiplication Algorithms on Conic Curves
53
7. Li, Y., Xiao, L., Hu, Y., Liang, A., Tian, L.: Parallel algorithms for cryptosystem on conic curves over finite field Fp. In: 9th International Conference on Grid and Cloud Computing, pp. 163–167. IEEE Press, Nanjing (2010) 8. Li, Y., Xiao, L., Liang, A., Wang, Z.: Parallel point-addition and point-double for cryptosystem on conic curves over ring Zn. In: 11th International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 317–322. IEEE Press, Wuhan (2010) 9. Li, Y., Xiao, L.: Parallel point-multiplication for conic curves cryptosystem. In: 3rd International Symposium on Parallel Architectures, Algorithms and Programming, pp. 116– 120. IEEE Press, Dalian (2010) 10. Li, Y., Xiao, L., Chen, S., Tian, H., Ruan, L., Yu, B.: Parallel Extended Basic Operations for Conic Curves Cryptography over Ring Zn. In: 9th IEEE International Symposium on Parallel and Distributed Processing with Applications Workshops, pp. 203–209. IEEE Press, Busan (2011) 11. Li, Y., Xiao, L., Wang, Z., Tian, H.: High Performance Point-Multiplication for Conic Curves Cryptosystem based on Standard NAF Algorithm and Chinese Remainder Theorem. In: 2011 International Conference on Information Science and Applications. IEEE Press, Jeju (2011)
Extending Synchronization Constructs in OpenMP to Exploit Pipeline Parallelism on Heterogeneous Multi-core Shigang Li, Shucai Yao, Haohu He, Lili Sun, Yi Chen, and Yunfeng Peng University of Science and Technology Beijing, 100083 Beijing, China [email protected]
Abstract. The ability of expressing multiple-levels of parallelism is one of the significant features in OpenMP parallel programming model. However, pipeline parallelism is not well supported in OpenMP. This paper proposes extensions to OpenMP directives, aiming at expressing pipeline parallelism effectively. The extended directives are divided into two groups. One can define the precedence at thread level while the other can define the precedence at iteration level. Through these directives, programmers can establish pipeline model more easily and exploit more parallelism to improve performance. To support these directives, a set of runtime interfaces for synchronization are implemented on the Cell heterogeneous multi-core architecture using signal block communications mechanism. Experimental results indicate that good performance can be obtained from the pipeline scheme proposed in this paper compared to the naive parallel applications. Keywords: Pipeline Parallelism, OpenMP, Cell architecture.
1
Introduction
Multi-core architectures are becoming the industry standard in the modern computer industry. There are two main categories: homogeneous multi-core and heterogeneous multi-core. The former one includes only identical cores while the later one integrates a control core and several accelerator cores. IBM/Toshiba/Sony Cell processor [10] is a typical heterogeneous multi-core. It comprises a conventional Power Processor Element (PPE) and eight synergistic processing elements (SPEs). SPEs don’t have hardware caches but each possesses a 256 KB local store. Communications between PPE and SPEs can be implemented through DMA, signal or mailbox. For different memory architectures, there are different programming models, such as message passing for distributed memory and shared memory inter-core communication methods. A well-known programming model for shared-memory parallel programming is OpenMP [1]. In the current definition of the OpenMP, multiple-levels of parallelism [9, 15, 16, 17] can be expressed. However, pipeline parallelism is not well supported in OpenMP. Due to the requirement from both programmers and applications, it’s necessary to extend OpenMP directives to express pipelined executions. In this paper we extend the OpenMP programming model with two groups of synchronization directives (one is Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 54–63, 2011. © Springer-Verlag Berlin Heidelberg 2011
Extending Synchronization Constructs in OpenMP to Exploit Pipeline Parallelism
55
based on thread and the other is based on iteration), by which programmers can establish the pipeline model flexibly and simply. Furthermore, runtime interfaces are implemented. To evaluate the performance, we conduct the experiments on the Cell Blade using the NAS IS, EP, LU [14] and SPEC2001 MOLDYN [20] benchmarks. In IS, EP and MOLDYN, from 4.8 to 5.5 speedup factors can be obtained from our pipeline model. In LU, pipeline structure can be established easily using our extended directives rather than using complicated data structures. The remainder of the paper is structured as follows: Section 2 discusses the related work. Section 3 presents the extended directives for pipeline parallelism. In Section 4 we show the runtime support for the pipeline parallelism on Cell Architecture. Experimental results are presented in Section 5 before we conclude in Section 6.
2
Related Work
Pipeline parallelism has been researched on both various programming languages and different architectures. Gonzalez et al. present the research work about extending OpenMP directives to exploit pipelined executions in OpenMP [2]. This pipeline model is work-sharing constructs oriented, while our pipeline model is loop-oriented which makes it possible to partition and pipeline the tasks in the critical region. Michailidis et al. propose a pipeline implementation of LU Factorization in OpenMP using the queue data structure [5]. In this pipeline scheme, no extra directives are extended to support pipeline. In contrast, two functions Put() and Get() are implemented using the original syntax of OpenMP to exchange elements between threads of the pipeline. Baudisch et al. present an automatic synthesis procedure that translates synchronous programs to software pipelines [6]. The compiler [8] generates the dependency graph of the guarded actions. Based on this, the graph is split into horizontal slices [6] that form threads to implement pipeline stages. In addition, threads communicate through FIFO buffers to allow asynchronous execution. Decoupled Software Pipelining (DSWP) [12, 3] can exploit the fine-grained pipeline parallelism inherent in most applications. In order to ensure that critical resources are kept in one thread, DSWP partitions the loop body into a pipeline of threads, rather than partitioning a loop by putting different iterations into different threads. Low-overhead synchronization between the pipeline stages can be implemented with a special synchronization array [3]. Coarse-grained parallelism is suitable for stream programming, because streaming applications are naturally represented by independent filters that communicate over explicit data channels [13]. Thies et al. present exploiting coarse-grained pipeline parallelism in C programs [11] to improve the performance of streaming applications. On Parallel Processor Arrays (PPA), Syrivelis et al. present a runtime support to extract coarse-grained pipeline parallelism out of sequential code [4]. Kurzak et al. present solving system of linear equations on CELL processor [7], and single instruction multiple data parallelism, instruction level parallelism and thread-level parallelism are well exploited on CELL processor. Nevertheless, the pipeline parallelism is less involved in this scheme.
56
S. Li et al.
3
Synchronization Constructs Extensions
In order to exploit pipeline parallelism on both thread and iteration levels, we equip the programmers with two sets of directives correspondingly. One can define the precedences at thread level while the other can define the precedences at iteration level. Synchronizations in the pipeline model are implemented by the extended directives. Each directive is described in detail as follows. 3.1
Synchronization Directives at Thread Level
The OpenMP API uses the fork-join execution model [1]. However, some constructs in the parallel region make the execution or the memory access sequential, such as the critical, atomic and ordered constructs. When the computation amount is great, synchronization constructs badly degrade the performance of the parallel applications. The straightforward method to improve the performance is pipeline and specifying the precedences relationship between different threads. The syntax of the ThrdPipe construct (C or C++ version) is as follows: #pragma omp ThrdPipe [clause[ [,]clause] …] new-line for-loops The clause is one of the following: blck_num (integer-expression) mry_syn (scalar-expression) The ThrdPipe construct specifies that the iterations of the outermost loop will be executed in pipeline by threads in the current context. This construct is one of the extended synchronization constructs which help the programmer to pipeline the loops with data dependency that can’t be parallelized by the loop construct directly. The blck_num clause specifies the number of blocks that the outermost loop is partitioned to. Subsequently, the block size and the theoretical speedup factor can be determined as follows. • Block Size: In our runtime implementation, we simply use a static partition algorithm to partition the outermost loop. For the number of blocks specified p and a loop of n iterations, let the integer q and r that satisfies n=p*q-r, with 0<=r
Extending Synchronization Constructs in OpenMP to Exploit Pipeline Parallelism
57
space equal to the whole shared array. In contrast, we can also choose to allocate address space equal to the block size to reduce the space overhead. In the former choice, the memory access synchronization is not needed, because it’s impossible to overlap the data block. In the latter choice, synchronization must be used to guarantee that there is no data block overlapping. 3.2
Synchronization Directives at Iteration Level
Iteration dependencies are common in the OpenMP applications. The loop construct can certainly partition the loop into several blocks. However, if the blocks be executed in parallel in discrete threads, the terminal result will be invalid, as the iteration dependency makes the computation ordered. If we want to pipeline this program segment, complex data structure will be introduced using the existing OpenMP syntax. In this section, two directives are extended to support pipeline at iteration level. The syntax of the IterWait and GotoIter constructs ( C or C++ version ) is as follows: #pragma omp IterWait( iter_ident, syn_ident ) new-line #pragma omp GotoIter( iter_ident, syn_ident ) new-line Both the IterWait and GotoIter directives are used in the affected loop and their two parameters iter_ident and syn_ident must be specified by the programmers. The IterWait construct used in the beginning of the loop block represents that the thread can’t continue to perform the following code until the iteration specified by the iter_ident parameter has been performed. The GotoIter construct used at the end of the loop block represents the thread continue to perform the following code and informs the iteration specified by the iter_ident parameter that the current iteration has been performed. The parameter iter_ident is an integer expression which contains the loop control variable and an integer constant, combined by the plus or subtract operation. In the expression, the loop control variable represents the current iteration. The whole expression specifies the iteration that should be waited for or should be informed. The parameter syn_ident indicates the synchronizations ID. In our pipeline scheme, multiple synchronizations are supported in order to make the pipeline model more reliable. Nevertheless, meaningful synchronization is formed only by the effective combination of one IterWait directive and its corresponding GotoIter directive. So an exclusive ID is assigned to them by the programmer through the syn_ident parameter. As a result, the IterWait directive and GotoIter directive with the same ID form one synchronization operation.
4
Runtime Support on Cell Architecture
In this section, we present the runtime system to support the extended directives. In order to take full advantage of the Cell architecture, we adopt the signal block transmission mechanism to implement the synchronization among different SPE threads, using the two signal notification channels of each SPE. The runtime library framework is based on our previous research work [18]. In this section, we only show the main runtime interfaces related to extended synchronization constructs.
58
4.1
S. Li et al.
Runtime Interface Supporting Synchronization Based on Threads
Fig. 1 shows the main implementation of one couple of the thread synchronization runtime interfaces, and the two interfaces are explained below. Another couple are implemented in the same way based on the other signal notification channel. void goto_nxt_thread_sig1(int thrdid): This function mainly plays a role in sending signal to other SPEs. The current thread, whose ID is curr_id, sends the signal to the thread whose ID is thrdid. The address of signal notification registers is stored in the array reg_add. void wait_prv_thread_sig1(int thrdid): This function mainly plays a role in waiting signals from other SPEs in block model. The current thread, whose ID is curr_id, reads the signal through the function spu_read_signal1(). void goto_nxt_thread_sig1 (int thrdid){ if(thrdid!=curr_id) /****Send signal_1 ***/ pip_send_signal( reg_add[thrdid], curr_id, 31-thrdid);}
void wait_prv_thread_sig1 (int thrdid){ uint32_t sig = 0; if(thrdid!=curr_id) /***Receive signal_1***/ sig=spu_read_signal1();}
Fig. 1. Thread synchronization runtime interface implementation
4.2
Runtime Interface Supporting Synchronization Based on Iterations
Fig. 2 shows the main part of the implementation of the iteration synchronization runtime interface. Different from the thread synchronization, these interfaces bring fine synchronizations of iteration level. void goto_nxt_loopindex_sig1(int index, int k){ /****Get lower and upper bounds of the loop***/ int lb=curr_lb; int ub=curr_ub; int id=-1; if(lb
Fig. 2. Iteration synchronization runtime interface implementation
void goto_nxt_loopindex_sig1(int index, int k): In this function, we firstly get the lower bound and upper bound in the current thread. Then we will judge whether the iteration need to be synchronized is in the bound. If it’s in the bound,
Extending Synchronization Constructs in OpenMP to Exploit Pipeline Parallelism
59
it will be executed in the current thread serially. Otherwise the function int probe_threadId(int iterID) will be scheduled, which will return the remote thread ID which the iteration locates in. Then it’ll send signal to the remote thread using the thread ID. void wait_prv_loopindex_sig1(int index, int k): In this function, we also judge whether the iteration need to be synchronized is in the bound. If not, it’ll read the signal in the channel by the block model.
5
A Case Study
In this section, we present a case study, pipeline optimization for IS critical region based on thread synchronization in Section 4.1, to introduce how to use the extended directives and the pipeline models. SPEs
#pragma omp critical {for(i=0; i<MAX_KEY; i++) key_buff1[i]+=prv_buff1[i];} Fig. 3a. Original code
SPE5 SPE4 SPE3 SPE2 SPE1 SPE0
t0
#pragma omp ThrdPipe blck_num(16) mry_syn (1) {for(i=0; i<MAX_KEY; i++) key_buff1[i]+=prv_buff1[i];} Fig. 3b. Extended code
t1
t2
Time
for(n=0; n
t3
t4
… …
t5 t6 t7
Block_1
Block_2 Block_3
Synchronization_1
Fig. 3d. Pipeline mode Fig. 3. IS critical region pipeline implementation
Synchronization_2
60
S. Li et al.
One paradigm difficult to parallelize in OpenMP is critical region, which may seriously sacrifice the performance when the computation amount is large. IS contains critical region in which the values of private array are added to the shared array. The original code of IS critical region is shown in Fig. 3a. The critical region retrogresses to serialization because of the read-write dependence about the shared array. In Fig. 3b, we use our extended ThrdPipe construct to optimize the critical region, and its clauses indicate that the shared array key_buff1 is divided into 16 blocks and memory access synchronization is used to reduce the space overhead. Fig. 3c shows the translated code. The constant PARTITION is equal to 16, and because the memory synchronization is chosen, two couples of thread synchronization runtime interfaces are inserted. Fig. 3d presents the pipeline model.
6
Evaluations
The hardware environment of the experiments is a Cell Blade [19], which contains two 3.2GHz Cell processors and 1GB system memory. Each Cell processor contains one PPE and eight SPEs. The benchmarks we use are NAS IS, EP, LU [14] and SPEC2001 MOLDYN [20]. The version of the operating system is Linux Kernel 2.6.25-14. All the applications are compiled by the ppu-gcc and spu-gcc compilers integrated in Cell SDK 3.1, and the optimization level is O5. We first use the ThrdPipe construct to pipeline the critical region in IS, EP and MOLDYN. 8 SPEs in the Cell Blade are used in the pipeline model. Fig. 4, Fig. 5 and Fig. 6 respectively show the speedup factors of IS, EP and MOLDYN critical regions after pipelining in different data sizes and different numbers of blocks. After analyzing the results horizontally and vertically, three aspects conclusions can be got. Firstly, in terms of a specific critical region, when the number of blocks is definite, the speedup factor is proportional to the data size. This is because when the data size increases, the granularity of each block increases. The overhead of data transfer and synchronization operations decreases in the proportion of the total execution time. In Fig. 4, when the IS critical region is divided into 8 blocks, the speedup factor increases from 0.99 to 3.85 as the data size increases from 8 KB to 4 MB. Secondly, when the data size of one critical region is definite, the speedup factor will increase at the beginning and then decline subsequently as the number of blocks increases. This is because the more blocks we divided, the pipeline will exploit more parallelism. Nevertheless, with the increasing number of blocks, the system will introduce more synchronization operations. When the performance improvement can’t compensate the extra overhead, the speedup factor will decrease. In Fig. 5, when the data size in the EP critical region is 512 KB, the speedup factor increases from 1.37 to 2.10 as the number of blocks increases from 2 to 16. However, when the number of blocks further increases to 32, the speedup factor decreases to 2.00. Thirdly, when the data size and the number of blocks are definite, the speedup factor is proportional to the computation complexity in the critical region, because the task granularity is coarser when the computation complexity increases. Then the pipeline will be more effective. From Fig. 5 and Fig. 6, we can see when the data size and the number of blocks are identical, MOLDYN definitely gets a higher speedup factor than EP, as MOLDYN contains more complex computation than EP.
Extending Synchronization Constructs in OpenMP to Exploit Pipeline Parallelism
2 blocks
6
4 blocks
8 blocks
16 blocks
61
32 blocks
5 pu 4 d eep 3 S2 1 0 8 KB
16 KB
256 KB
4 MB
IS Critical Region Data Size
Fig. 4. IS critical region speedup after pipelining
2 blocks
6
4 blocks
8 blocks
16 blocks
32 blocks
5 pu 4 ede 3 pS 2 1 0 16 KB
32 KB
256 KB
512 KB
8 MB
EP Critical Region Data Size
Fig. 5. EP critical region speedup after pipelining
2 blocks
7
4 blocks
8 blocks
16 blocks
32 blocks
6 5 up 4 de e pS 3 2 1 0 16 KB
32 KB
256 KB
512 KB
8 MB
MOL Critical Region Data Size
Fig. 6. MOLDYN critical region speedup after pipelining
Table 1 presents the execution time of blts function of LU, in which the original pipeline is located, before and after using the extended the IterWait and GotoIter directives. The original pipeline is implemented primarily by busy wait and continuously flushing between system memory and local store. However, the flush operation makes no sense when there is no change in the flag[] array. The great amounts of flush operations bring large overhead especially on the Cell architecture. In our pipeline scheme, this problem is solved radically. The pipeline is supported through the signal block communications mechanism at runtime. From the figures in Table 1, we can
62
S. Li et al.
figure out that when the SPEs number varies from 2 to 16, the time overhead of blts function falls by 23.3% averagely, and this illustrates that our pipeline scheme is more efficient that the original one. Table 1. Execution time of LU blts function before and after pipeline optimization Execution time (second)
2 SPEs
4 SPEs
8 SPEs
16 SPEs
Pipeline based on flush
122.9
64.7
34.1
20.6
Pipeline based on signal
97.8
51.0
26.1
14.8
7
Conclusions
In this paper, two sets of synchronization constructs are extended to exploit pipeline parallelism in OpenMP. One can define the precedence at thread level while the other can define the precedence at iteration level. The main contributions of our work are: (i) more parallelism can be exploited using the extended directives; (ii) the extended directives can help the programmers express pipeline more easily; (iii) a high efficient runtime library is implemented using the block communications mechanism to support the extended directives on Cell processor. Experiments results show that our implementation exhibits good performance and scalability. As future work, it’ll be interesting to investigate whether this scheme is suitable for homogeneous CMP architectures. Acknowledgments. The research is partially supported by the Hi-Tech Research and Development Program (863) of China under Grant No. 2006AA01Z105 and No. 2008AA01Z109, Natural Science Foundation of China under Grant No. 60373008, and by the Key Project of Chinese Ministry of Education under Grant No. 106019 and No. 108008.
References 1. OpenMP Application Program Interface, Version 3.0. OpenMP Architecture Review Board (2008) 2. Gonzalez, M., Ayguadé, E., Martorell, X., Labarta, J.: Defining and supporting pipelined executions in OpenMP. In: Eigenmann, R., Voss, M.J. (eds.) WOMPAT 2001. LNCS, vol. 2104, pp. 155–169. Springer, Heidelberg (2001) 3. Rangan, R., Vachharajani, N., Vachharajani, M., August, D.I.: Decoupled software pipelining with the synchronization array. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 177–188. IEEE Press, ashington, DC (2004) 4. Syrivelis, D., Lalis, S.: Extracting coarse-grained pipelined parallelism out of sequential applications for parallel processor arrays. In: Berekovic, M., Müller-Schloer, C., Hochberger, C., Wong, S. (eds.) ARCS 2009. LNCS, vol. 5455, pp. 4–15. Springer, Heidelberg (2009)
Extending Synchronization Constructs in OpenMP to Exploit Pipeline Parallelism
63
5. Michailidis, P.D., Margaritis, K.G.: Implementing parallel LU factorization with pipelining on a multicore using OpenMP. In: 13th IEEE International Conference on Computational Science and Engineering, pp. 253–260 (2010) 6. Baudisch, D., Brandt, J., Schneider, K.: Multithreaded code from synchronous programs: Generating software pipelines for OpenMP. In: Methoden und Beschreibungssprachen zur Modellierung und Verifikation (MBMV), Dresden, Germany (2010) 7. Kurzak, J., Dongarra, J.: QR factorization for the CELL processor. Scientific Programming 17, 31–42 (2009) 8. Baudisch, D., Brandt, J., Schneider, K.: Multithreaded code from synchronous programs: Extracting independent threads for OpenMP. In: Design, Automation and Test in Europe (DATE), pp. 949–952. European Design and Automation Association (2010) 9. Teruel, X., Unnikrishnan, P., Martorell, X., et al.: Openmp tasks in ibm XL compilers. In: Proc. of the 2008 Conference of the Center for Advanced Studies on Collaborative Research, pp. 207–221. ACM Press, New York (2008) 10. Gschwind, M.: Chip multiprocessing and the cell broadband engine. In: CF 2006: Proceedings of the 3rd Conference on Computing Frontiers, pp. 1–8 (2006) 11. Thies, W., Chandrasekhar, V., Amarasinghe, S.: A practical approach to exploiting coarse-grained pipeline parallelism in C programs. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 356–369. IEEE Press, Washington, DC (2007) 12. Ottoni, G., Rangan, R., Stoler, A., August, D.I.: Automatic thread extraction with decoupled software pipelining. In: Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, pp. 105–118. IEEE Press, Washington, DC (2005) 13. Gordon, M.I., Thies, W., Amarasinghe, S.: Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 151–162. ACM, New York (2006) 14. Jin, H., Frumkin, M., Yan, J.: The OpenMP implementation of NAS parallel benchmarks and its performance. NAS Technical Report NAS-99-011, NASA Ames Research Center, Moffett Field, CA(1999) 15. Ayguade, E., Copty, N., Duran, A., Hoeflinger, J., et al.: A proposal for task parallelism in OpenMP. In: Chapman, B., Zheng, W., Gao, G.R., Sato, M., Ayguadé, E., Wang, D. (eds.) IWOMP 2007. LNCS, vol. 4935, pp. 1–12. Springer, Heidelberg (2008) 16. Ayguade, E., Martorell, X., Labarta, J., Gonzalez, M., Navarro, N.: Exploiting multiple levels of parallelism in OpenMP: a case study. In: 1999 International Conference on Parallel Processing (ICPP), pp. 172–180 (1999) 17. Suess, M., Leopold, C.: Implementing data-parallel patterns for shared memory with OpenMP. In: Proceedings of the International Conference on Parallel Computing (PARCO). IOS Press, Amsterdam (2008) 18. Cao, Q., Hu, C., He, H., Huang, X., Li, S.: Support for OpenMP tasks on cell architecture. In: Hsu, C.-H., Yang, L.T., Park, J.H., Yeo, S.-S. (eds.) ICA3PP 2010. LNCS, vol. 6082, pp. 308–317. Springer, Heidelberg (2010) 19. Altevogt, P., Boettiger, H., Kiss, T., et al: IBM BladeCenter QS21 hardware performance, IBM Technical White Paper WP101245 [R], USA (2008) 20. SPEC: Standard Performance Evaluation Corporation, http://www.spec.org
Generic Parallel Genetic Algorithm Framework for Protein Optimisation Lukas Folkman, Wayne Pullan, and Bela Stantic Institute for Integrated and Intelligent Systems, Griffith University, Australia
Abstract. Proteins are one of the most vital macromolecules on the cellular level. In order to understand the function of a protein, its structure needs to be determined. For this purpose, different computational approaches have been introduced. Genetic algorithms can be used to search the vast space of all possible conformations of a protein in order to find its native structure. A framework for design of such algorithms that is both generic, easy to use and performs fast on distributed systems may help further development of genetic algorithm based approaches. We propose such a framework based on a parallel master-slave model which is implemented in C++ and Message Passing Interface. We evaluated its performance on distributed systems with a different number of processors and achieved a linear acceleration in proportion to the number of processing units. Keywords: Parallel Genetic Algorithm, Protein Optimisation, Protein Structure Prediction.
1
Introduction
Proteins form a group of one of the most important macromolecules in living organisms both quantitatively and functionally. In order to understand the function of a protein, its three-dimensional (3D) structure needs to be determined. Different computational approaches for Protein Structure Prediction (PSP) have been proposed for this purpose. Some of them rely on a template database of known structures and look for sequence similarities. On the other hand, an ab initio (meaning ‘from the origin’) method for PSP is based on the thermodynamic hypothesis which states that a protein’s native conformation is at its free energy minimum. Hence, a conformational search is conducted to find such 3D structure. The generic parallel genetic algorithm framework presented in this paper is aimed for the ab initio method. There are two major issues in addressing PSP using the ab inito approach: a) computationally modelling the problem and designing an effective energy function; b) an efficient conformational search technique to search the free energy landscape to identify conformations which have native features. The energy function is used to lead the conformational search through the vast space of possible conformations. Such an energy function needs to be able Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 64–73, 2011. c Springer-Verlag Berlin Heidelberg 2011
Parallel Framework for Protein Optimisation
65
to distinguish between random and near-native conformations. Energy functions can be classified according to their use of statistical information of known structures into physics-based and knowledge-based energy functions. In the latter one, rather than calculations of physical energy terms describing atomic interactions, energies are derived from the statistical information about the experimentally solved structures in protein databases. It is not feasible to perform a search of all possible conformations in order to find the native one due to the size of the search space. Molecular dynamics simulations together with a physics-based energy function can be used for ab initio prediction. On the other hand, much faster approaches are stochastic and heuristic conformational searches. Simulated annealing using Monte Carlo (MC) [6], Replica Exchange MC [5] and MC with minimisation [8] have proven well in a number of successful PSP algorithms [11,15]. Evolutionary computation methods such as genetic and population-based algorithms lend themselves well for optimisation problems in bioinformatics [10] as well as in PSP [13,7,9,4]. The strength of an effective Genetic Algorithm (GA) lies in its intelligent genetic operators which are applied in turns to different conformations throughout the search. Beside genetic operators, different GA approaches may have a lot in common. Hence, several general GA frameworks are available, for example [2]. In this paper, we would like to propose a framework that is dedicated solely to PSP and requires as little input as possible from the user. Furthermore, several models for a Parallel Genetic Algorithm (PGA) were described in [14]. We utilise one of them in the design of our framework in order to enhance the performance of predictions on distributed systems. The rest of this paper is organised as follows: an overview of the related work in the field is given in Sect. 2. Section 3 describes our generic PGA framework. The experiments and the results obtained with an implementation of our framework are presented in Sect. 4. We conclude the paper and suggest future work in Sect. 5.
2
Related Work and Background
Genetic Algorithms (GAs) are search heuristic methods which can be used for different hard optimisation problems. At the beginning of a GA, a random population of individuals is generated. Each individual encodes a candidate solution of the optimisation problem. It is important for the population to have high entropy, meaning individuals differ significantly and provide samples from different parts of the search space. Throughout the process of a GA, this population evolves using replication, crossovers and mutations. Only the fittest individuals are allowed to stay in the population. By this process of natural selection, the best solutions of the problem are found in the end. It should be noted that the existence of a fitness function is essential for a GA. 2.1
Genetic Algorithms in Protein Structure Prediction
GAs were applied for the first time to the PSP problem in [13]. The authors proposed using different conformations of the target protein for the population of
66
L. Folkman, W. Pullan, and B. Stantic
individuals. The protein’s energy function was used to evaluate the fitness. They designed a crossover as a binary operator that cuts two different conformations at a random position and swaps their two parts producing two new conformations. A mutation was defined as a unary operator which makes a random rotation around one of the residues. The authors described that in the end, only the fittest, meaning lowest in energy, conformations were present in the population. In [7], Conformational Space Annealing (CSA) search is proposed which is based on a distance cut-off GA with minimisation. CSA was applied in [9] for prediction of proteins’ tertiary structures. Furthermore, a feature-based resampling method based on a GA for structure refinement was presented in [4]. 2.2
Parallel Genetic Algorithms
Parallel Genetic Algorithms. (PGAs) for PSP are discussed in [12]. Three models are described – master-slave, fine-grained and coarse-grained. In the master-slave model, there is a master process distributing fitness calculation jobs among its slave processes. The decision making and genetic operators are invoked by the master process. In the other two models, the population is divided into several equal sub-populations. Each process takes care of one such sub-population. The sharing of individuals from these sub-populations among neighbouring processes is allowed in the fine-grained model. For the case of the coarse-grained model, only limited information exchange is possible via the concept of migrations. Although both theoretical and experimental results in [12] suggest that the master-slave model does not perform as well as the other models, it should be noted that the authors used a two-dimensional hydrophobic-polar lattice model to represent a protein’s conformation. In such case, energy calculation is fast and this can be seen as the reason why communication overhead in the master-slave model was so significant. For computationally expensive energy functions, the master-slave architecture may perform reasonably well. A lot of work has been done recently in the field of parallel multiobjective evolution algorithms (PMOEAs) [14,3]. In [14], four PMOEA models (masterslave, diffusion or fine-grained, island or coarse-grained, and hierarchical hybrid ) are described. The first three are identical to those described in the previous paragraph and in [12]. The hierarchical hybrid model is based on the combination of the others. In [3], the authors designed an ab initio PSP PMOEA based on the master-slave model and achieved linear acceleration in proportion to the number of processors.
3
Methods
The overall design of our generic PGA framework and its implementation in the C++ programming language is described in detail in this section. The Boost object-oriented implementation of Message Passing Interface (MPI) was used for communication among processors on distributed systems.
Parallel Framework for Protein Optimisation
3.1
67
Generic Genetic Algorithm Framework
The class diagram of the framework is depicted in Fig. 1a.
(a) framework class diagram
(b) master-slave model activity diagram
Fig. 1. Generic parallel genetic algorithm framework class diagram and parallel masterslave genetic algorithm activity diagram
Pool and Conformations. The interfaces IPool and IConformation together with their implementation classes Pool and Conformation form the core of the whole framework. The Pool object represents the population of individuals and is responsible for deciding whether newly created individuals should be added to the pool or discarded. Furthermore, it is aware of which individuals are being currently processed. The pool contains a set of Conformation objects. The Conformation object represents a specific conformation of the target protein. It keeps a reference to the Protein object (IProtein interface), which describes static features like the name and amino acid sequence of the protein. The Conformation object is used to access and set torsion angles, it takes care of calculating its own free energy as well as derivatives of the energy function for every torsion angle. For this reason, Conformation keeps a set of Derivative objects (IDerivative interface). A Derivative object is basically a couple of the derivative value and the torsion angle identification (ITorsionId interface, TorsionId implementation) for which the value applies. At the beginning of the run of a GA, the population of different conformations needs to be generated. For this purpose, Conformation object can be perturbed. This method alters randomly φ and ψ backbone angles of a number of residues of the conformation by a specific value. This perturbation is accepted only if altered angle values are valid according to the Ramachandran plot. This procedure ensures that all the sub-parts of the conformations can be used reasonably well in the GA search. Energy Function and Conformation Provider. In our framework, we decided to utilise some of Rosetta’s modules [11]. We decided to use Rosetta mainly
68
L. Folkman, W. Pullan, and B. Stantic
for two reasons. First of all, it is an open-source project which makes it possible to extract only the needed functions and optimise them for our purpose. Secondly, Rosetta proved itself as a promising approach in several Critical Assessment of techniques for protein Structure Prediction (CASP) competitions [1]. We use Rosetta’s scoring function to calculate conformation’s energy. Furthermore, we decided to use Rosetta’s Pose object to represent protein’s spatial arrangements. Pose provides means for both full-atom and reduced modelling of a conformation. In the latter case, each side-chain is modelled as a united atom – centre of mass. A framework’s user can choose which model will be used according to their computational resources as full-atom modelling is extremely expensive. The energy calculation is encapsulated in the RosettaProvider singleton object which acts like a mediator for Rosetta’s functionalities and takes care of all the needed inclusions. This way it is easy to use Rosetta’s energy function as the complexity of the robust system is hidden in our provider class. Local Optimisation. In our design, an individual is locally optimised after it has evolved – meaning the crossover or mutation operator has been applied to it. This is done in order to keep conformations at their local minima at all times. This approach was originally proposed for MC simulations in [8] and successfully applied in Rosetta [11]. CSA [7], a GA-based approach, applies local optimisation for each new individual. This helps to rescue conformations that may have had their energy increased after crossover or mutation even though they were moved to another valley in the free energy surface at which bottom structures with lower energy can be found. Another advantage of this approach is redefining the continuous space to be searched into a discrete one composed by local minima of the free energy [8]. The Limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method is employed for this purpose here, similarly as in [10]. The LBFGSOptimiser object fulfils this role (IOptimiser interface). Based on derivatives of the energy function for each torsion angle, the local minimum can be found. Genetic Operators. Only two methods – crossover and mutate – are defined in the IGeneticOperator interface. These methods are expected to be implemented by a framework’s user as they form the core functionality of a GA. By having the possibility to implement intelligent genetic operators, one can make their GA more successful than others. An important aspect of our design is the use of interfaces. As stated in the previous subsections, each implementation class implements an interface. The whole framework communication is defined only in terms of these interfaces. This allows for a high level of flexibility. 3.2
Parallel Implementation of the Framework
As described earlier in Sect. 2, there are several design patterns that address parallelism in GAs. We used the master-slave model in our approach – depicted in Fig. 1b. First of all, it is based on the idea that the master process maintains the
Parallel Framework for Protein Optimisation
69
population of individuals and assigns jobs to its slave processes. Hence, there is no need to alter the GA itself. Secondly, in our framework, the work-load for the slave processes should be high enough to keep them occupied and utilise the whole parallel system efficiently. In order to comply with this, we alter the master-slave model described in Sect. 2 slightly. The fitness function is computed by the slave processes. Furthermore, after any change to a conformation, local optimisation of this structure is performed. This is another computation that is passed onto the slave processes as it is computationally expensive. It can even be requested at no extra communication cost as local optimisation and calculation of the energy are tightly coupled so it is naturally performed within the same request. In order to utilise all the slave processes as much as possible, we decided to use a somewhat modified GA. We do not keep the state of a current generation. Individuals enter and leave the population as they are evolving. If we had kept the state of the current generation, the master process would have had to wait for all the slave processes to return the optimised individuals. Just then, it would have made a decision about the individuals in the current generation. Some slave processes would have returned earlier and some later. However, there would have been no work to be given to the early ones leaving these slave processes idle. In our design, after an individual is returned to the master by a slave process, the decision whether it is good enough to stay in the population is made immediately. In case of a negative answer, the individual is discarded. As mentioned before, the genetic operations performed over conformations are computationally inexpensive so they are traditionally performed by the master process. In order to implement this approach, the master process would have had to keep the count of which conformations had already evolved and had been waiting to be sent to the slave processes. Furthermore, some slave processes might have had to wait for the master process to finish genetic operations before they could have returned their results to it. The crossover and mutation always precede the local optimisation. From this point of view, genetic changes, which are fast to perform, could be done by the slave processes within the single request. We follow this idea in order to keep the design as simple as possible and put the utilisation of the slave processes first as they perform the most time-consuming activity.
4
Experimental Results
In order to evaluate benefits, efficiency and usability of our generic PGA framework for protein optimisation, the IGeneticOperator interface needed to be implemented. For this purpose, we used a simple version of crossover and mutation operators from [4]. The crossover operator splits two conformations and swaps their two parts whereas a mutation is performed as a rotation around a single residue. The chance of performing a crossover was equal to the chance of a mutation. The initialisation of the pool was conducted by the perturbation of 3 residues by 10–15 ◦ in each conformation. Our test set consisted of only two proteins – 2ptl (78 residues) and chain A of 3mwx (326 residues) with 3D structures experimentally solved in the past.
70
L. Folkman, W. Pullan, and B. Stantic
2,000 iterations (where one iteration corresponds to the invocation of a genetic operator over one individual) of our GA were executed with a pool size of 100 conformations. This simulation was run as a serial program on a single processor as well as on 2, 3, 4 and 5 processors in parallel using our parallel implementation of the framework. Each run was executed 5 times and the results were averaged. The Rosetta’s decoy structure of 2ptl and the experimentally determined structure of chain A of 3mwx were used as the input structures for the run of our algorithm. Our intention is not to present the ability of the described implementation of our framework to predict the tertiary structure of proteins from their sequences. However, we would like to evaluate the performance of the parallel implementation of the framework. 4.1
Acceleration by Parallelism
The results describing the performance of our framework are presented in Fig. 2a. The x axis is the number of processors and the y axis is the average acceleration. The average running time of the serial algorithm on the 2ptl input was 2,314 s on average whereas it was only 476 s when a master process with 4 slave processes were employed on a parallel system with 5 processors. For the case of the much larger chain A of 3mwx, it was 26,448 s and 5,598 s on average respectively. Figure 2a shows that we achieved linear acceleration in proportion to the number of processors employed. In addition, we were interested in how well the slave processes were utilised and what the communication overhead was. Figure 2b depicts the percentage of time that the master and slave processes spent working. The x axis represents the number of slave processes whereas the percentage of time is shown on the y axis. The slave processes were utilised efficiently and kept working on optimisation of conformations 92.56 % of the total running time on average. However, as the master process was responsible only for assigning jobs to the slave processes and managing the pool, it was idle 86.60 % of the running time on average. Protein optimisation aims at minimising the protein’s free energy. In Table 1, the averaged results from our algorithm are listed when run on a parallel system with
(a) average acceleration
(b) time spent working
Fig. 2. Average acceleration by parallelism and percentage of time spent working
Parallel Framework for Protein Optimisation
71
Table 1. Mean initial and optimised energies and their deviations and the best and mean RMSDs and TM-scores ¯ Protein Init. energy δIE 2ptl 3mwx
248.21 1609.41
142.72 796.59
¯ ¯ ¯ TM-score TM-score Opt. energy δOE RMSD RMSD −44.54 −426.07
9.57 6.69 ˚ A 10.24 ˚ A ˚ 18.12 A ˚ 45.49 11.45 A
0.36 0.69
0.32 0.64
5 processors. It can be seen that the energy was significantly decreased. However, this does not guarantee good prediction accuracy. In Table 1, the best and mean RMSDs1 and TM-scores2 of the final models to the native structures are listed. As the input to our algorithm for chain A of 3mwx was its native structure, the best model was moved away by 11.45 ˚ A. In the case of refinement of Rosetta’s 2ptl decoy structure, the best model had RMSD of 6.69 ˚ A to the native. 4.2
Analysis of the Results
The decoy structure of the small protein 2ptl was used to demonstrate the usability of the PGA implemented in our framework for protein structure refinement and to look at the benefits of running such a refinement on a distributed system. The second protein in the test set was chain A of 3mwx and an experimentally determined structure was used for the input. Having 326 residues, it is quiet a large structure for ab initio prediction. We were interested especially in how the performance of our parallel framework will be sustained in this case. From the results (Fig. 2a), it is apparent that running our PGA on a distributed system with 2 processors does not accelerate the algorithm. This is a clear consequence of not loading the master process with a sufficient amount of work (Fig. 2b). Even though the master process’ work load is very low in the 3, 4 and 5 processors configuration as well, the acceleration becomes more evident and increases linearly in proportion to the number of processors. It should be noted that such acceleration will stop at some stage depending on the size of the target protein. Prediction of a larger protein would be able to load a higher number of processors efficiently.
5
Conclusion and Future Work
In this paper, we described our generic PGA framework for protein optimisation. The framework is designed and implemented in such a way that any of its components can be exchanged with the relevant component implemented by the user. The rest of the framework will continue to function without need for modifications. The basic operation of the framework is to evolve protein’s conformations using crossover and mutation operators, local optimisation of conformations according to their energy using the L-BFGS algorithm and evaluation of their fitness. 1 2
Root Mean Square Deviation of the positions of the backbone Cα atoms. a measure of similarity of topologies of two protein structures [16].
72
L. Folkman, W. Pullan, and B. Stantic
The framework can exploit parallel systems in an efficient manner using the master-slave approach where the master process assigns jobs to the slave processes. The communication overhead on the slave processes is kept as low as possible which results in a fast performance. 92.56 % of the running time on slave processes was utilised for evolving the conformations. Linear acceleration in proportion to the number of processors was achieved in refinement of one small and one large protein chain. The master process was idle 86.60 % of the running time which implied only marginal acceleration on a distributed system with 2 processors. In the future, we need to address utilisation of the master process in a greater depth. Implementation of the island model for PGAs could also provide better work load distribution. Acknowledgements. This research is partly sponsored by NICTA Queensland Research Laboratory.
References 1. Bradley, P., Chivian, D., Meiler, J., Misura, K., Rohl, C., Schief, W., Wedemeyer, W., Schueler-Furman, O., Murphy, P., Schonbrun, J., et al.: Rosetta predictions in CASP5: successes, failures, and prospects for complete automation. Proteins: Structure, Function, and Bioinformatics 53(S6), 457–468 (2003) 2. Cahon, S., Melab, N., Talbi, E.: ParadisEO: A framework for the reusable design of parallel and distributed metaheuristics. Journal of Heuristics 10(3), 357–380 (2004) 3. Calvo, J., Ortega, J.: Parallel protein structure prediction by multiobjective optimization. In: 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, pp. 268–275 (2009) 4. Higgs, T., Stantic, B., Hoque, M., Sattar, A.: Genetic algorithm feature-based resampling for protein structure prediction. In: 2010 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8. IEEE, Los Alamitos (2010) 5. Kihara, D., Lu, H., Kolinski, A., Skolnick, J.: TOUCHSTONE: an ab initio protein structure prediction method that uses threading-based tertiary restraints. Proceedings of the National Academy of Sciences of USA 98(18), 10125 (2001) 6. Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. Science 220(4598), 671 (1983) 7. Lee, J., Scheraga, H., Rackovsky, S.: Conformational analysis of the 20-residue membrane-bound portion of melittin by conformational space annealing. Biopolymers 46(2), 103–115 (1998) 8. Li, Z., Scheraga, H.: Monte Carlo-minimization approach to the multiple-minima problem in protein folding. Proceedings of the National Academy of Sciences of USA 84(19), 6611 (1987) 9. Oldziej, S., Czaplewski, C., Liwo, A., Chinchio, M., Nanias, M., Vila, J., Khalili, M., Arnautova, Y., Jagielska, A., Makowski, M., et al.: Physics-based proteinstructure prediction using a hierarchical protocol based on the UNRES force field: assessment in two blind tests. Proceedings of the National Academy of Sciences of USA 102(21), 7547 (2005)
Parallel Framework for Protein Optimisation
73
10. Pullan, W.: An unbiased population-based search for the geometry optimization of Lennard–Jones clusters: 2 ≤ N ≤ 372. Journal of Computational Chemistry 26(9), 899–906 (2005) 11. Rohl, C., Strauss, C., Misura, K., Baker, D.: Protein structure prediction using Rosetta. Methods in Enzymology 383, 66–93 (2004) 12. Santos, E., Lu, L., Santos Jr., E.: Efficiency of Parallel Genetic Algorithms For Protein Folding on 2-D HP Model. In: Proceedings of the Fifth Joint Conference in Information Sciences, Third International Workshop on Frontiers of Evolutionary Algorithms, Atlantic City, NJ, pp. 1094–1097 (2000) 13. Unger, R., Moult, J.: Genetic algorithms for protein folding simulations. Journal of Molecular Biology 231(1), 75–81 (1993) 14. Van Veldhuizen, D., Zydallis, J., Lamont, G.: Considerations in engineering parallel multiobjective evolutionary algorithms. IEEE Transactions on Evolutionary Computation 7(2), 144–173 (2003) 15. Wu, S., Skolnick, J., Zhang, Y.: Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biology 5(1), 1741–7007 (2007) 16. Zhang, Y., Skolnick, J.: Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics 68(4), 1020 (2007)
A Survey on Privacy Problems and Solutions for VANET Based on Network Model Hun-Jung Lim1 and Tai-Myoung Chung2 2
1 Dept. of Computer Engineering, Sungkyunkwan University School of Information Communication Engineering, Sungkyunkwan University [email protected], [email protected]
Abstract. For a long term of vehicle communication research, now VANET is in a stage of implementation. However, most of VANET researches focus on message transmission. Vehicle is extremely personal device; therefore personal information, so called privacy has to be protected. In this paper, we analyze identity and location privacy threatening factors, problems, and solutions based on network model. To analyze solution’s effectiveness, we define four attack models: External attack, Internal attack, Correlational attack, and Relational attack. According to our research, most of the solutions use pseudonym identity or address changing scheme to protect identity privacy. Also, solutions are weak to or do not consider the relational attack. We analyze this is due the meet the network model’s transparency design goal and protect vehicle’s real identity even revealing the vehicle’s location. The result of this paper could guide a way to design a privacy preserve solution and present a trend of existing solutions. Keywords: VANET, Identity Privacy, Location Privacy.
1
Introduction
VANET is developed to support Car-to-Car (C2C) and Car-to-Infra(C2I) communication. For many years, global researchers and projects have been investigating VANET research issues: routing, security, address allocation, and etc. Based on these researches, some project group built a testbed and implemented programs on the vehicle for communication. The field test results of message exchange and network connectivity are fairly satisfied. For an additional research, they focused on security and privacy issues on VANET. Since the vehicle is extremely personal device, its communication data should be secured and the driver’s privacy should be unrevealed. Generally, privacy means “Right of an individual to decide for himself/herself when and on what terms his or her attributes should be revealed”[1]. Without privacy protection, driver’s attributes such as 5W1H(who, when, where, what, why ,and how) can be revealed and used by adversaries. Privacy in the context of VANET can be categorized into three parts [2]. ─ Data Privacy: Prevent others from obtaining communication data. ─ Identity Privacy: Prevent others from identifying subject of communication. ─ Location Privacy: Prevent others from learning one’s current or past location.
Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 74–88, 2011. © Springer-Verlag Berlin Heidelberg 2011
A Survey on Privacy Problems and Solutions for VANET Based on Network Model
75
Usually, data privacy easily achieved through encryption method in an application layer. For that reason, identity privacy and location privacy are usually mentioned as the privacy issues on VANET[3]. In this paper, we will identify the privacy threatening factors and solutions according to TCP/IP network model which will be the main network model in the VANET environment. The contributions of this paper are as follows: ─ By identifying the privacy threatening factors, we guide a way to design a privacy protection solution for follow researchers. ─ By researching the known privacy protection solutions, we describe trend of solutions and their strength and weakness. ─ By describing the treating factors based on network model, we cover all message fields transmitted over the network. This paper is structured as follows. In Section 2 to 5, we describe each layer’s privacy threatening factors, problems, and solutions: network access layer, network layer, transport layer, and application layer. In Section 6, we analyze the solutions based on four attack models and conclude the paper in Section 7.
2
Network Access Layer
At the network access layers, TCP/IP does not define any specific protocol. It supports all of standard and proprietary protocols. This layer puts data in frames and ensures error free transmission. In the scope of VANET environment, IEEE 802.11x technologies can be adapted such as 802.11b, 802.11g, 802.11n and 802.11p. Specially, IEEE 802.11p(WAVE) is designed for vehicular communication. IEEE 802.11x standards define ‘frame’ types to be used in data transmission as well as management and control of wireless links. For the privacy viewpoint, a physical address field, distinguishes a node within a local area, of the frame is a privacy threatening factor. Figure 1.illustrates frame format and privacy treating fields.
Fig. 1. IEEE 802.11 Format
76
H.-J. Lim and T.-M. Chung
Frame is divided into very specific and standardized sections. Each frame has a MAC header, payload and frame check sequence(FCS). MAC header is composed of frame control field, duration ID field, up to four MAC address fields, and sequence control filed. Figure 2 describes the MAC address format.
Fig. 2. MAC Address Format
MAC addresses are most often generated and assigned by the manufacturer of a network interface card (NIC) and are stored in its hardware. This 48-bit address space contains potentially 248 possible MAC addresses. The common MAC address notation is done in hexadecimal numbers of six groups of bytes separated by colons: 01:23:45:67:89:ab. The first 24-bits indicate Organizational Unique Identifier and the following 24-bits are host identifier. Problem By default, the NIC driver reads and uses the stored address, thus the identifiers typically persist as long as the same NIC is used. Also, according to the second least significant bit, which is either 0 or 1, MAC address is universally or locally unique respectively. Therefore every client is uniquely identified by its default MAC address(universally or locally). These two features, permanently and uniquely identify vehicle, cause identity privacy problem. In [4][5], the authors explain that WLAN is weak to privacy due to its follow features. ─ ─ ─ ─ ─ ─
Untrusted network operators Shared channel radio Unsecured frame header Frequent broadcast of MAC address High Density of access points(Low-cost wireless LAN radio) Precise positioning technology
Solution To the best of our knowledge, there is no VANET specific WLAN privacy protection research. However, there are lots of solutions for pervasive computing environment. These solutions could adapt to VANET environment [4][6][5]. Gruteser, M. suggests short-lived interface identifier for the privacy solution[4]. However, the concept of short-lived and periodically update interface identifiers creates new challenges: identifier selection, identifier uniqueness, and integration with port authentication. Huang, L. [5] also insists that the “periodical address updates”
A Survey on Privacy Problems and Solutions for VANET Based on Network Model
77
solution is weak to correlation attacks, which utilize the correlation between the old and new addresses of the same node. The author suggests a concept of a silent period for the additional solution. A silent period is defined as a non-transmission period between the use of new and old identifier. Pang, J evaluates correlation attack[7]. In [6], Jiang, T. summarizes five privacy leakage factors (time, location, sender node identity , receiver node identity, and content) and three privacy protecting methods. -
Anonymize the user or node identity with frequently changing pseudonyms Unlink different pseudonyms of the same user with silent periods between different pseudonyms Increase the entropy of the attackers’ location estimation by reducing the precision of location algorithms.
The other approach is suggested by Greenstein, B. using encryption method called SlyFi[8]. Figure 3 illustrates the privacy solutions for network access layer.
Fig. 3. Network Access Layer Privacy Solutions
3
Network layer
At the network layer, TCP/IP supports the Internetworking Protocol(IP). This layer is responsible for delivering a message from the source host to the destination host based on their addresses. IP has extended IPv4 to IPv6 for the solution of address pool shortage. On February 3, 2011, Internet Assigned Numbers Authority (IANA) announced that the last IP address blocks has allocated. We expect that the IPv4 based network model will turn into the IPv6 rapidly. In this paper, we briefly explain the IPv4 and investigate IPv6 in depth. In the scope of VANET environment, existing IPv4 and IPv6 protocol can be used. Additionally, mobility support IP can be used for vehicle’s mobility. For the privacy viewpoint, a logical address, distinguishes the node within a global area, is the privacy threatening factor. Figure 4 Illustrates IPv4 and IPv6 packet format and privacy threatening fields.
78
H.-J. Lim and T.-M. Chung
Fig. 4. IPv4 and IPv6 Format
3.1
IPv4
IPv4 is the most widely deployed Internet layer protocol. IPv4 is described in IETF publication RFC 791 (September 1981), replacing an earlier definition (RFC 760, January 1980). IPv4 uses 32-bit addresses, which limits the address space to 4.2 billion(232) possible unique addresses. However for some reasons, it is hard to allocate IP address uniquely which is the most important feature of IP address. Figure 5 illustrates the problem and privacy solutions for IPv4.
Fig. 5. IPv4 Problem and Privacy Solutions
To solve the address shortage problem, Dynamic Host Configuration Protocol [9], which generates host addresses based upon availability, and Network Address Translation [10], which maps the private address to the public address, are applied. These two techniques unintentional benefit of privacy protecting a host's address by hiding it within a private address space and by periodically change a host’s address. 3.2
IPv6
IPv6 is a version of the Internet Protocol that is designed to succeed IPv4. Since 1994, IPv6 was developed by the IETF 'IPng'(IP next generation) working group to deal the IPv4 exhaustion problem. IPv6 is described in IETF publication RFC 2460 (December 1998). To overcome the shortage of addresses in IPv4, IPv6 employs 128-bit addresses, which limits the address space to 5 1028 (2128) possible unique addresses. In addition to the larger address space, IPv6 designed for enhanced security and management aspects. For the management aspect, IPv6 supports Stateless auto-configuration which is described in rfc4862. Figure 6 illustrates IPv6 stateless auto-configuration scheme.
A Survey on Privacy Problems and Solutions for VANET Based on Network Model
79
Fig. 6. IPv6 Stateless auto-configuration scheme
Stateless auto-configuration makes an administrator to configure the network of the address while each device automatically configures the interface identifier (IID), of the address. The IID is formed by extending the 48-bit Media Access Control (MAC) address to a 64-bit number, spanning half of the IPv6 address. Problems From a privacy point of view, stateless auto-configuration scheme has two privacy risk: compromising location privacy due to IPv6 network prefix information and compromising identity privacy by IPv6 IID information[10, 11]. The first risk is that the network prefix informs us a network domain location and could guess the node’s location approximately [12]. The second problem is similar to network access layer problem. The component of the IPv6 address is the IID that remains static unless replacing the NIC. As a result, no matter what network the node accesses, IID remains the same. Consequently, simple network tools such as ping and trace-route can be used to track a node's geographic location from any-where in the world [13]. Lindqvist, J. insists that the 64bit-IID is enough to figure out the individuals and it is a major threat for user’s privacy [14]. Also, Trostle, J. mentions that it is possible to track a device by checking the global unicast address with the same Interface identifier and reminded us same Interface identifier’s risk[15]. Solutions To solve the first location problem, Trostle, J suggests encrypting the parts of the prefix such that only appropriate routers in the network can decipher the prefix and obtains the topological information [15]. To solve the second identity problem, IETF suggests privacy extensions [16]. Privacy extension uses a hash value of a nonce with the EUI-64 generated IID or the previous hash result, creating a new IID as the user changes networks. IETF also suggests to use a Cryptographically Generated Address(CGA) as an IID[17]. Both [16] and [17] use a random number; therefore the address is dynamically obscured each time. Also, DHCPv6 can be used for privacy identity protection[18]. In DHCPv6 address, a server assigns a random address to each node, independent from the MAC address. This solution also takes advantage of abundance of IPv6 addresses by allocating one IP address per application. The above suggestions can solve the privacy problems. However, it also requires additional resources such as hash calculation, DHCP server, and amount of addresses. Figure 7 illustrates the privacy solutions for IPv6.
80
H.-J. Lim and T.-M. Chung
Fig. 7. IPv6 Privacy Solutions
3.3
Mobility Support IP
IPv4 and IPv6 usually allocate an address based on network domain and route a message according to the network domain. However, in the VANET environment, a vehicle moves across network domains and periodically causes network handover. When every handover occurs, its IP address has to be changed and for that the connection will be broken. To solve the problem, mobility support IP technologies, called Mobile IP, are required. Mobile IP allows user to move from one network to another while maintaining the connection. Mobile IP for IPv4 is described in IETF RFC 3344 and Mobile IPv6 is described in RFC 3775. In this paper, we focus on the Mobile IPv6 with the same reason as the Internet protocol case. MobileIPv6 is also divided into two categories based on subject of mobility support: Host based and Network based. Host based mobility support : MobileIPv6 MobileIPv6 is a host based mobility support protocol. The core of MobileIPv6 is that the mobile node(MN) uses two kind of address to support a mobility. Home-AgentAddresss(HoA) and Care-of-Address(CoA). First, MN has a fixed IP address called HoA. When it moves to another network, MN generates tentative IP address called CoA. Then, MN sends a Bind Update(BU) message to its correspond node(CN) with new CoA. On receiving the BU message, CN update its HoA-CoA mapping table with new received CoA. Whenever CN sends a message to MN, CN checks the mapping table with a HoA and uses a correlated CoA as a destination address. The advantage of Host based mobility support protocol is that MN handles all of mobility signals even the network fully supports the mobility service. The disadvantage is that the MN requires additional mobility stack and battery consumption. Problem From a privacy point of view, the advantage of host based mobility threats the privacy. The end-point source and destination IP addresses are revealed to others because MN handles entire mobility signals. The privacy problems in the context of Mobile IPv6 are defined in rfc4882 [19]. The primary goal is to prevent adversary on the path between the MN and the CN from detecting roaming due to the disclosure of the HoA and exposure of a CoA to the CN. For example, when a MN roams from its home network to other network, use of a HoA in communication reveals to an adversary that the MN has roamed. Also, when MN roams from its home network other networks, use of CoA in communication with a CN reveals that the MN has roamed.
A Survey on Privacy Problems and Solutions for VANET Based on Network Model
81
Solution IETF suggests an Encrypted Home Address (eHoA) and a Pseudo Home Address (pHoA) to solve the privacy threat [20]. To protect privacy from adversary, the MN uses the eHoA. To protect privacy from CN, the MN uses the pHoA in the extended home address test procedure to obtain a home keygen token; then, it uses the pHoA instead of the real home address in the reverse-tunneled correspondent binding update procedure. Network Based mobility support : Proxy Mobile IPv6 PMIPv6 is network based mobility support protocol that is described in rfc5213. The core of PMIPv6 is that the network handles all of mobility support operations and MN does not participate in mobility. There are two methods to detect MN’s movement. Beacon information in layer2 and network prefix information in layer3 router advertisement message. In MobileIPv6, MN uses layer3 information to detect its movement. When MN detects its movement by receiving a different network prefix in RA message, it generates new CoA for communication. However in PMIPv6, Network uses layer2 information to detect MN’s movement. When network detects MN’s movement by receiving a beacon message, it sent RA message with MN’s previously used network prefix(i.e. Home Network Prefix). Even if MN receives the layer3 RA message, it could not detect its movement because of the same network prefix information. Instead of MN in MobileIP, in PMIPv6, the LMA supports the mobility. Whenever CN sends message to MN, the message is firstly transmitted to LMA. Then, LMA checks MN’s location and forwards the message. Problem and Solution From a privacy point of view, network based mobility support mechanism’s No IP change feature solves the privacy problem that existed in MIPv6. Even the PMIPv6 MN moves to other network, MN does not change its IP address. It maintains the same IP address in every network. i.e., network prefix of the IP address does not guarantee MN’s Location. However, in Identity Privacy viewpoint, MN uses the IID based IP address and inherits IP identity privacy problem described in 3.2. Figure 8 illustrates the privacy solutions for mobility support IP.
Fig. 8. Mobility Support IP Privacy Solutions
82
4
H.-J. Lim and T.-M. Chung
Transport Layer
At the transport layer, TCP/IP supports two protocols: TCP and UDP. Transport layer protocols provide process-to-process communication using port number. In the scope of VANET environment, existing transport protocols can be used. Additionally for a dynamic topology changing VANET environment, some sorts of protocol variants are required. For the privacy viewpoint, port number and sequence number field of TCP can be a privacy threatening factor with a low possibility. Port number is 16bit and it distinguishes a process within a host area. Figure 9 illustrates TCP/UP format and privacy threatening factors.
Fig. 9. TCP and UDP Format
Problem Both TCP and UDP use a random port number between 49,152 and 65,535 to send a message. Even the port number’s attributes are random, short-live, and not unique, it threats privacy by long term profiling. However, the privacy breach possibility is much lower than other layer’s risk. Also, TCP uses a sequence number to identify each byte of data and to support reliable transmission. Though the sequence number is not unique and is changed every time, it could be privacy threatening factor with the same reason as the port number. Solution Despite the port number and sequence number’s privacy threatening factors, its possibility is quite low. Also, transport layer information could be hidden from adversary with an encryption mechanism such as IPSec.
5
Application Layer
At the application layer, TCP/IP supports many protocols and services such as mail, file transfer, World Wide Web, and so on. This layer provides means for the user to access information on the network. Therefore, the application layer supports most kind of services for user’s needs. In the scope of VANET environment, the applications are divided into two categories: Comfort Applications and Safety application. For user’s connivance and application requirement, each application carries driver’s identification or location information. For the privacy viewpoint, user identity(ID) and application contents are privacy threatening factors.
A Survey on Privacy Problems and Solutions for VANET Based on Network Model
83
Problem Application layer ID and location information are closely related to user’s identity. Same as transport layer privacy solution, application layer could hide the information by IPSec or other security mechanism. However, ID could be required because encrypted ID may be revealed or plaintext ID may be required for retrieving the decryption key. With these ID, adversary could link the lower layer information to obtain the location. Solution The IETF Geopriv working group has researched on secure privacy information transmission protocol when service provider needs them[21]. Except Geopriv working group, most of researches focus on primitively block the privacy leak, especially for identity privacy. Common identity privacy protection solutions are divided into three categories. ─ Pseudonymization ─ Random silent period ─ Mix-zone. Beresford, A.R. proposes a concept of mix-zone for privacy protection [22]. Mixzones are anonymzed regions of the network wherein mobile nodes change their identifiers to obfuscate the relation between entering and exiting events. Huang, L. figures out mix-zone problem, weak to correlation attack [23]. The author also suggests a random silent period wherein mobile nodes turn off their transceivers and update their identifiers. In [24], Huang, L. further proposes arranging silent periods into cascades and to enhance anonymity, coordinating silent periods. In [25], Li, M. suggests a usercentric solution letting mobile nodes coordinate their silent periods and decide whether to change pseudonyms. In other hand, Gruteser, M.[4] proposes a network-centric solution in that direction. These the pervasive computing based privacy prevent concepts are extended to VANET environment. In [26], Raya, M. designs to use pseudonyms in VANET communication. In [27] [28], Gerlach, M. suggests using a mix-context instead of pseudonyms for better privacy protection. Buttyán, L. analyzes the performance evaluation for pseudonyms’ safety [29]. The analysis results that the effectiveness is limited as an adversary monitoring 50% of the intersections can successfully track 60% of the vehicles with very high probability because of the non-uniformity of the traffic. Depend on the pseudonyms changing subject, two researches are added networkcentric[30] and user-centric[31]. Similar to pervasive computing environment, pseudonyms changing solution is also weak in VANET environment. So [32],[33] adapted the concept of mix-zone to VANET. [33]’s basic idea is that vehicles do not transmit messages when their speed drops below a threshold, and they change pseudonym during each such silent period. Dahl, M. [34] analyzed the mix-zone in the context of VANET. Meanwhile, Sampigethaya, K.[35],[36] modifies silent period concept with group based message transmit. Figure 10 illustrates the privacy solutions for Application layer.
84
H.-J. Lim and T.-M. Chung
Fig. 10. Application Layer Privacy Solutions
6
Solution Analysis with Attack Models
To analyze privacy protect solution’s effectiveness in VANET, we define four attack models [5, 6, 32]: External attacker, Internal attacker, Correlational attacker, and Relational attacker. External attackers (ExtAtck) install its own radio receivers near the road network and passively eavesdrop vehicle safety messages. These attackers are a kind of sniffers that do not emit any signals, but only listen and localize vehicle s. The external attackers are strongest when they are densely scattered throughout wireless service areas, in which case they are capable of precisely locating a vehicle. Internal attackers (IntAtck) are network providers that must provide wireless services in addition to obtaining vehicle information. Although network providers themselves could be trustworthy, a provider may accidentally leak privacy-sensitive information. For example, wireless social communities (e.g., FON), or WiFi operators (e.g.,Google) provide low cost wireless internet connectivity via WiFi networks in cities. With minor software or hardware modifications, this infrastructure can eavesdrop. Correlational attackers (CorAtck) which utilize the correlation between the old and new identifier of the same vehicle, can defeat current protection methods. This attack assume that with enough temporal and spatial precision, it may be possible for an attacker to correlate two identifier that are sent separately from the same vehicle moving through space. Correlational attack works in the internal and external attack. Relational attackers (RelAtck) use other layer’s information of the same vehicle. Even a layer’s information is strictly protected by any means, these attackers utilize upper or lower layer information by mapping the two layer information coincidentally. For example if a vehicle changes MAC address periodically, with the unchanged IP address the attacker tracks the MAC address’s changes. Relational attacker works in the internal and external attack.
A Survey on Privacy Problems and Solutions for VANET Based on Network Model Solution Pseudonyms id [4] Silent period[5] Pseudonyms identifier [6] Slyfi[8] SLOW[33] SeVeCom[37] DHCP[9][18] NAT[10] Prefix Encryption [15] privacy extensions[16] CGA[17] eHoA&pHoA[20] Pseudonyms id[26] Mix-Context[27] User assigne[31] Network assigned[30] Group based[35][36] Mix-zone[32], [33]
*ɂ
85
ID.Pri
Loc.Pri
ExtAtck
IntAtck
CorAtck
RelAtck
ȿ
¯
ȿ
ȿ
¯
¯
ȿ
¯
ȿ
ȿ
ȿ
¯
ȿ
ȿ
ȿ
ȿ
ȿ
¯
ȿ
¯
ȿ
¯
˰
ȿ
¯
ȿ
ȿ
ȿ
¯
ȿ
ȿ
ȿ
¯
ȿ
ȿ
ȿ
ȿ
¯
¯
¯
ȿ
ȿ
ȿ
¯
ȿ
¯
¯
ȿ
ȿ
¯
˰
¯
ȿ
¯
ȿ
ȿ
˰
¯
ȿ
¯
ȿ
ȿ
˰
¯
ȿ
ȿ
ȿ
˰
¯
ȿ
¯
ȿ
ȿ
¯
¯
ȿ
¯
ȿ
ȿ
˰
¯
ȿ
¯
ȿ
ȿ
¯
¯
ȿ
¯
ȿ
¯
ȿ
¯
ȿ
ȿ
ȿ
ȿ
¯
ȿ
¯
ȿ
ȿ
ȿ
¯
6DWLVI\LQJ &RQGLWLRQDO6DWLVI\LQJ¯ 1RW6DWLVI\LQJ˰GdGuGy
According to our research, most of solutions are focus on the identity privacy. For the external attack, all of the solutions protect the privacy treats. The internal attack is also successfully protects privacy treats except the network based solutions and key exchange based solutions. In the case of correlation attack, the analysis is limitedly progressed only for the periodic identity change solutions. The early solutions are weak to correlation attack. However with the silent period and mix-zone concept, correlation attacks are dramatically reduced. Though the achieved solution development, most of all solution is dangerous to relational attack. We attention that the VANET privacy is focus on identity protect solutions and weak to relational attack. The reason of having the former feature is that the owner's location is more important than the vehicle's location. i.e., the real risk factor is ‘who’ is in somewhere, not someone is in ‘specific place’. The latter feature is due to the
86
H.-J. Lim and T.-M. Chung
network model’s transparency design goal. This design goal is advantage to support variable protocols for each layer without dependency problem. However, disadvantage to VANET privacy protection.
7
Conclusion
In our paper, we investigate privacy treating factors and its known solutions based on network model in VANET. Most of the solutions use pseudonym identity or address changing scheme to protect identity privacy. Additionally, known solutions are weak to or do not consider the relational attack. We interpret this is due the meet the network model’s transparency design goal and protect vehicle’s real identity even revealing the vehicle’s location. However, these approaches are not the ultimately protect privacy problem. Also if the attacker is interested in the location of the vehicle, it is potential treat to VANET environment. For example, a thief may analyze the incoming packets to police station and knows the patrol car’s location. Therefore in our future works, we plan to design a multi-layered privacy protection mechanism for VANET. Acknowledgments. This work (Grants No. 00044301) was supported by Business for Cooperative R&D between Industry, Academy, and Research Institute funded Korea Small and Medium Business Administration in 2010.
References 1. Kent, S.T., Millett, L.I.: IDs–not that easy: Questions about nationwide identity systems. Natl. Academy Pr., Washington DC (2002) 2. Beresford, A.R., Stajano, F.: Location Privacy in Pervasive Computing. IEEE Pervasive Computing 2, 46–55 (2005) 3. Fuentes, J.M., González-Tablas, A.I., Ribagorda, A.: Overview of Security Issues in Vehicular Ad-Hoc Networks (2010) 4. Gruteser, M., Grunwald, D.: Enhancing Location Privacy in Wireless LAN through Disposable Interface Identifiers: A Quantitative Analysis. Mobile Networks and Applications 10, 315–325 (2005) 5. Huang, L., Matsuura, K., Yamane, H., Sezaki, K.: Enhancing wireless location privacy using silent period. In: IEEE Wireless Communications and Networking Conference, 2005, vol. 2, pp. 1187–1192. IEEE, Los Alamitos (2005) 6. Jiang, T., Wang, H.J., Hu, Y.C.: Preserving location privacy in wireless LANs. In: Proceedings of the 5th International Conference on Mobile Systems, Applications and Services, pp. 246–257. ACM, New York (2007) 7. Pang, J., Greenstein, B., Gummadi, R., Seshan, S., Wetherall, D.: 802.11 user fingerprinting. In: Proceedings of the 13th Annual ACM International Conference on Mobile Computing and Networking, pp. 99–110. ACM, New York (2007) 8. Greenstein, B., McCoy, D., Pang, J., Kohno, T., Seshan, S., Wetherall, D.: Improving wireless privacy with an identifier-free link layer protocol. In: Proceeding of the 6th International Conference on Mobile Systems, Applications, and Services, pp. 17–20. Citeseer (June 2008) 9. Droms, R.: Dynamic host configuration protocol (1997)
A Survey on Privacy Problems and Solutions for VANET Based on Network Model
87
10. Srisuresh, P., Holdrege, M.: RFC 2663. IP Network Address Translator (NAT) Terminology and Considerations (1999) 11. Thomson, S., Narten, T., Jinmei, T.: RFC4862: IPv6 Stateless Address Autoconfiguration. Standards Track, http://www.ietf.org/rfc/rfc4862.txt AUTOCONFStating the Problem, 31 12. Haddad, W., Nordmark, E., Dupontand, F., Bagnulo, M., Park, S., Patil, B.: Privacy for Mobile and Multi-homed Nodes: MoMiPriv Problem Statement (2005) 13. Groat, S., Dunlop, M., Marchany, R., Tront, J.: The privacy implications of stateless IPv6 addressing. In: Proceedings of the Sixth Annual Workshop on Cyber Security and Information Intelligence Research, pp. 1–4. ACM, New York (2010) 14. Lindqvist, J.: IPv6 is Bad for Your Privacy (2007) 15. Trostle, J., Matsuoka, H., Tariq, M.M.B., Kempf, J., Kawahara, T., Jain, R.: Cryptographically protected prefixes for location privacy in ipv6. In: Martin, D., Serjantov, A. (eds.) PET 2004. LNCS, vol. 3424, pp. 142–166. Springer, Heidelberg (2005) 16. Narten, T., Draves, R., Krishnan, S.: RFC 4941-Privacy Extensions for Stateless Address Autoconfiguration in IPv6. IETF (September 2007) 17. Nikander, P., Arkko, J., Kempf, J., Zill, B.: SEcure Neighbor Discovery (SEND) 18. Droms, R., Bound, J., Volz, B., Lemon, T., Perkins, C., Carney, M.: Dynamic host configuration protocol for IPv6 (DHCPv6) (2003) 19. Koodli, R.: RFC 4882: IP Address Location Privacy and Mobile IPv6: Problem Statement (2007) 20. Qiu, Y.: RFC 5726 : Mobile IPv6 Location Privacy Solutions (2010) 21. Geographic Location/Privacy (geopriv), http://datatracker.ietf.org/wg/geopriv/ 22. Beresford, A.R., Stajano, F.: Mix zones: User privacy in location-aware services. In: Proceedings of the Second IEEE Annual Conference on Pervasive Computing and Communications Workshops, 2004, pp. 127–131. IEEE, Los Alamitos (2004) 23. Huang, L., Yamane, H., Matsuura, K., Sezaki, K.: Towards modeling wireless location privacy. In: Danezis, G., Martin, D. (eds.) PET 2005. LNCS, vol. 3856, pp. 59–77. Springer, Heidelberg (2006) 24. Huang, L., Yamane, H., Matsuura, K., Sezaki, K.: Silent Cascade: Enhancing Location Privacy without Communication Qos Degradation. In: Clark, J.A., Paige, R.F., Polack, F.A.C., Brooke, P.J. (eds.) SPC 2006. LNCS, vol. 3934, pp. 165–180. Springer, Heidelberg (2006) 25. Li, M., Sampigethaya, K., Huang, L., Poovendran, R.: Swing & swap: user-centric approaches towards maximizing location privacy. In: Proceedings of the 5th ACM Workshop on Privacy in Electronic Society, pp. 19–28. ACM, New York (2006) 26. Raya, M., Hubaux, J.P.: The security of vehicular ad hoc networks. In: Proceedings of the 3rd ACM Workshop on Security of Ad Hoc and Sensor Networks, pp. 11–21. ACM, New York (2005) 27. Gerlach, M., Guttler, F.: Privacy in VANETs using changing pseudonyms-ideal and real. In: IEEE 65th Vehicular Technology Conference, VTC 2007-Spring, pp. 2521–2525. IEEE, Los Alamitos (2007) 28. Gerlach, M.: Assessing and Improving Privacy in VANETs. ESCAR, Embedded Security in Cars (2006) 29. Buttyán, L., Holczer, T., Vajda, I.: On the Effectiveness of Changing Pseudonyms to Provide Location Privacy in VANETs. In: Stajano, F., Meadows, C., Capkun, S., Moore, T. (eds.) ESAS 2007. LNCS, vol. 4572, pp. 129–141. Springer, Heidelberg (2007)
88
H.-J. Lim and T.-M. Chung
30. Dötzer, F.: Privacy issues in vehicular ad hoc networks. In: Danezis, G., Martin, D. (eds.) PET 2005. LNCS, vol. 3856, pp. 197–209. Springer, Heidelberg (2006) 31. Golle, P., Greene, D., Staddon, J.: Detecting and correcting malicious data in VANETs. In: Proceedings of the 1st ACM International Workshop on Vehicular Ad Hoc Networks, pp. 29–37. ACM, New York (2004) 32. Freudiger, J., Raya, M., Felegyhazi, M., Papadimitratos, P., Hubaux, J.P.: Mix-zones for location privacy in vehicular networks. In: Proceedings of the 1st International Workshop on Wireless Networking for Intelligent Transportation Systems (WiN-ITS 2007) (2007) 33. Buttyán, L., Holczer, T., Weimerskirch, A., Whyte, W.: Slow: A practical pseudonym changing scheme for location privacy in vanets. In: 2009 IEEE Vehicular Networking Conference (VNC), pp. 1–8. IEEE, Los Alamitos (2010) 34. Dahl, M., Delaune, S., Steel, G.: Formal Analysis of Privacy for Vehicular Mix-Zones. In: Gritzalis, D., Preneel, B., Theoharidou, M. (eds.) ESORICS 2010. LNCS, vol. 6345, pp. 55–70. Springer, Heidelberg (2010) 35. Poovendran, R., Sampigethaya, K., Huang, L., Li, M., Matsuura, K., Sezaki, K.: CARAVAN: Providing Location Privacy for VANET (2005) 36. Sampigethaya, K., Li, M., Huang, L., Poovendran, R.: Amoeba: Robust Location Privacy Scheme for Vanet. IEEE Journal on Selected Areas in Communications 25, 1569–1589 (2007) 37. Papadimitratos, P., Buttyan, L., Hubaux, J.P., Kargl, F., Kung, A., Raya, M.: Architecture for secure and private vehicular communications. In: 7th International Conference on ITS Telecommunications, ITST 2007, pp. 1–6. IEEE, Los Alamitos (2007)
Scheduling Tasks and Communications on a Hierarchical System with Message Contention Jean-Yves Colin and Moustafa Nakechbandi LITIS, Université du Havre, IUT, 76610 Le Havre, France {jean-yves.colin,moustafa.nakechbandi}@univ-lehavre.fr
Abstract . A Directed Acyclic Graph (DAG) of tasks with small communication delays has to be scheduled on the identical parallel processors of clusters connected by a hierarchical network. The number or processors and of clusters is not limited. Message contention has to be avoided. Task duplication is allowed. In this paper, we present a new polynomial algorithm that computes the earliest start dates of all tasks and spreads these tasks to use few processors per cluster, for a DAG with small communication delays. It also avoids message contention, and always delivers messages on time. Keywords: Scheduling, DAG, Hierarchical Communications, Message contention, Task Duplication, CPM/PERT.
1
Introduction
The efficient use of distributed memory multiprocessors and grids is a very difficult problem. An application is made of different parts, with specific processing times and communication delays, that need to be scheduled carefully. Examples of applications include numerical analysis applications, logistics systems based on heterogeneous distributed computing systems, high performance Data Mining systems, and Automated Document Factories in banking environments. In the classical scheduling problem with communication delays, a positive processing time is associated to each task of a Directed Acyclic Graph (DAG) and a positive communication delay is associated to each precedence constraint between the tasks of this DAG. The tasks then have to be scheduled on the processors of the distributed memory multiprocessor or grid. This problem is known to be NP-hard in the general case even if the number of available processors is not limited [8]. Many studies are currently available on several aspects of this classical scheduling problem [3] [4] [5] [9] [13] [17] [18]. Task duplication, for example, is used in several studies to lower the communication overheads by executing identical copies of some of the tasks on different processors [1] [4] [5] [12].
This work is partially funded by the GRR "Transport Logistique et Technologie de l’Information" of the Université du Havre, France.
Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 89–98, 2011. c Springer-Verlag Berlin Heidelberg 2011
90
J.-Y. Colin and M. Nakechbandi
Hierarchical communications are taken into account in some studies too. The processors are typically grouped into clusters, with communication between processors of the same cluster being faster than communications between processors of different clusters [1] [7]. These problems are increasingly recognized to be unrealistic however, because they do not consider message contention [2] [11] [14] [21]. In [15] for example, the authors show the NP-Completeness of the two processor scheduling problem with tasks of execution time 1 or 2 units, unit interprocessor communication latency and message contention. In [6], a CPM/PERT-like polynomial scheduling algorithm for DAG with small communication delays and task duplication is proposed. It is optimal and always avoids message contention, if resources are not limited. It does not consider hierarchical communications, however. More recent studies use heuristics to avoid message contention, and present extensive experimental evaluations to evaluate performance improvements [19] [20]. In this paper, we present a new polynomial algorithm for DAG with small communication delays. The distributed architecture is made of clusters, has a two level communication network and has communication channels that can transmit at most one message at any time. The algorithm computes, if resources are not limited, the earliest start dates of all tasks and spreads these tasks to use few processors per cluster. It also schedules the communications so that message contention is avoided, and always delivers messages on time.
2 2.1
The 2lVds Problem The 2lVds Model
A 2-levels Virtual Distributed System architecture (2lVds) is a distributed memory multi-processor architecture (or grid) with a not limited number of homogeneous processors. The processors are grouped into clusters. Both the number of clusters and the number of processors in each cluster are not limited. Each processor belongs to one and only one cluster (Fig. 1). There is a complete communication network between all the processors. Each direct connection between any two processors is made of two unidirectional channels, one in each direction. All communications channels inside all clusters are cluster
cluster
slow communication channels
fast communication channels
cluster
processor
Fig. 1. A 2lVds architecture
Scheduling Tasks and Communications on a Hierarchical System
91
identical and all communications channels between processors of different clusters are identical too, but slower than the intra-cluster channels. Each unidirectional channel may carry at most one message at any time. An application is represented by a DAG G = (V, E) (or precedence graph) where V designates the set of tasks, and E the set of precedence constraints. Formally, a 2lVds scheduling problem may then be specified by the four parameters V, E, p, c, in which V = {1, 2, ...n} is the set of n tasks, E is the set of arcs (i, j) with (i, j) ∈ E representing a precedence constraint from task i ∈ V to task j ∈ V , p is the set of processing times with pi ∈ p being the processing time of task i ∈ V on any processor π of the 2lVds architecture, and c is the set of communications delays. To each arc (i, j) ∈ E are associated two values ci,j (1) ∈ c and ci,j (2) ∈ c. ci,j (1) is the positive communication delay of a message from task i to task j, if i and j are executed on different processors inside the same cluster (intra-cluster communication delay). ci,j (2) is the positive communication delay of a message from task i to task j, if i and j are executed in different clusters (inter-cluster communication delay), with ci,j (1) ≤ ci,j (2). If two communicating tasks i and j are executed on the same processor, there is no need for any communication or its duration is considered negligible, so the communication delay is then 0. A task is indivisible, starts when all the data it needs from its predecessors are available, and sends all the data needed by its successors at the end of its execution. All the immediate successors of a task use the same result from this task. This assumption implies that a task needs to send one message only to a given processor, even if several of its successors are to be processed on it, because one message is enough for all. If it does not hold, the task may usually be divided into sub-tasks such that the assumption is satisfied. Fig. 2 presents an example of such a DAG. The value above each node is its processing time, and the two values above each arc are its two communication delays. Task duplication is allowed. That is, several instances (or copies) of the same task may be executed on different processors. We will denote ik the k th copy of task i. Because we must take into account the messages in a schedule, we will denote m(ik , jl ) a message sent from a copy ik of task i to a copy jl of task j. 2 2 2 2 1 1,2 2 1,2 3 1,2 4 1,2 1,2 1,2 2 2 2 2 1,2 1,2 1,2 5 6 7 8 1,2 1,2 1,2 9 2 2 2 9 1,2 10 1,2 11 1,2 12 1,2 1,2 1,2 2 2 2 2 13 1,2 14 1,2 15 1,2 16 Fig. 2. Example of a DAG with two communication delays
92
J.-Y. Colin and M. Nakechbandi
A schedule S of a 2lVds scheduling problem is then a 5-tuple (F, tc , π, M, t ), where m
F (i) is the positive number of copies of task i ∈ V , tc (ik ) is the starting time of copy ik of task i, 0 < k ≤ F (i), π(ik ) is the processor assigned to copy ik of task i, 0 < k ≤ F (i), M (i, j) is the set of all messages sent by copies of task i to copies of task j, tm (m(ik , jl )) is the starting time of message m(ik , jl ) ∈ M (i, j). First, to be feasible, a schedule S must satisfy the following conditions: – at least one copy of each task is processed, i.e. ∀i ∈ V , F (i) > 0, – at any time, a processor executes at most one copy, – for each (i, j) ∈ E, for any copy jl of j, there is one copy ik of i that is on the same processor or that sends its message on time to jl , i.e. if π(jl ) = π(ik ) then tc (jl ) ≥ tc (ik ) + pi else if π(jl ) and π(ik ) are in the same cluster then tc (jl ) ≥ tc (ik ) + pi + ci,j (1) else tc (jl ) ≥ tc (ik ) + pi + ci,j (2) end if If, in a schedule S, ik and jl satisfy the above condition, we will say that the Generalized Precedence Constraint is true for the two copies (in short, that GP C(ik , jl ) is true). Second, a feasible schedule S must additionally satisfy the condition that there is no message contention, i.e. in all channels used to transmit at least two messages m(ik , jl ) and m(rt , sq ) from a processor πik to a processor πjl , with message m(ik , jl ) finishing before message m(rt , sq ), we have if if π(jl ) and π(ik ) are in the same cluster then tm (m(rt , sq )) ≥ tm (m(ik , jl )) + ci,j (1) else tm (m(rt , sq )) ≥ tm (m(ik , jl )) + ci,j (2) end if Now, let C(ik ) be the completion time of a copy ik of a task i, i.e. C(ik ) = tc (ik ) + pi . The maximum completion time, or makespan, Cmax of a solution S is the largest completion time of all copies of all tasks in this solution: Cmax =
max
i∈V,k≤F (i)
{tc (ik ) + pi } .
(1)
As usual for this kind of problem, we want to minimize Cmax , that is, find a ∗ . feasible solution S ∗ with the smallest makespan Cmax One can note that, if ci,j (1) = ci,j (2), this scheduling problem is actually equivalent to the classical DAG scheduling problem with communication delays
Scheduling Tasks and Communications on a Hierarchical System
93
which, in the general case, is a NP-hard problem, even if the number of processors is not limited [16]. For this reason, we will only consider a DAG satisfying the conditions in the following two equations. They guarantee that the DAG has small communication delays. We will denote PRED (i) (respectively SUCC (i)) the set of immediate predecessors (resp. successors) of task i in G. ∀i ∈ V,
min
g∈P RED(i)
pg ≥
max
h∈P RED(i)−{g}
ch,i (1) .
(2)
Equation (2) means that processing times are locally superior or equal to the communication delays inside the clusters. It ensures that the earliest start date of any copy of each task may be computed in polynomial time. ∀i ∈ V,
min
k∈SUCC(i)
pk ≥
max
j∈SUCC(i)−{k}
ci,j (2) .
(3)
Equation (3) is very similar to (2). It means too that the processing times are locally superior or equal to the communication delays between the clusters. However, (2) deals with the predecessors of a task and with the intra-clusters communication delays, while (3) deals with the successors, and with the interclusters communication delays. Also (2) is true in most cases if (3) is true. One can note that there is already a trivial solution to the 2lVds problem: use one cluster only, and schedule all tasks on the processors of this cluster using the algorithm in [6]. This trivial solution, however, is not helpful at all, because real architectures have a limited number of processors in each cluster. For this reason, we propose the following new algorithm 2lVdsOpt. It schedules the tasks and communications in a 2lVds problem in polynomial time and spreads the tasks on as many clusters as possible to use less processors per cluster. 2.2
The 2lVdsOpt Algorithm
This algorithm has four steps. The first step 2lVdsLwb() computes the earliest start date of all copies of each task of the DAG. The second step 2lVdsCs() computes the critical sequences of the DAG according to the earliest start dates calculated during the first step. The third step 2lVdsCc() computes the graph of the critical sequences of the DAG, and its connected components according to the communication delays ci,j (1). The last step 2lVdsBuild() computes the solution, scheduling the tasks and communication on the 2lVds architecture. Computing the Earliest Start Dates. The first step of 2lVdsOpt computes the earliest start date bi of all copies of each task i of the DAG. This is done in procedure 2lVdsLwb() (cf. Algorithm 1). Table 1 presents the earliest start dates of each task of the DAG of Fig. 2 computed by procedure 2lVdsLwb(). Computing the Critical Sequences. The second step of 2lVdsOpt computes the critical sequences resulting from the earliest start dates calculated during step 1. Let B be the set of the earliest start dates bi of all tasks of V .
94
J.-Y. Colin and M. Nakechbandi
Algorithm 1. procedure 2lVdsLwb(V , E, p, c) for all tasks i ∈ V such that P RED(i) = ∅ do let bi = 0 {assign 0 to i as its earliest start date bi } end for while there is a task i which has not been assigned an earliest starting date bi and whose predecessors h ∈ P RED(i) all have an earliest starting date bh assigned to them do let c = maxh∈P RED(i) bh + ph + ch,i (1) find g ∈ P RED(i) such that bh + ph + ch,i (1) = c let bi = max bg + pg , maxh∈P RED(i)−{g} bh + ph + ch,i (1) end while Table 1. Earliest start dates bi of the tasks i of the DAG of Fig. 2 computed by procedure 2lVdsLwb() (cf. Algorithm 1) task i:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
bi :
0
2
4
6
4
7
9
11
0
9
11
13
11
14
16
18
Let GC be the critical subgraph of G according to the earliest start dates in B. (i, j) is an arc of GC if (i, j) ∈ E and bj < bi + pi + ci,j (1). That is, an arc (i, j) in GC means that these two tasks must have copies on the same processor, because there is not enough delay to transmit the result of any copy ik to a copy jl from one processor to another processor of the same cluster. GC is always a forest [5]. A critical sequence sc of the DAG is a proper path of GC. The computation is done in procedure 2lVdsCs() (cf. Algorithm 2). Computing the Graph of the Critical Sequences. The third step of 2lVdsOpt builds the undirected graph GSC of the critical sequences scs and computes its connected components [10]. Let CC be the set of all computed critical sequences scs . Algorithm 2. procedure 2lVdsCs(V , E, p, c, B) GC = ∅ for all arcs (i, j) ∈ E do if bj < bi + pi + ci,j (1) then GC = GC ∪ {(i, j)} end if end for s=0 for all tasks i ∈ V do if task i is a leaf of the critical subgraph GC then let critical sequence scs be the path from the root of the tree in GC that includes task i, to task i s=s+1 end if end for
Scheduling Tasks and Communications on a Hierarchical System
95
GSC has one node ss for each critical sequence scs of CC computed during the previous step. Also, there is one edge (ss , st ) or (st , ss ) in GSC if ∃(i, j) ∈ E, with i ∈ scs , and i ∈ / sct , and j ∈ sct , such that bj < bi + pi + ci,j (2). This edge means that there is not enough time to transmit one message between at least one task i of scs to another task j of sct between two clusters. So scs and sct must be processed in the same cluster. The computation is done in procedure 2lVdsCc() (cf. Algorithm 3). Algorithm 3. procedure 2lVdsCc(V , E, p, c, B, CC) GSC = ∅ for all critical sequences scs ∈ CC do let ss be the new node related to scs end for for all nodes ss do GSC = GSC ∪ {ss } for all nodes st ∈ GSC − {ss } do if there is no edge between ss and st in GSC and there is at least one arc (i, j) / sct and j ∈ sct , such that bi < bi + pi + ci,j (2) then of E with i ∈ scs and i ∈ add one edge between ss and st to GSC end if end for end for compute the connected components gs of GSC
Fig. 3 shows the six critical sequences sc1 to sc6 found for the DAG of Fig. 2 using the computed earliest start dates in Table 1. It also shows the graph of the critical sequences and its two connected components. Computing the Solution. The last step of 2lVdsOpt builds a solution with minimal makespan using all the data computed in the preceding phases. One cluster is allocated to each connected component, and one processor of this cluster is allocated to each critical sequence of this connected component. One copy of each task of each critical sequence is executed at its earliest start date. All messages are sent as soon as the sending copy of the task finishes its execution. The computation is done in procedure 2lVdsBuild() (cf. Algorithm 4). Fig. 4 shows the Gantt chart of the final schedule found for the DAG of Fig. 2. Two clusters, each with three processors, are used. Tasks 1, 2, 9 and 10 have two copies each in this schedule. 2.3
Analysis of the Algorithm
Let n be the number of tasks and m be the number of arcs. The complexity of procedure 2lVdsLwb() is O(max(m, n)), and the complexity of procedure 2lVdsCs() is O(m). The complexity of building the graph of the critical sequences in 2lVdsCc() is O(n) [5], and of computing its connected components is O(n). Thus the complexity of 2lVdsCc() is O(n) too.
96
J.-Y. Colin and M. Nakechbandi
1
sc2
2
5
6
9
10
13
sc5
14
sc1
3 7
sc4
sc3
11 15
s1
4 s2
s3
8 s4
12 sc6
s5 16
s6
Fig. 3. The six critical sequences sc1 to sc6 in the critical graph GC of the DAG in Fig. 2 (left), and the graph GSC of these critical sequences with the two resulting connected components (right)
Algorithm 4. procedure 2lVdsBuild(V , E, p, c, B, CC, GSC) for all connected components gc ∈ GSC do allocate a new cluster Πc to gc for all node ss ∈ gc do let scs be the critical sequence related to node ss allocate a new processor πs in cluster Πc to this critical sequence scs for all task i ∈ scs do F (i) = F (i) + 1, tc (iF (i) ) = bi , π(iF (i) ) = πs end for end for end for for all copy jl of task j do let π(jl ) be the processor that executes jl for all task i ∈ P RED(j) do if there is no copy of task i on π(jl ) and π(jl ) does not already receive one message from any copy of i on time for copy jl then remove any message m from any copy of i to processor π(jl ) find one copy ik that can send its message on time to jl send one message m(ik , jl ) from copy ik at date bi + pi to processor π(jl ) end if end for end for
π1 11 21 31 41 cluster Π1 π2 12 22 51 π3 61 π4 91 cluster 92 Π 2 π5 π6 0 2 4 6 8
71 81 101 111 121 102 131 141 151 161 time 10 12 14 16 18 20
Fig. 4. Gantt chart of the solution of the DAG of Fig. 2, using two clusters Π1 and Π2 , with three processors per cluster (π1 , π2 and π3 in Π1 , and π4 , π5 and π6 in Π2 )
Scheduling Tasks and Communications on a Hierarchical System
97
Using a graph-level approach, one can show that the complexity of the first part of 2lVdsBuild() is O(n2 ). Because the second part of 2lVdsBuild() tries, in the worst case, to find one suitable copy of each predecessor for each copy of each task, it is possible to establish that the complexity of this second part is O(m2 n2 ). The complexity of procedure 2lVdsBuild() is then O(m2 n2 ). So the complexity of the overall algorithm is O(m2 n2 ). Also, we have the following theorems. Theorem 1. The solution built by 2lVdsOpt has minimal makespan. Theorem 2. At least one copy of each task is executed. Theorem 3. The GPC are true for all copies of all tasks. Theorem 4. In the solution computed, each copy of each task receives at least one message on time from at least one copy of each of its predecessor, if a message is needed. Theorem 5. There is no message contention on any unidirectional channel.
3
Conclusion
A Directed Acyclic Graph of tasks with small communication delays had to be scheduled on the identical parallel processors of several clusters connected by a hierarchical network. The number of processors and of clusters was not limited. Message contention had to be avoided. Task duplication was allowed. We presented a new polynomial algorithm that computes the earliest start dates of tasks and spreads these tasks to use few processors per cluster. It also schedules the communications so that there is no message contention and messages are always delivered on time.
References 1. Bampis, E., Giroudeau, R., König, J.-C.: Using Duplication for Multiprocessor Scheduling Problem with Hierarchical Communications. Parallel Processing Letters 10(1), 133–140 (2000) 2. Beaumont, O., Boudet, V., Robert, Y.: A Realistic Model and an Efficient Heuristic for Scheduling with Heterogeneous Processors. In: 11th Heterogeneous Computing Workshop (HCW 2002). IEEE Computer Society Press, Los Alamitos (2002) 3. Bittencourt, L.F., Sakellariou, R., Madeira, E.R.M.: DAG Scheduling Using a Lookahead Variant of the Heterogeneous Earliest Finish Time Algorithm. In: 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing (PDP 2010), Pisa, Italy (2010) 4. Bozdag, D., Ozguner, F., Catalyurek, U.V.: Compaction of Schedules and a Two Stage Approach for Duplication-Based DAG Scheduling. IEEE Transactions on Parallel and Distributed Systems 20(6), 857–871 (2009) 5. Colin, J.-Y., Chrétienne, P.: Scheduling with Small Communication Delays and Task Duplication. Operations Research 39(4), 680–684 (1991)
98
J.-Y. Colin and M. Nakechbandi
6. Colin, J.-Y., Colin, P.: Scheduling Tasks and Communications on a Virtual Distributed System. European Journal of Operational Research 94(2) (1996) 7. Colin, J.-Y., Nakechbandi, M.: Scheduling Tasks with Communication Delays on 2Levels Virtual Distributed Systems. In: Proceedings of the 7th Euromicro Workshop on Parallel and Distributed Processing (PDP 1999), Funchal, Portugal (1999) 8. Garey, M., Johnson, D.: Computers and Intractability, a Guide to the Theory of NP-Completeness. Freeman, New York (1979) 9. Giroudeau, R., König, J.-C.: Scheduling with Communication Delay. In: Multiprocessor Scheduling: Theory and Applications, pp. 1–26. ARS Publishing (2007) 10. Hopcroft, J., Tarjan, R.: Efficient Algorithms for Graph Manipulation. Communications of the ACM 16, 372–378 (1973) 11. Kalinowski, T., Kort, I., Trystram, D.: List Scheduling of General Task Graphs under LogP. Parallel Computing 26, 1109–1128 (2000) 12. Kruatrachue, B., Lewis, T.G.: Grain Size Determination for Parallel Processing. IEEE Software 5(1), 23–32 (1988) 13. Kwok, Y.-K., Ahmad, I.: Static Scheduling Algorithms for Allocating Directed Task Graphs to Multi-Processors. ACM Computing Surveys (CSUR) 31(4), 406– 471 (1999) 14. Marchal, L., Rehn, V., Robert, Y., Vivien, F.: Scheduling Algorithms for Data Redistribution and Load-Balancing on Master-Slave Platforms. Parallel Processing Letters 17(1), 61–77 (2007) 15. Norman, M.G., Pelagatti, S., Thanisch, P.: On the Complexity of Scheduling with Communication Delay and Contention. Parallel Processing Letters 5(3), 331–341 (1995) 16. Papadimitriou, C.B., Yannakakis, M.: Toward an Architecture Independent Analysis of Parallel Algorithms. In: Proceedings of the 20th Annual ACM Symposium Theory of Computing, Santa Clara, California, USA (1988) 17. Rayward-Smith, V.J.: Scheduling with Unit Interprocessor Communication Delays. Discrete Math. 18, 55–71 (1987) 18. Sarkar, V.: Partitioning and Scheduling Parallel Programs for Execution on Multiprocessors. MIT Press, Cambridge (1989) 19. Sinnen, O., Sousa, L.: Communication Contention in Task Scheduling. IEEE Transactions on Parallel and Distributed Systems 16(6), 503–515 (2005) 20. Sinnen, O., To, A., Kaur, M.: Contention-Aware Scheduling with Task Duplication. Journal of Parallel and Distributed Computing 71(1), 77–86 (2011) 21. Tam, A., Wang, C.L.: Contention-Aware Communication Schedule for High Speed Communication. Cluster Computing 6(4), 339–353 (2003)
Spiking Neural P System Simulations on a High Performance GPU Platform Francis George Cabarle1 , Henry Adorna1 , Miguel A. Mart´ınez–del–Amor2, and Mario J. P´erez–Jim´enez2 1
2
Algorithms & Complexity Lab, Department of Computer Science, University of the Philippines Diliman Diliman 1101 Quezon City, Philippines [email protected], [email protected] Research Group on Natural Computing, Department of Computer Science and Artificial Intelligence, University of Seville, Avda. Reina Mercedes s/n, 41012 Sevilla, Spain {mdelamor,marper}@us.es
Abstract. In this paper we present our results in adapting a Spiking Neural P system (SNP system) simulator to a high performance graphics processing unit (GPU) platform. In particular, we extend our simulations to larger and more complex SNP systems using an NVIDIA Tesla C1060 GPU. The C1060 is manufactured for high performance computing and massively parallel computations, matching the maximally parallel nature of SNP systems. Using our GPU accelerated simulations we present speedups of around 200× for some SNP systems, compared to CPU only simulations. Keywords: Membrane computing, Spiking Neural P systems, GPU computing, CUDA, parallel computing.
1
Introduction
P systems are by nature distributed, parallel, and non-deterministic computing models defined within Membrane computing, which is a research area initiated by Gheorghe P˘ aun in 1998 [16]. The objective, as with other disciplines of Natural computing (e.g. DNA/molecular computing, quantum computing, etc.), is to obtain inspiration from the way nature computes to provide efficient solutions to the limitations of conventional models of computation e.g. a Turing machine. Membrane computing can be thought of as an extension of DNA or molecular computing, zooming out from the individual molecules of the DNA and including other parts and sections of living cells in the computation, introducing the concept of distributed computing as well [16]. P systems are abstractions of the compartmentalized structure and parallel processing of biochemical information in biological cells. There are several P sytem variants defined in literature, each one based on the abstraction of different aspects (or ingredients) from cells, and that many of them have been proven Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 99–108, 2011. c Springer-Verlag Berlin Heidelberg 2011
100
F.G. Cabarle et al.
to be computationally complete [5]. There are three general classifications of P systems considering the level of abstraction: cell-like (a rooted tree where the skin or outermost cell membrane is the root), tissue-like (a graph connecting the cell membranes) and neural-like (a directed graph, inspired by neurons interconnected by their axons and synapses). The last type refer to Spiking Neural P systems (in short, SNP systems), where the time difference (when neurons fire and/or spike) plays an essential role in the computations [11]. An interesting result of P systems is that they are able to solve computationally hard problems (e.g. NP-complete problems) usually in polynomial, often linear time, but usually requiring exponential space as trade off [16]. Due to the nature of P systems, they are yet to be fully implemented in vivo, in vitro, or even in silico. Thus, practical computations of P systems are driven by silicon-based simulators. There are several simulators for P systems implemented over different software and hardware technologies [7]. In practice, P system simulations are limited by the physical laws of silicon architectures, which are often inefficient or not suitable when dealing with P system features, such as massive parallelism. However, in order to improve the efficiency of the simulators, it is necessary to exploit current technologies, leading to solutions in the area of High Performance Computing (HPC), such as accelerators or manycore processors. In this respect, Graphics Processing Units (GPUs) have been consolidated as accelerators thanks to their throughput-oriented and highlyparallel architecture [9]. Several simulators for P systems have been developed over highly parallel platforms, including reconfigurable hardware as in FPGAs [14], CPU-based clusters [6], as well as in NVIDIA corporation’s Compute Unified Device Architecture (CUDA) enabled GPUs [4,3]. These efforts show that parallel devices are very suitable in accelerating the simulation of P systems, at least for transition and active membrane P systems [3,4]. Efficiently simulating a Spiking Neural P (SNP) system, the P system variant of interest in this work, would thus require new efforts in parallel computing. Since SNP systems have already been represented as matrices due to their graph-like properties [18], simulating them in parallel devices such as GPUs is the next natural step. Matrix algorithms are well known in parallel computing literature, including GPUs [8], due to the highly parallelizable nature of linear algebra computations mapping directly to the data-parallel GPU architecture. An SNP system simulator using CUDA was presented in [1] and [2]. These previous works however were executed in GPUs of workstations only, hence we intend to do better. We adapt and analyse the performance of this simulator on a high-end GPU NVIDIA Tesla C1060, designed ground-up for parallel computing and HPC, by simulating SNP systems of different sizes. A final simulator for SNP systems using CUDA would allow the designers to check their models, and perform other complex computations such as computing backwards. This paper is organized as follows: Section 2 and 3 provide backgrounds for CUDA and SNP systems, respectively. The design of the simulator and simulation results are given in Section 4 and 5, respectively.
Spiking Neural P System Simulations on a High Performance GPU Platform
2
101
GPU Computing and NVIDIA CUDA
As many-core based platforms, GPUs are massively data-parallel processors which have high chip scalability in terms of processing units (cores, threads), and high bandwidth with internal GPU memories. The architectural difference between CPUs and GPUs is the reason why the latter offer larger performance increase over CPU only implementation of parallel code working on large amounts of input data [12]. The main advantages of using GPUs are their low-cost, lowmaintenance and low power consumption relative to conventional parallel clusters and setups, while providing comparable or improved computational power [10]. For example, the latest GPUs of NVIDIA with 512 cores are readily available at consumer electronics stores for around $500. GPUs can be programmed using a framework introduced by NVIDIA in 2007 called CUDA [12]. CUDA is a programming model and hardware architecture for general purpose computations in NVIDIA’s GPUs [12]. The programmer can use CUDA for free of charge (including the compiler, driver, SDK, libraries, etc), and is easy to learn because it’s an extension of the C language. CUDA implements a heterogeneous computing architecture, where two different parts are often considered: the host (CPU side) and the device (GPU side). The host part of the code is responsible for controlling the program execution flow, transferring data to and from the device memory, and executing specific codes, called kernel functions, on the device. The device acts as a parallel coprocessor to the host. The host outsources the parallel part of the program as well as the data to the device, since it is more suited to parallel computations than the host. The kernel code is executed in the device by a set of threads. They are organized into a three-level hierarchy, from highest to lowest: a grid of thread blocks, blocks of threads, and threads which can share data through shared memory and can perform simple barrier synchronization [12,15].Using kernel functions, the programmer can specify the GPU resources: up to 65,535 blocks and up to 512 threads per block.
3
Spiking Neural P Systems
Now we first formally define SNP systems as computing models. An SNP system without delay, of degree m ≥ 1, is of the form Π = (O, σ1 , . . . , σm , syn, in, out), where: [1.] O = {a} is the alphabet made up of only one object a, called spike; [2.] σ1 , . . . , σm are m number of neurons of the form σi = (ni , Ri ), 1 ≤ i ≤ m, where: (a) ni ≥ 0 gives the initial number of spikes (a) contained in neuron σi ; (b) Ri is a finite set of rules of the following forms:
102
F.G. Cabarle et al.
(b-1) E/ac → ap , are Spiking rules, where E is a regular expression over a, c ≥ 1, and p ≥ 1 number of spikes are produced (with the restriction c ≥ p), transmitted to each adjacent neuron with σi as the originating neuron, and ac ∈ L(E); ak → ap , is a special case of (b-1) where L(E) = {ac }, k = c, p = 1; (b-2) as → λ, are Forgetting rules, for s ≥ 1, such that for each rule E/ac → ap of type (b-1) from Ri , as ∈ / L(E); [3.] syn = {(i, j) | 1 ≤ i, j ≤ m, i = j } are the synapses i.e. connection between neurons; [4.] in, out ∈ {1, 2, . . . , m} are the input and output neurons, respectively. The system works as follows: At any given time, a σi (neuron) should use exactly one rule only, if and only if the condition ac ∈ L(E) is met. This condition means as long as the multiplicity of spikes is in the language generated by the regular expression E, a rule (or several of them) is (are) applicable. The rule to be used or applied is chosen non-deterministically. If a spiking rule is used, after rule application c spikes are consumed in the σi , producing p number of spikes to all other σj such that (i, j) ∈ syn. If a Forgetting rule is applied, s number of a are removed from σi and no a or spike is produced. A global clock is followed by the system. Parallelism is at the system level, although each neuron works sequentially.
Fig. 1. Π1 generates the set N - {1}. Π1 outputs are the time differences between the first spike of σ3 and its succeeding spikes. A total ordering of the neurons is seen (σ1 to σ3 ) including a total ordering of the rules (1 to 5).
We designate the SNP system shown in Figure 1 as Π1 [18].For our simulations we use 2 additional systems: Figure 8 in [11] and Figure 14 in [11] which we designate as Π2 and Π3 respectively. Next we present the matrix representation of an SNP system and its computations. This representation makes use of the following vectors and matrix definitions: Configuration vector. Ck is the vector containing all spikes in every σ on the kth computation step/time. C0 is the initial Ck of the system. Spiking vector. Sk shows, at a given Ck , if a rule is applicable (having value 1 ) or not (having value 0 instead). Spiking transition matrix. MSN P is a matrix comprised of aij elements where aij is given as: −c if rule ri is in σj and is applied consuming c spikes; p if rule
Spiking Neural P System Simulations on a High Performance GPU Platform
103
ri is in σs (s = j and (s, j) ∈ syn and is applied producing p spikes in total; 0 if rule ri is in σs (s = j and (s, j) ∈ / syn. The spiking transition matrix MΠ1 is shown in equation (1). ⎛ ⎞ −1 1 1 ⎜ −2 1 1 ⎟ ⎜ ⎟ ⎟ MΠ1 = ⎜ (1) ⎜ 1 −1 1 ⎟ ⎝ 0 0 −1 ⎠ 0 0 −2 Equation (2) provides the configuration vector at the (k + 1)th step: Ck+1 = Ck + Sk · MΠ
(2)
For Π1 C0 =< 2, 1, 1 >. and we have the S0 =< 1, 0, 1, 1, 0 > given its C0 . Note that a second alternative S0 =< 0, 1, 1, 1, 0 >, is possible if we use rule (2) over rule (1) instead (but not both at the same time). V alidity in this case means that only one among several applicable rules is used and thus represented in the Sk . The C0 , S0 for Π2 and Π3 can be similarly shown.
4
Parallel SNP System Simulation on GPU
We designate the improved SNP system simulator in this paper as snpgpu-sim4 which is an update to snpgpu-sim3 produced in [2]. Among the improvements of snpgpu-sim4 over snpgpu-sim3 include the use of multiple thread-blocks to accomodate matrices more than 512 elements, and a more streamlined part of the simulation code for handling the relationships between Ri , Ck , and Sk . This section will further expound on these, among other things. The simulator takes in 3 inputs: M f , C0f , and Rf which are the file counterparts of M , C0 , and Ri , respectively. Skf is the file counterpart of Sk , which is produced by the simulator itself once it is run. PyCUDA was used in addition to conventional Python and CUDA C languages. PyCUDA is a Python wrapper for NVIDIA CUDA C and C++, enabling programmers to create GPU software using Python, and has been used for high performance computing [13]. The inputs are text files with delimiters, between rule to another rule in a σ and between σs themselves. The elements of M are entered in row-major order format into the file, and are mapped onto each thread of a thread block, within the block grid as shown in Figure 3. Figure 2 shows an instance of host-device interaction. The host functions sequentially and calls the kernel function/s. The device is split up into a grid of thread blocks, each with their own threads, and operate on the data in a single program, multiple data (SPMD) programming style [12]. The simulation algorithm is shown in Algorithm 1, which also indicates where a specific part of the simulation runs on (either host or device parts). Part I loads the 3 initial inputs and the succeeding inputs from their file counterparts, checking for formatting and pre-processing them for Part II. Part II, from Part I’s outputs and from
104
F.G. Cabarle et al.
Fig. 2. Diagram showing a single run of the simulation flow. The host runs sequentially while the device is made up of a grid of thread blocks, each with their own threads operating in paralel. Require: Input files: Ck, M, r. I. (HOST) Load input files.M f , Rf are loaded once only. C0f is also loaded once, then Ckf s afterwards. II. (HOST) Determine if a rule in Rf is applicable based on the number of spikes present in each σ seen in Ckf . Then generate all valid and possible spiking vectors in a list of lists Skf . III. (DEVICE) Run kernel function on all valid and possible Skf s from the current Ckf . Produce the next file configuration counterparts of Ck + 1s and their corresponding Skf s. IV. (HOST+DEVICE) Repeat steps I to IV, till at least one of the two Stopping criteria is encountered. Algorithm 1. Overview of SNP system simulation algorithm
Ckf and Rf , produces all the valid and possible Skf s. Part II produces all valid and possible Skf files as follows: For each ni of σi , the {1,0} strings are produced on a per neuron level. For example, for Π1 we have n1 = 2 for σ1 . Now we have σ1 strings ‘10’ (choose to use R1 instead of R2 ) and ‘01’ (choose to use R2 over R1 ). We only have one string for σ2 , the string ‘1’, since σ2 has only one rule and it is readily applicable. Neuron σ3 produces only one string also, ‘10’, since only one rule is applicable given its n3 = 1. Only R4 is used in σ3 and not R5 . Once all the neuron level {1,0} strings are produced, the strings are exhaustively paired up with the other strings in the other σs from left to right as the ordering is important. The output therefore of Part II in this example given Ck =< 2, 1, 1 > are (1,0,1,1,0) and (0,1,1,1,0). Elements of the input files are treated as strings up to this point, because of the the concatenation and regular expression checking processes, among others. Part III now treats the input elements as integral values. Equation 2 is performed in parallel such that each thread is either adding or multiplying a (matrix or vector) element. Once the Ck+1 are produced by the device, the results are
Spiking Neural P System Simulations on a High Performance GPU Platform
105
moved back to the host. Part IV then checks whether to proceed or to stop based on 2 stopping criteria for the simulation: (I) if a zero vector (vector of zeros) is encountered, (II) if the succeeding Ck s have all been produced in previous computations. Both (I) and (II) make sure that the simulation halts and does not enter an infinite loop.
Fig. 3. Different representations of a given matrix X: (a) original matrix form (b) linear array in row-major order form, (c) using CUDA thread blocks in a single thread block grid. The linear array shows how the array’s elements are laid out: a (4 × 4) grid made up of (2 × 2) thread blocks. Each thread in a thread block computes a unique element of the array in parallel, and all of them execute the same kernel function.
5
Simulation Results and Observations
The simulations in this paper were executed using an Intel Xeon E5504 quad core CPU running at 2 GHz per core (there are two of these CPUs so there are effectively 8 cores). Each core has a 4MB cache. The GPU is an NVIDIA Tesla C1060 high performance GPU with 240 streaming-processor (SP) cores organized as 30 streaming multiprocessor (SMs) and has 4GB of memory for storing data used by the kernel functions. A 64-bit Ubuntu 10.04 Linux operating system was used to host the simulations. A sequential i.e. CPU only version of snpgpusim4 was created and compared to snpgpu-sim4. We designate this CPU only simulator as snpcpu-sim4. snpcpu-sim4 is identical to snpgpu-sim4 except for the computation of equation (2). Figure 4 shows the running times of the simulators with Π1 as the SNP system. The run times per SNP system are shown using three different time measurements: the real time, user time, and sys time taken using the Ubuntu Linux command time, based on the Unix command of the same name. The real time is the time that has elapsed during the run of the program (a ‘wall clock’ time measurement). The user time is the time spent by the program running in the CPU while in user mode. The sys time is the total CPU time used by the OS on behalf of the program that is being measured, and while the process is in kernel mode. A program or process in kernel mode means the process can use system calls or services such as allocating memory for itself, including hardware access (a more privileged
106
F.G. Cabarle et al.
execution mode) while being in user mode means the program is usually restricted to its initial resources only (less privileged execution mode) [17]. In Figure 4 we see the large improvement of snpgpu-sim4 over snpcpu-sim4, as expected. As expected also, snpcpu-sim4 used up more time from the CPU as seen in the real and sys times. It’s worth mentioning that snpgpu-sim4 used a bit more of the CPU in the user times (though still significantly less than snpcpu-sim4 ) because snpgpu-sim4 still needed some work from the CPU to process the inputs. Another noteworthy point is that with all three runtime figures (Figure 4 to 6) the user run time is significantly far less compared to the other two time measurements because it only measures the time used by the program alone in the CPU, and no other programs are involved in the time measurement. Table 1summarizes averages of the kernel function runtimes and the CPU counter parts of the kernel functions as well as the average speedups. The maximum size, in terms of the number of neurons (Cknum ) and rules (Rnum ) of a system, that the current setup can simulate is given by Cknum = 4GBytes/(16Bytes + 4Bytes × Rnum ).
Fig. 4. Runtime graph of snpgpu-sim4 versus snpcpu-sim4 for Π1 showing (a) real, (b) user, and (c) sys times usage
Fig. 5. Runtime graph of snpgpu-sim4 versus snpcpu-sim4 for Π2 showing (a) real, (b) user, and (c) sys times usage
Spiking Neural P System Simulations on a High Performance GPU Platform
107
Fig. 6. Runtime graph of snpgpu-sim4 versus snpcpu-sim4 for Π3 showing (a) real, (b) user, and (c) sys times usage Table 1. Summary of averages: kernel and CPU times, and speedup. All time measurements are in seconds, except for KRTA which is in microseconds. RTSA is Real Time Speedup Average, UTSA is User Time Speedup Average, STSA is System Time Speedup Average. KRTA is the Kernel Runtime Average, the amount of time the kernel function spent running inside the GPU/device. CRTA is the CPU Runtime Average, the amount of CPU time used by the CPU only (i.e. sequential) counterpart of the kernel function. RTSA UTSA STSA Π1 156.1439811343 3.5999180999 178.3754195194 Π2 3.2014649226 0.9619771863 4.3513513514 Π3 67.0445847755 8.4018691589 192.8963174046
KRTA 107.33688871 μs 216.442000587 μs 153.418998544 μs
CRTA 3.8535563 3.938559 3.9137748
Acknowledgments. Francis Cabarle is supported by the DOST-ERDT program. Henry Adorna is funded by the DOST-ERDT research grant and the Alexan professorial chair of the UP Diliman Department of Computer Science. M.A. Mart´ınez–del–Amor and M.J. P´erez-Jim´enez are supported by “Proyecto de Excelencia con Investigador de Reconocida Val´ıa” of the “Junta de Andaluc´ıa” under grant P08-TIC04200, and by the project TIN2009–13192 of the “Ministerio de Educaci´ on y Ciencia” of Spain, both co-financed by FEDER funds.
References 1. Cabarle, F., Adorna, H., Mart´ınez-del-Amor, M.A.: An Improved GPU Simulator For Spiking Neural P Systems. Accepted in the IEEE Sixth International Conference on Bio-Inspired Computing: Theories and Applications, Penang, Malaysia (September 2011) 2. Cabarle, F., Adorna, H., Mart´ınez-del-Amor, M.A.: A Spiking Neural P system simulator based on CUDA. Accepted in the Twelfth International Conference on Membrane Computing, Paris, France (August 2011)
108
F.G. Cabarle et al.
3. Cecilia, J.M., Garc´ıa, J.M., Guerrero, G.D., Mart´ınez-del-Amor, M.A., P´erez-Hurtado, I., P´erez-Jim´enez, M.J.: Simulating a P system based efficient solution to SAT by using GPUs. Journal of Logic and Algebraic Programming 79(6), 317–325 (2010) 4. Cecilia, J.M., Garc´ıa, J.M., Guerrero, G.D., Mart´ınez-del-Amor, M.A., P´erez-Hurtado, I., P´erez-Jim´enez, M.J.: Simulation of P systems with active membranes on CUDA. Briefings in Bioinformatics 11(3), 313–322 (2010) 5. Chen, H., Ionescu, M., Ishdorj, T.-O., P˘ aun, A., P˘ aun, G., P´erez-Jim´enez, M.: Spiking neural P systems with extended rules: universality and languages. Natural Computing: an International Journal 7(2), 147–166 (2008) 6. Ciobanu, G., Wenyuan, G.: P Systems Running on a Cluster of Computers. In: Mart´ın-Vide, C., Mauri, G., P˘ aun, G., Rozenberg, G., Salomaa, A. (eds.) WMC 2003. LNCS, vol. 2933, pp. 123–139. Springer, Heidelberg (2004) 7. D´ıaz, D., Graciani, C., Guti´errez, M.A., P´erez-Hurtado, I., P´erez-Jim´enez, M.J.: Software for P systems. In: P˘ aun, G., Rozenberg, G., Salomaa, A. (eds.) The Oxford Handbook of Membrane Computing, ch. 17, pp. 437–454. Oxford University Press, Oxford (2009) 8. Fatahalian, K., Sugerman, J., Hanrahan, P.: Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (HWWS 2004), pp. 133–137. ACM, NY (2004) 9. Garland, M., Kirk, D.B.: Understanding throughput-oriented architectures. Communications of the ACM 53(11), 58–66 (2010) 10. Harris, M.: Mapping computational concepts to GPUs. In: ACM SIGGRAPH 2005 Courses, NY, USA (2005) 11. Ionescu, M., P˘ aun, G., Yokomori, T.: Spiking Neural P Systems. Journal Fundamenta Informaticae 71(2,3), 279–308 (2006) 12. Kirk, D., Hwu, W.: Programming Massively Parallel Processors: A Hands On Approach, 1st edn. Morgan Kaufmann, MA (2010) 13. Kl¨ ockner, A., Pinto, N., Lee, Y., Catanzaro, B., Ivanov, P., Fasih, A.: PyCUDA: GPU Run-Time Code Generation for High-Performance Computing. Scientific Computing Group, Brown University, RI, USA (2009) 14. Nguyen, V., Kearney, D., Gioiosa, G.: A Region-Oriented Hardware Implementation for Membrane Computing Applications and Its Integration into Reconfig-P. In: P˘ aun, G., P´erez-Jim´enez, M.J., Riscos-N´ un ˜ez, A., Rozenberg, G., Salomaa, A. (eds.) WMC 2009. LNCS, vol. 5957, pp. 385–409. Springer, Heidelberg (2010) 15. NVIDIA corporation, NVIDIA CUDA C programming guide, version 3.0. NVIDIA, CA, USA (2010) 16. P˘ aun, G., Ciobanu, G., P´erez-Jim´enez, M. (eds.): Applications of Membrane Computing. Natural Computing Series. Springer, Heidelberg (2006) 17. Stallings, W.: Operating systems: internals and design principles, 6th edn. Pearson/Prentice Hall, NJ, USA (2009) 18. Zeng, X., Adorna, H., Mart´ınez-del-Amor, M.A., Pan, L., P´erez-Jim´enez, M.: Matrix Representation of Spiking Neural P Systems. In: Gheorghe, M., Hinze, T., P˘ aun, G., Rozenberg, G., Salomaa, A. (eds.) CMC 2010. LNCS, vol. 6501, pp. 377–391. Springer, Heidelberg (2010)
SpotMPI: A Framework for Auction-Based HPC Computing Using Amazon Spot Instances Moussa Taifi, Justin Y. Shi, and Abdallah Khreishah Temple University, Computer Science Department, Philadelphia, PA, USA {moussa.taifi,shi,akhreish}@temple.edu
Abstract. The economy of scale offers cloud computing virtually unlimited cost effective processing potentials. Theoretically, prices under fair market conditions should reflect the most reasonable costs of computations. The fairness is ensured by the mutual agreements between the sellers and the buyers. Resource use efficiency is automatically optimized in the process. While there is no lack of incentives for the cloud provider to offer auction-based computing platform, using these volatile platform for practical computing is a challenge for existing programming paradigms. This paper reports a methodology and a toolkit designed to tame the challenges for MPI applications. Unlike existing MPI fault tolerance tools, we emphasize on dynamically adjusted optimal checkpoint-restart (CPR) intervals. We introduce a formal model, then a HPC application toolkit, named SpotMPI, to facilitate the practical execution of real MPI applications on volatile auction-based cloud platforms. Our models capture the intrinsic dependencies between critical time consuming elements by leveraging instrumented performance parameters and publicly available resource bidding histories. We study algorithms with different computing v.s. communication complexities. Our results show non-trivial insights into the optimal bidding and application scaling strategies.
1
Introduction
The economy of scale affords cloud computing extreme cost effectiveness potentials. While it is in general difficult to assess the real cost of a computation task, the auction-based provisioning scheme offers a fair pricing structure. Theoretically, prices under fair market conditions reflect the most reasonable costs of computations. The fairness is ensured by the mutual agreements between the sellers and the buyers. From the consumer’s perspective, high performance computing (HPC) applications are the biggest potential beneficiaries since their infrastructure costs are the most expensive. From the seller’s perspective, HPC applications represent the most reliable income stream since they are the most resource intensive users. Theoretically, resource usage efficiency is automatically maximized under the auction-based provisioning schemes. Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 109–120, 2011. c Springer-Verlag Berlin Heidelberg 2011
110
M. Taifi, J.Y. Shi, and A. Khreishah
Traditional HPC applications are typically optimized for hardware features to obtain processing efficiency. Since transient component errors can halt the entire application, it has become increasingly important to create autonomic applications that can automate checkpoint and re-starting with little loss of useful work. Although the existing HPC applications are not suitable for volatile computing environments, with an automated checkpoint-restart (CPR) HPC toolkit, it is plausible that practical HPC applications could gain additional cost advantages using auction-based resources by dynamically minimized CPR overheads. We first establish models for estimating running times of a HPC application using auction-based cloud resources. The proposed models take into account the time complexities of the HPC application, the overheads of checkpoint-restart, and the publicly available resource bidding history. We seek to unravel the interdependencies between the applications’ computing/communication complexities, the number of required processors, bidding prices and the eventual processing costs. We then introduce the SpotMPI toolkit and show how it automate MPI application processing using volatile resources and the guidance of the formal models. Applying our models to recent bidding histories of Amazon EC2 HPC resources, we report preliminary results for two HPC application types with different computing and communication complexities.
2 2.1
Background Auction-Based Computing: Spot Instances
Amazon is one of the first cloud computing vendors that provides at least two types of cloud instances: on-demand instance and spot instance. An on-demand instance has a fixed price. Once ordered, it provides services according to Amazon’s Service Level Agreement (SLA). Spot instance is a type of resource whose availability is controlled by the current bidding price and the auctions market. There are three special features of Amazon’s spot instance pricing policy: – A successful bid does not guarantee the exclusive resource access for the entire requested duration. The Amazon engine can terminate access at any time if a higher bid is received. – Amazon does not charge a partial hour (job terminated before reaching the hour boundary) if the termination was caused by out-bidding. Otherwise, the partial hour is charged in full if the user terminates the job. – Amazon will only charge the user the highest market price that is less than the user’s successful bid. We have chosen two types of Amazon EC2 HPC resources for this study. The cc1.4xlarge and the cg1.4xlarge are cluster HPC instances that provide cluster level performance(23 GB of memory, 8 cores, 10 Gigabit Ethernet). The main difference is the presence of GPUs (2 x NVIDIA Tesla “Fermi” M2050)in the cg1.4xlarge which provide more power for compute intensive applications.
SpotMPI: Auction-Based HPC Computing
111
Figures 1 records a sample market price history for the cc1.4xlarge instance type from May 5 to May 11, 2011. This instance type shows typical user behavior for more legacy HPC applications. The cg1.4xlarge instance type illustrates ressources for HPC applications that can benefit from GPU processing. Since many legacy HPC applications are not suitable for GPU processing, cg1.4xlarge history is less interesting.
Fig. 1. Market Prices of cc1.4 Instance in May 2011
2.2
HPC in the Cloud
Although HPC applications are the biggest potential beneficiaries of cloud computing, except for a few simple applications, there are still many practical concerns. – Most mathematical libraries rely on optimized numerical codes that exploit common hardware features for extreme efficiencies. Some of the hardware features are not mapped in virtual machines. Hardware cache is one of the examples. Consequently, HPC applications suffer more performance drawbacks in addition to the normal virtualization overhead. – Many HPC applications have high inter-processor communication demand. Current virtualized networks have difficulty meeting these high demands. – All existing HPC applications handle only two communication states: success and failure. While success is a reliable state, failure is not. Existing applications treat timeout identical to failure. Consequently, any transient component failure can halt the entire application. Using volatile spot instances for these applications is a serious challenge. Initially, the low end cloud services provide little guarantee on the deliverable performance for HPC applications. Recently, high end cloud resources have been developed specifically for HPC applications. These improvements have demonstrated hopeful features ([28], [25] and [17]). These improvements show the diminishing overhead of virtual machine monitors such as XEN [6]. Due to the severity of declining MTBFs, fault tolerance for MPI applications has also progressed. These developments inspired the design and development of SpotMPI. 2.3
Checkpoint-Restart (CPR) MPI Applications
Much research has been done to provide fault tolerance for MPI applications. FTMPI [12] uses interactive process fault tolerance. Starfish [3] supports a number
112
M. Taifi, J.Y. Shi, and A. Khreishah
of CPR protocols and LA-MPI [13] provides message level failure tolerance. Egida [21] experimented with message logging grammar specification as means for fault tolerance. Cocheck [24] extends the Condor [18] scheduler to provide a set of fault tolerance methods that can be used by MPI applications. We choose the OpenMPI’s coordinated CPR because of its cluster wide checkpoint advantage since more fine grained strategies will not work in this highly volatile environment ( [16], [20] and [30]). The challenges for using the volatile spot instances are not much different than regular clusters with crash failures. In fact, for performance analysis, the spot instance out-of-bid failures can be modeled as random crash failures, since the Amazon cloud engine terminates out-bid instances without prior notice[2]. These out-of-bid failures force applications to adopt technologies to prevent excessive work loss due to frequent interruptions. Map-reduce applications ([11] can easily adopt single task CPR using spot instances [8]). A map-reduce application does not require inter-task communications. Parallel processing can be controlled external to the individual tasks. Therefore, spot instances can be used as “accelerators” via a simple job monitor that tracks and restarts dead jobs automatically [9]. Another noticeable effort using simulation to study the spot instances includes [5] and [26]. By simulating the behavior of a single instance under different bids, these work outlined the inherent tradeoff between completion time and budget. In [5] a decision model is proposed that describes a simulator that is able to determine under a set of conditions the total expected time of a single application. Yet another study [26] discussed a set of checkpoint strategies that can maximize the use of the spot instances while minimizing the costs. To the best of authors’ knowledge there have been no direct evaluation of practical MPI applications on spot instances. The volatile auction-based computing platform challenges the established HPC programming practices.
3
Evaluating MPI Applications Using Auction-Based Platform
For HPC applications using large number of processors, the CPR overhead is the biggest cost factor. Without CPR optimization, it is unlikely to gain practical acceptance for MPI application to run on the volatile auction-based platforms. We report a theoretical model based on application resource time complexities [22] and optimal CPR models ([10], [27]). In addition, we describe a toolkit named SpotMPI that can support autonomic MPI application using spot instance clusters. This toolkit can monitor spot instances and bidding prices, automate checkpointing at bidding price (and history) adjusted optimal intervals and automatically restart application after out-of-bid failures.
4
Theoretical Model
Auction prices vary dynamically depending on the supply and demand in the Amazon market place. There are no guidelines from Amazon as how the prices are set.
SpotMPI: Auction-Based HPC Computing
113
Table 1. Definition of Symbols and Variables for Modeling the Runtime Symbol t0 α K0 K1 K2 T E ET noobbidi tobserved P W u N ns Tpar
Description Interval of application-wide checkpoint. Expected rate of out-of-bid failures. Time needed to create and store a checkpoint. Time needed to read and recover a checkpoint. Average out-of-bid downtime. Estimated time needed to run the application with no checkpoints and no failures Expected running time between checkpoints Expected total running time Total number of out of bid failures corresponding to bidi over tobserved Total observed time Number of processing units Instrumented processor capacity in number of computational steps per second Instrumented network capacity in bytes per second Problem size Number of iterations Parallel processing time
Unlike other project [29] that use autoregressive models to maximize the profit for fictitious cloud provider, we focus on the intrinsic characteristics of users application and bidding history. We are interested in the inherent dependencies between these characteristics and their impact on the optimal CPR intervals – the largest cost factor for MPI applications to run on a volatile platform. 4.1
Bid-Aware Optimal CPR Interval
We assume that the time between consecutive out-of-bid failures is exponentially distributed with rate α. This allows the out-of-bid failures to be modeled the same way as component failures but at different rates. Thus we can extend the previous works on optimal CPR interval for distibuted memory applications. In this paper, we refer to the original CPR interval work by [27], which was extended by [10] and was later adapted to MPI by [19]. We start our discussion using the same symbols. Like in [19], we obtain the expected application running time with checkpoints and failures. Important assumptions are that out-of-bid failure occurs only once per checkpoint interval and all failures are independent: ET =
T t2 (K0 + t0 + α(t0 K1 + 0 )) t0 2
This leads to the optimal CPR interval by ([19], [27] and [10]): 2K0 t0 = (1) α A crucial difference between stable clusters and spot instances clusters is that an out-of-bid failure will force an application downtime that is absent for component
114
M. Taifi, J.Y. Shi, and A. Khreishah
failures. This means that the restart time K1 cannot happen until the average downtime per out-of-bid failure K2 has elapsed. The expected running time using spot instances becomes: ET =
T t2 (K0 + t0 + α(t0 (K1 + K2 ) + 0 )) t0 2
K2 can be obtained using the price history and the current bid. Given the pricing history and a current bid, we can also calculate the new density α of out-of-bid failures. Thus the optimal bid-aware CPR interval can be calculated as: 2K0 (2) t0 = noobbid i
tobserved
5
SpotMPI Toolkit
SpotMPI is a HPC toolkit constructed using OpenMPI coordinated CPR library [15], the Starcluster [1], and the BLCR project [14]. The OpemMPI and BLCR libraries facilitate the execution of automatic CRP at optimal intervals. Starcluster facilitates the creation and management of HPC clusters using Amazon EC2 resources. The latest Starcluster also supports spot instances and allows Python plugins during the launch of the cluster.As shown in Figure 2, the SpotMPI toolkit integrates multiple complementary tools for the auction-based HPC computing. 6WDUFOXVWHU 2UFKHVWUDWLRQ 6HUYLFH
+3& $SSOLFDWLRQV
6SRW03, )UDPHZRUN
$:6 $PD]RQ 6SRW ,QVWDQFHV
2SHQ03, &KHFNSRLQW 5HVWDUW 6HUYLFH
Fig. 2. SpotMPI environment
SpotMPI has four components: cluster monitor, CPR calculator, checkpoint executor, and restarter. These modules are initiated by the Starcluster script at spot instance cluster creation time. The cluster monitor pulls the status and
SpotMPI: Auction-Based HPC Computing
115
bidding prices of all instances continuously. The interactive bidding price and dynamic price history are used by the CPR calculator to generate the next optimal CPR interval. A composite timing model (next section) is responsible for estimating the total processing times. The checkpoint executor saves the state of the MPI application in the users EBS volume at dynamically adjusted intervals. Any out-bid failure will cause the application to halt. Upon a winning bid, the application will automatically restart from the last checkpoint using the Open MPI restart library. &KHFNSRLQW LQWHUYDO IRUFDVWHU
&KHFNSRLQW 5HVWDUW 6HUYLFH
$PD]RQ(&2Q 'HPDQGLQVWDQFHVHJ IRUPRQLWRULQJ
0RQLWRULQJ 6HUYLFH
&OXVWHU 2UFKHVWUDWLRQ HJ6WDUFOXVWHU
$PD]RQ(%6VWRUDJH
$PD]RQ(&6SRWLQVWDQFHVHJ &OXVWHULQVWDQFHVFF[ODUJH DQGFJ[ODUJH
8VHU(%69ROXPH HJ'DWDDQG03, ([HFXWDEOHV
Fig. 3. SpotMPI architecture design
6
Computational Results
Steady State Timing Model. To evaluate T , we need an estimate of the failure free processing time. We use the steady state timing model [22] to determine required running time based on major component usage complexities. Table 1 shows the symbols used in timing models. The general problem of assessing the processing time of a parallel application is difficult. There are too many hard to quantify factors. However, a steady state timing model can capture the intrinsic dependencies between major time consuming elements, such as computing, communication and input/output, by using instrumented capabilities like W and u. The idea is to eliminate the nonessential constant factors. Thus contrasting timing models can reveal non-trivial parallel processing insights [7]. In this paper, we chose to study two typical algorithm classes (Table 2) for the use of spot instance computing. Timing models in general can be applied to all deterministic algorithm classes [22]. Evaluation of Bid-aware Optimal CPR Interval. We validate the bidaware CPR interval using non-optimal intervals. In Figures 4 the behavior of the speed up is visualized under different CPR intervals.
116
M. Taifi, J.Y. Shi, and A. Khreishah Table 2. Algorithm Classes A1 and A2
Compute and Communication Timing model Sample Application Complexities 2 A1 : (O(n2 ), O(n)) Tpar = PNW + 16N Molecular force simulation u 3 2 A2 : (O(n3 ), O(n2 )) Tpar = PNW + 16N Linear solvers u
This figure shows the clear advantages of bid-aware optimal CPR intervals that have avoided longer completion times and higher total costs. We also notice that as the bid increases the advantage of optimal CPR interval decreases. This is because at higher bids, frequent checkpointing is not needed as much. Total Expected Speedup vs Bid prices vs Checkpoints intervals 60 0.01*optimal C/R interval 0.2*optimal C/R interval 1*optimal C/R interval 20*optimal C/R interval 50*optimal C/R interval
50
Speed up
40
30
20
10
0 0.52
0.53
0.54
0.55
0.56 Bid prices
0.57
0.58
0.59
0.6
Fig. 4. A1 Speedup Using 100 Spot Instances and Different CPR Intervals
Bidding Price and Application Processing Time. We are also interested in understanding, given the price history, how a new bid would affect the total processing time. Once we do this, we can then derive a number of other important metrics, such as speedup, efficiency, total cost, speedup per dollar, and efficiency per dollar deploying different numbers of processing units. In the following calculations, we assume: – The application uses the bid-aware optimal CPR intervals. – The HPC application is run under the optimal granularity (synchronization overhead is zero). – The Amazon resources deliver the advertised capabilities. We can then plug the steady state timing models in Table 2 directly into equation 2 4.1: ( PNW + 16N t2 u )ns (3) (K0 + t0 + α(t0 (K1 + K2 ) + 0 )) ET (A1 ) = t0 2 3
ET (A2 ) =
2
( PNW + 16N t2 u )ns (K0 + t0 + α(t0 (K1 + K2 ) + 0 )) t0 2
(4)
SpotMPI: Auction-Based HPC Computing
117
Table 3. Critical Parameters Variable P W u N ns
Range 200 to 1, 000 instances 1.5×109 measured algorithmic step per second using cc1.4xlarge Network speed: 250MB per second measured Problem size: 104 to 105 Number of iterations: 103 to 106
Equations 3 and 4 capture the intrinsic dependencies between critical factors, such as bidding price, price history, the number of spot instances (P ) and overall processing time. To minimize errors, we conducted program instrumentations to get the ranges of W and u. Table 3 shows the value ranges in our calculations. We report the results in Figures 5 and 6. Total Expected Efficiency vs Bid prices vs Number of Instances
Total Expected Speedup vs Bid prices vs Number of Instances
400
#PU=200 #PU=400 #PU=600 #PU=800 #PU=1000
0.8
200
0.6 0.4
0.54
0.56 Bid prices
0.58
0.2 0.52
0.6
2000
#PU=200 #PU=400 #PU=600 #PU=800 #PU=1000
1000
0 0.52
0.54
0.56 Bid prices
0.58
0.6
0.54
0.56 Bid prices
0.58
0.6
30
0.8 #PU=200 #PU=400 #PU=600 #PU=800 #PU=1000
0.6 0.4 0.2 0 0.52
0.54
0.56 Bid prices
0.58
0.6
#PU=200 #PU=400 #PU=600 #PU=800 #PU=1000
20
10
0 0.52
Total Speed up per dollar vs Bid prices vs Number of Instances
Total Expected Maximum Cost vs Bid prices vs Number of Instances 3000
Total Speed up per dollar
Total Expected Maximum Cost
0 0.52
Total Efficiency per dollar
#PU=200 #PU=400 #PU=600 #PU=800 #PU=1000
Total Expected time vs Bid prices vs Number of Instances Total Expected time in hours
1
Efficiency
Speed up
600
1.5
0.54
0.56 Bid prices
0.58
0.6
Total Efficiency per dollar −3 xvs10Bid prices vs Number of Instances #PU=200 #PU=400 #PU=600 #PU=800 #PU=1000
1
0.5
0 0.52
0.54
0.56 Bid prices
0.58
0.6
Fig. 5. A1 using 200 to 1,000 spot instances with n=100,000 to 100,0000 iterations
First, we observe that HPC applications can indeed gain practical feasibility using spot instances under optimized CPR intervals. As indicated by the Amdahl’s law [4], the diminish of return is also clearly visible when the number of spot instances increases for the same algorithms. For A1 (with linear communication complexity), speed up and efficiency drop significantly when spot instances are greater than 200. For A2 (with O(n2 ) communication complexity), speedup and efficiency drop much earlier. We also notice that for (A2 ), the bidding prices have much bigger impact on speedup than A1 . The added dimension of bidding price reveals cost effectiveness of different configurations. Although higher bids can deliver better performance, the cost effectiveness actually decreases (see Speedup Per Dollar charts). Therefore, the
118
M. Taifi, J.Y. Shi, and A. Khreishah
users should use these figures to optimize budget, processing deadline or anything in between. The non-trivial insight is the high price sensitivity for algorithms with high communication complexities. The cost effectiveness is also difficult to visualize without the proposed tools. These results provide the ground for selecting the best number of processors (spot instances) and the most promissing bidding price for a given objective. Total Expected Speedup vs Bid prices vs Number of Instances
Total Expected Efficiency vs Bid prices vs Number of Instances
100
Efficiency
Speed up
60 40
0.2 0.1
20 0.52
0.54
0.56 Bid prices
0.58
0 0.52
0.6
Total Expected Maximum Cost vs Bid prices vs Number of Instances 1500 #PU=200 #PU=400 #PU=600 #PU=800 #PU=1000
1000
500
0 0.52
0.54
0.56 Bid prices
0.58
0.6
0.54
0.56 Bid prices
0.58
0.6
7
0.8 #PU=200 #PU=400 #PU=600 #PU=800 #PU=1000
0.6 0.4 0.2 0 0.52
0.54
0.56 Bid prices
0.58
0.6
#PU=200 #PU=400 #PU=600 #PU=800 #PU=1000
6 5 4 3 2 0.52
Total Speed up per dollar vs Bid prices vs Number of Instances Total Speed up per dollar
Total Expected Maximum Cost
#PU=200 #PU=400 #PU=600 #PU=800 #PU=1000
0.3
Total Efficiency per dollar
#PU=200 #PU=400 #PU=600 #PU=800 #PU=1000
80
Total Expected time vs Bid prices vs Number of Instances Total Expected time in hours
0.4
3
0.54
0.56 Bid prices
0.58
0.6
Total Efficiency per dollar −3 xvs10Bid prices vs Number of Instances #PU=200 #PU=400 #PU=600 #PU=800 #PU=1000
2
1
0 0.52
0.54
0.56 Bid prices
0.58
0.6
Fig. 6. A2 Using 200 to 1,000 Spot Instances For N=10,000 and 1,000 Iterations
7
Conclusion
Finding the optimal bidding strategy for any application is a difficult problem. For specific applications, our proposed approach gives reasonable predictions that can guide the choice of a promising bidding strategy based on the intrinsic dependencies of critical factors. The timing model along the bid-aware CPR model provide an effective tool to determine the optimal bid as well as the optimal number of processing units needed for completing a specific application. This research paves the ground for more specialized pricing models for cloud providers by giving more insights about the return of investment. For example, since the speedup gain slows down when the number of processors reaches a level, it makes sense to give lower prices as “volume discounts” that is sensitive to the communication complexities. The new pricing models may change users’ behavior that in tern would also affect the providers that eventually would reach an equilibrium. Meanwhile, the resource utilization is maximized. Other innovative ideas are also possible. For example, self-healing applications [23] could enjoy much better cost advantages by setting bidding ranges to organize defensive rings to protect the users core interests while maintaining the lowermost cost structures.
SpotMPI: Auction-Based HPC Computing
119
Spot instances give the provider much freedom in dispatching resources for meeting dynamic users needs. This ultimate freedom allows for the ultimate computational efficiency and fair revenue/cost generation. It also challenges the HPC community for highly efficient and more flexible programming means that can automatically exploit cheaper resources on the fly. Acknowledgment. The authors would like to thank Professor Slobodan Vucetic and his Ph.D. student Vladmir Coric for the initial discussions of Amazon bidding histories. This research is supported in part by the National Science Foundation grant CNS 0958854 and educational resource grants from Amazon.com.
References 1. Starcluster (2010), http://web.mit.edu/stardev/cluster/ 2. Amazon hpc cluster instances (2011), http://aws.amazon.com/ec2/hpc-applications/ 3. Agbaria, A.M., Friedman, R.: Starfish: fault-tolerant dynamic mpi programs on clusters of workstations. In: Proceedings of the Eighth International Symposium on High Performance Distributed Computing, 1999, pp. 167–176 (1999) 4. Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the Spring Joint Computer Conference, April 18-20, pp. 483–485. ACM, New York (1967) 5. Andrzejak, A., Kondo, D., Yi, S.: Decision model for cloud computing under sla constraints. In: Proc. IEEE Int Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS) Symp., pp. 257–266 (2010) 6. Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., Warfield, A.: Xen and the art of virtualization. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 164–177. ACM, New York (2003) 7. Blathras, K., Szyld, D.B., Shi, Y.: Timing models and local stopping criteria for asynchronous iterative algorithms. Journal of Parallel and Distributed Computing 58(3), 446–465 (1999) 8. Borthakur, D.: The hadoop distributed file system: Architecture and design (2007), http://developer.yahoo.com/hadoop/tutorial/ 9. Chohan, N., Castillo, C., Spreitzer, M., Steinder, M., Tantawi, A., Krintz, C.: See spot run: using spot instances for mapreduce workflows. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, p. 7. USENIX Association (2010) 10. Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22(3), 303–312 (2006) 11. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008) 12. Fagg, G., Dongarra, J.: Ft-mpi: Fault tolerant mpi, supporting dynamic applications in a dynamic world. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) PVM/MPI 2000. LNCS, vol. 1908, pp. 346–353. Springer, Heidelberg (2000) 13. Graham, R.L., Choi, S.E., Daniel, D.J., Desai, N.N., Minnich, R.G., Rasmussen, C.E., Risinger, L.D., Sukalski, M.W.: A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming 31(4), 285–303 (2003)
120
M. Taifi, J.Y. Shi, and A. Khreishah
14. Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (blcr) for linux clusters. In: Journal of Physics: Conference Series, vol. 46, p. 494. IOP Publishing (2006) 15. Hursey, J.: Coordinated Checkpoint/Restart Process Fault Tolerance for MPI Applications on HPC Systems. PhD thesis, Indiana University, Bloomington, IN, USA (July 2010) 16. Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for open mpi. In: Proc. IEEE Int. Parallel and Distributed Processing Symp. IPDPS 2007, pp. 1–8 (2007) 17. Iosup, A., Ostermann, S., Yigitbasi, N., Prodan, R., Fahringer, T., Epema, D.: Performance analysis of cloud computing services for many-tasks scientific computing. IEEE Transactions on Parallel and Distributed Systems 22(6), 931–945 (2011) 18. Litzkow, M., Tannenbaum, T., Basney, J., Livny, M.: Checkpoint and migration of unix processes in the condor distributed processing system. Technical report, Technical Report (1997) 19. Lusk, E.: Fault tolerance in mpi programs. Special issue of the Journal High Performance Computing Applications, IJHPCA (2002) 20. Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proc. Int. High Performance Computing, Networking, Storage and Analysis (SC) Conf. for, pp. 1–11 (2010) 21. Rao, S., Alvisi, L., Vin, H.M.: Egida: An extensible toolkit for low-overhead faulttolerance. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, 1999. Digest of Papers, pp. 48–55. IEEE, Los Alamitos (1999) 22. Shi, J.Y.: Program scalability analysis. In: International Conference on Distributed and Parallel Processing. Geogetown University, Washington D.C (1997) 23. Shi, J.Y., Taifi, M., Khreishah, A., Wu, J.: Sustainable gpu computing at scale. In: 14th IEEE International Conference in Computational Science and Engneering 2011 (2011) 24. Stellner, G.: Cocheck: Checkpointing and process migration for mpi. In: Proceedings of the 10th International Parallel Processing Symposium, IPPS 1996, pp. 526–531. IEEE Computer Society, Washington, DC, USA (1996) 25. Vecchiola, C., Pandey, S., Buyya, R.: High-performance cloud computing: A view of scientific applications. In: Proc. 10th Int. Pervasive Systems, Algorithms, and Networks (ISPAN) Symp., pp. 4–16 (2009) 26. Yi, S., Kondo, D., Andrzejak, A.: Reducing costs of spot instances via checkpointing in the amazon elastic compute cloud. In: 2010 IEEE 3rd International Conference on Cloud Computing, pp. 236–243. IEEE, Los Alamitos (2010) 27. Young, J.W.: A first order approximation to the optimum checkpoint interval. Communications of the ACM 17(9), 530–531 (1974) 28. Youseff, L., Wolski, R., Gorda, B., Krintz, C.: Evaluating the performance impact of xen on mpi and process execution for hpc systems. In: Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed computing, p. 1. IEEE Computer Society, Los Alamitos (2006) 29. Zhang, Q., Grses, E., Boutaba, R., Xiao, J.: Dynamic resource allocation for spot markets in clouds. In: Proceedings of the 11th USENIX Conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services (2011) 30. Zheng, G., Shi, L., Kal´e, L.V.: Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi. In: 2004 IEEE International Conference on Cluster Computing, pp. 93–103. IEEE, Los Alamitos (2004)
Investigating the Scalability of OpenFOAM for the Solution of Transport Equations and Large Eddy Simulations Orlando Rivera1 , Karl F¨urlinger2, and Dieter Kranzlm¨uller1,2 1
Leibniz Supercomputing Centre (LRZ), Munich, Germany {Orlando.Rivera,Dieter.Kranzlmueller}@lrz.de 2 MNM-Team Ludwig-Maximilians-Universit¨at (LMU), Munich, Germany [email protected]
Abstract. OpenFOAM is a mainstream open-source framework for flexible simulation in several areas of CFD and engineering whose syntax is a high level representation of the mathematical notation of physical models. We use the backward-facing step geometry with Large Eddy Simulations (LES) and semiimplicit methods to investigate the scalability and important MPI characteristics of OpenFOAM. We find that the master-slave strategy introduces an unexpected bottleneck in the communication of scalar values when more than a hundred MPI tasks are employed. An extensive analysis reveals that this anomaly is present only in a few MPI tasks but results in a severe overall performance reduction. The analysis work in this paper is performed with the tool IPM, a portable profiling and workload characterization tool for MPI programs.
1 Introduction OpenFOAM (Open Field Operation and Manipulation) is an extensive framework for the solution of Partial Differential Equation (PDE) using the Finite Volume Method (FVM). It is one of the most popular open source tools used in continuum mechanics and Computational Fluid Dynamics (CFD). Written in C++, it makes use of advanced features in OOP (Object Oriented Programming) and modern programming techniques to mimic the mathematical notation of tensor algebra and PDE solutions [8]. OpenFOAM is not a monolithic program. Instead, it consists of many libraries grouped by functionality, on top of which solvers are built. A solver is a problemspecific glue-like program, which is linked with appropriate libraries for a specific problem. Some libraries are common to all solvers: basic mesh manipulation, parallelization, finite volume method, etc. Others libraries are used only if needed, e.g., for turbulence models, for compressible or incompressible flows, for dynamic mesh handling, and so on. The parallelization of OpenFOAM is performed using MPI (Message Passing Interface). Applications and solvers in OpenFOAM are the same for serial or parallel execution; a master-slave model is used for parallel runs. Non-blocking and blocking send/receive functions and reduction operations are at the core of each solver. Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 121–130, 2011. c Springer-Verlag Berlin Heidelberg 2011
122
O. Rivera, K. F¨urlinger, and D. Kranzlm¨uller
In this paper we analyze the scalability and performance characteristics of OpenFOAM for a Large Eddy Simulation test case. We perform a strong scaling study up to 256 MPI tasks and a weak scaling study up to 512 MPI tasks. We use performance tools, in particular the Integrated Performance Monitor (IPM), in order to gain an understanding of the main performance influencing factors and we identify an imbalanced time in point to point operations as a cause for limited scalability. The rest of this paper is organized as follows: In Sect. 2 we give an overview of related performance studies that have been conducted using OpenFOAM. In Sect. 3 we describe our experimental setup and in Sect. 4 we describe the results of our performance study. We conclude and discuss directions for future work in Sect. 5.
2 Related Work The parallel behavior of OpenFOAM is not very well understood when executed on massively parallel systems. Scalability and efficiency of OpenFOAM is an area of open debate. It has been reported that OpenFOAM scales well for small core counts, up to 64 MPI tasks using the lid-driven cavity test by Calegari et al. [12] and in a study for the HPC advisory council [9]. Pringle reported his efforts to port and test OpenFOAM on a Cray XT5 system (the HeCTOR system at EPCC) with the laminar lid-driven cavity test as well [4]. He was successful to run up to 2048 MPI tasks. Results on scalability were measured using meshes with 1003 and 2003 cells (one and eight million, respectively). In this setup OpenFOAM experiences a scalability limit at around 512 MPI tasks. The CSC - IT Center for Science in Finland has also reported benchmarking results running the lid-driven cavity test with up to 22 million cells in two systems, in one of then reaching super linear scalability with 1024 cores [1].
3 Test Cases and Experimental Setup An important type of simulation in turbulent regimes is the Large Eddy Simulation (LES), which is becoming very popular because it resolves the most relevant turbulent features within the fluid with high degree of accuracy. LES is based on the concept that smaller scale turbulence is isotropic, while larger turbulent and energetic eddies are the result of the current geometrical configuration. Larger scales are simulated, and small scales are filtered out and the effects are modeled; this process is known as the closure problem [10]. The building blocks of many LES solvers are the transport equations which are the main mechanism to solve many problems in Fluid Dynamics. For a first set of experiments we solved a simplified form of the transport equation for a scalar quantity: dθ + a∇θ = 0. dt
(1)
For a second set we performed a full large eddy simulation (LES). The chosen geometry for our tests was the backward-facing step. This is a sufficiently simple test for the numerical and physical properties and there are enough experimental and numerical results for proper validation [7]. Most importantly, it allows us to discover some features and characteristics of the underlying methodology such as
Investigating the Scalability of OpenFOAM
123
those regarding the numerical stability, domain decomposition strategies, MPI execution characteristics, and the performance of iterative solvers and their scalability. The test case is described in detail in [6] and depicted in Fig. 1. The backwardfacing step has a Reynolds number Re of 4800 with respect to the step height. The dimensionless inlet bulk velocity is 4.8, normed with respect to the geometrical units. The domain is 25 times larger than the step height in the x direction and three times the step height in the z direction.
Fig. 1. The backward facing step geometry. The step size is H, the total length of the domain is 25H, height and depth are 3H. Periodic boundary conditions on the near and far walls are used.
The solver used in our LES experiments is called pisoFoam for LES, OpenFOAM version 1.7.x, and it uses the Pressure Implicit solution by Split Operator method (PISO) [5]. The turbulence modeling is the k-equation eddy-viscosity model with the cube root of cell volume as LES delta. At the sides of the domain periodic boundary conditions are used, while top and bottom were set as non-slip walls. Finally, inlet boundary conditions with an artificial noise of 2% of the velocity value and the outlet with pressure-driven type were specified. Pressure equations were solved using the geometric algebraic multigrid (GAMG) and the Bi-Conjugate Gradient (BiCG) iterative solvers with 3 pressure corrections. Other fields were solved with the BiCG solver. The tests were run over 100 time steps with output at every 50 time steps. In all cases a CFL condition number less than 1.0 was specified. Three meshes with different resolutions were used. The first mesh has 250, 96, and 64 cells in the x, y and z direction, respectively. It contains 2.15 million hexahedrical cells with an expansion ration of 2 in the perpendicular direction resulting in better resolution near the top and bottom walls; cells close to the wall are half the size of cells along the center line. Also, for capturing better flux features, cells were stretched in the x direction close to the step by 50%. The maximal aspect ratio is 5.62 resulting in a mesh of good quality. Runs on 16, 32, 64, 128, and 256 cores were performed on this mesh for a strong scaling study. The second and third meshes contains 4 and 8 times more cells than the first mesh, i.e., 8.62 and 17.20 million cells, respectively. 256 MPI tasks were used on the 8.62 million cell mesh, while 512 MPI tasks on the finer mesh. The number of cells in x, y, and z direction, as well as the expansion ratios, were adjusted to conform to meshes of equivalent characteristics of the first mesh resulting in maximum aspect ratios of 3.90 for the 8.62 million cell mesh, and 4.01 for the 17.20 million cell mesh. The second and third mesh was used in weak scaling studies.
124
O. Rivera, K. F¨urlinger, and D. Kranzlm¨uller
To understand the scalability characteristics of OpenFOAM, we employed IPM (Integrated Performance Monitor)1, which is a portable profiling and workload characterization tool for MPI applications. IPM drastically reduces the overhead caused by application instrumentation and displays very detailed information at the same time [2]. Studying an application with IPM reveals important parameters beyond mere studies of total wallclock scalability. It identifies bottlenecks, detects hot spots and collects statistics that help to optimize representative sections of the code. Despite some challenges posed by the advanced software engineering techniques used by OpenFOAM, IPM was able to instrument OpenFOAM and derive useful data. All our experiments were conducted using a massively parallel general-purpose computer, an SGI Altix 4700, with 9728 Intel Itanium 2 cores, a peak performance of 62.3 TFlops and 19 partitions connected by a high performance NUMA link interconnect [11] installed and operated by the Leibniz Supercomputing Centre (LRZ). All runs, when possible, were restrained to a single shared-memory partition. In case a partition was not large enough, two partitions with the same number of cores were specified. The system is operated by the Leibniz Supercomputing Centre (LRZ).
4 Experimental Results In this section we describe the results of our experiments. We start with the strong scaling study from 16 to 256 tasks (512 tasks for the transport equation) followed by the weak scaling study up to 512 tasks. 4.1 Strong Scaling Study The solution of the simplified transport equation is done by means of the preconditioned conjugate gradient with a diagonal incomplete Cholesky decomposition. We used 16, 32, 64, 128, 256, and 512 MPI tasks for the strong scaling test on the 2.15 million cell mesh. Fig. 2 (left) shows the scalability results of this experiment. Evidently, this setup scales up to 128 tasks, after which a slowdown of the execution occurs. Looking at the contribution from various MPI calls, a large contribution of the MPI Recv routine becomes evident. Fig. 2 (right) shows the time spent in various MPI routines for each rank for the 128 task run. Evidently MPI Recv is the most dominant contributor and the time in MPI Recv is highly variable across ranks. MPI Recv is used by pseudo reduction operations in OpenFOAM and our results suggest that paying more attention to how synchronization values are distributed by OpenFOAM’s internal solvers is warranted. In the rest of this section we focus on the LES test case. For our strong scaling LES study we used 16, 32, 64, 128, and 256 MPI tasks on the 2.15 million cell mesh. All runs had the same setup and the domain decomposition was done with the Metis partitioner in order to have approximately the same number of cells per sub-domain and to minimize the maximum connectivity of the sub-domains [3]. The resulting subdomains are well-balanced and vary at most 3% in their number of cells. Table 1 shows the average number of cells per task for each run. 1
http://www.ipm2.org
Investigating the Scalability of OpenFOAM
MPI_Recv
200
Wall time (sec)
MPI_Waitall
MPI_Allreduce
125
Total MPI
3.5
180 3.0
160 MPI time (sec.)
140 120 100 80 60
2.5 2.0 1.5 1.0
40
0.5
20 0
0.0
16
32
64
MPI Task
128
256
512
0
127 MPI ranks
Fig. 2. Scaling study (left), and distribution of time in various MPI routines (right) for the transport equations test case
In Table 1 we also summarize the wallclock execution time from 16 to 256 MPI tasks as measured by IPM. This table also shows the percentage of execution time spent in MPI and I/O routines. For both MPI and I/O the minimum, maximum, and average percentages over all MPI tasks are listed. Note that minimum and maximum values for MPI and I/O routines do not generally refer to the same MPI rank. Table 1. Wallclock execution time, minimum, maximum, and average fraction of time spent in MPI and I/O routines, respectively, for 16, 32, 64, 128, and 256 MPI tasks MPI Tasks 16 32 64 128 256
#cells Wall Time MPI I/O (avg) (sec) (% min) (% max) (% avg) (% min) (% max) (% avg) 134 400 574.23 6.40 14.43 9.96 0.08 0.51 0.13 67 200 303.82 14.24 26.02 14.24 0.11 0.74 0.14 33 600 197.47 25.98 40.71 31.91 0.23 0.31 0.71 16 800 213.07 27.81 65.07 33.42 0.37 0.91 0.55 8 400 270.00 25.65 69.57 31.70 0.63 1.20 0.90
From Table 1 we see that the pisoFoam solver has an acceptable scalability up to 64 MPI tasks. With 128 and 256 MPI tasks the execution actually slows down. This could be due to several factors, such as a data set that is too small, a suboptimal use of MPI communication facilities, or a slow interconnection network. If we plot the minimum, the maximum and the average percentages of the MPI wall time, we see that for 16, 32, and 64 MPI tasks the average is in the middle of the maximum and minimum. However, for 128 and 256 the average value follows the minimum, while the maximum is much larger (Fig. 3, left). This is a clear indication that one or a few tasks experience significant overheads. IPM does not only measure the overall time spent in MPI, it also gives statistics for individual MPI function calls and data transfer sizes. A comparison of the number of calls (Fig. 3, right) with the aggregate time spent in each function (not shown in Fig. 3) reveals that there is no correlation between these numbers. OpenFOAM’s main communication mechanism used to interchange field values among sub domains is by means of MPI Isend/MPI Irecv/MPI Waitall. As shown in
126
O. Rivera, K. F¨urlinger, and D. Kranzlm¨uller
Fig. 3. Time in MPI as a fraction of wallclock time, plotting minimum, maximum, and average among ranks (left). Number of calls by MPI routine (log scale, right).
Fig. 3 (right), the number of calls of MPI Irecv/MPI Isend is one or two orders of magnitude larger compared to any other MPI function. The number of MPI Waitall calls remains almost constant, or even decreases slightly with the number of tasks. Calls to MPI Allreduce increase but not at the same rate as sends and receives. From Fig. 4 (left) we see a reduction in the relative time used by MPI Isend/MPI Irecv and MPI Allwait when 256 MPI tasks are used, compared to the 64 MPI task run, but the number of calls increases (Fig. 3, right). Smaller communication areas, locality communication patterns and the low latency interconnect explains this behavior. In Fig. 4 (right) we can see that Metis has decomposed the domain with some sub domains divided in more than two parts. Hence, increasing the total number of neighboring sub domains, and, subsequently, the calls of MPI Isend/MPI Irecv can result in load imbalance. As shown in Fig. 3 (right) and Fig. 4 (left), the time in MPI Allreduce is comparatively high considering that only a single scalar value is aggregated. How many times this function is called depends on the type of iterative solver and the number of
Fig. 4. Time spent in various MPI routines as the core count increases (left). Example for the graph-based Metis domain decomposition (right). Rank 0 is shown in dark grey, rank 5 in light gray.
Investigating the Scalability of OpenFOAM
127
iterations this solver requires to converge. For example, the 16 MPI task run, and using the BiCG solver for the pressure, results in 5179 calls for the first time step. By specifying the GAMG solver, the number of calls is 7839. There is, however, no advantage in using the BiCG over the GAMG since the number of iterations needed by BiCG is several hundreds (500–700) compared to the 20–40 used by GAMG solver, and the speed up is three to ten times faster by using the GAMG solver. Hence, trying to reduce the number of MPI Allreduce calls by changing the pressure-correction solver would be a mistake. The same can also be said for MPI Recv, which is not used to interchange values at the inter domain boundary, but to transport single scalar values. Both MPI Allreduce and MPI Recv are being called very infrequently but constitute an important fraction of the MPI time, for 256 MPI tasks more than 9% of time is spent in these routines. MPI Recv is a limiting factor not only because its contribution grows with increasing number of tasks, but also because there is a clustering of a few MPI tasks responsible for this increment. If we recall Fig. 3 (left) for 128 and 256 MPI tasks, the maximum value of the MPI time (as percentage of the wall time) was far away from the average. This is because few ranks were slowing down the whole execution time. Since there is a synchronization at the end of each iteration, equation solution and time step, it is enough that one MPI rank, for whatever reason, slows down to have a very negative impact in the whole application. Fig. 5 can help to understand this behavior. Here the absolute times for MPI Recv, MPI Allreduce and the sum of all MPI routines (ordered by rank for 128 and 256 MPI tasks) are plotted. MPI Allreduce uniformly constitutes about 25% and 32% of the MPI time for 128 and 256 MPI tasks.
Fig. 5. Total MPI time, time in MPI Recv and MPI Allreduce per rank. 128 MPI tasks (top) and 256 MPI tasks (bottom).
128
O. Rivera, K. F¨urlinger, and D. Kranzlm¨uller
MPI Recv behaves differently. In the the 256 task case for example, MPI ranks 0 to 7 have MPI overheads that double the average, primarily caused by the peak in MPI Recv time. For these ranks (0-7) between 61% to 69.57% of the wall time (270 sec.) are spent solely in MPI routines. For 128 tasks the situation is similar, ranks 12 to 15 present the MPI Recv peak (4 ranks), while the rest have a smother distribution. As mentioned before, MPI Recv/MPI Send pairs are mainly used to communicate scalar values, and as such interconnect bandwidth is not an issue here. MPI Recv is also not called more frequently than other functions, in fact compared to MPI Isend/MPI Irecv they are called too few times to argue that latency is a main factor. Whether 128 MPI tasks is an important limit and/or under which conditions is a question that cannot be answered at this point without more sophisticated tracing tools and a deeper understanding of how and when MPI Recv is used by OpenFOAM. 4.2 Weak Scaling Several questions remained unanswered in the previous section, mainly because the mesh with 2 millions of cells was not large enough for runs with more than 256 MPI tasks. Two meshes were prepared, one of 8.62 millions of cells to be run with 256 cores and the second of 17.20 millions of cells to be used with 512 cores. Table 2. Wall time, ranks with the minimum and maximum MPI and I/O time and average (percentage) for 256 and 512 tasks. Both the GAMG and BiCG solver were used for comparison for the 512 runs. MPI Tasks 256 512 512
#cells Solver Wall Time MPI I/O (avg) (sec) (% Min) (% Max) (% Avg) (% Min) (% Max) (% Avg) 33 688 GAMG 486.64 31.2 54.39 36.77 0.39 1.29 0.74 33 600 GAMG 600.95 41.56 68.77 47.29 0.32 0.56 0.88 33 600 BiCG 1893.24 20.01 27.16 23.04 0.11 0.18 0.28
The domain decomposition was done with Metis using the same parameters, as described in the strong scaling section, in order to have approximately 33 000 cells per sub domain. This number proved to be sufficiently large for a reasonable speedup for 64 MPI tasks with the smaller mesh. In Table 2 we list the percentage of time in MPI and I/O for the strong scaling study. Both the GAMG and BiCG solver were used for the 512 runs for comparison. For increased scalability one can specify the BiCG instead of the GAMG, and thus reducing the MPI overhead. However, the GAMG solver, with all its MPI issues, is 3 times faster than the BiCG solver (performance is actually the proper value to be measured and not scalability). This analysis gave us an indication that one should isolate and pay more attention to the GAMG and its specific MPI functionality. MPI Recv and MPI Allreduce are important overheads for GAMG (with 256 and 512 MPI tasks). For the BiCG solver, MPI Allreduce represents an important overhead, 11% which is half of the MPI time. This result agrees with the ones reported in [9], [12], and [4].
Investigating the Scalability of OpenFOAM
129
In previous work [13] we have shown that MPI Recv time is concentrated in a few ranks which slow down the simulation. However, in order to generalize these findings it will be necessary to run tests with more complicated geometries and domains, higher counts of cells per rank with larger domains, and new strategies for domain decomposition.
5 Summary and Outlook Understanding the parallel behavior of OpenFOAM is a complex task due to the internal structure and flexibility that OpenFOAM offers. However, some patterns and specific characteristics have been discovered. We have seen from the results presented that some configurations are more suitable for scaling and under which circumstances bottlenecks may occur. One of the choices for this problem and type of mesh, is to use the GAMG solver. This solver converges rapidly to a solution, faster than BiCG, but its MPI footprint increases and becomes more problematic with a larger number of MPI tasks. We have learned that the scalability of the GAMG solver is limited up to a certain point (for our study 64 MPI tasks). An inflexion point can be seen, however, where the good scalability properties of the BiCG could be used on massively parallel systems. Moreover, scalability alone is not a relevant parameter but overall performance. At this point is difficult to generalize, much more data and test cases with different solvers need to be investigated. Compressible, density based or large-mach-number solvers might have a different MPI signature. In our experiments the major performance issues did not occur for MPI routines with the largest visit counts or data transfers. Instead MPI routines that appeared harmless at first, such as MPI Recv and MPI Allreduce, were identified as problematic. We have also recognized that these problems were concentrated in only a few MPI tasks and that further research is needed to consolidate these observations. In terms of performance, I/O is not an important issue in our experiments. That does not mean there is no opportunity to improve it. In OpenFOAM each MPI instance reads its input and writes its output. If OpenFOAM breaks the barrier of thousands of cores, we will have consequently thousand of files, which is not suitable for any available file system. Lastly, we have to emphasize that all these findings would not have been possible at all, if we did not have profiling and tracing tools. Without these tools we could be optimizing or paying attention to wrong places. IPM was flexible enough to produce the required information (were other tools have failed) at the same time it was concise and informative (where some other tools could have produced huge amounts of redundant information). We hope these tools keep pace with the development in software and hardware, and specifically with the expected scale of next-generation machines. Acknowledgements. This project was in part financially supported the EU reintegration grant MADAME under agreement no. PIRG07-GA2010-268351.
130
O. Rivera, K. F¨urlinger, and D. Kranzlm¨uller
References 1. CSC - IT Center for Science. OpenFOAM - CSC, http://www.csc.fi/english/research/sciences/CFD/CFDsoftware/ openfoam/ofpage 2. F¨urlinger, K., Wright, N.J., Skinner, D.: Effective performance measurement at petascale using IPM. In: Proceedings of The Sixteenth IEEE International Conference on Parallel and Distributed Systems (ICPADS 2010), Shanghai, China (December 2010) 3. Karypis, G., Kumar, V.: MeTis: Unstructured Graph Partitioning and Sparse Matrix Ordering System, Version 4.0 (2009), http://www.cs.umn.edu/˜metis 4. Pringle, G.: Porting OpenFOAM to HECToR A dCSE Project (2010), http://www.hector.ac.uk/cse/distributedcse/reports/openfoam/ openfoam/index.html 5. Jasak, H.: Error Analysis and Estimation for the Finite Volume Method with Applications to Fluid Flow. PhD thesis, Department of Mechanical Enginering, Imperial College of Science, Technology and Medicine (1996) 6. Kobayashi, H., Wu, X.: Application of a local subgrid model based on coherent structures to complex geometries. Center for turbulent research Stanford University and NASA. Annual research brief, pp. 69–77 (2006) 7. Le, H., Moin, P., Kim, J.: Direct numerical simulation of turbulent flow over a backwardfacing step. Journal of Fluid Mechanics 330(1), 349–374 (1997) 8. Weller, H.G., Tabor, G., Jasak, H., Fureby, C.: A tensorial approach to computational continuum mechanics using object orientated techniques. Computers in Physics 12(6), 620–631 (1998) 9. HPC Advisory Council. OpenFOAM Performance Benchmark and Profiling (2010), http://www.hpcadvisorycouncil.com/pdf/OpenFOAM_Analysis_and _Profiling_Intel.pdf 10. Berselli, L.C., Iliescu, T., Layton, W.J.: Mathematics of Large Eddy Simulations of Turbulent Flows, 1st edn., pp. 18–25. Springer, Heidelberg (2005) 11. Leibniz-Rechenzentrum (LRZ). Hardware Description of HLRB II (2009), http://www.lrz.de/services/compute/hlrb/hardware/ 12. Calegari, P., Gardner, K., Loewe, B.: Performance Study of OpenFOAM v1.6 on a Bullx HPC Cluster with a Panasas Parallel File System. In: Open Source CFD Conference, Barcelona, Spain (November 2009) 13. Rivera, O., F¨urlinger, K.: Parallel aspects of openfoam with large eddy simulations. In: Proceedings of the 2011 International Conference on High Performance Computing and Communications (HPCC 2011), Banff, Canada (September 2011)
Shibboleth and Community Authorization Services: Enabling Role-Based Grid Access Fan Gao and Jefferson Tan Faculty of Information Technology, Monash University, VIC 3800 Australia [email protected], [email protected]
Abstract. Classical authentication and authorization in grid environments can become a user management issue due to the flat nature of credentials based on X.509 certificates. While such credentials are able to identify user affiliations, such systems typically leave out a crucial aspect in user management and resource allocation: privilege levels. Shibboleth-based authentication mechanisms facilitate the secure communication of such user attributes within a trust federation. This paper describes a role-based access control framework that exploits Shibboleth attribute handling and CAS (Community Authorization Services) within a Grid environment. Users are able obtain appropriate access levels to resources outside of their domain on the basis of their native privileges and resource policies. This paper describes our framework and discusses issues of security and manageability. Keywords: grids, resource allocation, user management, single sign-on.
1 Introduction Grids are used to “solve unique research problems and to collaborate between different researchers across the globe” [1]. They provide substantial support for research in a variety of applications, but a number of challenges remain, including security. We focus on two types of security services within grids: access control and communication security. Currently, Public Key Infrastructure (PKI) [2] provides the primary authentication mechanism, as in the case of the Grid Security Infrastructure (GSI) [3], where PKI-based X.509 identity certificates are used for authentication. Apart from the ability to employ asymmetric encryption, there is the advantage of using proxy certificates to support Single Sign-On (SSO) [4], allowing for security delegation via the more limited proxy certificates. The notion of SSO is not unique to X.509 certificates, as it is also used in Shibboleth SSO [5], which works within the context of a trust federation. A participating entity entrusts user authentication to another participating entity from which a service is requested. Due to the design of secure transactions to facilitate release of user data to support such trust relationships, we have found an apt application for Shibboleth SSO: role-based access control, translating to role-based authorization. This paper focuses on authentication, authorization and resource access control with Shibboleth and Community Authorization Service (CAS). The goal is to provide access control in large-scale grids where service providers need not rely solely Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 131–140, 2011. © Springer-Verlag Berlin Heidelberg 2011
132
F. Gao and J. Tan
on their own basis for the privileges that they grant to remote users. Our scheme supports the determination and application of a remote user’s appropriate authorization level based on local policies. In Section 2, related works are reviewed in light of our work. Section 3 illustrates the overview of the proposed framework’s design. Section 4 discusses the details of this framework, and we evaluate the framework design in Section 5. Finally, the conclusions are presented in Section 6.
2 Background and Related Works The Grid Security Infrastructure (GSI) [3] relies on X.509 and SSL/TLS mechanisms through which a trusted third party can sign, and thus vouch for, either user or service credentials consisting of a public and private key pair and a certificate. This trusted third party is the Certificate Authority (CA), which bears the mutual trust of all participating entities. The four steps in GSI authentication to access the Grid [1] are: 1. 2.
3.
4.
Obtain the CA certificate that contains its public key. The secret key pair is generated at the local host with a certificate request. The latter is signed using the user’s private key, which is used by the CA to verify the certificate’s authenticity. The CA verifies the user‘s information, typically with the assistance of an authorized registration authority operator (RAO). The CA then signs the request using its private key, and releases the signed certificate to the user. Store the CA’s certificate and the signed certificate, which are all used for subsequent authentication transactions.
These certificates (and associated keys) can form the basis access control. From among the three access control models, Mandatory Access Control (MAC), Discretionary Access Control (DAC), and Role-Based Access Control (RBAC), the MAC model is often used in secure grids [6]. Security labels describe some aspect of an entity that can be used to describe access conditions with another entity. Two types of security labels are security classification on an object and security clearance on a user. In the DAC model, users control access to files and other resources that they create, use, or own, and can pass access rights on as preferred. However, the RBAC model seems more appropriate in incorporating the user’s role in deciding to allow access to resources, and roles reflect a user’s privileges. RBAC is considered to be scalable and flexible ([1][7]). It also provides a mechanism to articulate policies instead of being locked into a particular security policy. Shibboleth-based grid resource access control [5] integrates Shibboleth and authorization based on the X.812 Access Control Framework standard [8]. Shibboleth pushes trust out towards local security infrastructures where users are accountable. Each participating organization relies on its own Identity Provider (IdP) to vouch for its users when accessing remote services. An optional Where Are You From (WAYF) service allows users to identify their home organizations prior to being redirected to their own IdP. A Service Provider (SP) integrates pre-existing or new web services, enforcing IdP-based authentication before making access control decisions.
Shibboleth and Community Authorization Services: Enabling Role-Based Grid Access
133
Consequently, access to resources can be decided on the basis of local security policies of the resource provider. Cacheable Decentralized Groups Access Control [9], based on decentralized group membership, relies on a global namespace, and end users could create and manage the decentralized groups by access control lists (ACLs) on shared resources. In Trust Degree Based Access Control, [9] a centralized security server is used to manage and compute a trust value for every access request. The authorization model, based on the eXtensible Access Control Markup Language (XACML) resulted in the Community Authorization Service (CAS) [10]. It promises to improve access control in terms of performance and ease of maintenance [9]. Some issues are crucial in the context of this study. Users are typically not experts in grid computing, and there should be a way to simplify authentication and authorization procedures. Moreover, as grids become increasingly complex, their management would dramatically increase in difficulty as well. Confidentiality is the centre of attention for Resource Providers (RPs), with data resources being commonly confidential in different communities and domains. Scalability affects the performance of GSI. Normally, additional mechanisms increase overheads, as would the storage requirement for user information such as attributes, keys, roles, etc. There are human overheads as well, such as in obtaining a signed certificate from the CA via a registration authority operator. Authentication management such as key production, cancellation, and certificate storage is also becoming a challenge in large-scale environments. There is also the problem of trust. If trust relationships are broken, the whole system could be affected. There are also interoperability issues among several security mechanisms that are individually effective but may not have been designed to fit together. These mechanisms only focus on particular issues or provide only part of the solution. There is no adherence to one complete, end-to-end standard among them.
3 RBAC with Shibboleth and CAS Our work emphasizes that utilizing a user’s role attribute, which Shibboleth-style authentication can supply at sign-on, enhances access control. As per Shibboleth SSO, users authenticate themselves through the IdP of their home organization, which releases one or more role attributes. The same sign-on allows them to obtain credentials from an on-demand CA, which can incorporate the role data into a newly signed certificate. This is used to obtain role-based capabilities from the CAS, which gives the users access to authorized resources at the appropriate level of access. As Figure 1 shows, the proposed RBAC framework includes six major components. The Authentication System is based on Shibboleth. The Grid Service Web Portal is the gateway to the target Grid resource. The Credential Proxy Server generates proxy certificates as needed by an authenticated user. The Certificate Pool System is a novel method of restricting access by using a limited pool of certificates that can be active at any given time. The Authorization Server enforces policies that take user credentials, including role attributes, to make authorization decisions. Grid Resources include the Globus gatekeeper service for launching jobs or the GridFTP service for file transfers. The simple workflow of our framework is as follows:
134
F. Gao and J. Tan
Fig. 1. The proposed RBAC framework with Shibboleth and CAS
1. 2. 3. 4. 5. 6.
7. 8. 9.
The user sends an access request to the Grid Service Web Portal via browser. The portal sends an authentication and attribute request to the user. The Authentication System authenticates the user and releases his role attribute. The user forwards authentication data and the role attribute to the Grid Service Web Portal, which forwards attributes to the Credential Proxy Server. The authenticated user sends an access request to the Credential Proxy Server. The Credential Proxy Server obtains a local certificate for the user from the Certificate Pool System, and generates a proxy credential based on the user’s role attribute, reflecting the appropriate privilege level for the user. The Proxy Server sends the proxy credential to the Authorization Server. The Authorization Server provides/activates the appropriate access capabilities for the user based on the role-based proxy credential. The user may now access Grid resources at the appropriate level of access.
The Community Authorization Service. (CAS), once a component of the Globus Toolkit (GT) for fine-grained access control, is part of our proposed framework. The CAS Server verifies the user’s credentials and generates proxy credentials for the user with specific capabilities based on the user’s role, as well as the security policies stored in the CAS Database. The latter also stores information about users, groups, actions, resources and policies. CAS acts as a broker between user and resource provider. The resource provider also makes final decisions on user access to grid resource, i.e., the resource provider could temporarily reject the user even if the latter has certain authorization to the resources.
Shibboleth and Community Authorization Services: Enabling Role-Based Grid Access
135
Fig. 2. Overview of CAS
Shibboleth. (Figure 3) is open source software developed by the Middleware Architecture Committee for Education - Internet2 (MACE-Internet2). For authentication purposes, Shibboleth is considered to be an extension of the typical local authentication and authorization approach where the user could access multiple resources of any participant of a federation by their local authentication information.
Fig. 3. Shibboleth architecture and workflow, redrawn from Jie et al [4]
The Certificate Pool System (CPS) is a novel system to manage certificates in a local system for remote users. Figure 4 illustrates the architecture of CPS, which has two major components: the Certificate Pool Manager (CPM) and Certificate Pool Database (CPD). CPS stores a certain number of certificates in the CPD, based on the policies set by the owning organization. The certificates are generated and signed by the CA for remote users who want to use local Grid resources. These certificates have shorter lifespans than certificates for local users. The main functions of the CPM are:
136
F. Gao and J. Tan
Fig. 4. Architecture of the Certificate Pool Manager (CPM)
• • •
• • • •
To check the validity of certificates stored in the CPD. To generate certificate status reports for users or administrators. To allocate a certain number of certificates to particular roles. For instance, assume 100 certificates stored in CPD. The CPM can allocate 10 for administrators, 70 for staff and 20 for students. The ratio of certificates for different roles is variable. To control the maximum number of concurrent remote user access to local resources. To respond to certificate requests from legitimate users or entities. To maintain certificates in the CPD, i.e., to destroy a certificate if revoked or if it expires, and to obtain or renew them via the CA when required. When the CPM sends a request to the local CA to sign the certificate, the CPM controls the maximum lifetime of certificates based on some policy.
The CPS can improve authentication and authorization since users do not need personal certificates from the local CA. It increases security by centralized certificate control over concurrent remote users and certificate life spans. It also simplifies the management of certificates for remote users, workload for the local system administrator decreases.
4 Framework Design The framework consists of one part for remote users and one for local users. The following are assumed: • • •
In our prototype, three roles are defined: administrator, staff and student. More sophisticated classifications may well suit other situations. The framework design does not reflect the influence of a firewall. HTTPS is used in all interactions in this framework, to provide SSL/TLS security.
Shibboleth and Community Authorization Services: Enabling Role-Based Grid Access
137
4.1 Framework Design for Remote Users The framework based on the typical Single Sign-On model [12] is redesigned for our own purposes. Figure 5 illustrates seven major components, which we describe below.
Fig. 5. System architecture and workflow for a remote user
The Identity Provider (IdP) provides authentication and authorization services via the user’s home institution. Note that it relies on some underlying security infrastructure, typically an authentication, authorization and accounting (AAA) system for local user accounts and accounting data. The Where Are You From (WAYF) Service is used to redirect a remote user to their home institution’s IdP to begin the authentication process. A Portal is considered a gatekeeper to local system services. It must verify the legitimacy of a remote user and obtain the user’s attributes. The portal’s Assertion Requester generates and presents an authentication request to a user to authenticate through his/her home institution. It is assumed that the local system and the user’s home institution are in the same federation. The Attribute Requester is used to obtain the user’s attributes from his/her home institution based on the local system requirements and delivers these attributes to the local system for subsequent authorization purposes. The
138
F. Gao and J. Tan
GridShib-CA is used to generate local proxy credentials based on the user’s role attributes and certificate, stored in the Certificate Pool System (CPS). The specific Subject Name (SN) (e.g. remote_administrator/01, remote_staff/02 and remote_student/05 etc.) is composed in that proxy credential on account of the user’s role. In effect, the GridShib-CA makes it possible for roles to enter the picture. It sends an attribute request to obtain a user’s attributes from the user’s home institution via the portal. It is also used to request local certificates for remote users from the CPS. The Certificate Pool System (CPS) is used to manage and store local certificates in the certificate pool for remote users. The Community Authorization Service (CAS) provides access control for remote users, based on a user’s Distinguish Name (DN) and the Subject Name (SN) in the proxy credentials, and based on local policies. CAS generates specific capabilities for a remote user, used to access authorized local resources. There is a “capability valid time” notice generated by the CAS to demonstrate a maximum valid time for those capabilities. The Resource Provider (RP) provides resources, e.g., computing resources, storage resources, applications and services. Figure 5 shows 19 steps of authentication and authorization for a remote user. Steps 1-8 are typical authentication steps of Shibboleth via portal for the remote user. Steps 9 through 15 present a procedure to generate role-based local proxy credentials for the remote user. Finally, steps 16 through 19 describe the local resource capability request procedure and access to local Grid resources. 4.2 Framework Design for a Local User In comparison with that for a remote user, the architecture and workflow for a local user is much simpler. For example, the IdP is part of the local system, so there is no need to use the Portal and WAYF service to redirect the user to the IdP, though doing so will still work. In addition, the Certificate Pool System (CPS) is unlikely to apply to local users since they each have local certificates, and local policies may not be restrictive to local users. They may also authenticate directly via the local Authentication System behind the IdP, rather than through the IdP itself.
5 Pros and Cons Single Sign-On allows users to be authenticated by their home institution’s authentication infrastructure. If authenticated, they obtain local proxy credentials and capabilities to repeatedly transact with resources without subsequent sign-on steps. Usability improves for users who are easily authenticated across domains based on the trust relationship within a federation. “Role-grained” Access Control comes from having a user’s role attribute allowing the local system to provide appropriate privileges without maintaining individual accounts for remote users. In a typical grid, resources are typically allocated via one-size-fits-all queues, or via groups to which individual users may be added manually. In our model, CAS policies apply to roles, i.e., “rolegrained” allocation. User authentication is effectively delegated in hierarchical fashion via home institutions, providing scalability. Apart from above benefits, other points of manageability can be made. Each institution needs only to manage its own IdP for its own users. New users or new institutions imply no major changes to local
Shibboleth and Community Authorization Services: Enabling Role-Based Grid Access
139
account management, assuming that generic accounts are already available for remote users in general. Moreover, resource providers define role-based authorization policies for their resources, based on roles rather than per user. The CPS avoids having users approach a registration authority operator prior to the approval of a certificate request. Each institution can autonomously use its own authentication system but still be federated. Different mechanisms such as VOMS [1] and PERMIS [13] could be used locally without impairing interoperability. Privacy is protected in Shibboleth, as local systems request minimal user attributes from the user’s home IdP, and our framework only adds role data to what is normally required. Of course, there are disadvantages as well. First, Shibboleth is a static trust based infrastructure, where each institution must trust each other. If the trust relationship is broken, the whole federation is affected. An unauthorized user who masquerades as a valid remote user can access other institutions in the federation. Moreover, if the IdP releases the wrong role data for a given user, CAS will effectively authorize higher privileges than is appropriate. Second, attributes defined in the remote user’s IdP may not meet the authorization requirement of other institutions. Moreover, if these definitions change at some point, mismatches may arise. Third, CAS does not support users from multiple VOs, as it only supports basic access control policies. Finally, CAS may become a performance bottleneck with only one CAS server. There are other issues to consider. The framework is not particularly designed to withstand attacks. A masquerade attack can let an adversary gain privileged roles. A Denial-of-Service (DoS) attack can bring the CAS server down, making the authorization service unavailable to all. Also, according to the characteristics of Shibboleth, a user who belongs to a specific institution can only possess one role. What if the user belongs to multiple institutions, having different roles? This can be a minor accounting problem, or a bigger one if there is reason to revoke the user’s privileges completely, but the user’s privileges exist across multiple identities with individual roles.
6 Conclusions and Future Work We discussed a proposed role-based access control system that addresses authentication, authorization and decentralized resource allocation and account management in a grid. It integrates Shibboleth and CAS to improve usability, scalability, manageability, privacy and performance of existing authentication and authorization systems using a Single Sign-On, role-based and “role-grained” access control mechanism. In our implementation, it supports workable workflows for remote and local users, and supporting interoperability across different mechanisms. A novel Certificate Pool System was employed to improve security, manageability and performance. Possible extensions to our work include: • • •
a distributed CAS system to improve robustness and scalability; implementing this framework with VOMS [1], PERMIS [13] and other mechanisms; adding support for users with multiple roles in different institutions.
140
F. Gao and J. Tan
References 1. Chakrabarti, A.: Grid computing security. Springer, New York (2007) 2. Gutmarm, P.: PKI: It’s not dead, just resting. Computer 35(8), 41–49 (2002) 3. Foster, I., Kesselman, C., Tsudik, G., Tuecke, S.: Security architecture for computational grids. In: 5th ACM Conf. on Computer and Communications Security (CCS 1998), pp. 83–92. ACM, NY (1998) 4. Jie, W., Arshad, J., Ekin, P.: Authentication and authorization infrastructure for Grids— issues, technologies, trends and experiences. J. Supercomput. 52(1), 82–96 (2010) 5. Sinnott, R.O., Jiang, J., Watt, J., Ajayi, O.: Shibboleth-based access to and usage of grid resources. In: Proc. 7th IEEE/ACM Int. Conf. Grid Computing (Grid 2006), pp. 136–143. IEEE Computer Society, Washington, DC (2006) 6. Daswani, N., Kern, C., Kesavan, A.: Foundations of security: what every programmer needs to know. Apress Media LLC, New York (2007) 7. Pereira, A.L., Muppavarapu, V., Chung, S.M.: Role-based access control for grid database services using the community authorization service. IEEE Trans. Dependable and Secure Computing 3(2), 156–166 (2006) 8. ITU-T Recommendation X.812 | ISO/IEC 10181-3:1996, Security Frameworks for open systems: Access control framework (1996) 9. Hemmes, J., Thain, D.: Cacheable decentralized groups for grid resource access control. In: 7th IEEE/ACM Int. Conf. Grid Computing (Grid 2006), pp. 192–199. IEEE Computer Society, Washington, DC (2006) 10. Ni, X., Luo, J., Song, A.: A trust degree based access control for multi-domains in grid environment. In: 11th Int. Conf. Computer Supported Cooperative Work in Design (CSCWD 2007), pp. 864–869. IEEE, Piscataway (2007) 11. Lang, B., Foster, I., Siebenlist, F., Ananthakrishnan, R., Freeman, T.: A multipolicy authorization framework for grid security. In: 5th IEEE Int. Symp. Network Computing and Applications (NCA 2006), pp. 269–272. IEEE, Los Alamitos (2006) 12. Jensen, J., Spence, D., Viljoen, M.: Grid single sign-on in CCLRC. In: Proc. UK eScience All Hands Meeting 2006, Nottingham, UK. National e- Science Centre, Edinburgh (2006) 13. Chadwick, D., Otenko, A.: The PERMIS X.509 role based privilege management infrastructure. Future Generation Computer Systems 19(2), 277–289 (2003)
A Secure Internet Voting Scheme Md. Abdul Based and Stig Fr. Mjølsnes Department of Telematics Norwegian University of Science and Technology (NTNU) {based,sfm}@item.ntnu.no Abstract. We describe information security requirements for a secure and functional Internet voting scheme. Then we present the voting scheme with multiple parties; this voting scheme satisfies all these security requirements. In this scheme, the voter gets a signed key from the registrar, where the registrar signs the key as blinded. The voter uses this signed key during the voting period. All other parties can verify this signature without knowing the identity of the voter, hence the scheme provides privacy of the voter. This voting scheme also satisfies voter verifiability and public verifiability. In addition, the scheme is receipt-free.
1
Introduction
Internet voting is a very interesting research topic in recent years. Many countries have started deploying Internet voting in their areas, and it has been taken as a very challenging task to have a secure Internet voting scheme. In Internet voting, we want privacy of the voter, and at the same time we want to verify that the ballot cast by a voter has been counted properly. In addition, we do not want to show how the ballot is cast to the vote buyer or coercer. These requirements make the Internet voting a tough topic for the researchers. Regarding physical coercion-free election, it is recommended that the voter will first go to an election booth controlled by the election officials. Then the voter will show some identity to get access to the election booth. After that the voter will vote over Internet. We can make the voting scheme remote so that people can vote from anywhere, but in that case we need to compromise the physical coercion. The voter may be coerced by the vote buyer to vote for a particular candidate in the remote Internet voting. In order to protect the voter to sell the vote, the voting scheme must be receipt-free. We can assume that there is a voter computer [1] inside the election booth to perform the cryptographic task on behalf of the voter. The voter will just choose a vote for a particular candidate and the voter computer will construct the ballot from that vote on behalf of the voter. Then the voter will not be able to sell the vote to the bad people or the coercer. We use bulletin board in the voting scheme to make it verifiable. There are two ways of verification: the voter can verify that the ballot is counted properly and anyone can verify that the counting is correct. The first one refers to voter verifiability and the second one is called public or universal verifiability. Our voting scheme is both voter verifiable and public verifiable. Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 141–152, 2011. c Springer-Verlag Berlin Heidelberg 2011
142
M.A. Based and S.F. Mjølsnes
Outline of the paper: The related work is described in Section 2. The security requirements for a secure and functional Internet voting scheme are presented in Section 3. In Section 4, the voting scheme is presented. The corresponding message sequence diagrams are shown in Section 5. The security analysis is done in Section 6, where the conclusions and future plans are presented in Section 7.
2
Related Work
An Internet voting scheme with multiple parties is presented in [2], where a registrar was proposed to sign a key for the voter that is used during the voting period. But, the register could easily link the key with the identity of the voter in that scheme. We improve this part in our voting scheme such that the registrar signs a blinded copy of the key K. So the registrar has no way to link the key with the identity of the voter. The inductive approach to verifying cryptographic protocols is published by Lawrence et. al. in [3]. They describe protocols inductively as sets of traces, where a trace is a list of communication events. Kremer et. al. present the analysis of an Electronic Voting Protocol in the Applied Pi Calculus in [4]. Groth investigates the existing voting schemes based on homomorphic threshold encryptions and evaluates the security of the schemes in the UC framework in [5], and also proposes some modifications to make the voting schemes secure against active adversaries. The zero-knowledge protocols for voting systems and homomorphic cryptosystems are published in [5,6,7]. A non-interactive zeroknowledge protocol (NIZK) is described in [1]. The authors propose multiple counting servers to count the ballots jointly. The counting servers can individually verify the validity of the ballot without interacting with the voter. They also propose a voter computer to construct the ballot from a vote. We refer to their voter computer and NIZK protocol that can also be used in our voting scheme.
3
Security Requirements
We classify the information security requirements of Internet voting scheme as basic security requirements and expanded security requirements [8]. These requirements are mostly related to each other. 3.1
The Basic Requirements
The basic security requirements are: Eligibility of the Voter. To allow the authenticated voter to cast the ballot, eligibility of the voter is important. To have this, authentication and identification of voter are needed so that only the eligible voters will have access to cast the ballot.
A Secure Internet Voting Scheme
143
Confidentiality of the Ballot. The ballot cast by a voter must be confidential throughout the transmission from the voter to the servers. The servers represent the election officials or authentication servers or counting servers. In most cases, the Public Key Cryptography (PKC) is used in this purpose through the public key encryption techniques. The voter encrypts the ballot using the public key of the servers, so only the servers can decrypt and see the contents of the ballot. Integrity of the Ballot. Integrity of the ballot is to guarantee that the ballot cast by an eligible voter should not be altered or modified by any unauthorized party. To provide integrity, digital signature or group signature [9] techniques are mostly used [10]. Privacy and Secrecy. Privacy refers to both the privacy of the voter and the privacy of the ballot. The privacy should not be compromised in a voting scheme. Secrecy means the way in which a voter casts the ballot should not be revealed to any party [11]. Robustness. The voting scheme should ensure that the components (both the servers and the voter computers) of the voting scheme must be functional during the voting period such that if some components fail then the other components will still work without stopping the voting process. Fairness. The voting scheme should not publish any intermediate results such that no participant gains any knowledge about the partial results before the final tallying. This is important, because, the knowledge of the partial results may affect the intentions of the voters who have not cast their ballots before knowing the partial results to vote for a particular candidate. That is why, a voting scheme should be fair. Soundness. The counting process of a voting scheme should discard a ballot that is not valid. Ballot verification is needed before counting the ballots. Completeness. The counting process of a voting scheme is complete if the counting process counts all the valid ballots. In this case, the final tally is equal to the number of valid ballots cast by the voter. Verifiability of the counting process is important to provide completeness in a voting scheme. Unreuseability of the Ballot. The counting process of a voting scheme should count only one ballot for each eligible voter. The authentication process in some schemes does not allow a voter to vote twice or more, though some voting schemes allow a voter to vote multiple times. But, in that case, the counting process includes only the final ballot cast by the voter. 3.2
The Enhanced Requirements
The enhanced security requirements for Internet voting schemes are: Unlinkability and Untraceability. Unlinkability between the voter and the ballot is important for the privacy of the voter and the ballot. By untraceability neither the voter nor the ballot acquirer can add identifiable information to the ballot. This property supports the unlinkability property.
144
M.A. Based and S.F. Mjølsnes
Validity of the Ballot. The counting process of a voting scheme should validate the ballots before counting them to satisfy soundness, completeness, and unreuseability requirements of the counting process. In Internet voting scheme, we define a ballot as valid if the ballot is cast by an eligible voter and the content of the ballot is also valid. In general, the Zero Knowledge Proof (ZKP) protocols [14] are mainly used for this purpose. The ZKP protocols are classified into two types based on interaction between the prover and the verifier:
. Interactive Zero Knowledge Proof Protocol . Non-interactive Zero Knowledge Proof Protocol
Interactive zero knowledge protocol [7] allows the prover to interact with the verifier and prove the validity of the ballot. The voters play the role as the prover and the counting servers play the role as the verifier in the voting schemes. First, the voter commits to a value of the ballot, and sends the ballot to the servers. Then the servers send a random challenge string to the voter, and the voter responds to the challenge string. The verifiers then verify the ballot by verifying the responses. Like interactive zero-knowledge protocol, there is no interaction between the prover and the verifier in a Non-Interactive Zero-Knowledge (NIZK) protocol [10]. In NIZK, the verifier verifies the ballot without interacting with the prover. A NIZK requires no online communication between the prover and the verifier, hence usually it is faster and more efficient. On the other hand, a NIZK protocol requires that both the prover and verifier share a common random string (as challenge string), usually provided by a trusted third party and a pre-arranged use of this random string is required. Verifiability. Verifiability refers to the verifiability of the counting process. It is divided into two categories:
. Universal Verifiability or Public Verifiability . Voter Verifiability
In universal verifiability or public verifiability, any observer can verify that the counting process contains all the valid ballots and the final tally is correct [6]. On the other hand, if only the voters can verify the counting process and the tally, then the voting scheme is called voter verifiable [1,12]. In the scheme [1], the counting servers publish the final tally and the nonce values that were added to the ballot. The scheme is voter verifiable, since every voter can verify the nonce values. Receipt-freeness. In Internet voting, receipt-freeness [13] is another challenging requirement introduced by Benaloh and Tuinstra [15] to provide coercionresistance. Receipt-freeness means, in a voting scheme, the voters should not be allowed to produce a receipt of the ballot to prove for which candidate the ballot is cast. To protect vote-buying by the coercer or the vote buyer this is very important.
A Secure Internet Voting Scheme
145
Fig. 1. The Voting Scheme
Coercion-resistance. Coercion-resistance is achieved through receipt-freeness if the voting schemes do not allow the vote-buyer or coercer to enforce a voter to vote for a particular candidate. Re-voting or multiple voting [1] by a voter are allowed in some schemes to make them coercion-resistant. To protect the voter from physical coercion, election booth based voting schemes are recommended. In case of remote Internet voting, there is no cryptographic way to make it free from physical coercion.
4
The Voting Scheme
Fig. 1 shows the voting scheme. In this scheme, the voter interacts with a registrar to get a signed key and uses that key for voting. Then the voter sends the ballot to the ballot acquirer together with this signed key. The ballot acquirer verifies the signature of the registrar on this key and sends the ballot to the counting servers after voting period is over. The ballot acquirer signs the key and sends the key back to the voter. The voter publishes this key on the bulletin board. The counting servers also sign the key and publish the signed key on the bulletin board. So the voter and any observer can verify that the number of keys signed by the counting servers is equal or less than to the number of keys signed by the ballot acquirer. 4.1
The Parties
The various parties in the voting scheme are as follows:
. Voter - A voter is an eligible one to vote. We assume that the voter holds a smartcard containing some private information that identifies the eligibility
146
.
. .
.
.
4.2
M.A. Based and S.F. Mjølsnes
of the voter to the registrar. The voter creates a private/public key pair, where the public key is to be signed by the registrar so that the voter can use this key during the voting period. After the voter has chosen the candidate to vote for, the choice is encoded, and the ballot is split into shares and sent to the ballot acquirer. We do not describe the ballot construction here, rather refer to [1]. Voter Computer - Voter Computer (VC) is used to perform the cryptographic task on behalf of the voter to construct the ballot from a vote. We assume that VC is trusted and operates inside an election booth. The voter cannot see the cryptographic detail of the construction of a ballot, so cannot sell the ballot to the vote buyer or coercer. Registrar - The role of the registrar is to validate the eligibility of the voter and to sign the blinded copy of the public key supplied by the voter. Ballot Acquirer - The role of the ballot acquirer is to acquire the ballot from the voter and to verify the signature of the registrar on the key included with the ballot. After the ballot acquisition stage, the ballot acquirer mixes all the ballots and sends these to the counting servers. To defend coercion, a voter may vote multiple times. In that case, the ballot acquirer keeps a record and sends the latest ballot of the voter to the counting servers. Bulletin Board - The content of the bulletin board is public. This board is used to provide voter verifiability and universal verifiability. After voting, the voter sends the signed key (signed by the ballot acquirer) to the bulletin board. After counting, the same key is published by the counting servers. All counting servers will publish the same key, so no single counting server can alter or delete a key without detection. The voter can verify that the key signed by the ballot acquirer is the same key published by the counting servers. Any observer can observe that the number of keys signed by the ballot acquirer must be equal to or larger than the number of keys published by the counting servers. Because, the counting servers may ignore some invalid ballots after ballot verification. In that case, the key will not be published. But, the counting servers can maintain a list of the keys of the discarded ballots (if needed) for future claims by the voters. Counting Servers - A group of counting servers verifies that each individual ballot is correctly constructed and computes the result of the election using multiparty computations. No single counting server gets complete information about the individual ballot, but together they reach a result that all counting servers agree upon. Communication Model
We assume that the bulletin board is implemented as a centralized server and the contents of the bulletin board are public. The voter and the counting servers communicate with the bulletin board over the Internet. The counting servers are the same role, but different parties, the communication between them is done over authenticated and encrypted channels.
A Secure Internet Voting Scheme
147
Fig. 2. The Protocol between the Voter and the Registrar
The communication between the voter and the registrar is interactive over the Internet. The implementations of the Registrar, Ballot Acquirer, Counting Servers, Bulletin Board are owned, developed and deployed by different trusted parties and all Internet connections are authenticated and confidential based on standard technology such as TSL/SSL. Each party is assumed to have a public key known by all roles and have a private key to sign message with.
5
The Protocol
The protocol is shown as message sequence diagrams in this section. The voter shows some valid identity to enter the election booth and generates the key pair K, K −1 where K −1 is the private key and K is the corresponding public key. 5.1
The Protocol between the Voter and the Registrar
The protocol between the voter and the registrar is shown in Fig. 2. The voter sends a signed and encrypted blinded copy (the blinding scheme is not shown in detail here) of K to the registrar. That is, the voter sends [K v−1 ]R to the registrar. Here, K is the blinded copy of K, signed with the private key of the voter V −1 , and encrypted with the public key of the registrar R. The registrar verifies the signature of the voter, and if the voter is valid and not voted before, the registrar signs the blinded K with its private key R−1 , encrypts it with the public key of the voter V , and sends this signed and encrypted blinded key to the voter. Then the voter can unblind the message and gets a signed public key. 5.2
The Protocol between the Voter and the Ballot Acquirer
The protocol between the voter and the Ballot Acquirer is shown in Fig. 3. The voter generates the ballot (we can assume that a voter computer [1] will perform the cryptographic task to construct a ballot from a vote so that the voter does not know the cryptographic details of the vote), signs it with the private key K −1 , adds the signed KR−1 to the ballot, and encrypts it with the
148
M.A. Based and S.F. Mjølsnes
Fig. 3. The Protocol between the Voter and the Ballot Acquirer
Fig. 4. The Protocol between the Voter and the Bulletin Board
public key of the counting server (CS). The voter then adds the signed key again (KR−1 ) with this message and encrypts the message with the public key of the ballot acquirer (BA). The voter sends this signed and encrypted ballot to the ballot acquirer. The ballot acquirer can verify the signature on the key after decrypting the message. The ballot acquirer signs the key and sends the signed key KBA−1 to the voter. 5.3
The Protocol between the Voter and the Bulletin Board
The protocol between the voter and the bulletin board is shown in Fig. 4. The voter sends the signed key KBA−1 to the bulletin board. 5.4
The Protocol between the Ballot Acquirer and the Counting Servers
The protocol between the Ballot Acquirer and the Counting Servers is shown in Fig. 5. After verifying the signature on the key K, the ballot acquirer signs the encrypted message received from the voter with its private key BA−1 , and sends it to the counting server. The counting server only receives ballots signed by the ballot acquirer.
A Secure Internet Voting Scheme
149
Fig. 5. The Protocol between the Ballot Acquirer and the Counting Servers
Fig. 6. The Protocol between the Counting Servers and the Bulletin Board
5.5
The Protocol between the Counting Servers and the Bulletin Board
The protocol between the Counting Servers and the Bulletin Board is shown in Fig. 6. The counting servers (we assume that there are multiple counting servers to count the ballots [1]) decrypt the message and count the ballots and publish the tally. The counting servers also sign and publish the signed key KCS −1 on the bulletin board (BB).
6
Security Analysis
This section analyzes (informal) the security properties of the voting scheme presented in this paper. The voting scheme satisfies the following: Eligibility. The voter shows the identity to the election officials to enter the election booth and then sends the signed key (as blinded) to the registrar. The registrar verifies the signature of the voter. So, only the eligible voter can vote. Confidentiality and Integrity of the Ballot. In our scheme, the ballot is encrypted with the public key of the counting servers and signed using the private key K −1 of the voter. Only the counting servers can decrypt and see the content of the ballot. The counting servers can also verify the key KR−1 by verifying the signature of the registrar on this key (because the key must be signed by the registrar). This provides confidentiality and integrity of the ballot.
150
M.A. Based and S.F. Mjølsnes
Privacy and Secrecy. In our scheme, the voter first enters an election booth controlled by the election officials by showing some identity and credentials. Then the voter casts the vote inside the booth. No one knows any information how the voter casts the ballot inside the election booth. Using the blind signature and ballot encryption (encryption with the public key of the counting servers) techniques, our scheme provides privacy of the voter and the ballot. Robustness and Fairness. There are multiple counting servers in our voting scheme. As long as the threshold number of counting servers are functional, the voting scheme will work. No intermediate results should be published before the final tally. Since there are multiple counting servers to count the final tally, it is not possible for a single counting server to publish any intermediate result. The counting servers work together to count and publish the result of the voting. Soundness, Completeness, and Unreuseability of the Ballot. These properties are achieved through ballot verification, voter verifiability and universal verifiability. Unlinkability and Untraceability. In our scheme, the voter sends a signed (signed by the voter) blinded copy of the key K to the registrar. The registrar verifies the signature and signs the blinded copy of the key. Then the registrar sends this signed (signed by the registrar) blinded key to the voter. The voter unblinds it, gets the signed key and uses this key during the voting period. The registrar has no knowledge about the key, so it can not link the voter with this key. So, no one can add identifiable information to the ballot in our scheme. Validity of the Ballot. Regarding the validity of the ballot by a set of counting servers, our scheme is similar to the scheme presented in [10]. By using the noninteractive zero-knowledge protocol [10] the counting servers can individually verify the validity of the ballot. Voter Verifiability. After voting, the voter receives the signed key from the ballot acquirer (as shown in Fig. 4). This means that the ballot acquirer has received the ballot. The counting servers also publish the same key signed by these servers on the bulletin board. The voter can easily verify that the key included by the voter in the ballot is published by the counting servers. Universal Verifiability. Any observer can observe the contents on the bulletin board. The counting servers publish the signed key (signed by the counting servers) after ballot verification and counting. The voter publishes the signed key (signed by the ballot acquirer) after the ballot has been received by the ballot acquirer. The number of keys published by the counting servers must be equal to or less than the number of keys signed by the ballot acquirer (the counting servers may discard some invalid ballots after ballot verification). Since all the counting servers will publish the same key, no single counting server can alter or delete a key without detection.
A Secure Internet Voting Scheme
151
Receipt-freeness and Coercion-resistance. We assume a voter computer inside the election booth that performs the cryptographic task on behalf of the voter to construct the ballot from a vote. The voter does not know the detail construction of the ballot, so the voter can not prove to anyone how the voting was done. We assume this voter computer as a trusted ballot generator. Thus, the voting scheme is receipt-free and hence, coercion-resistant.
7
Conclusions and Future Plans
In this paper, we first present the security requirements for a secure Internet voting scheme. We then present a new scheme that satisfies all these security requirements. The scheme involves multiple parties for voter authentication (by registrar) and ballot generation (by voter computer), and multiple counting servers to count the ballots. We do not need to trust all the counting servers. As long as a single counting server is honest, the ballot verification and counting will be correct in our voting scheme. This essentially increases public trust in ballot counting. The voter and any observer can verify that the counting is correct by observing the contents of the bulletin board. Thus, voter verifiability and public verifiability are also satisfied. In summary, we can say that, we have presented a voting scheme here that satisfies all the basic and enhanced security requirements for Internet voting schemes. We have analyzed the security properties of our voting scheme informally. Ongoing work involves verification of these security properties by using formal verification tool, for example, Isabelle [3]. After implementation of some parts of the scheme (counting), performance evaluation is an important future work. Acknowledgement. The authors are thankful to Peter Ryan, University of Luxemberg, David Gray and Denis Butin, Dublin City University, for their valuable comments and suggestions regarding the voting scheme presented in this paper.
References 1. Based, M. A., Mjølsnes, S.F.: Universally Composable NIZK Protocol in an Internet Voting Scheme. In: Cuellar, J., et al. (eds.) STM 2010, LNCS, vol. 6710, pp. 147– 162. Springer, Heidelberg (2011) 2. Based, M.A., Reistad, T.I., Mjølsnes, S.F.: Internet Voting using Multiparty Computations. In: Proceedings of the the 2nd Norwegian Security Conference (NISK 2009), Tapir Akademisk Forlag, pp. 136–147 (2009) ISBN: 978-82-519-2492-4 3. Paulson, L.C.: The Inductive Approach to Verifying Cryptographic Protocols. Journal of Computer Security (2000) 4. Kremer, S., Ryan, M.: Analysis of an Electronic Voting Protocol in the Applied Pi Calculus. In: Sagiv, M. (ed.) ESOP 2005. LNCS, vol. 3444, pp. 186–200. Springer, Heidelberg (2005)
152
M.A. Based and S.F. Mjølsnes
5. Groth, J.: Evaluating Security of Voting Schemes in the Universal Composability Framework. Springer, Heidelberg (2004) ISBN 978-3-540-22217-0 6. Schoenmakers, B.: A Simple Publicly Verifiable Secret Sharing Scheme and its Application to Electronic Voting. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 148–164. Springer, Heidelberg (1999) 7. Iversen, K.R.: The Application of Cryptographic Zero-Knowledge Techniques in Computerized Secret Ballot Election Schemes. Ph.D. dissertation, IDT-report, 1991:3, Norwegian Institute of Technology (February 1991) 8. Meng, B.: Analyzing and Improving Internet Voting Protocol. In: Proceedings of the IEEE International Conference on e-Business Engineering, pp. 351–354. IEEE Computer Society, Los Alamitos (2007) ISBN 0-7695-3003-6 9. Boyen, X., Waters, B.: Compact Group Signatures Without Random Oracles. In: Vaudenay, S. (ed.) EUROCRYPT 2006. LNCS, vol. 4004, pp. 427–444. Springer, Heidelberg (2006) 10. Based, M.A., Mjølsnes, S.F.: A Non-interactive Zero Knowledge Proof Protocol in an Internet Voting Scheme. In: Proceedings of the the 2nd Norwegian Security Conference (NISK 2009), Tapir Akademisk Forlag, pp. 148–160 (2009) ISBN:97882-519-2492-4 11. Gray, D., Sheedy, C.: E-Voting: a new approach using Double-Blind Identity-Based Encryption. Presented in STM 2010: 6th International Workshop on Security and Trust Management, Athens, Greece (September 23-24, 2010) 12. Chaum, D.: Secret-ballot receipts: True voter-verifiable elections. IEEE Security and Privacy 2(1), 38–47 (2004) 13. Hirt, M., Sako, K.: Efficient Receipt-Free Voting Based on Homomorphic Encryption. In: Preneel, B. (ed.) EUROCRYPT 2000. LNCS, vol. 1807, pp. 539–556. Springer, Heidelberg (2000) 14. Krantz, S.G.: Zero Knowledge Proofs. AIM Preprint Series, 2007-46 (July 25, 2007) 15. Benaloh, J., Tuinstra, D.: Receipt-free secret-ballot elections. In: Proceeding of STOC 1994, pp. 544–553 (1994)
A Hybrid Graphical Password Based System Wazir Zada Khan1, Yang Xiang2, Mohammed Y. Aalsalem1, and Quratulain Arshad1 1
School of Computer Science, Jazan University, Saudi Arabia {wazirzadakhan,aalsalem.m}@jazanu.edu.sa, [email protected] 2 School of Information Technology, Deakin University, Australia [email protected]
Abstract. In this age of electronic connectivity, where we all face viruses, hackers, eavesdropping and electronic fraud, there is indeed no time when security is not critical. Passwords provide security mechanism for authentication and protection services against unwanted access to resources. A graphical based password is one promising alternatives of textual passwords. According to human psychology, humans are able to remember pictures easily. In this paper, we have proposed a new hybrid graphical password based system, which is a combination of recognition and recall based techniques that offers many advantages over the existing systems and may be more convenient for the user. Our scheme is resistant to shoulder surfing attack and many other attacks on graphical passwords. This resistant scheme is proposed for small mobile devices (like smart phones i.e. ipod, iphone, PDAs etc) which are more handy and convenient to use than traditional desktop computer systems. Keywords: Graphical passwords, Authentication, Network Security.
1 Introduction A password is a secret that is shared by the verifier and the customer. ”Passwords are simply secrets that are provided by the user upon request by a recipient.” They are often stored on a server in an encrypted form so that a penetration of the file system does not reveal password lists [2]. Passwords are the most common means of authentication because they do not require any special hardware. Typically passwords are strings of letters and digits, i.e. they are alphanumeric. Such passwords have the disadvantage of being hard to remember [3]. Weak passwords are vulnerable to dictionary attacks and brute force attacks where as Strong passwords are harder to remember. To overcome the problems associated with password based authentication systems, the researchers have proposed the concept of graphical passwords which use pictures instead of textual passwords and are partially motivated by the fact that humans can remember pictures more easily than a string of characters [4].Graphical passwords have been known from the mid 1990s. The idea of graphical passwords was originally described by Greg Blonder in 1996 [5]. The first and most important advantage is that they are easier to remember than textual passwords. Human beings have the ability to remember faces of people, places they visit and things they have
Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 153–164, 2011. © Springer-Verlag Berlin Heidelberg 2011
154
W.Z. Khan et al.
seen for a longer duration. Thus, graphical passwords provide a means for making more user-friendly passwords while increasing the level of security. Besides these advantages, the most common problem with graphical passwords is the shoulder surfing problem: an onlooker can steal user’s graphical password by watching in the user’s vicinity. Many researchers have attempted to solve this problem by providing different techniques [6]. Due to this problem, most graphical passwords schemes recommend small mobile devices (PDAs) as the ideal application environment. Another common problem with graphical passwords is that it takes longer to input graphical passwords than textual passwords [6]. The login process is slow and it may frustrate the impatient users. Graphical passwords serve the same purpose as textual passwords differing in consisting of handwritten designs (drawing), possibly in addition to text. The exploitation of smart phones like ipod and PDA’s is increased due to their small size, compact deployment and low cost. In this paper, considering the problems of text based password systems, we have proposed a new graphical password scheme which has desirable usability for small mobile device. Our proposed system is new graphical passwords based hybrid system which is a combination of recognition and recall based techniques and consists of two phases. During the first phase called Registration phase, the user has to first select his username and a textual password. Then objects are shown to the user to select from them as his graphical password. After selecting the user has to draw those selected objects on a touch sensitive screen using a stylus. During the second phase called Authentication phase, the user has to give his username and textual password and then give his graphical password by drawing it in the same way as done during the registration phase. If they are drawn correctly the user is authenticated and only then he/she can access his/her account. For practical implementation of our system we have chosen i-mate JAMin smart phone which is produced by HTC, the Palm Pilot, Apple Newton, Casio Cassiopeia E-20 and others which allow users to provide graphics input to the device. It has a display size of 240x320 pixels and an important feature of Handwriting recognition. The implementation details are out of the scope of this paper. Rest of the paper is organized as follows. In Section II, all existing graphical password based schemes are classified into four main categories. Section III gives a little review of existing research and schemes which are strongly related to our work. Section VI discusses the problems of all existing graphical password based schemes. In Section V our proposed system is described in detail. In Section VI we have compared our proposed system with existing schemes by drawing out the flaws in existing schemes. Section VII provides discussion. Finally Section VIII concludes the paper.
2 Classification of Graphical Password Based Systems Graphical passwords schemes can be broadly classified into four main categories: recognition-based, cued-recall based, pure-recall based and hybrid systems. Recognition Based Systems are also known as Cognometric Systems or Search metric Systems. Recognition based techniques involves identifying whether one has seen an image before. The user must only be able to recognize previously seen images, not generate them unaided from memory.
A Hybrid Graphical Password Based System
155
Pure Reacll Based Systems are also known as Drwanmetric Systems. In pure recallbased methods the user has to reproduce something that he or she created or selected earlier during the registration stage. Cued Recall Based Systems are also called Iconmetric Systems. In cued recall-based methods, a user is provided with a hint so that he or she can recall his his/her password. Hybrid Systems are typically combination of two or more schemes. Like recognition and recall based or textual with graphical password schemes.
3 Related Work Haichang Gao et al. [10] have proposed and evaluated a new shoulder-surfing resistant scheme called Come from DAS and Story (CDS) which has a desirable usability for PDAs. This scheme adopts a similar drawing input method in DAS and inherits the association mnemonics in Story for sequence retrieval. It requires users to draw a curve across their password images (pass-images) orderly rather than click directly on them. The drawing method seems to be more compatible with people’s writing habit, which may shorten the login time. The drawing input trick along with the complementary measures, such as erasing the drawing trace, displaying degraded images, and starting and ending with randomly designated images provide a good resistance to shoulder surfing. A user study is conducted to explore the usability of CDS in terms of accuracy, efficiency and memorability, and benchmark the usability against that of a Story scheme. The main contribution is that it overcomes a drawback of recallbased systems by erasing the drawing trace and introduces the drawing method to a variant of Story to resist shoulder-surfing. P.C.van Oorshot and Tao Wan [1] have proposed a hybrid authentication approach called Two Step. In this scheme users continue to use text passwords as a first step but then must also enter a graphical password. In step one, a user is asked for her user name and text password. After supplying this, and independent of whether or not it is correct, in step two, the user is presented with an image portfolio. The user must correctly select all images (one or more) pre-registered for this account in each round of graphical password verification. Otherwise, account access is denied despite a valid text password. Using text passwords in step one preserves the existing user sign-in experience. If the user’s text password or graphical password is correct, the image portfolios presented are those as defined during password creation. Otherwise, the image portfolios (including their layout dimensions) presented in first and a next round are random but respectively a deterministic function of the user name and text password string entered, and the images selected in the previous round.
4 Problem Domain There are many problems with each of the graphical based authentication methods. These are discussed below:
156
W.Z. Khan et al.
4.1 Problems of Recognition Based Methods Dhamijia and Perrig proposed a graphical password based scheme Déjà Vu, based on Hash Visualization technique [11]. The drawback of this scheme is that the server needs to store a large amount of pictures which may have to be transferred over the network, delaying the authentication process. Another weakness of this system is that the server needs to store the seeds of portfolio images of each user in plaintext. Also, the process of selecting a set of pictures from picture database can be tedious and time consuming for the user [7]. This scheme was not really secure because the passwords need to store in database and that is easy to see. Sobrado and Birget developed a graphical password technique that deals with the shoulder surfing problem [3]. In their first scheme the system displays a number of passobjects (pre-selected by user) among many other objects as shown in Fig: 3. To be authenticated, a user needs to recognize pass-objects and click inside convex hull formed by all the pass objects. They developed many schemes to solve the shoulder surfing problem but the main drawback of these schemes is that log in process can be slow. Another recognition based technique is proposed by Man et al. He proposed a shoulder-surfing resistant algorithm which is similar to that developed by Sobrado and Birget. The difference is that Man et al has introduced several variants for each pass-object and each variant is assigned a unique code. Thus during authentication the user recognize pre-selected objects with an alphanumeric code and a string for each pass-object. Although it is very hard to break this kind of password but this method still requires the user to memorize alphanumeric codes for each pass-object variants. “Passface” is another recognition based system. It is argued by its developer that it is easy for human beings to remember human faces than any other kind of passwords. But Davis et al [12] have found that most users tend to choose faces of people from the same race. This makes the Passface password somewhat predictable. Furthermore, some faces might not be welcomed by certain users and thus the login process will be unpleasant. Another limitation of this system is that it cannot be used by those people who are face-blind [6]. 4.2 Problems of Recall Based Methods The problem with the Grid based methods is that during authentication the user must draw his/her password in the same grids and in the same sequence. It is really hard to remember the exact coordinates of the grid. The problem with Passlogix is that the full password space is small. In addition a user chosen password might be easily guessable [6]. DAS scheme has some limitations like it is vulnerable to shoulder surfing attack if a user accesses the system in public environments, there is still a risk for the attackers to gain access to the device if the attackers obtained a copy of the stored secret, and, brute force attacks can be launched by trying all possible combinations of grid coordinates, ) Drawing a diagonal line and identifying a starting point from any oval shape figure using the DAS scheme itself can be a challenge for the users, and finally Difficulties might arise when the user chooses a drawing which contains strokes that pass too close to a grid-line, thus, the scheme may not be able to distinguish which cell the user is choosing.
A Hybrid Graphical Password Based System
157
“PassPoints” is the extended version of Blonder’s idea by eliminating the predefined boundaries and allowing arbitrary images to be used. Using this scheme it takes time to think to locate the correct click region and determine precisely where to click. Another problem with these schemes is that it is difficult to input a password through a keyboard, the most common input device; if the mouse doesn’t function well or a light pen is not available, the system cannot work properly [6]. Overall, with both “PassPoints” and “Passlogix”, looking for small spots in a rich picture might be tiresome and unpleasant for users with weak vision. In Viskey’s scheme’s main drawback is the input tolerance. Pointing to the exact spots on the picture has proven to be quite hard thus Viskey accepts all input within a certain tolerance area around it. It also allows users to set the size of this area in advance. However, some caution related to the input precision needs to be taken, since it will directly influence the security and the usability of the password. In order to practically set parameters, a four spot VisKey theoretically provides approximately 1 billion possibilities for defining a password. Unfortunately this is not large enough to prevent off-line attacks from a high-speed computer. Therefore no less than seven defined spots are required to overcome the likelihood of brute force attacks.
Fig. 1. A shoulder Surfing Resistant Graphical Password Scheme [3]
5 Proposed System Taking into account all the problems and limitations of graphical based schemes, we have proposed a hybrid system for authentication. This hybrid system is a mixture of both recognition and recall based schemes. Our proposed system is an approach towards more reliable, secure, user-friendly, and robust authentication. We have also reduced the shoulder surfing problem to some extent. 5.1 Proposed Algorithm Steps 1-3 are registration steps and steps 4-9 are the authentication steps. The algorithm of our proposed system is as follows:
158
W.Z. Khan et al.
─ Step 1 The first step is to type the user name and a textual password which is stored in the database. During authentication the user has to give that specific user name and textual password in order to log in. ─ Step 2 In this second step objects are displayed to the user and he/she selects minimum of three objects from the set and there is no limit for maximum number of objects. This is done by using one of the recognition based schemes. The selected objects are then drawn by the user, which are stored in the database with the specific username. Objects may be symbols, characters, auto shapes, simple daily seen objects etc. Examples are shown in Figure 2. ─ Step 3 During authentication, the user draws pre-selected objects as his password on a touch sensitive screen (or according to the environment) with a mouse or a stylus. This will be done using the pure recall based methods. ─ Step 4 In this step, the system performs pre-processing. ─ Step 5 In the fifth step, the system gets the input from the user and merges the strokes in the user drawn sketch. ─ Step 6 After stroke merging, the system constructs the hierarchy. ─ Step 7 Seventh step is the sketch simplification. ─ Step 8 In the eighth step three types of features are extracted from the sketch drawn by the user. ─ Step 9 The last step is called hierarchical matching. During registration, the user selects the user name and a textual password in a conventional manner and then chooses the objects as password. The minimum length for textual password is L=6. Textual password can be a mixture of digits, lowercase and uppercase letter. After this the system shows objects on the screen of a PDA to select as a graphical password. After choosing the objects, the user draws those objects on a screen with a stylus or a mouse. Objects drawn by the user are stored in the database
A Hybrid Graphical Password Based System
159
Fig. 2. Some examples of objects shown to the user
with his/her username. In object selection, each object can be selected any number of times. Flow chart of registration phase is shown in Figure 3. During authentication, the user has to first give his username and textual password and then draw pre-selected objects. These objects are then matched with the templates of all the objects stored in the database. Flow chart of authentication phase is shown in Figure 4. The phases during the authentication like the pre-processing, stroke merging, hierarchy construction, sketch simplification, feature extraction, and hierarchical matching are the steps proposed by Wing Ho Leung and Tsuhan Chen in their paper [13]. They propose a novel method for the retrieval of hand drawn sketches from the database, finally ranking the best matches. In the proposed system, the user will be authenticated only if the drawn sketch is fully matched with the selected object’s template stored in the database. Pre-processing of hand drawn sketches is done prior to recognition and normally involves noise reduction and normalization. The noise occur in the image by user is generally due to the limited accuracy of human drawn images. [14]. A number of techniques can be used to reduce noise that includes Smoothing, filtering, wild point correction etc. Here in the proposed system Gaussian smoothing is used which eliminates noise introduced by the tablet or shaky drawing.
or specifically in two dimensions
Where r is the blur radius (r2 = u2 + v2), and σ is the standard deviation of the Gaussian distribution. In case, if user draws very large or a very small sketch then the system performs size normalization which adjusts the symbols or sketches to a standard size. The Stroke merging phase is use to merge the strokes which are broken at end points. If the end points are not close, then that stroke is considered as open stroke and it may
160
W.Z. Khan et al.
be merged with another open stroke if the end point of one stroke is close to the end point of the other. The strokes are then represented in a hierarchy to simplify the image and to make it meaningful for further phases [13]. In the next step of sketch simplification, a shaded region is represented by a single hyper-stroke. After sketch simplification three types of features are extracted from the user re-drawn sketch. These features are hyper stroke features, Stroke features, and bi-stroke features.
Fig. 3. Flow chart for Registration Phase
In the last step of hierarchical matching, the similarity is evaluation the top to bottom hierarchical manner. The user is allowed to draw in an unrestricted manner. The overall process is difficult because free hand sketching is a difficult job. The order in which the user has selected the objects does matter in our proposed system i.e. during the authentication phase, the user can draw his pre-selected objects in the same order as he had selected during the registration phase. So, in this way the total combinations of each password will be 2n –1, ‘n’ being the number of objects selected by the user as password during the registration phase.
A Hybrid Graphical Password Based System
161
6 Comparison of Proposed System with Existing Schemes Our system offers many advantages over other existing systems as discussed below: Comparing to the “Passface” system, our system can also be used for those who are face-blind. We have used objects instead of human faces for selecting as password because later on during the authentication phase, the user has to draw his/her password and it is a much more difficult task to draw human faces than simple objects. Also we believe that as compared to human faces, objects are easier to remember which are in daily use. Our system has eliminated the problems with grid based techniques where the user has to remember the exact coordinates which is not easy for the user. Our system just compares the shapes of the objects drawn by the user during authentication. Our scheme is less vulnerable to Brute force attack as the password space is large. It is also less vulnerable to online and offline dictionary attacks. Since
Fig. 4. Flow Chart for Authentication Phase
162
W.Z. Khan et al.
stylus is used, it provides ease to the user for drawing objects and also it will be impractical to carry out dictionary attack. Our scheme is better than Man et al scheme. This is because in his scheme the user has to remember both the objects and string and the code. In our method the user has to remember the objects he selected for password and also the way he has drawn the objects during registration. Our proposed system differs from CDS in that the user has to first select a textual password and then a graphical password, making it more secure. Comparing to Van Oorschot’s system Two Step Authentication system, our system is more secure since users not only select graphical password but also draw their password, making it difficult to hack and even if the textual password is compromised, the graphical password cannot be stolen or compromised since the user is also drawing the graphical password. Our proposed system works in the same way as Two Step Authentication system i.e the user has to choose a textual password before choosing a graphical password but difference is that in our system during authentication, after giving the username and textual password, the user has to draw his graphical password which is matched with its stored template drawn by the user during the registration phase. This approach protects from hacking the password and prevents them from launching different attacks. Thus our system is more secure and reliable than two step authentication system. As with all graphical based systems our system will also be slow. The normalization and matching will take time. An important issue of our system is that it is somewhat user dependent during authentication. It depends upon the user’s drawing ability. Thus, the system may not be able to verify the objects drawn by the user and as a result the actual user may not be authenticated. The possible attacks on graphical passwords are Brute force attack, Dictionary attacks, Guessing, Spy-ware, Shoulder surfing and social engineering. Graphical based passwords are less vulnerable to all these possible attacks than text based passwords and they believe that it is more difficult to break graphical passwords using these traditional attack methods. Our System is resistant to almost all the possible attacks on graphical passwords.
7 Conclusion and Future Work The core element of computational trust is identity. Currently many authentication methods and techniques are available but each with its own advantages and shortcomings. There is a growing interest in using pictures as passwords rather than text passwords but very little research has been done on graphical based passwords so far. In view of the above, we have proposed authentication system which is based on graphical password schemes. Although our system aims to reduce the problems with existing graphical based password schemes but it has also some limitations and issues like all the other graphical based password techniques. To conclude, we need our authentication systems to be more secure, reliable and robust as there is always a place for improvement. Currently we are working on the System Implementation and Evaluation. In future some other important things regarding the performance of our system will be investigated like User Adoptability and Usability and Security of our system.
A Hybrid Graphical Password Based System
163
References 1. van Oorschot Tao Wan, P.C.: TwoStep: An Authentication Method Combining Text and Graphical Passwords. In: 4th International Conference, MCETECH 2009, Ottawa, Canada (May 4-6, 2009) 2. Authentication, http://www.objs.com/survey/authent.htm (last visited on May 15, 2011) 3. Sobrado, L., Birget, J.C.: Graphical Passwords, The Rutgers Schloar, An Electronic Bulletin for Undergraduate Research, vol. 4 (2002), http://rutgersscholar.rutgers. edu/volume04/sobrbirg/sobrbirg.htm 4. Elftmann, P.: Diploma Thesis, Secure Alternatives to Password-Based Authentication Mechanisms, Aachen, Germany (October 2006) 5. Blonder, G.E.: Graphical password. U.S. Patent 5559961, Lucent Technologies, Inc., Murray Hill, NJ (August 1995) 6. Suo, X., Zhu, Y., Owen, G.S.: Graphical Passwords: A Survey. In: Proceedings of Annual Computer Security Applications Conference (2005) 7. Approaches to Authentication, http://www.e.govt.nz/plone/archive/services/see/see-pkipaper-3/chapter6.html?q=archive/services/see/see-pki-paper3/chapter6.html (last visited on May 15, 2011) 8. Roman, V.Y.: User authentication via behavior based passwords. In: Systems, Applications and Technology Conference, Farmingdale, NY (2007) 9. Biometric Authentication, http://www.cs.bham.ac.uk/~mdr/teaching/modules/security/lect ures/biometric.html (last visited on May 02, 11) 10. Gao, H., Ren, Z., Chang, X., Liu, X., Aickelin, U.: A New Graphical Password Scheme Resistant to Shoulder-Surfing. In: 2010 International Conference on CyberWorlds, Singapore (October 20-22, 2010) 11. Perrig, A., Song, D.: Hash Visualization: A New Technique to improve Real-World Security. In: International Workshop on Cryptographic Techniques and E-Commerce, pp. 131– 138 (1999) 12. Davis, D., Monrose, F., Reiter, M.K.: On User Choice in Graphical Password Schemes. In: 13th USENIX Security Symposium (2004) 13. Leung, W.H., Chen, T.: Hierarchical Matching For Retrieval of Hand Drawn Sketches. In: Proceeding of International Conference on Multimedia and Expo (ICME 2003), vol. 2 (2003) 14. Khan, H.Z.U.: Comparative Study Of Authentication Techniques. International Journal of Video & Image Processing and Network Security IJVIPNS 10(04) 15. Token Based Authentication, http://www.w3.org/2001/sw/Europe/events/foafgalway/papers/fp /token_based_authentication/ (last visited on May 02, 2011) 16. Knowledge Based Authentication, http://csrc.nist.gov/archive/kba/index.html (last visited on May 02, 2011) 17. Knowledge based Authentication, http://searchsecurity.techtarget.com/definition/knowledgebased-authentication (last visited on May 02, 2011)
164
W.Z. Khan et al.
18. A Survey on Recognition based Graphical User Authentication Algorithms, http://www.scribd.com/doc/23730953/A-Survey-on-RecognitionBased-Graphical-User-Authentication-Algorithms (last Visited on May 02, 2011) 19. Jain, A., Bolle, R., Pankanti, S. (eds.): Biometrics: personal identification in networked society. Kluwer Academic, Boston (1999) 20. Hurson, A.R., Ploskonka, J., Jiao, Y., Haridas, H.: Security issues and Solutions in Distributed heterogeneous Mobile Database Systems. In: Advances in Computers, vol. 61, pp. 107–198 (2004) 21. Biddle, R., Chiasson, S., van Oorschot, P.C.: Graphical Passwords: Learning from the First Twelve Years, Carleton University - School of Computer Science, Technical Report TR-11-01 (January 4, 2011) 22. Weinshall, D.: Cognitive authentication schemes safe against spyware, (short paper). In: IEEE Symposium on Security and Privacy (May 2006) 23. Hayashi, E., Christin, N., Dhamija, R., Perrig, A.: Use Your Illusion: Secure authentication usable anywhere. In: 4th ACM Symposium on Usable Privacy and Security (SOUPS), Pittsburgh (July 2008) 24. Davis, D., Monrose, F., Reiter, M.: On user choice in graphical password schemes. In: 13th USENIX Security Symposium (2004) 25. Passfaces Corporation. The science behind Passfaces, White paper, http://www.passfaces.com/enterprise/resources/white_papers.h tm (last visited on May 05, 11) 26. De Angeli, A., Coventry, L., Johnson, G., Renaud, K.: Is a picture really worth a thousand words? Exploring the feasibility of graphical authentication systems. International Journal of Human-Computer Studies 63(1-2), 128–152 (2005) 27. Moncur, W., Leplatre, G.: Pictures at the ATM: Exploring the usability of multiple graphical passwords. In: ACM Conference on Human Factors in Computing Systems (CHI) (April 2007) 28. Pering, T., Sundar, M., Light, J., Want, R.: Photographic authentication through untrusted terminals. In: Pervasive Computing, pp. 30–36 (January-March 2003) 29. Wiedenbeck, S., Waters, J., Sobrado, L., Birget, J.: Design and evaluation of a shouldersurfng resistant graphical password scheme. In: International Working Conference on Advanced Visual Interfaces (AVI) (May 2006) 30. Bicakci, K., Atalay, N.B., Yuceel, M., Gurbaslar, H., Erdeniz, B.: Towards usable solutions to graphical password hotspot problem. In: 33rd Annual IEEE International Computer Software and Applications Conference (2009) 31. Jermyn, I., Mayer, A., Monrose, F., Reiter, M., Rubin, A.: The design and analysis of graphical passwords. In: 8th USENIX Security Symposium (August 1999) 32. Valentine, T.: An Evaluation of the PassfaceTM Personal Authentication System, Technical Report. Goldmsiths College University of London, London (1998) (the first report known in the literature)
Privacy Threat Analysis of Social Network Data Mohd Izuan Hafez Ninggal and Jemal Abawajy School of Information Technology, Deakin University, 3217 Victoria, Australia {mninggal,jemal}@deakin.edu.au
Abstract. Social network data has been increasingly made publicly available and analyzed in a wide spectrum of application domains. The practice of publishing social network data has brought privacy concerns to the front. Serious concerns on privacy protection in social networks have been raised in recent years. Realization of the promise of social networks data requires addressing these concerns. This paper considers the privacy disclosure in social network data publishing. In this paper, we present a systematic analysis of the various risks to privacy in publishing of social network data. We identify various attacks that can be used to reveal private information from social network data. This information is useful for developing practical countermeasures against the privacy attacks. Keywords: Privacy disclosure, Social networks, Threat analysis, Data publications.
1 Introduction Online social networking has become one of the most popular activities on the web [1-2]. The dramatic increase in the number, size, and variety of online social networks has generated interesting data management and data mining problems. An important concern in the release of these data for study is their privacy, since social networks usually contain personal information. Privacy is an important issue when one wants to make use of data that involves individuals’ sensitive information. While expressly acknowledging the immense potential and the importance of social networks as a communication tool, many real-world social networks contain sensitive information and serious privacy concerns have been raised. Social networks often contain some private attribute information about individuals as well as their sensitive relationships. Simply removing or replacing the identifying attributes such as name and SSN by meaningless unique identifiers is far from sufficient to protect privacy [3]. Serious concerns on privacy protection in social networks have been raised in recent years [4-5], particularly when social network data is published [3]. The increasing availability of rich social media, popular online social networking sites, and sophisticated data mining techniques have made privacy in social networks a serious concern. As a result, the thriving phenomenon of social networks has also attracted the attention of lawmakers. In a significant number of countries (primarily, but not exclusively, in Europe and North America) Data Protection Authorities have Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 165–174, 2011. © Springer-Verlag Berlin Heidelberg 2011
166
M.I.H. Ninggal and J. Abawajy
started to scrutinize both, the business models of social networks as well as the practices performed on such platforms, where such attention usually is reserved to general privacy issues (such as: identity theft, on-line frauds, data security) or to sector specific data protection problems (such as: financial transactions, e-commerce, electronic communication, children's protection), but not yet to marketing performed on such platforms (exceptions – with provisions specifically governing commercial communication on social networks – are a number of countries, members to the European Union). The US has in place Statute Law provisions or Self-Regulation guidelines specifically dealing with privacy issues on social networks. In most countries, general provisions and requirements in place for processing of personal data will apply to social networks. In some jurisdictions (inclusive of several nonEuropean ones in North and South American as well as in Asia and Oceania) dedicated rules are in preparation and likely to come into force in a near future. In this paper, we consider the privacy disclosure in social network data publishing. We present a systematic analysis of the various risks to privacy in published social network data. Privacy becomes an important issue when one wants to make use of data that involve individuals’ sensitive information. Social network data often contains sensitive information about the users [4-6]. Simply removing all identifiable personal information (such as names and social security number) before releasing the data is insufficient. The conventional methods proposed for micro data privacy [7], cannot be used directly to ensure privacy of social network data due to complex dependencies between the data and various relationships. We will survey and analyze various possible privacy leak concern when a social network data is published. We identify various attacks such as different structural queries that can be used to reveal private information from social network data. This information is useful for developing countermeasures to reduce the risk of such attacks. The rest of the paper is organized as follows. Section 2 presents a high level architecture of social network and discuses the motivation for social network data publication. Section 3 analyses various threats to social network data privacy concerns. The privacy breach is presented in Section 4. Finally, Section 5 gives the conclusion.
2 Publishing Social Network Data Social networks have become an important data source. Certainly, they have made data collection on individuals much easier. This new phenomenon has generated a wealth of data that is collected and maintained by the social media service providers. Sometimes it is necessary or beneficial to release such data to the public. The data generated by social media services often referred to as the social network data. In many situations, the data needs to be published and shared with others. The usefulness of social network data in capturing real world social activities has attracted many parties demanding the data for analysis purposes. Social network analysis has been a key technique in modern sociology, geography, economics, and information science [3]. Researchers in sociology, epidemiology, and health-care related field collect data about geographic, friendship, family, and sexual networks. Social network data is useful to them to study disease propagation and risk. In addition, there is also
Privacy Threat Analysis of Social Network Data
167
increased interest by researchers in governments institution in mining social network data for information and security purposes [8]. All these quests require the data to be shared or published.
Fig. 1. High level system components of social network
Social networks describe entities (often people) and the relationships between them. Social network analysis is often used to understand the nature of these relationships, such as patterns of influence in communities, or to detect collusion and fraud. A high level system component of a social network is shown in Fig. 1. In the architecture, there are users, social media services, and third party data recipients. Example of social media services include collaborative projects, blogs, content communities, social networking sites, virtual game worlds, and virtual communities [9]. Facebook is the most popular social networking sites currently has more than 500 million active users and they spend over 700 billion minutes per month of using the application [10]. Social media service users can be any real world entity that uses the service like individual or organization. When a user uses an online social media service, they usually are asked to create a profile and to give information about themselves. This information includes personal identifiable information like social security number, name and phone number which uniquely identify a person. They may also give semiidentifiable information like home address, former school he/she went or former company he/she has worked as well as private or sensitive information that users may like to make available to selected entity while keep it hidden from the public view. Sensitive information can include religion, political view, type of disease (as in healthcare network) or generated income (as in financial network). On top of that, there are also data generated from the social activity from the services. Some of this data may also carry sensitive information. All these information is kept and maintained by the service provider. In many situations, the data needs to be published and shared with others. The overall intent is for the data to be used for the public good such as in the evaluation of economic models, in the identification of social trends, and in the pursuit of the stateof-the-art in various fields. Usually, such data contain personal information such as medical records, salaries, and so on, so that a straightforward release of data is not appropriate. For examples, business companies are analyzing the social connections in social network data to uncover customer relationship that can benefit their services and product sales. The analysis result of social network data is believed to potentially provide an alternative view of real-world phenomena due to the strong connection
168
M.I.H. Ninggal and J. Abawajy
between the actors behind the network data and real world entities [11-12]. In fulfilling the demands for the data, online social media operators have been sharing the data it maintains with external third parties such as business advertisers, application developers, as well as academic researchers. Therefore, releasing the data to third parties has to be done in a way that can guarantee the privacy of the users. In other word, the data must undergo a privacy-preserving phase before being released to third parties. Online social media service providers who maintain the data may has specific interest in specific analysis outcomes of their data but due to the lack of in-house expertise to conduct the analysis, outsourcing the task to external parties often comes as the alternative option. In different situation, the owner of the data shares the data with third parties. For example, advertising partners tend to be interested in the sort of information that is held by the service provider. The data usually contain valuable information that can enable better social targeting of advertisements. On the other hand, the request to use the data can also come from third party applications embedded in the social media application itself. For instance, Facebook has thousands of third-party applications and the number is growing exponentially [13]. Even though the process of data sharing in this case is implicit, the data is indeed passed over from the data owner (service provider) to different party (the application). The data given to these applications is usually not sanitized to protect users’ privacy [14]. Social network data usually contain users’ private information; it is important to protect this information in any sharing activities. When data is shared, risk to data privacy violation will likely result as the key problem both, for all those participating in activities on social networks (be it or for private or for business purposes) as well as for companies running such platforms. Thus, publishing the data may violate individual privacy. Individual privacy is defined as “the right of the individual to decide what information about himself should be communicated to others and under what circumstances” [15]. A privacy breach occurs when sensitive information about the user is disclosed to an adversary. There are well-known examples of unintended disclosure of private information such as the AOL search data [16] and attacks on Netflix data [9]. The problem is challenging due to the diversity and complexity of data and their relationships, on which an adversary can use many types of background knowledge to conduct an attack. In the following section, we analyze various threats to social network data. Although in many ways a user gives ‘consent’ when they sign up to an online social network site, most are unaware of the implications of voluntarily providing personal information on profiles as well as not being aware of how this information may be processed. Privacy implications associated with online social networking depend on the level of identifiability of the information provided, it’s possible recipients, and its possible uses [17]. According to a survey released for EU Data Protection Day 2010, almost 50% of human resources professionals active in Europe perform on-line checks on candidates and 25% of all job applications are rejected based on the results of searches performed on candidates’ on-line reputation and profiles, where dismissal is usually grounded on ‘inappropriate comments’ or ‘unsuitable photos or videos’ found on the Internet. Interestingly the same survey shows that on the other side consumers continue to significantly underestimate the risks deriving from their on-line profiles: in the UK only 9% (in France 10% and in
Privacy Threat Analysis of Social Network Data
169
Germany 13%) of job seekers held that personal information available about them on the Internet would exercise influence on the outcome of their applications. In health care area, Personal Health Record (PHR) systems such as Google Health (health.google.com) allows users to store and manage personal information, including health information, emergency contacts, insurance plans, medications, immunizations, past procedures, test results, medical conditions, allergies, medications, family histories and lab results. Sharing of this information across user accounts is also supported. Placing detailed histories of health information online could expose users to significant risks [18]. Social finance network services like Wikinvest (www.wikinvest.com) are also changing the way finance has been done. Unacceptable disclosure of this type of data can result in serious consequences for individual ranging from scam and frauds to physical threats.
3 Threat Analysis In this section, we examine how a social network is translated into a data graph, what kind of sensitive information may be at risk and how an adversary may launch attack on individual privacy. 3.1
Data Representation
The data generated by social media services is usually viewed or represented as a graph network that contains vertices and connections between them. The network considered here is binary, symmetric, and without self-loops. Formally, a social network can be represented as , where , ,…, is a set of vertices and is the set of edges. We define as the degree of vertex . , which is the number of the vertices connected to node Harinda
Andy Eddie
Izuan
Claudia Bob Davood
Fed Gary Jemal
a) Original network
b) Naïve anonymized network
Fig. 2. Graphical social network representation
Fig. 2 illustrates an example of social network as a graph. The vertices usually represent real world actors or entities like individuals or organizations. Each vertex has a profile that usually contains personal attributes, such as name, gender, birth date, political view, religion etc. These individuals are usually connected by edges to represent some sort of social tie or link made between them. For example, in Social Networking Sites, these edges represent the connected friends each member has.
170
M.I.H. Ninggal and J. Abawajy
Therefore, edge can also have its attributes to describe the properties of the connection. For example in content communities type of social media, this attribute can be the content of the document they are collaborating with or the text comments and its timestamp that a user made in his/her subscribed blogs. This may also be the type of relationship between two users in social networking sites (e.g. political affiliation). 3.2
Background Knowledge
The first step in understanding attacks on privacy is to know what external information of a graph may be acquired by an adversary. Background knowledge is the piece of information that is known to the attacker and used by the attacker to perpetrate privacy attack. In social network data, the information that could be used as background knowledge to intrude user privacy is personal attribute and structural (or topological) attribute. The personal attribute is the information that describes a person such as name, address, date of birth, political view etc. Some attributes act as an identifier itself where it can be unique to individuals. Some other attributes are categorized act as semi identifiers or quasi attributes. Several quasi attributes, when combined together is potential to identify a person. Therefore, it is usually exploited as mapping parameter to find targeted individual in social network data.
Gary
Gary
Gary
Fig. 3. Structural information of Gary
The structural attribute is the information that can describe how an entity is connected to other entities in social network data. This information includes: i. Degree - This is the number of direct social links or relationships that an entity has. in Fig. 3 shows the social link that Gary has. This information does not carry direct sensitive information but is potential to be used as an effective mapping parameter in searching target individual in a network. Furthermore, acquiring this information from the network is relatively easy. For example in Facebook, the number of friends that appear in the user profile is the direct social tie the user has. This information does not uniquely represent an individual. However, in situation where the data range is small –e.g. only involve on a specific social group in the network - the return results could be very minimal or possibly unique. Using , adversary can map Gary as vertex 7 in naïve anonymized network of Fig. 2b. ii. Neighborhood – This refers to a set of neighboring entities that has direct social links to a target entity which they also have mutual link between them. For example in Fig.3b. If an adversary knows that Gary has four best friends who also three of them know each other, the adversary could still map Gary as vertex 7 even though there are several other vertices share the same number of degree.
Privacy Threat Analysis of Social Network Data
171
iii. Sub-graph – This refers to a set of relationships which the target entity is connected to which is a subset of the whole graph (Fig. 3c). To assume an adversary know this richer information is quite too strong. However, an adversary may create a set of dummy profiles and create social link between these profiles with certain patterns. Having that, the adversary then use those dummy profiles to establish a social link to target individual. The social link can be established as easy as adding target to the friend list or address book. Another way is the adversary simply constructs a coalition with other friends which also forming a small uniquely identifiable sub-graph. Having knowledge about specific pattern of relationship that he/she purposely created, the attacker later uses that pattern to locate the target individual in the released data. This is known as active attack and passive attack respectively [5]. iv. Network graph metrics – A network graph has many matrices. Some of them can implicitly reveal an individual. For instance, in a closed network communities which is a political movement group, the centrality metric could potentially reveal the leader of the group. There is another situation where an individual does not know about the other relationship of his/her neighborhoods. The closeness metric could give false assumption to the individual if it is found some of their neighborhoods have non-supposed affiliation. If a social network is published as simply a graph with no other information (such as attributes of a node) then re-identification attacks are still possible. It has been shown that if an attacker (or set of attackers) participates in the social network, then in many cases it is possible for the attacker to identify nodes corresponding to accounts under his control [5]. 3.3
Data Mapping Mechanism
The adversaries normally access the data by performing queries. The queries usually performed with several parameters like auxiliary information. Structural Queries is a series of knowledge queries that able to provide answers to a restricted knowledge about a target node in the network [19]. These queries exploit the structural information that may be available to an adversary, including complete and partial descriptions of vertex neighborhoods, and connections to hubs in the network. i. Vertex refinement queries - this query is a locally expanding structural query. It describes the structure of local neighborhood from a vertex’s perspective (a targeted individual perspective) in an iterative way. The weakest knowledge query, , simply returns the attribute information of the vertex . For unlabelled graph, the queries return null; returns the degree (the number returns the degree of each neighbor of of social links) that vertex v has; vertex v. These iterative queries can be defined as , where H
,… ,
returns the multiset of degree of each vertices adjacent to .
172
M.I.H. Ninggal and J. Abawajy
Example 3.1. Fig. 3 shows the computation of, , and for each node. is the vertex label of every vertices. In this case, the graph is set unlabeled, then is and they are uniform for all vertices. Let assume the targeted = {4}, which is Gary’s degree. individual is Gary in Fig. 1a, then = {2, 2, 3, 3} which represents Gary’s neighbors’ degrees. ii. Sub-graph queries - these queries identify the existence of a sub-graph around the targeted vertex by counting the number of edges in described subgraph. By these queries, the adversary is assumed to be able to gather some fixed number of social link focused around the target . By exploring the neighborhood of , the adversary learns the existence of a sub-graph around representing partial information about the structure around . iii. Hub fingerprint queries – these queries give information about how the vertex is linked to a set of selected hubs in the network. A hub is identified as a vertex with high degree and high betweenness centrality and they are often outliers in a network. For example in social networking sites like Facebook, a hub may correspond to a very famous person who has very high number of social links. These queries represent a range of structural information that may be available to adversaries, including complete and partial descriptions of vertex's local neighborhoods, and vertex's connections to hubs in the network.
4 Classification of Privacy Breach The privacy breaches in social networks can be categories into identity disclosure, sensitive link disclosure and sensitive attribute disclosure [20-21]. Identity disclosure happens when an adversary is able to map a record to specific individual. The identity disclosure may be considered as the key of privacy violation in social networks because it usually leads to the disclosure of content information as well as the information about relationship they have got. It could also lead to the revelation of an individual’s existence in a closed community network where he/she has strong privacy expectation of their existence in that group. For example, Facebook allows its user to create a network group with invited only member. This closed community network group could theme from secret society to political movement to religious purpose. Therefore, revealing someone existence in such group would also violate their privacy. In the sensitive link disclosure attack, the relationships between two individuals are revealed - The link among vertices in social network data can be a symbolism of relationship between individuals or organizations. This information is generated from social activities when using social media services. There are relationships that are safe for public to know, but some individuals may not prepare to reveal specific relationship they have. An adversary may want to know the degree of relationship between two entities. The disclosure occurs when the adversary is able to find out the existence of a relationship between two users, which the involved individuals prefer to keep it private. For examples, in social network data, based on the friendship relationships of a person and the public preferences of the friends such as political affiliation, it may be possible to infer the personal preferences of the person in
Privacy Threat Analysis of Social Network Data
173
question as well. Two entities in a social network may have many connections. Some are safe for the public to know and others that should remain private. If the relationship of two individuals can be determined by certain path, then the privacy is compromised. In the sensitive attribute disclosure attack, the sensitive data associated with each vertex or edge is compromised. Attribute disclosure occurs when an adversary is able to determine the value of a sensitive user attribute, which the user intended to keep it as private. Sensitive attributes may be associated with an entity as well as link relationship. In application level, the visibility of the attribute information is often variable. A member's profile can be set to be viewed publicly as well as by limited people in the network. In social network sites, content which commonly viewable by public usually something about hobbies and interests. However, certain application requires the user to give specific information accordingly to the application theme. In health based application, there could be information such as drinking and drug habits or type of disease that the user gives in the profile for monitoring purpose by other user in the system such as doctor. On the other hand, in online sexual network, there are sexual-based information like preferences and orientation. Meanwhile, there is also sensitive information generated from the interaction between users. For example in messaging network and email, the sensitive content are usually the text message, the timestamp, the frequency of interaction and other information correspond to both parties. Users usually have strong perception that this information is kept private [5].
5 Conclusion Although in many ways a user gives ‘consent’ when they sign up to an online social network site, most are unaware of the implications of voluntarily providing personal information on profiles as well as not being aware of how this information may be processed. Privacy implications associated with online social networking depend on the level of identifiability of the information provided, it’s possible recipients, and its possible uses. In this paper, we considered the privacy disclosure in social network data publishing. We present a systematic analysis of the various risks to privacy in publishing of social network data. We identify various attacks that can be used to reveal private information from social network data. This information is useful for developing practical countermeasures against the privacy attacks.
References 1. Alexa. The top 500 sites on the web (2011), http://www.alexa.com/topsites 2. Bonneau, J., Preibusch, S.: The privacy jungle: On the market for data protection in social networks. Economics of Information Security and Privacy, 121–167 (2010) 3. Zhou, B., Pei, J., Luk, W.: A brief survey on anonymization techniques for privacy preserving publishing of social network data. ACM SIGKDD Explorations Newsletter 10(2), 12–22 (2008) 4. Kleinberg, J.M.: Challenges in mining social network data: processes, privacy, and paradoxes. ACM, New York (2007)
174
M.I.H. Ninggal and J. Abawajy
5. Backstrom, L., Dwork, C., Kleinberg, J.: Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. ACM, New York (2007) 6. Srivastava, J., et al.: Data mining based social network analysis from online behavior (2008) 7. Fung, B., et al.: Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys (CSUR) 42(4), 1–53 (2010) 8. Rosenblum, D.: What anyone can know: The privacy risks of social networking sites. IEEE Security & Privacy, 40–49 (2007) 9. Facebook. Facebook Statistic (2011), http://www.facebook.com/press/info.php?statistics 10. Kaplan, A.M., Haenlein, M.: Users of the world, unite! The challenges and opportunities of Social Media. Business Horizons 53(1), 59–68 (2010) 11. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’networks. Nature 393(6684), 440–442 (1998) 12. Watts, D.J.: Networks, dynamics, and the small-world phenomenon. American Journal of Sociology, 493–527 (1999) 13. Narayanan, A., Shmatikov, V.: De-anonymizing social networks. IEEE, Los Alamitos (2009) 14. Felt, A., Evans, D.: Privacy protection for social networking APIs. In: 2008 Web 2.0 Security and Privacy, W2SP 2008 (2008) 15. Westin, A.F.: Privacy and freedom, London, vol. 97 (1967) 16. Hansell, S.: AOL removes search data on vast group of web users. New York Times 8, C4 (2006) 17. Gross, R., Acquisti, A.: Information revelation and privacy in online social networks. ACM, New York (2005) 18. Williams, J.: Social networking applications in health care: threats to the privacy and security of health information. ACM, New York (2010) 19. Hay, M., et al.: Resisting structural re-identification in anonymized social networks. Proceedings of the VLDB Endowment 1(1), 102–114 (2008) 20. Liu, K., et al.: Privacy-preserving data analysis on graphs and social networks. Next Generation of Data Mining, 419–437 (2008) 21. Zheleva, E., Getoor, L.: Preserving the privacy of sensitive relationships in graph data. Springer, Heidelberg (2007)
Distributed Mechanism for Protecting Resources in a Newly Emerged Digital Ecosystem Technology Ilung Pranata, Geoff Skinner, and Rukshan Athauda University of Newcastle, School of Design, Communication and IT, University Drive, Callaghan NSW, Australia {Ilung.Pranata,Geoff.Skinner,Rukshan.Athauda}@newcastle.edu.au
Abstract. Digital Ecosystem (DE) is characterized as an open and dynamic environment where the interaction and collaboration between its entities are highly promoted. A major requirement to promote such intensive interaction and collaboration in a DE environment is the ability to secure and uphold the confidentiality, integrity and non-repudiation of shared resources and information. However, current developments of such security mechanisms for protecting the shared resources are still in their infancy. Most of the proposed protection frameworks do not provide a scalable and effective mechanism for engaging multiple interacting entities to protect their resources. This is even a greater issue when multiple resources are exchanged and shared in an open and dynamic environment. Therefore, we proposes a distributed mechanism for individual enterprises to manage their own authorization processes and resource access permissions with an aim to provide a rigorous protection of entities’ resources. Keywords: Authentication, authorization, digital ecosystem.
1 Introduction Since its first introduction in 2002, a new emerging concept of Digital Ecosystem (DE) has grasped numerous attentions from researchers, businesses, ICT professionals and communities around the world. This concept aimed at achieving a set of predetermined goals of Lisbon summit in March 2000 which primarily focuses on dynamic formation of knowledge based economy [1]. Further, the knowledge based economy will lead to a creation of more jobs and a greater social inclusion in sustaining the world economic growth [2]. DE is a multi-dimensional concept that encompasses several current technology models such as collaborative environment [3], distributed system [4], and grid technology [5]. The combination of concepts from these models provides the ability for a DE environment to deliver an open, flexible and loosely coupled resource sharing environment. On the other hand, this combination also develops several complicated security issues which need to be addressed before the full implementation of a DE concept. Unfortunately, the evaluation on DE security dimensions from the current literature signifies a number of deficiencies in its security architecture particularly in protecting the enterprise resources and information. There is a need for a comprehensive resource protection solution that is able to provide a Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 175–185, 2011. © Springer-Verlag Berlin Heidelberg 2011
176
I. Pranata, G. Skinner, and R. Athauda
strong and rigorous mechanism to safeguard the critical resources and further to reduce the possibility of information leakage to the unauthorized parties. A key challenge for enterprises who involved in a DE environment is to determine the right users who are able to access the services, resources, knowledge and information hosted by these enterprises. This challenge is occurred due to several reasons. First, the occurrences of multiple resources published and shared by each enterprise in a DE environment and second, the situation where various clients are able to access each individual resource. Due to these reasons, enterprises urgently need a mechanism that effectively manages their clients’ access control and authorization permissions with an aim to protect their resources. In this paper, we attempt to deliver a comprehensive framework allowing enterprises to protect their resources and information from any unauthorized use.
2 Related Work In a DE environment where multiple interacting entities exist, the required efforts to enforce a strong authentication and authorisation mechanism are extensive. We identify three core issues that appear to be the challenging tasks to enforce such mechanisms. First, as the DE community expands its size to incorporate more entities, the resource providers face a challenge to identify the legal entities that are able to access their resources. Second, the fact that each entity would have different set of access permissions to access multiple resources further complicates the issue. Third, it is probable that each resource provider would host multiple resources and services in a DE environment. This situation, in turn, creates a great issue to authorize the right entities to the right resources with the right permissions. Failure in assigning right permissions to the entities would compromise the usage of resources and would bring negative impact for resource provider. The current internet mechanisms are still far from adequate to provide a reliable authentication and authorisation processes for a DE environment. This view is reflected from our literature analysis over a number of internet mechanisms. The most prominent mechanism to manage the client credentials is through the implementation of Identity Provider (IdP) or Credential Provider [6, 7]. IdP mainly focuses at storing and providing the client credential to any resource providers for their client authentication process. On every authentication process, resource provider will request the client credential from its trusted IdP where it receives any access request from a client. In latter development, several technology standards such as SAML [8] and Liberty Alliance [9] are adopted in this mechanism to provide the federation mechanism of multiple entities for Single Sign On (SSO) services. Similarly, the Credential Server (CRES) [10] and the Grid Security Infrastructure (GSI) MyProxy [11] utilizes the IdP concept, and they further leverage its concept for large number of dispersed servers over a wide geographical area. Both mechanisms store the clients’ credentials in the local server; however, authentication of a remote client can be facilitated by requesting his credential from the trusted remote server. In both MyProxy and CRES mechanisms, the resource provider requests the client credential from the local server on every authentication process. The local server then creates a certificate token which contains client information. Subsequently, the
Distributed Mechanism for Protecting Resources
177
certificate token is sent to the resource provider as the acknowledgement of the authentic client. When the resource provider receives the token, it allows the client to access the resources based on the trust established with the publisher of token. The SSL/TLS technology [12] has been extensively used in e-commerce transactions for secure authentication and communication. This technology is designed with a highly reliance on the Certificate Authority (CA) to ensure the legitimacy of an entity. Therefore, SSL/TLS technology also presents a centralized credential management. Although these approaches could be deployed well in a DE environment, the conspicuous issue of single server failure must be carefully considered. In an event where the credential provider server is down, there possibly a chaos in a DE community due to the unavailability of credential services for client authentication. Apparently, our literature review identifies that several internet authorisation mechanisms take similar approach as its authentication mechanisms. The most prominent authorisation mechanisms, such as CAS [13], Akenti [14], and PMI [15], utilize a central server to assign the multiple access permissions to the clients individually although their implementation are differ between each other. These mechanisms also inherit several issues pertinent to the central management of authorisation permissions. First, the central management would face real issue with the bottleneck and failure on its servers. Security breach would occur if the central servers fail to perform their authorisation processes over the clients. Although it is possible to replicate the central server, the replication process will bring abundance administrative issues, considering a huge amount of data that needs to be replicated. Second, challenges occur when the central server attempt to assign the access permissions to the DE member entities. As a large number of resource providers that host one or more resources, the central server needs to register each resource and its access permissions individually. Further, this situation becomes even more challenging as a single resource could be associated with multiple different access permissions, and each client may have different access permissions assigned to him. Therefore, the central management is not practical when there is huge number of entities in a DE environment. Third, serious administration issues would occur as a DE environment grows in size and diversity due to the great benefits that they can achieve. A central server will be experiencing huge burden to manage all client and resource providers’ accounts and permissions even with the use of super computers or grid collections of computers. Several DE literatures clearly reveal that DE is characterized as an open environment on which a centralized structure is minimized. DE must be engineered to provide a high resilience infrastructure while avoiding single point of control and failure [16, 17]. Therefore, a completely distributed control mechanism is required that immune to the super control failure. It is evident that the aforementioned internet mechanisms are inappropriate to be implemented in a DE environment due to its centralized management. In this paper, we propose a solution to manage the authentication and authorization in a full distributed approach that focuses on each entity to manage the authentication and authorization mechanism with the utilization of a capability token. We termed our solution as Distributed Resource Protection Mechanism (DRPM) [18, 19]. In this paper, we enhance our solution by eliminating its reliance on the central credential server and further secure the mechanism by utilizing the Public Key Infrastructure [15].
178
I. Pranata, G. Skinner, and R. Athauda
3 Overview of DRPM 3.1 Identifying Entity through Client Profile The present mechanism for service discovery in a DE environment requires a client to search for resources by utilizing a semantic discovery portal through its browsers or rich applications [20]. This discovery portal would search and list all resources which are provided by DE resource providers. Once the client finds the resource, it then contacts the resource provider and requests for that resource. At this stage, resource provider does not know any information about this client and its intended purpose on the resource. This may put the resource at risk as it may contain highly sensitive information which must be protected from any misuse and malicious act. Therefore, it is crucial for a resource provider to understand its client’s information before any access to the resource is granted. Taking this into consideration, we adopted a method of creation for a client profile that aims to capture all required, but voluntarily provided, information about a client. The information which is contained in a client profile provides necessary data about who the client is and about their intentions and purpose for using the requested resources. The aim of implementing a client profile is to ensure the resource provider that resources are not going to the wrong entities and further impose the confidentiality and integrity of the resources. The use of client profiles also facilitates the auditing process on who is accessing a resource. For example, there may be a situation where a resource provider needs to trace back which client was delegated an access to the resource in case there was an incident involving a dispute or counterfeiting of the resource. In order to fully implement a client profile, it is necessary that a client registration portal is employed in DRPM. A client profile is generated through this registration portal. Further, resource providers are able to customize the registration portal to contain only the information which is important to them. New clients wishing to access a specific resource are initially redirected into this portal. If they wish to access the resource, they must continue to fill in all the necessary information required by the resource provider to produce a client profile. Once it is produced, the client profile is stored in the resource provider repository. Utilisation of this functional procedure and process provides an additional and enhanced method for determining who is accessing a particular resource at a particular time inside a DE environment. 3.2 Storing Permission in a Capability Token It is always a challenge to enforce client access permissions on the available resources within a DE environment. This challenge is due to the occurrences of a high amount of entities that actively interact in a DE environment. Further, these entities could also make the same request for a particular resource either at the same or at different time. To solve the issue of managing multiple resource access permissions on a diverse range of DE clients, we utilize and further evolve the concept of capability introduced by CAS server that is used in a Collaborative Environment. In CAS, capability is used to store all access rights of a user which are determined by a community policy. However, the implementation of the capability in our framework is slightly different to the capability implementation in a CAS server. In our framework, capability contains all the necessary
Distributed Mechanism for Protecting Resources
179
right permissions for each client to perform a set of operations on a particular resource. This capability is produced by the resource provider that hosts a particular resource. This capability would be used to grant the client access to the resources, and it further facilitates the authorization process for the clients. Once a client profile is created, a list of client authorization permissions are assigned into the capability token. The client’s access permissions and policies are expressed in XML [21] due to its simplicity, wide usability and self-descriptive characteristics. Our basic design of a capability token contains the client profile identifier, resource provider identifier, resource identifier and list of access permissions. A time-stamp can be implemented in the capability token to determine the validity period of a client when accessing the resources. In the event where the trustworthiness of a new client is equivocal, a short-life capability token can be issued. Once the trustworthiness of this client gradually increases, resource provider can replace the short-life token with longer time-stamp validity. Additionally, the Uniform Resource Locator (URL) of resources is embedded in the token to provide an automatic and seamless connection to resource servers. Once a capability token is created, it would be disseminated to the requesting client. Every time a client makes a request to the resource provider, the client sends back its initial configured capability to the resource provider. The resource provider then authenticates the client’s capability token and grants the access permissions based on the listed permissions obtained from the client’s capability.
4 Developing a Secure DRPM Workflow In this section, we present a secure DRPM which provides a strong authentication and authorization mechanism while upholding the confidentiality, integrity and nonrepudiation of resources. The following notation will be used to mathematically define the secure DRPM: • • • • • • • • • •
Cl: Client that request for the resources. RP: Resource Provider that host the resources. PKi: Public Key of i. SKi: Secret Key of i. Clcp: Capability token of client. Si(x)i: Signed object x with private key of i. E[x]j: Encrypted object x with public key of j. ATCl: Authentication Token of client. SyKi: Symmetric key passphrase of i i → j:{x1, …., xn}: A message sent from i to j with content xi to xn.
given that: •
PKi
SKi: Public key of i is only related to secret key of i therefore, Si(x)i can only be verified with PKi and E[x]i can only be decrypted with SKi.
180
I. Pranata, G. Skinner, and R. Athauda
4.1 Securing Registration Workflow The DRPM registration portal is used to generate a client profile during the initial resource provisioning. This registration portal also captures the client information and possibly his reasons for accessing the resources. The registration process comprises of three main stages: client registration, public key exchanges, and secure transfer of capability token. The resource provider endorsed certificate is utilized to identify the authentic resource provider based on its community endorsed public key certificate, which will be discussed in the next sub-section. The Public Key Infrastructure (PKI) is used to provide a secure communication between the client and resource provider. Figure 1 shows the principal workflow for securing three stages of registration process.
Fig. 1. DRPM secure registration workflow
The registration steps are detailed below: 1. A new client contacts the resource provider for requesting a resource (Cl → RP). Resource provider sends its WoT endorsed public key to the client (Cl ← RP:{PKRP}). Once the client determines and accepts the trustworthiness of the public key, he stores the resource provider trusted public keys and fills his information on the registration portal. 2. After the client information is filled, the registration portal will build a unique client profile which identify the client, and send this client profile to the repository server. 3. Resource provider then requests for client certificate and stores the client public key on its repository (Cl:{PKCl} → RP). If required, WoT verification could be performed on client certificate to ensure the trustworthiness of the client. 4. The resource provider generates a client capability token based on client’s allowed permissions. 5. Resource provider uses its own private key to sign the capability token (SKRP + Clcp = Si(Clcp)RP). SHA Algorithm is used to hash the capability token. This process enhances the integrity of capability token over the untrusted network.
Distributed Mechanism for Protecting Resources
181
6. Resource provider then uses client’s public key, received from step 3, to encrypt the signed message (PKCl + Si(Clcp)RP = E[Si(Clcp)RP]Cl) and send it to client endpoint (Cl ← RP:{E[Si(Clcp)RP]Cl}). 7. Client uses his own private key to decrypt the encrypted capability token (E[Si(Clcp)RP]Cl - PKCl = Si(Clcp)RP). This process further ensures the confidentiality of capability token. A capability token is breached if client cannot decrypt the message. 8. Client then uses resource provider public key to generate the capability token from the signed message (Si(Clcp)RP - PKRP = Clcp). This process further ensures that the client receives the capability token from the genuine resource provider unchanged. Note that at the final step of registration process, client will have his capability token and public key which were retrieved from the resource provider. The capability token and resource provider public key will then be stored in client repository for future communication or resource access. On another end-point, the resource provider stores the client’s public key in its own repository. We trust that the combination of both encryption and hashing mechanisms further uphold the confidentiality, integrity and non-repudiation of capability token during its transfer. 4.2 Fine-Grained Resource Access Workflow Once a client has been successfully registered with the resource provider, client will present his capability token to the resource provider on every access request. The capability token which contains client assertions and authorization permissions is primarily used as a base by the resource provider for granting the resource access. Resource provider utilizes client’s capability token to authenticate and authorize client access. Three foremost protection requirements for the resource access are the identification of resource provider, secured transfer of capability token, and authentication of a requesting client. A detailed workflow that ensures security protection on each resource access is provided in figure 2. The steps are as follow: 1. Client looks at his repository for his intended resource provider capability token. He then retrieves this capability token from client repository. The capability token contains the client access permissions and the resource URL. At this stage, the client also determines a symmetric pass key which will be shared with the resource provider and generate the Authentication Token which consists of symmetric pass key and capability token (Clcp + SyKcl = ATCl). 2. Client uses his private key to sign the capability token (SKCl + ATCl = Si(ATCl)Cl). The signing process is essential to uphold the non-repudiation of capability token. 3. Client then encrypts the signed capability token using resource provider public key (PKRP + Si(ATCl) = E[Si(ATCl)Cl]RP) and he sends the encrypted message to the resource provider (Cl:{E[Si(ATCl)Cl]RP} → RP). 4. When resource provider received the encrypted message, it uses its own private key to de-crypt the message and retrieve signed capability token (E[Si(ATCl)Cl]RP - SKRP = Si(ATCl)Cl).
182
I. Pranata, G. Skinner, and R. Athauda
Fig. 2. DRPM resource access protection
5. Resource provider then verifies the signature of capability token using the client public key (Si(ATCl)Cl – PKCl = ATCl). It then verifies the integrity of the capability token by generating the hash number from capability token using the SHA Algorithm. 6. Resource provider retrieves the access permissions listed in capability token. Note that, on the step 1 of the workflow the client determines a symmetric pass key. This pass key will be utilized to generate a symmetric key for further communication after the capability token authentication and authorization processes is valid. In an event where the capability token is stolen due to the man-in-middle attack, the unauthorized entity will still not be able to access the resource due to the symmetric key passphrase that is shared between the legitimate client and resource provider only. If there is a security breach on which resource provider generates a new pair of publicprivate keys, client would not be able to decrypt using his current resource provider public key. Therefore, a request needs to be made to obtain a new public key. PKI is extensively utilized during the DRPM resource workflow. The other party public key retained by both client and resource provider during the registration process is re-used to provide the confidentiality and integrity of capability token. PKI is primarily adopted during the initial handshake and capability token transfer. Due to the limitation of PKI which requires higher computation process, we suggest the utilization of symmetric key for transferring the data after the authentication and authorization process. The symmetric key can be incorporated into the capability token message before encrypting with the client’s private key. Resource exchange is then encrypted by this symmetric key over the untrusted network.
6 Implementation Strategies and Scalability Testing Due to the limitation on the length of this paper, this section provides a very brief review on the implementation strategies and the scalability performance of our proposed DRPM mechanism. Our DRPM prototype implementation was divided into two major applications: the resource provider application and the client application. The resource provider application was built of three main system components: listener
Distributed Mechanism for Protecting Resources
183
module, registration page and resource page. The main tasks of these components were to listen for any incoming connection from the client, to automatically create the client profile and capability token, to securely exchange the information, and to host multiple resources. In contrast, the client application was primarily being used to securely register and access the hosted resources. We tested our prototype for the scalability on its listener server component to handle multiple HttpWebRequest requests. This test was conducted by utilizing the Apache JMeter [21] tool that specializes on the web scalability and performance testing.
Fig. 3. DRPM listener component scalability testing
In our test bed, 1000 users were generated to access the listener component concurrently. Each user accessed the listener component for either registering or accessing the resources. Our test shows that the average elapsed time was 162 ms with the aggregate highest elapsed time of 327 ms for resource access process and the aggregate lowest elapsed time of 5 ms for registration process. It was shown that the highest elapsed time was primarily due to the encryption/decryption and hash verification of the capability token.
7 Conclusion In this paper we have highlighted the needs for protecting enterprise resources from any unauthorized use in a Digital Ecosystem (DE) environment. Further, we have also analysed the appropriateness of several existing security mechanisms for DE. After a thorough analysis, we found a number of deficiencies of the current mechanisms to promote a strong community protection. Therefore, we propose the Distributed Resource Protection Mechanism (DRPM) for DE to provide a comprehensive resource protection. DRPM can be classified as a new approach to facilitate the authorization process for enterprises that request for specific resources or information. DRPM emphasizes on the decentralized authorization mechanism that is performed by each resource provider. It is achieved by utilizing the client profile and capability
184
I. Pranata, G. Skinner, and R. Athauda
token for its authentication and authorization permissions. Several future works such as analysis of DRPM and scalability of the prototype are needed to ensure a strong protection for DE member entities. Further, investigation on an effective trust mechanism to improve the overall DRPM security is critically needed. Our proposal incorporates the Web of Trust (WoT) to actively engage the community to protect the resources. As trust is critical in DRPM to build the confidence of the entities to interact and sharing their resources, a close analysis on the applicability of WoT to develop an effective trust management is desired.
References 1. Nachira, F., Dini, P., Nicolai, A.: A network of digital business ecosystems for Europe: roots, processes and perspectives, European Commission, Bruxelles, Introductory Paper (2007) 2. Dini, P., Darking, M., Rathbone, N., Vidal, M., Hernandez, P., Ferronato, P., Briscoe, G., Hendryx, S.: The digital ecosystems research vision: 2010 and beyond, European Commisssion, Bruxelles, Position Paper (2005) 3. Ballesteros, I.L.: New Collaborative Working Environments 2020, European Commission, Report on industry-led FP7 consultations and 3rd Report of the Experts Group on Collaboration@Work (2006) 4. van Steen, M., Homburg, P., Tanenbaum, A.S.: Globe: a wide area distributed system. IEEE Concurrency 7, 70–78 (1999) 5. Czajkowski, K., Kesselman, C., Fitzgerald, S., Foster, I.: Grid Information Services for Distributed Resource Sharing. In: 10th IEEE International Symposium on High Performance Distributed Computing, HPDC-10 2001 (2001) 6. Koshutanski, H., et al.: Distributed Identity Management Model for Digital Ecosystems. Presented at the International Conference on Emerging Security Information, Systems and Technologies (Securware 2007), Valencia (2007) 7. Seigneur, J.M.: Demonstration of security through collaborative in digital business ecosystem. In: Proceedings of the IEEE SECOVAL Workshop, Athens, Greece (2005) 8. Hughes, J., Maler, E.: Security Assertion Markup Language (SAML) v. 2.0 Technical Overview, OASIS, Working Paper (2005) 9. Alliance, L.: Liberty Aliance Project (2011), http://www.projectliberty.org/ 10. Seigneur, J.M.: Demonstration of security through collaborative in digital business ecosystem. In: Proceedings of the IEEE SECOVAL Workshop, Athens, Greece (2005) 11. Novotny, J.: An online credential repository for the Grid: MyProxy. In: Proceedings of the IEEE Tenth International Symposium on High Performance Distributed Computing (HPDC 2010), San Fransisco, USA (2001) 12. Chou, W.: Inside SSL: The Secure Sockets Layer Protocol. IEEE Computer Society: IT Professional (2002) 13. Pearlman, L., et al.: A Community Authorization Service for Group Collaboration. In: Proceedings of the Third International Workshop on Policies for Distributed Systems and Networks, Monterey, USA (2002) 14. Thompson, M., et al.: Certificate-based access control for widely distributed resources. In: Proceedings of the 8th Conference on USENIX Security Symposium, Washington DC (1999) 15. Weise, J.: Public Key Infrastructure Overview, Sun BluePrints Online (2001)
Distributed Mechanism for Protecting Resources
185
16. Boley, H., Chang, E.: Digital Ecosystem: Principles and Semantics. Presented at the Inaugural IEEE International Conference on Digital Ecosystems and Technologies (IEEE DEST 2007), Cairns, Australia (2007) 17. Briscoe, G., Wilde, P.: Digital Ecosystems: Evolving Service-Oriented Architectures. In: Proceedings of the 1st International Conference on Bio Inspired Models of Network, Information and Computing Systems, New York, USA (2006) 18. Pranata, I., Skinner, G.: Managing enterprise authentication and authorization permissions in digital ecosystem. Presented at the 3th IEEE International Conference on Digital Ecosystems and Technologies (DEST), Istanbul, Turkey (2009) 19. Pranata, I., Skinner, G.: Digital ecosystem access control management. WSEAS Transactions on Information Science and Applications 6, 926–935 (2009) 20. Kennedy, J.: Distributed infrastructural service. In: Nachira, F., Dini, P., Nicolai, A., Le Louarn, M., Leon, L.R. (eds.) Digital Ecosystem Technology. European Commission: Information Society and Media (2007) 21. Apache JMeter (July 2011), http://jakarta.apache.org/jmeter/
Reservation-Based Charging Service for Electric Vehicles Junghoon Lee, Gyung-Leen Park, and Hye-Jin Kim Dept. of Computer Science and Statistics Jeju National University, 690-756, Jeju-Do, Republic of Korea {jhlee,glpark,hjkim82}@jejunu.ac.kr
Abstract. This paper designs a telematics service capable of providing electric vehicles with a reservation-based charging mechanism, aiming at improving acceptance ratio. By the telematics network, each vehicle retrieves the current reservation status of charging stations of interest and then sends a reservation request specifying its requirement on charging amount and time constraint. Receiving the request, the charging station checks if it can meet the requirement of the new request without violating the constraints of already admitted requests. In this admission test, the charging scheduler, which may work in the charging station or a remote data center, implements a genetic algorithm to respond promptly to the fast moving vehicle. The performance measurement result, obtained from a prototype implementation, shows that the proposed scheme can significantly improve the acceptance ratio for all range of the number of tasks and permissible peak load, compared with a conventional scheduling strategy. Keywords: Smart transportation, electric vehicle telematics, charging schedule, reservation service, acceptance ratio.
1
Introduction
Telematics means the integration of telecommunication and informatics, especially focusing on applications in vehicles. The in-vehicle telematics device is an onboard computing platform having computing power and wireless connection to an information server. Empowered by the ongoing development of wireless communication technologies, vehicles better remain connected to the global network, through which a lot of sophisticated services can be provided. The telematics device also provides a user interface to drivers and passengers, inviting a variety of telematics applications, which are necessarily location-based services built upon GPS technology. The examples of telematics services include real-time traffic information, path finding, and vicinity information retrieval. Nowadays, vehicle telematics is required to extend its service area to electric vehicles, or EV, in
This research was supported by the MKE (The Ministry of Knowledge Economy) Korea, under IT/SW Creative research program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA-2011-(C1820-1101-0002)).
Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 186–195, 2011. c Springer-Verlag Berlin Heidelberg 2011
Reservation-Based Charging Service for Electric Vehicles
187
short. In addition to the classical services, the EV telematics system must consider the EV specific requirement such as online advance booking of charging spots, remote vehicle diagnostics, and time display for next charging [1]. Even though many researchers and developers are working to improve driving range while decreasing charging time, weight, and cost of batteries, EVs still need to be charged more often and it takes tens of minutes to charge an EV [2]. Moreover, drivers want to have their cars charged by a certain time instant, for example, before they depart for their offices, their homes, and the like. As a result, the charging station must coordinate or schedule multiple requests having different time constraints, charging amount, and power consumption dynamics. In addition, when the station is charging multiple vehicles at the same time, the power consumption may grow too much shortly beyond the permissible bound contracted with its utility company [3]. Considering the battery capacities ranging from 15 to 50 kW h, area-wide peak load can be also serious when a number of EVs start charging during a short time window. Without an appropriate distribution of EVs over charging stations, not only the waiting time can intolerably increase, but also the power consumption may exceed the permissible range, possibly resulting in the extra cost. The availability of charging station information can distribute and even assign EVs over multiple stations. The information retrieval from the vehicle on the move is one of the typical applications of vehicle telematics. After retrieving the charging station information through the telematics system via the wireless vehicle network, a driver can select when and where to charge. The telematics device, manipulated by the driver, attempts to make a reservation on a station and possibly changes the route by a new path plan. The telematics system collects the current status of each charging station generally via the wired network including the Internet. For the charging station’s side, it must be able to post its current prices and estimated waiting time considering the current queue length and confirmed reservations. The stations inherently try to serve as many vehicles as possible. In addition, they have some restrictions on the maximum permissible power, the number of chargers, and so on. As a result, it is necessary to schedule the requests from vehicles according to its scheduling policy and announce its status to EVs via the telematics system. In this regard, this paper is to design an EV telematics service capable of providing efficient charging station selection. For EVs, charging information is provided to let them make a charging plan, while the telematics application integrates the decision to the vehicle’s routing schedule. In addition, for stations, a charging scheduler is needed to estimate the waiting time and to maintain peak load below the power level contracted with the energy supplier. Here, scheduling is in most cases a very complex time-consuming problem greatly sensitive to the number of tasks. It is difficult to solve by conventional optimization schemes, and their severe execution time makes them impractical in the real system. Accordingly, we will develop a scheduling scheme based on a genetic algorithm [4], which is an efficient search technique based on principles of natural selection and genetics.
188
J. Lee, G.-L. Park, and H.-J. Kim
This paper is organized as follows: After issuing the problem in Section 1, Section 2 describes the background of this paper. Section 3 designs an EV telematics service for efficient EV charging. The performance measurement results are discussed in Section 4. Finally, Section 5 summarizes and concludes this paper with a brief introduction of future work.
2
Background
The smart grid is a next generation power network which combines information technology with the legacy power network to optimize energy efficiency [5]. EVs are one of the most important components of the smart grid, as their batteries are efficiently charged via the smart grid to replace petroleum-based transportation infrastructure which may create quite much air pollution. EVs need a nationwide power charging infrastructure [6], creating new business models such as management of diverse vehicle types, charging stations, and subsidiary services [7]. A charging facility can be installed not just in the commercial charging station, but we can further consider the service area such as universities, offices, public institutes, shopping malls, airport parking lots, and the like. Many vehicles will concentrate in those places and they must be served according to a welldefined reservation and scheduling strategy [8]. Moreover, as EVs are necessarily equipped with one or more vehicle network interfaces, they can easily interact with a charging scheduler and other telematics services which may reside even in a remote computing cluster [9]. Meanwhile, the telematics system can provide diverse information services to the drivers taking advantage of two-way communication between the drivers and services. Particularly, IEEE 802.11 WLAN and DSRC (Dedicated Short Range Communication) provide vehicle-to-vehicle communication, while cellular networks such as GSM (Global System for Mobile) support global ubiquitous connection. The service usually exploits the current location and underlying geographic information such as road networks and POIs (Point Of Interests). Many services will be available to EVs, for example, on-demand-information fuelling, remote vehicle diagnostics, interior pre-conditioning, green report generation for monthly EV miles, and the like [10]. Among these, efficient charging is most essential. Considering the already available telematics service framework by which EVs and stations can interact, the scheduling policy in the charging station is the most critical factor for guaranteeing reasonable waiting time to drivers as well as keeping the energy consumption below the contracted amount. As for energy consumption scheduling, our previous work has designed a power management scheme capable of reducing the peak power consumption [11]. It finds the optimal schedule for the task set consisting of nonpreemptive and preemptive tasks, each of which has its own consumption profile as in [12]. To deal with the intolerable scheduling latency on the large number of tasks and slots, the feasible combinatory allocations are precalculated in advance of search space expansion for preemptive tasks. Then, for every partial allocation just consist of nonpreemptive tasks, the scheduler maps the combination of each preemptive
Reservation-Based Charging Service for Electric Vehicles
189
task to the allocation table one by one, checking the peak power requirement. This scheme significantly reduces the scheduling time by pruning unnecessary branches in the search space and seems to work efficiently also for charging tasks. However, it doesn’t consider any other constraints such as the number of chargers, precedence relations, and permissible contracted amount. The speedup is not still enough for practical use, so further enhancement must be achieved. Besides, the current DSM (Demand Side Management) programs consider appliance scheduling in homes and buildings, while some of them can be applied to the charging scheduler [13,14].
3 3.1
Scheduling Scheme EV Service Architecture
Even though multihop vehicle-to-vehicle networks can connect vehicles and charging stations without cost, the connection is not stable. Hence, we assume that vehicles and stations are connected via a global cellular network. A telematics service works between drivers and charging stations to support efficient information exchange as shown in Figure 1. The information necessary for charging services includes estimated distance covered on current charging, availability and booking of charging station, location of charging station, and state of charging [10]. Basically, drivers generally have several options and decide the station according to his/her preference on waiting time, cost, and the like. To this end, the current reservation status of each charging station is posted in the telematics server so that drivers can retrieve this information. After selecting a station to contact, a driver sends a reservation request specifying its requirement. When receiving a request, the station must be able to check if it can accept the request. The station conducts this test and returns the result fast enough for the moving vehicle to decide and confirm its reservation.
Fig. 1. EV telematics architecture
190
J. Lee, G.-L. Park, and H.-J. Kim
Each requirement consists of vehicle type, estimated arrival time, the desired service completion time (deadline), charging amount, and so on. Figure 1 shows the road network of our target area, namely, Jeju city. The charging stations are registered in this map. Based on this road map and classical real-time traffic information, the vehicle can quite accurately estimate when it will arrive at a specific charging station. The station must be indispensably reachable with the remaining battery. Receiving the request, the scheduler prepares the power load profile of the vehicle type from the well-known vehicle specification database. Then, it checks whether the station can meet the requirement of the new request without violating the constraints of already admitted requests based on its own scheduling strategy. The result is delivered back to the vehicle, and the driver may accept the schedule, attempt a renegotiation, or choose another station. Entering the reserved station, the vehicle is assigned a charger and waits in the queue according to the schedule. Actually, an EV is connected to a power line during it is waiting. However, power supply can begin under the control of the scheduler by activating the connection relay. Hence, our scheduling model disregards the vehicle switch time, which is analogous to the context switch overhead. 3.2
Charging Scheduler
The reservation service can work in the station or high-performance server in the data center. The scheduler must decide whether a station can accept the request before the requesting vehicle passes by its vicinity. So, accuracy can be somewhat sacrificed for fast computation. Each charging operation can be modeled as a task. For a task, the power consumption behavior can vary according to the charging stage, remaining amount, vehicle type, and so on. The load power profile is practical for characterizing the power consumption dynamics along the battery charging stage [12]. Web portals like Google PowerMeter also centralize energy consumption data about their users and this profile can be exploited in generating a better charging schedule [14]. In the profile, the power demand is aligned to the fixed-size time slot, during which the power consumption is constant, considering the availability of automatic voltage regulators. The length of a time slot can be tuned according to the system requirement on the schedule granularity and the computing time. In a power schedule, the slot length can be a few minutes, for example, 5 minutes. This length coincides with the time unit generally used in the real-time price signal. Charging task Ti can be modeled by the tuple of < Ai , Di , Ui >. Tasks are practically nonpreemptive in charging stations. Even though it can be preempted in the single user case as in an individual home, the charging process continues to the end once it has started in the charging station. The charging order can be changed only before the charging operation begins. Ai is the activation time of Ti , Di is the deadline, and Ui denotes the operation length, which corresponds to the length of the consumption profile. Ai is the estimated arrival time of the vehicle. Each task can start from its activation time to the latest start time, which can be calculated by subtracting Ui from Di . When a start time is
Reservation-Based Charging Service for Electric Vehicles
191
selected, the profile entry is just copied to the allocation table one by one, as the task can be neither suspended nor resumed during its operation. The choice option is bounded by M , the number of time slots in the scheduling window, hence the time complexity of search space traversal for a single charging task is O(M ), making the total complexity O(M N ) for optimal schedules, where N is the number of tasks. It takes tens of minutes or sometimes a couple of hours in the average performance PC to generate an optimal schedule, which investigates all feasible schedules. As contrast, genetic algorithms are efficient search techniques based on principles of natural selection and genetics. They have been successfully applied to find acceptable solutions to problems in business, engineering, and science within a reasonable time bound. Each evolutionary step generates a population of candidate solutions and evaluates the population according to a fitness function to select the best solution and mate to form the next generation. Over a number of generations, good traits dominate the population, resulting in an improvement in the quality of the solutions. It must be mentioned that the genetic algorithm process can run for years and does not find any better solution than it did in the first part of the process. In the scheduling problem, a chromosome corresponds to a single feasible schedule, and is represented by a fixed-length string of integer-valued vector. To begin with, the value denotes the start time for charging tasks. As they cannot be suspended once they have begun, just the start time is enough to describe their behaviors in a schedule. Here, if the consumption profile for the task is (3, 4, 5, 2), and the vector element is 2, the allocation for this task will be (0, 0, 3, 4, 5, 2, ...). As a result, the allocation vector can be converted into the allocation table which has N rows and M columns. For an allocation, the scheduler can calculate the per-slot power requirement and the peak load. If the peak load exceeds the permissible bound, the fitness value for this allocation will be the lowest, and the iteration will discard this allocation. The fitness function evaluates the quality of an allocation and proceeds to the next step. The iteration consists of selection and recombination. Selection is a method that picks parents according to the fitness function. The Roulette Wheel selection gives precedence to the chromosome having a better fitness value for mating. Recombination, or crossover, is the process taking two parents and producing a child with the hope that the child will be a better solution. This operation randomly selects a pair of two crossover points and swaps the substrings from each parent. This step may generate the same schedule with the existing ones in the population. It is meaningless to have multiple instances of a single schedule. So, they will be replaced by new random chromosomes. The charging scheduler is subject to time constraint of each charging task. However, this constraint can be always met, as the scheduler selects the start time only within the valid range, namely, from Ai to Di − Ui . In addition, we can directly control the number of iterations according to the allowable scheduling time. We can find more accurate schedule with more scheduling time, or iterations.
192
4
J. Lee, G.-L. Park, and H.-J. Kim
Performance Measurement
This section implements the prototype of the proposed allocation scheme using Visual C++ 6.0. Our implementation runs on the platform equipped with Intel Core2 Duo CPU, 3.0 GB memory, and Windows Vista operating system. The experiment sets the schedule length, namely, M , to 20 time units. If a single time unit is equal to 5 min, the total schedule length will be 100 min, but the time scale is a tunable parameter. For a task, the activation time is selected randomly between 0 and M , while the operation length is also selected randomly, but it will be set to M if the finish time, namely, the sum of start time and the operation length, exceeds M . The deadline of a charging task has 1.5 times as large as the charging time on average. In addition, the power demand for each time slot has the value of 1 through 5. The power scale, for example, kw, is not explicitly specified in this experiment, as it is a relative-value term. The experiment consists of the measurement of an acceptance ratio as well as the effect of iterations. For every parameter setting, 50 task sets are generated. The scheduler runs on every request arrival. If n tasks are already accepted, the next request invokes scheduling of (n + 1) tasks. Hence, the schedulability of a task set is a critical performance metric. We define the acceptance ratio as the ratio of the number of accepted task sets to that of total task sets. If the peak load of the schedule generated by a scheduling scheme is less than the contracted power, the set is considered to be accepted. The uncoordinated scheduler is selected for performance comparison as in [12]. This scheme initiates the task as soon as the task is ready and makes it run without preemption. This approach employs no control strategy, but it is important as it provides a measure for a comparative assessment for the efficiency of other charging strategies. The first experiment measures the acceptance ratio according to the number of charging tasks submitted to a charging station. The experiment changes the number of tasks from 3 to 15, while the power requirement amount exponentially distributes with the same average. The contracted power is set to 20. Hence, if the peak load of a schedule for a given task set is less than 20, the charging station can accept the task set. As shown in Figure 2(a), the uncoordinated scheme misses tasks set even when the number of tasks is just 3, while it cannot accept any set from the point where the number of tasks is 9. On the contrary, the proposed scheme can accept all task sets until the number of tasks is 6, and then misses more sets according to the increase of the number of tasks. When there are 10 sets, the acceptance ratio falls below 10 %. The comparison of acceptance ratio between two schemes is a little bit meaningless, as the gap reaches 76 % when the task set includes 6 tasks. Anyway, the proposed scheme shifts the breakdown point by at least 4 tasks. That is, our scheme can service at least 4 more charging tasks. The second experiment measures the effect of the contracted power, namely, permissible peak load, to the acceptance ratio. In this experiment, the number of tasks is set to 10. According to Figure 2(b), when the permissible bound reaches 32, our scheme can accept all task sets. On the contrary, the uncoordinated scheme shows just 45 % acceptance ratio. Judging from the result, the proposed
Reservation-Based Charging Service for Electric Vehicles
1
1 "Proposed" "Uncoordinated"
"Proposed" "Uncoordinated"
0.8 Acceptance ratio
0.8 Acceptance ratio
193
0.6 0.4 0.2
0.6 0.4 0.2
0
0 4
6
8 10 Number of tasks
12
14
10
(a) Effect of # of tasks
15
20 25 30 Contracted peak
35
40
(b) Effect of contracted power
Fig. 2. Acceptance ratio according to the task set and contracted power
scheme enables the charging station manager to contract with at least 50 % less power amount for the same acceptance ratio, compared with the uncoordinated case. It can help to increase the business profits. Up to this, the number of iterations is 1000 and the number of populations is 25. The next experiment measures the effect of genetic algorithm-specific parameters to the acceptance ratio. Figure 3(a) plots the acceptance ratio according to the number of iterations for two cases where task sets have 8 tasks and 10 tasks, respectively. The further iteration improves the acceptance ratio, but its effect is not so significant for both cases as long as the number of iterations exceeds 1000. In addition, Figure 3(b) plots the execution time according to the number of iterations. The execution time is measured using Microsoft Windows GetT ickCount system call which has the 1 ms time granularity. Figure 3(b) shows that the execution time is exactly linear to the number of iterations. However, the difference between two sets having 8 tasks and 10 tasks is very small. By this, we can decide the reasonable iteration number according to the desired execution time bound and accuracy. 1
2.5 "Task=8" "Task=10"
"Task=8" "Task=10" Execution time (sec)
Acceptance ratio
0.8 0.6 0.4 0.2 0
2 1.5 1 0.5 0
0
500
1000
1500
Iteration
(a) Effect of iterations
2000
0
500
1000
1500
Iteration
(b) Execution time
Fig. 3. Acceptance ratio according to the genetic algorithm parameters
2000
194
5
J. Lee, G.-L. Park, and H.-J. Kim
Concluding Remarks
In this paper, we have designed a telematics service capable of providing an efficient reservation mechanism to EVs, which can replace fossil fuels and reduce gas emissions, aiming at fast penetration of them into our daily life. Charging stations first post their reservation status and estimated waiting time in the telematics server, and EVs retrieve this information to select one and sends a reservation request specifying its requirement on charging amount and time constraint. Receiving the request, the charging station checks if it can meet the requirement of the new request without violating the constraints of already admitted requests. In this admission test, the charging scheduler, which may work in the charging station or a remote data center, implements a genetic algorithm to respond promptly to the fast moving vehicle. The performance of our design has been measured by a prototype implementation in terms of acceptance ratio according to the number of tasks and permissible peak load. The result analysis shows that the proposed scheme remarkably improves the acceptance ratio for given parameter setting, compared with a conventional uncoordinated scheduling scheme, accepting at least 4 more charging tasks and possibly contracting 50 % less power amount for the same acceptance ratio. As future work, we are planning to extend our work to global peak load reduction among the set of charging stations connected to a single utility company. The provider sets pricing tariffs that differentiate rate in time and level, the global optimization very important in energy cost saving [13]. In addition, we can avoid a system-wide power shortage, which make it necessary to rebuild the cable system or more power plants. Such goals can be achieved by an efficient scheduling scheme.
References 1. Guille, C., Gross, G.: A Conceptual Framework for the Vehicle-to-grid (V2G) Implementation. Energy Policy 37, 4379–4390 (2009) 2. Markel, T., Simpson, A.: Plug-in Hybrid Electric Vehicle Energy Storage System Design. In: Advanced Automotive Battery Conference (2006) 3. Spees, K., Lave, L.: Demand Response and Electricity Market Efficiency. The Electricity Journal, 69–85 (2007) 4. Katsigiannis, Y., Georgilakis, P., Karapidakis, E.: Multiobjective Genetic Algorithm Solution to the Optimum Economic and Environmental Eerformance Problem of Small Autonomous Hybrid Power Systems with Renewables. In: IET Renewable Power Generation, pp. 404–419 (2010) 5. Gellings, C.W.: The Smart Grid: Enabling Energy Efficiency and Demand Response. CRC Press, Boca Raton (2009) 6. Morrow, K., Karner, D., Francfort, J.: Plug-in Hybrid Electric Vehicle Charging Infrastructure Review. Battelle Energy Alliance (2008) 7. Kaplan, S.M., Sissine, F.: Smart Grid: Modernizing Electric Power Transmission and Distribution; Energy Independence, Storage and Security. TheCapitol.Net (2009)
Reservation-Based Charging Service for Electric Vehicles
195
8. Schweppe, H., Zimmermann, A., Grill, D.: Flexible In-vehicle Stream Processing with Distributed Automotive Control Units for Engineering and Diagnosis. In: IEEE 3rd International Symposium on Industrial Embedded Systems, pp. 74–81 (2008) 9. Ipakchi, A., Albuyeh, F.: Grid of the Future. IEEE Power & Energy Magazine, 52–62 (2009) 10. Frost & Sullivan: Strategic Market and Technology Assessment of Telematics Applications for Electric Vehicles. In: 10th Annual Conference of Detroit Telematics (2010) 11. Lee, J., Park, G., Kim, S., Kim, H., Sung, C.: Power Consumption Scheduling for Peak Load Reduction in Smart Grid Homes. In: ACM Symposium on Applied Computing, pp. 584–588 (2011) 12. Derin, O., Ferrante, A.: Scheduling Energy Consumption with Local Renewable Micro-Generation and Dynamic Electricity Prices. In: First Workshop on Green and Smart Embedded System Technology: Infrastructures, Methods, and Tools (2010) 13. Mohsenian-Rad, A., Wong, V., Jatskevich, J., Leon-Garcia, A.: Autonomous demand-side management based on game-theoretic energy consumption scheduling for the future smart grid. IEEE Transactions on Smart Grid 1, 320–331 (2010) 14. Caron, S., Kesidis, G.: Incentive-based energy consumption scheduling algorithms for the smart grid. In: IEEE SmartGridComm (2010)
Intelligent Ubiquitous Sensor Network for Agricultural and Livestock Farms Junghoon Lee1 , Hye-Jin Kim1 , Gyung-Leen Park1, Ho-Young Kwak2 , and Cheol Min Kim3 1
Dept. of Computer Science and Statistics 2 Dept. of Computer Engineering 3 Dept. of Computer Education Jeju National University, 690-756, Jeju-Do, Republic of Korea {jhlee,hjkim82,glpark,kwak,cmkim}@jejunu.ac.kr
Abstract. This paper designs and implements an intelligent ubiquitous sensor network architecture for agricultural and livestock farms which embrace a variety of sensors and create a great volume of sensor data records. For the sake of efficiently and accurately detecting the specific events out of the great amount of sensor data which may include not just erroneous terms but also correlative attributes, the middleware module embeds an empirical event patterns and knowledge description. For the filtered data, data mining module opens an interface to define the relationship between the environmental aspect and facility control equipments, set the control action trigger condition, and integrate new event detection logic. Finally, the remote user interface for monitoring and control is implemented by on Microsoft Windows, Web, and mobile device applications. Keywords: Ubiquitous sensor network, middleware, rule-based data processing, event detection, control box interface.
1
Introduction
Nowadays, wireless sensor networks have been successfully applied to environmental and wildlife habitat monitoring [1], while its intelligent and efficient management improves productivity and revenue of the agricultural and livestock farms [2]. Sensor data, inherently quite different from the traditional data records, are created in the form of a real-time, continuous, ordered sequence of sensor readings. Here, the temporal order can be decided either implicitly by arrival time or explicitly by timestamp, so a data stream is defined as a continuous sequence of tuples. Structure of data items in a data stream can change in time. Moreover, many data streams can include the spatial tag not just the temporal order, possibly hosting a geographic application on the sensor network.
This research was supported by the MKE (The Ministry of Knowledge Economy), through the project of Region technical renovation, Republic of Korea.
Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 196–204, 2011. c Springer-Verlag Berlin Heidelberg 2011
Intelligent Ubiquitous Sensor Network for Agricultural and Livestock Farms
197
A sensor network can be viewed as a large database system which responds to the query issued from various applications [3,4]. For example, the SyncQuery language that expresses composable queries over streams, pointing out that composition of queries, and hence supporting views is not possible in the append-only stream model [5]. This language employs the tagged stream model in which a data stream is treated as a sequence of modifications over the given relation. Particularly, the sliding-window approach is generalized by introducing the synchronization principle that empowers SyncSQL with a formal mechanism to express queries with arbitrary refresh condition. Besides, this work includes an algebraic framework for SyncSQL queries, couple of equivalences and transformation rules, and a query-matching algorithm. The main task of ubiquitous sensor networks, or USN in short, is monitoring sensor values, deciding the control actions, and triggering appropriate actuators [6]. For example, if the current lightness drops below the permissible level, USN can turn on the light in the green house. Moreover, if the current CO2 level is higher than a specific bound, a ventilator is activated to refresh the air. To this end, a lot of sensors are installed over the wide target area and each of them reports its sensor values to the controller, creating a tremendous amount of data records. USN must handle the large volume of sensor records and analyze them. Here, more than one sensor records are correlated as they capture the same event, and the records have sequential or spatial correlation. Moreover, sensor values can have garbage and measurement errors. The instability of wireless networks also can jeopardize the correct analysis. Wrong reaction stemmed from wrong data analysis can burn actuator motors, waste power, and lead to many hazardous problems. In this regard, this paper is to design and implement a USN architecture for agricultural and livestock farms, aiming at efficiently and accurately handling great volume of sensor data obtained from a variety of sensor devices and generating a correct control action. Our implementation focuses on the data processing middleware that interacts with sensor nodes containing CO2 , temperature, humidity, lightness, and wind sensors. The system design opens an interface to define a rule to filter the raw data, correlate multiple streams, and decide the control action. Next, the remote user interface for monitoring and control USN is implemented by on Windows, Web, and mobile device applications. The rest of this paper is organized as follows: After issuing the problem in Section 1, Section 2 describes the background and related work, focusing on the target USN architecture. Section 3 describes raw data processing and middleware processing of the proposed system, respectively. Section 4 presents the user interface implementation details. Finally, Section 5 concludes this paper with a brief introduction of future work.
2
Background and Related Work
Under the research and technical project named Development of convergence techniques for agriculture, fisheries, and livestock industries based on the
198
J. Lee et al.
ubiquitous sensor networks, our project team has designed and developed an intelligent USN framework [7]. This framework provides an efficient and seamless runtime environment for a variety of monitor-and-control applications on sensor networks. The sensor node, built on the Berkeley mote platform, comprises sensors, microprocessor, radio transceiver, and battery [8]. Over the sensor network mainly exploiting the Zigbee technology, composite sensors detect events such as body heat change of a livestock via the biosensors attached to it as well as humidity, CO2 , and NH3 level via the environmental sensors. Each node runs the IP-USN protocol and implements corresponding routing schemes [9]. The sensor network and the global network, namely, the Internet, are connected through the USN gateway. At this stage, the system is to integrate a remote control model to provide remote irrigation and the activation of heater or fan.
Fig. 1. Agricultural USN framework
Our previous work has designed an intelligent data processing framework in ubiquitous sensor networks, implementing its prototype [7]. Much focus is put on how to handle the sensor data stream as well as the interoperability between the low-level sensor data and application clients. This work first designs systematic middleware which mitigates the interaction between the application layer and low-level sensors, for the sake of analyzing a great volume of sensor data by filtering and integrating to create value-added context information. Then, an agent-based architecture is proposed for real-time data distribution to forward a specific event to the appropriate application, which is registered in the directory service via the open interface. The prototype implementation demonstrates that this framework can not only host a sophisticated application on USN and but also autonomously evolve to new middleware, taking advantages of promising technologies such as software agents, XML, and the like. Particularly, cloud
Intelligent Ubiquitous Sensor Network for Agricultural and Livestock Farms
199
computing can provide the high-speed data processing framework for sensor streams [10]. It must be mentioned that XML data stream processing is also of interest, as XML becomes a common part of information systems, including RFID (Radio Frequency IDentifier), ad-hoc sensor data collection, network traffic management, and so-called service-oriented architecture [11]. Generally, XML streams are created as a second-hand product obtained from information exchange in XML systems, rather than from the raw sensor values. XML data streams can be viewed as a sequence of XML documents, and each data item in the stream is a valid standalone XML document, which is independent of other items in the stream. Moreover, queries on data stream can support data mining and filtering. While the first evaluates queries that span over a long time period, processing a great deal of time-sequenced data, the second takes the data items from the stream matching the filtering condition. Anyway, processing XML has an attractive real-world motivation and our system will also take advantage of XML technologies for interactions between data processing modules.
3 3.1
Intelligent USN Architecture Raw Data Processing
Each sensor output must be converted to our daily-life values. First, the sensor board consistently supplies 2500 mV to the soil humidity sensor device, which will generate the voltage value of 250 through 1000 mV . Here, 250 mV corresponds to 0 % humidity while 1000 mV to 100 % humidity. Next, pyranometer sensors are used to measure the solar radiation flux density on a planar surface, generally in watts per meter square. According to the sensor device specification, 220 mV is detected on full sunlight, namely, 1100 W m−2 . Hence, by Eq. (1), we can obtain the solar radiation value. Sr = So × Cv = 200mV × 5.0W −2 /mV = 1100W −2,
(1)
where Sr is solar radiation, So is the sensor output, and Cv is the conversion factor having 5 W m−2 /mV . Anometer sensors, commonly used in a weather station instrument, measures wind speed and direction. The device calculates the wind direction based on the probed voltage values measured from different angles. The relationship between the difference and the phase angle is provided as shown in Figure 2 and the corresponding measurement value is estimated as in Eq. (2). Vout − 2431 (2) −6.8473 In addition, wind speed is measured by counting the number of rotations of a wind cup during the unit time. Namely, π Ws = × Nr , (3) t θ[deg] =
200
J. Lee et al.
Output voltage (mV)
2500 "OutputVoltage" 2000 1500 1000 500 0 0
45
90
135
180 225 Degrees
270
315
360
Fig. 2. Phase angle and wind direction
where Ws is the wind speed estimation and Nr is the number of rotations during time interval t. 3.2
Middleware Layer Processing
Middleware works between the sensor interface and high-end data analyzers as shown in Figure 3. To begin with, as the collected data may have erroneous readings and garbage values, the middleware is required to check the validity range of the collected data first and prevent multiple reactions to a single event. To this end, the duplicate detector module checks the event length and value changes for the real-time sensor data, storing the filtered event in the database. The controller logic can define the control target and reaction range to activate the predefined control action to a specific event. Now, from the database, the meaningful context is extracted by the sophisticated classification and time series analysis for the event-level sequences. The series of event patterns and interpreted knowledge are embedded in the analysis module to recognize abnormal conditions instantly. Moreover, our system opens an interface to define the relationship between the environmental aspect and facility control equipments, set the control action trigger condition, and integrate new event detection logic. The inference engine defines a set of rules to detect events. To define a rule, each sensor and node is assigned a unique identifier, while max(), min(), average(), count(), and run() functions are provided for better event specification. Using this, we can specify several rules, for example, report an event when average temperature of node 123 is higher than 35, or turn on all fans installed in sensor node 452. Based on this rule-base, the middleware checks the validity of the sensor data and requests the retransmission if it has an error term. After calculating the difference from the previous sensor reading, the middleware detects an abnormal condition based on the empirically obtained event patterns and knowledge. This procedure is illustrated in Figure 4.
Intelligent Ubiquitous Sensor Network for Agricultural and Livestock Farms
Data logging mgmt Data management
201
Sensor management Sensor group mgmt dynamic event
Query processing
MSMQ (MicroSoft Message Queue) Data conversion
excel extension, XML exchange
Error processing
exception check, error policy
Duplication processing
duplicate sensor data check
Event processing
event detection logic
Sensor interface Fig. 3. Middleware architecture Rule
Event detection Rule Sensor info Database manager Rule
Data mining
Sensor value Log
Rule−based value processing
Response & event Request
Actuator control process
Data manager
Sensor value
Data source (Sensor nodes)
Fig. 4. Event logic
4
User Interface and Control Action
Remote monitor-and-control keeps track of the environmental sensor values on temperature, humidity, lightness, and CO2 . It can turn on or off the power switch to each actuator device. For example, the temperature monitor tracks the current temperature of a specific position selected via the geographic map. According to the initiation command, the server module begins to collect and store sensor readings. During the lifetime of this operation, the event detection is carried out based on the criteria specified in the query. The client also retrieves the current temperature value to monitor the up-to-date temperature change. Figure 5 show the user interface implemented in this application. First of all, it displays the
202
J. Lee et al.
(a) Normal status
(b) Abnormal status detection Fig. 5. User interface
Fig. 6. Control box interface
map, location of sensors along with the current status of them. In addition, the series of sensor values are scrolled in the listbox, while a graph is created to plot the temperature change. Figure 5(a) indicates the normal status where no sensor value deviates from the given bound, and all nodes are marked blue. Whereas,
Intelligent Ubiquitous Sensor Network for Agricultural and Livestock Farms
203
in Figure 5(b), one sensor node detects the value out of the normal range and this node turns red. In addition, a remote control interface is implemented in an embedded control box and a u-Multi smart mote. First, the embedded control box application interacts with the control system via TCP/IP. As a Microsoft Windows application, it sends a control command such as current status retrieval and specific control action trigger, as shown in Figure 6. In response to this command, the sensor network sends an acknowledgment back to the controller box. In addition, the sensor network can automatically notify the event of approaching the final permissible borderline of current sensor reading. This control box application is also implemented as a web program. Second, u-Multi smart mote is functionally similar to the control box except that its communication interface is SMS (Short Message Service) instead of TCP/IP and it is developed on a smaller user display. One of the most critical events is power breakage and the battery can survive tens of minutes to allow a failure recovery procedure, including notification to human managers.
5
Concluding Remarks
In this paper, we have designed and developed an intelligent ubiquitous sensor network targeting at agricultural and livestock farms which have a variety of sensors and create a great volume of sensor data records during the monitoring phase. For the sake of efficiently and accurately detecting the specific events out of the great number of sensor data which may include erroneous terms, correlative components, the middleware module embeds an empirical event patterns and knowledge description. It also interprets sensor-specific data to the actual values. For the filtered data, data mining module opens an interface to define the relationship between the environmental aspect and facility control equipments, set the control action trigger condition, and integrate new event detection logic. Finally, the remote user interface for monitoring and control USN is implemented in Windows, Web and mobile device applications. As future work, we are planning to design an advanced data inference engine for management information as well as sensor data [13]. The sophisticated data analysis will create a new type of management messages and those messages will make USN more intelligent.
References 1. Golab, L., Oszu, M.: Issues in Data Stream Management. ACM SIGMOD Record 32, 5–14 (2003) 2. Lee, J., Park, G., Kim, H., Kim, C., Kwak, H., Lee, S., Lee, S.: Intelligent Management Message Routing in Ubiquitous Sensor Networks. In: Int’l. Conference on Computational Collective Intelligence-Technologies and Applications (2011) 3. Madden, S., Franklin, M., Hellerstein, J., Hong, W.: TinyDB: an Acquisitional Query Processing System for Sensor Networks. ACM Transactions on Database Systems 30 (2005)
204
J. Lee et al.
4. Madden, S., Franklin, M.: Fjording the Stream: An Architecture for Queries over Streaming Sensor Data. In: Proc. of the 2002 Intl. Conf. on Data Engineering (2002) 5. Ghanem, T., Elmagarmid, A., Larson, P., Aref, W.: Supporting Views in Data Stream Management Systems. ACM Transactions on Database Systems 35(1) (2010) 6. Culler, D., Estrin, D., Srivastava, M.: Overview of Sensor Networks. IEEE Computer 37, 41–49 (2004) 7. Lee, J., Park, G., Kwak, H., Kim, C.: Efficient and Extensible Data Processing Framework in Ubiquitous Sensor Networks. In: International Conference on Intelligent Control Systems Engineering, pp. 324–327 (2011) 8. http://www.tinyos.net 9. Cuevas, A., Urue˜ na, M., Laube, A., Gomez, L.: LWESP: Light-Weight Exterior Sensornet Protocol. In: IEEE International Conference in Computer and Communications (2009) 10. Kang, M., Kang, D., Crago, S., Park, G., Lee, J.: Design and Development of a Run-time Monitor for Multi-Core Architectures in Cloud Computing. Sensors 11, 3595–3610 (2011) 11. Ulrych, J.: Processing XML data streams: A survey. In: WDS Proc. Contributed Papers, pp. 218–223 (2008) 12. Madden, S., Franklin, M., Hellerstein, J., Hong, W.: The Design of an Acquisitional Query Processor for Sensor Networks. In: ACM SINGMOD (2003) 13. Woo, H., Mok, A.: Real-Time Monitoring of Uncertain Data Streams Using Probabilistic Similarity. In: Proc. of IEEE Real-Time Systems Symposium, pp. 288–300 (2007)
Queue-Based Adaptive Duty Cycle Control for Wireless Sensor Networks Heejung Byun1 and Jungmin So2 1
Dept. of Information and Telecommunication Engineering, Suwon University, Hwaseong-si, Gyeonggi-do, Korea [email protected] 2 Dept. of Computer Engineering, Hallym University, Chuncheon, Gangwon-do, Korea [email protected]
Abstract. This paper proposes a control-based approach to duty cycle adaptation for wireless sensor networks. The proposed method controls duty cycle through queue management in order to achieve high performance under variable traffic rates. To have energy efficiency while minimizing the delay, we design a feedback controller, which adapts sleeping interval time to traffic change dynamically by constraining the queue length at a predetermined value. Based on the control theory, we analyze the adaptation behavior of the proposed controller and demonstrate system stability. The simulation results show that the proposed method outperforms existing scheduling protocols by achieving more energy savings while minimizing the delay. Keywords: Wireless sensor networks, energy, delay, queue management, analytic analysis.
1
Introduction
Wireles sensor networking (WSN) has a wide range of applications that help to sense and monitor environmental attributes, such as target tracking, infrastructure security, fire detection and traffic control. These networks are usually deployed in an ad hoc manner in the network sharing the same communication medium. Typically, WSN is composed of a large number of distributed sensor nodes which are often battery-powered and required to operate for years with no human intervention after deployment. Therefore, a major problem in deploying WSNs is their dependence on limited battery power. Many research efforts in the recent years have focused on developing power saving methods for WSNs. These methods include power-efficient MAC layer protocols [1]-[9] and network layer routing protocols [10]-[11]. These protocols save energy but introduce extra end-to-end delay, i.e., sleep delay. In WSN, the delay has been a key factor to some delay-sensitive applications, such as health or military applications. Many researches have been proposed to achieve a good tradeoff between energy consumption and delay [5]-[9]. Adaptive listening [5] Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 205–214, 2011. c Springer-Verlag Berlin Heidelberg 2011
206
H. Byun and J. So
suggests the use of overhearing to reduce the sleep delay. DSMAC [6] dynamically changes each node’s duty cycle to meet applications’ demands so that a node increases its duty cycle by adding extra active periods when it requires less latency or when the traffic load increases. U-MAC [7] tunes its duty cycle based on a utilization function, which is the ratio of the actual transmission and receptions performed by the node over the whole active period. RL-MAC [8] optimizes active and sleep periods with the double aim of increasing throughput and saving energy based on MDP (Markov decision process). DutyCon [9] proposes a feedback controller which controls the duty cycle to guarantee an end-to-end communication delay while achieving the energy efficiency. To do this, DutyCon decomposes the end-to-end delay requirement problem into a set of single-hop delay requirement problems. However, DutyCon requires the time stamp on the sender side to calculate the time delay on the receiver and the use of this slack time of each packet results in a slow response to the traffic changes. In this paper, we propose an adaptive duty cycle control mechanism based on the queue management with the aims of energy saving and delay reduction. The queue states potentially imply the network status, so that we can infer the traffic variations or topology changes implicitly. Using the queue length and its variations of a sensor node, we present a control-based approach and design a distributed duty cycle controller, which adapts the sleeping interval time to the variable traffic rate. Based on the control theory, we derive the steady state and show system stability for the proposed controller.
2 2.1
Duty Cycle Control Network Modeling
We first introduce the following notations. – G = (M, L), a WSN where M is the node set and L is the link set of the network. – lm , the outgoing wireless link of node m (0 ≤ m ≤ M − 1) where M is the cardinality of M. – τc , the time period of duty cycle control. Specifically, the duration of time slot [n, n + 1). – Dlm , the link transmission rate at link lm . – qlm , the queue length of link lm . – cm , the time length of the sleep interval of node m. – wm , input process, the number of packets that arrive during time slot [n, n + 1) including the generated traffic of node m. The queue length of link lm at the (n + 1)-th iteration can be modeled as: qlm (n + 1) = [qlm (n) + wm (n) − Dlm (τc − km (n) · cm (n) − mod(τc , cm (n) + Ta ))]+ = [qlm (n) + wm (n) − Dlm km (n)Ta ]+
(1)
Queue-Based Adaptive Duty Cycle Control for Wireless Sensor Networks
207
τc where km (n) = cm (n)+T and [·]+ = max(·, 0). The value of Ta stands for the a active period with a fixed size. Note that the value of cm (n) remains constant during iteration [n, n + 1). Active periods are of fixed size whereas the length of sleep periods depends on a value determined by the duty cycle controller. During a single control period, there may be multiple active times and the queued packets are transmitted during active times. We assume that the average network condition of a link does not change frequently. Thus, we simplify the value of Dlm to be stable during a control period [9].
2.2
Duty Cycle Controller Design
Based on the network model, we design the distributed duty cycle controller. We control the duty cycle of each node by dynamically adjusting the sleep interval time under variable traffic conditions. In each control period, the controller controls a node’s sleeping interval time using the local information available at the node. The proposed scheme does not need to gather information about its neighborhood’s state, such as suffered delay [7]-[9]. Using the local information, we propose a dynamic duty cycle controller to meet time-varying or spatially non-uniform traffic loads by constraining the queue length at a predetermined threshold: − qlm (n + 1)) − γ(qlm (n + 1) − qlm (n)) cm (n + 1) = cm (n) + β(qlth m
(2)
qlth m
where β and γ are the control parameters to be chosen, and is the queue threshold. The sleep interval time increases linearly as the queue length becomes smaller than the queue threshold. Meanwhile, the sleep interval time decreases as the forward difference of queue length becomes larger than zero because the increased forward difference of queue length means increased latency. The value of β and γ determines the stability of the controller. The range of β and γ for the stable performance is established using a stability analysis of a closed-loop system shown in the next section. The queue threshold can be set according to the application requirement. When the queue threshold is low, a node increases the duty cycle by adding active periods, resulting in low latency. On contrary, as the queue threshold becomes larger, the delay increases because the proposed controller increases the sleeping interval time in order to buffer the packets until the queue length reaches the queue threshold. Hence, for the delay-sensitive applications, the queue threshold can be set as a rather small value. Since each node can be assigned different duty cycle, the sender has to synchronize its duty cycle with the receiver such that receiver and sender node are active at the same time. Therefore, it needs to exchange its determined schedule with its neighbors. As in S-MAC [2], we assume that each node maintains a schedule table that stores the schedules of all its neighbors and the sensor nodes exchange schedules using ACK packet with their neighbors. 2.3
Stability Analysis
Based on the network model and duty cycle controller, we analyze the system stability in case of variable τc and fixed τc respectively. First we consider the
208
H. Byun and J. So
case of variable τc such as km = 1 for all iterations. Then the system can be represented by a discrete-time model where the duration of control period equals (cm (n) + Ta ): qlm (n + 1) = [qlm (n) + wm (n) − Dlm (τc − cm (n))]+ cm (n + 1) = cm (n) + β(qlth − qlm (n + 1)) − γ(qlm (n + 1) − qlm (n)) m
(3)
Let qlm s and cms denote the average steady-state solution of queue length qlm (n) and sleep interval time cm (n) of node m respectively. Note that in the neighborhood of the steady state, we ignore the saturation nonlinearity. From the asymptotic theory [12]-[14], we obtain the average steady points of the queue length and sleep interval time: qlm s = qlth m cms = τc −
wma D lm
(4)
where the value of wma denotes the average value of wm . Thus, from (4), we can see that the queue length converges to the desired threshold in the steady state and the sleeping interval time is adapted with consideration of traffic rate. Now we analytically show the stability of the proposed controller around the steady point. Let δqlm = qlm − qlm s δcm = cm − cms δwm = wm − wma Then (3) can be rewritten as: δqlm (n + 1) = δqlm (n) + δwm (n) + Dlm δcm (n) δcm (n + 1) = δcm (n) − βδqlm (n) − (β + γ)Dlm δcm (n) − (β + γ)δwm (n) (5) For the purpose of analytic simplicity, we concentrate on the networks where the traffic load is arbitrarily constant as average steady point. However, the results of this paper can be generalized to the stochastic traffic load. Let x(n) = [δqlm (n) δcm (n)]T . Then the characteristic polynomial of (5) can be obtained as: Φ(z) = z 2 + ((β + γ)Dlm − 2)z + 1 − γDlm
(6)
In order for the controller to be stable, Φ(z) should have all zeros within the unit circle. Hence the system is asymptotically stable if the control parameters satisfy the following relation: (β + 2γ)Dlm < 4
(7)
From now on, we consider the case of fixed τc . Then there may be multiple active times and km can be variable: +
qlm (n + 1) = [qlm (n) + wm (n) − Dlm km (n)Ta ] cm (n + 1) = cm (n) +
β(qlth m
− qlm (n + 1)) − γ(qlm (n + 1) − qlm (n))
(8) (9)
Queue-Based Adaptive Duty Cycle Control for Wireless Sensor Networks
209
From (8) and (9), the average steady points of the queue length and sleep interval time are derived: qlm s = qlth m Dlm τc cms = Ta −1 wma
(10)
Thus, we can see that the queue length converges to the desired threshold and the sleeping interval time is adapted in inverse proportion to the traffic rate. With the notation of δqlm , δcm , δwm , we rewrite (8) and (9) as: δqlm (n + 1) = f1 (δqlm (n), δcm (n)) δcm (n + 1) = f2 (δqlm (n), δcm (n))
(11) (12)
where f1 (δqlm (n), δcm (n)) = δqlm (n) + δwm (n) − ζ(δcm (n)) f2 (δqlm (n), δcm (n)) = δcm (n) − βδqlm (n) − (β + γ)(δwm (n) − ζ(δcm (n))) D
T τ
a c and ζ(δcm (n)) = Ta +δclm . m (n)+cms The qualitative behavior of a nonlinear system near a steady point can be determined via linearization with respect to that point. We approximate the nonlinear system as described in (11) and (12) by the following linear system:
δqlm (n + 1) = a11 δqlm (n) + a12 δcm (n)
(13)
δcm (n + 1) = a21 δqlm (n) + a22 δcm (n)
(14)
Rewriting this equation in a vector form, we obtain x(n + 1) = Ax(n) where
(15)
∂f1 ∂f1 a11 a12 lm ∂δcm A= = ∂δq ∂f2 ∂f2 a21 a22 ∂δqlm ∂δcm δqlm =0,δcm =0 Dlm Ta τc 1 (Ta +cms )2 = D Ta τc −β 1 − (β + γ) (Talm +cms )2
Then the system is asymptotically stable if the control parameters satisfy the following relation: (β + 2γ)Dlm Ta τc <4 (Ta + cms )2
(16)
Therefore, since the origin of the linearized state equation is stable in a small neighborhood of the steady point, the trajectories of the nonlinear state equation will behave like a stable node.
210
H. Byun and J. So (a) duty cycle 50 node 0 node 1 node 2
40 %
30 20 10 0 0
50
100
150
200 time(s)
250
300
350
400
(b) queue length node 0 node 1 node 2
packets
20 15 10 5 0 0
50
100
150
200 time(s)
250
300
350
400
Fig. 1. Proposed algorithm : (a) duty cycle and (b) queue length
3
Simulation Results
To show the effectiveness of the proposed algorithm, we compare with DutyCon [9]. We use the following network model: the network topology is a linear network with 4 nodes where node 0, 1, and 2 are the data source nodes and packets are forwarded to data sink node 3 through the intermediate nodes. Thus, the traffic load is relatively high as the node is located close to the sink due to the forwarded and generated traffic. We assume that the sender sends the packet only once every active period. The packet size is 100 bytes and the transmission rate is 250kbps. All nodes are initialized as 50% duty cycle. For the proposed = 4 packets and τc = 0.2 second for all nodes. From (16), algorithm, we set qlth m we set β = 0.0005, γ = 0.001. Fig. 1 shows the evolutions of duty cycle and queue length of nodes 0, 1, and 2 for the proposed algorithm when the traffic rate changes at runtime. The data source node 0, 1, and 2 generate packets, in a uniform distribution at a rate of 2 packets per control period. The queue threshold is set to 5 packets for all nodes. Traffic load is lowest at node 0 because it does not need to forward packets for other nodes, and is highest at node 2. To maintain queue length under the threshold, the duty cycle becomes high when the traffic load is high, and vice versa. At 200ms, the packet arrival rate increases to 3 packets per control period, causing the duty cycle of each node increases. However, the queue lengths of nodes 0, 1, 2, and 3 are successfully maintained around the queue threshold irrespective of traffic change . In order to compare the average performance between DutyCon and the proposed algorithm in terms of queue length, delay and power consumption, the
Queue-Based Adaptive Duty Cycle Control for Wireless Sensor Networks
211
Average queue length 14 proposed algorithm 12
DutyCon
packets
10
8
6
4
2
0
1
10
Average packet arrival rate (packets/s)
Fig. 2. Average queue length under different packet arrival rate
following simulations are carried out by varying the packet arrival rate. For the arrival rate, poisson traffic is assumed for all nodes. In this simulation, the queue threshold is 4 packets and delay requirement for DutyCon is 0.2 seconds. For each packet arrival rate, we have done 100 iterations to measure the average performance. Figure 2 shows the average queue length under different packet arrival rate. When the traffic load is light, DutyCon keeps the queue length at a low value. However, as the traffic load becomes heavy, the queue length of DutyCon increases rapidly. The reason for this result is that DutyCon controls the sleeping interval time with the slack time information. Therefore, DutyCon cannot react speedily to the increasing traffic, resulting in the large queue length. On the other hand, the proposed algorithm successfully controls the queue length around the queue threshold regardless of the traffic load. This is because the proposed controller adapts the duty cycle according to the traffic variation in a timely manner so that the queue length is converged to the queue threshold. Therefore the proposed algorithm avoids large backlogging packets in intermediate nodes. Figure 3 shows the average delay performance under different packet arrival rate. The results indicate that when the traffic rate is low, DutyCon can control the average delay very close to the desired requirement. However, as the traffic load becomes loaded, DutyCon does not have effective control of the average delay. After the packet arrival rate reaches at 25packets/s, the average delay is reduced due to the increased duty cycle, but still longer than the delay requirement. The proposed algorithm achieves a rather longer delay than DutyCon when the traffic load is light. The reason for this result is when the traffic load is light, the proposed controller increases the sleeping interval time in order to
212
H. Byun and J. So
Average delay 0.5 0.45 0.4
sec
0.35 0.3 0.25 0.2 0.15
proposed algorithm DutyCon
0.1 0.05
1
10
Average packet arrival rate (packets/s)
Fig. 3. Average delay under different packet arrival rate
Average power consumption 400 proposed algorithm 350
DutyCon
300
mW
250
200
150
100
50
1
10
Average packet arrival rate (packets/s)
Fig. 4. Average power consumption under different packet arrival rate
buffer the packets until the queue length reaches the queue threshold. However, as the traffic rate becomes high, the average delay is reduced due to the increased duty cycle and the proposed algorithm performs better than DutyCon.
Queue-Based Adaptive Duty Cycle Control for Wireless Sensor Networks
213
We evaluate average power consumption under different packet arrival time. In our simulations, we set the transmitting power to 24.75 mW , sleeping power is 15μW . Figure 4 shows that when traffic load is light, the average power consumption of both DutyCon and the proposed algorithm is small due to low duty cycle. As the packet arrival rate increases, nodes have fewer chances to go to sleep and thus spend more time in transmission. In DutyCon, the average power consumption increases rapidly until the packet arrival rate reaches 15packets/s and then increases slowly after. However, the average power consumption with the proposed algorithm is much lower than that of DutyCon, which leads to longer lifetime with the heavy traffic load. According to these results, we can see that when the traffic is light, both DutyCon and the proposed algorithm work well. However, as the traffic load becomes heavy, DutyCon does not have effective control of the network performance, whereas the proposed algorithm has a significant improvement on delay and power consumption. Therefore, the proposed algorithm achieves energy efficiency while minimizing the delay by controlling the queue length close to the queue threshold and avoiding large backlogging packets in intermediate nodes.
4
Conclusions
In this paper, we propose a control-based approach to the adaptive duty cycle control for wireless sensor networks. The proposed approach controls the duty cycle through the queue management in order to achieve high performance under variable traffic rate. To have energy efficiency while minimizing the delay, we design a feedback controller, which changes the sleeping interval time dynamically by constraining the queue length at the predetermined value. This results in lower energy consumption and faster adaptation to traffic change. Our simulation results show that the proposed algorithm improves significantly both energy efficiency and delay performance by adapting the duty cycle properly under different traffic rates. Acknowledgments. This research was supported by the GRRC SUWON 2011B5 program of Gyeonggi province.
References 1. Bachir, A., Dohler, M., Watteyne, T., Leung, K.K.: MAC essentials for wireless sensor networks. IEEE Communications Surveys & Tutorials 12, 222–248 (2010) 2. Ye, W., Heidemann, J., Estrin, D.: An energy-efficient MAC protocol for wirelss sensor networks. In: Proceedings of IEEE Infocom, pp. 1567–1576 (2002) 3. Van Dam, T., Langendoen, K.: An adaptive energy-efficient MAC protocol for wireless sensor networks. In: Proceedings of ACM Sensys, pp. 171–180 (2003) 4. Havinga, P., Smit, G.: E2MaC: an energy efficient MAC protocol for multimedia traffic. Techinical Report (1998) 5. Ye, W., Heidemann, J., Estrin, D.: Medium access control with coordinated, adaptive sleeping for wireless sensor network. IEEE Trans. on Networking 12, 493–506 (2004)
214
H. Byun and J. So
6. Lin, P., Qiao, C., Wang, X.: Medium access control with a dynamic duty cycle for sensor networks. In: Proceedings of IEEE WCNC, vol. 3, pp. 1534–1539 (2004) 7. Yang, S.H., Tseng, H.-W., Wu, E., Chen, G.-H.: Utilization based duty cycle tuning MAC protocol for wireless sensor networks. In: Proc. of IEEE GLOBECOM, vol. 6, pp. 3258–3262 (2005) 8. Liu, Z., Elhanany, I.: RL-MAC: a reinforcement learning based MAC protocol for wireless sensor networks. International Journal of Sensor Networks 1, 117–124 (2006) 9. Wang, X., Xing, G., Yao, Y.: Dynamic duty cycle control for end-to-end delay guarantees in wireless sensor networks. In: International Workshop on Quality of Service (IWQoS), pp. 1–9 (2010) 10. Jurdak, R., Baldi, P., Lopes, C.V.: Energy-aware adaptive low power listening for sensor networks. In: Proceedings of INSS, pp. 24–29 (2005) 11. Hu, H., Yang, Z.: The study of power control based cooperative opportusnistic routing in wireless sensor networks. In: Proceedings of ISPACS, pp. 345–348 (2007) 12. Lim, J.-T., Shim, K.H.: Asymptotic performance evaluation of token passing networks. IEEE Trans. Industrial Electronics 40, 384–385 (1993) 13. Lim, J.-T., Shim, K.H.: Performance analysis and design of token-passing networks with two message priorities. IEE Proc. Communications 44, 11–16 (1997) 14. Lim, J.-T., Shim, K.H.: Extreme-point robust stability of a class of discrete-time polynomials. Electonics Letters 32, 1421–1422 (1996)
Experimental Evaluation of a Failure Detection Service Based on a Gossip Strategy Leandro P. de Sousa and Elias P. Duarte Jr. Federal University of Parana (UFPR) - Dept. Informatics P.O. Box 19018 Curitiba 81531-980 PR Brazil {leandrops,elias}@inf.ufpr.br
Abstract. Failure detectors were first proposed as an abstraction that makes it possible to solve consensus in asynchronous systems. A failure detector is a distributed oracle that provides information about the state of processes of a distributed system. This work presents a failure detection service based on a gossip strategy. The service was implemented on the JXTA platform. A simulator was also implemented so the detector could be evaluated for a larger number of processes. Experimental results show that increasing the frequency in which gossip messages are sent gives better results than increasing the fanout. Results are included for fault and recovery detection time and mistake rate of the detector. Keywords: Failure Detectors, P2P, Probabilistic Dissemination.
1
Introduction
Several distributed applications involve some kind of agreement between their components [11]. Processes must reach a consensus whenever they need to decide on the same value given an initial entry consisting of a set of possible values. As both processes and communication channels can fail in real distributed systems, a basic condition for distributed processes to reach an agreement is that each process must know the state (working or failed) of the other processes in the system. In some types of distributed systems, this can be hard or even impossible to implement. That is the case of asynchronous systems: in this type of system, processes and its communication channels can behave arbitrarily slowly, making it impossible to distinguish slow and failed processes. Lynch and others proved in [4] that consensus is impossible in an asynchronous system in which even a single process can fail by crashing. This result is known as the FLP impossibility. As way of avoiding the FLP impossibility and thus solving the consensus problem in asynchronous systems, Chandra et al. proposed abstractions called unreliable failure detectors [1]. A failure detector is a distributed oracle that provides information about the state of processes of a distributed system. Failure detectors can make mistakes, i.e. fault-free but slow processes can be erroneously
This work was partially supported by grant 304013/2009-9 from the Brazilian Research Agency (CNPq).
Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 215–224, 2011. c Springer-Verlag Berlin Heidelberg 2011
216
L.P. de Sousa and E.P. Duarte Jr.
considered to be suspect. Chandra and Toueg proposed two properties to classify failure detectors: completeness and accuracy. Completeness requires that if a process has crashed then it is suspected by the failure detector, while accuracy restricts the mistakes that the detector can make. Even though it is impossible to implement perfect failure detectors in completely asynchronous systems, consensus algorithms using unreliable failure detectors can complete sucessfully if the detector output can be trusted for a long enough period [10]. Also, solutions built around failure detectors are simpler and more generic, as failure detectors encapsulate the timing properties of the system [1]. In order to allow real applications to use and take advantage of failure detectors, they might require some properties. Completeness [1] is said to be weak if eventually every process that crashes is permanently suspected by some correct process. Accuracy can be eventual: there is a time after which mistakes do not occur. Applications have timing restrictions, and detectors that are too slow may not suffice. For this very reason, [2] proposes some metrics for the quality of service, or simply QoS, of failure detectors. The metrics proposed by the authors are mostly used to describe the speed and accuracy of detection. Relate work includes [3], a protocol that includes a failure detector and a dissemination protocol for membership information. This failure detector was initially proposed in [6]. The detector uses a randomized ping strategy, where each process periodically tests another process, which was selected randomly. Information about group membership and process failures are piggybacked in the ping messages sent by the detectors. Recently [13] an implementation of a failure detection service was reported for a P2P storage system that tries to improve the detection QoS by using monitoring together with a prediction model. This work presents the specification and implementation of a distributed failure detection service based on epidemic dissemination. The detection service proposed in this paper is based on the gossip strategy proposed in [12]. In the algorithm, processes periodically send gossip messages to a group of other processes, which are chosen randomly. Failure detection is based on a heartbeat mechanism. Each gossip message contains the heartbeat value for the sending process and the last heartbeat values it has received for every other process. The detection protocol is probabilistic, and uses a gossip strategy [5]. To use the detector, a process must implement the service and then participate in a detection group. At any moment, the process can query its local detector and receive a list of processes suspected to have failed. The detection service was implemented as a prototype in the P2P JXTA platform [7]. A simulator was also implemented, using the SMPL library [9]. Experimental results are reported and show that increasing the frequency in which gossip messages are sent gives better results than increasing the fanout. Results are given for fault and recovery detection time and mistake rate of the detector. The rest of this paper is organized as follows. Section 2 presents the proposed detection service and the gossip algorithm on which the detector is based.
Experimental Evaluation of a Failure Detection Service
217
In Section 3, experimental results are given after a description of the service implementation. Finally, Section 4 concludes the work and discusses future work.
2
The Proposed Detection Service
In this section the failure detection service is described. The service was implemented in the JXTA platform and it is called from here on JXTA-FD. The detection algorithm implemented by JXTA-FD service is based on the algorithm proposed on [12]. Processes monitor each other using heartbeats, which are disseminated through gossip messages (epidemic dissemination). The complete algorithm is shown on Figure 1. We consider a system where each process can directly send and receive messages from every other process. The system is asynchronous; in particular, the system has probabilistic properties for message delay and process failures. Only crash failures are considered. Failed processes can rejoin the system with a new identity. Each process executes an instance of the detection algorithm. The algorithm is divided in four distinct tasks that execute in parallel: ReceiverTask , GossipTask , BroadcastTask and CleanupTask . The following sections describe in detail these tasks as well as the data structures used. 2.1
Data Structures
The HeartBeat Table, or HBT able is the most important of the data structures used by the detection algorithm, it stores the heartbeat values received from other processes and records the last local time instant each entry was updated. The HBT able is implemented as a hash table that uses the process identifier as its key, called ID. Each entry stores a tuple , consisting of two integers: the heartbeat and the timestamp of the last update. A HBT able provides five operations. update(ID, hbvalue), when executed, verifies if the hbvalue received is larger then the one stored for the given ID. If it is, the new value is stored and the timestamp is updated. If the HBT able does not keep an entry for the given ID, a new is included. The get hbvalue(ID) and get tstamp(ID) operations return, respectively, the hbvalue and timestamp for a given ID. Finally, size() returns the number of entries in the table and get ids() returns the set of IDs stored as table keys. The algorithm also uses two integers: localHB is the local heartbeat value and timeOf LastBcast is the time of the last broadcast received. 2.2
ReceiverTask
The ReceiverTask routine is executed every time a gossip message is received, including broadcasts. Each message is composed of a set of tuples, each representing the heartbeat value for a specific process. When a gossip message arrives, the update(ID, hbvalue) operation of the HBT able is called for each one of the tuples. When the received message is from a broadcast, timeOf LastBcast is updated.
218
L.P. de Sousa and E.P. Duarte Jr.
Every JXTA-FD instance executes the following:
|| Initialization: table ← new HBTable heartbeat ← 0 timeOf LastBcast ← 0 start tasks ReceiverTask, GossipTask, BroadcastTask and CleanupTask || ReceiverTask: whenever a gossip message m arrives for all ∈ m do table.update(ID, hbvalue) end for if m is a broadcast then timeOf LastBcast ← current time end if || GossipTask: repeat every GOSSIP IN T ERV AL units of time if table is not empty then numberOf T argets ← min(F AN OU T, table.size()) targets ← choose numberOf T argets random elements from table.get ids() for all t ∈ targets do send gossip message to t end for heartbeat ← heartbeat + 1 end if || BroadcastTask: repeat every BCAST T ASK IN T ERV AL units of time if shouldBcast() then send gossip message by broadcast timeOf LastBcast ← current time {not necessary if the process receives its own broadcasts} end if || CleanupTask: repeat every CLEAN U P IN T ERV AL units of time for all id ∈ table.get ids() do timeF romLastU pdate ← current time - table.get tstamp(ID) if timeF romLastU pdate ≥ REM OV E T IM E then remove id from table end if end for Fig. 1. Detection algorithm used by the JXTA-FD service
2.3
GossipTask
The GossipTask routine is executed periodically, every GOSSIP IN T ERV AL time intervals. It is responsible for sending gossip messages to other processes. At each execution, it checks whether HBT able is empty. If it is, there is nothing to
Experimental Evaluation of a Failure Detection Service
219
be done, as no other process is known. If there are entries in the table, F AN OU T processes are chosen randomly from the set of known processes (less if there are not enough entries). A gossip message is sent to each one of the chosen processes. For every entry in the table, a tuple is added to the message. A tuple containing the local process ID and localHB value is also included. After the messages are sent, localHB is incremented. 2.4
BroadcastTask
The BroadcastTask routine is executed periodically in order to allow processes to find each other after they start up, and also to improve the speed in which the output of the detector stabilizes after the occurrence of multiple simultaneous failures. Broadcast messages are sent occasionally: each time BroadcastTask executes, there is a chance that a message is broadcast to every process in the system. The probability that a broadcast is performed is computed using the service parameters and timeOf LastBcast. This probability must be defined so that it avoids too frequent or too many simultaneous broadcasts. The JXTA-FD service uses the broadcast probability proposed in [12], p(t) = (t/BCAST M AX P ERIOD)BCAST F ACT OR , where t is the number of time units from the last broadcast received and BCAST M AX P ERIOD and BCAST F ACT OR are algorithm parameters described in the following. Every time BroadcastTask is executed, a broadcast message is sent with probability p(t). In this way, the mean time between broadcasts depends on the frequency in which the gossip routine is executed (controlled by the BCAST T ASK IN T ERV AL parameter), the number of processes in the system and the algorithm parameters. BCAST M AX P ERIOD is the maximum interval between each broadcast: as t approaches this value, the probability p(t) approaches 1. BCAST F ACT OR is a positive floating point number and controls how close to BCAST M AX P ERIOD the broadcasts tend to occur. The higher the BCAST F ACT OR value, the closer to BCAST M AX P ERIOD the broadcasts are sent. 2.5
CleanupTask
The CleanupTask routine is responsible for the removal of old entries from the local HBT able. At every CLEAN U P IN T ERV AL time units, entries from the table that have not been updated in less than REM OV E T IM E time units are removed. 2.6
Detector Output
At any moment, a process can query its detector for the set of suspect or correct processes. A process is suspect if the time from its last update in the HBT able is larger than or equal to SU SP ECT T IM E.
220
3
L.P. de Sousa and E.P. Duarte Jr.
Implementation and Experimental Results
The JXTA-FD service was implemented as a Module for the JXTA platform, version 2.5, using the Java language. Process (or peer ) monitoring is done in the context of a Peer Group, and the JXTA-FD module must be loaded and started for every group that is supposed to be monitored. Only peers that are executing the module participate in the algorithm. At any moment, a peer can query its detection module for the list of processes considered suspect or correct. A number of parameters are available for configuring the behavior of the algorithm. The most important are: GOSSIP IN T ERV AL, F AN OU T and SU SP ECT T IM E. The first parameter (GOSSIP IN T ERV AL) controls the interval in which gossip messages are sent by the GossipTask routine. The second parameter (F AN OU T ) controls the number of gossip messages that are sent at each interval. Finally, SU SP ECT T IM E represents the interval after which a silent peer is considered to be suspect. The service parameters for a given group must be specified before the module is initialized. 3.1
Experimental Results
To evaluate the proposed failure detection service, the empirical study included experiments executed with both the JXTA implementation and a simulator, which was implemented using a discrete events library called SMPL [9]. Two strategies for configuring the detector were evaluated. In the first strategy, on each execution of the gossip task, only one gossip message is sent. To increase the detector accuracy, the interval between sending gossip messages (GOSSIP IN T ERV AL parameter) is decreased, i.e. a shorter gossip interval is employed. In the second strategy, the interval between sending gossip messages is fixed, so in order to increase the detector accuracy, more gossip messages (F AN OU T parameter) are sent at each interval. The two strategies were compared while using the same bandwidth, that is, the number of tuples sent in a given time interval is the same for both strategies. These strategies are represented in every figure as Gossip and Fanout, respectively. To simulate the delay and loss of messages, a simple mechanism that drops a percentage of the messages received was implemented. Each message has a chance of being discarded. This mechanism was adopted to simplify the implementation and analysis of the results, given that sufficiently delayed messages have the same impact as lost messages on the detection accuracy. 3.2
JXTA Implementation Results
The experiments were conducted for a group of peers executing on one host. Each experiment was run for 15 minutes in which 10 peers ran the detection service. Every peer employed the same parameters to configure the service. SU SP ECT T IM E was set to 5 seconds and REM OV E T IM E to 20 seconds. Each peer queries its detector at intervals of 1 second. The values for the BCAST M AX P ERIOD and BCAST F ACT OR parameters were, respectively, 20 and 4.764. The charts shown here are presented for a confidence interval of 95%.
Experimental Evaluation of a Failure Detection Service
(a)
221
(b)
Fig. 2. (a) Impact of bandwidth usage on the number of mistakes. 30% of messages are dropped. (b) Impact of message loss on the number of mistakes. The bandwidth usage is fixed as 25.
Mistake Probability. These experiments were executed to evaluate the impact of the detection parameters and the number of failures on the number of mistakes the detector makes. A mistake occurs when a working peer is considered to be suspect by some other peer. In these experiments peers never failed. In this way, every suspicion by some detector is a mistake. Figure 2(a) shows the probability of a given query returning a mistake for message loss rate of 30%. It is possible to see that increasing the bandwidth used by the two strategies also increases the accuracy of the detection. The probability of a mistake for a bandwidth value of 12.5 is approximately 0.02700 for the Gossip strategy and 0.03019 for the Fanout strategy. For a bandwith value of 25, the values are 0.00015 and 0.00035, respectively. For a bandwidth value of 50, no mistakes were made. As the chart shows, for such a small group of peers, the difference between the two strategies is not very expressive. Even so, the Gossip strategy presented better accuracy, making approximately 50% less mistakes for a bandwidth value of 25. Figure 2(b) shows the impact of message loss on number of mistakes made by the detector. The results show that the Gossip strategy is a little more resilient to message losses. Given loss rates of 20% and 30%, the number of mistakes is approximately 10% smaller than the number for the Fanout strategy. Detection and Recovery Time. These experiments have the objective of verifying the difference in the detection time and the recovery time for the two proposed configuration strategies. The detection time is the mean time between the failure of a given process and the time it takes for another process to suspect it. The recovery time is mean time between a process recovering from a failure and the time it takes for another process to stop suspecting it. The recovery time can also be seen as the time it takes for a new process to be discovered by another process.
222
L.P. de Sousa and E.P. Duarte Jr.
(a)
(b)
Fig. 3. (a) Detection time, for different bandwidth values. (b) Recovery time, for different bandwidth values.
The tests were executed with the same configuration as the previous experiments. There is no message loss. In a given moment, one peer ceases its execution. It resumes its execution 10 seconds later. Figure 3(a) shows the detection time for different values of bandwidth used. The results show a wide variation in detection time. For low bandwidth values, the variation is probably due to the large number of mistakes. A given peer might already be suspect when it actually fails. The chart also shows that the detection time for the Fanout strategy is approximately 20% smaller for a bandwidth value of 50. Figure 3(b) shows the recovery time for different bandwidth values. In this case, the Gossip strategy is superior, having a recovery time approximately 50% smaller than the other strategy, for bandwidth value of 50. This difference is probably due to to the higher frequency of updates. It can also be seen that the recovery time is directly affected by the bandwidth used. 3.3
Simulation Results
The simulation experiments were executed so that the detection algorithm could be evaluated for a larger number of processes and without the overhead of the JXTA platform on the results. The experiments were conducted for a group of 200 peers. Some parameters are fixed for all the experiments. The SU SP ECT T IM E value is 5 time units and the REM OV E T IM E value is 20 time units. The detectors are queried every 0.25 time units. The BroadcastTask routine is executed every 1 time units, and the values BCAST M AX P ERIOD and BCAST F ACT OR are 20 and 8.2, respectively. This causes a broadcast to be executed approximately every 10 time units. The Gossip strategy keeps the F AN OU T value as 1 and decreases the GOSSIP IN T ERV AL, while the Fanout strategy keeps the GOSSIP IN T ERV AL as 2 time units and increases the F AN OU T value.
Experimental Evaluation of a Failure Detection Service
(a)
223
(b)
Fig. 4. (a) Mistake probability for different bandwidth values. Experiment done with 50% of the message being dropped. (b) Mistake probability for different percentages of message loss.
Mistake Probability. Figure 4(a) shows how the bandwidth used affects the quantity of mistakes made by the detector. 50% of the messages are dropped. The chart shows that increasing the frequency of the gossip messages has a much larger impact on the detection accuracy than increasing the F AN OU T . In some cases, the Gossip strategy is an order of magnitude better than the Fanout strategy. In Figure 4(b), the impact of message loss on the accuracy of the detector is shown. Bandwidth is equal to approximately 550 (F AN OU T is 5 and GOSSIP IN T ERV AL is 0.4 time units). The chart shows that the Gossip strategy is much better in preventing mistakes for every value of bandwidth used. These results, together with the results from the JXTA experiments, show again that the Gossip strategy is much superior to the Fanout strategy in terms of detection accuracy. They also show that the difference between the two strategies becomes even larger as the number of processes in the group increases.
4
Conclusions and Future Work
This work described the specification, implementation and evaluation of a failure detection service based on a gossip strategy. The detection service was evaluated through experiments on both the JXTA platform and a simulator. Experimental results show that the algorithm scales well as the number of processes in the group grows, and that it is robust in terms of amount of mistakes it makes. Results also show that increasing the frequency in which gossip messages are sent gives much better detection accuracy than increasing the fanout of the algorithm; both strategies use exactly the the same amount of bandwidth. Future work includes implementing the failure detection service on another platform, as we had problems using JXTA to develop our system. Although we
224
L.P. de Sousa and E.P. Duarte Jr.
tried hard, going through the available documentation (which was mostly outdated) and mailing lists, the JXTA relays and rendezvous could not be made to work correctly, communication between machines in different networks was not possible. Future work also includes the implementation of a consensus algorithm, such as Paxos, [8] based on the proposed failure detection service.
References 1. Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. J. ACM 43(2), 225–267 (1996) 2. Chen, W., Toueg, S., Aguilera, M.K.: On the quality of service of failure detectors. IEEE Trans. Comput. 51(1), 13–32 (2002) 3. Das, A., Gupta, I., Motivala, A.: Swim: scalable weakly-consistent infection-style process group membership protocol. In: Proc. International Conference on Dependable Systems and Networks DSN 2002, pp. 303–312 (June 23-26, 2002) 4. Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. J. ACM 32(2), 374–382 (1985) 5. Gupta, I., Birman, K.P., van Renesse, R.: Fighting fire with fire: using randomized gossip to combat stochastic scalability limits. Quality and Reliability Engineering International 18(3), 165–184 (2002) 6. Gupta, I., Chandra, T.D., Goldszmidt, G.S.: On scalable and efficient distributed failure detectors. In: PODC 2001: Proceedings of the Twentieth Annual ACM Symposium on Principles of Distributed Computing, pp. 170–179. ACM, New York (2001) 7. Jxta website, http://java.net/projects/jxta/ (last access in April 2011) 8. Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998) 9. MacDougall, M.H.: Simulating Computer Systems, Techniques and Tools. The MIT Press, Cambridge (1997) 10. Raynal, M.: A short introduction to failure detectors for asynchronous distributed systems. SIGACT News 36(1), 53–70 (2005) 11. Turek, J., Shasha, D.: The many faces of consensus in distributed systems. Computer 25(6), 8–17 (1992) 12. van Renesse, R., Minsky, Y., Hayden, M.: A gossip-style failure detection service. Tech. rep., Cornell University, Ithaca, NY, USA (1998) 13. Wan, Y., Luo, Y., Liu, L., Feng, D.: A dynamic failure detector for p2p storage system. In: NISS (2009)
On the Performance of MPI-OpenMP on a 12 Nodes Multi-core Cluster Abdelgadir Tageldin Abdelgadir1, Al-Sakib Khan Pathan1,*, and Mohiuddin Ahmed2 1
Department of Computer Science, International Islamic University Malaysia, Gombak 53100, Kuala Lumpur, Malaysia 2 Department of Computer Network, Jazan University, Saudi Arabia [email protected], [email protected], [email protected]
Abstract. With the increasing number of Quad-Core-based clusters and the introduction of compute nodes designed with large memory capacity shared by multiple cores, new problems related to scalability arise. In this paper, we analyze the overall performance of a cluster built with nodes having a dual Quad-Core Processor on each node. Some benchmark results are presented and some observations are mentioned when handling such processors on a benchmark test. A Quad-Core-based cluster's complexity arises from the fact that both local communication and network communications between the running processes need to be addressed. The potentials of an MPI-OpenMP approach are pinpointed because of its reduced communication overhead. At the end, we come to a conclusion that an MPI-OpenMP solution should be considered in such clusters since optimizing network communications between nodes is as important as optimizing local communications between processors in a multi-core cluster. Keywords: MPI-OpenMP, hybrid, Multi-Core, Cluster.
1 Introduction The integration of two or more processors within a single chip is an advanced technology for tackling the disadvantages exposed by a single core when it comes to increasing the speed, as more heat is generated and more power is consumed by those single cores. The word core refers as well to a processor in this new context and can be used interchangeably. Some of the famous and common examples of these processors are the Intel Quad Core; which is the processor our research cluster is based on, and the AMD Opteron or Phenom Quad-core. This aggregation of classical cores into a single “Processor” has introduced the division of workload among the multiple processing cores as if the execution was to happen on a fast single processor, this also introduced the need of parallel and multi-threaded approaches in solving most kinds of problems. When Quad-cores processors are deployed in a cluster, 3 types of communication links must be considered: (i) between the two processors on *
This work was supported by IIUM research incentive funds. Abdelgadir Tageldin Abdelgadir also has been working with MIMOS Berhad research institute.
Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 225–234, 2011. © Springer-Verlag Berlin Heidelberg 2011
226
A.T. Abdelgadir, A.-S.K. Pathan, and M. Ahmed
the same chip, (ii) between the chips in a same node, and (iii) between different processors in different nodes. All these communications methods need to be considered on such cluster in order to deal with the associated challenges [1], [2], [3]. The rest of the paper is organized as follows: in Section 2, we briefly introduce MPI and OpenMP and discuss performance measurement with High Performance Linpack (HPL), Section 3 presents the architecture of our cluster, Section 4 describes the research methodologies used, Section 5 records our findings and future expectations and Section 6 concludes the paper.
2 Basic Terminologies and Background 2.1 MPI and OpenMP The Message passing models provide a method of communication amongst sequential processes in a parallel environment. These processes execute on the different nodes in a cluster but interact by “passing messages”, hence the name. There can be more than a single process thread in each processor. The Message Passing Interface (MPI) [8] approach simply focuses on the process communication happening across the network, while the OpenMP targets inter-process communications between processors. With this in mind, it will make more sense to employ OpenMP parallelization for inter-process communications within the node and MPI for message passing and network communication between nodes. It is also possible to use MPI for each core as a separate entity with its own address space; this will force us to deal with the cluster differently though. With this simple definitions of MPI and OpenMP, a question arises whether it will be advantageous to employ a hybrid mode where more than one OpenMP and MPI process with multiple threads on a node so that there is at least some explicit intra-node communications [2], [3]. 2.2 Performance Measurement with HPL High Performance Linpack (HPL) is a well-known benchmark suitable for parallel workloads that are core-limited and memory intensive. Linpack is a floating-point benchmark that solves a dense system of linear equations in parallel. The result of the test is a metric called GigaFlops that translates to billions of floating point operations per second. Linpack performs an operation called LU Factorization. This is a highly parallel process, utilizing the processor's cache up to the maximum limit possible, though the HPL benchmark itself may not be considered as a memory intensive benchmark. The processor operations it does perform are predominantly 64-bit floating-point vector operations and uses SSE instructions. This benchmark is used to determine the world’s top-500 fastest computers. In this work, HPL is used to measures the performance of a single node and consequently, a cluster of nodes through a simulated replication of scientific and mathematical applications by solving a dense system of linear equations.
On the Performance of MPI-OpenMP on a 12 Nodes Multi-core Cluster
227
In the HPL benchmark, there are a number of metrics used to rate a system. One of these important measures is Rmax, measured in Gigaflops that represents the maximum performance achievable by a system. In addition, there is also Rpeak, which is the theoretical peak performance for a specific system [4]; this is obtained from: [Nproc * Clock freq * FP/clock]
(1)
Where Nproc is the number of processors available, FP/clock is the floating-point operation per clock cycle, Clock freq is the frequency of a processor in MHz or GHz.
3 The Architecture of Our Cluster Our cluster consists of 12 Compute Nodes and a Head Node as depicted in Figure 1.
Fig. 1. Cluster physical architecture
3.1 Machine Specifications The cluster consisted of two node types: a Head Node and Compute Nodes. The tests were run on the compute nodes only as the Head node was different in both capacity and speed. Its addition will increase the complexity of the tests. The Compute Node specifications shown in Table 1 are same. Each node has an Intel Xeon Dual Quad Core Processor running at 3.00 GHz. Note that the system had eight of the mentioned processors and that the sufficient size of cache reduces the latencies in accessing instructions and data; this generally improves performance for applications working on large amount of data sets. The Head Node specification (Table 2) was similar and had an Intel Quad Xeon Quad Core Processor running at 2.9 GHz.
228
A.T. Abdelgadir, A.-S.K. Pathan, and M. Ahmed Table 1. Compute Node processing specifications Element processor cpu family model name stepping cpu MHz cache size cpu cores fpu flags
bogomips clflush size cache_alignment address sizes RAM
Features 0 (Upto 7) 6 Intel(R) Xeon(R) CPU E5450 @ 3.00GHz 6 2992.508 6144 KB 4 yes fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm 6050.72 64 64 38 bits physical, 48 bits virtual 16GB
3.2 Cluster Configuration The cluster was built using Rocks 5.1 64-bits Cluster Suite, Rocks [9] is a Linux Distribution based on CentOS [10], it is intended for High Performance Computing systems. Intel 10.1 compiler suite was used; the Intel MPI implementation and the Intel Math Kernel Library were utilized as well. The cluster was connected to two networks, one used for MPI-based operations and the other for normal data transfer. As a side-note relevant to practitioners, a failed attempt was done with an HPCC version of the Linpack benchmark that utilized an OpenMPI library implementation; the results were unexpectedly low. Tests based on the OpenMPI configuration and subsequent planned test-runs were aborted. Table 2. Head Node specifications Element processor cpu family model name stepping cpu MHz cache size cpu cores fpu flags
bogomips clflush size cache_alignment address sizes
Features 0 (Upto 16) 6 Intel(R) Xeon(R) CPU X7350 @ 2.93GHz 11 2925.874 4096 KB 4 yes fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm 5855.95 64 64 40 bits physical, 48 bits virtual
On the Performance of MPI-OpenMP on a 12 Nodes Multi-core Cluster
229
4 Research Methodology Tests were done in two main iterations, the first iteration was a single node performance measurement followed by an extended iteration that included all the 12 nodes. These tests consumed a lot of time; the cluster was not fully dedicated for pure research purposes as it was used as a production cluster as well, time was limited for test-runs. Our main research focused on examining to what extent, the cluster would scale, as it was the first Quad-core to be deployed at the site. In this paper, we focus on the much more successful test-run of the hybrid implementation of HPL by Intel for Xeon Processors. In each of the iterations, different configurations and set-ups were implemented; these included changing the grid topology used by HPL according to different settings. This was needed since the cluster contained both an internal grid – between processors – and an external grid composed of the nodes themselves. In each test trial, a configuration was set and performance was measured using HPL. An analysis of the factors affecting performance is recorded for each trial and graphs were generated to clarify process-distribution in the grid of processes. 4.1 Single Node Test The test for a single node was done for all nodes. This is a precautionary measure to check whether all nodes are performing as expected since the cluster's performance in an HPL test-run is limited by the slowest of nodes. Table 3 shows the results from different nodes, the average is approximately 75.6 Gflops. This number can be calculated using Equation 2. In each node, there are Dual Xeon Quad Processors, making the theoretical peak performance equal to: Rpeak = 8*3*4 = 96Gflops/node.
(2)
But the maximum performance obtained was at an approximate average of 75.6 Gflops/node, this is the Rmax Value obtainable for a single node. The efficiency is calculated at 78.8%. Table 3. Performance of Cluster Nodes Node 1: 7.517e+01 Gflops, Node 2: 7.559e+01 Gflops, Node 3: 7.560e+01 Gflops, Node 4: 7.552e+01 Gflops, Node 5: 7.558e+01 Gflops, Node 6: 7.559e+01 Gflops, Node 7: 7.557e+01 Gflops, Node 8: 7.560e+01 Gflops, Node 9: 7.537e+01 Gflops, Node 10: 7.561e+01 Gflops Node 11: 7.557e+01 Gflops Node 12: 7.562e+01 Gflops
Table 4 shows the parameters used for the single node test. 4.2 Multiple Nodes Test The Multiple node test required many iterations to scale well and reach an optimal performance in the limited time the researcher had. The first thing put into consideration was the grid topology to be used in order to achieve good results. Several grids were proposed depending on the knowledge gathered from previous experiences; it is considered that in a cluster-wide test, attainment of high
230
A.T. Abdelgadir, A.-S.K. Pathan, and M. Ahmed
performance is dependent on the number of cores and the frequency of the processor being used on each node. Distribution of processes is crucial; a balanced distribution of processes will basically result in better performance. Table 4. HPL configuration for Single Node test Choice 6 1 40000 1 192 0 1 1 8 16.0 1 012 1 42 1 2 1 102 1 0 1 0 2 256 1 1 0 8
Parameters device out (6=stdout,7=stderr,file) # of problems sizes (N) Ns # of NBs NBs PMAP process mapping (0=Row-,1=Column-major) # of process grids (P x Q) Ps Qs threshold # of panel fact PFACTs (0=left, 1=Crout, 2=Right) # of recursive stopping criterium NBMINs (>= 1) # of panels in recursion NDIVs # of recursive panel fact. RFACTs (0=left, 1=Crout, 2=Right) # of broadcast BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) # of lookahead depth DEPTHs (>=0) SWAP (0=bin-exch,1=long,2=mix) swapping threshold L1 in (0=transposed,1=no-transposed) form U in (0=transposed,1=no-transposed) form Equilibration (0=no,1=yes) memory alignment in double (> 0)
Generally, HPL is controlled by two main parameters that describe how processes are distributed across the cluster's nodes; these values P and Q are both critical benchmark-tuning parameters when producing good performance is required. P and Q should be as close to equal as possible, but when they are not equal; P should be less than Q. That is because when P is multiplied by Q, it actually gives the number of MPI processes to be used and how they are distributed across the nodes. In this cluster, there are several choices, such as 1x96, 2x48, 3x32, 4x24, 6x16, 8x12. However, the network can affect performance, and in our case, the introduction of Multi-Cores within a single node; so different trials are needed to achieve best performance. Another parameter needed is N, which is the size of the problem to be fed to HPL. We have used the following formula as in [4] to estimate the problem size:
On the Performance of MPI-OpenMP on a 12 Nodes Multi-core Cluster
[((∑ M )MB *1000000000) / 8] sizes
231
(3)
This will give a value that will approximately be N, for example: N = sqrt(12*16*1000000000) ~= 154919 However, it is preferred not to take the whole result, we chose 140000 as N, giving more than 25% to other local system processes; this is to avoid the use of the virtual memory which will render the whole test-run useless. An overloaded system will use the swap area, and this will negatively affect the results of the benchmark. It is advisable to make full use of main memory, but at the same time avoid using the virtual memory. The optimal performance was achieved with HPL input parameters as in Table 5: Table 5. HPL configuration for 12 Nodes test Parameter
Value
N NB
140000 192
PMAP
Row-major process mapping
P
6
Q
16
RFACT
Crout
BCAST SWAP
1ring Mix (threshold = 256)
L1
no-transposed form
U
no-transposed form
EQUIL
no
ALIGN
8 double precision words
A first expectation was 3x32 or 4x24 will produce the optimal performance, but a 6x16 grid (Figure 2 and Figure 3) obtained the best performance at 662.2 Gflops, Performance increase is linear to some extent, but will not equal the overall absolute sum of 12 nodes that is 907 Gflops. This is acceptable, as a cluster's performance does not scale linearly in reality [1], thus the efficiency of the cluster is calculated at approximately 60%, which is satisfactory for a Gigabit-based cluster.
5 Observations, Discussions, and Future Expectations By looking at the general topological structure of this cluster, we notice that different cores will be completing the same process in parallel, this leads to high network communication between the different nodes in such clusters. Moreover, processing speed tends to be faster than the Gigabit network's communication link speed available for the cluster. This will be translated into waiting time in which some cores may become idle. In preliminary test-runs, we opted to use an MPI-only approach based on our previous experiences with clusters, the results were disappointing, reaching a
232
A.T. Abdelgadir, A.-S.K. Pathan, and M. Ahmed
maximum of approximately 205Gflops. An option was proposed to run the Linpack benchmark test using Intel's MPI library in its hybrid mode, this version featured an MPI-OpenMP implementation of HPL, it uses MPI for the network communication while utilizing OpenMP for local communication between cores. This approach seemed more appropriate for a Multi-Node cluster and the results previously presented in this paper are based on a Hybrid implementation of HPL benchmark. The direct effect of this was a fully saturated network as well as fully utilized processors. [5], [6]. C ore
Co re
Co re
C ore
Co re
C ore
Co re
Co re
C ore
Co re
C ore
Co re
C ore
Co re
Co re
C ore
Co re
C ore
Co re
Co re
C ore
Co re
C ore
Co re
C ore
Co re
Co re
C ore
Co re
C ore
Co re
Co re
C ore
Co re
C ore
Co re
C ore
Co re
Co re
C ore
Co re
C ore
Co re
Co re
C ore
Co re
C ore
Co re
C ore
Co re
Co re
C ore
Co re
C ore
Co re
Co re
C ore
Co re
C ore
Co re
C ore
Co re
Co re
C ore
Co re
C ore
Co re
Co re
C ore
Co re
C ore
Co re
C ore
Co re
Co re
C ore
Co re
C ore
Co re
Co re
C ore
Co re
C ore
Co re
C ore
Co re
Co re
C ore
Co re
C ore
Co re
Co re
C ore
Co re
C ore
Co re
Fig. 2. Physical view of 6x16 cores grid
Fig. 3. Abstract view of MPI process distribution on the 6x16 grid
Another factor observed which affected the performance of this cluster is the network, the current set-up of the cluster includes two networks, one is used solely for MPI traffic, which is the network that obtained the highest possible result. From these findings, it is recommended that multi-core clusters deployed for MPI jobs should have a dedicated network to run those types of jobs. It was noticed in the test-run phases that MPI processes in general generate huge data, which in turn requires lots of network bandwidth. This is mainly caused by the higher speed of Multi-processing in each node in relation to the current speed available in the test cluster. The main advantage of using Intel's MPI implementation in this work is the ability to define network or device fabrics, or in other words, defining a clusters physical connectivity. In this cluster, the fabric can be defined as a TCP network with sharedmemory cores, which is synonymous to an Ethernet-based SMP cluster. When running without explicitly defining the underlying fabric of our cluster, overall performance degradation was noticeable as the cluster's overall benchmark result was merely 240 Gflops, a 40 Gflops more than the previously mentioned failed attempts with an MPI only approach but still a low number when considering the overall expected performance. This was caused by having the MPI processes started without previous knowledge of multiple-core nodes, in this scenario, each core will be treated as a single component with no communicative relation with its neighboring cores within the same node, resulting in communication rather than processing which leads to more idle time for that specific core. To solve the problem of the low overall results obtained, a new parameter to define the underlying mechanism for the running MPI library was introduced in next test-runs, and as expected, the results obtained reached a maximum of 662.6 Gflops. This was the expected result at the beginning of the test, but was not achievable with our preliminary runs since it additionally needed a definition of the underlying fabric for Intel's MPI to use in order to achieve such performance. The addition of the option lead to an execution aware of both communication types available for this cluster which are the Gigabit communication between the nodes and the shared memory communication within a node's cores. This essentially leads to the better performance achieved.
On the Performance of MPI-OpenMP on a 12 Nodes Multi-core Cluster
233
Another aspect of these tests was how the cluster was viewed or perceived physically and how that differed from the way we should look at it. When dealing with Multi-core processors, an abstract view is needed as well, and the best method was to use diagrams, such as in figure 1 and 2. These figures depict how a 6x16 topology was chosen and how processes are distributed among nodes. It can be noticed from the figures that processes are passed in a round-robin way across different cores and not nodes. In this cluster, each node has 8 processors, so it can be viewed as 8 different single-core processor nodes. This distribution of processes affects the overall performance as well. Unexpectedly, the 6x16 grid performed well as a result of having more related processors on a single node, as well as less communication is needed between the processes across the grid. In this configuration, each of the running processes can heavily utilize the shared cache and local communication bridges to accomplish some of the tasks. On the other hand, network communication happens while processing cores are being utilized for processing. Table 6 summarizes the best as well as unexpected results obtained from several testruns. From Table 6, we can notice the drastic performance obtained by changing the way we deal with modern day computer clusters. A high increase in performance was the result of an experience we attained when dealing with this new types of clusters. Table 6. General summary of trials Option Types OpenMPI, MPI
Gflops Obtained 207 Gflops
PxQ 8x12
Problem size N 140000
Intel MPI, fabric-less
204 Gflops
8x12
140000
Intel MPI, fabric-less
224.6 Gflops
6x16
140000
Intel MPI, TCP+Shared Mem.
662.6 Gflops
6x16
140000
Comments Low results, tested with different topologies and mapping schemes. Another low result, although expectations were high, using non-MPI-only network. Good indication of the 6x16 topology which lead us to choose it in later phases. The hybrid mode reaches a new peak, 60% overall efficiency.
In general, we can summarize the main observations gathered from a modern day cluster in the following points: 1.
2.
3.
The network significantly affects the Cluster’s performance. Thus, separating the MPI network from the normal network may result in better overall performance of the cluster. The cluster's compute processors and the architecture, of which the processors inherit their features from, should be studied as different processors perform differently. The MPI implementation in use must be considered since not all provide the same features and perform similarly as shown in Table 6. Although all libraries can run MPI jobs, as well as the different approaches available for cluster users. An example of MPI libraries available are the OpenMPI library and the Intel MPI library implementation.
234
A.T. Abdelgadir, A.-S.K. Pathan, and M. Ahmed
4. 5.
Both the physical and abstract aspects are important, details of how MPI applications process data must at least be known by a cluster administrator, as these details will determine how a cluster performs. Multi Node cluster scalability is still debatable; scaling a cluster without upgrading network bandwidth may not achieve its goal of performance improvement as we found out in this work. Performance degradation caused by scaling up was relatively high; we assume a faster network will yield better performance in relation to scalability for these types of clusters.
6 Conclusion From the obtained results, we can observe the difference between the MPI-OpenMP hybrid implementations and an MPI-only implementation. Moreover, how this may heavily affect a benchmark test when done on Multi-core clusters. The numbers obtained are the results of test-runs executed on a 12 node cluster with 96 cores. It was done for the purpose of knowing how scalable the cluster was and how good was it to perform, while that happened, many more observations were recorded that we hope will benefit the researchers and practitioners working with such clusters.
References 1. Buyya, R. (ed.): High Performance Cluster Computing: Architectures and Systems, vol. 1. Prentice Hall PTR, NJ (1999) 2. Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK Benchmark: Past, Present, and Future. Concurrency and Computation: Practice and Experience 15, 1–18 (2003) 3. Gepner, P., Fraser, D.L., Kowalik, M.F.: Second Generation Quad-Core Intel Xeon Processors Bring 45 nm Technology and a New Level of Performance to HPC Applications. In: Bubak, M., et al. (eds.) ICCS 2008, Part I. LNCS, vol. 5101, pp. 417– 426. Springer, Heidelberg (2008) 4. Pase, D.M.: Linpack HPL Performance on IBM eServer 326 and xSeries 336 Servers. IBM (July 2005), ftp://ftp.software.ibm.com/eserver/benchmarks/wp _Linpack_072905.pdf 5. Saini, S., Ciotti, R., Gunney, B.T.N., Spelce, T.E., Koniges, A., Dossa, D., Adamidis, P., Rabenseifner, R., Tiyyagura, S.R., Mueller, M.: Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks. Journal of Computer and System Sciences 74(6) (2007), doi:10.1016/j.jcss.2007.07.002 6. Rane, A., Stanzione, D.: Experiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems. In: Proceedings of the 10th LCI International Conference on High-Performance Clustered Computing (2009) 7. Tools for the Classic HPC Developer. Whitepaper published by The Portland Group, v2.0 (September 2008), http://www.pgroup.com/lit/pgi_whitepaper_tools4hpc.pdf 8. Wu, X., Taylor, V.: Performance Characteristics of Hybrid MPI/OpenMP Implementations of NAS Parallel Benchmarks SP and BT on Large-Scale Multicore Clusters. The Computer Journal (to be published, 2011) 9. http://www.rocksclusters.org/ 10. http://www.centos.org/
A Protocol for Discovering Content Adaptation Services Mohd Farhan Md Fudzee1,2 and Jemal Abawajy1 1
School of Information Technology, Deakin University, 3217 Victoria, Australia {mfmd,jemal}@deakin.edu.au 2 Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, 86400 Johor, Malaysia
Abstract. Service-oriented content adaptation scheme has emerged to address content adaptation problem. In this scheme, content adaptation functions are provided as services by multiple providers, located across wide area network. To benefit from these services, clients must be able to locate them in the network. This makes service discovery as an important component. In this paper, we propose a service discovery protocol that takes into account searching space, searching time, QoS and physical location of the potential providers. The performance of the proposed protocol is studied in term of discoverability under various conditions and shown to be substantially better than the keyword-based and QoS-based approaches. Keywords: Content adaptation service, discovery protocol, discoverability.
1 Introduction Online contents are becoming increasingly rich in content and varied in format. Most of these contents however, are originally designed for desktop level displays and tends to be made-up of different media objects [1]. With the proliferation of user devices varied in their sizes and capabilities (e.g., processing power, input and output facilities), it is becoming increasingly difficult for direct content delivery to varying devices without adjustment. To address this problem, a service-oriented content adaptation (SOCA) scheme has recently emerged as an efficient paradigm [2], [3], [4]. SOCA promotes the idea of assembling content adaptation functions into a network of services, thus enabling clients to access a variety of adaptation services such as content annotation, transcoding and translation. To benefit from these services, a client must be able to locate and invoke them in the network. This necessitates service discovery as an important component. Content adaptation is a time-critical Internet service. Most of its clients use mobile devices with wireless network connectivity [3]. Therefore, what is required is a protocol that quickly terminate when specified search space for potential services is achieved, rather than performing extensive search. In this way, searching time can be reduced. Also, it should take into account quality of service (QoS) levels offered by these service providers to match with the client QoS requirements. Furthermore, choosing closer providers can avoid high latency hops hence, relatively reducing the Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 235–244, 2011. © Springer-Verlag Berlin Heidelberg 2011
236
M.F. Md Fudzee and J. Abawajy
time to deliver adapted content version. Although there are many service discovery protocols, there is none specifically developed for SOCA systems (e.g., [1], [2], [3], [4]) or has solved the aforementioned issues simultaneously. The innovative aspect of our work is that the service discovery protocol quickly terminates the searching when the specified search space or matched services is achieved. The rest of the paper is organized as follows. In section 2, related work is presented. Section 3 presents the service discovery protocol that includes the system model. Section 4 and 5 present the performance evaluation and discussion of the results, respectively. Finally, we concluded the paper in Section 6.
2 Related Work Service discovery protocol is paramount to any service-oriented system. The service discovery requirements (i.e., service description, storage of service description, message communication and searching methods) differ from one application scenario to another (e,.g., wired, wireless, MANET, vehicular networks). Many existing service discovery solutions are originally based on Universal Description, Discovery and Integration (UDDI) reference model. Current discovery solutions can be divided into two dominant approaches: function-based and non-functional-based. Function-based approaches utilize the service’s description (i.e., function’s name, input and output parameters, preconditions, and effects) to match user query for services. One of the well known function-based approaches is keyword matching, for example: Sun’s Jini, IBM’s Salutation, and OASIS’s UDDI (for wired networks); and Bluetooth and ZigBee (for wireless networks) [5], [6]. However, keyword-based approach may return a huge list containing inappropriate services that may not satisfy the requester’s intended requirements. To solve this, ontology technology is used. Ontology organizes service profile according to its semantic [6]. Ontology-based approach however, may suffer from performance problems (e.g., inappropriate matching) due to the use of immature ontology reasoners [7]. On the other hand, non-functional-based approaches incorporate the service’s QoS attributes together with the service’s descriptions. QoS is a set of the service’s nonfunctional attributes such as cost, time, rating, and reputation. Efforts such as OWL-Q are trying to make QoS description more flexible to describe and present the formal description of a service [8]. Existing methods such as [9], [10] perform matchmaking between the client and the service’s QoS, thus requiring both QoSs must be known a priori. However, discovering all services from an Internet-scale list that specifically matched the client's QoS requirements is time consuming. Although there are many service discovery protocols exist, there is none to the best of our knowledge for SOCA or could not be directly adapted as they tend to perform extensive searching, which renders it unsuitable for SOCA. Our previous work in [14] proposes a method to discover potential service when the client QoS is unknown however; it does not address these specific issues. Therefore, what is required is an algorithm that quickly terminates when specified search space or a number of matched services is achieved. We aim to provide a solution to this.
A Protocol for Discovering Content Adaptation Services
237
3 Service Discovery Protocol In this section, we describe the system model and the proposed protocol. 3.1
System Model
In the system, there are N content adaptation service providers and B brokers geographically distributed across wide area network. A service provider may provide or perform one or more content adaptation tasks . Formally, , ,…, and , ,…, . The service providers advertise their services along with QoS , ,…, including the cost using the UDDI publication API web offered interface at the local business service registry. Service registries , ,…, are distributed across wide area network and maintained by the mediators (i.e., business organizations such as HP and IBM). Each broker keeps and maintains a list of local service registries information. It uses the pinger logic to measure the proximity of the service registry(s). Upon receiving the proximity measurement from each accessible service registry, the broker lists these registries in the ascending order of the proximity measurement. Also, the broker provides access to content servers for the clients. It uses the adaptation decision-taking engine (ADTE) to produce required tasks in a manner similar to [8]. The resulting tasks can be mapped into detailed processes using web service business process execution language (WS-BPEL 2.0). The output from WS-BPEL is a sequence of required services. 3.2
Proposed Protocol
There are multiple brokers; each receives requests from many clients. A request is composed of one or more tasks. Each broker has to correctly store this detail and keeps track of each look up inquiry to the registry(s). Upon receiving the request (containing required tasks) for the specific client including the corresponding client QoS requirements (Step A0), the broker performs the discovery algorithm and initiates look up to the accessible local service registry(s). Each request is uniquely identified by the client ID and the client’s request ID. Figure 1 presents the detailed protocol. The registry is responsible for caching and maintaining services’ advertisements and allows for service look up (Step B0). Providers are responsible to publish and periodically update their service(s) offer including the availability status (Step C0). Through inquiry API, the broker sends a runtime message using SOAP via HTTP protocol to UDDI public business XML-based registry, to perform browse and drilldown query (Step A1). Each inquiry made by the broker is identified by the broker ID and the inquiry ID. The inquiry API allows the broker to locate and obtain details entered in a UDDI registry. The browse and drill-down query combines ‘find’ and ‘get’ operations made on the key attribute associated with data retrieved. The query message contains the look up for available services for performing the content adaptation tasks. The query includes information for specific adaptation functions, QoS levels including availability, IP address, proximity measurement and handle (i.e., binding template). At the registry, it responds to the query made by the broker and returns the message containing services and required information, keyed by the registry ID (Step B1). The binding template provides the command at which a program can start interacting with the service [11].
238
M.F. Md Fudzee and J. Abawajy
Fig, 1. The proposed service discovery protocol
Upon receiving the reply message from the registry (Step A2), the following assessment is carried: if the discovery algorithm returns services (Step A2.0), the broker returns potential service providers (Step A3). The service providers can be automatically invoked using their binding templates. Otherwise, the broker informs the client that content adaptation is not performable due to the service unavailability and terminates its operation (Step A2.1). Figure 2 outlines the pseudo-code of the service discovery algorithm. The inputs to the algorithm are the set of tasks and a set of published services for each task. We assume the number of required tasks and local registries are known a priori. During initialization, the algorithm sets number of services in the sorted list , timer t, partial matched service and number of searched services NT to zero. Also, the broker sets the acceptable waiting period for each requested registry to respond Trespond, number of minimum services to be searched NMIN and the maximum number . of matched services It then dispatches the broker’s agent to the first nearest accessible business registry. The agent gets executed at the registry and then moves to the next registry for services collection. Upon arrival at the registry, the agent performs the registry accessibility assessment. It sends a drill-down query for information. The agent reads and stores the current time upon arrival as Tstart and calculates time to get reply Twait = Tstart + timer t. If the waiting time Twait is bigger than Trespond, it aborts query execution; otherwise, for each task, it starts retrieving services advertised by providers at the or minimum registry, until the maximum number of the matched services
A Protocol for Discovering Content Adaptation Services
239
searched of the services NMIN is achieved. For each service retrieved, it evaluates the match category based on definitions 2 to 4, performs update algorithm (figure 3) and updates the increment of the number of current searched services NT by 1. If ) or ( )), the algorithm either of both conditions is met (i.e., ( breaks the DO loop (line 5 to 10). Definitions 1 to 4 are defined as the following: Definition 1. (Client Requirement) Let R be the client requirements defined in the , , where is the particular content adaptation task required form of is the set of maximum/minimum QoS levels agreed by the client. Maximum and QoS level (i.e., upper bound) is set for positive monotonic QoS such as reputation and rating. Minimum QoS level (i.e., lower bound) is set for negative monotonic QoS such as cost and time. Definition 2. (Matching) Suppose that the client requirement Let a function of a service be and QoS levels ) then it is a matching category. Definition
3.
,
is given. . If
(Partial matching) If ) then it is a partial matching category.
Definition 4. (Non-matching) If non-matching category. Algorithm 1. Service Discovery INPUT: T, S, R OUTPUT: for all tasks BEGIN 1: Initialization 2: FOR DO 3: Registry accessibility assessment 4: FOR each task DO 5: DO 6: Retrieve service 7: Find service matching category 8: update 9: Increase by 1 10: WHILE (( ) OR ( )) 11: IF (( ) AND ( = empty))THEN into //constraint relaxation 12: Append partial matching list 13: END IF 14: Attach with proximity measurement and handle 15: Proximity assessment 16: END FOR 17: END FOR 18: RETURN for all tasks END Fig. 2. Service discovery algorithm
then it is a
240
M.F. Md Fudzee and J. Abawajy
Figure 3 depicted the update algorithm. Upon receiving the input (i.e., service and its matching category) from the main discovery algorithm, it performs the following assessments. Matching services are appended and sorted into the list . The counter is increased by 1. The first service that belongs to partial matching if it is empty. The noncategory is kept in the partial matching service list matching or partial matching (except the first one) service will be discarded. Algorithm 2. Update INPUT: service ( ) and its matching category OUTPUT: Updated BEGIN 1: IF ((category matching) THEN 2: Append and sort service ( ) into 3: Increase by 1 = empty)) THEN 4: ELSE IF ((category partial) AND 5: Store service ( ) into 6: ELSE 7: Discard service ( ) 8: END IF END Fig. 3. Update
algorithm
Then it carries the following assessment: if either the number of current searched services NT is equal to the number of minimum searched services NMIN or the matched will be appended into (i.e., QoS relaxation services is empty, the service from is imposed). Otherwise, it attaches the providers’ proximity measurement and handle for each service stored in , and performs the proximity assessment function. To perform the function, proximity measurement of each provider is required. The registry is assumed to have the capability of performing proximity measurement to providers using the pinger logic in the same manner the broker measures proximity to the registry. Then, the function sorts and returns in ascending order of the proximity measurement. The look up and proximity assessment processes are repeated for each task and this will eventually return for all tasks. Finally, the discovery algorithm returns the list for each task to the broker. Upon receiving the output from the discovery algorithm, for each task, the broker can randomly select one of the providers from , especially the one with the higher order. In this way, a set of closer providers is discovered to perform the client’s request. Alternatively, the broker can use QoS criteria while preserving proximity to select the best possible composition of service providers in a manner similar to [2], [13]. For each provider selected, the corresponding handle is used by the broker to enable further communication. The proposed discovery algorithm has several strengths. It finds services that matched the client requirements. After specified search space or a number of matched services is achieved, the algorithm is quickly terminated. Thus, it avoids the algorithm to perform extensive searching. This significantly reduces searching time. Also, the algorithm returns sets of closer providers to select from. If required, to check actual
A Protocol for Discovering Content Adaptation Services
241
responsiveness of the top services, it requires round trip time (RTT) measurement to only a small number of providers.
4 Performance Evaluation We develop a metric that computes the discoverability of the service discovery protocol. The discoverability metric [0…1] quantitatively expresses the mixture of three important factors: searching time, number of searched service providers and the match type of each service returned by the protocol to serve a given request. Unlike discoverability metric presented in [12], ours is designed to evaluate the request(s) with multiple tasks i.e., multiple service types. Discoverability metric D is formulated as given in equation below. ⁄3 .
(1)
where is the search space, is the aggregate score of the returned services’ matching categories and is simulation search time. The discoverability metric is developed based on the observation of the following factors: 1. If the search space (of services) is increased, the better assessment is made. Hence, it is directly proportional to discoverability metric i.e., discoverability is better if search space is wider. Given the number of tasks T, number of searched services by the end of the search execution NT, and total services ⁄ for all tasks and available for each task N; is computed by adding then divided it by T. 2. If the returned services belong to matching category; the better client will be served, as the client is provided with exact or better QoS levels than required. Hence, match category is directly proportional to discoverability. Given the number of tasks T, match category of each service returned , and total ⁄ for services returned for each task ; is computed by adding ∑ all tasks and then divided it by T. The matching value is 1, 0.5 or 0 for match, partial match or non-match, respectively. 3. In term of searching time TST, the more time is taken, the later service providers will be located thus, relatively increases the amount of time to provide the client with adapted content version. Hence, searching time is inversely proportional to the discoverability. Given the searching time for the particular protocol until it is terminated TST and the minimum searching time ; is computed by dividing between protocols being compared with TST. As experiments may not fetch searching time from protocols being compared and lead to difficulty of estimating total services available, we simulate the system with all possible cases considering variations in multiple factors: search space, matching category of each returned service, and simulation searching time. Also, simulation allows for the simulated environment to be controlled and the exact setting to be repeated. Two different simulations were conducted to study the discoverability metric towards (1) number of tasks and (2) number of service providers. These
242
M.F. Md Fudzee and J. Abawajy
variations are chosen to evaluate scalability and reliability of the proposed protocol compared with others. The value we used for each parameter is in line with the current literature and also reflects the actual environment. The number of tasks, service providers and QoS are in line with the work of [1], [13]. We used two well known service discovery protocols as the baseline approaches. Keyword-based protocol (e.g., Sun’s Jini, IBM’s Salutation, and OASIS’s UDDI) and more recent QoS-based protocol (e.g., DAML-QoS, [9]) are chosen because they are widely accepted and comparable to our protocol (i.e., wired network application). A keyword-based protocol is characterized with the ability of matching the required adaptation function. On the other hand, a QoS–based protocol has the ability to return services that matched required adaptation functions and the client QoS requirements. Both protocols perform extensive searching.
5 Result and Discussion Extensive simulation analysis of the proposed algorithm has been carried out. Figure 4(a) shows the discoverability ratio (y axis) as a function of the tasks (x-axis). In this simulation, we varied the number of tasks from 1 to 5. The number of registries, services, QoS, sorted list and NMIN are set to 1, 10, 1, 5 and 5 respectively. The matching category service is also randomized to be between 70% to 90%. As can be seen, there is a very small decrement of the discoverability for the proposed algorithm compared to others along x-axis. The proposed discovery algorithm provides higher discoverability while keyword-based algorithm provided the least. The proposed algorithm constantly produces around 85% of the discoverability. There is a considerable different of 8% average between the proposed algorithm with QoS-based and average of 15% between the proposed algorithm with keyword-based along xaxis. This figure indicates that the proposed algorithm is more stable towards task variations compared to others. This is due to early termination when specified search space is achieved (NMIN is met), thus minimizing search time, while at the same time returning matching category services. For others, extensive searching has resulted the considerable increasing decrement when number of tasks is increased. In addition, keyword-based suffers the most as it tends to return every single service regardless the matching category. Figure 4(b) shows the discoverability ratio (y axis) as a function of the service providers (x-axis). In this simulation, we varied the number of service providers from 5 to 20. The number of registries, tasks, sorted list and QoS are set to 1, 2, 5 and 1 respectively. NMIN is set to be rounded half of total available services. The matching category service is also randomized to be between 70% to 90%.. As can be seen, there is a very small and small decrement of the discoverability for the proposed discovery algorithm and others respectively, along x-axis. The proposed algorithm generated higher discoverability (around 84%) while keyword-based algorithm provided the least (around 70%). The reason behind this is because keyword-based approach searched and returned all available services that contain non-matching category. There is a small different of 6% between the proposed algorithm with QoS-based, and 15% between the proposed algorithm with keyword-based along x-axis. The slight decrement between the proposed algorithm with QoS-based (compared to figure 4(a))
A Protocol for Discovering Content Adaptation Services
243
is due to the fact that the proposed discovery algorithm quickly terminate (NMIN is met or sorted list is full) even though number of providers is increased along x-axis. The simulation implies that number of services has a minor impact on discoverability of all algorithms. It is worth noting however, if NMIN is constant (as observed in different simulation setting); the proposed algorithm will experience a slight discoverability decrement when the number of providers increases along x-axis.
Fig. 4. Discoverability towards (a) task and (b) service provider variations
In summary, some key findings were observed and a comparative discussion can be made. The discoverability of the proposed algorithm is higher than others (i.e., keyword-based and QoS-based approaches) due to certain factors. First, it has the same capability with (1) the keyword-based approach in term of locating service(s) that matched the required adaptation function(s), and (2) the QoS-based approach in term of discovering potential services that matched client QoS requirements. Second, it has the advantageous feature that quickly terminates searching when specified search space or a number of matching category services is achieved. As a result, the proposed algorithm benefits from minimized searching time and matching services. On the other hand, both keyword-based and QoS-based approaches suffer from longer searching time due to extensive search and keyword-based additionally is penalized from non-matching service(s) returned. Both approaches however, have a slightly higher credit than the proposed algorithm in term of wider search space.
6 Conclusion In this paper, we proposed a service discovery mechanism for the SOCA platform. To the best of our knowledge, most (if not all) of the service discovery protocols did not take search space, matching category, searching time and network proximity factors simultaneously. The proposed service discovery protocol is proximity-aware and quickly terminate when specified search space or a number of matching category services is achieved. Consequently, searching time is significantly minimized as well
244
M.F. Md Fudzee and J. Abawajy
as accumulated time to provide client with adapted content version. In summary, the proposed protocol was able to clearly meet its objective. The proposed protocol increases discoverability and outperforms keyword-based and QoS-based approaches. We summarize our contributions into three points: (1) we proposed the service discovery protocol for SOCA, (2) we designed the discoverability metric, and (3) the proposed discovery algorithm is simulated in various conditions and demonstrated to have a high discoverability compared to the pure keyword-based and QoS-based approaches. In future, we plan to study on how to integrate QoS as the selection criteria. Also, we are working on how to optimize settings of the specified search space or the number of matched service.
References 1. Fawaz, Y., Berhe, G., Brunie, L., Scuturici, V.-M., Coquil, D.: Efficient Execution of Service Composition for Content Adaptation in Pervasive Computing. Int. Journal of Digital Multimedia Broadcasting, 1–10 (2008) 2. Md Fudzee, M.-F., Abawajy, J.: QoS-based Adaptation Service Selection Broker. Future Generation Computer Systems 27(3), 256–264 (2011) 3. Azhan, N., Hui, S., Imran, G., Izzuddin, T.: Using Service-based Content Adaptation Platform to Enhance Mobile User Experience. In: Int. Conf. on Mobile Technology and Applications, pp. 552–557. ACM Press, New York (2007) 4. Tonnies, S., Kohncke, B., Hennig, P., Balke, W.: A Service Oriented Architecture for Personalized Rich Media Delivery. In: IEEE Int. Conf. on Service Computing, pp. 340– 347. IEEE Press, New York (2009) 5. Mian, A., Baldoni, R., Beraldi, R.: A Survey of Service Discovery Protocols in Multihop Mobile Ad Hoc Networks. IEEE Pervasive Computing 8(1), 66–74 (2009) 6. Yu, Q., Liu, X., Bouuettaya, A., Medjahed, B.: Deploying and Managing Web Services: Issues, Solutions, and Direction. The VLDB Journal 17, 537–572 (2008) 7. Kritikos, K., Plexousakis, D.: Requirements for QoS-based Web Service Description and Discovery. IEEE Trans. on Service Computing 2(4), 320–327 (2009) 8. Song, X., Dou, W.: A Workflow Framework for Intelligent Service Composition. Future Generation Computer Systems 27(5), 627–636 (2011) 9. Kritikos, K., Plexousakis, D.: Mixed-integer Programming for QoS-based Web Service Matchmaking. IEEE Trans. on Service Computing 2(2), 122–139 (2009) 10. Dastjerdi, A., Tabatabaei, S., Buyya, R.: An Effective Architecture for Automated Appliance Management System Applying Ontology-based Cloud Discovery. In: 10th IEEE/ACM CCGrid, pp. 104–112. IEEE Press, New York (2010) 11. Pastore, S.: The Service Discovery Methods Issue: A Web Services UDDI Specification Framework Integrated in a Grid Environment. Journal of Network and Computer Applications 31, 93–107 (2008) 12. Chakraborty, D., Joshi, A., Yesh, Y., Finin, T.: GSD: A Novel Group-based Service Discovery Protocol for MANETS. In: 4th IEEE Conf. on Mobile and Wireless Communications Networks, pp. 140–144. IEEE Press, New York (2002) 13. Berhe, G., Brunie, L., Pierson, J.: Content Adaptation in Distributed Multimedia Systems. Journal of Digital Information Management 3(2), 96–100 (2005) 14. Md Fudzee, M.-F., Abawajy, J., Deris, M.: Service Discovery for Service-oriented Content Adaptation. In: 7th ICST Mobiquitous, pp. 1–2. Springer, Heidelberg (2010)
Securing RFID Systems from SQLIA Harinda Fernando and Jemal Abawajy School of IT, Deakin University, Australia {hsf,jemal.abawajy}@deakin.edu.au
Abstract. While SQL injection attacks have been plaguing web applications for years the threat they pose to RFID systems have only identified recently. Because the architecture of web systems and RFID systems differ considerably the prevention and detection techniques proposed for web applications are not suitable for RFID systems. In this paper we propose a system to secure RFID systems against tag based SQLIA. Our system is optimized for the architecture of RFID systems and consists of a query structure matching technique and tag data cleaning technique. The novelty of the proposed system is that it’s specifically aimed at RFID systems and has the ability to detect and prevent second order injections which is a problem most current solutions haven’t addressed. The preliminary evaluation of our query matching technique is very promising showing very high detection rate with minimal false positives.
1
Introduction
RFID is a tagging technology that allows an object, place, or person to be automatically identified from a distance without visual or physical contact [1]. But the nature of RFID technology dictates that the enhanced automation and productivity come at the price of an increase in security threats. One such major problem for RFID systems is their vulnerability to SQL Injection Attacks (SQLIA). Successful SQLIA can have a range of detrimental impacts on the system including: corruption and compromise of information or infection of other programs and system components. SQLIA has been a major issue for web based systems for a number of years but the possibility of them impacting RFID systems was not considered till recently. In [2] the authors who first identified the possibility of RFID based SQLIA attacks demonstrated how a fully functional RFID virus can be used to infect and spread via SQLIA. These RFID viruses spread using SQLIA as their attack vector and infect new tags and databases. While a number of different techniques have been proposed for SQLIA detection and prevention in web applications, none of them have been truly effective so far. In addition the differences in the architecture of web and RFID systems mean that most of the approaches proposed for web systems do not work for RFID applications. Therefore the detection and prevention of SQLIA in RFID systems remain an urgent research problem. In this paper we propose a dual pronged defense mechanism for protecting RFID systems from tag-based SQLIA. The use of a dual mechanism affords protection against second order injection while ensuring that the defense is hard to bypass. The proposed technique method consists of cleaning (validation and sanitization) of RFID Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 245–254, 2011. © Springer-Verlag Berlin Heidelberg 2011
246
H. Fernando and J. Abawajy
based data and then matching the structure of those dynamic queries with the legal structure as defined by the programmers. The evaluation of the system was promising with the system giving a 100% detection rate and a 0% false positive rate. The rest of this paper is organized as follows Section 2 presents the related work in the area. Section 3 presents SQLIA in context of RFID systems. Section 4 contains the proposed SQLIA prevention technique while section 5 presents the results of our evaluation. We finish the paper with our conclusions in section 6.
2
Related Work
An SQLIA occurs when an attacker successfully changes the logic, semantics or syntax of a legitimate SQL query by inserting additional SQL keywords and operators into the inputs used in building that query it in such a manner as to make the database interpret that input as part of the command thereby making the changed query compromise the security of the database in some manner when executed [3]. While there is a large amount of SQLIA defense mechanism for web based systems very little work exists for preventing them in RFID applications. Overall SQLIA defense techniques can be classified into two main types: (1) defensive coding practices and (2) detection and prevention techniques [4]. Defensive coding practices revolve around ensuring that all accepted inputs are validated before being used to build queries [5]. Some examples of defensive coding are input type checking: which consists of ensuring that the input data is type consistent with the expected data for that value and encoding of inputs: which consists of changing the input in such a way as to ensure that the database does not mistake meta characters in the input for keywords, tokens or operators [3]. Overall defensive coding techniques still remain one of the simplest ways with which to prevent SQL injection attacks. Unfortunately they are typically very simple and can be easily bypassed and are prone to human error making them less effective [5]. But in RFID based systems this is not an issue because all dynamic queries are generated by the middleware and it is sufficient to put input validation at that point in the SQL middleware. In addition the defense coding is just one half of our defense mechanism which means even if it’s by passed the query matching will still detect any SQLIA. The other approach for SQLIA security is various SQLIA detection and prevention techniques. Unfortunately as most of these techniques have been developed specifically to protect web based systems they do not translate well into the architecture used in RFID systems. Black box testing techniques such as the one proposed in [6] uses a web crawler to identify all possible attack points in the application. Because there is only one possible attack point for SQLIA in RFID systems this type of technique is overkill and not necessary for RFID systems. The new query development paradigms proposed in [7] use encapsulation of database queries to provide a safe and reliable way to access the database. While this system is secure it cannot be used for existing legacy systems and it also requires programmers to learn a completely new development process this makes the over head of implementing such a system too high. SQL rand [8] is an instruction set randomization technique which allows developers to create SQL queries using randomized instructions. This technique is based on cryptographic integrity check
Securing RFID Systems from SQLIA
247
systems and not only places significant overhead on the system but its security is also fully dependent on the security of the secret key used in the randomization process. The computational overhead is an issue for RFIOD systems which require very high through put rates. Static code checking is a method by which the source code is checked for various weaknesses that make it vulnerable to SQLIA. These methods are only effective against a specific type of SQLIA [9].Because RFID systems are targeted by around six different SQLIA types this makes it unsuitable for these systems. Another common technique consists of a hybrid of static code analysis and dynamic run time monitoring. In this technique the code is analyzed for weaknesses as well as all legal query patterns possible during the static analysis phase. Then the identified query patterns are used to analyze and validate the SQL queries generated during the runtime monitoring phase. AMENESIA [10]: an approach based on this method, uses a web crawler to identify possible input sources for the system. Given the much lower amount of legal query patterns and the single point of query generation in RFID this approach is unnecessarily complex. In addition the NDFA they use for query matching are pretty complex and may be an overestimate which may result in illegal queries being mistaken for legal queries. In SQL-Check [11] the authors generate a parse tree to represent legal queries and compare them to the parse tree of the dynamically generated query. Unfortunately this approach uses secret keys which must be kept secret and requires the developer to use special intermediary libraries or to manually insert special markers in the code. It also uses a needlessly complex query matching system putting unnecessary overhead on the system making this system unsuitable for RFID systems as well [3]. In general all current query pattern checking techniques have (1) un-needed complexity and computational overhead and (2) weakness in the query models due to the automated manner in which they are built resulting lower security [4]. Finally and most significantly none of these systems are capable of detecting or preventing second order SQLIA attacks.
3
SQLIA in Networked RFID
While SQLIA have been an issue for web systems for a long time they have only been recently identified as threat to RFID systems. Figure 1 shows how tag based SQLIA attacks are mounted on RFID systems. RFID tags store data that is read and forwarded by the readers to the middleware. The middleware uses the received data to build dynamic RFID queries which are then forwarded onto the database. When an attacker wants to mount an SQLIA on this system he saves the malicious data on the tag itself. This data which is read and forwarded to the middleware by readers is then used to build SQL queries which are forwarded to the database for execution [2]. These malicious queries can do a range of attacks varying from deleting tables, crashing the server to corrupting the data stored on the tables. Later on additional tags may be updated with the corrupted data stored in the database. If the malicious data is written correctly this will cause the recently updated tag to become infected and it will in turn go on to infect and compromise other system’s middleware and databases [12].
248
H. Fernando and J. Abawajy
Fig. 1. SQLIA in RFID systems
But differences in web systems and RFID systems impact the way SQLIA are mounted on them and therefore how they can be prevented. Unlike in web based applications where the queries can originate from a large number of different applications or web pages, the dynamic queries in RFID systems are all generated by the middleware. Another key feature of RFID systems is the limited amount of data stored on the tag and the limited access given to the tag [12]. These features allow the setting of very strict data standards for tag data compared to web form input and makes it relatively easier to validate and sanitize input data coming from the RFID tags. RFID tags are also treated as simple data containers which send data to the system as opposed to the web pages which are treated as input output devices. This makes most attacks based on getting feedback or error messages in response to SQLIA useless against RFID systems. In addition RFID systems have a lower number of dynamically generated queries which are set by a single developer. This makes the number of valid structures possible for the dynamically generated queries for RFID relatively low and easy to track making the generation of valid query structures easier.
4
Proposed Solution
Taking into account the above differences in web and RFID systems we propose a simple yet effective dual pronged SQLIA detection system for RFID system (shown in figure 2). It consists of RFID tag data cleaning and dynamic SQL query pattern matching. The technique we propose has two mechanisms because data validation is easily detects simpler SQLIA and allows the detection of second order injection. But can it be bypassed by more complex queries. The query matching is much more difficult to evade but does not protect second order injection Therefore by using both techniques we ensure protection against second order injection while making the overall defence much more difficult to bypass.
Securing RFID Systems from SQLIA
249
Each of our proposed methods has two distinct phases. During static analysis the system and RFID tag data is analyzed and certain policies concerning the data stored on the tags and about the legal structures of the dynamic queries are made. Then during the runtime monitoring phase the system is monitored when it’s in use and the policies made during the previous phase is enforced. The static phase takes place once during the development of the system and can be repeated regularly to ensure that all rules and legal query structures are up to date. The Runtime monitoring phase is an ongoing phase and is always happening while the system is in use. The three modules shown in the proposed method box (RFID data cleaning, dynamic query generation and SQL query pattern matching) are all implemented on RFID middleware.
Static Analysis
Proposed Method RFID Middleware
RFID Tag Data
Runtime Monitoring Infected RFID Tag Malicious tag details
Legal Query Structures
Middleware Query Analysis
Data Cleaning Rules
Tag Data Analysis
Tag Data Tag Data Cleaning Dynamic Query Generation SQL Query Pattern Matching
Legitimate Query Malicious Query RFID Repository
Fig. 2. Overview of proposed system
SQLIA depend on inputting data in unexpected or unusual formations and structures to be successful. Therefore if we can verify that the received inputs are of the expected structured and type then a majority of the simpler SQLIA can be prevented [5]. Therefore in our technique RFID tag data cleaning consists of ensuring that the data received from RFID tags adheres to pre-defined set of rules and standards. To increase the protection afforded by this approach we use two distinct types of rules: white list rules and black list rules. White list rules are used to validate that the data received match preset conventions for features such as data type, max and min length and formatting standards. The black list rules are used to ensure that the data does not contain any forbidden characters or keywords. While this approach is not effective in web based systems because of their architecture the different architecture used in RFID systems make data cleaning a simple yet effective technique for preventing SQLIA attacks. Additionally if done properly this technique has the capability to prevent second order injection attacks which cannot be prevented using query structure matching techniques.
250
H. Fernando and J. Abawajy
While RFID data cleaning is one of the simplest and most effective countermeasures to RFID SQLIA its strength is based purely on the strength of the rules defined [5]. To ensure security against more complex attacks that can bypass this security measure we propose a second security mechanism that takes into account the structure of legal SQL queries. SQLIA work by injecting additional code or conditions into legal SQL queries in such a manner as to make them carryout an unexpected process [13]. Our query pattern matching mechanism takes advantage of this fact to identify when dynamically generated queries differ in structure to the expected query and therefore identify them as SQLIA and block their execution. The proposed approach using this technique is a simple and computationally minimal query pattern matching technique which employs string comparison and is sufficient for protecting RFID systems. The proposed method is also easy to develop and integrate and provides stronger or equivalent protection to what is offered by [8, 10, 14] when implemented in the specific architecture present in RFID systems. 4.1
Static Analysis
The first step during static analysis is data cleaning rule creation (figure 3). Because tag data must be stored as separate values rather than one long contiguous block we must first identify all the data fields that will be stored on the tag. Once all data fields have been identified the validation and sanitization rules for each of those fields must be set. To create the validation rules, data field features (data-type, max length, min length etc) that can be used for the validation of each data field must be identified. Then the allowable values for each of the data features must be identified for each field. This is information such as data-type, max length and min length of each data field. For data to be sanitized it must be clean of illegal specials characters and keywords. To set the sanitization rules first analyze if any special characters/keywords are not allowed to be contained in a data field. These data validation and sanitization rules and standards must then be stored in a format which is available to the middleware Identify all tag data fields
Identify the field features that will be used in validation
Identify the feature values for each data field
Create validation (while list)
Decide which keywords are not allowed Yes Are any characters not allowed?
No
Are any keywords not allowed?
No
Create sanitization (black list) rules
Yes Decide which characters are not allowed
Fig. 3. RFID data cleaning rule creation process
Securing RFID Systems from SQLIA
251
Next the legal query structure for valid queries must be defined. To do this all possible queries that incorporates RFID tag data and is dynamically generated by the middleware must be uniquely identified. Then the legal query pattern for each identified query must be created. In developing a query structure model we use the concept of tokens to decompose the query into its different constituent parts while preserving its logical structure but removing any user inputs. In our technique tokens are defined as individual string parts and can be one of four main types: Keywords, Symbols/Operators, Identifiers and Literals. The first three types of tokens are important for the logic and structure of the query the fourth is only user input which has no effect on either the logic or the structure of the query. Legitimate RFID tags only contain the literals and therefore will only change the value of the literals in the query when used to build dynamic queries. Therefore to build the legal query structure for any query we must identify the positions where literals will be inserted. Take the following example query: (Dark blue – keywords, Orange – identifiers, Green – operators/symbol, Red – literals).
Fig. 4. Tokenized SQL query
Now by replacing the literals with “?” we ensure that the legal structure we define for this query does not take into account the changing literal values thereby allowing the tag input to change as required. By keeping the first three types of tokens we ensure that the structure contains all the data concerning the query logic and structure, allowing for the logic of the dynamic queries to be validated. Therefore the legal structure for any query generated for the above example would be - insert into product (tag_id, product_name) values(?,?); RFID systems have relatively little dynamically generated queries containing user input (RFID tag data) compared to web systems and all of them are developed internally by the same company. Therefore while query identification can be done automatically using existing methods, for our system we recommend that the programmer who develops the query generation software also define the legal query structure for each query generated by that software manually. This has the twin advantages of minimizing the coding required and also ensuring the correctness of the developed query models without fear of over or under-compensation inherent in models developed by automated systems. Once the legal query models have been developed they must be saved along with the corresponding unique identifier of the SQL query for that model. 4.2
Runtime Monitoring
When data is retrieved from the RFID tags it arrives at the middleware as a single block of data. Therefore the middleware must first identify each individual field in the full block and then separate the data into the separate fields. Then, the identifier for each field is then used to extract the RFID data cleaning rules for that field from the pre-generated validation and sanitization rules developed during static analysis. Then for each individual field of data received from the reader the data feature values such as max length, min length and data-type must be extracted by analyzing the separated
252
H. Fernando and J. Abawajy
data fields. Finally those extracted feature values must be matched against the values stored in the validation and sanitization rules. If the received data follow the set rules then the data are passed on for query generation else it’s rejected and the tag is tagged as having corruption. GQ ID GQp QSa QSl
– Dynamically generated query - Unique identifier that associates GQ with the legal query pattern – GQ after is has been parsed by the database – Actual query structure of GQ as extracted from GQp – Legal query structure for GQ as defined by developer Algorithm: Query structure matching algorithm INPUT: GQ, ID, QSl OUTPUT: validated QS BEGIN Query structure matching 1. Receive GQ and corresponding ID from middleware 2. Submit GQ to a parse function of the DBMS 3. Receive GQp as output of parse function 4. Generate QSa by removing literals from the parsed query GQp 5. Use ID to retrieve QSl from storage 6. IF (QSl != QSa) THEN 7. Reject query 8. ELSE 9. Submit query to DBMS for execution 10. ENDIF END Query structure matching Fig. 5. RFID Query structure matching algorithm
The second step in this process is actually comparing the predefined query structure for a query with the actual query structure of a dynamically generated query (Figure 5). Once the tag data is received by the middleware it then uses that data to generate dynamic queries. But rather than sending these queries to the database our system then passes them on to the query matching module. Therefore the query matching module receives a generated query (GQ) and the associated identifier (ID) from the query generation module. When the GQ is received the module calls the parse function of the DBMS and inputs GQ as a parameter. The DBMS parses that query (but does not execute it) and returns the resulting parsed query (GQp) back to the query matching module. The module then strips all literals from GQp to generate the actual query structure (QSa) of GQ. Then the module uses ID to retrieve the legal query structure QSl corresponding to GQ. Finally it compares QSl with QSa using simple string comparison. If the two do not match the query is identified as a SQLIA and rejected. Otherwise it’s forwarded to the database for execution. The algorithm for this process is presented in the following diagram. A number of other proposals that use a query matching approach for SQLIA security use more complex methods of comparing the legal query structure with the actual query structure. Because our approach uses simple string comparison it has the advantages of requiring less overhead than those other techniques while offering a comparable amount of security as the evaluation shows.
Securing RFID Systems from SQLIA
5
253
Evaluation
We tested our technique to evaluate its performance in terms of detection rates and false positive rates. In our evaluation we test all three types of dynamically generated SQL queries possible in RFID systems (SELECT, UPDATE, INSERT) for each type of attack possible on that type of query. To evaluate our technique we used two programs: the freely available demo version of the General SQL parser (GSP Demo) and a simple string comparison program. For parsing of the dynamic queries we used the pretty print facility of the GSP Demo to identify the literals with a red color. Once the query was parsed we replaced all red text (literals) with “?” The resulting string was then compared with the legal query structure using the string comparison program we had written. This program takes the dynamic query which has been stripped of all literals and then runs a simple string comparison to compare the predefined legal query structure and the query structure of the parsed dynamic query. If the two query structures match the query is considered legitimate while if they don’t match the query is considered a SQLIA. Table 1 shows the results of the evaluation. Table 1. Table of test results Type of SQLIA
Tautologies Union query Piggy backed queries Alternate encodings Commenting queries Total
Select Queries tested (Detected) 21(21) 18(18) 15(15)
Update Queries tested (Detected)
Insert Queries tested (Detected)
Total (Detected)
21(21) 6(6) 15(15)
N/A 12(12) 15(15)
42(42) 36(36) 45(45)
12(12) 2(2) 68(68)
12(12) 5(5) 59(59)
12(12) 1(1) 40(40)
36(36) 8(8) 167(167)
In our testing we did not test the incorrect/illegal queries, inference, blind injection and timing attacks. These attacks that are based on gaining information about the system and the database based on the feedback received by the attacker in response to the SQLIA. Because RFID tags and therefore attackers do not receive results or error messages attacks based on receiving feedback from the system in response to the SQLIA are ineffectual on RFID systems. The results of the testing are pretty straightforward and very positive. For all types of queries and all types of SQLIA types tested our query structure matching technique was able to identify SQLIA with 100% efficiency. In addition during the testing process we also tested around 120-130 legal queries. All legal queries were allowed by the technique with a 0% false positive rate.
6
Conclusions and Future Work
In this paper we presented a simple but secure method for detecting and preventing RFID tag based SQLIA. The technique consist of two stand alone methods that can be implemented individually but when combined give much stronger protection. The first method is a simple tag data cleaning technique for the RFID tag data. By using a
254
H. Fernando and J. Abawajy
mix of white-listing and black-listing this technique prevents ‘bad’ data from being used while building dynamic queries. The second method is a SQL query structure matching technique which uses simple string comparison. Our techniques have the advantage of protecting all other possible SQLIA types possible on RFID systems while being simpler than other comparative methods. The initial testing of the query structure matching method yielded very positive results showing a detection rate of 100% and false positive rate of 0%. Our system has the capability to protect against second order injection attacks which is a type of SQLIA most other methods do not take in consideration. Our future work will investigate how these techniques can be modified to better suit web applications architectures.
References 1. Glover, B., Bhatt, H.: RFID Essentials. Theory in Practice. O’Reilly Media, Sebastopol (2006) 2. Rieback, M., Simpson, P., Crispo, B., Tanenbaum, A.: RFID malware: Design principles and examples. Pervasive and Mobile Computing 2(4), 405–426 (2006) 3. Amirtahmasebi, K., Jalalinia, S.R., Khadem, S.: A survey of SQL injection defense mechanisms. In: 6th International Conference for Internet Technology and Secured Transactions. IEEE, London (2009) 4. Tajpour, A., Zade Shooshtari, M.J.J.: Evaluation of SQL Injection Detection and Prevention Techniques. In: 2nd International Conference on Computational Intelligence, Communication Systems and Networks, pp. 216–221. IEEE, Liverpool (2010) 5. Halfond, W., Viegas, J., Orso, A.: A classification of SQL-injection attacks and countermeasures. In: International Symposium on Secure Software Engineering. Citeseer (2006) 6. Huang, Y.W., Huang, S.K., Lin, T.P., Tsai, C.H.: Web application security assessment by fault injection and behavior monitoring. In: 11th International World Wide Web Conference. ACM, Honolulu (2003) 7. McClure, R.A., Krüger, I.H.: SQL DOM: compile time checking of dynamic SQL statements. In: 27th International Conference on Software Engineering. ACM, Missouri (2005) 8. Boyd, S.W., Keromytis, A.D.: SQLrand: Preventing SQL injection attacks. In: Jakobsson, M., Yung, M., Zhou, J. (eds.) ACNS 2004. LNCS, vol. 3089, pp. 292–302. Springer, Heidelberg (2004) 9. Wassermann, G., Su, Z.: An analysis framework for security in Web applications. In: First FSE Workshop on Specification and Verification of Component-Based Systems (2004) 10. Halfond, W.G.J., Orso, A.: AMNESIA: analysis and monitoring for NEutralizing SQLinjection attacks. In: 3rd International ICSE Workshop on Dynamic Analysis. ACM, MO (2005) 11. Su, Z., Wassermann, G.: The essence of command injection attacks in web applications. In: 33rd Annual Symposium on Principles of Programming Languages. ACM, New York (2006) 12. Suliman, A., Shankarapani, M., Mukkamala, S., Sung, A.: RFID malware fragmentation attacks. IEEE, Los Alamitos (2008) 13. Das, D., Sharma, U., Bhattacharyya, D.: An Approach to Detection of SQL Injection Vulnerabilities Based on Dynamic Query Matching. International Journal of Computer Applications IJCA 1(25), 39–45 (2010) 14. Buehrer, G., Weide, B.W., Sivilotti, P.A.G.: Using parse tree validation to prevent SQL injection attacks. In: International Conference on Software Engineering and Middleware. ACM, New York (2005)
Modeling QoS Parameters of VoIP Traffic with Multifractal and Markov Models Homero Toral-Cruz1,2, Al-Sakib Khan Pathan3, and Julio C. Ramírez-Pacheco4 1 Dept. of Sciences and Engineering, Universidad de Quintana Roo, México Dept. of Postgraduate, Instituto Tecnológico Superior de Las Choapas, México 3 Dept. of Computer Science, International Islamic University Malaysia, Malaysia 4 Dept. of Basic Sciences and Engineering, Universidad Del Caribe, México [email protected], [email protected], [email protected] 2
Abstract. In this paper, we analyze the jitter and packet loss behavior of voice over Internet protocol (VoIP) traffic by means of networks measurements and simulations results. As result of these analyses, we provide a detailed characterization and accurate modeling of these Quality of Service (QoS) parameters. Our studies have revealed that VoIP jitter can be modeled by selfsimilar and multifractal models. We present a methodology for simulating packet loss. Besides, we found relationships between Hurst parameter (H) with packet loss rate (PLR). Keywords: VoIP, PLR, Jitter, H Parameter, Markov Chains, Multifractality.
1 Introduction The voice quality of VoIP applications depends on many parameters, such as: bandwidth, one-way delay (OWD), jitter, PLR, codec, voice data length, and de-jitter buffer size. In particular, packet loss, and jitter have an important impact on voice quality [1]. To achieve a satisfactory level of voice quality, the VoIP networks must be designed by using correct traffic models. In this work, we provide a detailed characterization and accurate modeling of the main QoS parameters, such as: jitter and packet loss. These characterization and models can be used by other researches to design and implement de-jitter buffers, synthetic generators of VoIP jitter data traces and effective schemes for packet loss recovery. The paper is organized as follows: Section 2 presents related works. Section 3 presents the measurements description. In section 4, we discuss the jitter and packet loss behaviors. In section 5, we propose that VoIP jitter can be modeled by selfsimilar and multifractal models. In section 6, we present a methodology for simulating packet loss on VoIP traffic; and propose a new model that allows relating the H parameter with the PLR. Finally, Section 7 concludes the paper highlighting the achievements from this work with possible future use of our findings. Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 255–265, 2011. © Springer-Verlag Berlin Heidelberg 2011
256
H. Toral-Cruz, A.-S.K. Pathan, and J.C. Ramírez-Pacheco
2 Related Works and Motivation In [2], [3], [4], [5] the QoS parameters of VoIP applications are studied, though, the relationships between them to assess the overall effects of these parameters are not considered. Therefore, these studies are limited, because the impact of the QoS parameters is analyzed separately of the others. In this work, we show that some QoS parameters are intricately related to each other. It has been shown through empirical studies that data traffic exhibits self-similar nature and long range dependence (LRD) [6], [7]. The presence of LRD, is remarkably universal, and has become an indispensable part of traffic modeling, in particular for TCP/IP traffic [8]. Furthermore, the discovery of evidence for multifractal behavior, raised hopes that another "traffic invariant" had been found which could lead to a complete, robust model of aggregate wide area network traffic over all time scales [8]. The multifractal behavior of network traffic was first noticed by Riedi and Véhel [9]. Subsequently various studies have addressed the characterization and modeling of multifractal traffic, essentially within the framework of random cascades [10]. On the other hand, in Internet, packet losses occur due to temporary overloaded situations, are bursty in nature and exhibit temporal dependency [11]. Consequently, there is a strong correlation between consecutive packet losses, resulting in a bursty packet loss behavior. This temporal dependency can be effectively modeled by a finite Markov chain [11]. In previous work [12] is presented a methodology for simulating packet loss, this methodology is restricted, because it incorporate only one microscopic period of packet loss by using a 2-state Markov chain. In order to generalize this methodology, in this work we proposed to incorporate “n” microscopic periods of packet loss by means of 4-state Markov chain.
3 Network Measurements In order to accomplish our analysis, extensive jitter and packet loss measurements were collected, as follows: - Test calls were established by a VoIP application called “Alliance Foreign eXchange Station” (FXS) [13]. - The jitter and packet loss were measured by Wireshark [14] to obtain a set of data traces. - The measurement scenario was based on a typical H.323 architecture (Figure 1(a)). - The parameter configuration employed in the test calls is shown in Figure 1(b): a) Four simultaneous test calls were established between A1/B1, A2/B2, A3/B3 and A4/B4 endpoints, see Figure 1. b) The configurations used in the test calls are based on two parameters: CODEC type (G.711 and G.729), and voice data length (10ms, 20ms, 40ms and 60ms). c) The measurement periods were one hour for each test call (call duration). d) For each measurement period (one hour), four jitter and packet loss data traces were obtained. e) The 4 configuration sets contain more than one hundred-thirteen million voice packets corresponding to 710 jitter and 710 packet loss data traces, measured during typical working hours (between 10:00-16:00 hrs.), between 2004 and 2010 [15].
Modeling QoS Parameters of VoIP Traffic with Multifractal and Markov Models
257
(a) Measurement scenario Set Set 1 Set 2 Set 3 Set 4
A1/B1 G.711-10ms G.729-10ms G.711-10ms G.711-40ms
A2/B2 G.711-20ms G.729-20ms G.711-20ms G.711-60ms
A3/B3 G.711-40ms G.729-40ms G.729-10ms G.729-40ms
A4/B4 G.711-60ms G.729-60ms G.729-20ms G.729-60ms
(b) Parameter configuration employed in the test calls Fig. 1. Network measurements
4 Jitter and Packet Loss Behavior Jitter - When voice packets are transmitted from source to destination over IP networks, packets may experience variable delay, called jitter. The packet inter-arrival time (IAT) on the receiver side is not constant even if the packet inter-departure time (IDT) on the sender side is constant. As a result, packets arrive at the destination with varying delays (between packets) referred to as jitter. We measure and calculate the difference between arrival times of successive voice packets that arrive on the receiver side, according to RFC 3550 [16], this is illustrated in Figure 2. Let S K be the RTP timestamp and RK the arrival time in RTP timestamp units for packet K . Then, for two packets K and K − 1 , the one way delay (OWD) difference between two successive packets, K and K − 1is given by: J (K ) = ( RK − S K ) − ( RK −1 − S K −1 ) = ( RK − RK −1 ) − ( S K − S K −1 )
= IAT (K ) − IDT (K ) IAT (K ) = J (K ) + IDT (K )
(1)
(2)
where IDT (K ) = (S K − S K −1 ) is the inter-departure time (in our experiments, IDT= {10ms, 20ms, 40ms, and 60ms}) and IAT (K ) = ( RK − RK −1 ) is the inter-arrival time or arrival jitter for the packets K and K − 1 . In the current context, IAT (K ) is referred to as jitter.
258
H. Toral-Cruz, A.-S.K. Pathan, and J.C. Ramírez-Pacheco
On the other hand, the voice data lengths of 10ms, 20ms, 40ms and 60ms are used and the successive voice packets are transmitted at a constant rate, i.e., 1 packet/10ms, 1 packet/20ms, 1 packet/40ms and 1 packet/60ms, respectively. However, when voice packets are transported over IP networks, they may experience delay variations and packet loss. A relationship between jitter and packet loss can be established using the following equations: If packet K − 1 is lost, IAT (K ) = J (K ) + (2) • IDT (K )
(3)
Therefore, if n consecutive packets are lost, IAT (K ) = J (K ) + (n + 1) • IDT (K )
(4)
Therefore, equation (4) describes the packet loss effects in the VoIP jitter.
Fig. 2. Jitter experienced across Internet paths
Packet Loss - As stated earlier in related works section, temporal dependency can be effectively modeled by a finite Markov chain [11]. In this work, we use 2-state and 4state Markov chains for this purpose. Figure 3 shows the state diagram of a 2-state Markov chain. In this model, one of the states (lost) represents a packet loss and the other state (found) represents the case where packets are correctly received. The transition probabilities in this model, as shown in Figure 3, are represented by p and q. The average number of consecutively lost and received packets can be calculated by b and g, respectively, as shown below: ∞
b = ∑ n ⋅ q (1 − q )
n −1
n =1
=
1 q
,
g=
1 p
(5)
The overall loss probability or PLR can be calculated according to equation (6): PLR =
p p+q
(6)
The transition probabilities p and q can be estimated by equation (7). max (i )
∑ mi q ⋅ PLR = i =1 , p= 1 − PLR m0
max (i )
1 q= = b
∑m
i =1 max (i )
i
∑ i ⋅ mi i =1
(4)
Modeling QoS Parameters of VoIP Traffic with Multifractal and Markov Models
259
where mi is the number of times i packets got lost consecutively, m0 is the overall number of received packets,
∑m
i
is the number of consecutive loss events, and
i =1
max (i )
∑i ⋅ m i =1
max (i )
i
is the overall number of lost packets.
The collected data traces in real IP networks can be modeled accurately with a higher number of states, i.e., n-state Markov chains. However, for network planning, a trade off is desirable between very accurate modeling of data traces and a low number of model input parameters, in order to yield a model still usable for network planners with reasonable effort. Hence, we used a simplification of an n-state chain, i.e., the 4state Markov chain. Figure 4 shows the state diagram of this chain. In this model, a ‘good’ and a ‘bad’ state are distinguished, which represent periods of lower and higher packet loss, respectively. Both for ‘bad’ and ‘good’ state, an individual 2-state Markov chain represents dependency between consecutively lost or found packets. Found in Bad State
4 p43
Bad State
p34
3
Lost in Bad State
p32
p23 2 p21 Good State 1
Fig. 3. 2-state Markov chain
p12
Found in Good State Lost in Good State
Fig. 4. 4-state Markov chain
The two 2-state chains can be described by 4 independent transition probabilities (two for each one). Two further probabilities characterize the transitions between the two 2-state chains, leading to a total of six independent parameters for this particular 4-state Markov chain. In the “good state” (G) packet loss occur with (low) probability PG while in the “bad state” (B) they occur with (high) probability PB. The occupancy times for states B and G are both geometrically distributed with respective means 1 and 1 , respectively. The steady state probabilities of being p32 p 23 p32 p23 , respectively. The overall and π = in states G and B are π = G B p32 + p 23 p32 + p23 packet loss rates in the ‘good’ and ‘bad’ states can be calculated by: p 43 p 21 , PG = PB = (8) p 21 + p12 p 43 + p34 The overall packet loss for the 4-state Markov chain is given by [17]: PLR = PG ⋅ π G + PB ⋅ π B
(9)
260
H. Toral-Cruz, A.-S.K. Pathan, and J.C. Ramírez-Pacheco
5 Jitter Modeling Self-Similarity, SRD and LRD - Following the methodology proposed in [18] to find correlations and LRD, the Hurst parameter is estimated by the wavelet-based estimator [19] of jitter data traces as a function of the aggregation level m ( m = {1,2,4,8,16,32,64,128} ). Figure 5 shows the Hurst parameter of representative jitter data traces to different aggregation levels m. It can be observed that a set of jitter data traces has Hurst parameters larger than 0.5 for all aggregation levels. This indicates a high degree of LRD. In contrast, other sets of jitter data traces have Hurst parameters lower than 0.5. These results are thus not a strong indication of LRD. This indicates that the autocovariance (ACV) functions decay quickly to zero, indicating no memory property or short range dependence (SRD). Figure 6 shows the comparison between the ACV function of a measured data trace with H = 0.35 and the theoretical ACV function. It can be observed that the ACV function of the measured data trace behaves similarly to the ideal model and decay quickly to zero. A comparison was made between ACV function of a measured data trace with H = 0.58 and the theoretical ACV function, as shown in Figure 7. In this figure, a similar behavior can be observed and a very slow decaying from ACV functions. These results show that VoIP jitter exhibits self-similar characteristics with SRD or LRD, therefore, a selfsimilar process can be used to model the jitter behavior.
Fig. 5. Hurst parameter for VoIP jitter data traces
0.1
0.04
Jitter Measured - H = 0.58 Theoretical - H = 0.58
0.08
ACV
ACV
0.06
0.06 0.04 0.02
0.02 0 0
0.1
Jitter Measured - H = 0.35 Theoretical - H = 0.35
0.08
1
2
3
Lag
4
5 x 10
Fig. 6. ACV for jitter data traces with SRD
4
0 0
1
2
Lag
3
4
5 x 10
4
Fig. 7. ACV for jitter data traces with LRD
Modeling QoS Parameters of VoIP Traffic with Multifractal and Markov Models
261
Multifractal behavior - In this section we review the evidence for multifractal behavior of VoIP jitter. In order to accomplish this analysis, we decomposed the time series of VoIP jitter into a set of time series or components [20]. The behavior of these components is used to determine the kind of asymptotic fractal scaling. If the variance of the components of a time series is modeled by a straight line, the time series exhibit monofractal behavior. On the other hand, if the variance of the components cannot be adequately modeled with a linear model, then the scaling behavior should be described with more than one scaling parameter, i.e. the time series exhibits multifractal behavior. Figure 8 shows the components behaviors of a VoIP jitter data trace that belong to the data sets with SRD. It is observed that the variance of the components of this time series is modeled by a straight line. Figure 9 shows the components behaviors of a VoIP jitter data trace that belong to the data sets with LRD. It is observed that the variance of the components of this time series cannot be adequately modeled with a linear model. These results show that VoIP jitter with SRD or LRD, exhibit monofractal or multifractal behavior, respectively. 4
LD-Diagram H=0.43
2
8
LD-Diagram H1=0.43 H2=1.11
0 6
4
2
-6
i
-4
2 2
log [var(C)]
i
log [var(C )]
-2
0
-8
-2
-10
-12 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-4 0
1
2
3
4
5
Fig. 8. Components behavior of VoIP jitter data traces: monofractal behavior
6
7
8
9
10
11
12
13
14
15
i
i
Fig. 9. Components behavior of VoIP jitter data traces: multifractal behavior
6 Packet Loss Modeling: A Power Law Model 6.1 Packet Loss Model Framework In this paper, a description of VoIP packet loss based on narrow and wide time windows was used. The packet loss behavior over a narrow time windows is called here microscopic, and the packet loss behavior over wide time windows is called macroscopic [21]. Microscopic behavior refers to a packet loss period observed on a “time window” W1 of the packet loss data trace; where, this packet loss period has a specific PLR1 . On the other hand, macroscopic behavior refers to a set of microscopic periods (W1 , W2 , W3 ,..., Wn ) that are observed on all packet loss data trace; where, each
microscopic period has a particular PLR (PLR1 , PLR2 , PLR3 ,..., PLRn ) , as shown in Figure 10 (a). From the figure, the packet losses do not occur homogeneously. Figures 10(b) and 10(c) show some packet loss patterns extracted from VoIP test calls. In Figure 10(b), we can see that packet loss behavior is homogeneous, i.e., the packet
262
H. Toral-Cruz, A.-S.K. Pathan, and J.C. Ramírez-Pacheco
loss pattern is represented by a microscopic period. In Figure 10(c), the packet loss is non homogeneous, i.e., the packet loss pattern is represented by a concatenation of two microscopic periods. Microscopic and macroscopic behavior can be effectively modeled by a 2-state and a 4-state Markov chains, respectively. PLR [%] PLR3
Macroscopic Packet Loss Behavior
PLR1 PLRT
PLR5 PLR4 PLR2
W1
W2
W3
W4
WT
W5
Samples
(b) Homogeneous PLR
Tcd Time [sec]
Periods of Different Microscopic Packet Loss Behavior
(a) Microscopic and Macroscopic Behavior
Samples
(c) Non Homogeneous PLR
Fig. 10. Packet loss descriptions from VoIP test calls
6.2 Methodology for Simulating Packet Loss The current methodologies for simulating packet loss consist only of generating packet loss patterns by Markov chains of different orders. Therefore, the studies based on these methodologies are limited, because the impact of this parameter is analyzed separately from the others. A new methodology to simulate packet loss in two stages is proposed: first, by generating packet loss pattern and secondly, by applying this packet loss pattern to a VoIP jitter data trace, i.e., the simulation of the effect of this packet loss pattern in the VoIP jitter by the relationship shown in equation (4). Let X = {X t : t = 1,..., N } be a VoIP jitter data trace with a length of N , self-similar (H parameter 0 < H 0 < 0.5 ), and with a low packet loss rate PLR0 . The packet loss patterns are generated by means of a 4-state Markov chain, and are represented as the binary sequences P = {Ptτ : t = 1,..., N ;τ = 0,1,2,...T − 1}, where Ptτ = 1 means a packet loss, Ptτ = 0 means a success or received packet, N is length of the packet loss pattern and T is the number of packet loss patterns used. The relationship between jitter and packet loss from Equation (4) is used to apply the packet loss patterns to X t by the pseudo-code shown in Algorithm 1. As a result of using the Algorithm 1, the new time series Xˆ tτ were obtained, for t = 1,..., N and
τ = 0,1,2,...T − 1 . For each Xˆ tτ the PLR and H parameter were calculated, and the functions f (PLRτ , Hτ ) were generated. Algorithm 1. Pseudo-code for applying the packet loss patterns 1.
FOR n = 2 to
2.
IF ( P[ n] = 1 )
3.
X [n] = X [n] + X [n − 1]
N
Modeling QoS Parameters of VoIP Traffic with Multifractal and Markov Models 4. 5. 6. 7.
263
END IF END FOR i =1
8.
FOR n = 2 to N IF ( P[n] ≠ 1 )
9.
Xˆ [i] = X [n − 1]
10. i = i + 1 11. END IF 12. END FOR
6.3 Simulation Results The simulations are accomplished over the measures VoIP jitter data traces. Figure 11 illustrates the relationships between the packet loss rate and the Hurst parameter. Figure 11(a) shows the empirical functions f (PLRτ , H τ ) that were obtained from simulation results and the function f REAL (PLRε , H ε ) . The functions f (PLRτ , Hτ ) , resulted from applying " T " packet loss patterns to representative VoIP jitter data traces X t . In these functions, each point represents the PLRτ and Hτ of a particular new time series Xˆ tτ . The function f REAL (PLRε , H ε ) is generated by “ Ε ” jitter data traces. In this function each point represents the PLRε and H ε of a particular jitter data trace X tε , where t = 1,..., N , ε = 1,2,...Ε , and “ Ε ” is the number of representative jitter data traces used. The respective differences between the functions corresponding to simulation results f (PLRτ , H τ ) and the function f REAL (PLRε , H ε ) , were quantified in terms of MSE. Figure 11 (b) shows the fitted parameters and MSE between f (PLRτ , Hτ ) and f REAL (PLRε , H ε ) .
Hurst Parameter (H)
1 0.9 0.8 0.7 0.6
f (PLRτ , H τ )
0.5 0.4
G.711
Hˆ 0
aˆ
bˆ
MSE
0.0428 0.5659 0.2760 0.001474
0.3 0
0.5
1
1.5
2
2.5
3
3.5
4
Packet Loss Rate (%) REAL
G.711
G.729
(a) Relationship between PLR and H
4.5
G.729 0.0430 0.5716 0.2805 0.002305 f REAL (PLRε , Hε ) 0.0429 0.5471 0.2475
(b) Fitted parameters
Fig. 11. Relationship between PLR and H parameter
6.4 Proposed Model From the simulations carried out, it was found that the relationship between the H parameter and the PLR can be modeled by a power-law function, characterized by three fitted parameters, as follows:
264
H. Toral-Cruz, A.-S.K. Pathan, and J.C. Ramírez-Pacheco
ˆ H M = Hˆ 0 + aˆ (PLR )b
(10)
where H M is the H parameter of the model found; Hˆ 0 , aˆ , and bˆ are the fitted parameters; Hˆ 0 is the H parameter when PLR = 0 . The fitted parameters are estimated by linear regression. The strategy to find the parameters values Hˆ , aˆ , and bˆ is such 0
that it minimizes the mean squared error, i.e. MSE = ( Hˆ + aˆ r bˆ − H ) 2 dr , and the τ ∫ 0 r
validity of the proposed model corresponds to those ranges of r = PLR (e.g. 0%-4%).
7 Conclusions In this paper, the jitter and packet loss behavior of VoIP traffic were analyzed. As a result of these analyses, a detailed characterization and accurate modeling of these main QoS parameters was provided. First, we proposed that VoIP jitter can be properly modeled by means of self-similar and multifractal models. Secondly, a methodology for simulating packet loss on VoIP Jitter was presented. Finally, a relationship between the Hurst parameter and the PLR was found, where this relationship can be modeled by means of a power-law function with three fitted parameters, summarized by equation (10). These characterization and models can be used by other researches to design and implement de-jitter buffers, synthetic generators of VoIP jitter data traces and effective schemes for packet loss recovery.
References 1. Markopouloua, A., Tobagib, F., Karam, M.: Loss and Delay Measurements of Internet Backbones. Computer Communications 29(10), 1590–1604 (2006) 2. Karapantazis, S., Pavlidou, F.-N.: VoIP: A comprehensive survey on a promising technology. Computer Networks 53(12), 2050–2090 (2009) 3. Salah, K.: On the deployment of VoIP in Ethernet networks: methodology and case study. Computer Communications 29(8), 1039–1054 (2006) 4. Turunen, J., Loula, P., Lipping, T.: Assessment of objective voice quality over best-effort networks. Computer Communications 28(5), 582–588 (2005) 5. Zhang, L., Zheng, L., Ngee, K.S.: Effect of delay and delay jitter on voice/video over IP. Computer Communications 25(9), 863–873 (2002) 6. Park, K., Willinger, W.: Self-Similar Network Traffic and Performance Evaluation, ch. 1. John Wiley & Sons, Inc., Chichester (2000) 7. Paxson, V., Floyd, S.: Wide area traffic: the failure of Poisson modeling. IEEE/ACM Transactions on Networking (TON) 3(3), 226–244 (1995) 8. Veitch, D., Hohn, N., Abry, P.: Multifractality in TCP/IP traffic: the case against. Computer Networks 48(3), 293–313 (2005) 9. Riedi, R., Véhel, J.: Multifractal properties of TCP traffic: A numerical study. Technical Report No. 3129, INRIA Rocquencourt, France (1997), http://www.dsp.rice.edu/~riedi 10. Gilbert, A., Willinger, W., Feldmann, A.: Scaling analysis of conservative cascades, with applications to network traffic. IEEE Trans. Info. Theo. 45(3), 971–992 (1999)
Modeling QoS Parameters of VoIP Traffic with Multifractal and Markov Models
265
11. Yajnik, M., Moon, S., Kursoe, J., Towsley, D.: Measurement and modelling of the temporal dependence in packet loss. In: Proc. IEEE INFOCOM 1999, NY, pp. 345–352 (1999) 12. Tarnay, K., Adamis, G., Dulai, T.: Advanced Communication Protocol Technologies: Solutions, Methods, and Applications, ch. 17. IGI Global (2011) 13. Advanced Information CTS (Centro de Tecnología de Semiconductores) Property, Alliance FXO/FXS/E1 VoIP System, http://www.cts-design.com 14. Wireshark: A Network Protocol Analyzer, http://www.wireshark.org/ 15. Toral, H.: QoS Parameters Modeling of Self-similar VoIP Traffic and an Improvement to the E Model. PhD. Thesis, Electrical Engineering, Telecommunication Section, CINVESTAV, Guadalajara, Jalisco, Mexico (2010) 16. RFC 3550, RTP: A Transport Protocol for Real-Time Applications. Internet Engineering Task Force (2003), http://www.ietf.org/rfc/rfc3550.txt 17. Yee, J.R., Weldon Jr., E.J.: Evaluation of the Performance of Error-Correcting Codes on a Gilbert Channel. IEEE Trans. on Communications 43(8), 655–659 (1995) 18. Fitzek, F.H.P., Reisslein, M.: MPEG-4 and H. 263 video traces for network performance evaluation. IEEE Network 15(6), 40–54 (2001) 19. Veitch, D., Abry, P.: A wavelet based joint estimator for the parameters of LRD. IEEE Transactions on Information Theory 45(3), 878–897 (1999) 20. Estrada, L., Torres, D., Toral, H.: Variance Error for Finite-length Self-similar Time Series. In: 7th International Conference on Computing, Communications and Control Technologies (CCCT 2009), Orlando, Florida, USA, pp. 193–198 (2009) 21. Raake, A.: Short- and Long-Term Packet Loss Behavior: Towards Speech Quality Prediction for Arbitrary Loss Distributions. IEEE Transactions on Audio, Speech, and Language Processing 14(6), 957–1968 (2006)
Hybrid Feature Selection for Phishing Email Detection Isredza Rahmi A. Hamid and Jemal Abawajy School Information Technology, Deakin University {iraha,jemal}@deakin.edu.au
Abstract. Phishing emails are more active than ever before and putting the average computer user and organizations at risk of significant data, brand and financial loss. Through an analysis of a number of phishing and ham email collected, this paper focused on fundamental attacker behavior which could be extracted from email header. It also put forward a hybrid feature selection approach based on combination of content-based and behavior-based. The approach could mine the attacker behavior based on email header. On a publicly available test corpus, our hybrid features selections are able to achieve 96% accuracy rate. In addition, we successfully tested the quality of our proposed behavior-based feature using the information gain. Keywords: Internet Security, Behavior-based, Feature Selection, Phishing.
1 Introduction Phishing emails have become common problem in recent years. Phishing is a type of semantic attack in which victims are sent emails that deceive them into providing sensitive information such as account numbers, passwords, or other personal to phisher. Normally, phishers send a large number of fake e-mails pretending to be from a legitimate and well-known business organization. Generally, the email content insists the victim to update personal information to avoid losing access rights to services provided by the organization. Unfortunately, they lure user to a bogus web site implemented by the attacker. According to Anti-Phishing Working Group phishing trend report, the number of phishing attacks through email increased from about 170000 in 2005 to about 440000 in the 2009 [2]. Based on Gartner survey, approximately 109 million U.S adults have received phishing e-mail attacks with average loss per victim estimated to be $1,244. Phishing email detection has drawn a lot of considerations from many researchers. Several good anti-techniques such as content-based [6], [11], [16] and behavior-based [7], [5], [13] have been developed to address the phishing problems. However, phishing attacks have continued to be a serious problem. This is because phishing has become more and more complicated and the phishes continually change their ways of perpetrating phishing attack to defeat the anti-phishing techniques. Moreover, most phishing emails are nearly identical to the normal email. Therefore existing antiphishing techniques such as content-based approach are not able to curb phishing attacks. Furthermore, most of the existing emails filtering approaches are static where it can easily be defeated by modifying contents of emails and link strings. Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 266–275, 2011. © Springer-Verlag Berlin Heidelberg 2011
Hybrid Feature Selection for Phishing Email Detection
267
In this paper, we present an approach to detect phishing email using hybrid features that combine content-based and behavior-based approaches. The main objective of this paper is to identify behavior-based features in phishing emails which cannot be disguised by an attacker. By analyzing attacker’s pattern, it is observed that phishing email that has a tendency to come from more than one domain could indicate abnormal activity. Domain server that handles more than one type of domain email could show abnormal email as well. This information is done by analyzing email header which is usually neglected by others. We considered analyzing the message-ID tag and sender email in order to mine the attacker’s behavior. This study applies the proposed hybrid feature selection to 6923 datasets which come from Nazario [14] phishing email collection ranging from 2004 to 2007 and SpamAssassin [17] as ham emails. The result shows that the proposed hybrid feature selection approach is effective in identifying and classifying phishing email. The remainder of this paper is organized as follows. Section 2 describes related research regarding phishing email detection approaches proposed in recent year. Section 3 examines the phishing email feature selection approach pertaining the data and feature set used in the experiment and hybrid feature selection algorithm as well. Section 4 gives the performance analysis result and the effectiveness of the proposed hybrid feature selection. Section 5 concludes the work and direction for future work is discussed.
2 Related Work Several anti-phishing techniques have been proposed in recent years to detect and prevent the increasing number of phishing attacks. In general, phishing detection can be classified into server based techniques and client based techniques. Server based techniques typically are implemented by service providers such as ISP, e-commerce stores or other financial institutions. On the other hand, client-based techniques are implemented on users’ end point through browser plug-ins or e-mail analysis. Various feature selection approach have been recently introduced to assist phishing detection mechanism. Most of previous researches [6], [11], [16] were focusing on email content in order to classify the emails as either abnormal or normal. Previous attempt by [11] presents an approach based on natural structural characteristics in emails. The features included number of words in the email, the vocabulary, the structure of the subject line, and the presence of 18 keywords. They tested on 400 data which then divided into five sets with different type of feature selection. Their result shows the best when more features used to classify phishing email using Support Vector Machine classifier. However, the significance of the results is difficult to assess because of the small size of the email collection. Fette et. al [6] on the other hand, considered 10 features which mostly examine URL and presence of JavaScript to flag emails as phishing. Nine features were extracted from the email and the last features obtained from WHOIS query. They follow similar approach as [11] but using larger datasets about 7000 normal emails and 860 phishing emails. They focused on URL properties which is not the best approach. This is because, attacker could use tools to obfuscate URL such as TinyUrl (http://tiny.cc/) and make it look valid. Their filter scores 97.6% F-measure and false positive rate of 0.13% and a false negative rate of 3.6% respectively.
268
I.R.A. Hamid and J. Abawajy
Abu-Nimeh et al. [16] study the performance of different classifiers used in text mining such as logistic regression, classification and regression trees, Bayesian additive regression trees, Support Vector Machines, random forests, and neural networks. They test on a public collection of about 1700 phishing mails and 1700 legitimate mails from private mailboxes is used They focused on richness of word to classify phishing email based on 43 keywords. The features represent the frequency of “bag-of-words” that appear in phishing and legitimate emails. As phishing emails always look similar to normal email, this approach might not be reliable anymore. Recently, behavior-based approach to determine phishing message has been proposed by [7], [5], [13]. Zhang et. al. [7] works on detecting abnormal mass mailing host in network layer by mining the traffic in session layer. Toolan et. al. [5] investigates 40 features that have been used in recent literature and proposed behavioral features such as number of word in send field, total number of characters in sender field, difference between sender’s domain and reply-to domain and difference between sender’s domains from the email’s modal domain. Ahmed Syed at. al [13] however proposed behavioral blacklisting using 4 features which is log-based on live data. Ma et. al. [8] claimed they classify phishing email based on hybrid features. They used 7 features derived from 3 types of email features that are content feature, orthographic feature and derived feature which also can be considered as content-based approach as well. In terms of detecting phishing using content, text-based classification does not seem to be the best approach. This is because phishing messages are nearly identical to the normal emails. Content-based filtering might be more effective technique if messages have a long lifetime and a large amount of duplication. However, attackers tend to use more sophisticated techniques from time to time that make them difficult to detect. They became more advanced to overcome this challenge by compiling phishing pages with non-HTML components, such as images, Flash objects, and Java applets. Yet, the updating rate of filters is often defeated by the changing rate of the attacks because phishing e-mails are continuously modifying senders and link strings. Therefore, this remains an open problem to be solved. Although there are clear advantages to filtering phishing attacks at the email level, there are at present not many methods specifically designed to target phishing emails based on phishing behavior. There is a very little research on behavior-based approach. Our study differs from the previous work on feature selection in several ways. First we propose a hybrid feature selection by combining content-based and behavior-based features. We considered analyzing email header information particularly the sender email and email’s message-ID tags in order to evaluate the attacker behaviors. We mine attacker behaviors by considering whether the sender sends emails from more than a single domain and if the domain name is used by more than one sender’s domain. We then choose to use Bayes Net algorithm as our classifier because they are a powerful knowledge representation and reasoning mechanism. Second, we produce promising result using 7 features with 96% accuracy and 4% false positive and false negative rate respectively.
3 Phishing Email Feature Selection Approach Email filtering can be divided into two which are origin-based filtering and content based filtering. Origin-based filtering focuses on the source of the e-mail and verifies whether
Hybrid Feature Selection for Phishing Email Detection
269
this source is on a white verification list or black verification list. In contrast, contentbased filters focus on the subject and body of the email. Phishing emails can be detected by filtering it based on text feature, linguistic feature or structural feature. The textual features and linguistic features identify phishing e-mails based on the word composition and grammatical construction. Instead, structural features focus on identifying the presence of obvious sign present in the e-mail body, which implicate it to be spoofed. 3.1 System Model Fig. 1 shows the basic system components and general processing steps which is extended from [15]. The processing phases includes: pre-processing of the email, feature extraction and selection, feature assessment, classification, and finally the evaluation of the classification result. We used Bayes Net algorithm as our classifier as it is a powerful knowledge representation and reasoning mechanism. Moreover, it is the simplest and most widely used classification method because of its manipulating capabilities of tokens and associated probabilities according to the user’s classification decisions and empirical performance.
Fig. 1. System Model
We used open source software: Mbox2xml as a disassembly tool. A python module mbox2xml exported the information from mbox format to xml format. We modified some scheme in order to extract all features and store in the database. The next step in the process is to generate components of a feature vector by analyzing the database. After that, we constructed both training and testing dataset with 60% and 40% respectively. The training set was used to train the classifier and the test set to estimate the error rate of the train classifier. Information gain is generated by Feature Assessment to evaluate accuracies before and after the deduction of each feature. This Feature Assessment is implemented repeatedly until the best feature vector is identified. Lastly, the Improved feature matrix is identified which is the most optimized feature sets and a good classifier is generated. 3.2 Feature Extraction and Selection It is well known that email consists of header and message body. Email header contains common identification such as from, to, date, subject and route information an email takes as it is transferred from one computer to another. It travels through a
270
I.R.A. Hamid and J. Abawajy
Mail Transfer Agent (MTA) where it is stamped with a date, time and recipient. This part of email header is not visible to most users but it is a useful indicator in determining phishing email. We find that message-ID tags found in email header is globally unique identification and can be used to mining the sender behavior. The features that we identified in email header are: (1) Subject-based features: These features are related to the presence / absence of blacklist word in the email subject; (2) Sender-based features: These features are extracted from sender email address; (3) Behavior-based features: These features are extracted from the email header including information as sender email and email’s message-ID. The body-based feature includes the following: (1) URL-based: These features are extracted from email HTML; (2) Keyword-based: These features are related to the presence/absence of blacklist word in the email body; (3) Form-based: These features are related to the presence/absence of from in the email body; (4) Script-based: These features are related to the presence/absence of script in the email body. 3.2.1 Feature Defines in Email Email messages have two basic parts that are the header and body parts. The header contains information about who the message was sent from, the recipients date and the route which contains optional fields such as received, reply-to, subject and message-ID. This is then followed by the body of the message. In our analysis, we considered the “message-ID” and the “from tag” in email header. We experimented with five features belong to email structure and additional two features which are extracted based on sender behavior. The features are listed as below: 1) Domain_sender: This binary feature represents the similarity of domain name extracted from email sender with domain message-ID. We think the email is normal if it is similar and set the value 0. If not, we set the value 1 to indicate the email is abnormal. This feature has been proposed by [5]. 2) Subject_blacklist_words: This binary feature represents the appearance of blacklist words in the subject of an email which included in bags of words in [11]. If the email subject contains the blacklist word, the email is abnormal and set the value 1. This feature has been used in [8]. 3) URL_IP: This numerical data shows number of links that are using IP address. This feature has been used in [1]. 4) URL_dots: This numerical data represent number of links in email that contains dots more than 5. This feature has been used in [11] but they calculate maximum number of dots in every link. 5) URL_symbol : This numerical data represent the occurrence of links in emails that present symbol. This has been used in [18] but we incorporate other symbol such as “%” and “&”to detect obfuscation url. The behavior features, newly proposed in this paper include: (1) Unique_sender (US): This binary data represent sender behavior whether the sender sends emails from more than a single domain. If it is more than 1, we think the sender is phisher and set value 1 or else the value is 0 to indicate that the sender is not phisher; and (2) Unique_domain (UD): This binary data denotes if the domain names is used by more than one sender domain email. If it is more than 1, we think the email is abnormal or else the email is normal and set the value to 0.
Hybrid Feature Selection for Phishing Email Detection
271
Table 1. Datasets for sender behavior Email Sender service@ paypal.com service@ paypal.com mark@ talios.com mark@ talios.com
Domain Email Sender paypal.com paypal.com talios.com talios.com
Message-ID q4-6$c--0--$-w@ qmb02q YYBTIESSYZXKLGFQV [email protected] 2060000.1012684767@spa wn.se7en.org 2060000.1012684767@spa wn.se7en.org
Domain Message-ID qmb02q
US
UD
1
0
hotmail.com
1
0
spawn.se7en.org
0
0
spawn.se7en.org
0
0
Table 2. Phishing datasets files summary Duration 27 nov 2004 – 13 june 2005 14 june 2005 – 14 nov 2005 15 nov 2005 – 7 august 2006 7 august 2006 – 7 august 2007 Total phishing datasets
Email messages 414 443 1423 2279 4559
3.2.2 Mining Sender Behavior The data mining for sender behaviour is analysed from email header. The dataset we selected from the email header has a structure as shown in Table I. After all the features are defined, we extracted all 7 possible features from each email. The values of all features are in various types. Sender domain, subject blacklist word, unique sender and unique domain are in binary. All URL based features are in numerical however in vastly different ranges. For example, the URL dots could number of links under five. In order to treat all the original features as equally important, the value of each feature needs to be normalized before the classification process. Features with numerical values are normalized using the quotient of the actual value over the maximum value of that feature so that numerical values are limited to the range [0, 1].
4 Performance Analysis 4.1 Experimental Setup This section presents our experimental setup. In our study, the classification was performed using WEKA (Waikato Environment for Knowledge Analysis). For our preliminary experiment, we used freely available pre-classified phishing datasets from [12]. We used 4 data files as presented in Table 2. These phishing datasets have been used in phishing detection research including work by [3],[4],[5],[6],[9],[10],[12], and [16]. In order to provide non-phishing datasets, we used the SpamAssassin Project [17] from the easy ham directory. This collection provides 2364 hams emails. We generated 3 sets of datasets randomly containing varying number of phishing and ham emails from the overall datasets. Corpus 1 consists of phishing email which is selected from the first two of the phishing datasets ranging from November 2004 to November 2005. The ham message consists of 1/3 of the ham email collection.
272
I.R.A. Hamid and J. Abawajy
Corpus 2 is taken from the third phishing collection while the ham email is taken form half of the ham collection. Corpus 3 contains phishing emails ranging from August 2006 to August 2007 and the whole ham collection. Datasets for corpus 2 and corpus 3 which contained unreadable symbol, Chinese language and Nigerian online scam are neglected. The details on each datasets are summarized in Table 3. Table 3. Summary of datasets
Corpus1 Corpus2 Corpus3
Total 1645 2495 4594
Example Phishing 857 1313 2230
Ham 788 1182 2364
Training Size (60%) Total Phishing Ham 986 514 472 1496 787 709 2756 1338 1418
Testing Size (40%) Total Phishing Ham 659 343 316 999 526 473 1838 892 946
Table 4. Weighted average classification result of 3 different corpuses
Corpus 1 Corpus 2 Corpus 3
FN 0.042 0.081 0.079
FP 0.041 0.079 0.081
Precision 0.96 0.92 0.92
Recall 0.96 0.92 0.92
Error 0.04 0.08 0.08
Accuracy 96% 92% 92%
4.2 Performance Metric In order to measure the effectiveness of the classification, we refer to the four possible outcomes as: (1) True positive (TP): a classifier correctly identifies an instance as being positive. (2) False positive (FP): a classifier incorrectly identified an instance as being negative, in fact an instance is instances hypothetical to be positive. (3) True negative (TN): a classifier correctly identifies an instance as being negative; and (4) False negative (FN): a classifier incorrectly identifies an instance as being positive, in fact an instances hypothetical to be negative. To measure the effectiveness of our approach, we use four metrics that also used in previous work [5], [6], [8] and [11]: (1) Precision (P) - this is the fraction of correctness; (2) Recall (R) - this measures the portion of the completeness of correct categories that were assigned; (3) Accuracy (A) - this measures the percentage of all decisions that were correct; and (4) Error (E) this relates to the number of misclassifications of instances. 4.3 Results and Discussions This section presents the classification outcome of the Bayes Net algorithms on the extracted features. 4.3.1 Feature Selection Table 4 presents the experimental results according to selected classifier for three different corpuses. Since not all corpuses are in the same size, we decided to calculate the weighted average for all corpuses. Our result shows that, the hybrid based feature selection by combining content-based and behavior-based feature selection shows quite promising result. This is evidence that features based on sender and domain behavior could be considered to determine phishing email.
Hybrid Feature Selection for Phishing Email Detection
273
Table 5. Comparison our approach with existing works
Fette et. al [6] Chandrasekaran et. al [11]
Ma et. al [8]
Number of Feature 10 Set 1 – 18 Set 2 – 20 Set 3 – 7 Set 4 – 7 Set 5 – 7 7
Feature Approach URL-based and script-based Content-based
Hybrid based
Sample Phishing - 860 Non-phishing - 6950 Phishing - 200 Non-phishing - 200
Phishing - 46,525 (7%) Non-phishing-613,048 phishing and ham 2889
Abu-Nimeh et.al. [16]
43
Keyword-based.
Toolan et. al [5]
22
Zhang et.al.[7]
7
Behaviouralbased and content-based Behaviour based
Total dataset - 6097 Non-phishing - 70% Spam - 30% 2328 host
Nadeem et.al.[13] Our approach
4
Behavioral blacklisting Hybrid Feature.
Dataset 1 – 20000 Dataset 2 - 1437 Total Dataset -6923
7
Accuracy 92% Misclassify 0.1% Set 1 (95%) Set 2 (100%) Set 3 (80%) Set 4 (90%) Set 5 (75%) 99% NN (94.5%) RF (94.4%) SVM (94%) LR (93.8%) BART (93.2%) CART (91.6%) Dataset 1 (97%) Dataset 2 (84%) Dataset 3 (79%) Test (99.6% ) Train (95.8%) Capture 50% spam email Corpus 1 (96%) Corpus 2 (92%) Corpus 3 (92%)
4.3.2 Comparative Analysis In Table 5 we compared our result with existing work. Fette et. al [6] used 10 features based on URL and script presence achieved 92% accuracy but high FP result. Chandrasekaran et. al [11] tested on five sets of data with different type of feature selection. Their result shows the best when more features used to classify phishing email. Ma et.al [8] proposed hybrid based approach with 7 features. They successfully determine 99% accuracy. However, they only used small sample of phishing email. Abu Nimeh [16] examined 43 keywords to determine phishing email. They found that Neural Net algorithm performs the best among others with 94.5% accuracy. Toolan et. al. [5], Zhang et. al. [7] and Nadeem et. al. [13] used behavior-based approach to classify phishing email. Toolan et. al. [5] used 22 features to test on 3 datasets with approximately 97% accuracy. Zhang et. al. [7] however used small live sample with 7 features. Over 99.6% and 95.8% accuracy are achieved for their test and train datasets respectively. Even though Nadeem et. al. [13] used only 4 features, they manage to capture 50% spam email with high FN and FP. Our work used 7 features with several datasets and successfully achieved 96% accuracy. 4.3.3 Information Gain of Each Feature In our experiment, we used 7 features to determine whether the email is phishing or not. However, not every feature presents a good indicator. Some of them perform well
274
I.R.A. Hamid and J. Abawajy
and some are weak features. Therefore, the information gain (IG) for each corpus is calculated to proof the quality of our proposed behavior-based feature. Table 6 shows ranking for each feature where “subject_blacklist_word” is the best feature while “URL_symbols” is the least effective features among all corpuses. We trained the classifier by taking one “weak” feature away each time for each corpus from ranking 7 up to ranking 5. We discovered that the classifiers perform the best when it is built based on all 7 features. Table 7 shows the accuracy of the shortened feature vector is decreasing if we reduced the feature vector. However, corpus 2 shows the best only by using 5 features. This shows that every feature support each other which is agreed by [8]. Moreover, our behavior-based feature is able to classify email message dynamically depends on corpus. Table 6. Information gain of each corpus Corpus 2 Feature Subject_blacklist 0.3882 _word
Corpus 3 Feature Subject_blacklist 0.3715 _word
URL_IP
0.2765
Unique_domain
0.2726
0.1973
Unique_domain
0.2399
URL_dots
0.2043
Unique_sender
0.157
Unique_sender
0.2255
URL_IP
0.1665
URL_IP
5
0.1021
URL_dots
0.2111
Unique_sender
0.1313
Unique_domain
6
0.0396
Domain_sender
0.0622
Domain_sender
0.0617
Domain_sender
7
0.022
URL_symbols
0.0368
URL_symbols
0.0395
URL_symbols
Rank
IG
1
0.4448
2
0.2426
3 4
Corpus 1 Feature Subject_blacklist _word
IG
IG
URL_dots
Table 7. Comparison between original dataset and shortened feature vector Accuracy of all features
Accuracy of short feature data
7 features
6 features
5 features
4 features
corpus 1
95.9%
95.7%
95.7%
95.4%
corpus 2 corpus 3
92.0% 92.0%
91.9% 91.3%
92.2% 91.4%
90.4% 89.4%
5 Conclusion and Future Direction In this paper, we propose behavior-based features to detect phishing emails by observing sender behavior. We extract all features using Mbox2xml as a disassembly tool. We then mine the sender behavior to identify whether the email came from legitimate sender or not. We take into account behavior of sender who tends to send email from more than a single domain and a domain that handle different kind of email sender domain. By combining these datasets, we used Bayes Net algorithm to classify the corpuses into phishing or ham emails. This hybrid feature selection approach produce promising result using 7 features with 96% accuracy and 4% False Positive and False Negative rate respectively. The feature selection we used in this paper does not work on graphical form as some attacker bypass the content based approach using image. The result motivates future works to explore attackers’
Hybrid Feature Selection for Phishing Email Detection
275
behavior and profile their modus operandi. As future works, we will mine the attacker behavior to understand their motivation and profile the attacker.
References [1] Bergholz, A., Paab, G., Reichartz, F., Strobel, S., Chung, J.H.: Improved Phishing Detection using Model-Based Features. In: Proceedings of the International Conference on E-mail and Anti-Spam (2008) [2] The Anti-Phishing work Group, http://www.apwg.org/ [3] Liu, C.: Fighting Unicode-Obfuscated Spam. In: Proceedings of E-Crime Research (2007) [4] Toolan, F., Carthy, J.: Phishing Detection using Classifier Ensemble. In: eCrime Researchers Summit (2009) [5] Toolan, F., Carthy, J.: Feature Selection for Spam and Phishing Detection. In: eCrime Researchers Summit, eCrime (2010) [6] Fette, I., Sadeh, N., Tomasic, A.: Learning to Detect Phishing Emails. Technical report, Institute of Software Research International, School of Computer Science, Carneige Melon University (2006) [7] Zhang, J., Du, Z., Liu, W.: A Behavior-based Detection Approach To Mass-Mailing Host. In: Proceedings of the Sixth International Conference on Machine Learning and Cybernetics (2007) [8] Ma, L., Ofoghani, B., Watters, P., Brown, S.: Detecting Phishing Emails Using Hybrid Features. In: Symposia and Workshops on Ubiquitous, Autonomic and Trusted Computing (2009) [9] Zhou, L., Shi, Y., Zhang, D.: A Statistical Language Modeling Approach to Online Deception Detection. IEEE Transactions on Knowledge and Data Engineering (2007) [10] Bazarganigilani, M.: Phishing E-Mail Detection Using Ontology Concept and Naïve Bayes Algorithm. International Journal of Research and Reviews in Computer Science, IJRRCS (2011) [11] Chandrasekaran, M., Narayanan, K., Upadyaya, S.: Phishing Email Detection Based on Structural Properties. In: Proceeding of the NYS Cyber Security Conference (2006) [12] Chandrasekaran, M., Shankaranarayanan, V., Upadhyaya, S.: CUSP: Customizable and Usable Spam Filters for Detecting Phishing Emails. In: NYS Symposium, Albany, NY (2008) [13] Ahmed Syed, N., Feamster, N., Gray, A.: Learning To Predict Bad Behavior. In: NIPS 2007 Workshop on Machine Learning in Adversarial Environments for Computer Security (2008) [14] Nazario: J. Phishing Corpus, http://www.monkey.org/jose/wiki/doku.php?id=phishingcorpus [15] Basnet, R.B., Sung, A.H.: Classifying Phishing Emails Using Confidence-Weighted Linear Classifiers. In: International Conference on Information Security and Artificial Intelligence (ISAI) (2010) [16] Abu-Nimeh, S., Nappa, D., Wang, X., Nair, S.: Comparison of Machine Learning Techniques for Phishing Detection. In: Proceeding of APWG eCrime Researchers Summit, Pittsburgh, USA (2007) [17] Spamassassin public corpus, http://spamassassin.apache.org/publiccorpus [18] Gansterer, W.N., Polz, D.: E-Mail Classification for Phishing Defense. LNCS Advances (2009)
On the Use of Multiplanes on a 2D Mesh Network-on-Chip Cruz Izu School of Computer Science The University of Adelaide Adelaide 5001, South Australia [email protected]
Abstract. Alike interconnection networks for parallel systems, Networks-onchip (NoC) must provide high bandwidth and low latency, but they are further constrained by their on-chip power budget. Consequently, simple network topologies such as the 2D Mesh with shallow buffers and simple routing strategies such as dimensional order routing (DOR) have been widely used in order to achieve this goal. A low number of virtual channels could be used to eliminate head-of-line blocking and increase network throughput. Due to the spare routing area in deep submicron technology, another possibility is to replicate the simple network once or more times. This work compares and combines the two approaches, by considering the distribution of a fixed number of virtual channels over one or more multiplanes. A thorough evaluation of the possible 2D mesh network configurations under a range of workloads will show that, provided there is spare area, replicating the 2D mesh with 2 virtual channels results on the best power/performance trade-off. Keywords: Network-on chip, replication, virtual channels, evaluation.
1 Introduction Multicore systems with high communication demands have replaced bus-based and crossbar interconnects with packet switching networks-on-chip (NoC). As these networks are limited by their on-chip power budget [12, 15], wormhole routing is widely used, normally paired with a low number of virtual channels. If the network must support multiple traffic classes, each class mapped to its own virtual channel. Recent work on NoCs has focused on using of replication as a simple way to increase throughput without any latency penalty at low loads. By duplicating the network we are doubling its throughput while using the spare area that may result in a CMP tile after mapping of the basic router [4]. In some cases both approaches are combined, resulting in much higher performance with a modest energy overhead [1]. Plane Replication is not a new idea; in fact, the mad postman network [8] used 4 independent planes to forward packets in a 2D mesh, but it has not been explored before due to its silicon cost. However, multiplanes are going to be play a part in future NoC design as they provide the following advantages: able to use the spare area in tiled Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 276–286, 2011. © Springer-Verlag Berlin Heidelberg 2011
On the Use of Multiplanes on a 2D Mesh Network-on-Chip
277
architectures, exhibit low power dissipation rates and could simplify the critical path by reducing/removing virtual channel arbitration. It is then of interest to explore the use of multiplanes and/or virtual channels in more depth. The impact of adding virtual channels to a single planer outer in terms on performance is well understood [2,3,5]. However, there is only limited exploration of the multiplane approach. Noh et al [11] were the first to propose the use of fully replicated planes although in their design the two planes were not fully independent and packets could change planes as they advance through the network. The paper showed the advantages of using both multiplanes and VCs but didn’t provide an evaluation with real loads. Carara et al [4] considered a flit width constrain when comparing multiplanes (MPs) versus virtual channels (VCs). Yoon et al [16] compared them under a fixed wire and buffer constraint, and showed MPs are more efficient that VCs when the input buffer storage is limited; besides, they provided a higher per-watt efficiency for regular traffic patterns. Finally, Gilbert et al [6] explored different architectures for deterministic wormhole routing: a traditional VC design, the multiswitch approach in which a VC-less switch is replicated as many times as virtual channels, and the multi-network approach which corresponds to the multiplane design with both link width and buffer constraints. Although [6] provided a system level analysis, it reduced the number of injections in the multiplane to only one packet per cycle so the multiplane approach could not make good use if the added planes. In short, the literature has shown multiplanes can achieve lower area and power consumption, but there is limited analysis of this approach at the system level. The goal of this paper is to cover the evaluation gap under real loads by comparing the 3 possible alternatives: VC-less multiplanes, single plane virtual channels and multiplanes with virtual channels. The rest of the paper is organized as follows: section 2 presents the description and background knowledge for multiplanes. Section 3 describes the simulation environment. Section 4 presents the result of the evaluation, followed by the conclusions in section 5.
2 Multiplanes A standard 2D mesh network can be seen as a network with a single plane. We will assume a generic VC router, as shown in figure 1, with a 4-stage pipeline: flit transmission, routing logic, VC and switch allocation, crossbar traversal. The number of ports in a 2D mesh is five: four network ports and the node injection/delivery port. If we duplicate the router component, each network interface will be able to send messages using one of the two available planes. Messages move only through one plane, and there is no connection between the two planes. In other words, bandwidth between the network interface (NI) and the network plane(s) grows with the number of planes, and we are able to simultaneously inject packets into multiple planes. 2.1 Base Router Selection Most NoC studies present a detailed implementation of the router followed by logic synthesis, and a post place-and-route simulation to measure power consumption.
278
C. Izu
Fig. 1. Architecture of a generic virtual channel router
At each step of this process there are trade-off decisions that will be different depending on the power and area budget of a given project. For example, when the critical path is simplified, we could trade-off speed with area/power by using relaxed delay constraints [6]. As our main goal is to provide a system level analysis of the use of multiplanes, instead of implementing each router, we have used the power router simulator Orion2 [9]. This high level tool allows us to explore multiple configurations at the system level without the overheads of the logic synthesis. The first step is to select the VC base router; in other words, we need to select the number of virtual channels and the buffer capacity of the single plane. We set the technology parameters to be 65 nm CMOS with a core voltage of 1.0 volts and a target clock frequency of 1 GHz. We are modeling a 5x5 router with 32-bit links. Figure 2.a shows the power estimation for a link load usage ranging from 0 (standby power) to a full load. As we add more virtual channels, power consumption increases significantly. Using 2 VCs increases standby power by 30% but the overhead is only of 10-15% at medium loads. Having 4 virtual channels doubles standby power with an overhead of 35-50% at medium loads. Buffer area grows with the number of virtual channels, and it is responsible for 30-40% of the standby power. Crossbar area remains constant as the number of inputs is fixed (one per port) but both VC and switch allocation grows moderately with the number of VCs. The area overhead for a 2 and 4 VC router is 16% and 50% respectively compared to the VC-less counterpart. Figure 2.b shows the impact that virtual channels have on the completion time of four traces from the SPEC benchmark running in a 5x5 mesh, normalized to the base case of 1 single channel. These traces are acquired from the operand network of the tiled Trips processor [14] (see section 3.2 for further reference). There is a clear benefit on the use of 2 VCs as this reduces completion time by 2040%. Moving from 2 to 4 VCs reduces completion time in 3 of the 4 traces. Increasing the number of virtual channels above 4 shows no performance gains. In short, virtual channels are quite costly, but they are widely used because of their significant impact on network performance. Based on these preliminary results we will limit our exploration to a maximum of 4 virtual channels per router.
On the Use of Multiplanes on a 2D Mesh Network-on-Chip
(a)
279
(b)
Fig. 2. (a) Power cost of adding virtual channels and (b) their impact on trace completion time
2.2 Replication Alternatives Now that we have our base router, a 4 VC single plane mesh, we need to consider its multiplane alternatives. As NoCs are normally used in a resource-constrained domain, we should consider the implications of router replication. Table 1 summarizes the replication alternatives for a standard 2D mesh network. If spare area is available such as in [1] we could consider full replication. Assuming the NI use the 2 planes in an alternating fashion, each plane will support half of the injected traffic, doubling the peak network throughput and reducing latency at low loads. Thus, we will skip this case and focus on the network performance under constrained replication conditions. Table 1. Replication alternatives for a single plane network with link width w and m buffers
Full replication Buffer restricted Bandwidth restricted Link &buffer restricted
Link width 2*w 2*w 2*(w/2) 2*(w/2)
Buffer capacity 2*m 2 * (m/2) 2*m 2 * (m/2)
If area is a critical constraint, we could replicate the router but maintain the total buffer space fixed, as buffer uses a significant percentage of the total router area [9, 10]. If the router uses ‘v’ virtual channels and each VC buffer has ‘b’ flits, the total buffer capacity per router is m=5*v*b. Most wormhole router implementations allocate the minimal buffer space, determined by the round-trip latency, to each VC. In this case, it will not be possible to reduce the value ‘b’ when replicating; instead, we could keep the buffer capacity constant in a 2 multiplane network by halving the number of virtual channels used in each plane. For example if the single plane network has 4 VCs, the dual plane alternative will have 2VCs per plane. An advantage of this approach is that the VC arbitration stage is simplified as the number of planes grows and eventually removed from the router.
280
C. Izu
In regards with the link restriction, we should note that on-chip networks are not limited by pin counts as their inter-chip networks are, having enormous wiring resources at their disposal [3,12]. Thus, in this study we will consider our design to be buffered constrained only. This results in the following 3 configurations: (a) 1P-4VC: single plane with 4 virtual channels each with a 4-flit input buffer. (b) 2P-2VC, will have 2 planes, each with 2VCs and similar input buffers. (c) 4P-1VC will have 4 planes, with one 4-flit input buffer per port only. Since Orion2 cannot model 2 or more planes when calculating power consumption, we assumed a replicated plane with a balanced fraction of the load received by a single plane. This leads to power overheads of 8-30% for the 2P-4VC and 26-100% for the 4P -1VCconfiguration compared to the 1P-4VC base case. Note this is a worst case scenario, as we could optimize the design in order to match or even reduce the area and power dissipation ratio of the base router as shown in [4,6,11]
3 Evaluation Environment This section briefly presents the network simulator used to model multiplanes and the application traces. 3.1 Popnet Simulator and Multiplanes Popnet is a C++ network simulator for NoC [13], which has Orion 1.0's power models embedded with it. It models deterministic wormhole routers for torus and mesh topologies. We could change the following parameters in the network: the network radix and number of dimensions, the size of the input and output buffers, the flit size, the link length and the routing strategy. We simulated a 5x5 mesh network with one or more multiplanes. This network size is selected so that we can use the application traces described in the next section, instead of relying in synthetic traffic loads. For more details regarding the trips network please refer to [7]. We have followed the following approach to model a multiplane network: we divide the original trace file into 2 or 4 files, one per plane, assuming each node interface would select the plane to inject a packet in a roundrobin fashion: 1st packet uses plane 0, 2nd packet uses plane 1 etc. Each plane is simulated separately, by feeding the modified trace file to the simulator with the parameters of the given plane. Note this relies on the assumption that injection and delivery of packets can happen simultaneously in each plane. The reported completion time is the worst completion time for the set of planes; note tat the observed variations in the completion time between planes was minimal. The output of each simulation includes the completion time, number of delivered messages and power consumption (as this is based on Orion 1.0 it does not include standby power).
On the Use of Multiplanes on a 2D Mesh Network-on-Chip
281
3.2 SPEC Traces The network traces that we applied to this simulator are taken from the Trips OPN network [7] for a range of SPEC benchmarks. Although SPEC is a single-threaded benchmark, the Trips processor implements a tiled microarchitecture, and the trace records the transfers of operands between 5x5 tiles of the processor. The execution of these traces shows the impact that the different network configurations have in the completion time of the global application. Each trace contains 1 millions packets. In examining the application traces, we should note the following common characteristics: node 0 has no incoming or outgoing traffic, the other 4 nodes in the diagonal generated together half on the network traffic, and the destinations are mostly balanced. In other words, there will be alike a hot-region traffic, but caused by congestions at the source, instead of at the destination. The average distance travelled is in the range 2.1 to 2.3 hops, lower than the network average distance of 3.3 hops.
4 Multiplane Evaluation We have compared the three configurations : one plane with 4 virtual channels (1P4VC), two multiplanes with 2 virtual channels each (2P-4VC) and 4 multiplanes with no virtual channels (4P-1VC). In all cases we have kept the buffer capacity constant as described before. 4.1 Completion Times Having fixed the number of virtual channels to 4, we want to compare the performance results when using one, two or four multiplanes. Figure 3 shows the impact of using multiplanes for the 4 application traces. The values are normalized to the single plane completion time for each traffic load. The application compress is not limited by the network bandwidth, and in this case adding multiplanes does not reduce its completion time. For the other three applications loads using 2 multiplanes is quite effective; as we doubled the network bandwidth with 2 planes, this reduces completion times by 40-45%. This is to be expected for application with heavy communication demands that saturated the network.
Fig. 3. Impact of the number of planes on trace completion time
282
C. Izu
However, using 4 planes is less effective; although it reduces completion time by a further 5-10%, it requires twice as much bandwidth as its 2P-2VC counterpart. 4.2 Power Consumption The second metric of interest if the power consumption under each network load. We will consider first the compress trace, which is using the network at a load below saturation. Figure 4 shows the average router power dissipation per cycle during the simulation time for each network configuration. The top value each graph represents the total network consumption per cycle. The network load shows small fluctuations until the nodes stops injecting messages at approximately 650,000 cycles.
Fig. 4. Average power dissipation per cycle – compress trace
When having multiplanes, each plane carries a fraction of the load, and it consumes a similar fraction of the power. However, the average delay per packet is reduced by 40% when using 2P-2VC. The delay does not decrease any further when using 4P -1VC. Therefore, the hybrid approach is the winner at medium loads. The other three traces, swim, mgrid and hydro2d traces exhibited similar trends as they have heavy communication demands. We have chosen swim as the example to discuss power efficiency for the other loads. Figure 5 shows the average router power dissipation per cycle during the simulation time for each network configuration under the swim application load. We can see the network is heavy loaded in all cases, until the packet injection stops around 120,000 cycles. In the case of the single plane at this point of time only 60% of the messages have been delivered, while for 2 or 4 multiplanes nearly 80% of the
On the Use of Multiplanes on a 2D Mesh Network-on-Chip
283
messages have been delivered by then. The increased throughput reduces the additional times needed by the network to deliver the remaining traffic. Although multiplane networks exhibit higher power cost per cycle, they delivered higher throughput as well. In fact, the total energy used to delivered this trace is similar for all configurations as shown in Figure 6.a. This is to be expected as Orion1.0 does not take into account standby power, thus measuring the power used when sending messages through each router component.
Fig. 5. Average power dissipation per cycle – swim trace
(a)
(b)
Fig. 6. (a) Total energy used to deliver the swim trace and (b) power consumption per router component
284
C. Izu
Figure 6.b shows the power consumption split by router component. In spite of using limited buffer space, the memory elements attached to the input ports are responsible for 85% of the power consumed. The power to transmit flits over the router links represents 9-10% , the crossbar uses 4-5% , while the arbiter and clock use less than 1%. Again, the values for swim are representative of the other two application loads. In short, this evaluation has shown that it is effective to use a hybrid configuration with 2 multiplanes and 2Vcs per plane. For non saturated loads such as compress, the additional bandwidth results in lower latencies, and for heavy loaded applications it significantly reduces completion times. However, using 4 multiplanes (and no virtual channels) does not results in significant gains in spite of doubling the bandwidth compared with 2MP-2VC. A possible reason for this poor performance is that the head-of line blocking when not using virtual channels is preventing the application from using the additional bandwidth. We will explore this topic further in the next section. 4.3 Evaluation of 4MP-2VC In this section we will maintain the buffer constrains for the 4 multiplane network by splitting the 4 flit buffer into 2 virtual channels of 2 flits each. This buffer may not be realistic for some implementations as it may not cover the round-trip latency from one wormhole router to the next. However, the goal of this evaluation is to identify if head-of-line blocking is the reason why the 4P-1VC configuration underperformed in the previous tests. We have run the three traces with this new configuration and the results in terms of completion time for each trace are shown in figure 7. As it can be observed in this figure, there is a reduction of completion time when compared with 4P-1VC, but the gains are in the 6-8% range. Thus, the impact of head of line blocking on performance is not significant.
Fig. 7. Trace completion times for 2 and 4 multiplane configurations
Note our experiments are limited to a 5x5 network due to the trace original configuration, and such a small network has limited adaptivity. Thus, we should extend this work to larger networks, in which the use of more virtual channels increases adaptivity.
On the Use of Multiplanes on a 2D Mesh Network-on-Chip
285
5 Conclusions This paper has explored the advantages of using multiplanes in a 2D mesh network at the system level, by using a power simulator and real application traces to measure the impact that multiplanes have on network performance. We have extended previous multiplanes evaluations by considering not only multiplanes versus virtual channels but also including a hybrid configuration that combines 2 multiplanes with the use of virtual channels. In fact, the hybrid router 2P-2VC shows the best performance for the fours application loads. Multiplanes increase network performance at medium to high loads, with a moderate power cost. For networks with variable traffic demands, it should be possible to use power gaiting to shut off some of the network planes when the network activity is low. Therefore, a hybrid multiplane is the best approach under buffer and power constrains. We will extend this work by evaluating larger systems under a wider range of application loads. We also need to consider the interface requirements to take full advantage of the increase bandwidth provided by multiplanes.
References 1. Balfour, J., Dally, W.J.: Design tradeoffs for tiled CMP on-chip networks. In: Proceedings of the 20th Annual International Conference on Supercomputing (ICS 2006), pp. 187–198. ACM, New York (2006), doi:10.1145/1183401.1183430 2. Dally, W.J.: Virtual-Channel Flow Control. IEEE Trans. Parallel Distrib. Syst. 3(2), 194– 205 (1992), doi:10.1109/71.127260 3. Dally, W.J., Towles, B.: Route packets, not wires: on-chip interconnection networks. In: Proceedings of the 38th Annual Design Automation Conference (DAC 2001), pp. 684– 689. ACM, New York (2001), doi:10.1145/378239.379048 4. Carara, E., Moraes, F., Calazans, N.: Router architecture for high-performance NoCs. In: Proceedings of the 20th Annual Conference on Integrated Circuits and Systems Design (SBCCI 2007), pp. 111–116. ACM, New York (2007), doi:10.1145/1284480.1284515 5. Duato, J., Yalamanchili, S., Ni, L.: Interconnection Networks: an engineering Approach. IEEE Computer Society Press, Los Alamitos (1997) 6. Gilabert, F., Gomez, M.E., Medardoni, S., Bertozzi, D.: Improved Utilization of NoC Channel Bandwidth by Switch Replication for Cost-Effective Multi-processor Systemson-Chip. In: Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip (NOCS 2010), pp. 165–172. IEEE Computer Society, Washington, DC, USA (2010), http://dx.doi.org/10.1109/NOCS.2010.25, doi:10.1109/NOCS.2010.25 7. Gratz, P., Kim, C., Sankaralingam, K., Hanson, H., Shivakumar, P., Keckler, S.W., Burger, D.: On-Chip Interconnection Networks of the TRIPS Chip. IEEE Micro 27(5), 41–50 (2007), doi:10.1109/MM.2007.90 8. Jesshope, C.R., Izu, C.: The MP1 Network Chip and its Application to Parallel Computers. The Computer Journal 36(8), 763–777 (1993), doi:10.1093/comjnl/36.8.763 9. Kahng, A., Li, B., Peh, L.-S., Samadi, K.: Orion 2.0: A fast and accurate noc power and area model for early-stage design space exploration. In: Proceedings of the Conference on Design, Automation and Test in Europe. ACM, New York (2009)
286
C. Izu
10. Matsutani, H., Koibuchi, M., Wang, D., Amano, H.: Adding Slow-Silent Virtual Channels for Low-Power On-Chip Networks. In: Proceedings of the Second ACM/IEEE International Symposium on Networks-on-Chip (NOCS 2008), pp. 23–32. IEEE Computer Society, Washington, DC, USA (2008) 11. Noh, S., Ngo, V.-D., Jao, H., Choi, H.-W.: Multiplane Virtual Channel Router for Network-on-Chip Design. In: First International Conference on Communications and Electronics, ICCE 2006, October 10-11, pp. 348–351 (2006), doi:10.1109/CCE.2006.350796 12. Owens, J.D., Dally, W.J., Ho, R., Jayasimha, D.N., Keckler, S.W., Peh, L.-S.: Research Challenges for On-Chip Interconnection Networks. IEEE Micro 27(5), 96–108 (2007) 13. Shang, L., Peh, L.-S., Jha, N.K.: Dynamic Voltage Scaling with Links for Power Optimization of Interconnection Networks. In: Proceedings of the 9th IEEE International Symposium on High-Performance Computer Architecture, pp. 79–90 (February 2003) 14. Sankaralingam, K., Nagarajan, R., Gratz, P., Desikan, R., Gulati, D., Hanson, H., Kim, C., Liu, H., Ranganathan, N., Sethumadhavan, S., Sharif, S., Shivakumar, P., Yoder, W., McDonald, R., Keckler, S.W., Burger, D.C.: The Distributed Microarchitecture of the TRIPS Prototype Processor. In: 39th International Symposium on Microarchitecture (MICRO) (December 2006) 15. Vangal, S.R., et al.: An 80-tile sub-100w teraflops processor in 65-nm cmos. IEEE Journal of Solid-State Circuits (2008) 16. Yoon, Y.J., Concer, N., Petracca, M., Carloni, L.: Virtual channels vs. multiple physical networks: a comparative analysis. In: Proceedings of the 47th Design Automation Conference (DAC 2010), pp. 162–165. ACM, New York (2010), doi:10.1145/1837274.1837315
A Minimal Average Accessing Time Scheduler for Multicore Processors Thomas Canhao Xu, Pasi Liljeberg, and Hannu Tenhunen Turku Center for Computer Science, Joukahaisenkatu 3-5 B, 20520, Turku, Finland Department of Information Technology, University of Turku, 20014, Turku, Finland {canxu,pasi.liljeberg,hannu.tenhunen}@utu.fi
Abstract. In this paper, we study and analyze process scheduling for multicore processors. It is expected that hundreds of cores will be integrated on a single chip, known as a Chip Multiprocessor (CMP). However, operating system process scheduling, one of the most important design issue for CMP systems, has not been well addressed. We define a model for future CMPs, based on which a minimal average accessing time scheduling algorithm is proposed to reduce on-chip communication latencies and improve performance. The impact of memory access and inter process communication (IPC) in scheduling are analyzed. We explore six typical core allocation strategies. Results show that, a strategy with the minimal average accessing time of both core-core and core-memory outperforms other strategies, the overall performance for three applications (FFT, LU and H.264) has improved for 8.23%, 4.81% and 10.21% respectively comparing with other strategies.
1
Introduction
The CMP technology enables today’s semiconductor companies to integrate more than one core on a single chip. It is predictable that in the future, hundreds of cores on a single chip will appear on markets. However, the current communication schemes in CMPs are based on the shared bus architecture which suffers from high communication delay and low scalability. Therefore, Network-on-Chip (NoC) has been proposed as a promising approach for future systems with hundreds or even thousands cores on a chip [1]. A NoC-based multicore processor is different from modern processors since a network is used as on-chip communication medium. Figure 1 shows a NoC with 4×4 mesh network. The underlying network is comprised of network links and routers (R), each of which is connected to a processing element (PE) via the network interface (NI). The basic architectural unit of a NoC is the tile/node (N) which is consisted of a router, its attached NI and PE, and the corresponding links. The communication among PEs is achieved via the transmission of network packets. Intel1 has demonstrated 1
This work is supported by Academy of Finland and Nokia Foundation. Intel is a trademark or registered trademark of Intel or its subsidiaries. Other names and brands may be claimed as the property of others.
Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 287–299, 2011. Springer-Verlag Berlin Heidelberg 2011
288
T.C. Xu, P. Liljeberg, and H. Tenhunen
an experimental microprocessor containing 48 x86 cores on a chip. The chip implements a 4×6 2D mesh network with 2 cores per tile [2]. Tile-Gx, the latest generation of NoC from Tilera, brings 16 to 100 processor cores interconnected with a mesh on-chip network [3]. In Figure 1, memory controllers are placed on the upper and lower sides of the chip, this represents a typical NoC design, similar as Intel and Tilera. The design of operating system schedulers is one of the most important issues for CMPs. For large scale CMPs N12 N13 N14 N15 such as hundred-core chips, it is obvious that scheduling of multi-threaded tasks to N8 N9 N10 N11 achieve better or even optimal efficiency R N4 N5 N6 N7 is crucial in the future. Several multiprocessor scheduling policies such as round N0 N1 N2 N3 robin, co-scheduling and dynamic partitioning have been studied and compared in [4]. However, these policies are designed mainly for the conventional shared bus based communication architecture. Many Fig. 1. A 4x4 mesh multicore processor heuristic-based scheduling methods have with on-chip memory controllers been proposed [5]. These methods are based on different assumptions, e.g. the prior knowledge of the tasks and execution time of each task in a program, presented as a directed acyclic graph. Hypercube scheduling has been proposed for off-chip systems [6]. Hypercube systems, usually based on Non-Uniform Memory Access (NUMA) or cache coherent NUMA architectures [7], are different from CMPs. It is claimed in [8] that the network latency is greatly affected by the distance between a core and a memory controller. Therefore, how to reduce the distances between tasks and memory controllers is one of the main considerations in our approach. However, work in [8] is based on enumerating all possible permutations of memory controller placement explicitly beforehand. While in our study, we focus on the other side instead of hardware design. Task scheduling for NoC platforms is studied in [9] and [10]. The effect of memory controller placement is not considered in these papers. In our paper, we propose and discuss a novel scheduler for NoC-based CMPs which aims to minimize the average network latency between memory modules and cores. With the decrease of the latencies, lower power consumption and higher performance can be achieved. To confirm our theory, we model and analyze a 64-core NoC with 8×8 mesh (Figure 9), present the performances with different allocation strategies using a full system simulator. Memory Controller
N
W
E
NI
PE
S
Memory Controller
2
Motivation
An unoptimized scheduling algorithm can cause hotspots and traffic contentions. As a result, average network latency, one of the most important factors of a NoC, is increased and overall performance is degraded. Figure 2 shows the network
A Minimal Average Accessing Time Scheduler for Multicore Processors
289
request rate of each processing core when running FFT in a 16-core NoC under GEMS/Simics simulation environment. The detailed system configuration can be found in Section 5.1 (Except for the number of cores and number of memory controllers etc. We use a 4×4 mesh with 16 cores and 8 memory controllers). In Figure 2, the horizontal axis is time, segmented in 216K-cycle percentage fragments. The traffic trace has 1.64M packets, with 21.6M cycles executed. The traffic is shown for all the 16 nodes. It is revealed that, 63.9% of data traffic are concentrated on five nodes (N0 29.6%, N8 6.7%, N11 10.0%, N13 8.7% and N15 8.8%). The top point-to-point traffics are listed in Table 1. A small portion of source-destination pairs generated a sizable portion of the traffic, e.g. 3.13% of the pairs (8/256) generated 32.07% traffic.
2500 2000 1500 Injected packets 1000 14
500 0 0
12 10 10
8 20
30
40 Time
6 50
60
4 70
80
Node ID
2 90
0
Fig. 2. Network request rate for 16-core NoC running FFT. The time is segmented in 216K-cycle/percentage.
Assuming X-Y deterministic routing, Equation 1 Table 1. Top Point-toshows the access time (latency) required for a core- Point traffics core communication. The latency involves in-tile links (Between NI and PE, LLink delay1 ), router Src Dst Percentage 0 11 7.43 (LRouter delay ), tile-tile links (LLink delay2 ) and the 0 4 4.11 number of hops required to reach the destination 0 3 3.94 (nhop ). Obviously, without proper schedule, the com15 11 3.66 munication overhead can be an obstacle for future 13 6 3.63 multicore processors. 11 0 3.54 LC = (nhop + 1)×LRouter delay + 0 12 3.49 2×LLink delay1 + nhop ×LLink delay2 (1) 8 11 2.27
3
Scheduling with Minimal Average Accessing Time
In this section, we define a model for our system. A new scheduling algorithm aiming at minimizing average access time is proposed. We analyze advantages and limitations of our algorithm in different aspects.
290
3.1
T.C. Xu, P. Liljeberg, and H. Tenhunen
NoC Model and Access Time
Our proposed algorithm considers the on-chip topology, scheduling decisions are made based on such information. We use a NoC model as described below. Definition 1. A NoC N (P (X, Y ), M ) consists of a PE mesh P (X, Y ) of width X, length Y ; and on-chip memory controllers M (connected to the upper and lower sides of the NoC). Figure 9 shows a NoC of N (P (8, 8), 16). Definition 2. A N (P (X, Y ), M ) consists of X×Y PEs, which is the maximum number of concurrent threads it can process. Definition 3. Each PE is denoted by a coordinate (x, y), where 0≤x≤X − 1 and 0≤y≤Y − 1. Each PE contains a core, a private L1 cache and a shared L2 cache. Definition 4. The Manhattan Distance between ni (xi , yi ) and nj (xj , yj ) is MD(ni ,nj ), MD(ni ,nj )=|xi − xj | + |yi − yj |. Definition 5. Two nodes n1 (x1 , y1 ) and n2 (x2 , y2 ) are interconnected by a router and related link only if they are adjacent, e.g. |x1 − x2 | + |y1 − y2 | = 1. Definition 6. A task T (n) with n threads requests the allocation of n cores. Definition 7. nF ree is a list of all unallocated nodes in N . Definition 8. R(T (n)) is a unallocated region in P with n cores for T (n). Average core access time (ACT ) and average memory access time (AM T ) are calculated when making scheduling decisions. The aim of the algorithm is to minimize average network latency of the system, which is one of the most important factors of a NoC. ACT is defined as the number of nodes a message has to go through from a node to other nodes, ∀i, j ∈ P . M D(ni , nj ) (2) ACT = n Such that: ∀i=j∈P and ni =nj For a rectangular allocation with A×B nodes, according to [11], ACT can be calculated with Equation 3. For example, 4×4 and 2×8 are possible rectangular core allocations for a task with 16 threads. However, the value of ACT in 4×4 is smaller than in 2×8 (2.5 and 3.125). In consideration of ACT, an allocation shape have a lower ACT number if it is closer to a square. Figure 3a and 3b show two core allocation schemes for a task with 15 threads. In Figure 3b, the number of ACT is lower than in Figure 3a (2.4177 and 2.4888 respectively). ACT =
1 A+B × (1 − ) 3 A×B
(3)
Taking into account of memory controller placement, e.g. in Figure 1, the memory controllers are allocated in top and bottom of the chip. The number of transistors required for a memory controller is quite small compared with billions of total transistors in a chip. It is presented that a DDR2 memory controller
A Minimal Average Accessing Time Scheduler for Multicore Processors
291
is about 13,700 gates with application-specific integrated circuit (ASIC) and 920 slices with Xilinx Virtex-5 field-programmable gate array (FPGA) [12]. The memory controllers are shared by all processors to provide a large physical memory space to each pro(a) (b) cessor. Each of the controller controls a part of the Comparison physical memory, and each processor can access any Fig. 3. of two core allocation part of the memory [13]. Traditionally, a physical adschemes for 15 threads dress will be mapped to a memory controller according to its address bits and cache line address. In this case, memory traffic are distributed to all the controllers evenly. However, in our study, we assume that a physical address will be mapped to a memory controller according to its physical location of the on-chip network, i.e. the nearest controller in terms of MD [14]. We define AM T as the minimal number of nodes a message has to go through from a node to a memory controller since more than one controller can co-exist, ∀i ∈ P . min(M D(ni , M )) (4) AM T = n Equation 5 shows the access time required for a core-memory communication (not considering the latencies of the memory controller and the memory). (5) LM = LLink delay1 + (nhop + 1)×(LRouter delay + LLink delay2 ) 3.2
Analyze Different Scheduling Strategies
Figure 4 shows six typical allocation of a task with 16 threads to a 64- 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 core CMP configuration, all cores are free initially, gray nodes are allocated nodes. One of the worst case ACT configuration is shown in Figure 4c, in which 16 threads are distributed in four corners of the CMP, thread- 9 10 11 12 13 14 15 16 9 10 11 12 13 14 15 16 9 10 11 12 13 14 15 16 thread communication delay is thus (a) (b) (c) very high. We can calculate the ACT is 6.625 according to Equation 2. The 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 square allocation shown in Figure 4a shows the most promising ACT , it is reduced to the minimum of 2.5. As aforementioned in Equation 3, for a rectangular core allocation, a quasi9 10 11 12 13 14 15 16 9 10 11 12 13 14 15 16 9 10 11 12 13 14 15 16 square shape has the lowest ACT (d) (e) (f) value. Obviously Figure 4a represent a minimal ACT in a 16-thread task. Fig. 4. Comparison of different In consideration of AM T however, al- core/memory allocation schemes though ACT in Figure 4c is the worst, N1
N2
N3
N4
N5
N6
N7
N8
N1
N2
N3
N4
N5
N6
N7
N8
N1
N2
N3
N4
N5
N6
N7
N8
N9 N10 N11 N12 N13 N14 N15 N16
N9 N10 N11 N12 N13 N14 N15 N16
N9 N10 N11 N12 N13 N14 N15 N16
N17 N18 N19 N20 N21 N22 N23 N24
N17 N18 N19 N20 N21 N22 N23 N24
N17 N18 N19 N20 N21 N22 N23 N24
N25 N26 N27 N28 N29 N30 N31 N32
N25 N26 N27 N28 N29 N30 N31 N32
N25 N26 N27 N28 N29 N30 N31 N32
N33 N34 N35 N36 N37 N38 N39 N40
N33 N34 N35 N36 N37 N38 N39 N40
N33 N34 N35 N36 N37 N38 N39 N40
N41 N42 N43 N44 N45 N46 N47 N48
N41 N42 N43 N44 N45 N46 N47 N48
N41 N42 N43 N44 N45 N46 N47 N48
N49 N50 N51 N52 N53 N54 N55 N56
N49 N50 N51 N52 N53 N54 N55 N56
N49 N50 N51 N52 N53 N54 N55 N56
N57 N58 N59 N60 N61 N62 N63 N64
N57 N58 N59 N60 N61 N62 N63 N64
N57 N58 N59 N60 N61 N62 N63 N64
N1
N2
N3
N4
N5
N6
N7
N8
N1
N2
N3
N4
N5
N6
N7
N8
N1
N2
N3
N4
N5
N6
N7
N8
N9 N10 N11 N12 N13 N14 N15 N16
N9 N10 N11 N12 N13 N14 N15 N16
N9 N10 N11 N12 N13 N14 N15 N16
N17 N18 N19 N20 N21 N22 N23 N24
N17 N18 N19 N20 N21 N22 N23 N24
N17 N18 N19 N20 N21 N22 N23 N24
N25 N26 N27 N28 N29 N30 N31 N32
N25 N26 N27 N28 N29 N30 N31 N32
N25 N26 N27 N28 N29 N30 N31 N32
N33 N34 N35 N36 N37 N38 N39 N40
N33 N34 N35 N36 N37 N38 N39 N40
N33 N34 N35 N36 N37 N38 N39 N40
N41 N42 N43 N44 N45 N46 N47 N48
N41 N42 N43 N44 N45 N46 N47 N48
N41 N42 N43 N44 N45 N46 N47 N48
N49 N50 N51 N52 N53 N54 N55 N56
N49 N50 N51 N52 N53 N54 N55 N56
N49 N50 N51 N52 N53 N54 N55 N56
N57 N58 N59 N60 N61 N62 N63 N64
N57 N58 N59 N60 N61 N62 N63 N64
N57 N58 N59 N60 N61 N62 N63 N64
292
T.C. Xu, P. Liljeberg, and H. Tenhunen
AM T is only 1.75. For Figure 4a, despite the fact that it has the best ACT value, the value of AM T is 2.5. This allocation might not be optimal in case a task has a lot of memory accesses. Each time a cache miss happen, a request to the memory subsystem is generated to fetch the required data. Lower network latency translates into higher performance. The best case AM T is shown in Table 2. ACTs and AMTs for different alFigure 4b, because of each allocated location strategies core is connected with the memory controller directly. Figure 4d shows Strategy ACT AMT Average the worst value of AM T , which is Figure 4a 2.5000 2.5000 2.5000 4. Two balanced allocation strategies Figure 4b 6.1250 1.0000 3.5625 are shown in Figure 4e and 4f. In Figure 4c 6.6250 1.7500 4.1875 these strategies, despite neither ACT Figure 4d 3.1250 4.0000 3.5625 nor AM T beats other strategies as a Figure 4e 3.1250 1.5000 2.3125 single value, the average numbers of Figure 4f 2.6406 1.8750 2.2578 these two factors are better than other four strategies. For instance, allocated cores are in two lines, adjacent to each other in Figure 4e, the ACT and AM T are therefore 3.125 and 1.5, respectively. The average value is lower than in Figure 4a (2.3125 and 2.5). Figure 4f shows another possibility, with further reduced average number of ACT and AM T . Table 2 summarizes these data. We note that, if we want to reduce ACT (between 6.625 and 2.5), the value of AM T will increase (between 1 and 4), and vice versa. 3.3
The Algorithm
In our case (irregularly shaped allocation with ACT and AM T constraints), given a task with n executing threads, we define the problem as determining the best core allocation for the task by selecting a region containing of n cores. The problem can be described as: Find a region R(T (n)) inside N (P (X, Y ), M ) and a node list Nl of R(T (n)), which minimizes the average of ACT and AMT. Algorithm 1. The steps of region selection algorithm 1, ∀n∈nF ree , calculate all min(M D(n, M )). 2, ∀n∈nF ree , start with the first free node ni and calculate all other nj ∈nF ree with M D(ni , nj ) and sort them in ascending order M D1 ≤M D2 ≤M D3 ≤. . .M Dk . 3, Repeat 2 for the remaining free nodes. T }. 4, Select R(T (n)) from 3 which contains Nl that satisfies T (n) with min{ ACT +AM 2
The pseudo code of the algorithm is shown in Algorithm 1. Figure 4f shows the outcome of the algorithm, with minimal average value of ACT and AM T . Algorithm 1 uses method of exhaustion, and it works always. However, it is very important to design an efficient scheduling algorithm. Our problem is in nondeterministic polynomial time (NP). To determine if an allocation strategy has the lowest combination of ACT and AM T , it suffices to enumerate the allocation possibilities and then verify if these possibilities produce the lowest
A Minimal Average Accessing Time Scheduler for Multicore Processors
293
1 2 3 4 5 6 7 8 value. We consider the problem to be NP-complete. It means that, despite any allocation can be verified N1 N2 N3 N4 N5 N6 N7 N8 in polynomial time, there is no known efficient way to N9 N10 N11 N12 N13 N14 N15 N16 N17 N18 N19 N20 N21 N22 N23 N24 find that allocation. It is as difficult as any other NPN25 N26 N27 N28 N29 N30 N31 N32 complete problems. The time required to solve this N33 N34 N35 N36 N37 N38 N39 N40 problem increases very quickly as the size of the inputs N41 N42 N43 N44 N45 N46 N47 N48 grows (e.g. the number of free nodes and the number N49 N50 N51 N52 N53 N54 N55 N56 of threads in a task). As a result, it is noteworthy that N57 N58 N59 N60 N61 N62 N63 N64 exhaustive simulation is feasible only for a small size NoC, because of the high computational complexity 9 10 11 12 13 14 15 16 from the large search space, e.g. an 8×8 mesh with Fig. 5. A fragmented sit16 threads has 64 16 = 488,526,937,079,580 different uation with 36 cores occuallocation possibilities! In real world, however, it is likely that a task will pied (gray), and 28 cores free (white) have fewer threads, and there are fewer available PEs for allocation. Faulty PEs can also be excluded from the search space. Thus there might be a much smaller search space. Figure 5 shows a fragmented allocation, in which only 28 cores are available for a new task. In this case, there are only 28 16 = 30,421,755 allocation possibilities for a 16-thread task. Heuristic scheduling algorithms are proposed with a clear view of the behavior of a program beforehand [5]. The longest path to a leaf is selected in the dependence directed acyclic graph [5]. However this method is not practical for millions of different applications. We extend Algorithm 1 with greedy heuristic approximation. As aforementioned, an allocation shape closer to a square have a lower ACT number. The calculation of all combinations is unnecessary. Take Figure 5 for example. To schedule a task with 8 threads, we start from square regions which are closest to the number of nodes required for the task. In this case, we have 4 candidates (R1(N 33 − N 35, N 41 − N 43), R2(N 31, N 32, N 39, N 40, N 47, N 48), R3(N 38 − N 40, N 46 − N 48), R4(N 38, N 39, N 46, N 47, N 54, N 55)). To select other two nodes, adjacent nodes of the region are considered. The improved algorithm is shown below.
Algorithm 2. The steps of greedy heuristic approximation 1, ∀n∈nF ree , calculate the ACT and AMT of all region, which contains nodes ≤ T (n). 2, Add adjacent nodes of the region in 1, if the region is smaller than the task. T }. 3, Select R(T (n)) from 2 which contains Nl that satisfies T (n) with min{ ACT +AM 2
3.4
Discussion
Despite our goal is to find the best combination of ACT and AM T using the average value of the two, the weight of ACT and AM T should be considered as well. Different applications have their own profile: memory-intensive or IPCintensive. Researches have shown that scientific applications such as transitive closure, sparse matrix-vector multiplication, histogram and mesh adaption are
294
T.C. Xu, P. Liljeberg, and H. Tenhunen
memory-intensive [15]. It is also shown by Patrick Schmid et al. [16] that video editing, 3D gaming and file compression are memory sensitive applications in daily computing, while other applications concentrate more on thread-thread communication. It is difficult to determine the behavior of an application automatically beforehand, since there are millions of them and the number is still increasing. One feasible way is to add an interface between the application and the OS, the application will tell the OS if it is memory-intensive. Another way is to add a low overhead profiling module inside the OS. Program access patterns are traced dynamically. Memory management functions such as malloc(), free() and realloc() are obtained as histograms for evaluating the weight of AMT, thread management functions such as pthread create(), pthread join() and pthread mutex*() are obtained as histograms for evaluating the weight of ACT. It is noteworthy that these histograms can be only used as rescheduling (thread migration, or in case of a fault PE), i.e. there are no access patterns for the first run of a program. Another problem is that, the trade-off for spending time to find the best combination of ACT and AM T can be unworthy. If the differences between allocation strategies are quite small, and if the search algorithm takes too much time, a near optimal allocation strategy is preferable. In this paper, we evaluate the performance differences for several allocation strategies of three 16-thread tasks. These tasks have different IPC and memory access intensities. The detailed performance analysis will be explained in later sections.
4 4.1
Case Studies FFT
The fast Fourier transform (FFT) is an algorithm to compute the continuous and discrete Fourier transform. FFT is widely used in digital signal processing. There are many FFT algorithm implementations, we select a one-dimensional, radix-n, six-step algorithm from [17], which is optimized to minimize IPC. The algorithm has two data sets for input, one with n2 complex data points is to be transformed, and the other with n2 complex data points is referred as the roots of unity. The two data sets are organized and partitioned as n×n matrices, a partition of contiguous set of rows is assigned to a processor and distributed to its local cache. The six steps are: (1), Transpose the input data set matrix; (2), Perform one-dimensional FFTs on the resulting matrix; (3), Multiply the resulting matrix by roots of unity; (4), Transpose the resulting matrix. (5), Perform one-dimensional FFTs on the resulting matrix; (6), Transpose the resulting matrix. The communication among processors can be a bottleneck in three matrix transpose steps (Step 1, 4 and 6). During the matrix transpose step, a processor transposes a contiguous sub-matrix locally, and a sub-matrix from every other processor. The transpose step requires communication of all processors. It is shown in [18] that, the performance is mostly determined by the data latencies between processors. Our workload contains 64K points with 16 threads.
A Minimal Average Accessing Time Scheduler for Multicore Processors
4.2
295
LU
The LU decomposition/factorization is a matrix decomposition which factors a matrix into the product of a lower triangular and an upper triangular matrices. It is used in numerical analysis to solve linear equations or to calculate the determinant. The main application fields of LU include: digital signal processing, wireless sensor networks and simulating electric field components. We select a LU decomposition kernel from [18]. This program is optimized to reduce IPC by using blocking. A dense n×n matrix M is divided into an N ×N array of B×B blocks (n = N ×B). The implementation of blocking method in the program can exploit temporal locality on individual sub-matrix’s elements. As is shown in Figure 6, the diagonal block (DB) is decomposed first. The perimeter blocks (PB) are upDB PB P0 P1 P2 dated using DB information. The matrix blocks are P3 P4 P5 assigned to processors (P1, P2...) using a 2D scatter P6 P7 P8 PB IB decomposition. The interior blocks (IB) are updated using corresponding PB information. It is very important that, since the computation of IB involves a Fig. 6. LU decomposition dense matrix multiplication of two blocks, to reduce algorithm with blocking IPC, the computation is performed by the processor that owns the block. Despite the optimization, communication of processors can still be a bottleneck. One situation is that, when processors require a DB used by all processors to update their own PBs. In this case, a copy of the DB is sent to all requesting processors by the processor that updates the DB. Another case is that, when processors require PBs used by all processors to update their IBs. In this case, a copy of the PB is sent to all requesting processors by the processor that updates the PB [18]. The 16-thread workload used in our experiment has an input matrix of 512×512 with 16×16 element blocks. 4.3
H.264
The H.264 is the latest standard of video stream coding, it is optimized for higher coding efficiency than previous standards. We select a data parallel H.264 coding implementation from [19]. In this program, video stream data are distributed to processors. Multiple video stream data can be processed simultaneously in data parallel coding. The program is multithreaded with frame level parallelization, and coarse granular pipeline parallelism is achieved. It is presented in [20] that, independent frame sequences are required to realize a full I B B P B B P B B …... P I frame level parallelization. However, frames in H.264 are dependent on each other. For the three frame types I, P and B [21], I frame does Fig. 7. Dependency in a video senot need any reference frame, P frame refers quence, with 2 B frames to the previous P frame and B frame refers to the previous and next P frames. Take the video sequence in Figure 7 for example. The first I frame refers to nothing, while the fourth frame (P) refers to the
296
T.C. Xu, P. Liljeberg, and H. Tenhunen
first I frame and is referred by the previous B frames (2nd and 3rd) and the next P frame (7th). Full parallelization of all frames is impossible due to the dependabilities in the frame sequences. In the program, a thread T will be genT1 T2 T5 T8 …… erated for each frame (Figure 8). Previous (I) (P) (P) (P) frames must be completed before new frames T3 T6 T9 (B) (B) (B) can be coded because the motion prediction T4 T7 T10 and compensation involves previous frames. (B) (B) (B) Data dependency among threads is heavy because of the shared reads and writes. The Fig. 8. Frame level parallelization shared data are deblocking pixels and refer- of a video sequence ence frames. Moreover, since the program processes image data, the local cache of a processor is usually too small for the frame information, data transfers from external memory to local cache can be a bottleneck as well. Apparently, in terms of IPC and external memory communication, H.264 can be the toughest among three applications. We select “simlarge” as our workload. This is a standard video clip from the PARSEC, taken from an open source movie [19]. This video clip models a high motion chasing scene.
5 5.1
Experimental Evaluation Experiment Setup
The simulation platform is based on a cycle-accurate NoC simulator which is able to produce detailed evaluation results. The platform models the routers and links accurately. The state-of-the-art router in our platform includes a routing computation unit, a virtual channel allocator, a switch allocator, a crossbar switch and four input buffers. Deterministic routing algorithm has been selected to 1 2 3 4 5 6 7 8 avoid deadlocks. We use a 64-node network N which models a single-chip CMP for our exTile periments. A full system simulation environW R E ment of an 8×8 mesh, 64 nodes, each with a core and related cache, has been implemented PE S (Figure 9). The 16 memory controllers are connected to 9 10 11 12 13 14 15 16 the two sides of the mesh network. The simulations are run on the Solaris 9 operating sys- Fig. 9. An 8×8 mesh-based NoC tem based on UltraSPARCIII+ instruction set with 16 memory controllers atin-order issue structure. Each processor core tached to up and down sides is running at 2GHz, attached to a wormhole router and has a private write-back L1 cache (split I+D, each 16KB, 4-way, 64-bit line, 3-cycle). The 64MB L2 cache shared by all processors is split into banks (64 banks, each 1MB, 64-bit line, 6-cycle). We setup a system with 4GB of main memory, and the latency from the main memory to the L2 cache is 260 cycles. The simulated memory/cache architecture mimics SNUCA. A two-level NI
A Minimal Average Accessing Time Scheduler for Multicore Processors
297
distributed directory cache coherence protocol called MOESI based on MESI has been implemented in our memory hierarchy in which each L2 bank has its own directory. The protocol has five types of cache line status: Modified (M), Owned (O), Exclusive (E), Shared (S) and Invalid (I). We use Simics [22] full system simulator as our simulation platform. 5.2
Result Analysis
We evaluate performance in terms of Average Network Latency (ANL), Average Link Utilization (ALU), Execution Time (ET) and Cache Hit Latencies (CHL). ANL represents the average cycles required for the transmission of all network messages. The number of cycle of each message is calculated as, from the injection of the message header into the network at the source node, to the reception of the tail flit at the destination node. ALU is defined as the number of flits transferred between NoC resources per cycle. Under the same configuration and workload, lower metric values are favorable. The results are illustrated in Figure 10a, 10b and 10c, in terms of FFT, LU and H.264 respectively. Allocation F (Figure 4f, similarly hereinafter) outperforms the other strategies on average, in the three applications. For example, the ANL for allocation F is 10.42% lower than for allocation C, and 3.84% lower than for allocation A when considering FFT application. This is primarily due to the bet(a) ter ACT and AMT numbers in allocation F, compared with the other allocations. As aforementioned, the transpose steps in FFT require communication of all processors (especially the last stage, see Figure 2). In this case, the ACT plays as a major role. It is clear that allocation B, C and D are not favorable strategies in this case, since the two values are high. (b) The ACT is very high in Allocation E, compared with A and F (3.13 in E, 2.50 in A and 2.64 in F). This is the reason why ANL in E is worse than in A and F. The differences of ANL in LU is not as significant as in FFT, e.g. the value of ANL in allocation F is 7.02% lower than in allocation C, and 1.84% lower than in allocation A. The reason is that, LU generates (c) less network traffic compared with FFT. The larger difference of ANL in H.264 reflects its Fig. 10. Normalized performance higher demand on core-core and core-memory metrics with different allocation communication, compared with FFT and LU. strategies 1.35
ANL ALU Execution Time Cache Hit Latency
1.3
Normalized Value
1.25 1.2
1.15 1.1
1.05
1
0.95
A
B
C
D
E
F
Allocation Strategy
1.35
ANL ALU Execution Time Cache Hit Latency
1.3
Normalized Value
1.25 1.2
1.15 1.1
1.05
1
0.95
A
B
C
D
E
F
Allocation Strategy
1.35
ANL ALU Execution Time Cache Hit Latency
1.3
Normalized Value
1.25 1.2
1.15 1.1
1.05
1
0.95
A
B
C
D
Allocation Strategy
E
F
298
T.C. Xu, P. Liljeberg, and H. Tenhunen
The ALUs of FFT for allocation E and F are lower than other strategies as well, e.g. 12.69% and 12.05% lower than in allocation C, respectively. Apparently, ALU is directly related with the average number of ACT and AMT. However, as we observed in the preceding part, the ALU is affected by the traffic intensity of an application as well. In terms of ET, however, the ACT plays as a major role again. Allocations A and F shows the most promising performance, while the other strategies did not perform well. For instance, the ET of allocation F in the three applications has reduced 1.69%, 0.67% and 2.87% compared with allocation A, respectively. The CHL is more related with ACT. Allocations A and F have lower number of CHLs, while allocations with high ACT numbers (B, C, D and E) have much higher values. We note that, ACT is more important than AMT in most cases. This is due to, most multithreaded applications nowadays are still optimized for IPC. An extreme application can benefit more from closer memory controllers, e.g. one with a lot of threads sending memory requests constantly to the memory controller, and without communications with each other. We also note that, allocations A and F provides better performance than the other allocations in most cases. In consideration of the four metrics, on average, for allocation F, the performance is improved by 8.23%, 4.81% and 10.21% in FFT, LU and H.264, respevtively, compared with the other allocations.
6
Conclusion and Future Work
In this paper, we studied the problem of process scheduling for multicore processors. A NoC-based model for the multicore processor was defined. We analyzed process scheduling in terms of IPC and memory access in our model. An algorithm was proposed to minimize overall on-chip communication latencies and improve performance. Results show that, different scheduling strategies have a strong impact on system performance. The results of this paper give a guideline for designing future CMP schedulers. Our next step is to analyzed and compare the weight of average core access time and average memory access time. The trade-off for finding the best allocation strategy will be studied. We will also evaluate the impact of memory controller placement to scheduling issues.
References 1. Benini, L., Micheli, G.D.: Networks on chips: A new soc paradigm. IEEE Computer 35(1), 70–78 (2002) 2. Intel: Single-chip cloud computer (May 2010), http://techresearch.intel.com/articles/Tera-Scale/1826.htm 3. Corporation, T. (August 2010), http://www.tilera.com 4. Scott, T.L., Mary, K.V.: The performance of multiprogrammed multiprocessor scheduling algorithms. In: Proc. of the 1990 ACM SIGMETRICS, pp. 226–236 (1990) 5. Hakem, M., Butelle, F.: Dynamic critical path scheduling parallel programs onto multiprocessors. In: Proceedings of 19th IEEE IPDPS, p. 203b (2005)
A Minimal Average Accessing Time Scheduler for Multicore Processors
299
6. Sharma, D.D., Pradhan, D.K.: Processor allocation in hypercube multicomputers: Fast and efficient strategies for cubic and noncubic allocation. IEEE TPDS 6(10), 1108–1123 (1995) 7. Laudon, J., Lenoski, D.: The sgi origin: a ccnuma highly scalable server. In: Proc. of the 24th ISCA, pp. 241–251 (June 1997) 8. Abts, D., Jerger, N.D.E., Kim, J., Gibson, D., Lipasti, M.H.: Achieving predictable performance through better memory controller placement in many-core cmps. In: Proc. of the 36th ISCA (2009) 9. Chen, Y.J., Yang, C.L., Chang, Y.S.: An architectural co-synthesis algorithm for energy-aware network-on-chip design. J. Syst. Archit. 55(5-6), 299–309 (2009) 10. Hu, J., Marculescu, R.: Energy-aware communication and task scheduling for network-on-chip architectures under real-time constraints. In: DATE 2004 (2004) 11. Lei, T., Kumar, S.: A two-step genetic algorithm for mapping task graphs to a network on chip architecture. In: DSD, pp. 180–187 (September 2003) 12. Global, H.: Ddr 2 memory controller ip core for fpga and asic (June 2010), http://www.hitechglobal.com/ipcores/ddr2controller.htm 13. Kim, Y., Han, D., Mutlu, O., Harchol-Balter, M.: Atlas: A scalable and highperformance scheduling algorithm for multiple memory controllers. In: 2010 IEEE 16th HPCA, pp. 1–12 (2010) 14. Awasthi, M., Nellans, D.W., Sudan, K., Balasubramonian, R., Davis, A.: Handling the problems and opportunities posed by multiple on-chip memory controllers. In: Proceedings of the 19th PACT, pp. 319–330. ACM, New York (2010) 15. Gaeke, B.R., Husbands, P., Li, X.S., Oliker, L., Yelick, K.A., Biswas, R.: Memoryintensive benchmarks: Iram vs. cache-based machines. In: Proc. of the 16th IPDPS 16. Schmid, P., Roos, A.: Core i7 memory scaling: From ddr3-800 to ddr3-1600 (2009), Tom’s Hardware 17. Bailey, D.H.: Ffts in external or hierarchical memory. The Journal of Supercomputing 4, 23–35 (1990), doi:10.1007/BF00162341 18. Woo, S.C., Singh, J.P., Hennessy, J.L.: The performance advantages of integrating block data transfer in cache-coherent multiprocessors. In: ASPLOS-VI, pp. 219– 229. ACM, New York (1994) 19. Bienia, C., Kumar, S., Singh, J.P., Li, K.: The parsec benchmark suite: characterization and architectural implications. In: Proc. of 17th PACT (October 2008) 20. Xu, T., Yin, A., Liljeberg, P., Tenhunen, H.: A study of 3d network-on-chip design for data parallel h.264 coding. In: NORCHIP, pp. 1–6 (November 2009) 21. Pereira, F.C., Ebrahimi, T.: The MPEG-4 Book. Prentice Hall, Englewood Cliffs (2002) 22. Magnusson, P., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., Werner, B.: Simics: A full system simulation platform. Computer 35(2), 50–58 (2002)
Fast Software Implementation of AES-CCM on Multiprocessors Jung Ho Yoo SAMSUNG THALES Co. Ltd., 259, Gongdan-Dong, Gumi-City, 730-904, South Korea [email protected]
Abstract. This paper presents a novel software implementation of AES-CCM (Advanced Encryption Standard-Counter mode with Cipher Block Chaining Message Authentication Code) for multiprocessors. The software includes AES key expansion for dual multiprocessors and cipher/inverse cipher for dual/quad multiprocessors. On the measurement of a Xilinx MicroBlaze multiprocessor based platform, the speedup of our AES key expansion, cipher and inverse cipher is up to 1.7, 2.6 and 2.6 times, respectively. Using the new software implementation of AES, AES-CCM for IEEE 802.11i is implemented on the octet MicroBlaze processors. The fast software implementation of the AESCCM for multi processors is up to 3.6 times faster than the implementation for the single processor. Keywords: AES, AES-CCM, CCMP, Multiprocessor Implementation.
1
Introduction
The growing demand for wired and wireless communication increases the implementation of the cryptographic algorithm on embedded systems. However, the data rate of current communication systems is relatively high so that the throughput of the cryptographic algorithm should be also high. The AES [1] is a widely used cipher algorithm in digital communication systems such as Wireless LAN, WiMedia UWB, ZigBee, Secure Socket Layer (SSL) and Internet Protocol SeCurity (IPSeC). The CCMP (CTR with CBC-MAC Protocol) is a security protocol that provides confidentiality, authentication, integrity, and reply protection to protect information from the unsecured exchange of information over WLAN. CCMP is based on AES-CCM [2], which combines AES CTR (Counter mode) for confidentiality and AES CBC-MAC (Cipher Block Chaining-Message Authentication Code mode) for integrity. The implementation of AES-CCM can be fully software, hardware accelerated software by instruction set extension and fully hardware. The software implementation has a few advantages in terms of flexibility and adaptability to SDR (Software Defined Radio). In addition, the rapid growth of MPSoC (Multi Processor System on Chip) helps increase the execution speed of the embedded software. In this paper, we propose a novel software implementation for AES key expansion and cipher/inverse cipher for multiprocessors. The proposed software implementations of AES are intended for dual/quad multiprocessors and we have verified their Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 300–311, 2011. © Springer-Verlag Berlin Heidelberg 2011
Fast Software Implementation of AES-CCM on Multiprocessors
301
performance on the MicroBlaze multiprocessors. On the basis of the performance results of AES, we built an AES-CCM/CCMP system for high speed WLAN [3] and show the fast performance of the multiprocessor implementation. This paper is organized as follows: In section 2, AES-CCM/CCMP is shortly summarized. In Section 3, a few related works in software implementation for AES are summarized. Software Implementation of AES for multiprocessors is proposed in section 4. The system architecture of AES-CCM for multiprocessors is explained in Section 5. In section 6, implementation results and their analysis are given. Finally, section 7 concludes this paper.
2
CCMP
CCMP [2][3] uses AES as a 128 bit-key length (AES-128) among 128, 192, 256 bitkey length. CCMP encapsulation encrypts by AES-CCM and encapsulates the ciphered framebody MPDU, CCMP header and MIC value. CCM encryption (AESCCM) encrypts the framebody of MPDU for confidentiality and generates MIC (Message Integrity Code) for integrity. The framebody of MPDU is encrypted every 128 bits of AES cipher by the AES CTR. However, MIC generation is sequentially computed, the xored result of the previous 128-bit framebody block with the AES output is the input of the next AES by the AES CBC-MAC. This AES output is xored with the next 128-bit framebody in MIC generation. When the receiver gets CCMP encapsulated MPDU, CCMP decapsulation decrypts by AES-CCM and decapsulates the ciphered framebody MPDU, CCMP header and MIC value. After decrypting the received MPDU by AES-CCM in CCMP decapsulation, checks the integrity by comparing the calculated MIC with the MIC from received encrypted MPDU. For the high-speed WLAN [3], the framebody of the plaintext MPDU, which is input of CCMP encapsulation, can be maximum 7,919 bytes. The 128 bits key of AES is the same for AES CTR and AES CBC-MAC of one MPDU. This means that all AES ciphers in AES-CCM use the same key for the MPDU.
3
Related Works
Implementation of AES has been studied in three major areas: fully hardware design, software-hardware co-design, and fully software design. For software-hardware co-design, SubByte or MixColumn is operated by hardware using extended special instructions, and the other transforms are done by general instructions [7][8][9][10][16]. Recently, Intel made AES-NI instruction to support software-hardware based AES in [16]. However, the expense for the instruction set extension includes the modified instruction set, processor’s architecture, and compiler. For fully software design, the AES implementation is based on a single processor or multiprocessors. On the single processor, T-lookup [4][14] combines SubBytes, Inverse SubBytes and MixColumn/Inverse MixColumn by the 1, 4, or 8Kbytes lookup tables. Bertoni’s transposed implementation [5] uses the transposed state that
302
J.H. Yoo
performs the MixColumn/Inverse MixColmn by multi stage ‘double and add’ algorithm. Bertoni’s works have a smaller lookup table and faster processing speed in inverse cipher on ARM, ST22, and Pentium than T-lookup. Atasu [6] studied the efficient AES for ARM by exploiting the ARM instruction set. However, the Tlookup is much faster than transposed implementation for the encryption [5]. On the multiprocessor, Huerta’s works [11] implemented the AES by only software on the MicroBlaze based multiprocessors. This work used dual, quad, and octet processors to encrypt/decrypt 1024 bytes. The disadvantages of this implementation are redundant operation of key expansion and individual 128 bits cipher/inverse cipher which is only applicable for AES CTR, not for AES CBCMAC. In other words, this software implementation for multiprocessors is inefficient and not suitable for CCMP, because CCMP uses the same 128 bits key for one MPDU and the AES calculation unit for one processor of AES CBC-MAC should be divided into less than 128 bits by cooperation among processors. In addition, the instruction level parallelism in the AES is analyzed in [15]. Our AES implementation is a fully software design on the multiprocessor platform. Dual or quad multiprocessors do the AES function. T-lookup and transposed implementation are adopted in our multiprocessor implementation, because those are known as the fastest fully software implementation among the previous studies. Proposed implementation of a round of key expansion for T-lookup by dual processors
KeyExpansion_on_Processor1() while (i < Nb * (Nr+1)) /* Expand seed */ seed = SubWord(RotWord(seed)) xor Rcon[i/Nk] w0_xor_w1_xor_w2_xor_w3 = getword_from_processor2() putword_to_processor1(seed) seed = w0_xor_w1_xor_w2_xor_w3 xor seed /* Update */ i = i + Nk end while KeyExpansion_on_Processor2() while (i < Nb * (Nr+1)) k = i – Nk w0_xor_w1 = w[k] xor w[k+1] w0_xor_w1_xor_w2 = w0_xor_w1 xor w[k+2] w0_xor_w1_xor_w2_xor_w3 = w0_xor_w1_xor_w2 xor w[k+3] putword_to_processor1(w0_xor_w1_xor_w2_xor_w3) seed = getword_from_processor1() w[i] = w[k] xor seed /* 1st */ w[i+1] = w0_xor_w1 xor seed /* 2nd */ w[i+2] = w0_xor_w1_xor_w2 xor seed /* 3rd */ w[i+3] = w0_xor_w1_xor_w2_xor_w3 xor seed /* 4th */ i = i + Nk end while
Fast Software Implementation of AES-CCM on Multiprocessors
4
303
Software Implementation of AES for Multiprocessors
The software of AES consists of key expansion and cipher or inverse cipher algorithm. For each part, suitable multiprocessing implementation is required for dual or quad processors. 4.1
Key Expansion
The key expansion generates 1408 bits (=128 * 11) expanded key from 128 bits input key (Nk = 4 : number of 32-bits (a word) key). From the input key, each round makes 128 bits expanded key. 1280 bits come from 10 rounds (Nr = 10: number of rounds) and 128 bits come from the input key. Total 1408 bits are the output of the key expansion. (The function KeyExpansion_on_Processor1) a round generates 32 bits seed variable by RotWord, SubWord, Rcon, and XOR function. RotWord is cyclic permutation function. SubWord is substitution function with S-box. Rcon is constant word array. (The function KeyExpansion_on_Processor2) four 32 bits key w[i], to w[i+3] are expanded from seed and previous round w[k], w0_xor_w1, w0_xor_w1_xor_w2, and w0_xor_w1_xor_w2_xor_w3 variables. W[] is four bytes(one word) in the state. A state is 128 bits. (Nb = 4: number of columns in the state) When the key expansion is computed by dual processors, the function KeyExpansion_Processor1() and KeyExpansion_Processor2() can be done in parallel as shown above. The ‘seed’ variable that is calculated in processor 1 and send to processor 2 is a seed to expand the four-word key. The Processor 1 generates the seed variable ‘seed’ and the processor 2 expands the four-word key using the ‘seed’ variable. As the processor 1 calculates the ‘seed’, the processor 2 prepares w0_xor_w1, w0_xor_w1_xor_w2, and w0_xor_w1_xor_w2_xor_w3 variables to minimize the processing time for expanding the key from the seed. While the processor 1 takes RotWord, SubWord, Rcon, the processor 2 does xor operation with the expanded key of the previous round and stores the expanded key. These similar work load between processor 1 and 2 makes the proposal algorithm to meet the work balance between processors. During a round by dual processors, data exchange of the seed and w0_xor_w1_xor_w2_xor_w3 variable occur. According to the key expansion for inverse cipher by T-lookup implementation, the processors do additional substitution function for Inverse MixColumn after computing the expanded key. For the transposed key expansion, the program code below shows the algorithm for dual processors. The transposed key expansion and cipher/inverse cipher for the single processor is in [5]. In contrast with the first key expansion, the key expansion for transposed is the symmetrical architecture between the two processors. The processor 1 does SubByte, Rcon and xor with the right-shifted expanded key of the previous round. The processor 2 takes SubByte and xor with the expanded key of the previous round. Because all two processors do substitution with the S-Box (a substitution table which is defined in FIPS-197 AES), they have S-Box, individually. The result of the expanded key is located at the memory of processor 1.
304
J.H. Yoo Proposed implementation of key expansion for transposed by dual processors
KeyExpansion_Transpose_on_Processor1() while (i < Nb * (Nr+1)) k = i - Nk w[i+2] = w[k+2] xor ShiftL24(pad(Sbox(Bytes(w[k+3])))) w[i+3] = w[k+3] xor ShiftL24(pad(Sbox(Bytes(w[k])))) w[i+2] = w[i+2] xor ShiftR8(w[i+2]) xor ShiftR16(w[i+2]) xor ShiftR24(w[i+2]) w[i+3] = w[i+3] xor ShiftR8(w[i+3]) xor ShiftR16(w[i+3]) xor ShiftR24(w[i+3]) w[i] = getword_from_processor2() /* Update 1st */ putword_to_processor1(w[i+2, i+3]) /* Put 3rd, 4th */ i = i + Nk end while KeyExpansion_Transpose_on_Processor2() while (i < Nb * (Nr+1)] k = i - Nk w[i] = w[k] xor ShiftL24(pad(Sbox(Bytes(w[k+1])))) xor Rcon[i/Nk] w[i+1] = w[k+1] xor ShiftL24(pad(Sbox(Bytes(w[k+2])))) xor ShiftR8(w[i]) xor w[i] = w[i] ShiftR16(w[i]) xor ShiftR24(w[i]) w[i+1] = w[i+1] xor ShiftR8(w[i+1]) xor ShiftR16(w[i+1]) xor ShiftR24(w[i+1]) putword_to_processor2(w[i]) /* Put 1st expanded key */ w[i+2, i+3] = getword_from_processor1() /* 3,4th key*/ i = i + Nk end while 4.2
Cipher and Inverse Cipher
To implement AES cipher and inverse cipher on multiprocessors, there are two implementation methods. One method is that one processor handles one AES block (128 bits) [11]. The other is that one processor takes only part of one AES block and the result of n multiple processors is one AES block (128 bits) [1]. The first method has low overhead among the processors but it is not applicable for AES CBC-MAC whose architecture is a recursive AES cipher. On the other hand, AES-CTR consists of a large number of independent AES ciphers. To increase speed of AES-CCM, not only AES CBC-MAC but also AES CTR need to be calculated by multiprocessors. Consequently, one AES block (128 bits) should be divided for multiple tasks to calculate AES CBC-MAC by multiprocessors. The state can be divided into row, column, or each byte, but the T-lookup implements four 32-bit column variables and the transposed method implements four 32-bit row variables.
Fast Software Implementation of AES-CCM on Multiprocessors
305
Proposed implementation of a round in T-lookup cipher by dual processors
Cipher_Tlookup_on_Processor1() for round = 1 step 1 to Nr–1 /* Round */ /* SubBytes, ShiftRows, MixColumns by Tnbox, ShiftR */ state[0] = T1box(ShiftR24(state[0])) xor /*column 0*/ T2box(ShiftR16(state[1])) xor T3box(ShiftR8 (state[2])) xor T4box( state[3]) state[1] = T1box(ShiftR24(state[1])) xor /*column 1*/ T2box(ShiftR16(state[2])) xor T3box(ShiftR8 (state[3])) xor T4box( state[0]) AddRoundKey_column01(state[0,1]) putword_to_processor2(state[0, 1]) state[2, 3] = getword_from_processor2() end for Cipher_Tlookup_on_Processor2() for round = 1 step 1 to Nr–1 /* Round */ /* SubBytes, ShiftRows, MixColumns by Tnbox, ShiftR */ state[2] = T1box(ShiftR24(state[2])) xor /*column 2*/ T2box(ShiftR16(state[3])) xor T3box(ShiftR8 (state[0])) xor T4box( state[1]) state[3] = T1box(ShiftR24(state[3])) xor /*column 3*/ T2box(ShiftR16(state[0])) xor T3box(ShiftR8 (state[1])) xor T4box( state[2]) AddRoundKey_column23(state[2,3]) putword_to_processor1(state[2, 3]) state[0, 1] = getword_from_processor1() end for Although the T-lookup implementation is operated on a single processor, the proposed implementation in Cipher_Tlookup code above uses dual processors that take column 0, 1, and 2, 3, respectively. All processors have T-lookup 8KB tables (T1box, T2box, T3box, T4box). After each round transformation, all columns are mutually exchanged for the next round of ShiftRow transformation. When four processors are used in T-lookup, each processor should do the round transform in one column and exchange part of the other three columns with the other processors. From the transposed implementation, the proposed implementation in Cipher_Transpose program code below is calculated by assigning row 0, 1 and 2, 3 into processor 1 and 2, respectively. The MixColumn implementation requires the data exchange between the two processors during the MixColumn, but it needs only the S-Box table. At the expense of the small look up table, each processor reads 128 bits from the others and writes 128 bits on the others. Thus, the multiprocessor implementation by the transposed state needs more data exchanges than the T-lookup, which reduces the efficiency of multiprocessing. Similarly, four processors can be cooperatively used for one AES cipher by taking one row in transposed implementation.
306
J.H. Yoo Proposed implementation of a round in transposed cipher by dual processors
Cipher_Transpose_on_Processor1() for round = 1 step 1 to Nr–1 SubBytes(state[0,1]) ShiftRows_Transpose_column01(state[0,1]) /* MixColumn */ temp = state[0] xor state[1] putword_to_processor2(temp) temp = getword_from_processor2() y[0] = state[1] xor temp y[1] = state[0] xor temp state[0] = {02} mul state[0] state[1] = {02} mul state[1] putword_to_processor2(state[0]) state[2]= getword_from_processor2() y[0] = y[0] xor state[0] xor state[1] y[1] = y[1] xor state[1] xor state[2] state = y AddRoundKey(state[0,1], w[round*Nb, round*Nb+1]) end for Cipher_Transpose_on_Processor2() for round = 1 step 1 to Nr–1 SubBytes(state[2,3]) ShiftRows_Transpose_column23(state[2,3]) /* MixColumn */ temp = state[2] xor state[3] putword_to_processor1(temp) temp = getword_from_processor1() y[2] = temp xor state[3] y[3] = temp xor state[2] state[2] = {02} mul state[2] state[3] = {02} mul state[3] putword_to_processor1(state[2]) state[0]= getword_from_processor1() y[2] = y[2] xor state[2] xor state[3] y[3] = y[3] xor state[3] xor state[0] state = y AddRoundKey(state[2,3], w[round*Nb+2, round*Nb+3]) end for
5
System Architecture of AES-CCM
To verify our software implementation, the Xilinx MicroBlaze based multiprocessor system is built on Xilinx FPGA Virtex II pro FPGA XC2VP100FF1704-6. The MicroBlaze is a soft core 32-bit RISC processor by Xilinx. The MicroBlaze can be connected with the other MicroBlaze with a dedicated channel (Fast Simplex Link) or a shared bus (On-chip Peripheral Bus or Processor Local Bus), but FSL is suitable for
Fast Software Implementation of AES-CCM on Multiprocessors
307
a small amount of data exchange and OPB or PLB is suitable for a large amount of data exchange. Through dual port RAM in OPB bus, the input and output MPDU for CCMP encapsulation and decapsulation are exchanged. To exchange data during an AES for AES CBC-MAC or 128 bits input/output for AES-CTR, 32 bit FSL is used. To organize a minimum processing element, one MicroBlaze processor (operating at 100MHz), RAM, and peripherals are required as in Fig. 1. The instruction and data are located at the dual port RAM. This element can communicate with other processing elements through FSL or OPB buses.
Fig. 1. Block diagram of a processing element Table 1. Comparison of resource utilization of multiprocessor platform # of MicroBlaze Processors Single Dual Quad Octet
Slice Flip Flops
4 input LUTs
Block RAMs
1,522 3,028 4,127 7,067
1,888 3,876 6,382 11,890
32 64 128 288
To maximize the processing speed of AES-CCM, octet MicroBlaze processors are employed in Fig. 2. Group A undertakes AES-CTR by calculating AES separately [11], and group B does AES CBC-MAC by calculating the AES cooperative proposed software in key expansion and cipher codes above . Table 1 shows the resource utilization of the system in terms of Slice Flip Flops, LUT, and Block RAM.
6
Results and Analysis
We implemented the key expansion and 128-bit cipher/inverse cipher for an AES for T-lookup and transposed implementation to compare the processing speed. The system architecture for this measurement is Group A in Fig. 2 by the test vector in FIPS-197. An internal timer that is connected with the MicroBlaze processor through the OPB bus measures the processing time.
308
J.H. Yoo
Fig. 2. Block diagram of system architecture for AES-CCM Table 2. The data exchange and cycles for Key Expansion # of MicroBlaze Processors Transposed Transposed T-lookup T-lookup
Single [5] Dual (Ours) Single [14] Dual (Ours)
Data Exchange [bits] Key Inverse Key Expansion Expansion 848 848 768 1,920
# of Cycles Key Expansion Inverse Key Expansion 642 622 405 405 411 1,257 274 726
From Table 2, the 1.7 times speedup of key expansion for inverse cipher by Tlookup is the highest because it needs the table substitution for the inverse MixColumn after key expansion. The data exchange that includes read/write for a processor during 128-bit inverse cipher is 1,920 bits for the T-lookup key expansion. This is about 2.5 times that of the T-lookup key expansion for the cipher, because of the inverse MixColumn into the expanded key. Table 3. The data exchange and cycles for Cipher/Inverse Cipher # of MicroBlaze Processors Transposed Transposed Transposed T-lookup T-lookup T-lookup
Single [5] Dual (Ours) Quad (Ours) Single [14] Dual (Ours) Quad (Ours)
Data Exchange [bits] Cipher Inverse Cipher 1,280 2,144 2,688 4,224 768 768 912 912
# of Cycles Cipher 2.087 1,102 695 823 469 312
Inverse Cipher 2,959 1,601 1,201 822 469 312
Fast Software Implementation of AES-CCM on Multiprocessors
309
In Table 3, the inverse cipher by the transposed implementation needs more data exchange than the cipher, because the inverse MixColumn of the transposed implementation has more calculation stages than the MixColumn of the transposed implementation. For implementation by quad processors, the proposed cipher implementation are more divided into one row or column for one processor. Among various implementations, the T-lookup cipher and the inverse cipher by the quad processors show the fastest performance with 2.6 times of speedup than the T-lookup single processor. When we compared the proposed software to other studies related to AES in Table 4, it has the highest speed performance among fully software implementations, even faster than SW-HW co-design [8]. In terms of resources in table 4, our systems used more slice F/F and LUT and Block RAM than [14] to construct quad microblaze cores. Thus, AES CBC-MAC uses this T-lookup quad core implementation to obtain the best results shown in Tables 3 and Cipher_Tlookup codes. On the other hand, AES CTR is implemented with independent T-lookup 8KB implementation for each processor. Table 4. Comparison of required cycles for 128 bits AES cipher Platform
Implementation
Resources(Counts)
Cycles
MicroBlaze [10]
SW
7,076
ARM7TDMI [5] ARM9TDMI [5] MicroBlaze [14] (single in table 1)
SW SW SW
Leon2 [8] Ours – Quad Core (quad in table 1)
SW-HW SW
MicroBlaze [10] Leon2 [9] Intel Westmere [16]
SW-HW SW-HW SW-HW
Slice(1321) Block RAM(74) ASIC ASIC Slice F/F(1522) LUT(1888) Block RAM(32) ASIC Slice F/F(4127) LUT(6382) Block RAM(128) Slice(4997) ASIC ASIC
2,309 1,883 1,234
512 312
202 196 22.08(CTR)
AES-CCM for CCMP is operated on octet core platform as shown in Fig. 2 and its performance is shown in Table 5. The implementation of CCMP encapsulation has high speedup with respect to the single processor (3.6 times at 7919 bytes of MPDU length). This factor is higher than the speedup of one AES by T-lookup quad processors, but is lower than the number of processors used. This is because the bottleneck of CCMP lies in AES CBC-MAC, which cannot be divided into independent AES operations like Huerta’s algorithm. In spite of this vulnerability in the AES CBC-MAC algorithm, our key expansion and cipher implementation by multiprocessors plays an important role in boosting the processing speed of CCMP. The multiprocessor’s performance gain over the single processor’s performance proves our software’s superior processing speed.
310
J.H. Yoo Table 5. Comparison of required cycles for CCMP Encapsulation Platform
Implementation
Cycles per byte (Length)
ASIC – 72MHz [7]
HW(CCMP)
0.722 (2,000bytes)
MicroBlaze [14] (single in table 1) Ours –Octet Core (octet in table 1) Pentium IV 3GHz [12] 64 bit coprocessor [13]
SW(CCMP)
107.7840 (7,919bytes)
SW(CCMP)
29.5113 (7,919bytes)
SW(AES-CCM) SW(CCMP)
9048.0103 (2,312bytes) 2712.0113 (128Kbytes)
For the CCMP shown in Table 5, the proposed software boosted the processing speed by multiprocessors. Even through our software on the FPGA is a little slower than the hardware design, the software design has a few advantages in terms of cost, time to market, flexibility and adaptability as SDR. Additionally, the proposed software can be implemented into various multiprocessors such as the Intel dual/quad core or the ARM Cortex-A9 MPCore with small modifications in the processor dependent data exchange functions between the processors. In short, our CCMP system can meet 65Mbps data rate if the MicroBlaze runs over 235MHz on the Xilinx Virtex 7 FPGA.
7
Conclusion
In this paper, we propose novel software AES for a multiprocessor based on T-lookup and transposed implementation. The software uses dual cores for key expansion and dual or quad cores for cipher. The performance of the software is measured on the Xilinx MicroBlaze processors. From the results in Tables 2 and 3, we conclude that Tlookup by quad processor is the fastest implementation. With our ciphering implementation using quad cores, the multicore system which has more than four cores can make scalable high speed AES cipher by grouping quad cores into one ciphering function. The proposed algorithms is suitable for the embedded systems which is hard to use hardware acceleration or add new instruction. For future works, the proposed software can be tested Intel Multicore Processors without AES-NI instructions and modified for other block ciphering algorithms which is similar with AES.
References 1. Daemen, J., Rijmen, V.: The Design of Rijndael. Springer, Heidelberg (2002) 2. IEEE 802.11-2007, IEEE Standard for Information technology-Telecommunications and information exchange between systems-Local and metropolitan area networks-Specific requirements-Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications (2007) 3. IEEE 802.11n-2009, IEEE Standard for Information technology-Telecommunications and information exchange between systems-Local and metropolitan area networks-Specific requirements-Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications Amendment 5: Enhancements for Higher Throughput (2009)
Fast Software Implementation of AES-CCM on Multiprocessors
311
4. Gladman, B.: Cryptographic Technology Interests, http://www.gladman.me.uk/ 5. Bertoni, G., Breveglieri, L., Fragneto, P., Macchetti, M., Marchesin, S.: Efficient Software Implementation of AES on 32-Bit Platforms. In: Kaliski Jr., B.S., Koç, Ç.K., Paar, C. (eds.) CHES 2002. LNCS, vol. 2523, pp. 159–171. Springer, Heidelberg (2003) 6. Atasu, K., Breveglieri, L., Macchetti, M.: Efficient AES Implementations for ARM Based Platforms. In: SAC 2004, ACM Symposium on Applied Computing, vol. 1, pp. 841–845 (2004) 7. Mitsuyama, Y., Kimura, M., Onoye, T., Shirakawa, I.: Architecture of IEEE802.11i Cipher Algorithms for Embedded Systems. IEICE Transaction of Fundamentals e88-a(4), 899– 906 (2005) 8. Tillich, S., Großschädl, J.: Instruction Set Extensions for Efficient AES Implementation on 32-bit Processors. In: Goubin, L., Matsui, M. (eds.) CHES 2006. LNCS, vol. 4249, pp. 270–284. Springer, Heidelberg (2006) 9. Elbirt, A.J.: Fast and Efficient Implementation of AES Via Instruction Set Extensions. In: AINAW 2007, Proceedings of the 21st International Conference on Advanced Information Networking and Applications Workshops, vol. 1 (2007) 10. Gonzalez, I., Gomez-Arribas, F.J.: Ciphering algorithms in MicroBlaze-based embedded systems. Computers and Digital Techniques, IEE Proceedings 153(2), 87–92 (2006) 11. Huerta, P., Castillo, J., Mártinez, J.I., López, V.: A MicroBlaze Based Multiprocessor SoC. WSEAS Transactions on Circuits and Systems, 423–430 (2005) 12. León, M., Aldeco, R., Merino, S.: Performance Analysis of the Confidentiality Security Service in the IEEE 802.11 using WEP, AES-CCM, and ECC. In: 2nd International Conference on Electrical and Electronics Engineering (2005) 13. VOCAL Technologies, http://www.vocal.com 14. Anescu, G.: A C++ Implementation of the Rijndael Encryption/Decryption method (2002), http://www.codeproject.com/KB/security/aes.aspx 15. Clapp, C.S.K.: Instruction-level Parallelism in AES Candidates. In: The Second AES Candidate Conference (1999) 16. Gueron, S.: White Paper - Intel Advanced Encryption Standard (AES) Instructions Set. Intel (2010)
A TCM-Enabled Access Control Scheme Gongxuan Zhang*, Zhaomeng Zhu, Pingli Wang, and Bin Song School of Computer Science & Technology, Nanjing University of Science & Technology 210094 Nanjing, China {gongxuan,face601,pingli_w2000}@mail.njust.edu.cn, [email protected]
Abstract. Trusted Cryptography Supporting Platform is a computer platform with high dependable and available software and hardware, within which security mechanism is reliable and robust because some encryption/decryption, authentication techniques are adopted upon the operating system based on the trusted platform module in a chip or ARM board. USB disk is a popular, flexible, removable storage device but it also brings some new information security risks at the same time. In this paper, a TCM (Trusted Cryptography Module)enabled transparent file encryption/decryption strategy is proposed with which a Minifilter driver subroutine are programmed under Microsoft’s latest Minifilter framework and files of USB disk can be transparently encrypted or decrypted. With the TSM/SDK (TCM Service Module/ Software Development Kit) , the file encryption/decryption procedures are better kept in safety by invocating TCM’s hash component, random function component and encryption/decryption component. Hence, the removable storage’s data (files) are of high security because TCM is an individual hardware, the encryption/decryption operations are running within TCM and the key is stored in TCM. Keywords: Removable storage, TCM, Encryption/ Decryption, Minifilter framework.
1
Introduction
With removable storages like USB disks widely used for information storing, they are flexible but bring lots of new secure risks because they are easily stolen and lost important information. Especially, in the higher information security site, there exist extreme risks due to the application of general USB disks which stored some files without any secure processing. To prevent the risk from data revealing, within the enterprises, some access control methods are adopted for the communication ports of I/O equipments like USB interface prohibited or NIC disabled. Indeed, these can efficiently avoid data revealing, but the computer efficiency becomes low and some people are of higher right and permitted to get the data through other I/O interfaces. That means the important data will be revealed and the risk exist still [1]. *
The work is partially supported by Natural Science Foundation of China (60850002, 60803001).
Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 312–320, 2011. © Springer-Verlag Berlin Heidelberg 2011
A TCM-Enabled Access Control Scheme
313
For above issue, one solution is cryptography application that the data or files are encrypted and then stored into the USB disk. People can decrypt the files and read the original if only they get files’ keys. Recently, dynamic encryption has been widely used for information encryption/decryption because it seams transparent that users can keep their usage and take less interference for the encryption/decryption procedures. And there are two different implementations at the application layer or driver layer for dynamic encryption/decryption. At application layer, the encryption/decryption is done by invoking special API subroutines. Nevertheless, at driver layer, the encryption/decryption is implemented by capturing the IRP (I/O Request Package) of file system kernel, with which file will be encrypted/decrypted when file create or read is extracted from the IRP attributes. So the file is protected in safety because file encryption/decryption is within operating system kernel and difficult to be revealed [2]. In fact, this software solution is not perfectly secure because some access constraints are not compulsory although the access constraints of kernel data exist in most operating systems [3]. Some privilege process or components of operating system can access and get the desired kernel data. For the key as plain text stored longtime in the kernel within driver-based encryption/decryption system, the interpolated or vicious components and hacker process can easily get the key and important data. Accordingly, the other solution is that TCM (Trusted Cryptography Module) is introduced to the encryption/ decryption system and enhance files security [4]. TCM is an individual hardware, the encryption/decryption operations can be done within TCM and the key is stored in TCM. As so far, it is the best way that software and hardware are mixed to solve information security because what all software methods attempt to check malicious codes can be avoided away by the malicious program without related hardware supporting [5]. The rest of the paper is organized as follows. In Section 2, related work is summarized. In Section 3, a TCM (Trusted Cryptography Module)-based transparent file encryption/ decryption strategy is proposed and in Section 4, the data structures and invocation of TSM APIs are presented in more detail. Finally, Section 5 concludes this report.
2 2.1
Related Work Minifilter Framework
It’s well known that ‘Filter Manager’ component is referenced in Microsoft Windows XP SP3 later. So a called ‘Minifilter Driver’ can be developed by third party and managed with the filter manager. Minifilter can find its location at the file system I/O stack with ‘Load Order Group’ and make better coordination with such filter drivers as anti-virus components, mirroring components and so on. Minifilter’s location is identified by a unique identification ‘Altitude’ and a Minifilter instance is attached to the Minifilter driver. Figure 1 shows the framework with a simplified filter manger and Minifilter I/O stack.
314
G. Zhang et al.
Fig. 1. The Minifilter Framework
Minifilter can filter IRP-based I/O operations including Fast-I/O and filter callback of file system. For an I/O operation being filtered, the Minifilter will register a ‘Pre Operation’ callback subroutine and a ‘Post Operation’ callback subroutine. Windows kernel drivers make communication with other drivers and the operating system through the kernel data structure of I/O Request Packets (in short, IRPs). The upper application program sends the I/O request when communicating with the lower driver program and the I/O request is packed to IRP structure by the I/O manager of operating system. IRP listens and describes I/O requests. And then IRP is delivered down to the I/O stack. When receiving an IRP, the driver program calls the related dispatched subroutine to process the IRP. 2.2
TCM Architecture and Invocation
A trusted computing platform is a computing device integrated with a cryptographic chip called a trusted cryptography module (TCM) under ‘Functionality and Interface Specification of Cryptographic Support Platform for Trusted Computing’ which is specified by the Trusted Cryptography Module Union (TCM-U, an industry consortium formed to develop standards for Trusted Computing Platforms in China [5]. As an important development direction, trust secure computing platform research has been supported by the national “five years-eleventh project” and the national “863 high-tech plan” in china. Based on TCM, a trusted computing platform can implement many security features, such as secure boot, sealed storage, and software integrity attestation. According to TCN-U definition, Trusted Cryptography Module (TCM) and TCM Service Module (TSM) are the main two parts of Trusted Cryptography Supporting Platform (in short, TCSP). TCSP is based on cryptography and has such secure
GPIO PW
CPU Core
LPC
Flash Data
315
Monotonic Counter
Control
Flash Program
Security Logic
A TCM-Enabled Access Control Scheme
Tick Counter
RAM
Internal Bus
SMS4 Engine
HMAC Engine
SM3 Engine
SM2 Engine
RNG
ROM
Fig. 2. TCM Architecture
functions as platform integrity, identification dependability and data security. TCM is a necessary infrastructure of TCSP and has a collection of encryption algorithms (or engines). TCM is enveloped as an individual hardware or firmware or integrated with IP cores or other chips, and its architecture is depicted in Figure 2. With the encryption engines, a system platform’s integrity is guaranteed and trustfully reported to outer entities by measuring the component’s integrity, its identification is authenticated with the platform’s identification management, and the system platform’s sensitive data is protected in safety. Within TCM, there are two protection sub-systems of memory and execution, with which the trusted roots can be created and stern security mechanism built. To prevent TCM from the performance bottleneck, only protection kernels of two sub-systems are stored and executed in TCM (called support modules) while the other subroutines are stored and executed in PCs. Those support modules become TCM Service Modules (in short, TSM) which are invocated though their APIs. For application programs, TSM supply a series of standardized instance interfaces, such as context management, policy management, TCM management, key management, data encryption/decryption, PCR (Platform Configuration Register) management, nonviolate memory management, Hash operation and key negotiation [5].
3
Transparent File Encryption Strategy
For removable storages, without reliable security mechanism, it is terrible that stored sensitive data or files are revealed by malicious hacker when the storages are lost. So it is very vital to find adequate keys and encryption algorithms. In this paper, symmetric keys are applied and different keys are used for different file encryption. In detail, there are two types of keys: volume host key and file store key. The host key is bound to volume and one volume has one host key. Moreover, the file store key is bound to a file and one file is encrypted and stored with a file store key generated from TCM’s RNG. The file store key is generated when the file created. And all file READ/WRITE operations need to be encrypted or decrypted with the key. Then the encrypted or decrypted file is stored to a removable storage. At same time, the file
316
G. Zhang et al.
store key is encrypted with the host key and stored with the file. The host key can be got from user’s password via hash module. To enhance the system security, a 128 bit “salt–value file” is generated from TCM before encryption and the host key can be got by hash function mixed user password and the “salt–value file”. With the method, dictionary attacks can efficiently be avoided because the hacker can only decode the 128bit SMS4 key without the password and salt-value file. Even that, the file is still kept in safety because its encryption key is generated with the TCM’s RNG and quite different from other file keys. Figure 3 shows the proposed access control scheme. It is sure that information processed by the scheme is of high security owing to the combination of the host key, the file store key, TCM key protection and TCM encryption engines [6]. TSM-based Server Programs
User Management Program User Mode
TCM
Minifilter-based File System Filter-Driver
Kernel Mode
Encrypted Files Stored in Removable Storage Fig. 3. TCM-based Access Control Scheme
In figure 3, the scheme is composed of Minifilter-based file system filter-driver, TSM-based server programs and related management programs like user management program at application layer. The Minifilter-based file system filter-driver’s task is to filter interested file operations like create, cleanup, read, write and so on. Whiles, the Minifilter’s task is to handle various keys and related information store, capture related IRP packet, send the desired data to TSM servers for encryption operation, receive the data and replace the original, handle the buffer and memory exchange, etc. The Minifilter is the whole scheme’s core. TSM-based server’s task is to supply services including the host key uploaded into/downloaded from TCM, encryption/decryption algorithm execution, the file store key generated by RNG. TSM server can deliver information with the Minifilter-based file system filter-driver through the Minifilter’s communication port. In other ways, user management program’s task is to check the removable storage—USB disk’s inserting or removing, upload/download some related driver subroutines. With user management program, user will be asked if the storage be encrypted or decrypted transparently when a new removable storage is inserted to the system. If user’s answer is yes, the user management program will deliver user password and related salt-value to TSM components for data encryption/decryption processing. Reversely, the user management program will inform related components to release some resources when the removable storage is removed. Moreover, the user management program may create a new salt-value file for users.
A TCM-Enabled Access Control Scheme
317
With the proposed scheme, the Minifilter’s communication ports can be used to make communication or coordination among components, especially, between file system Minifilter driver in kernel mode and other component in user mode.
4 4.1
TCM-Based Application and Implementation Minifilter-Based FfilterDriver: SeUSB
As known, the file filter driver is the system’s core. In this paper, a file filter driver subroutine is designed and implemented based on the Minifilter (called SeUSB). In SeUSB, four IRP structures, as IRP_MJ_CREATE, IRP_MJ_CLEANUP, IRP_MJ_WRITE, IRP_MJ_READ, need to be handled. RP_MJ_CLEANUP is forced to cleanup the system’s buffers because data is buffered by buffer management in Windows system. Windows programs can get data from a disk by IRP structure when the data I read at first time and then buffer the data into memory for later data operation. The data is till in memory after the Minifilter driver downloaded. Therefore, the buffered and decrypted data in memory is under hacker’s attack risk. In the scheme, a ‘memory replace’ technique is used to exchange memory areas when data read or data write occurs because the encryption/decryption procedure is transparent and the original memory buffer area must be no operation. In detail, a new memory buffer is created in Pre-Operation of Write or Read, and replaces the original buffer, and all operations are done in the new buffer. And then the IRP packet is delivered down to the lower driver program. In Post-Operation, related operations must be done first both the new buffer and the original one, and the original buffer is replaced back later. All operations seam transparent to the upper programs. To implement secure communication and support many different types of communication, a new object ‘communication port’ is referenced in the Minifilter. Correspondingly, a communication port ‘\SeUSBport’ is created for SeUSB and used to make communication between TCMSrv subroutines and user management program for user application programs. The user management program can open a communication port for information exchange when it need connect with SeUSB, and disconnect the port after the operations finished. Meanwhile, after the driver loaded, TCMSrv can also open a port which is connected to SeUSB driver all time so as to get related TCM service requests. It means that SeUSB subroutine can make connection with the TCM server through the communication port, and then receives the volume host key which achieved by registering an instance ‘SeUSBPortMessage’, and sends instructions to the TCMSrv. When a new user comes, the SeUSBPortMessage is invoked. In SeUSBPortMessage, SeUSB driver may get a user defined buffer which is organized with the structure SEUSB_COMMAND as fellow: typedef struct _SEUSB_COMMAND { WCHAR Target[VOLUME_NAME_LEN]; UCHAR InstanceMainKey[KEY_LEN]; }SEUSB_COMMAND, *PSEUSB_COMMAND; After getting an instruction, SeUSB driver attempts to obtain the context of related volume instance, makes some identities and then sends the volume host key to TCMSrv for loading in TCM, and later release the host key from the memory buffer.
318
4.2
G. Zhang et al.
TCM Server: TCMSrv
TCMSrv listens and receive SeUSB requests, and invokes TCM encryption/decryption engines to implement data or files encryption/decryption. TCMSrv services are given in Figure 4. TCM connect/config.
TCM Server
Volume host key load Volume host key unload/release File store key generation Data encryption Data decryption
Fig. 4. Services of TCM server
For encryption/decryption operation, due to Storage Main Key (SMK) used often, SMK can be obtained and loaded into TCM at TCMSrv initiation in the proposed scheme. At the same time, TCMSrv initiates a TSM_HKEY array which is used to store handles of volume host keys. Finally, TCMSrv applies such two data structures as SEUSB_MESSAGE and SEUSB_REPLY for communication with SeUSB’s communication port. Below are the two structures. typedef struct _SEUSB_MESSAGE { FILTER_MESSAGE_HEADER Header; SEUSB_MESSAGE_DATA Data; }SEUSB_MESSAGE, *PSEUSB_MESSAGE; typedef struct _SEUSB_REPLY{ FILTER_REPLY_HEADER Header; SEUSB_REPLY_DATA Data; }SEUSB_REPLY, *PSEUSB_REPLY; After initiation, TCMSrv uses FilterLoad to load the drivers, and has FilterConnectCommunicationPort connect the driver’s Communiction Port. Later, TCMSrv listens and receives related requests in cycle.
5
Test and Conclusion
In the test platform, the TCM chip is set on a small print board with 20 pins LPC interface and connected into the PC mainboard, and the related TSM SDK is running on Windows platform. All programs are developed with Visual Studio 2005.
A TCM-Enabled Access Control Scheme
319
After initiation, TCMSrv uses FilterLoad to load the drivers, and has FilterConnectCommunicationPort connect the driver’s Communiction Port. Later, TCMSrv listens and receives related requests in cycle. After USB disk inserted and the password input correctly, we can get the original text as shown in Figure 5.
Fig. 5. The correct test result
After USB disk inserted and the password input incorrectly, we can get the miss text as shown in Figure 6.
Fig. 6. The incorrect test result
From above test results, the proposed TCM-enabled transparent file encryption/decryption strategy with a Minifilter driver can achieve the transparent encryption/decryption. It is a efficient way to protect the removable storage’s data or files to be revealed.
References 1. Lin, H.: Research and Implementation for File Transparent Encryption based on Minifilter. Zhejiang University of Technology (2009) 2. Chen, M.: Development for secure file kernel based on the New Generation FilterDrvier Framework. SouthWeast Jiaotong University (2009)
320
G. Zhang et al.
3. Reid, J.F., Caelli, W.J.: DRM, Trusted Computing and Operating System Architecture. Research and Practice in Information 44, 127–136 (2005) 4. Kong, W.: TPM Working Model. Journal of Wuhan Coolege of Sci. and Tech. 18(1), 44–47 (2005) 5. The National Standard, Functionality and Interface Specification of Cryptographic Support Platform for Trusted Computing (2007) 6. Huang, G.: The Core Technique Analysis of Windows Encryption File System. Computer and Information Technology 13(4), 1–12 (2005)
Binary Addition Chain on EREW PRAM Khaled A. Fathy1 , Hazem M. Bahig2 , Hatem M. Bahig2 , and A.A. Ragb3 1
Department of Basic Science, Faculty of Engineering, Sinai University, Egypt 2 Computer Science Division, Department of Mathematics, Faculty of Science, Ain Shams University, Cairo, Egypt 3 Department of Mathematics, Faculty of Science, AlAzhar University, Egypt {hazem baheeg,hmbahig}@sci.asu.edu.eg
Abstract. An addition chain for a natural number x of n bits is a sequence of numbers a0 , a1 , ..., al , such that a0 = 1, al = x, and ak = ai +aj with 0 ≤ i, j < k ≤ l. The addition chain problem is what is the minimal number of additions needed to compute x starting from 1? In this paper, we present a new parallel algorithm to generate a short addition chain for x. The algorithm has running time O(log 2 n) using polynomial number processors under EREW PRAM (exclusive read exclusive write parallel random access machine). The algorithm is faster than previous algorithms and is based on binary method. Keywords: Addition chain, binary method, parallel algorithm, EREW PRAM.
1
Introduction
An addition chain, AC, for a natural number x is a sequence of numbers a0 , a1 , a2 , . . . , al , such that a0 = 1, al = x and ak = ai + aj , 0 ≤ i, j < k ≤ l. The minimal length l for which an addition chain for x exists is denoted by (x). For example, an addition chain for x = 47 is a0 = 1, a1 = a0 + a0 = 2, a2 = a0 + a1 = 3, a3 = a1 + a2 = 5, a4 = a3 + a3 = 10, a5 = a4 + a4 = 20, a6 = a2 + a5 = 23, a7 = a6 + a6 = 46, a8 = a0 + a7 = 47. The study of finding an addition chain for x arise naturally when considering how to raise m to the xth power mx given m, and x (or compute modular exponentiation mx mod g, where g is a positive integer). For example, to compute m47 , we could proceed as follows: m, m2 , m4 , m8 , m16 , m32 , m40 , m44 , m46 , m47 . The corresponding addition chain is 1, 2, 4, 8, 16, 32, 40, 44, 46, 47. By another way, one can compute m47 as follows: m, m2 , m3 , m5 , m10 , m20 , 23 m , m46 , m47 . The corresponding addition chain is 1, 2, 3, 5, 10, 20, 23, 46, 47. Clearly, the first computation of m47 needs 9 multiplications while the second one needs 8 multiplications. The problem of finding optimal number of multiplication corresponds to find an addition chain with minimal length. Downey, Leong and Sethi [5] prove that the problem of finding a minimal length addition chain for a set S of natural numbers (in this case the chain is called addition sequence) is NP-complete. The proof in case of S = {x} is not known. Therefore, it is interesting to consider short (not necessary minimal length) addition Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 321–330, 2011. c Springer-Verlag Berlin Heidelberg 2011
322
K.A. Fathy et al.
chains, provided that they can be generated efficiently. Different techniques for finding an addition chain with short length have been suggested, such as binary, factor, window, power tree, m−ary and continued fraction methods [3,4,6,8,9]. The chain is called binary addition chain if we use the binary technique. The length of the binary chain is log x + ν(x) − 1, where ν(x) is the number of ones in the binary representation of x. The binary technique is suitable (simple and fast) for many applications when both m, and x are not fixed [10] and has near optimal length when ν(x) is small. Parallelization of additions chain and modular exponentiations have been studied under different parallel models with the goal of speeding up the running time [2,6,12]. In this paper we are concerned on addition chain under PRAM (Parallel Random Access Machine) model, and more specially EREW (Exclusive Read Exclusive Write) PRAM model. PRAM is a machine consisting of a finite number, p, of processors (RAMs) operating synchronously on an infinite global memory consisting of cells numbered 0, 1, 2, · · ·. In each step, each processor may carry out local computation, may access the global memory, or may write to one global memory cell. Various PRAM models have been introduced. They differ in the conventions regarding to concurrent reading and writing [1,7]. (1) Exclusive Read Exclusive Write (EREW) PRAM: for each memory location, it may only be read from or written to by one processor in each cycle. (2) Concurrent Read Exclusive Write (CREW) PRAM: for each memory location, it may be read from by several processors, or written to by only a single processor in each cycle. (3) Concurrent Read Concurrent Write (CRCW) PRAM: for each memory location, it may be read from or written to by several processors at the same time. A parallel algorithm using time Tp (n) and p processors to solve a given problem Q of size n is said to be cost optimal if Tp (n) × p = O(T ∗ (n)), where T ∗ (n) is the running time of the fastest sequential algorithm for the problem Q [1]. This paper presents a new parallel algorithm for generating a short addition chain. The proposed algorithm is based on binary method but different from the previous works. Our algorithm runs in O(log2 n) time using n2 log n log log n processors on EREW PRAM. Comparing with the previous algorithms, our algorithm is faster, but the algorithm is not optimal. The structure of the paper is as follows: we present a brief description of sequential and parallel binary methods in Section 2. In Section 3, we give a new parallel algorithm for computing a short addition chain on EREW PRAM. Finally, the conclusion of our work is given in Section 4.
2
Binary Method for AC
In this section we give a brief description of sequential and parallel algorithms for generating addition chains by using the binary method. Let (xn−1 xn−2 · · · x1 x0 )2 be the binary representation of the natural number x. The algorithm that generates an addition chain for x using the binary method is as follows [6,8,9].
Binary Addition Chain on EREW PRAM
323
Algorithm 1 Input: A number x = (xn−1 xn−2 · · · x1 x0 )2 . Output: A binary addition chain for x : a0 , a1 , . . . , an+ν(x)−2 . Begin 1. a0 = 1, j = 0 2. for i = n − 2 down to 0 do j =j+1 aj = aj−1 + aj−1 if xi = 1 then j =j+1 aj = aj−1 + 1 End. Note that the value of j at the end of Algorithm 1 is equal to n + ν(x) − 2. The running time of Algorithm 1 is O(n2 ) since the running time for adding two n−bits integers is equal O(n). A simple parallel version of Algorithm 1 on EREW PRAM can be obtained by parallelizing the addition operation. Algorithm 2 [1,6,12] generates a binary addition chain for x under EREW PRAM, where +p denotes the addition of numbers by parallelism. Algorithm 2 Input: x = (xn−1 xn−2 · · · x1 x0 )2 . Output: A binary addition chain for x : a0 , a1 , . . . , an+ν(x)−2 . Begin 1. a0 = 1, j = 0 2. for i = n − 2 to 0 do j =j+1 aj = aj−1 +p aj−1 if xi = 1 then j =j+1 aj = aj + p 1 End. The running time of Algorithm 2 is equal to O(n log n) since we have O(n) iterations and in each iteration the addition of two n−bits numbers takes O(log n) time using logn n processors on EREW PRAM [1,6,12]. Algorithm 2 is optimal and the storage is constant.
3
The Proposed Algorithm
In this section, we introduce a new parallel addition chain algorithm under EREW PRAM. This section consists of five subsections. In Section 3.1, we give the main idea of the new algorithm which consists of two stages. The implementation details of the first stage is given in Section 3.2, while the implementation details of the second stage is given in Section 3.3. The full description of the
324
K.A. Fathy et al.
new algorithm is given in Section 3.4. The complexity analysis of the proposed algorithm and a comparison between the proposed algorithm and the previous algorithms are given in Section 3.5. 3.1
Main Idea
Let us describe the main idea behind the new parallel algorithm by the following example. Given x = 35205 = (1000100110000101)2 of 16-bits. We can map the bits xi of x into an array Y of length n as shown in Figure 1. The binary addition chain for x obtained by Algorithm 1 is given in Figure 2. It is clear that each zero bit (in the array Y) generates one element in the binary addition chain while each one bit, except the first one Y [0] = y0 , generates two elements, say ak , and ak+1 , in the binary addition chain, where k is equal to the number of previous bits in Y plus the number of previous ones minus Let x=35205=(1000100110000101)2 of 16-bits.
x=
15 ( 1
14 0
0 1
Y:
13 0
1 0
12 0
2 0
11 1
3 0
10 0
4 1
9 0
5 0
8 1
6 0
7 1
7 1
6 0
8 1
5 0
9 0
10 0
4 0
3 0
11 0
12 0
2 1
1 0
0 1 )
13 1
14 0
15 1
Fig. 1. Mapping the n-bits to an array of n elements 0 1
Y: 0 1
A
1 2
1 0
2 4
3 8
2 0
4 16
3 0
5 17
6 34
4 1
7 68
5 0
8 136
9 137
6 0
10 274
7 1
11 275
12 550
8 1
9 0
10 0
11 0
12 0
13 1
14 0
15 1
13 1100
14 2200
15 4400
16 8800
17 8801
18 17602
19 35204
20 35205
Fig. 2. Addition chain for x 0 1
1 2
2 4
3 8
4 16
5 17
*2*2*2 *2
6 34
7 68
8 136
9 137
*2 *2 *2
10 274
11 275
12 550
*2
*2
13 1100
*2
14 2200
*2
15 4400
*2
16 8800
17 8801
18 17602
*2
*2
19 35204
20 35205
*2
Fig. 3. How to generate the elements between two successive marked elements
-1 0 1
1 2
2 4
3 8
4 16
17= 1 * 24 + 1
-1 5 17
6 34
7 68
8 136
137=17 * 23 + 1
-1 9 137
10 11 274 275 275= 137 * 2 + 1
-1 12 550
13 1100
14 2200
15 4400
8801= 275 * 25 + 1
16 8800
17 8801
-1 18 17602
19 35204
20 35205
35205=8801 * 22 + 1
Fig. 4. (1) Relation between two successive marked elements. (2) Relation between marked element and previous one.
Binary Addition Chain on EREW PRAM
325
one. As we can see in Figure 3, if we have two consecutive ones in the array Y at positions i1 and i2 , i1 < i2 . Then the first one generates two elements in the binary chain, say ak1 and ak1 +1 , while the second one generates two elements, say ak2 and ak2 +1 . For every zero bit between i1 and i2 , the element of the binary chain that corresponds to the zero bit can be generated by doubling ak1 +1 m times, where m is equal to the position of zero bit minus i1 . The relation between the two elements, ak1 +1 and ak2 +1 , is ak2 +1 = ak1 +1 2ei2 −ei1 + 1 as shown in Figure 4, where ei represents the position of the (i + 1)−th one of the array Y. Assume that b0 , b1 , . . . , bν(x)−1 , represent the elements of chain corresponding to the second elements generated by ones. If we have two elements, bi−1 and bi , in the chain corresponding to the second elements generated by two successive ones then 1 i=0 (1) bi = 2ei −ei−1 bi−1 + 1 1 ≤ i < ν(x) Therefore, the main idea of computing an addition chain a0 , a1 , . . . , an+ν(x)−2 under EREW PRAM and based on the binary method consists of two stages. Stage 1: for the (i + 1)−th one (1 ≤ i < ν(x)) in the array Y, compute, in parallel, the corresponding two elements in the binary chain: the first element is bi − 1 and the second element is bi . Stage 2: generate the reminder elements of the binary chain from the elements generated in Stage 1. For example, in the first stage, we construct the sequence 1, 16, 17, 136, 137, 274, 275, 8800 and 8801 in parallel. In the second stage we construct the subsequences (2, 4, 8), (34, 68), (550, 1100, 2200, 4400), and (17602, 35204) in parallel from the elements 1, 17, 137, 275, and 8801 respectively. Thus, the obtained binary chain is 1, 2, 4, 8, 16, 17, 34, 68, 136, 137, 274, 275, 550, 1100, 2200, 4400, 8800, 8801, 17602, 35204. Now, we have two questions. The first one is how to compute the elements of the first stage? The second question is how to construct the reminder elements from the first stage? 3.2
Implementation Details for Stage 1
Now, we show how to compute the elements of Stage 1. From Eq.(1), we get b1 = 2e1 −e0 b0 + 1 = 2e1 −e0 + 1, b2 = 2e2 −e1 b1 + 1 = 2e2 −e0 + 2e2 −e1 + 1, b3 = 2e3 −e2 b2 + 1 = 2e3 −e2 (2e2 −e0 + 2e2 −e1 + 1) + 1, = 2e3 −e0 + 2e3 −e1 + 2e3 −e2 + 1, ··· bi = 2ei −e0 + 2ei −e1 + 2ei −e2 + · · · + 2ei −ei−1 + 1, 1 1 1 1 bi = 2ei ( e0 + e1 + e2 · · · + ei−1 ) + 1, 2 2 2 2
326
K.A. Fathy et al.
Let si = Then
1 2e0
+
1 2e1
+
1 2e2
···+
1 2ei
.
bi = 2ei si−1 + 1, ∀1 ≤ i < ν(x)
(2)
We will use Eq.(2) to compute (the elements of Stage 1) bi in parallel instead of Eq.(1) as follows. Step 1: Transfer the binary representation of x to the array Y. Y is an array of bits, and has length n. Each processor pi write the value of xn−i−1 in yi where 0 ≤ i < n. By applying this step on x, we obtain Y = (1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1). Step 2: Construct an array W of length n, where each element wi represents i the number of ones in the array Y from position 0 to position i, i.e., wi = j=0 yj . We can compute W by applying the prefix sums [1,7] on Y. By applying this step, we obtain: W = (1, 1, 1, 1, 2, 2, 2, 3, 4, 4, 4, 4, 4, 5, 5, 6). Step 3: Construct an array E of length ν(x) = wn−1 , where each element ei represents the position of (i + 1)-th one in the array Y. Each processor pi computes the element ei by the following test: if yi = 1 then ewi −1 = i, where 0 ≤ i < n. By applying this step on Y, we obtain E = (0, 4, 7, 8, 13, 15). Step 4: Construct an array Q of length n, where each qi = 2i , 0 ≤ i < n. The array Q can be constructed by applying the prefix multiplication [1,7]. Note that the parallel multiplication, denoted by ∗p , of two n−bits numbers takes O(log n) time and O(n log n log log n) processors on the EREW PRAM (using FFT methods) [11,12]. While the parallel division, denoted by /p , of two n−bits numbers takes O(log n log log n) time and O(n log n log log n) processors (logspace-uniform circuits) on the EREW PRAM [12]. By applying this step, we obtain Q = (1, 2, 4, 8, 16, · · · , 2n−1 = 32768). Step 5: Construct an array L of length ν(x), where each element li = 21ei , 0 ≤ i < ν(x). By applying this step, we obtain L = (1, 0.0625, 0.0078125, 0.00390625, 0.00012207, 3.05176e−5). Step 6: Construct an array S of length ν(x), where each element si = lj , 0≤j≤i
and 0 ≤ i < ν(x). We can construct S by applying the prefix sums on L. By applying this step on L, we obtain S = (1, 1.0625, 1.07031, 1.07422, 1.07434, 1.07437). Step 7: Construct an array B of length ν(x). Each element bi represents the value of the second element that corresponds to (i + 1)−th one in the array Y. The value of bi is equal to qei ∗p si−1 + 1, and b0 = 1, ∀1 ≤ i < ν(x). The position of bi in the binary chain A is ei + i. Therefore, aei +i = bi = qei ∗p si−1 + 1, ∀1 ≤ i < ν(x). Thus, the position of the first element (bi − 1) that corresponds to (i + 1)−th (i = 0) one in the array Y is ei + i − 1. Hence, aei +i−1 = bi − 1.
Binary Addition Chain on EREW PRAM
327
By applying this step, we obtain: B = (1, 17, 137, 275, 8801, 35205). All elements in the binary chain corresponding to 1 bit are: A = (a0 = 1, a4 = 16, a5 = 17, a8 = 136, a9 = 137, a10 = 274, a11 = 275, a16 = 8800, a17 = 8801, a20 = 35502). 3.3
Implementation Details for Stage 2
In the second stage, we want to compute the elements corresponding to the zeros in the array Y. If yi = 0 then the position of the corresponding element in the binary chain is i + wi − 1. On the other hand, the second element generated by the previous one is aewi −1 +wi −1 . Therefore, ai+wi −1 = aewi −1 +wi −1 ∗p qi−ewi −1 . Thus, ai+wi −1 = The term
aew
i −1
qew
+wi −1
i −1
aewi −1 +wi −1 ∗p qi . qewi −1
(3)
depends only on wi . wi changes only when yi = 1. We use
Eq.(3) to compute elements corresponding to any set of zeros between any two consecutive ones in the array Y. The following step uses an auxiliary array Z of length n to compute Eq.(3). Step 8: Construct an array Z of length n, where each element zi is computed by the following test, ∀ 0 ≤ i < n − 1 : a i −1 else zi = 0 If yi = 1 then zi = i+w qi By applying this step, we obtain: Z = (1, 0, 0, 0, 1.0625, 0, 0, 1.07031, 1.07422, 0, 0, 0, 0, 1.07434, 0, 1.07437). Step 9: To compute zi for yi = 0, we use the interval broadcast algorithm [1] on Z. The interval broadcasting problem is ”given an array C = (c1 , c2 , . . . , cn ). Certain positions of C are referred to as leaders. Each leader holds a datum while all other positions of C are empty. The interval broadcasting is copying the datum in each leader into all positions of the array following the leader, up to, but not including, the next leader (if it exists)”. This problem can be solved in O(log n) time using logn n processor under EREW PRAM [1]. By applying the interval broadcast algorithm on the array Z, we obtain: Z = (1, 1, 1, 1, 1.0625, 1.0625, 1.0625, 1.07031, 1.07422, 1.07422, 1.07422, 1.07422, 1.07422, 1.07434, 1.07434, 1.07437). Step 10: Compute the elements of the binary chain that correspond to the zero bits, i.e., yi = 0. This can be done as follows: if yi = 0, then ai+wi −1 = zi ∗p qi , ∀ 0 ≤ i < n − 1. By applying this step, we obtain: a1 = z1 ∗p 21 = 2, a2 = z2 ∗p 22 = 4, a3 = z3 ∗p 23 = 8, a6 = z5 ∗p 25 = 34, a7 = z6 ∗p 26 = 68, a12 = z9 ∗p 29 = 550, a13 = z10 ∗p 210 = 1100, a14 = z11 ∗p 211 = 2200, a15 = z12 ∗p 212 = 4400, a18 = z14 ∗p 214 = 17602.
328
K.A. Fathy et al.
3.4
The Algorithm
The pseudocode for the new algorithm to compute a binary addition chain for n−bits number x under EREW PRAM is as follows. Algorithm 3 Input: x = (xn−1 xn−2 xn−3 · · · x1 x0 )2 . Output: A binary addition chain of x in the array A of length n + ν(x) − 1. Begin: 1. for i = 0 to n − 1 do in parallel yi = xn−i−1 2. for j = 0 to log n − 1 do for i = 0 to logn n − 1 do in parallel if j = 0 then wi log n = yi log n else wi log n+j = wi log n+j−1 + yi log n+j for j = 0 to log logn n − 1 do for i = 2j + 1 to logn n do in parallel wi log n−1 = w(i−2j ) log n−1 + wi log n−1 for j = 0 to log n − 2 do for i = 1 to logn n − 1 do in parallel wi log n+j = wi log n−1 + wi log n+j 3. for i = 0 to n − 1 do in parallel if yi = 1 then ewi −1 = i 4. for j = 0 to log n − 1 do for i = 0 to logn n − 1 do in parallel if j=0 then if i = 0 then qi = 1 else qi log n = 2 else qi log n+j = qi log n+j−1 ∗p 2 for j = 0 to log logn n − 1 do for i = 2j + 1 to logn n do in parallel qi log n−1 = q(i−2j ) log n−1 ∗p qi log n−1 for j = 0 to log n − 2 do for i = 1 to logn n − 1 do in parallel qi log n+j = qi log n−1 ∗p qi log n+j 5. ν(x) = wn−1 for i = 0 to ν(x) − 1 do in parallel li = 1 /p qei 6. for j = 0 to log ν(x) − 1 do for i = 0 to logν(x) ν(x) − 1 do in parallel if j = 0 then si log ν(x) = li log ν(x) else si log ν(x)+j = si log ν(x)+j−1 +p li log ν(x)+j for j = 0 to log logν(x) ν(x) − 1 do for i = 2j + 1 to logν(x) ν(x) do in parallel si log ν(x)−1 = s(i−2j ) log ν(x)−1 +p si log ν(x)−1
Binary Addition Chain on EREW PRAM
329
for j = 0 to log ν(x) − 2 do for i = 1 to logν(x) ν(x) − 1 do in parallel si log ν(x)+j = si log ν(x)−1 +p si log ν(x)+j 7. a0 = 1 for i = 1 to ν(x) − 1 do in parallel aei +i = bi = qei ∗p si−1 +p 1. aei +i−1 = bi −p 1 8. for i = 0 to n − 1 do in parallel if yi = 1 then zi = aewi −1 +wi −1 /p qewi −1 else zi = 0 9. for j = 1 to log n − 1 do for i = 0 to logn n − 1 do in parallel if wi log n+j = wi log n+j−1 then zi log n+j = zi log n+j−1 for j = 0 to log logn n − 1 do for i = 2j + 1 to logn n do in parallel if wi log n−1 = w(i−2j ) log n−1 then zi log n−1 = z(i−2j ) log n−1 for j = 0 to log n − 2 do for i = 1 to logn n − 1 do in parallel if wi log n+j = wi log n−1 then zi log n+j = zi log n−1 10. for i = 1 to n do in parallel if yi = 0 then ai+wi −1 = zi ∗p qi End. 3.5
Analysis
Step 1 takes O(1) time using O(n) processors. Step 2 takes O(log n) time using O( logn n ) processors. Step 3 takes O(1) time using O(n) processors. Step 4 takes O(log2 n) time using O(n2 log log n) processors. Step 5 takes O(log n log log n) time using O(ν(x)n log n log log n) processors. Step 6 takes O(log ν(x) log n) time using O( log nn ν(x) log ν(x) ) processors. Step 7 takes O(log n) time using O(ν(x)n log n log log n) processors. Step 8 takes O(log n log log n) time using O(n2 log n log log n) processors. Step 9 takes O(log n) time using O( logn n ) processors. Step 10 takes O(log n) time using O(n2 log n log log n) processors. Therefore, the overall time of parallel addition chain algorithm is O(log2 n) using O(n2 log n log log n) processors under EREW PRAM, where ν(x) ≤ n. Now, we compare between the proposed algorithm and previous works (best sequential and parallel algorithms) that are based on binary method according to the following measures. Length of Chain: It is clear that all algorithms produce a chain with length n + ν(x) − 1.
330
K.A. Fathy et al.
Running Time: The proposed algorithm has running time O(log2 n) while the best known sequential and parallel algorithms has running time O(n2 ) and O(n log n) respectively. Therefore, the proposed algorithm is faster than the best known sequential and parallel algorithms. Optimality: The best known parallel algorithm is optimal because the cost (time-processors product) of the algorithm is equal to O(n2 ) which is equal to the running time the best known sequential algorithm. But the proposed algorithm is not optimal because the cost of the algorithm is O(n2 log3 n log log n). Communication Complexity: The communication complexity is defined as the worst-case bound on the traffic between the shared memory and any local memory of a processor [7]. The proposed algorithm has communication complexity O(log2 n) while the best known parallel algorithm has O(n log n) communication complexity.
4
Conclusion
In this paper we address how to parallelize the addition chain problem on EREW PRAM. A new algorithm is proposed and based on binary method. The proposed algorithm running in O(log2 n) time using polynomial number of processors. The algorithm is faster than previous results.
References 1. Akl, S.: Parallel Computation: Models and Methods. Prentice Hall, Upper Saddle River (1997) 2. Chang, C., Lou, D.: Parallel Computation of the Multi-Exponentiation for Cryptosystems. International Journal of Computer Mathematics 63, 9–26 (1997) 3. Bergeron, F., Berstel, J., Brlek, S.: Efficient computation of addition chains. J. de Theorie Nombres de Bordeaux, 6, 21–38 (1994) 4. Bergeron, F., Berstel, J., Brlek, S., Duboc, C.: Addition Chains using Continued Fractions. Journal of Algorithms 10, 403–412 (1989) 5. Downey, P., Leony, B., Sethi, R.: Computing Sequences with Addition Chains. SIAM Journal of Computing 3, 638–696 (1981) 6. Gordon, D.M.: A Survey of Fast Exponentiation Methods. Journal of Algorithms 27(1), 129–146 (1998) 7. Jaja, J.: An Introduction to Parallel Algorithms. Addison-Wesley, Reading (1992) 8. Knuth, D.: The Art of Computer Programming: Seminumerical Algorithms, vol. 2. Addison-Wesley, Reading (1973) 9. Kruijssen, S.: Addition Chains: Efficient Computing of Powers. Bachelor Project, Amsterdam (2007) 10. Rooij, P.: Efficient Exponentiation Using Precomputation and Vector Addition Chains. In: De Santis, A. (ed.) EUROCRYPT 1994. LNCS, vol. 950, pp. 389–399. Springer, Heidelberg (1995) 11. Schonhage, A., Strassen, V.: Schnelle Multiplikation GroBer Zahlen. Computing 7, 281–292 (1971) 12. Sorenson, J.: A Sublinear-Time Parallel Algorithm for Integer Modular Exponentation. In: Proceedings of the Conference on the Mathematics of Public-Key Cryptography (1999)
A Portable Infrastructure Supporting Global Scheduling of Embedded Real-Time Applications on Asymmetric MPSoCs Eugenio Faldella and Primiano Tucci Department of Electronics, Computer Engineering and Systems, University of Bologna, Italy {eugenio.faldella,primiano.tucci}@unibo.it
Abstract. Multiprocessor systems-on-chip (MPSoCs) open notable perspectives in the design of highly-targeted embedded solutions for real-time multitasking applications. The heterogeneity of available platforms, however, often hinders plain applicability of efficient scheduling policies, particularly in the case of loosely coupled architectures which do not provide any support for inter-processor tasks migration. In this paper we present a portable software infrastructure addressing the global scheduling of periodic real-time tasks on such platforms. We show that global scheduling policies, under the restrictedmigration model, are applicable also on asymmetric multiprocessing systems and experimentally evaluate the validity of the approach using different FPGAbased configurations that recall manifold architectures of commercial MPSoCs. Keywords: AMP, Global Scheduling, MPSoC, Real-Time, R-EDF, RTOS.
1 Introduction MPSoCs open up new vistas as regards the reference architectures which can be exploited to face the ever-increasing complexity of modern embedded systems. In particular in the case of FPGA-based MPSoCs, highly-versatile platforms can be customized ad-hoc in order to meet the several, and often contrasting, requirements of most embedded applications [1,2]. Such platforms turn out to be particularly suitable when considering the intensive and highly time-constrained computational demands that characterize the application scenario of embedded real-time systems [3-5]. Multiprocessing, however, actually brings in many new issues in both hardware and software design. The choice of the reference architectural scheme, either symmetric multiprocessing (SMP) or asymmetric multiprocessing (AMP), represents the first crucial design issue, involving complex tradeoffs between the high-level services exposed by the software platform and the low-level hardware requirements. SMP involves a set of closely-coupled architecturally identical processors interfaced to a shared bus, which operate as a single resource pool (Fig. 1a). The platform exhibits a coordinated environment in which a unique real-time operating system (RTOS) yields a uniform memory view and is able to dynamically spawn, execute and migrate each task on any processor. This scheme, which is widely used in multicore PCs, enables support for global scheduling policies (GSPs) paving the way to Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 331–342, 2011. © Springer-Verlag Berlin Heidelberg 2011
332
E. Faldella and P. Tucci
dynamic load balancing of computationally intensive multitasking applications. It requires, however, dedicated hardware resources to support interlock mechanisms and cache coherence management policies, which can prove very costly for an embedded system due to both area and power requirements. For such reasons, in the case of small scale and highly-integrated embedded realtime systems, AMP often reveals to be the leading choice. It can be viewed as a multiuniprocessor scheme (Fig. 1b), in which the processors are independent and don’t necessarily share all the memory space1. AMP, however, has a strong impact on the overall software organization, which needs to be approached in a decentralized fashion: distinct RTOS instances must be independently executed on each processor, working as separate environments, lacking any direct support to task migrations.
Fig. 1. SMP and AMP Architectures
From the real-time scheduling point of view, it is often implicitly assumed that the lack of task migration capability completely hinders the applicability of GSPs, constraining software designers to rely on partitioning approaches, in which each task is statically assigned to a predetermined processor. In some situations such assumption turns out to be too conservative. There is, in fact, a subclass of GSPs conforming to the restricted migration model [6,7] which, under certain hardware scenarios such as the ones considered in this work, can be still applied. The key for an effective integration of such kind of scheduling policies lies in the further definition of an agile software infrastructure capable of encapsulating fundamental recurring patterns (e.g., runtime task model and scheduling policies), providing the designer with a high-level view of the underlying platform that abstracts the mechanisms required to deal with it. In our knowledge, methodological issues related to the design, implementation and performance evaluation of restricted migration policies on AMP architectures have not yet been considered. In this work we first highlight the limitations faced by software designers when dealing with the execution of periodic tasks on conventional FPGA-based MPSoC platforms. We then outline the main features of a portable infrastructure that overcomes such limitations and enables performance enhancement through the application of the aforementioned scheduling policies. The infrastructure has been experimentally evaluated using an Altera Cyclone IV FPGA equipped with
1
The term asymmetric, itself, does not imply a difference between processors. The prefix heterogeneous is preferred for this purpose.
Embedded Real-Time Applications on Asymmetric MPSoCs
333
multiple NIOS-II soft-cores, combining different processor and memory layouts in order to reflect manifold MPSoC architectures and give generality to the results. The rest of the paper is organized as follows. Sec. 2 discusses background context and related work. The run-time task model that underpins our scheduling infrastructure, along with its main architectural features, is described in Sec. 3. Sec. 4 presents the target platform and the evaluation methodology. Experimental results are reported in Sec. 5. Concluding remarks and future research directions are outlined in Sec. 6.
2 Related Work Depending on the degree of migration allowed, real-time multiprocessor scheduling policies are typically divided into three classes [8]: no-migration, full migration, restricted migration. No-migration policies presuppose a static partitioning of the n application tasks into m disjoint subsets, each one handled on a specific processor by a uniprocessor scheduler. As a result, all jobs of each task are always executed on the same processor. Full migration policies, conversely, involve a single system-wide scheduler that is allowed to delegate the execution of each task to any processor. Moreover, the execution can be suspended, possibly more than once, and later resumed on a different processor (job-level migration). Restricted migration policies, finally, envisage that the execution of different jobs of any task may be delegated to different processors, with the only constraint that every job, even if preempted, has to be entirely executed on the same processor (task-level migration). Full migration policies generally provide the best performances, especially in bounded tardiness soft real-time systems [9,10]. However, they make the strong assumption of an underlying SMP platform, in order to handle a shared tasks queue and perform inter-processor tasks migration. No-migration policies, conversely, may be applied on both SMP and AMP platforms, since they substantially operate as a multiplicity of legacy uniprocessors. As a matter of facts, since AMP turns out to be the only architectural scheme supported in most MPSoCs, many studies are currently being undertaken, some investigating on partitioning approaches [11,12], others aiming to extend SMP facilities to AMP platforms [13,14]. Although the latter prove to be functionally correct, the overhead they introduce is not negligible and the overall platform does not scale well as the number of cores increases [15]. Restricted migration policies have so far received fewer attention. In [6,7] a restricted-migration variant of the earliest deadline first algorithm (r-EDF) is proven to be not worse than the highest known utilization bound for global fixed priority scheduling. It is also worth noting that non-preemptive versions of global policies, as NPG-EDF [16,17] still fall in the case of restricted-migration policies. Restricted migration policies can bring significant benefits even for very simple real-time applications as the one depicted in Fig. 2, which considers the scheduling of 3 tasks (the parameters associated to each task represent, respectively, its computation time, period and resulting processor utilization) on a 2-way multiprocessor system. It is evident that no-migration policies are not able to feasibly schedule this task-set since the utilization factor of any pair of tasks exceeds the computational capacity of a single processor. However, a r-EDF policy could be able to successfully schedule the application meeting all deadlines, as shown.
334
E. Faldella and P. Tucci T1 {1, 2, 50%} T2 {2, 3, 66%} T3 {3, 4, 75%} Time
1
Legend
2
Ready
3
4
5
6
7
Running (core 1)
8
9 10 11 12
Running (core 2)
Fig. 2. Schedule of a sample application using r-EDF
3 Asymmetric Multiprocessor Scheduling Infrastructure In every multiprocessor system, the RTOS plays a key role for the deterministic and reliable operation of the application software, providing fundamental support services for tasks scheduling, resources management and inter-process communication. As regards AMP platforms, a wide variety of solutions characterizes the embedded realtime scenario, ranging from very lightweight and responsive RTOSs, to fully featured and modularly expandable ones. Most of embedded RTOSs, however, provide a very limited support even for very common real-time software patterns. The run-time model they exhibit, in fact, merely consists of a set of straight processes whose execution is transparently carried out on a single processor in accordance with userdefined priority levels, further supported by conventional synchronization mechanisms such as semaphores, mutexes and task control system calls. Even the notion of periodic execution of a task is completely lacking, as well as its temporal attributes. Existing research projects that realize more advanced scheduling policies, such as LITMUSRT [18] or the Linux SCHED_EDF [19], cannot be applied in this context since they are intended for realizing unrestricted GSPs on SMP architectures. In order to address these issues and raise the design abstraction level we developed a portable infrastructure which exploits for its operations only the primitives provided by the RTOS. It takes care of both the high-level interaction with the application designer and low-level mechanisms which coordinate the RTOS instances of the AMP platform and realize the GSP. The only architectural assumption we make hereafter is the availability of a shared memory (whose hardware details are discussed in Sec. 4) accessible by all processes, in order to hold the working sets of application tasks. The interface provided to the application designer requires only the attributes that, from the scheduling point of view, characterize each task. The arguments involved by the primitive create_real_time_task envisioned for this purpose include the temporal attributes period and relative deadline, a collection of policy-dependent attributes that may be further required by the GSP for its operation, such as the time-quantum for time-driven policies and a job entry-point, a pointer to the task code taking as argument the reference to the memory working-set. 3.1 The Shadow Process Model The fundamental contribution brought in by the scheduling infrastructure is represented by the underlying run-time model, herein called shadow process model. Its purpose is to concretely map periodic real-time tasks onto RTOS processes and
Embedded Real-Time Applications on Asymmetric MPSoCs
335
manage their execution flow in a centralized manner, according to the global decisions of the GSP it implements, realizing task-level migrations even if RTOS processes cannot be really migrated. In order to do so, each time a real-time task is created using the above interface, the infrastructure instantiates on each of the m processors a corresponding shadow process. From the software standpoint, each shadow process consists in a cyclic routine (Fig. 3) that, upon each iteration triggered by the infrastructure, issues the execution of a new job for the corresponding real-time task, invoking the entry-point defined by the application designer. At the beginning of each cycle, a shadow process suspends itself (3.) waiting for scheduling decisions. When the infrastructure triggers the release of a new job, the shadow process is resumed, causing the RTOS to carry on its execution. In absence of pending removal requests (4.), the shadow process starts the user-defined job (5.). When the job is completed, the event is notified back (6.), then the shadow process self suspends waiting for the next trigger. At any time, at most one shadow process is made ready on each RTOS. This shadow process corresponds to the real-time task that is expected to execute on that processor by the infrastructure. No ambiguity can exist as regards the RTOS behavior which, independently of its inner scheduling policy (e.g., FIFO, round-robin), could not do anything but executing the only ready process, if any. In order to accomplish this, we take advantage of the task suspend and task resume system calls offered by every RTOS to control the schedule of processes. In lieu of these system calls, semaphores could be exploited to achieve a similar behavior. Full AMP compliance is ensured since the restricted migration model guarantees that preempted tasks cannot be resumed on any processor other than the one where the job execution started, therefore no migration of process state is required. The only state that the infrastructure should care about and keep coherent is the working-set of each task, which might be accessed by distinct jobs of the same task from different processors and, as later discussed, depends on the hardware architecture. 1.
void shadow_process (rt_task *t) {
2.
while(true) {
3.
OS_suspend(OS_SELF);
4.
if(t->exit) return;
5.
t->entry_point(t->ws);
6.
notify_job_copletion(t);
7. 8.
} } Fig. 3. Pseudo-code of the shadow process
3.2 The Scheduling Coordinator and RTOS Dispatchers In order to concretely put this model in operation, the infrastructure needs to carry out two fundamental activities: realize the GSP and enforce its actions. In order to
336
E. Faldella and P. Tucci
decouple these two aspects, we introduce two dedicated software components, respectively, the scheduling coordinator and the RTOS dispatchers. The scheduling coordinator, which from the software point of view is a unique process for the system that can be instantiated arbitrarily on any of the m processors, implements the GSP taking all the scheduling decisions for the whole system. It perceives only a high-level and platform-independent view of the multiprocessor system. Its operation is decoupled by means of message passing interaction from the underlying shadow process model, which is handled by the RTOS dispatchers. More in detail, the following messages are envisaged: • create_task: sent by the scheduling coordinator to the m dispatchers when a realtime task is created through the create_real_time_task primitive in order to let the dispatchers instantiate the corresponding shadow process. • activate_task: sent by the scheduling coordinator to a dispatcher to activate (release or resume) a job on a specific processor. • execution_completed: sent by a dispatcher to the scheduling coordinator to notify the completion of a job when the corresponding shadow process invokes the notify_job_completion primitive. The state of the real-time tasks is rendered to the scheduling coordinator through two data structures, the task state vector and the processor state vector, that are updated as result of the message exchange with the RTOS dispatchers (Fig. 4). The first data structure keeps track of the evolution of each task, reporting, along with the temporal attributes, its current state (Table 1) and processor affinity. The second data structure reflects the current state (either IDLE or BUSY) of each processor, as well as the realtime task currently running on it when in the BUSY state.
Fig. 4. Overview of the restricted-migration scheduling infrastructure
The role of the RTOS dispatchers is twofold: enforce the actions issued by the scheduling coordinator (task creation and activation) and notify back jobs completion. From the software standpoint, the former is realized through a service routine, whose execution is triggered by the mailbox hardware. Correspondingly, the dispatcher performs the actions envisaged by the shadow process model, resuming the new task and possibly suspending the previously running task (if any) on the local RTOS. Analogously, the
Embedded Real-Time Applications on Asymmetric MPSoCs
337
notify_job_completion primitive, which is invoked by the shadow process after each job execution (Fig. 3), is modeled as a procedure which sends backwards the execution_completed message to the scheduling coordinator, allowing the GSP to proceed with further scheduling decisions. Table 1. Run-time states of real-time tasks State IDLE SCHEDULABLE RUNNING PREEMPTED
Description The task is waiting for next job release. A new job can be released but has not yet been activated (e.g., due to higher priority tasks). The task has been activated and is currently running on the processor specified by its affinity. Task execution has been pre-empted, after beginning on the affinity processor.
4 Evaluation Methodology Originally exploited as prototyping platforms for later implementation in ASIC, FPGAs have become feasible vehicles for final designs, enabling an agile integration of manifold hardware resources suitably interconnected via a customizable bus, as general-purpose processors (soft-cores), memory and peripheral devices. Currently available design tools leave high degrees of freedom to the designer, particularly as regards the inter-processor communication infrastructure and the memory layout. Customization options typically involve not only the choice of the memory technology, which can range from fast on-chip memory to external solutions, but also the interconnection topology, allowing to tightly couple a memory to a single core, avoiding any contention, or share it, sacrificing access times in favor of resource saving. The Altera NIOS-II soft-core has been chosen as reference architecture for the experimental evaluations, due to the flexibility of its integrated development environment that permits easy customization of different hardware templates transparently supported by the bundled μC/OS-II RTOS. The NIOS-II/f fast version we employed in our experiments can be further endowed with a write-back directly mapped data cache (D-cache), that permits to reduce bus contentions exploiting spatial and temporal locality of memory accesses. Lacking any hardware coherency support, explicit cache flushes and proper synchronization must be handled by software in order to guarantee coherency of memory shared by different cores. The message-passing infrastructure has been realized using the FIFO core provided by the Altera SoPC, realizing a 1-to-m bidirectional channel between soft-cores. Using an Altera Cyclone IV FPGA clocked at 50 MHz and combining different memory and cache layouts as shown in Table 2, we investigated four reference hardware templates based on NIOS-II/f cores: shared memory (TS), shared memory with D-cache (TSC), dedicated memory (TD), dedicated memory with D-cache (TDC). As regards the memory technology, we used internal MK9 SRAM blocks for the on-chip memory and an external SDRAM module for the shared memory. In order to preserve the memory consistency of the shadow process model in the TSC and TDC templates, explicit cache flushes are performed on job boundaries.
338
E. Faldella and P. Tucci Table 2. Configuration of the reference hardware templates
TS
TSC
TD
TDC
No
2 kB
Instuctions. cache Data cache
No
RTOS memory (Instructions + data)
External memory
On-chip memory
Tasks memory (Instructions)
External memory
On-chip memory
Tasks memory (Data)
2 Kb 2 kB
External memory
The goals of the experimental evaluation are twofold. Infrastructure overhead. Two key factors contribute to such overhead: (i) job activation overhead, i.e. the interval that elapses between the issue of an activate_task message by the GSP and the execution of the corresponding shadow process; (ii) job completion overhead, i.e. the interval that elapses between the completion of a job, the update of the working-set and the reception of the corresponding message by the GSP. The additional time taken by the GSP to carry out its scheduling decisions has not been accounted since it strongly depends on the particular GSP employed and is extensively discussed by the relative studies herein referred. Performance slowdown. Apart from the infrastructure overhead itself, we analyze how the run-time execution of application tasks is further biased by the hardware platform. The different hardware templates, in fact, are likely to differently respond to the workload of the real-time tasks, in particular to changes of number of cores simultaneously executing and their working-set size. Furthermore, the more or less frequent context switches and task migrations issued by the GSP can additionally contribute to the run-time duration. In order to account these additional contributes and determine the effective factors which influence them, we set-up an experimental test-bench which combines (Fig. 5) the four hardware templates (T) with 4 different number of cores (m), 6 working set sizes (S) , 4 pre-emption rates (P) and 4 migration rates (M, expressed in migrations per period), for a total of 1536 scenarios.
Fig. 5. Test-bench parameters
For each scenario, we perform the scheduling of a fixed number of 16 identical tasks, in which each job executes a CoreMark [20] instance in order to emulate some real workload on the working set. Task periods were chosen to be long enough to compensate duration variance due to the different platforms avoiding overrun conditions. We employ a regular scheduling pattern relying on a quantum-driven roundrobin scheme, in order to deliver a constant number of preemptions and migrations according to the configuration of each scenario. At each period the 16 tasks are arranged in m clusters and each cluster is scheduled on each core in round-robin using a P time-quantum (‘NO’ means that task jobs are sequentially executed). On the next period the pattern repeats shifting the clusters by M positions.
Embedded Real-Time Applications on Asymmetric MPSoCs
339
5 Experimental Results Figs. 6a and 6b show the two contributions to the infrastructure overhead. Each column reports the overhead measured for each hardware template in function of m, aggregating the average over the variation of S, P and M parameters, as, not surprisingly, they revealed to have a negligible influence on the infrastructure overhead. Job activation measurements show as both the TD and TDC templates exhibit an almost constant overhead as m increases, since the operations performed on the shared memory are minimal. On the other hand, the TS and TSC templates exhibit a worse scalability, in particular in the case of simultaneous activations on the cores, as both data and instruction ports contribute to the contention of the shared memory module when RTOS scheduling primitives are invoked. Furthermore, it might be also noted that for both the dedicated and shared cases, the relative templates involving data cache exhibit slightly higher overheads. The limited size of the data cache, in fact, is likely to cause a lag due to write-back of the stale cache lines prior to executing the dispatcher code, causing for such a short-length routine an opposite effect than expected. As regards the completion overheads, both TS and TD templates exhibit a very limited, yet expected, contribute. The corresponding templates involving data cache, instead, introduce a more consistent overhead (order of tenths of microseconds) required to invalidate and write-back the data cache in order to preserve the working-sets consistency. In this case, while the TDC template exhibits an almost linear behavior, the TSC template suffers of concurrent data and instruction cache contentions causing increased (≈ 2x) overheads in the 8-cores configuration. Cumulative infrastructure overheads are showed in Fig. 6c as sum of the two contributes. The dedicated templates exhibit an overall good scalability inducing small and almost constant overhead even in the 8-core configurations, while the shared templates demonstrate to be negatively influenced by the shared memory bottleneck.
Overhead [us.]
Overhead [us.]
300 200 100 0
500
25
400
20 15 10 5 0
TD
1 Core 2 Cores 4 Cores 8 Cores TDC TS TSC
Infrastructure overhead
Job completion overhead 30 Overhead [us.]
Job activation overhead 400
300 200 100 0
TD
1 Core 2 Cores 4 Cores 8 Cores TDC TS TSC
1 Core TD
2 Cores TDC
4 Cores TS
8 Cores TSC
Fig. 6. Infrastructure overhead due to job activation (a), completion (b) and cumulative (c)
In addition to the overhead directly introduced by the scheduling infrastructure, Figs. 7(a-d) show how run-time performance of application tasks is affected by preemptions. Each of the 4 charts reports the average time required to complete a whole job issuing preemptions at different rates (according to the P parameter) in function of m, under each hardware template. TD reveals to be the less influenced template incurring, in the {m=8 cores; P=1 ms} configuration, a slowdown of 1,8% (7 us) compared to the sequential execution case. In the corresponding template involving data cache (TDC), preemptions caused an higher relative increment of 6,9% (5 us.)
340
E. Faldella and P. Tucci
in the analogous configuration. The shared templates demonstrated to majorly suffer the influence of preemptions, in particular the TS exhibit a slowdown of 24,5% (98 us) in the {m=8 cores; P=1 ms} configuration while the introduction of data cache induce in the TSC template a slowdown of 30,8% (25 us). As a broader level consideration it might be noted that the effect of data cache on the preemption overhead has a lesser extent if compared to the speedup provided to tasks run-time. In order to provide a comparative evaluation of the overall run-time overhead factors, Figs. 8(a-d) show, for each hardware template, the relative slowdowns highlighting, at variations of W, the difference between the slowdown due to the hardware architecture and the slowdown due to the scheduling infrastructure. For each column, the lower colored part reports the ratio between the average run-time on the m-way multiprocessor configuration performing sequential jobs execution and the corresponding measurement on the uniprocessor configuration. The upper (red) part shows the surplus slowdown, introduced by the infrastructure, using the preemptive roundrobin execution with the tightest (P = 1 ms) quantum. It may be clearly noted that the slowdown introduced in the infrastructure is definitely marginal in the TD and TS templates when compared to the slowdown introduced by the multiprocessor hardware architecture. Such slowdown becomes comparable only in the TDC and TSC templates, highlighting how preemptions suffer a worse exploitation of caches. As a final remark it might be noted that neither of the considered graphs reports the effect of tasks migrations. In fact, in all of the combinations considered, the changes of the M parameter did not produce any remarkable effect on the measurements, therefore they have been omitted. Run-time - TD [us.]
Run-time - TDC [us.] 389 384 383 382
8 Cores 203 200 199 198
4 Cores
1 Core
2 Cores
1 Core
50 100 150 200 250 300 350 400 450 1 ms 5 ms 10 ms NO Preempt.
50 1 ms
Run-time - TS [us.] 426 413 400
8 Cores
1 Core
5 ms
2 Cores
1 Core
10 ms
NO Preempt.
100 NO Preempt.
88 84 81
8 Cores
4 Cores
50 100 150 200 250 300 350 400 450 500 550 1 ms
10 ms
Run-time - TSC [us.]
148 143 142 140 126 122 122 121
2 Cores
5 ms
498
229 213 210 207
4 Cores
77
65 63 63 62 62 60 60 59 60 59 59 58
4 Cores
139 137 136 136 119 117 117 116
2 Cores
73 73 72
8 Cores
50 1 ms
69 68 67 66 63 62 62 63 61 60 60
5 ms
106
76
10 ms
100 NO Preempt.
Fig. 7. Absolute run-time performances of TD (a), TDC (b), TS (c) and TSC (d) templates varying m and P parameters with W: 16 kB
Embedded Real-Time Applications on Asymmetric MPSoCs
Preemption overhead - TS 350% 325% 300% 275% 250% 225% 200% 175% 150% 125% 100% 75% 50% 25% 0%
341
Preemption overhead - TSC 75%
50%
25%
0% 512 B
1 Core
1 kB 2 kB 4 kB 8 kB 16 kB Working-set size 2 Cores 4 Cores 8 Cores
512 B 1 Core
Preemption overhead - TD 250% 225% 200% 175% 150% 125% 100% 75% 50% 25% 0%
1 kB 2 kB 4 kB 8 kB 16 kB Working-set size 2 Cores 4 Cores 8 Cores
Preemption overhead - TDC 25%
0% 512 B
1 Core
1 kB 2 kB 4 kB 8 kB 16 kB Working-set size 2 Cores 4 Cores 8 Cores
512 B 1 Core
1 kB 2 kB 4 kB 8 kB 16 kB Working-set size 2 Cores 4 Cores 8 Cores
Fig. 8. Relative slow-down of TD (a), TDC (b), TS (c) and TSC (d) templates varying W and m parameters
6 Concluding Remarks We presented the essential implementation details of a portable scheduling infrastructure which enables global scheduling of real-time tasks on asymmetric multiprocessor platforms, according to the restricted-migration model. We put the focus on the mechanisms which, regardless the particular scheduling policy employed, allow to arbitrarily perform job preemptions and task migrations on the mainstream embedded AMP platforms employing only elementary scheduling primitives offered by almost every RTOS. In order to decouple these low-level scheduling mechanisms from userdefinable high-level scheduling policies we presented a run-time approach, called shadow process model, which introduces two software components with the aim of managing separately the two aspects, handling the decoupling by means of message passing interaction. We experimentally evaluated the viability of our approach employing four reference FPGA-based multiprocessor templates combining different memory models and cache layouts, and analyzed both the overhead directly introduced by our infrastructure and the further consequences yielded on run-time performances, putting particular attention to the effect of scheduling decisions, i.e. preemptions and migrations, on the tasks run-time. In this regard we showed that the overhead introduced by the proposed infrastructure has a limited extent, especially in the hardware platforms which involve private memory for the RTOS. Furthermore we showed that job preemptions induce a slowdown which is smaller than the slowdown caused by the multiprocessor parallelism Task migrations, instead, showed to not cause any remarkable effect in the proposed approach.
342
E. Faldella and P. Tucci
As future research directions the experimental evaluations herein presented should be extended in order to contemplate more complex MPSoC architectures involving other communication and interaction paradigms such as network-on-chips, studying the viability of the approach on those hardware platforms which do not assume any shared memory. Furthermore, we consider to exploit the hardware configurability of FPGAs to replace the scheduling coordinator with an hardware implementation, freeing the soft-cores by the computational cost of the scheduling policy.
References 1. Lee, E.A.: What’s ahead for embedded software? Computer 33, 18–26 (2000) 2. Martin, G.: Overview of the MPSoC design challenge. In: 43rd ACM/IEEE Design Automation Conf., pp. 274–279 (2006) 3. Tumeo, A., et al.: A dual-priority real-time multiprocessor system on FPGA for automotive applications. In: DATE 2008, pp. 1039–1044 (2008) 4. Ben Othman, S. Ben Salem, A.K., Ben Saoud, S.: MPSoC design of RT control applications based on FPGA SoftCore processors. In: ICECS 2008, pp. 404–409 (2008) 5. Joost, R., Salomon, R.: Advantages of FPGA-based multiprocessor systems in industrial applications. In: IECON 2005, p. 6 (2005) 6. Baruah, S., Carpenter, J.: Multiprocessor fixed-priority scheduling with restricted interprocessor migrations. In: ECRTS 2003, pp. 195–202 (2003) 7. Funk, S., Baruah, S.K.: Restricting EDF migration on uniform multiprocessors. In: 2th International Conference on Real-Time Systems (2004) 8. Carpenter, J., Holman, P., Anderson, J., Srinivasan, A., Baruah, S., et al.: Handbook of Scheduling: Algorithms, Models, and Performance Analysis, pp. 30-1–30-19. Chapman and Hall/CRC, Boca Raton (2004) 9. Devi, C.U., Anderson, J.: Tardiness bounds under global EDF scheduling on a multiprocessor. Real-Time Systems 38, 133–189 (2008) 10. Lauzac, S., Melhem, R., Mosse, D.: Comparison of global and partitioning schemes for scheduling rate monotonic tasks on a multiprocessor. In: 10th Euromicro Workshop on Real-Time Systems, pp. 188–195 (1998) 11. Xie, X., et al.: Asymmetric Multi-Processor Architecture for Reconfigurable System-onChip and Operating System Abstractions. In: ICFPT 2007, pp. 41–48 (2007) 12. Monot, A., et al.: Multicore scheduling in automotive ECUs. In: ERTSS 2010 (2010) 13. Hung, A., Bishop, W., Kennings, A.: Symmetric multiprocessing on programmable chips made easy. In: DATE 2005, pp. 240–245 (2005) 14. Huerta, P., et al.: Exploring FPGA capabilities for building symmetric multiprocessor systems. In: SPL 2007, pp. 113–118 (2007) 15. Huerta, P., et al.: Symmetric Multiprocessor Systems on FPGA, pp. 279–283. IEEE Computer Society, Los Alamitos (2009) 16. Baruah, S.: The Non-preemptive Scheduling of Periodic Tasks upon Multiprocessors. Real-Time Systems 32, 9–20 (2006) 17. Kargahi, M., Movaghar, A.: Non-preemptive earliest-deadline-first scheduling policy: a performance study. In: MASCOTS 2005, pp. 201–208 (2005) 18. Calandrino, J., et al.: LITMUS^RT: A Testbed for Empirically Comparing Real-Time Multiprocessor Schedulers. In: RTSS 2006, pp. 111–126 (2006) 19. Faggioli, D., et al.: An EDF scheduling class for the Linux kernel. In: Real-Time Linux Workshop (2009) 20. The Embedded Microprocessor Benchmark Consortium: EEMBC Benchmark Suite
Emotional Contribution Process Implementations on Parallel Processors Carlos Domínguez, Houcine Hassan, José Albaladejo, Maria Marco, and Alfons Crespo Departamento de Informática de Sistemas y Computadores, Universidad Politécnica de Valencia, Valencia, Spain [email protected]
Abstract. An emotional agent software architecture for real-time mobile robotic applications has been developed. In order to allow the agent to undertake more dynamically constrained application problem solving, the processor computation time should be reduced and the gained time is used for executing more complex processes. In this paper, the response time of the operating processes, in each attention cycle of the agent, is decreased by parallelizing the highly parallel processes of the architecture, namely, emotional contribution processes. The implementation of these processes has been evaluated in Field Programmable Gate Array (FPGA) and multicore processors. Keywords: FPGA, Multicore, Load balancing, Robotics, Agents, Real-time.
1 Introduction Robotic agents can solve problems in dynamic environments with uncertainty. The agents are supposed to have considerable autonomy to define their objectives and apply the appropriate strategies to reach them. Many agent architectures have been proposed, from the purely reactive to the purely deliberative ones, through hybrid solutions as well. One of the approaches which is widely studied by different authors is the emotional approach [7], inspired on the natural emotional agents. Various models of emotion have been proposed. Many researchers consider mainly the problem of the expression of the agent on emotional states, which is very useful in the communication of people with machines and between artificial agents [8]. Other researchers, however, consider the emotional process from a more general point of view, as a mechanism for the motivation of the agent behavior [9]. In this sense, RTEA – Real-time Emotional Agent [10], an emotional agent architecture for real time applications, has been developed. The RTEA architecture defines a set of operational processes: emotion, motivation and attention, which are executed together with the application processes that solve specific problems. An important parameter in RTEA, which limits the type of problems that the agent can solve, is the maximum frequency of its cycle of attention. Every cycle of attention, the processor of RTEA must complete all the operational processes – situation appraisal, emotion, motivation and attention - and additionally it must have a sufficient bandwidth to significantly advance the problem solving processes, which Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 343–352, 2011. © Springer-Verlag Berlin Heidelberg 2011
344
C. Domínguez et al.
are composed of both reactive processes and deliberative processes. So, the capacity of the processor is an important parameter to decide how to manage, with a predetermined risk level, solving a problem with given dynamics. A RTEA implementation has been developed to control a mobile service robot using a generalpurpose processor in which both, operative processes and application processes, run on a single core processor with a multitasking operating system. In this type of application, the agent has to deal with the resolution of a large set of simultaneous problems such as transport of objects, information point, security, cleaning, etc. The agenda of the agent can grow substantially. Since the agent must select their targets by running the operational processes of appreciation, emotion, motivation and attention, and these processes should be evaluated in every cycle of attention, the relative load of these processes may be important. In this paper we propose to reduce the relative load of the operational processes in the RTEA implementation, to thereby be able to increase the available bandwidth for application processes, or alternatively, to shorten the period of the cycle of attention and this way to be able to deal with dynamic problems with shorter required response time. Specifically in our application, we have improved the agent performance increasing the navigation and operation speed of the mobile robot. To reduce the process time of the operational processes we have considered two different alternatives for the processor. On the one hand, the utilization of a general purpose multicore processor, which can dedicate specific cores to run the operative processes of emotion, motivation and attention, and other cores to run the application processes by balancing the processes’ load between the different cores. On the other hand, the design of a special purpose processor for the operational processes on FPGA devices and the implementation of the full system on a hybrid processor, with general-purpose cores for the application processes and with special purpose cores for the operational processes. For the specific processor design, the operational processes have been characterized, and the parallel processes have been identified. We have described the emotional processor in VHDL. The project has evaluated both processor alternatives using a set of problems of varying complexity, considering the achievable benefits by a set of FPGA and multicore processors commercially available. The rest of the paper is organized as follows: section 2 reviews the state of the art of FPGA implementations. Section 3 describes the general characteristics of the RTEA architecture, and highlights the processes to execute in the proposed processor. Section 4 describes the design of both processor alternatives. Section 5 sets the evaluation and presents the results. Finally, section 6 summarizes the conclusions.
2 Related Work In [1] a review of Field Programmable Gate Array (FPGA) technologies and their contribution to industrial control applications is presented. To illustrate the benefits of using FPGAs in the case of complex control applications, a sensorless motor controller based on the Extended Kalman Filter, is studied. In [2], a coarse-grain parallel deoxyribonucleic acid algorithm for optimal configurations of an omnidirectional mobile robot with a five-link robotic arm performing fire extinguishment is presented. The hardware/software co-design, and System-on-a-Programmable-Chip technology
Emotional Contribution Process Implementations on Parallel Processors
345
on a FPGA are employed to implement the proposed algorithm to significantly shorten its processing time. A hardware–software coprocessing speech recognizer for real-time embedded applications is presented in [3]. The system consists of a standard microprocessor and a hardware accelerator for Gaussian mixture model emission probability calculation implemented on a FPGA. The GMM accelerator is optimized for timing performance by exploiting data parallelism. The development and implementation of a generalized backpropagation multilayer perceptron architecture described in VLSI hardware description language is proposed in [4]. By exploiting the reconfigurability of FPGAs, authors are able to perform fast prototyping of hardwarebased ANNs to find optimal application specific configurations, in terms of cost/speed/accuracy tradeoffs affecting hardware-based neural networks. A design environment for the synthesis of embedded fuzzy logic controllers on FPGAs, which provides a novel implementation technique, has been developed in [5]. This technique allows accelerating the exploration of the design space of fuzzy control modules, as well as a co-design flow that eases their integration into complex control systems. In [6] an embedded adaptive robust controller for trajectory tracking and stabilization of an omnidirectional mobile platform is proposed. This adaptive controller is then implemented into a high-performance field-programmable gate array chip using hardware/software codesign technique and system-on-a-programmable-chip design. Simulations show the effectiveness and merit of the proposed control method in comparison with a conventional proportional-integral feedback controller.
3 Model of the Agent In RTEA, the agent behavior is based on the concept of problem solving. A thought is an execution context of mental processes of observation, deduction, decision and action, related to the resolution of the problem generated by a desire. Every thought has a level of motivation, which is the basic parameter used in the negotiation of the attention and therefore it plays an essential role in the direction that the actual behavior takes. The mechanism of thoughts motivation in RTEA is emotional. An emotion is a process in which the evaluation of the current situation provides an emotional state, and this triggers as a response a motivation for a behavior related to the situation. Figure 1 shows the general flow of information in a RTEA agent. A main branch of this flow of information is covered by the application processes, and it connects the interface devices with the environment (sensors and motors) through two major ways, the reactive way and the deliberative one. This main branch develops and materializes the effective behavior of the agent, the one that has effects on their environment, giving responses to stimulus. A second main branch of the information flow is covered by the operative processes: emotion-motivation-attention and emotion-desire. These operational processes are reactive processes, i.e. with bounded and relatively short response times. They embody the emotional behavior of the agent that causes changes in its attitude. The response of the so called intrinsic emotions consists of establishing the level of the motivation of associated thoughts.
346
C. Domínguez et al.
Device:
Process:
Concept:
Sensor
Sense
Perception
Thought:
Observe
Thought:
Concept:
Thought:
Deduce
Situation
Decide
Process:
Concept:
Thought:
Concept:
Appraise
Decision
Act
Action
Concept:
Concept:
Process:
Concept:
Process:
Device:
Dedication
Appraisal
React
Reaction
Motor
Motor
Process:
Process:
Attend
Arouse
Concept:
Process:
Concept:
Process:
Concept:
Process:
Motivation
Motivate
Emotion
Desire
Desire
Procure
Fig. 1. Flow control in RTEA agent situation appraisal
situation property [-oo,+oo]
contribute f([-1,+1])
situation
appraisal contribution [-1,+1]
situation property [-oo,+oo] appraisal contribution [-1,+1]
situation observation, deduction
emotion (emotional sensitivity)
ponder & add {[-1,+1]...}
appraisal [-1,+1]
contribute f([-1,+1])
emotional contribution [-1,+1]
emotional contribution [-1,+1]
ponder & add {[-1,+1]...}
emotional state
thought motivation (emotional control)
desire f([0,+1])
desire {0,1}
motivate f([0,+1])
motivation
state [0,+1] [0,+1]
Fig. 2. Emotional control
Figure 2 shows the branch of the emotional flow that are established, from the situations to the motivations. Emotional process concepts are represented as normalized real numbers: Situation Appraisal [-1, +1], Emotional Contribution [-1, +1], Emotional State [0, +1], Motivation of Thought [0, +1]. Note that different appraisals can contribute to the same emotional state given. For example: "fear to crash" emotion could consider appraisals like "distance" and "relative speed", so if even being a small distance the object moves away, the appreciation of the collision problem may decrease. Therefore, emotional contributions should be weighted, defining a unitary partition, so that the emotional state is always defined by its standardized range [0, +1]. Appraisal of the situation processes, emotional contribution and response are based on functions of appreciation, contribution and response respectively. It has been opted for sigmoidtype functions, because of their descriptive properties, due to the adaption to the models of appreciation, contribution and response we want to represent, with slight variations at the ends of the range that tend to asymptotic values and abrupt variations
Emotional Contribution Process Implementations on Parallel Processors
347
around a inflection point in the center of the range. Specifically it has used sigmoid functions and hyperbolic tangents. A Basic Hyperbolic Tangent is found in (1).
y ( x) =
e x − e− x . e x + e−x
(1)
The hyperbolic tangent is an s-shaped curve, with rapid growth around its center point and two saturations, on the ends, following two asymptotes. To achieve speed up/slow activation and vary the intensity, a translation, a scaling and the application of offsets are applied. A parametric hyperbolic tangent is shown in (2).
y ( x) =
e ( x − x0 ) k x − e − ( x − x0 ) k x k y + y0 . e ( x − x0 ) k x + e −( x − x0 ) k x
(2)
In a first phase, we have identified 3 main parts of the emotional architecture, based on the above functions, whose critical processes could be executed in parallel: (1) The process of emotional motivation: it makes a subjective appreciation of the current situation of the problem, it activates an emotional state based on that appreciation and in response it motivates certain behaviors. (2) The attention process: it allocates processing resources to solving problems processes. (3) The set of reactive processes of the application: they require strict deadlines on the response of the agent. This paper will consider the process of emotional motivation (Central part of the Figure 2), and more specifically the contribution process, in which a set of appreciations of the current situation contribute to establish an emotional state.
4 FPGA and Multicore Agent Processes Design 4.1 Problem Definition The computational load is related to the complexity of the application problem and the period of the attention cycle, which is related to the dynamics of the application problem (another important parameter in the measure of its complexity). The complexity varies widely depending on the problem, so we are going to consider a specific example of application consisting of the problem of controlling a mobile robot for services in a public building. The assistant robot provides basic services (e. g., information, transportation of goods, surveillance of facilities and cleaning) in public buildings. In this application, users make requests of services, which are recorded in the agenda of the agent. The resolution of each of these problems is motivated by an emotional process. These emotional motivation processes are triggered in each attention cycle of the agent. A typical cycle of attention is between 0.1s and 1s depending on the dynamics of the problem. For the problem of transport, the number of running emotional processes may be around 10000. For a more complex problem that integrates various services, this number could reach 8 million processes. Because of the large number of processes generated to solve a problem, it has been selected a small part of the operative processes, particularly the emotional motivation system and it has been identified the possibilities of executing in parallel
348
C. Domínguez et al.
these processes, to be impllemented on FPGAs and multicore, in order to invest the saved resources in the ex xecution of more processes that allow undertake m more complex problems. From th his analysis it was noticed that these processes are higghly parallel and that the implementation of a small subset of the emotional architectture processes could be carried d out on commercial multicore processors or FPGAss of medium performance. This article proposes a comparative study of the implementation of a subsett of these emotional processes of the agent in specific systems based on FPGA and multicore processors, dep pending on the complexity of the problem (emotioonal contributions executed per cycle of attention) in terms of the execution time of the emotional processes. To thiis end, it has been considered an emotional process dessign in C++ for multicores, as well w as a Matlab implementation. It has also been propoosed a VHDL design of the emo otional process on FPGAs. For evaluation purposes, it has been defined the computattional load of the emotional processor as the numberr of emotional contributions perr unit of time (MOPS, millions of operations per seconnd). The relationship between 1M MOPS and 1 MFLOPS is 240. 4.2 FPGA Implementatio on The block diagram of the implementation of one of the basic functions that comppose an emotion is shown in Fig gure 3. To compare the performance of previous solutiions with a semicustom hardwaare implementation, the function is implemented using the resources of the functions liibrary provided by the development tool for FPGAs, Alttera Quartus II 10.1.
Fig. 3. Block diagram m of a basic function composing an emotional contribution
The design and implemeentation of the agent's emotional processes through the use of FPGAs has been carried out in a modular way, using available library componeents in VHDL specification language. l Furthermore, it has also done a functioonal simulation to test the validity of the design. For their synthesis, different applicattion models of varying complex xity FPGA have been used, in order to analyze the leveel of parallelization achievable according to available resources. Then, a post-syntheesis simulation to verify that thee VHDL design could be implemented as logic blocks, has been performed. Besides, it i has been proceeded to "Placement and Routing" to get some good connections in n order to get top performance of the device operattion frequency. Finally the dessign has been validated on FPGA STRATIX III moodel (EP3SE50F780C2) of Alterra of medium performance.
Emotional Contribution Process Implementations on Parallel Processors
349
4.3 Multicore Implementation Regarding the implementation of the emotional processes in C++ running on a multicore, several aspects have been considered. On the one hand, the agent software, which has been executed sequentially, consists of five main modules: belief, relationship, behavior, emotion and attention. The emotional module, which is highly parallel, as mentioned above, has been extracted to run on a multicore processor (where each core is a 3.3GHz i5 processor) and performance measures has been taken to compare them with the obtained results when executing the processes in FPGA specific processors. The execution of the emotional process on multicore systems has been performed at the process level, by using the operating system scheduler to balance the load.
5 Experimental Results Regarding multicore processors, 100 sets of 1000 tasks each have been defined, and using operating system services priorities of tasks have been assigned, as well as their assignment affinity to cores within the processor. Experiments have been executed on a multicore machine with 1, 2 and 4 cores, each core is a 3.33GHz Intel Core i5. For each processor configuration, two implementations of emotional contribution processes, on two different languages, to compare differences between them, have been executed. First, we have decided to implement the processes in Matlab since it is a widely used software in the automation and control community. Then, the implementation in C++ to analyze the overhead that the compiler could generate, has been developed. The results of these implementations can be seen in Figure 4.
Fig. 4. Multicore software implementation
The processes implemented in C++ provide better computing capabilities than the same processes implemented in Matlab. In the case of C++, the assignment of sets of emotional processes to the processor, when using 1, 2 and 4 cores, provide a computing capacity of around 25, 47 and 89 million operations per second respectively. As for Matlab, the number of operations per second is lower and around,
350
C. Domínguez et al.
9, 17 and 30 MOPS. It can be observed that the results are even more favorable using C++, as the number of cores increases. In proportion, improvements in C++ implementation with respect to Matlab are for 1, 2 and 4 cores of about 2, 2.5 and 2.8 times better. Therefore, in the successive studies the comparisons between FPGA and Multicore will be performed with the C++ implementation of the emotional processes. The next experiment consists of comparing the results obtained with multicores with an optimized implementation of the contribution processes on a FPGA: STRATIX III (EP3SE50F780C2) device from ALTERA. Speed optimization, has been selected, since the bottleneck is enforced by the number of inputs and outputs of the FPGA and DSP (Digital Signal Processor) that the device incorporates. For the proposed design, the number of DSP circuits that have been able to operate in parallel for the STRATIX III is 7, with 4-stages machine segmentation. The results can be seen in Figure 5.
Fig. 5. Multicore and FPGA performances
For the application of the service robot carrying objects, the processing capacity of both implementations (FPGAs and multicore processors), has been evaluated. A set of 150 simulations of the transportation problem, of varying complexity (e.g., varying the number of pieces and routes), has been evaluated. Depending on the complexity of the problem, between 10000 and 8 million of emotional contributions in each cycle of attention of the agent can arise for each simulation. Taking into account that the attention cycle of the agent on high alert has been defined as 0.2 s. In summary, Figure 5 shows the average results obtained for the set of tested simulations. The FPGA Stratrix III has provided a processing capacity of about 14 MOPC - millions of emotional contribution operations per attention cycle of the agent. On the other hand, with the multicore processors, using 1 core processing capacity is on average of 5 MOPC, with 2 cores, 9.4 MOPC and with 4 cores 17.8 MOPC. For the specific service robotic application evaluated, with limited complexity (transport: from 10000 to 8 MOPC) it has been shown that it can be resolved with an FPGA of medium performance as Stratrix III (14 MOPC), by using the parallel design and proposed segmentation. However, for the multicore
Emotional Contribution Process Implementations on Parallel Processors
351
processors, the application requires at least 2 cores (9.4 MOPC). In this case, other cognitive processes of the agent (deliberative, planning, learning) are being executed on the other cores that are not being used for the calculation of emotional contributions. Note that, for more complex problems, the number of cores that would be needed also would grow.
6 Conclusions The analyzed FPGA, even being a development system, allows a greater number of operations per attention cycle of the agent than the dual-core processor due to proposed parallelization and segmentation. Therefore, for the prototype of the service robot the choice of the FPGA may be more convenient to allow the multicore processor execute more emotional processes, otherwise the number of cores that the agent should provide to solve more complex problems would be insufficient. It is pointed out that the analyzed problem is of low level complexity, aspects such as the attention and application processes, that are not considered, would overload more the multicore processor and would worsen the chances of multicore problem solving. For more complex applications of the service robot (e.g., integrating multiple services simultaneously), the computing power required would be even more, so higher performance FPGAs should be analyzed in future works. In this case, FPGAs prices would start to be prohibitive for the development of the prototype. However, the current market trend is to have, in the near future, processors with a large number of cores (e.g., 32) for a very competitive price. Under these conditions, it could be dedicated a larger number of cores (e.g., 6) to parallelize a larger number of processes for more complex service applications. In this case, aspects such as the distribution of the load between cores should be analyzed.
References 1. Monmasson, E., Idkhajine, L., Cirstea, M.N., Bahri, I., Tisan, A., Naouar, M.W.: FPGAs in Industrial Control Applications. IEEE Trans. on Industrial Informatics 7(2) (2011) 2. Tsai, C.-C., Huang, H.-C., Lin, S.-C.: FPGA-based parallel DNA algorithm for optimal configurations of an omnidirectional mobile service robot performing fire extinguishment. IEEE Trans. on Ind. Electron. 58(3), 1016 (2011) 3. Cheng, O., Abdulla, W., Salcic, Z.: Hardware-software codesign of automatic speech recognition system for embedded real-time applications. IEEE Trans. on Ind. Electron. 58(3), 850–859 (2011) 4. Gomperts, A., Ukil, A., Zurfluh, F.: Development and Implementation of Parameterized FPGA-Based General Purpose Neural Networks for Online Applications. IEEE Trans. on Industrial Informatics 7(1) (2011) 5. Chia-Feng, J., Chun-Ming, L., Chiang, L., Chi-Yen, W.: Ant colony optimization algorithm for fuzzy controller design and its FPGA implementation. IEEE Trans. on Ind. Electron. 55(3), 1453–1462 (2008) 6. Huang, H.-C., Tsai, C.-C.: FPGA Implementation of an Embedded Robust Adaptive Controller for Autonomous Omnidirectional Mobile Platform. IEEE Trans. on Industrial Electronics 56(5), 1604–1616 (2009)
352
C. Domínguez et al.
7. Damiano, L., Cañamero, L.: Constructing Emotions. Epistemological groundings and applications in robotics for a synthetic approach to emotions. In: AI-Inspired Biology (AIIB) Symposium, Leicester, UK (2010) 8. Moshkina, L., Arkin, R.C.: Beyond Humanoid Emotions: Incorporating Traits, Attitudes and Moods. In: IEEE Inter. Conference on Robotics and Automation (2009) 9. Sloman, A.: Some Requirements for Human-Like Robots: Why the Recent Over-Emphasis on Embodiment Has Held Up Progress. In: Sendhoff, B., Körner, E., Sporns, O., Ritter, H., Doya, K. (eds.) Creating Brain-Like Intelligence. LNCS, vol. 5436, pp. 248–277. Springer, Heidelberg (2009) 10. Domínguez, C., Hassan, H., Albaladejo, J., Crespo, A.: Simulation Framework for Validation of Emotional Agents. In: Arabnia, H.R. (ed.) The 2010 International Conference on Artificial Intelligence, Las Vegas, Nevada, USA. CSREA Press (2010)
A Cluster Computer Performance Predictor for Memory Scheduling M´onica Serrano, Julio Sahuquillo, Houcine Hassan, Salvador Petit, and Jos´e Duato Department of Computer Engineering (DISCA), Universidad Polit´ecnica de Valencia, Camino de Vera s/n, 46022 Valencia, Spain [email protected], {jsahuqui,husein,spetit,jduato}@disca.upv.es
Abstract. Remote Memory Access (RMA) hardware allow a given motherboard in a cluster to directly access the memory installed in a remote motherboard of the same cluster. In recent works, this characteristic has been used to extend the addressable memory space of selected motherboards, which enable a better balance of main memory resources among cluster applications. This way is much more cost-effective than than implementing a full-fledged shared memory system. In this context, the memory scheduler is in charge of finding a suitable distribution of local and remote memory that maximizes the performance and guarantees a minimum QoS among the applications. Note that since changing the memory distribution is a slow process involving several motherboards, the memory scheduler needs to make sure that the target distribution provides better performance than the current one. In this paper, a performance predictor is designed in order to find the best memory distribution for a given set of applications executing in a cluster motherboard. The predictor uses simple hardware counters to estimate the expected impact on performance of the different memory distributions. The hardware counters provide the predictor with the information about the time spent in processor, memory access and network. The performance model used by the predictor has been validated in a detailed microarchitectural simulator using real benchmarks. Results show that the prediction accuracy never deviates more than 5% compared to the real results, being less than 0.5% in most of the cases. Keywords: cluster computers, memory scheduling, remote memory assignment, performance estimation.
1 Introduction Since their introduction, cluster computers have been improving their performance and lowering their implementation costs with respect to supercomputers. Nowadays, it is easy to find many of these type of computer organizations in the top positions of highperformance computer rankings such as TOP500 [1]. This transition has been possible as advanced microarchitectural techniques and interconnection solutions only available in supercomputers enter the consumer market (i.e., they are commoditized), which in Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 353–362, 2011. c Springer-Verlag Berlin Heidelberg 2011
354
M. Serrano et al.
turn allow new ways to improve the performance of current cluster designs while maintaining or even lowering their expenses. However, since cluster architectures are loosely coupled by design, there is not a standard commodity framework supporting the access to memory installed on remote nodes. Therefore, to cope with applications demanding large amounts of main memory (e.g., enterprise level databases and services, large computing intensive parallel applications, etc.), cluster systems must rely on slower OS-based solutions such as swapping on remote RAM disks or implementing software-based shared memory. This, in turn, reduces the competitivity advantages of this type of systems. So far, Remote Memory Access (RMA) hardware [2], which allows a given node to directly access remote memory, has been only available in supercomputer systems like BlueGene/L [3], BlueGene/P [4], or Cray XT [5]. Nevertheless, commodity implementations for cluster computers are already entering the market. For example, the HyperTransport consortium [6], which is composed by more than 60 members from the leading industry (AMD, HP, Dell, IBM, etc.) and universities, is extending the Hypertransport technology, enabling the development of cluster systems supporting remote memory accesses. This work focuses on a cluster prototype that implements the aforementioned Hypertransport extensions and whose nodes are linked using a fast interconnection network. In this context, we assume that the OS running in the nodes offers inter-node memory allocation capabilities that enable the assignment of remote memory portions to local applications. As these regions have different latencies, performance of a given application strongly depends on how its assigned memory is distributed among regions. Since each application contributes with its performance to the global performance, a memory scheduler that maximizes the global performance is required. This memory scheduler must be aware not only of the characteristics (i.e., latency, bandwidth) of the different memory regions but also of the executing applications’ memory requirements. For example, allocating a 25% of the available remote memory to a memory-intensive application could lead to worse performance results than allocating the whole remote memory to an application with good cache locality. To decide how to distribute the different memory regions among the running applications, the scheduler needs information about the expected performance of a given memory distribution. To obtain this information two solutions can be devised: i) to perform an off-line profiling of the benchmarks varying the memory distribution and ii) to dynamically predict the performance of the benchmarks by measuring during execution their utilization of the system resources. The first solution has been developed in a previous work [7], where we analyzed how the memory distribution impacts on the performance of applications with different memory requirements, and presented an ideal memory allocation algorithm (referred to as SPP) that distributed the memory space among applications to maximize global performance. The generalization of SSP to any number n of applications was published in [8], where we also present an efficient heuristic algorithm that approximates the performance results provided by SPP while reducing its complexity in a factor of (n − 1)!. Both algorithms consider a quality of service (QoS) parameter for each application in order to guarantee minimum performance requirements.
A Cluster Computer Performance Predictor for Memory Scheduling
355
In contrast to these works, this paper proposes a performance predictor that provides the information required by the memory scheduler. The main aim of proposed predictor is to be used by the memory scheduler to maximize the system performance while guaranteeing specific QoS requirements. To perform the predictions, 3 sample executions for every benchmark are required, each one considering that the complete working set of the benchmark is stored in a different memory region (i.e., L, Lb or R). Using these samples the performance of any other memory distribution is estimated. The proposed predictor is driven by a novel performance model fed by simple hardware counters (like those available in most current processors) that measure the distribution of execution time devoted to processor, memory, and network resources. Although the model can be implemented for any type of processor, this work considers in-order execution for simplicity reasons. The model has been validated by comparing its estimations with the performance values obtained by the execution of real benchmarks in the Multi2Sim simulation framework [9]. The results show that the dynamic predictor is very accurate, since its deviation with respect to the real results is always lower than 5%, and much lower in most of the cases. The remaining of this paper is organized as follows. Section 2 describes the system prototype. Section 3 details our proposed performance model. Section 4 validates the model by comparing its predictions with detailed cycle-by-cycle simulation results. Section 5 discusses previous research related to this work, and finally, Section 6 presents some concluding remarks.
2 Cluster Prototype A cluster machine with the required hardware/software capabilities is being prototyped in conjunction with researchers from the University of Heidelberg [2], which have designed the RMA connection cards. The machine consists of 64 motherboards each one including 4 quad-core 2.0GHz Opteron processors in a 4-node NUMA system (1 processor per node), and a 16GB RAM memory per motherboard. The connection to remote motherboards is implemented by a regular HyperTransport [10] interface to the local motherboard and a High Node Count HyperTransport [11] interface to the remote boards. This interface is attached to the motherboard by means of HTX compatible cards [12]. When a processor issues a load or store instruction, the memory operation is forwarded to the memory controller of the node handling that memory address. The RMA connection cards include their own controller, which handles the accesses to remote memory. Unlike typical memory controllers, the RMA controller has no memory banks directly connected to it. Instead, it relies on the banks installed in remote motherboards. This controller can be reconfigured so that memory accesses to a given memory address are forwarded to the selected motherboard. Since the prototype is still under construction, in order to carry out the experiments and validate the proposed performance model, the cluster machine has been modeled using Multi2Sim. Multi2Sim is a simulation framework for superscalar, multithreaded, and multicore processors. It is an application-only execution-driven microarchitectural simulator, which allows the execution of multiple applications to be simulated without booting a complete OS.
356
M. Serrano et al.
Fig. 1. Block diagram of the 2-node NUMA system model and RMA Table 1. Memory subsystem characteristics Characteristic
Description
# of processors L1 cache: size, #ways, line size L1 cache latency L2 cache: size, #ways, line size L2 cache latency Memory address space L Latency Lb Latency R Latency
2 per motherboard 64KB, 2, 64B 3 1MB, 16, 64B 6 512MB, 256MB per motherboard 100 142 410
In addition, the whole system has been scaled down to have reasonable simulation times. The scaled system consists of two motherboards, each one composed of a 2node NUMA system as shown in Figure 1. Each node includes a processor with private caches, its memory controller and the associated RAM memory. Table 1 shows the memory subsystem characteristics, where memory latencies and cache organizations resemble those of the real prototype. The RMA connection cards have been assumed with no internal storage capacity. Likewise, the Multi2Sim coherence protocol has been extended to model the RMA functionality.
3 Performance Model A system whose running applications can be executed using different memory distributions (L, Lb, R) needs a mechanism to determine which memory distribution should be assigned to each application. This section presents a methodology for predicting the
A Cluster Computer Performance Predictor for Memory Scheduling
357
impact on performance of the different memory distributions, and then using the predictions to guide the assignment of memory regions to applications in order to meet memory constraints and reduce performance loss. This work assumes that the predictor evaluates seven possible memory distributions (three samples and four estimated cases) since this number of data points is enough to define sufficiently the performance of each application among the complete set of possible memory distributions [8]. To predict the performance (execution time) of a running application A when having a memory assignment {L = X, Lb = Y, R = Z}, an analytical method has been designed. Existing processors implement performance counters for debugging purposes which are readable by software. In this paper, these counters are utilized by an application-tomemory assignment prediction mechanism. The counters are used to track the number of cycles spent for each considered event during a full scheduling quantum. 3.1 Analytical Model The execution time of a given application can be estimated from two main components, as stated by equation 1. Tex = CDispatch + Cmem
(1)
stalls
Each Cx is the number of processor cycles spent on a type of activity. As the dispatch width has been assumed to be 1, the execution time can be expressed as the sum of the number of dispatched instructions plus the number of cycles stalled due to memory accesses. In the devised system, stalls due to a full load-store queue (LSQ) are critical for performance, mainly in those benchmarks having a high rate of memory accesses. On the other hand, dispatch stage remains stalled during the execution of a load instruction. This includes both the accesses to private caches (i.e. L1 and L2) and to the main memory, with their respective access times as well as the delays related to the network or structural hazards. To project the IPC, the performance model breaks down the memory components of the execution time into memory region-dependent and memory region-independent components: Cmem
stalls
= CL + CLb + CR + Cprivate
caches
+ CLSQ
iwidth
(2)
CL , CLb , and CR refer to the cycles spent on each memory region, that is, Local, Local to Board, respectively. Each C includes the cycles due to several activities related to this memory region. In particular, stalls due to the following reasons have been taken into account: Main memory access time. This time includes both the cycles spent in the data read from the main memory and the message traffic through the memory network. Delayed hit. This type of stall occurs when the memory access cannot be performed because the accessed block is already locked by another memory instruction, that is, a new block is being brought.
358
M. Serrano et al.
Write concurrency. This type of stall happens because concurrent accesses to the same block in a given cache are not allowed if one of them is a write. Full LSQ. Dispatch stage is stalled because there is no free entry in the LSQ. The remaining components of the equation can be considered as a constant k for every memory region. The region-independent components are the following: Private caches access time. Number of cycles spent in accessing the first and second level caches of the system. These accesses are region-independent since no memory module is accessed. LSQ issue width limitation. Only a load or a store can be issued at a given cycle. So, if a load instruction is ready to be issued and there is an access conflict between a load and a store, they are issued in program order, and the youngest instruction will retry the next cycle. The final equation used by the performance predictor is 3: Tex = CDispatch + CL + CLb + CR + k
(3)
3.2 Estimating Performance The model assumes that the implemented target machine provides the required performance counters to obtain the values for the components of equation 3. Notice that network traffic is taken into account, so congestion is also quantified. The predictor requires to run each benchmark three times to gather the required values to project performance. Each sample will correspond to all the memory accesses in one single region, that is, i) all the accesses to local memory region (i.e. Tex,L=100% ), ii) all the accesses to the other node in the local motherboard memory region (i.e. Tex,Lb=100% ), and iii) all the accesses to remote memory region (i.e. Tex,R=100% ): Sample 1 (L = 100%, Lb = 0%, R = 0%): Tex,L=100% = CL:L=100% + k Sample 2 (L = 0%, Lb = 100%, R = 0%): Tex,Lb=100% = CLb:Lb=100% + k Sample 3 (L = 0%, Lb = 0%, R = 100%): Tex,R=100% = CR:R=100% + k To predict the execution time for a given memory distribution, the predictor calculates a weighted execution time, Tex weighted , from the three samples. It takes each not null memory region component C of each of the samples and multiplies it by the fraction f of accesses of the destination memory region: Tex
weighted
= CL,L=100% · (fL ) + CLb,Lb=100% · (fLb ) + CR,R=100% · (fR ) + k (4)
For any given memory distribution, equation 4 can be used to predict its execution time given the gathered components for the three samples. This provides a mechanism to identify the optimal memory distribution at which to run a given execution phase with minimal performance loss. So this prediction will be an input for the memory scheduler. Table 2 analyzes an example of prediction for the benchmark FFT, where the execution time of the memory distribution (50%, 50%, 0) is obtained from the three samples. The estimated execution time is equal to 2774807.8 and the real detailed cycle-by-cycle simulation execution time is 2774931, so the model has obtained an estimation which deviates less than 0.005% with respect to the target value.
A Cluster Computer Performance Predictor for Memory Scheduling
359
Table 2. Performance predictor working example C f C Sample1 Sample2 Sample3 k tex
weighted
44687 62236 166757
0.5 0.5 0
pond
22343.5 31118 0 2721346.3 2774807.8
Fig. 2. Model Validation. Detailed cycle-by-cycle simulation vs model.
4 Validating the Model This section analyzes the prediction accuracy. We have proceed by making experiments for the four benchmarks with the eight memory distributions: i)(100%, 0%, 0%), ii)(50%, 50%, 0%), iii)(0%, 100%, 0%), iv)(75%, 0%, 25%), v)(50%, 25%, 25%), vi)(50%, 0%, 50%), vii)(25%, 0%, 75%), viii)(0%, 0%, 100%). Then, we have taken the components of the three samples (i, iii, and viii) and have applied the model to each benchmark to obtain the execution time for each of the remaining memory distributions. Finally, the Instructions Per Cycle (IPC) has been calculated for each case. Figure 2 shows the comparison of the simulated performance results (sim) against the values calculated by the performance predictor (model). Both model and detailed cycle-by-cycle simulation curves are overlapped, since the model provides a deviation lower than 5% in the worst case, being near to 0% for some of the benchmarks, for instance, FFT.
360
M. Serrano et al.
5 Related Work Previous research works have addressed the problem of performance prediction to characterize and classify memory behavior of applications to predict their performance. Zhuravlev et al [13] estimated that factors like memory controller, memory bus and prefetching hardware contentions contribute more to overall performance degradation than cache space contention. To alleviate these factors they minimize the total number of misses issued from each cache. To that end they developed scheduling algorithms that distribute threads such that the miss rate is evenly distributed among the caches. In [14] authors propose a classification algorithm for determining programs cache sharing behaviors. Their scheme can be implemented directly in hardware to provide dynamic classification of program behaviors. They propose a very simple dynamic cache partitioning scheme that performs slightly better than the Utility-based Cache Partitioning scheme while incurring a lower implementation cost. In [15] a fast and accurate shared cache aware performance model for multi-core processors is proposed. The model estimates the performance degradation due to cache contention of processes running on CMPs. It uses reuse distance histograms, cache access frequencies, and the relationship between the throughput and cache miss rate of each process to predict its effective cache size when running concurrently and sharing cache with other processes, allowing instruction throughput estimation. The average throughput prediction error of the model was 1.57 In [16] the authors apply machine learning techniques to predict the performance on multi-core processors. The main contribution of the study is enumeration of solo-run program attributes, which can be used to predict paired-run performance. The paired run involves the contention for shared resources between co-running programs. The previous research papers are focused on multicore or CMP processors however the work proposed in this paper is focused on cluster computers dealing with the problem of predicting the application behaviour using remote memory in order to allow a scheduler to improve system performance. Other research papers found in the bibliography dealing with remote memory allocation are mainly focused on memory swapping. Shuang et al. design a remote paging system for remote memory utilization in InfiniBand clusters [17]. In [18], the use of remote memory for virtual memory swapping in a cluster computer is described. Midorikawa et al. propose the distributed large memory system (DLM), which is an userlevel software-only solution that provides very large virtual memory by using remote memory distributed over the nodes in a cluster [19]. These papers use the remote memory for swapping over cluster nodes and present their system as an improvement of disk swapping. On the contrary, our research aims at predicting system performance depending on different assignment configurations of remote memory to applications. The predictions will be used by a memory scheduler to decide dynamically which is the best configuration to enhance system performance.
6 Conclusions This paper has presented a performance predictor which is able to estimate the execution time for a given memory distribution of an application. We first carried out a study to
A Cluster Computer Performance Predictor for Memory Scheduling
361
determine the events considered by our model, and classified them as memory-region dependent and independent. The model assumes that the number of cycles spent in each considered event is obtained from some hardware counters of the target machine. The devised predictor has been used to estimate the performance of different memory distributions for four benchmarks. The accuracy of the prediction has been validated, since the deviation of the model with respect to the real results is always lower than 5% and very close to 0% in several studied cases. This study constitutes the first step of a deeper work in the ground of memory scheduling. The performances estimated by the predictor will feed a memory scheduler which will dynamically choose the optimum target memory distribution for each application concurrently running in the system in order to achieve the best overall performance of the system. Acknowledgements. This work was supported by Spanish CICYT under Grant TIN2009-14475-C04-01, and by Consolider-Ingenio under Grant CSD2006-00046.
References 1. Meuer, H.W.: The top500 project: Looking back over 15 years of supercomputing experience. Informatik-Spektrum 31, 203–222 (2008), doi:10.1007/s00287-008-0240-6 2. Nussle, M., Scherer, M., Bruning, U.: A Resource Optimized Remote-Memory-Access Architecture for Low-latency Communication. In: International Conference on Parallel Processing, pp. 220–227 (September 2009) 3. Blocksome, M., Archer, C., Inglett, T., McCarthy, P., Mundy, M., Ratterman, J., Sidelnik, A., Smith, B., Alm´asi, G., Casta˜nos, J., Lieber, D., Moreira, J., Krishnamoorthy, S., Tipparaju, V., Nieplocha, J.: Design and implementation of a one-sided communication interface for the IBM eServer Blue Genesupercomputer. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 120. ACM, New York (2006) 4. Kumar, S., D´ozsa, G., Almasi, G., Heidelberger, P., Chen, D., Giampapa, M., Blocksome, M., Faraj, A., Parker, J., Ratterman, J., Smith, B.E., Archer, C.: The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer. In: ICS, pp. 94–103 (2008) 5. Tipparaju, V., Kot, A., Nieplocha, J., Bruggencate, M.T., Chrisochoides, N.: Evaluation of Remote Memory Access Communication on the Cray XT3. In: IEEE International Parallel and Distributed Processing Symposium, pp. 1–7 (March 2007) 6. HyperTransport Technology Consortium. HyperTransport I/O Link Specification Revision (October 3, 2008) 7. Serrano, M., Sahuquillo, J., Hassan, H., Petit, S., Duato, J.: A scheduling heuristic to handle local and remote memory in cluster computers. In: High Performance Computing and Communications (2010) (accepted for publication) 8. Serrano, M., Sahuquillo, J., Petit, S., Hassan, H., Duato, J.: A cost-effective heuristic to schedule local and remote memory in cluster computers. The Journal of Supercomputing, 1–19 (2011), doi:10.1007/s11227-011-0566-8 9. Ubal, R., Sahuquillo, J., Petit, S., L´opez, P.: Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors. In: Proceedings of the 19th International Symposium on Computer Architecture and High Performance Computing (2007) 10. Keltcher, C.N., McGrath, K.J., Ahmed, A., Conway, P.: The AMD Opteron Processor for Multiprocessor Servers. IEEE Micro 23(2), 66–76 (2003)
362
M. Serrano et al.
11. Duato, J., Silla, F., Yalamanchili, S.: Extending HyperTransport Protocol for Improved Scalability. In: First International Workshop on HyperTransport Research and Applications (2009) 12. Litz, H., Fr¨oening, H., Nuessle, M., Br¨uening, U.: A HyperTransport Network Interface Controller for Ultra-low Latency Message Transfers. In: HyperTransport Consortium White Paper (2007) 13. Zhuravlev, S., Blagodurov, S., Fedorova, A.: Addressing shared resource contention in multicore processors via scheduling. In: Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 129–142 (2010) 14. Xie, Y., Loh, G.H.: Dynamic Classification of Program Memory Behaviors in CMPs. In: 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects in conjunction with the 35th International Symposium on Computer Architecture (2008) 15. Xu, C., Chen, X., Dick, R.P., Mao, Z.M.: Cache contention and application performance prediction for multi-core systems. In: IEEE International Symposium on Performance Analysis of Systems and Software, pp. 76–86 (2010) 16. Rai, J.K., Negi, A., Wankar, R., Nayak, K.D.: Performance prediction on multi-core processors. In: 2010 International Conference on Computational Intelligence and Communication Networks (CICN), pp. 633–637 (November 2010) 17. Liang, S., Noronha, R., Panda, D.K.: Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device. In: CLUSTER, pp. 1–10. IEEE, Los Alamitos (2005) 18. Werstein, P., Jia, X., Huang, Z.: A Remote Memory Swapping System for Cluster Computers. In: Eighth International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 75–81 (2007) 19. Midorikawa, H., Kurokawa, M., Himeno, R., Sato, M.: DLM: A distributed Large Memory System using remote memory swapping over cluster nodes. In: IEEE International Conference on Cluster Computing, pp. 268–273 (October 2008)
Reconfigurable Hardware Computing for Accelerating Protein Folding Simulations Using the Harmony Search Algorithm and the 3D-HP-Side Chain Model César Manuel Vargas Benítez, Marlon Scalabrin, Heitor Silvério Lopes, and Carlos R. Erig Lima Bioinformatics Laboratory, Federal University of Technology - Paraná, Av. 7 de setembro, 3165 80230-901, Curitiba (PR), Brazil [email protected], [email protected], {hslopes,erig}@utfpr.edu.br
1
Introduction
Proteins are essentials to life and they have countless biological functions. They are synthesized in the ribosome of cells following a template given by the messenger RNA (mRNA). During the synthesis, the protein folds into an unique three-dimensional structure, known as native conformation. This process is called protein folding. Several diseases are believed to be result of the accumulation of ill-formed proteins.Therefore, understanding the folding process can lead to important medical advancements and development of new drugs. Thanks to the several genome sequencing projects being conducted in the world, a large number of new proteins have been discovered. However, only a small number of such proteins have its three-dimensional structure known. For instance, the UniProtKB/TrEMBL repository of protein sequences has currently around 16.5 million records (as in july/2011), and the Protein Data Bank – PDB has the structure of only 74,800 proteins. This fact is due to the cost and difficulty of unveiling the structure of proteins, from the biochemical point of view. Computer Science has an important role here, proposing models and computation approaches for studying the Protein Folding Problem (PFP). The Protein Folding Problem (PFP) can be defined as finding the threedimensional structure of a protein by using only the information about its primary structure (e.g. polypeptide chain or linear sequence of amino acids) [9]. The three-dimensional structure is the folding (or conformation) of a polypeptide as a result of interactions between the side chains of amino acids that are in different regions of the primary structure. The simplest computational model for the PFP problem is known as Hydrophobic-Polar (HP) model, both in two (2D-HP) and three (3D-HP) dimensions [5]. Although simple, the computational approach for searching a solution
This work is partially supported by the Brazilian National Research Council – CNPq, under grant no. 305669/2010-9 to H.S.Lopes and CAPES-DS scholarships to C.M.V. Benítez and M.H. Scalabrin.
Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 363–374, 2011. c Springer-Verlag Berlin Heidelberg 2011
364
C.M. Vargas Benítez et al.
for the PFP using the HP models was proved to be N P -complete [3]. This fact emphasizes the necessity of using heuristic and massively parallel approaches for dealing with the problem. In this scenery, reconfigurable computation is an interesting methodology due to the possibility of massive parallel processing. However, this methodology has been sparsely explored in molecular biology applications. For instance, [7] present a methodology for the design of a system based on reconfigurable hardware applied to the protein folding problem, where different strategies are devised to achieve a significant reduction of the search space of possible foldings. Also, [12] presents a methodology for the design of a reconfigurable computing system applied to the protein folding problem using Molecular Dynamics (MD). [13] propose a complete fine-grained parallel hardware implementation on FPGA to accelerate the GOR-IV package for 2D protein structure prediction. [4] present a FPGA based approach for accelerating string set matching for Bioinformatics research. A survey of FPGAs for acceleration of high performance computing and their application to computational Molecular Biology is presented by [11]. The main focus of this work is to develop approaches for accelerating protein folding simulations using the Harmony Search algorithm and the 3D-HP-SC (three dimensional Hydrophobic-Polar Side-Chain) model of proteins.
2
The 3D-HP Side-Chain Model (3D-HP-SC)
The HP model divides the 20 proteinogenic amino acids into only two classes, according to their affinity to water: Hydrophilic (or Polar) and Hydrophobic. When a protein is folded into its native conformation, the hydrophobic amino acids tend to group themselves in the inner part of the protein, in such a way to get protected from the solvent by the polar amino acids that are preferably positioned outwards. Hence, a hydrophobic core is usually formed, especially in globular proteins. In this model, the conformation of a protein (that is, a folding) is represented in a lattice, usually square (for the 2D-HP) or cubic (for the 3D-HP). Both 2D-HP and 3D-HP models have been frequently explored in the recent literature [9]. Since the expressiveness of the HP models is very poor from the biological point of view, a further improvement of the model is to include a bead that represents the side-chain (SC) of the amino acids [8]. Therefore, a protein is modeled by a backbone (common to any amino acid) and a side-chain, either Hydrophobic (H) or Polar (P). The side-chain is responsible for the main chemical and physical properties of specific amino acids. The energy of a conformation is an inverse function of the number of adjacent amino acids in the structure which are non-adjacent in the sequence. To compute the energy of a conformation, the HP model considers that the interactions between hydrophobic amino acids represent the most important contribution for the energy of the protein. Li et al. [8] proposed an equation that considers only three types of interactions (not making difference between types of side-chains). In this work we use a more realistic approach, proposed by [2], to compute the
Reconfigurable Computing for Protein Folding Using Harmony Search
365
energy of a folding, observing all possible types of interactions, as shown in Equation 1. H=
n
HH · ⎛
δrHH
i=1,j>i
+ ⎝BP ·
n i=1,j=i
ij
+
⎞ δrBP ⎠ + ij
n
BB ·
i=1,j>i+1
HP ·
n i=1,j>i
⎛
+ ⎝BH ·
δrBB ij
δrHP ij
δrBH ⎠
i=1,j=i
+
⎞
n
P P ·
n
ij
(δrP P )
i=1,j>i
ij
(1)
In this equation, HH , BB , BH , BP , HP , P P are the weights of the energy for each type of interaction, respectively: hydrophobic side-chains (HH), backbonebackbone (BB), backbone-hydrophobic side-chain (BH), backbone-polar sidechain (PH), hydrophobic-polar side-chains (HP), and polar side-chains (PP). In a chain of n amino acids, the distance (in the three-dimensional space) between ∗∗ . For the ith and j th amino acid interacting with each other is represented by rij the sake of simplification, in this work we used unity distance between amino ∗∗ = 1). Therefore, δ is an operator that returns 1 when the distance acids (rij between the ith and j th elements (either backbone or side-chain) for each type of interaction is the unity, or 0 otherwise. We also used an optimized set of weights for each type of interaction, defined by [2]. During the folding process, interactions between amino acids take place and the energy of the conformation tends to decrease. Conversely, the conformation tends to converge to its native state, in accordance with the Anfinsen’s thermodynamic hypothesis [1]. In this work we consider the symmetric of H such that PFP is understood as a maximization problem.
3
Harmony Search Algorithm
The Harmony Search (HS) meta-heuristic is inspired by musician skills of composition, memorization and improvisation. Musicians use their skills to pursuit a perfect composition with a perfect harmony. Similarly, the HS algorithm use its search strategies to pursuit for the optimum solution to an optimization problem. The pseudo-code of the HS algorithm is presented in Algorithm 1 [6]. The Harmony Search (HS) algorithm starts with a Harmony Memory of size HM S, where each memory position is occupied by a harmony of size N (musicians). Each improvisation step of a new harmony is generated from the harmonies already present in the harmony memory. If the new harmony generated is better than the worst harmony in the harmony memory, it is replaced with the new. The steps to improvise and update the harmony memory are repeated until the maximum number of improvisations (M I) is achieved. The HS algorithm can be described by five main steps, detailed below [6]1 : 1. Initialization and Setting Algorithm Parameters: In the first step, as in any optimization problem, the problem is defined as an objective function 1
For more information see the HS repository: http://www.hydroteq.com
366
C.M. Vargas Benítez et al.
Algorithm 1. Pseudo-code of the Harmony Search algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:
Parameters: HMS, HMCR, PAR, MI, FW Start Objective Function f (x), x = [x1 , x2 , ..., xN ] Initialize Harmony Memory xi , i = 1, 2, ..., HM S Evaluate each Harmony in HM: f (xi ) cycle ← 1 while cycle < MI do for j ← 1 to N do if random ≤ HMCR then {Rate of Memory Consideration} xj ← xij , with i ∈ [1, HM S] {chosen randomly} if random ≤ PAR then {Pitch Adjusting Rate} xj ← xj ± r × F W {with r random} end if else {Random Selection} Generate xj randomly end if end for Evaluate new harmony generated: f (x ) if f (x ) is better than worst harmony in HM then Update Harmony Memory end if cycle ← cycle + 1 end while Results and views End
to be optimized (line 3), which can or cannot be constrained. Originally, Harmony Search was designed for solving minimization problems [6]. The four main parameters of the algorithm are also defined here: Harmony Memory size – HM S, the Harmony Memory Consideration Rate – HM CR, the Pitch Adjusting Rate – P AR, and the Maximum number of Improvisations – M I. 2. Harmony Memory Initialization: The second step is the initialization of the Harmony Memory (HM) with a number of harmonies randomly generated (line 4). The Harmony Memory is the vector in which the best harmonies found during execution are stored. Each harmony is a vector representing a possible solution to the problem. 3. Improvise a New Harmony: In the third step, a new harmony is improvised based on a combination of several other harmonies found in HM (between lines 8–17). For each variable of the new harmony, a harmony of HM is arbitrarily selected by checking the corresponding probability of using or not this value (HM CR). If another harmony is used, the value of this variable will have small adjustments (Fret Width – F W ) according to a probability (P AR). If the value of another harmony is not used, a random value within the range of allowed values is assigned. Thus, the parameters HM CR and P AR are responsible for establishing a balance between exploration and exploitation in the search space.
Reconfigurable Computing for Protein Folding Using Harmony Search
367
4. Update Harmony Memory: In the fourth step, each new improvised harmony is checked to see if it is better than the worst harmony from HM (lines 19–21). If so, the new harmony replaces the worst one in HM. 5. Verification of the Stopping Criterion: In the fifth step, the end of each iteration is checked to discover if the best harmony meets the stopping criterion, usually a maximum number of improvisations (M I). If so, the execution is completed. Otherwise, it returns to the second step until reaching the stopping criterion.
4
Methodology
This section describes in detail the implementation of the Harmony Search algorithm for the PFP using the 3D-HP-SC model of proteins. Four versions were developed: a desktop computer version and three different FPGA-based implementations. The FPGA-based versions were developed in VHDL (Very High Speed Integrated Circuit Hardware Description Language) and implemented in a FPGA (Field Programmable Gate Array) device. Two of these versions also used an embedded processor (Altera’s NIOS II), as part of its hardware design. On the other hand, software implementations (i.e. for both NIOS II and the desktop computer) were developed in ANSI-C programming language. The first hardware-based approach is a version for the 32-bit NIOS II embeddedprocessor, and simply reproduces the software implemented on the desktop computer. The second hardware-based approach is a version for NIOS II with a dedicated hardware block, specifically developed for computing the fitness function, as shown in Figure 1). The HS algorithm runs on the NIOS II processor and the block called “Fitness Calculation System” works as a slave of the NIOS II. The processor is responsible for initializing the Harmony Memory, improvising new harmonies, updating the Harmony Memory and, finally, distributing the individuals (also called as harmonies) to the slave block. The slave, in turn, is responsible for computing the fitness function for each individual received. The internal structure of this block is described later. FPGA
clk Sequence
rst harmony
Harmony Search Algorithm (NIOS II)
reset Enable busy
Fitness Calculation System
Results MUX *Energy *Colisions *Fitness
fitness
Fig. 1. Functional block diagram of the folding system with NIOS II embeddedprocessor
368
C.M. Vargas Benítez et al.
The third hardware-based approach is fully implemented in hardware and does not use an embedded processor, as shown in Figure 2. The block called “Harmony Search Core” performs the HS algorithm. The Harmony Memory initialization is performed producing a new harmony for each position of the Harmony Memory. Each variable of each new harmony is independent of the others. Therefore, each new harmony is generated in one clock pulse using a set of N random number generators, where N is the number of variables in the harmony. Once the Harmony Memory is loaded with the initial harmonies, the iterative process of optimization of the HS algorithm is started. At each iteration, four individuals (harmonies) are evaluated simultaneously (in parallel), thus expecting an improvement in performance. In the improvisation step of the algorithm, the process of selection of each variable of the new harmony is performed independently. This procedure is done in only N clock pulses, as before. After that, the updating of the Harmony Memory is performed by inserting the new harmonies in their proper positions. The following positions are shifted, discarding the four worst harmonies. To find the insertion position, the position of the worst harmony in the Harmony Memory is always maintained in a latch. Each variable to be replaced is treated simultaneously. Once the optimization process is completed, the best harmony found is transferred from the Harmony Memory to the “Fitness Calculation System” block in order to display all relevant information about the conformation represented by this harmony. The chronometer block measures the total elapsed processing time of the system. The multiplex block selects output data among the obtained results (energy of each interaction, number of collisions, fitness and the processing time to be shown in a display interface). The random number generator is implemented using the Maximum Length Sequence (MLS) pseudo-random number approach. MLS is an n-stage linear shift-register that can generate binary periodical sequences of maximal period length of L = 2n − 1. In this work, we used n=7 or 4 for all probability values mentioned in the Algorithm 1, and n = 5 for generate variables of the new harmonies in the improvisation process. Figure 3 shows a functional block diagram of the “Fitness Calculation System” that has three main elements: a three-dimensional conformation decoder, a coordinates memory and a fitness computation block. By calculating the energy of each different type of interactions and the number of collisions between the elements (side-chains and backbone), the fitness of the conformation is obtained. The blocks that perform such operations are described as follows: Harmony Representation: The encoding of the candidate solutions (harmonies of the HS algorithm) is an important issue and must be carefully implemented. Encoding can have a strong influence not only in the size of the search space, but also in the hardness of the problem, due to the establishment of unpredictable cross-influence between the musicians of a harmony. There are several ways of representing a folding in an individual, as pointed by [9]: distance matrix, Cartesian coordinates (absolute coordinates), or relative internal coordinates. In this work we used the relative internal coordinates, because it is the
Reconfigurable Computing for Protein Folding Using Harmony Search
369
FPGA clk rst
Harmony Memory
Harmony Data
Amino acid Sequence
rst
Write Enable
Address
Fitness Calculation System 1
harmony_1
clk rst
Fitness Calculation System 2
harmony_4
clk rst
reset
MLS Random Number Generator
clk
Harmony Search Core
Fitness Calculation System 3
Enable busy fitness_1
Fitness Calculation System 4 fitness_4
enable clk rst
Results MUX *Energy *Colisions *Fitness *Chronometer
processing time Chronometer
Fig. 2. Functional blocks of the proposed folding system without NIOS II embeddedprocessor
Fitness Calculation System
Harmony
Fitness Computation
3D Conformation Decoder
clk rst clk_mem
Element Coordinates (x i , yi , zi )
Coordinates Memory
Coordinates (x i , yi , zi )
Interactions calculation
Collisions detection
Energy Calculation
Fitness Calculation Fitness
Fitness
Interactions Energy
Fig. 3. Fitness computing system
Collisions
370
C.M. Vargas Benítez et al.
most efficient for the PFP using lattice models of proteins. In this coordinates system, a given conformation of the protein is represented as a set of movements into a three-dimensional cubic lattice, where the position of each amino acid of the chain is described relatively to its predecessor. As mentioned in Section 2, using the 3D-HP-SC model, each amino acid of the protein is represented by a backbone (BB) and a side-chain, either hydrophobic (H) or polar (P). Using the relative internal coordinates in the three-dimensional space, there are five possible relative movements for the backbone (Left, Front, Right, Down and Up), and other five for each side-chain (left, front, right, down, up). It is important to know that the side-chain movement is relative to the backbone. The combination of these possible movements gives 25 possibilities. Each possible movement is represented by a symbol which, in turn, is represented using a 5-bit binary format (number of bits needed to represent the alphabet of 25 possible movements, between 0 and 24). The invalid values (value ≥ 25) are replaced by the largest possible (value = 24). Considering a folding of a namino acids long protein, a harmony of n − 1 musicians will represent the set of movements of the backbone and side-chain of a protein in the three-dimensional lattice. For a n-amino acids long protein, the resulting search space is 25(n−1) possible foldings/conformations. Three-Dimensional Conformations Decoder: The harmony, representing a given conformation, has to be converted into Cartesian coordinates that embeds the conformation in the cubic lattice. Therefore, a progressive sequential procedure is necessary, starting from the first amino acid. The coordinates are generated by a combinational circuit for the whole conformation. These coordinates are stored in the “Coordinates Memory” which, in turn, provides the coordinates of all elements (backbone and side-chains) in a parallel output bus. The algorithm for the decoding process (harmony → conformation) is as follows. The harmony is read and decoded into a vector using the set of possible movements. In the next step, the elements of the first amino acid are placed in the three-dimensional space. For each movement, four steps are done. First, the direction of the movement is obtained from the next movement and the direction of the movement of the predecessor amino acid. The backbone coordinates are obtained similarly from predecessor amino acid. The next step consists in determining the coordinates of the side-chain of the amino acids from the movement and coordinates of the backbone. Finally, the coordinates obtained in this process are stored in the “Coordinates Memory”. Figure 4(left) shows a conformation for a hypothetical 4-amino acids long protein, where the Cartesian coordinates of each element are represented as xi (row), yi (column), zi (depth), and obtained from the relative movement of the current amino acid and position of its predecessor. Blue balls represent the polar residues and the red ones, the hydrophobic residues. The backbone and the connections between elements are shown in gray. The search space for the protein represented in this figure has 25(n−1) = 253 = 15625 possible conformations. Here, the folding is formed by three movements: Ul→Dl→Dl. In this figure, the backbone and the side-chain of the first amino acid of the chain are also
Reconfigurable Computing for Protein Folding Using Harmony Search
371
Fig. 4. Left: Example of relative 3D movements of a folding. Right: Diagram representing the possible iterations between the elements of a protein chain.
indicated, where the backbone and the side-chain are set to the origin of the coordinates system (0,0,0) and (0, -1, 0), respectively. Fitness Function: In this work, we used a simplified fitness function based on that formerly proposed by [2]. Basically, this function has two terms: f itness = H − (N C · P enaltyV alue). The first is relative to the free-energy of the folding (H, see Equation 1) and the second is a penalty term that decreases the fitness value according to the number of collisions in the lattice. The term Energy takes into account the number of hydrophobic bonds, hydrophilic interactions, and interactions with the backbone. Also, the number of collisions (considered as penalties) and the penalty weight are considered in this term. This penalty is composed by the number of points in the three-dimensional lattice that is occupied by more than one element (N C - number of collisions), multiplied by the penalty weight (P enaltyV alue). The blocks named “Interactions calculation”, “Collisions detection” and “Energy calculation”, compute the energy of each type of interaction (see Figure 4(right) for a visual representation), the number of collisions between elements and the free-energy (H), respectively. Finally, the block called “Fitness Calculation” computes the fitness function. It is important to note that, in the current version of the system, due to hardware limitations, all energies are computed using a sequential procedure, comparing the coordinates of all elements of the protein. As the length of sequences increase, the demand for hardware resources will increase accordingly.
5
Experiments and Results
All hardware experiments done in this work were run in a NIOS II Development kit with an Altera Stratix II EP2S60F672C5ES FPGA device, using a 50MHz internal clock. The experiments done for the software version were run in a desktop computer with a Intel processor Core2Quad at 2.8GHz, running Linux. In the experiments reported below, the following synthetic sequences were used [2], with 20, 27, 31, 36 and 48 amino acids, respectively: (HP)2 PH2 PHP2 HP
372
C.M. Vargas Benítez et al.
Table 1. Comparative performance of the several approaches n
tp (s) tpN IOS tpN IOS−HW tpSW tpHW
20
557.3
54.0
6.5
1.6
27
912.8
75.0
7.7
3.0
31
1186.8
87.3
7.9
4.0
36
1460.5
107.7
9.4
5.0
48
2414.9
174.8
13.44 10.0
H2 P(PH)2 ; H3 P2 H4 P3 (HP)2 PH2 P2 HP3 H2 ; (HHP)3 H(HHHHHPP)2 H7 ; PH (PPH)11 P; HPH2 P2 H4 PH3 P2 H2 P2 HPH3 (PH)2 HP2 H2 P3 HP8 H2 . In this work, no specific procedure was used for adjust the running parameters of the HS algorithm. Factorial experiments and self-adjusting parameters [10] of algorithms are frequently used in the literature, but these issues fall outside the focus of the work. Instead, we used the default parameters suggested in the literature. The running parameters used in this work are: M I = 100000, HM S = 20, P AR = 30%, F W = 5 and HM CR = 90%. It is important to recall that the main objective of this work is to decrease the processing time of protein folding simulations by using the 3D-HP-SC model. Each developed approach was applied to the sequences mentioned before. Results are shown in Table 1. In this table, the first column identifies the sequence length; columns tpN IOS , tpN IOS−HW , tpSW and tpHW show the processing time for each approach. Where, tpN IOS , tpN IOS−HW , tpSW and tpHW represent, respectively, the total elapsed processing time for the NIOS II, NIOS II with the “Fitness Calculation System” block, the software and the hardware-based system without embedded processor approach. Overall, the processing time, for any approach, is a function of the length of the sequence, possibly growing exponentially as the number of amino acids of the sequence increases. This fact, by itself, strongly suggests the need for highly parallel approaches for dealing with the PFP. In order to facilitate the comparison of performance between the approaches, Figure 5 presents the speedups obtained, where: – Spa = tpN IOS−HW /tpN IOS : speedup of the NIOS II with the “Fitness Calculation System” block relative to the NIOS II approach; – Spb = tpSW /tpN IOS−HW : speedup of the software relative to the NIOS II with the “Fitness Calculation System” block; – Spc = tpN IOS−HW /tpHW : speedup of the hardware-based system without embedded processor approach relative to the NIOS II with the “Fitness Calculation System” block; – Spd = tpSW /tpHW : speedup of the hardware-based system without embedded processor approach relative to the software for desktop computers.
Reconfigurable Computing for Protein Folding Using Harmony Search
35
373
Spa Spb Spc Spd
30
Speedup
25 20 15 10 5 0 20
25
30
35
40
45
n Fig. 5. Comparison of speedups between the approaches
The NIOS II version presented the worst performance (i.e. the highest processing time) amongst all implementations. Its processing time was larger than the software approach due to the slow frequency of the internal clock (comparing with the desktop processor). It is also observed that the NIOS II with the “Fitness Calculation System” block achieved significant speedup when compared to the NIOS II approach, ranging from 10x to 13x, depending on the length of the sequence, mainly due to the number of clock cycles needed to execute each instruction in the NIOS II processor. The hardware-based system without the embedded processor showed the best performance, mainly due to the several levels of parallelism, namely, in the Harmony Memory initialization, in the improvisation and in the parallelization of several fitness function evaluations. It is observed that this approach was significantly better when compared to the remaining hardware-based approaches, achieving a speed-up ranging from 17x to 34x, also depending on the length of the sequence. When compared with the software approach, it is observed that this approach achieved speedups ranging from 1.5x to 4.1x. The speedup decreases as the length of the sequences grows, due to the sequential procedure used to compute the energy for each type of interaction (as mentioned in Section 4).
6
Conclusions and Future Works
The PFP is still an open problem for which there is no closed computational solution. As mentioned before, even the simplest discrete model for the PFP requires an N P -complete algorithm, thus justifying the use of metaheuristic methods and parallel computing. While most works used both 2D and 3D-HP models, the 3D-HP-SC is still poorly explored (see [2]), although being a more expressive model, from the biological point of view. Improvements will be done in future versions with the hardware-based system without the embedded processor approach, such as the full parallelization of
374
C.M. Vargas Benítez et al.
the energy computation. Also, future works will investigate hardware versions of other evolutionary computation approaches, such the Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO) or the traditional Genetic Algorithm (GA) applied to the PFP, so as to develop parallel hybrid versions and different parallel topologies. Regarding the growth of hardware resources usage, future work will consider the use of larger devices or multi-FPGA boards. Overall, results lead to interesting insights and suggest the continuity of the work. We believe that the use of reconfigurable computing for the PFP using the 3D-HP-SC model is very promising for this area of research.
References 1. Anfinsen, C.B.: Principles that govern the folding of protein chains. Science 181(96), 223–230 (1973) 2. Benítez, C.M.V., Lopes, H.S.: Hierarchical parallel genetic algorithm applied to the three-dimensional HP side-chain protein folding problem. In: Proc. of the IEEE Int. Conf. on Systems, Man and Cybernetics, pp. 2669–2676 (2010) 3. Berger, B., Leighton, F.T.: Protein folding in the hydrophobic-hydrophilic HP model is NP-complete. Journal of Computational Biology 5(1), 27–40 (1998) 4. Dandass, Y.S., Burgess, S.C., Lawrence, M., Bridges, S.M.: Accelerating string set matching in FPGA hardware for bioinformatics research. BMC Bioinformatics 9(197) (2008) 5. Dill, K.A., Bromberg, S., Yue, K., Fiebig, K.M., et al.: Principles of protein folding - a perspective from simple exact models. Protein Science 4(4), 561–602 (1995) 6. Geem, Z.W., Kim, J.-H., Loganathan, G.V.: A new heuristic optimization algorithm: Harmony search. Simulation 76(2), 60–68 (2001) 7. Armstrong Junior, N.B., Lopes, H.S., Lima, C.R.E.: Preliminary steps towards protein folding prediction using reconfigurable computing. In: Proc. 3rd Int. Conf. on Reconfigurable Computing and FPGAs, pp. 92–98 (2006) 8. Li, M.S., Klimov, D.K., Thirumalai, D.: Folding in lattice models with side chains. Computer Physics Communications 147(1), 625–628 (2002) 9. Lopes, H.S.: Evolutionary algorithms for the protein folding problem: A review and current trends. In: Smolinski, T.G., Milanova, M.G., Hassanien, A.-E. (eds.) Computational Intelligence in Biomedicine and Bioinformatics. SCI, vol. 151, pp. 297–315. Springer, Heidelberg (2008) 10. Maruo, M.H., Lopes, H.S., Delgado, M.R.B.: Self-adapting evolutionary parameters: Encoding aspects for combinatorial optimization problems. In: Raidl, G.R., Gottlieb, J. (eds.) EvoCOP 2005. LNCS, vol. 3448, pp. 154–165. Springer, Heidelberg (2005) 11. Ramdas, T., Egan, G.: A survey of FPGAs for acceleration of high performance computing and their application to computational molecular biology. In: Proc. of the IEEE TENCON, pp. 1–6 (2005) 12. Sung, W.-T.: Efficiency enhancement of protein folding for complete molecular simulation via hardware computing. In: Proc. 9th IEEE Int. Conf. on Bioinformatics and Bioengineering, pp. 307–312 (2009) 13. Xia, F., Dou, Y., Lei, G., Tan, Y.: FPGA accelerator for protein secondary structure prediction based on the GOR algorithm. BMC Bioinformatics 12, S5 (2011)
Clustering Nodes in Large-Scale Biological Networks Using External Memory Algorithms Ahmed Shamsul Arefin1, Mario Inostroza-Ponta2, Luke Mathieson3, Regina Berretta1,4, and Pablo Moscato1,4,5,* 1
Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia 2 Departamento de Ingeniería Informática, Universidad de Santiago de Chile, Chile 3 Department of Computing, Faculty of Science, Macquarie University, Sydney Australia 4 Hunter Medical Research Institute, Information Based Medicine Program, Australia 5 ARC Centre of Excellence in Bioinformatics, Callaghan, NSW, Australia {Ahmed.Arefin,Regina.Berretta,Pablo.Moscato}@newcastle.edu.au, [email protected], [email protected]
Abstract. Novel analytical techniques have dramatically enhanced our understanding of many application domains including biological networks inferred from gene expression studies. However, there are clear computational challenges associated to the large datasets generated from these studies. The algorithmic solution of some NP-hard combinatorial optimization problems that naturally arise on the analysis of large networks is difficult without specialized computer facilities (i.e. supercomputers). In this work, we address the data clustering problem of large-scale biological networks with a polynomial-time algorithm that uses reasonable computing resources and is limited by the available memory. We have adapted and improved the MSTkNN graph partitioning algorithm and redesigned it to take advantage of external memory (EM) algorithms. We evaluate the scalability and performance of our proposed algorithm on a well-known breast cancer microarray study and its associated dataset. Keywords: Data clustering, external memory algorithms, graph algorithms, gene expression data analysis.
1
Introduction
The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. A number of algorithms, techniques and applications have been proposed to obtain useful information from various types of biological networks. Data clustering is perhaps the most common and widely used approach for the global network analysis. It helps to uncover important functional modules in the network. Numerous clustering algorithms for analyzing biological networks have been developed. These traditional algorithms/tools work well on moderate size networks and can produce *
Corresponding author.
Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 375–386, 2011. © Springer-Verlag Berlin Heidelberg 2011
376
A.S. Arefin et al.
informative results. Interestingly, the size and number of the biological networks are continuously growing due to extensive data integration from newly discovered biological processes and by novel microarray techniques that also consider ncRNAs. To handle the large-scale networks, existing algorithms are required to scale well and need to be re-implemented using cutting-edge software and hardware technologies. In this work, we have enhanced and re-implemented a graph-based clustering algorithm known as MSTkNN, proposed by Inostroza-Ponta et al. [1], to tackle the task of clustering large-scale biological networks. Given a weighted undirected graph (G) (or in its special case, given a non-negative square matrix of distances among a set of objects, i.e. a complete weighted graph) the MSTkNN algorithm starts by building a proximity graph. It is defined as having the same set of nodes as the original graph, but has as the set of edges, the intersection of the edges of the minimum spanning tree (MST(G)) and a the k-nearest neighbor graph (kNN(G)). Gonzáles et al. [2] also used this proximity graph, with k= ⎣ln(n)⎦ where n is the number of nodes. In the MSTkNN algorithm, the value of k is determined automatically and a recursive procedure partitions the graph until a stopping criteria stops this recursive partition of a cluster [3]. MSTkNN does not require any fixed parameter (e.g., predetermined number of clusters) and it performs better than some other known classical clustering algorithms (e.g., K-Means and SOMs) in terms of homogeneity and separation [3] in spite of not using an explicitly defined objective function. In addition, it performs well even if the dataset has clusters of different mixed types (i.e. MSTkNN is not biased to “prefer” convex clusters). We propose here a different approach to allow the basic idea inherent to the MSTkNN to be practically applicable on large datasets. In the worst-case situation, the input is a similarity/dissimilarity matrix at the start of the computation and, for a very large data set, this matrix may not fit in the computer’s internal memory (inmemory) or even in the computer’s external memory (EM). In order to overcome this problem, given G, we compute and store only a qNN graph (with q=k+1) of the similarity matrix and compute its MST (i.e. MST(qNN)). Additionally, we annotate each edge of MST(qNN)) with a non-negative integer value which is a function of the relative distance between the two nodes of that edge and their nearest neighbors. Finally, we recursively partition the MST(qNN) using this set of annotations on the edges to produce the clusters. Unlike the MSTkNN in [1], we compute the MST only once, instead of at each recursive step and we show that our clustering result is still the same to the previous proposed algorithm. We have implemented our proposed algorithm by adapting the EM algorithmic approaches presented in [4-6], which give us an excellent performance improvement over the previous implementation. EM algorithms are very efficient when most of the data needs to be accessed from external memory. This approach improves the running time by reducing the number of I/Os between in-memory and the external memory. Further details on EM algorithms can be found in [7]. Additionally, we now have the benefits of employing parallel and distributed computing to calculate the similarity/distance matrix and computing the qNN graph that has made our data preprocessing reasonably fast on large data sets.
Clustering Nodes in Large-Scale Biological Networks
2
377
Related Work
Several graph-based clustering algorithms/tools have been developed in the past years and the advantages of them to analyse biological networks are clearly demonstrated in several publications [1, 8-9]. We can see graph-based clustering as a general domain of problems in which the task is often seen as an optimization problem (generally defined on a weighted graph). Given the graph, it is partitioned using certain predefined conditions. Each partition that represents a subgraph/ component of the graph is either further partitioned or presented as a cluster based on certain stopping criteria and guided by an objective function. In Table A.1, we present a brief list of the known graph-based clustering algorithm/tools for biological data sets along with the maximum test data set sizes in the relevant published literature. It is clear from the table that traditional graph-based clustering algorithms can serve as a primary/first tool for analyzing biological networks. However, new algorithms, designed with more advanced technologies, are necessary to deal with larger data sets. Surprisingly, EM algorithms, which are very convenient for handling massive data sets, have not yet been applied for clustering biological networks. We have found only few attempts in the published literature that exploit EM algorithms in bioinformatics, all of them seem to be related to sequence searching [10-11]. There exist several graph-based EM algorithms [12-13] that could be further investigated for their applicability on biological networks. In this work, we have adapted the EM computation of minimum spanning trees (EM MST) [4] and connected components (EM CC) [5-6]. These algorithms are capable of handling sparse graphs with up to billions of nodes.
3 3.1
Methods The Original MSTkNN Algorithm
The original MSTkNN algorithm, presented in [3], takes an undirected complete graph (G) and computes two proximity graphs: a minimum spanning tree (GMST) and a k-nearest neighbor graph (GkNN), where the value of k is determined by: k = min{ ⎣ln(n)⎦ ; min k/GkNN is connected}
(1)
Subsequently, the algorithm inspects all edges in GMST. If for a given edge (x,y) neither x is one of the k nearest neighbors of y, nor y is one of the k nearest neighbors of x, the edge is eliminated from GMST. This results in a new graph G ′ = GMST – {(x,y)}. Since GMST is a tree, after the first edge is deleted, G ′ becomes a forest. The algorithm continues applying the same procedure to each subtree in G ′ (with a value of k re-adjusted (k= ⎣ln(n)⎦ ), where n is now the number of nodes in each subtree), until no further partition is possible. The final partition of the nodes of G ′ induced by the forest is the result of the clustering algorithm.
378
3.2
A.S. Arefin et al.
MSTkNN+: The Modified MSTkNN Algorithm
The original MSTkNN algorithm requires (n × (n − 1) / 2) distance values (between all pairs of the n elements) as the input. For a large data set, this could be too large to fit in the computer’s in-memory and, for even larger values of n, it may not even fit in external memory. Even if we can store the distance matrix in the external memory, the computational speed will slow down dramatically because of the increased number of I/O operations. Therefore, we modified this step and instead of creating the complete graph from the distance matrix, we compute a q-nearest neighbor graph (GqNN), where q (= ⎣ln(n) ⎦ +1). This procedure reduces the input graph size, but still creates a reasonable clustering structure of the data set. The value of the q is determined from the inclusion relationship [2] of the GMST and the family of the nested sequence of graphs (GkNN, where k > ln(n)). Then, we compute the MST of the GqNN graph. We will call it GMSTp. We first annotate each edge in GMSTp according to the following procedure. For each edge (a,b) in E(GMSTp ) we assign an integer value p to the edge as follows: let, f(a,b) be the index of b in the sorted list of nearest neighbors of a in GqNN. The value of p is given by, p = min {f(a,b), f(b,a)}
(2)
We define the maximum value of p in the MSTp (or any of its components) as pmax and then, we partition the GMSTp with the following criteria: C1. If p > ⎣ln(n) ⎦ ; we remove the edge,
C2. If pmax < ⎣ln(n) ⎦ ; remove the edges with weight pmax – 1, and;
C3. If pmax= 1 or pmax = ⎣ln(n) ⎦ ; do not remove any edge, the result is a “cluster”.
The final output of our algorithm is a set of partitions or clusters of the input data. The algorithm does not require any pre-determined value for q but it is obviously possible to change the threshold from ⎣ln(n) ⎦ to any other user-defined parameter. The algorithm can be understood as a recursive procedure (see below): Algorithm 1. PRE-MSTkNN+ (D: distance matrix)
1: Compute GqNN. 2: Compute GMSTp = MST(GqNN). Algorithm 2. PRUNE-MSTkNN+ (GMSTp)
1: G ′ = Partition GMSTp , using the criteria C1, C2 and C3 described above. 2: c = connectedComponent( G ′ ) 3: If c > 1 then 4: Gcluster= Uic=1 PRUNE-MSTkNN+(components( Gi′ )) 5: End if 6: Return Gcluster
Clustering Nodes in Large-Scale Biological Networks
379
The function connectedComponent() gives the number of components in G ′ and the function components() identifies and returns each of the components. Unlike the original algorithm in [1], we compute the MST only once (at the beginning), instead of at each recursive step. This change also gives a significant speed-up in terms of run-time performance over the previous algorithm. The following Lemma proves that this approach is sound (i.e., a partitioned subtree also represents an exact MST of the relevant component in the complete graph): Lemma 1. Let T be a minimum spanning tree for a weighted graph G. Then if we select an edge e from T and partition the graph according to the subgraphs induced by the subtrees induced by excluding e from T, these subtrees are also minimum spanning trees for the subgraphs. Proof. Let T be a minimum spanning tree for a graph G. Let T be partitioned into two subtrees A and B with vertex and edge sets V (A), V (B), E(A) and E(B) respectively. Furthermore, let V (A) ∩ V(B) = φ and V (A) ∪ V(B) = V (G) and let A and B be connected by a single edge e in T. Now consider the graph G[V(A)] and let T ′ be a minimum spanning tree for G[V (A)]. We define the weight function w of a spanning tree to be the sum of the weights of the edges of the tree, and extend this in the natural way to any subtree. Then, w(T) = w(A) + w(B) + w(e). Now, assume that w( T ′ ) < w(A). Then, we could replace the subtree A with T ′ , and join it to B using e. As V(A) and V(B) are disjoint we cannot introduce any cycles, therefore T ′ joined with B via e must be a tree, and further, a spanning tree for G. However, this new tree must have weight less than w(T), contradicting the minimality of T. Therefore, T ′ cannot exist.
The main advantage of this algorithm over the all other MST-based graph clustering algorithms (for example [8-9]) is that it prunes the MST edges using the local connectivity, instead of using the exact distance between the two nodes in an edge (e.g., deleting the longest edge). Our algorithm can produce better results in terms of local connectivity (i.e., homogeneity) which is a desirable characteristic in clustering biological networks. 3.3
Implementation
The Test Environment. The computational tests were performed on a 16 node cluster computer (with Intel Xeon 5550 processors, 2.67 GHz speed, 8 cores) and the programs were written in C++ with the support of STL, STXXL1 and BOOST2 library and compiled using the g++ 4.4.4 compiler on a Linux OS, kernel ver. 2.6.9. Parallel /Distributed NN graph computation. To compute the distance matrix we use a message-passing interface (MPI) to distribute the data set (row-wise) into P parallel processors and then initiate the parallel computation of the distance metric, in each of them using Open MP (Multi-Processing). The method for efficiently distributing the computation of upper/lower triangle of the symmetric similarity matrix will be discussed later. 1 2
http://stxxl.sourceforge.net/ http://www.boost.org/
380
A.S. Arefin et al.
The EM MST and CC computation. We compute the MST using the EM MST algorithm in [4]. The I/O complexity of this algorithm is O(sort(m)·log(n/M)), where n is the number of nodes of the original graph, m is number of edges and M number of nodes that fit into computer’s internal memory, respectively, and the sort(m) is the time required for sorting the m edges. After partitioning the MST, we identify the connected components using the EM connected component algorithm in [5-6]. The I/O complexity of this algorithm is O(m·log(log(n))). Unlike other clustering tools, we store the connected components/clusters in external memory and only keep the list of the components in computer’s in-memory. This eliminates the excessive use of the inmemory even when there are a large number of components or clusters. Additionally, we tuned the implementations of the adapted algorithms [4-6] for better performance with denser graphs.
4 4.1
Results Data Description
We used two different data sets to demonstrate the performance of our proposed EM algorithm MSTkNN+. The first data set is used to illustrate the algorithm and contains a distance matrix between 10 Australian cities. The second data set is a breast cancer gene-expression data set from a study by van de Vijver et al. [14]. This microarray dataset contains the expression of 24,158 probe sets in 295 primary breast cancer patients. The data set also contains the clinical metastasis information (in terms of years to relapse) for all the patients. We also create a third larger dataset from van de Vijver et al. [14] as follows. First, we filter the probes sets using Fayyad and Irani's algorithm [15]. This step is supervised and aims at finding differentially expressed probe sets in the samples labeled “metastasis” versus the ones labeled “nonmetastasis”. This does not mean that these patients had no relapse. Instead, we indicate with “non-metastasis” that the patients had no relapse within five years after the initial diagnosis, but indeed there is a presence of a metastasis during the duration of the study, up to 14 years in one case. Next, we use a feature selection algorithm to refine the selection of probe sets using the (alpha-beta)-k-Feature set methodology [16]. After we selected features based on this method, we obtain a set of 876 probes sets. Finally, we produce a new large data set by subtracting the expression values of each possible pair of probes. These unique probe pairs are termed as metafeatures as in Rocha de Paula et al. [17]. Subsequently, we have an artificial data set with 384,126 elements, including all the filtered probes and all metafeatures. 4.2
Application on the City Distance Data Set
Our first application is on a distance matrix that we have created by taking distances among 10 Australian cities. The data set is given is Table A.2. We first create a qNN graph from the data set (See Table A.3) for q = 3 and an MSTp, where we annotate each edge with an integer value (p) as described in equation (2). For example (See Figure 1(a) and Table A.3), Adelaide is the third nearest neighbor of Melbourne
Clustering Nodes in Large-Scale Biological Networks
381
Fig. 1. (a) The MSTp created from 10 Australian cities (actual locations of the cities in the map are schematic). The edge between “Albany” and “Adelaide” is the candidate for deletion as the neighborhood value p > ⎣ln(10)⎦ = 2 (b) In the first iteration of MSTkNN+ the edge between “Katherine” and “Adelaide” is the next candidate to delete as p > ⎣ln(7)⎦ = 1 , where the number of elements in that component is 7 (c) Final clustering result.
and Melbourne is the first nearest neighbor of Adelaide. Therefore, we give a weight of 1 (the minimum) to the edge that connects Adelaide and Melbourne. Finally, we prune the MST edges using the criteria C1, C2 and C3 on each of the components. The result of our algorithm is presented in Figure 1(c). 4.3
Application on the Breast Cancer Data Set
Our second application is on a dataset on breast cancer. It contains the gene expression values measured on 24,158 probe sets for 295 primary breast cancer patients [14]. We first compute a similarity matrix using Pearson’s correlation and create a qNN graph that contains 24,158 vertices and 265,738 edges. Next, we create the MSTp. Finally, we apply our proposed algorithm to partition the MSTp to obtain the clusters (see Figure 2).
Fig. 2. Visualization of the clusters from the breast cancer data set in [12]. Some genes of interest are highlighted.
382
A.S. Arefin et al.
Additionally, we used iHOP3 to find the importance of the genes that are in the central regulatory positions of some of the clusters (see Figure 3). Our results show that many of the genes that are in the central position seem to have been already discussed in breast cancer and its progression course (see Table 1). Additionally, the genes with less number of published papers can also be further investigated based on their conspicuous position in the clustering and adjacency relation with the genes that have already been implicated in breast cancer. Table 1. The number of published literature associated with some of the observed most central genes and results using iHOP and Pubmed for searching the name of the gene and its aliases together with the words “breast” and “cancer” (ordered by gene symbol, highly referenced genes are in bold face) Gene Symbol COPS8 CPNE1 ESR1 EST FGF8 FOXA1 GATA3 GPR35 HAPLN3 HIC2 LOC729589 MTNR1A NCOA7 PLEKHA3 PLK1 SPAST
4.4
Gene Name COP9 constitutive photomorphogen copine I estrogen receptor 1 mitogen-activated protein kinase 8 fibroblast growth factor 8 forkhead box A1 GATA binding protein 3 G protein-coupled receptor hyaluronan and proteoglycan link 3 hypermethylated in cancer hypothetical LOC729589 melatonin receptor 1A nuclear receptor coactivator 7 pleckstrin homology domain 3 polo-like kinase 1 spastic paraplegia 4
Breast 10 1 17,352 165 27 60 219 0 1 13 0 194 1 0 49 0
Cancer 57 3 28,250 879 156 120 1399 2 1 122 0 1193 3 2 458 3
Application on an Expanded Breast Cancer Data Set with 384,126 Vertices and 4,993,638 Edges
Finally, we apply our proposed algorithm (MSTkNN+) on a large-scale “artificial” data set that is an expanded version of the breast cancer data set [14]. This data set has 384,126 elements (the values of 383,250 metafeatures together with the values of 876 probe sets obtained by filtering the original data set). Additionally, we also include the clinical metastasis information as a “phenotypical dummy probe set”. As previously described, we first create the qNN graph containing 384,126 vertices and 4,993,638 edges. Next, we apply MSTkNN+ to find the clusters. Due to the limitation of the existing visualization tools, it is impossible to provide a picture of the complete clustering. Instead, we present the group of metafeatures that closely cluster with the “phenotypical dummy probe set” (years for relapse), zooming in a part of that naturally strikes many as very interesting (see Figure 3). We find one metafeature (BCAR1SLC40A1) that has better correlation with the metastasis information than either the individual probe sets alone (e.g., genes BCAR1 or SLC40A1, see Figure 4). 3
http://www.ihop-net.org/UniPub/iHOP/
Clustering Nodes in Large-Scale Biological Networks
383
Fig. 3. The visualization (partial) of the cluster that contains the clinical metastasis information as a phenotypical gene. The rectangular shaped nodes indicate that the genes in these metafeatures share a common biological pathway (identified using GATHER4).
Fig. 4. The metafeature (BCAR1-SLC40A1) shows better correlation with the clinical metastasis values of each patient with respect to the feature (i.e., the BCAR1, Breast Cancer Anti-estrogen Resistance 1, or SLC40A1, Ferroportin-1) alone
It is also interesting to note the presence of SLC40A1 in three of the metafeatures co-expressed with the time to relapse values (clinical metastasis “dummy probe set”). Jiang et al. suggested that “breast cancer cells up-regulate the expression of iron importer genes and down-regulate the expression of iron exporter SLC40A1 to satisfy their increased demand for iron” [18]. This data indicates that, for those tumors that may relapse (and for which a different genetic signature may need to be found), the joint expression of BCAR1 and Ferroportin may be associated to time to relapse. Similarly, other identified metafeatures could also be further investigated. 4.5
Performance Comparisons
We have compared the solutions of our clustering approach against K-Means, SOM, CLICK and the original MSTkNN [1], using the homogeneity and separation indexes that give us an idea of how similar the elements in a cluster and dissimilar among the clusters, respectively (See Table 5). We used the implementation of the K-Means, SOM and CLICK available in the Expander tool5 and the implementation of the MSTkNN in [1] is obtained from http://cibm.newcastle.edu.au. The averages of homogeneity (Havg) and separation (Savg) were computed as in [19] and the Pearson’s correlation is used as the metric for computing the similarity matrix.
4 5
http://gather.genome.duke.edu/ http://www.cs.tau.ac.il/~rshamir/expander/
384
A.S. Arefin et al.
Table 2. Performance comparisons with K-Means, SOM, CLICK and original the MSTkNN approach in terms of homogeneity and separation Data Algorithms Breast Cancer K-Means SOM Filtered n=876 CLICK MSTkNN MSTkNN+ Complete n=24,158
Expanded n=384,126
^ †
Param. K=41 3× 3 -
K-Means, SOM, CLICK MSTkNN MSTkNN+ K-Means, SOM, CLICK, MSTkNN MSTkNN+(ours) -
Havg 0.521 0.501 0.538 0.287 0.288
Savg -0.186 -0.015 -0.281 0.386 0.389
#Clust. 41 9 8 41 45
Time (min) ~1 ~ 0.2 ~ 0.5 ~ 0.5 ~ 0.3
Mem. (MB) ~ 250 ~ 200 ~ 250 ~ 250 ~ 156
-
-
-
-
-
0.429 0.430
0.390 0.398
732 745
~ 12 ~ 5^
~ 8,100 ~ 650†
-
-
-
-
-
0.630
0.410
2,587
~ 15^
~ 1,500†
Does not include the time for computing the similarity matrix. Internal memory consumption can be pre-defined with EM environment parameters.
From Table 2, we can clearly see that MSTkNN succeeds in producing small, precise clusters from the filtered expression data (n=876). Even for the same number of clusters it gives better performance (i.e., higher homogeneity and lower separation values) than K-Means (even when we intentionally set K=41 in K-Means), SOM and CLICK. Proposed MSTkNN+, showed better performance in terms of homogeneity, time and memory usage, but the separation value was slightly increased. For the complete breast cancer data set (n=24,158), only the MSTkNN and our proposed algorithm were able to cluster the data set with high and low in-memory usage, respectively. The other algorithms were incomputable and were running indefinitely on the test machine. Finally, for the expanded breast cancer data set (n=384,126), only our proposed algorithm’s implementation MSTkNN+ could successfully cluster the whole data set in 15 minutes and using reasonable amount of in-memory.
5
Conclusion and Future Work
In this paper, we have proposed a significant improvement to the existing MSTkNN based clustering approach. Our implementation is faster (due to parallel/distributed pre-processing and algorithmic enhancement) and more memory efficient and scalable (due to the EM implementation) than the one in [1]. The clusters identified by our approach are meaningful, precise and comparable with other state-of-the-art algorithms. Our future work includes the design and implementation of a nearest neighbor-based MST algorithm so that we can eliminate the prohibitive computation of the similarity matrix when the data set is terribly large. Finding the nearest neighborhood of a point in space is being widely researched and one way to do so is to produce a kdtree. Other approaches, such as a GPU based similarity matrix computation can be an aid to accelerate the clustering process.
Clustering Nodes in Large-Scale Biological Networks
385
References 1. Inostroza-Ponta, M.: An Integrated and Scalable Approach Based on Combinatorial Optimization Techniques for the Analysis of Microarray Data, PhD thesis, The University of Newcastle, Australia (2008) 2. Gonzalez-Barrios, J.M., Quiroz, A.J.: A clustering procedure based on the comparison between the k nearest neighbors graph and the minimal spanning tree. Statistics and Probability Letters 62(3), 23–34 (2003) 3. Inostroza-Ponta, M., Mendes, A., Berretta, R., Moscato, P.: An integrated QAP-based approach to visualize patterns of gene expression similarity. In: Randall, M., Abbass, H.A., Wiles, J. (eds.) ACAL 2007. LNCS (LNAI), vol. 4828, pp. 156–167. Springer, Heidelberg (2007) 4. Dementiev, R., Sanders, P., Schultes, D., Sibeyn, J.: Engineering an external memory minimum spanning tree algorithm. In: 3rd IFIP Intl. Conf. on Theoretical Computer Science, pp. 195–208 (2004) 5. Sibeyn, J.: External Connected Components. In: Hagerup, T., Katajainen, J. (eds.) SWAT 2004. LNCS, vol. 3111, pp. 468–479. Springer, Heidelberg (2004) 6. Schultes, D.: External memory spanning forests and connected components, Technical report (2004), http://algo2.iti.kit.edu/dementiev/files/cc.pdf 7. Vitter, J.S.: External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys 33 (2001) 8. Xu, Y., Olman, V., Xu, D.: Clustering Gene Expression Data Using a Graph-Theoretic Approach: An Application of Minimum Spanning Tree. Bioinformatics 18(4), 526–535 (2002) 9. Grygorash, O., Zhou, Y., Jorgensen, Z.: Minimum Spanning Tree Based Clustering Algorithms. In: Proc. of the 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2006), pp. 73–81. IEEE Computer Society, Washington, DC, USA (2006) 10. Doowang, J.: An external memory approach to computing the maximal repeats across classes of dna sequences. Asian Journal of Health and Information Sciences 1(3), 276–295 (2006) 11. Choi, J.H., Cho, H.G.: Analysis of common k-mers for whole genome sequences using SSB-tree. Japanese Society for Bioinformatics 13, 30–41 (2002) 12. Chiang, Y., Goodrich, M.T., Grove, E.F., Tamassia, R., Vengroff, D.E., et al.: Externalmemory graph algorithms, In. In: SODA 1995: Proceedings of the Sixth Annual ACMSIAM, pp. 139–149. Society for IAM, Philadelphia (1995) 13. Abello, J., Buchsbaum, A.L., Westbrook, J.R.: A functional approach to external graph algorithms. Algorithmica, 332–343 (1998) 14. van de Vijver, M.J., He, Y.D., van’t Veer, L.J., Dai, H., et al.: A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med. 347(25) (2002) 15. Fayyad, U.M., Irarni, K.B.: Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In: IJCAI, pp. 1022–1029 (1993) 16. Cotta, C., Sloper, C., Moscato, P.: Evolutionary Search of Thresholds for Robust Feature Set Selection: Application to the Analysis of Microarray Data. In: Raidl, G.R., Cagnoni, S., Branke, J., Corne, D.W., Drechsler, R., Jin, Y., Johnson, C.G., Machado, P., Marchiori, E., Rothlauf, F., Smith, G.D., Squillero, G. (eds.) EvoWorkshops 2004. LNCS, vol. 3005, pp. 21–30. Springer, Heidelberg (2004) 17. Rocha de Paula, M., Ravetti, M.G., Rosso, O.A., Berretta, R., Moscato, P.: Differences in abundances of cell-signalling proteins in blood reveal novel biomarkers for early detection of clinical Alzheimer’s disease. PLoS ONE 6(e17481) (2011) 18. Jiang, X.P., Elliot, R.L., Head, J.F.: Manipulation of iron transporter genes results in the suppression of human and mouse mammary adenocarcinomas. Anticancer Res. 30(3), 759–765 (2010) 19. Shamir, R., Sharan, R.: CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis. In: Proc. of ISMB, pp. 307–316 (2000)
386
A.S. Arefin et al.
Appendix Table A.1. A list of known graph-based clustering algorithms/tools for biological networks6 Name cMonkey GTOM SAMBA CAST NNN EXCAVATOR HCS MSTkNN CLICK Ncut-KL TribeMCL MPI-MCL
Approaches Bi-clustering Topological overlap Neighborhood search Affinity search Mutual NN search MST Minimum cut Intersect MST- kNN Mincut Mincut, MCL MCL, dist. comp.
Language R R C/C++ Matlab Java C, Java Matlab,LEDA Java C/C++ C/C++ Fortran, MPI
Max. test data (n) 2,993 4,000 4,177 6,000 6162 6,178 7,800 14,772 29,600 40,703 80,000 125,008
Table A.2. A distance matrix in KMs for 10 Australian cities7 Canb. 0 240 473 967 3102 3141 1962 865 2838 3080
Canberra Sydney Melb. Adelaide Perth Darwin Katherine Hobart Albany Bunbury
Syd. 240 0 713 1163 3297 3153 2030 1060 3046 3282
Melb 473 713 0 654 2727 3151 1892 601 2436 2690
Adel. 967 1163 654 0 2136 2620 1330 1165 1885 2118
Perth 3102 3297 2727 2136 0 2654 1995 3017 392 156
Darwin 3141 3153 3151 2620 2654 0 1291 3743 2828 2788
Kath. 2870 2882 2885 2364 2562 271 0 3478 2702 2688
Hobart 865 1060 601 1165 3017 3743 2470 0 2678 2951
Albany 2838 3046 2436 1885 392 2828 1993 2678 0 279
Bunb. 3080 3282 2690 2118 156 2788 2688 2951 279 0
Table A.3. Three nearest neighborhood (q=3) for 10 Australian cities City/ q = Canberra Sydney Melbourne Adelaide Perth Darwin Katherine Hobart Albany Bunbury
6
7
1 Sydney (240) Canberra (240) Canberra (473) Melbourne (654) Bunbury (156) Katherine (271) Darwin (1291) Melbourne (601) Bunbury (279) Perth (156)
2 Melbourne (473) Melbourne (713) Hobart (601) Canberra (967) Albany (392) Adelaide (2620) Adelaide (1330) Canberra (865) Perth (392) Albany (279)
3 Hobart (865) Hobart (1060) Adelaide (654) Sydney (1163) Adelaide (2136) Perth (2654) Melbourne (1892) Sydney (1060) Adelaide (1885) Adelaide (2118)
Details about the methods and test environments can be found in the relevant publications. Computed using the distance tool at http://www.geobytes.com/citydistancetool.htm
Reconfigurable Hardware to Radionuclide Identification Using Subtractive Clustering Marcos Santana Farias1 , Nadia Nedjah2 , and Luiza de Macedo Mourelle3 1
Department of Instrumentation, Nuclear Engineering Institute, Brazil [email protected] 2 Department of Electronics Engineering and Telecommunications, State University of Rio de Janeiro, Brazil [email protected] 3 Department of Systems Engineering and Computation, State University of Rio de Janeiro, Brazil [email protected]
Abstract. Radioactivity is the spontaneous emission of energy from unstable atoms. Radioactive sources have radionuclides. Radionuclide undergoes radioactive decay and emits gamma rays and subatomic particles, constituting the ionizing radiation. The gamma ray energy of a radionuclide is used to determine the identity of gamma emitters present in the source. This paper describes the hardware implementation of subtractive clustering algorithm to perform radionuclide identification. Keywords: Radionuclides, data classification, reconfigurable hardware, subtractive clustering.
1
Introduction
Radioactive sources have radionuclides. A radionuclide is an atom with an unstable nucleus, i.e. a nucleus characterized by excess of energy, which is available to be imparted. In this process, the radionuclide undergoes radioactive decay and emits gamma rays and subatomic particles, constituting the ionizing radiation. Radionuclides may occur naturally but can also be artificially produced [10]. So, radioactivity is the spontaneous emission of energy from unstable atoms. Correct radionuclide identification can be crucial to planning protective measures, especially in emergency situations, by defining the type of radiation source and its radiological hazard [6]. The gamma ray energy of a radionuclide is a characteristic of the atomic structure of the material. This paper introduces the application of a method for a classification system of radioactive elements that allows a rapid and efficient identification to be implemented in portable systems. Our intention is to run a clustering algorithms in a portable equipment to perform identification of radionuclides. The clustering algorithms consume high processing time when implemented in software, mainly on processors of portable use, such as micro-controllers. Thus, a custom implementation for reconfigurable hardware is a good choice in embedded systems, which require real-time execution as well as low power consumption. Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 387–398, 2011. c Springer-Verlag Berlin Heidelberg 2011
388
M.S. Farias, N. Nedjah, and L. de Macedo Mourelle
The rest of this paper is organized as follows: first, in Section 2, is demonstrated the principles of nuclear radiation detection. Later, in Section 3, we review briefly existing clustering algorithms and we concentrate on the subtractive clustering algorithm. In Section 4, we describe the proposed architecture for cluster centers calculator using the subtractive clustering algorithm. Thereafter, in Section 5, we present some performance figures to assess the efficiency of the proposed implementation. Last but not least, in Section 6, we draw some conclusions and point out some directions for future work.
2
Radiation Detection
The radioactivity and ionizing radiation are not naturally perceived by the sense organs of human beings and can not be measured directly. Therefore, the detection is performed by analysis of the effects produced by radiation when it interacts with a material. There are three main types of ionizing radiation emitted by radioactive atoms: alpha, beta and gamma. The alpha and beta are particles that have mass and are electrically charged, while the gamma rays, like x-rays, are electromagnetic waves. The emission of alpha and beta radiation is always accompanied by the emission of gamma radiation. So most of the detectors is to gamma radiation. Gamma energy emitted by a radionuclide is a characteristic of the atomic structure of the material. The energy is measured in electronvolts (eV). One electronvolt is an extremely small amount of energy so it is common to use kiloelectronvolts (keV) and megaelectronvolt (MeV). Consider, for instance, Cesium-137 (137 Cs) and Cobalt-60 (60 Co), which are two common gamma ray sources. These radionuclides emit radiation in one or two discreet wavelengths. Cesium-137 emits 0.662 MeV gamma rays and Cobalt60 1.33 MeV and 1.17 MeV gamma rays. These energy are known as decay energy and define the decay scheme of the radionuclide. Each radionuclide, among many others, has a unique decay scheme by which it is identified [10]. When these emissions are collected and analyzed with a gamma ray spectroscopy system, a gamma ray energy spectrum can be produced. A detailed analysis of this spectrum is typically used to determine the identity of gamma emitters present in the source. The gamma spectrum is characteristic of the gammaemitting radionuclides contained in the source [11]. A typical gamma-ray spectrometry system (Fig. 1) consists of a scintillator detector device and a measure system . The interaction of radiation with the system occurs in the scintillator detector and the measurement system interprets this interaction. The scintillator detector is capable of emitting light when gamma radiation transfers to him all or part of its energy. This light is detected by a photomultiplier optically coupled to the scintillator, which provides output to an electrical signal whose amplitude is proportional to energy deposited. The property of these detectors provide an electrical signal proportional to the deposited energy spectrum allows the generation of the gamma energy spectrum by a radioactive element (histogram). To obtain this spectrum is used a
Reconfigurable Hardware to Radionuclide Identification
389
Fig. 1. Gama Spectrometry System - main components
multichannel analyzer or MCA. The MCA consists of an ADC (Analog to Digital Converter) which converts the amplitude of analog input in a number or channel. Each channel is associated with a counter that accumulates the number of pulses with a given amplitude, forming a histogram. These data form the energy spectrum of gamma radiation. Since different radionuclides emit radiation at different energy distributions, the spectrum analysis can provide information on the composition of the radioactive source found and allow the identification. Figure 2 shows a spectrum generated by simulation, to a radioactive source with of 137 Cs and 60 Co. The x-axis represents the channels for a 12-bit ADC. In such representation, 4096 channels of the MCA correspond to 2.048 MeV in the energy spectrum. The first peak in channel 1324 is characteristic of 137 Cs (0.662 MeV). The second and third peaks are energies of 60 Co.
50 45 40 35
Counts
30 25 20 15 10 5 0
0
500
1000
1500
2000 2500 Channels
3000
3500
Fig. 2. Energy spectrum simulated by a source with
4000
137
4500
Cs and
60
Co
390
M.S. Farias, N. Nedjah, and L. de Macedo Mourelle
The components and characteristics of a gamma spectrometry system (the type of detector, the time of detection , the noise of the high-voltage source, the number of channels, the stability of the ADC, temperature changes) can affect the formation of spectrum and quality of the result. For this reason it is difficult to establish a system for automatic identification of radionuclides, especially for a wide variety of these. Equipment that are in the market, using different algorithms of identification and number of radionuclides identifiable, do not have a good performance [6].
3
Clustering Algorithms
Clustering algorithms partition a collection of data into a certain number of clusters, groups or subsets. The aim of the clustering task is to group these data into clusters in such a way that similarity between members of the same cluster is higher than that between members of different clusters. Clustering of numerical data forms the basis of many classification algorithms. Various clustering algorithms have been developed. One of the first and most commonly used clustering algorithms is based on the Fuzzy C-means method (FCM). Fuzzy C-means is a method of clustering which allows one piece of data to belong to two or more clusters. This method was developed by Dunn [1] and improved by Hathaway [7]. It is commonly used in pattern recognition. Yager and Filev [2] introduced the so-called mountain function as a measure of spatial density around vertices of a grid, showed in the function (1) M (vi ) =
n
2
e−αxj −xi ,
(1)
j=1
where α > 0, M is the mountain function, calculated for the ith vertex vi during the first step, N is the total number of data, which may be simple points or samples, that is assumed to be available before the algorithm is initiated. Norm × | denotes the Euclidean distance between the points used as arguments and xj is the current data point or sample. It is ensured that a vertex surrounded by many data points or samples will have a high value for this function and, conversely, a vertex with no neighboring data point or sample will have a low value for the same function. It should be noted that this is the function used only during the first step with all the set of available data. During the subsequent steps, the function is defined by subtracting a value proportional to the peak value of the mountain function. A very similar approach is the subtractive clustering (SC) proposed by Chiu in [3]. It uses the so-called potential value defined as in (2). Pi =
n j=1
2
e−αxj −xi , where α =
4 ra
(2)
wherein, Pi is the potential value i-data as a cluster center, xi the data point and ra a positive constant, called cluster radius.
Reconfigurable Hardware to Radionuclide Identification
391
The potential value associated with each data depends on its distance to all its neighborhoods. Considering (2), a data point or sample that has many points or samples in its neighborhood will have a high value of potential, while a remote data point or sample will have a low value of potential. After calculating potential for each point or sample, the one, say x∗i , with the highest potential value, say Pi∗ , will be selected as the first cluster center. Then the potential of each point is reduced as defined in (3). This is to avoid closely spaced clusters. Until the stopping criteria is satisfied, the algorithm continues selecting centers and revising potentials iteratively. ∗ 2
Pi = Pi − Pi∗ e−βxi −xi ,
(3)
4/rb2
In (3), β = represents the radius of the neighborhood for which significant potential revision will occur. The data points or samples, that are near the first cluster center, say x∗i , will have a significantly reduced density measures. Thereby, making the points or samples unlikely to be selected as the next cluster center. The subtractive clustering algorithm can be briefly described by the following 4 main steps: – Step 1: Using (2), compute the potential Pi for each point or sample, 1 ≤ i ≤ n; – Step 2: Select the data point or sample, x∗i , considering the highest potential value, Pi∗ ; – Step 3: Revise the potential value of each data point or sample, according to (3). Find the new maximum value maxPi ; – Step 4: If maxPi ≤ Pi∗ , wherein is the reject ratio, terminate the algorithm computation; otherwise, find the next data point or sample that has the highest potential value and return to Step 3. The main advantage of this method is that the number of clusters or groups is not predefined, as it is in the fuzzy C-means method, for instance. Therefore, this method becomes suitable for applications where one does not know or does not want to assign an expected number of clusters ´ a priori. This is the main reason for choosing this method for the identification of radionuclides.
4
Proposed Architecture
This section provides an overview of the macro-architecture and contains information on the broad objectives of the proposed hardware. The hardware implements the subtractive clustering algorithm. Subtractive clustering algorithm was briefly explained in the Section 3. The implementation of this algorithm in hardware is the main point to develop a classification system of radioactive elements. For referencing, this hardware it will call hsc, hardware to subtractive clustering. This component processes all the arithmetic computation, described in the Section 3, to calculate the potential of each point in the subtractive clustering
392
M.S. Farias, N. Nedjah, and L. de Macedo Mourelle
algorithm. It has two components (exp1 and exp2 ) to compute the exponential 2 value e−αxi −xj and one component to sum (adder). The other component of this macro-architecture is called slc, component to storage, loading and control, which provides to the hsc the set of samples for the selection of cluster centers and stores the results of the calculated potential of each sample. This component also has the controller of the hsc. Figure 3 shows the components of the described macro-architecture.
Fig. 3. Macro-architecture components - SLC and HSC
The slc is a controller based on state machine. It includes a dual port memory md that provides the data that has to be clustered and the memory mp that allows for the bookkeeping of the potential associated with each clustered data. The data in this case could be provided by an ADC that belongs to a typical gamma-ray spectrometry system. The registers xmax , xi and xIndex maintain the required data until components exp1 and exp2 have completed the related computation. We assume the xmax value is available in memory md at address 0. The xmax is the biggest value found within the data stored in md. This register is used to the data normalization. The two exp components, inside hsc, receive, at the same time, different xj values from the dual port memory md. So the two modules start at the same time and thus, run in parallel. This sample for each component exp are two distinct values xj from two subsequent memory addresses. 2 After the computation of e−αxi −xj by exp1 and exp2 , component adder sums and accumulates the values provided at its input ports. This process is repeated until all data xj , 1 ≤ j ≤ N , are handled. Thus, this calculation determines the first Pi value to be stored in memory mp. After that, the process
Reconfigurable Hardware to Radionuclide Identification
393
is repeated to compute the potential values of all data points in memory md. At this point the first cluster center, i.e. the sample with maximum potential value, has been found. The slc component works as a main controller of the process. Thus, the trigger for initiating the components exp1 and exp2 occurs from the signal StartExp sent by slc. The proposed architecture allows the hardware to subtractive clustering hsc can be scaled by adding more of these components in parallel to the computation 2 of the factors e−α||xj −xi || . This provides greater flexibility to implement the hardware. Figure 4 shows how new components hsc are assembled in parallel. Each component hsc calculates in parallel the potential of a point i, the value Pi of the function 3. For this reason each module hsc must to receive and record a value of xi to work during the calculation of the potential of a point. Since these values are in different addresses of the memory, this registry value xi has to be done at different time because the memory can not have your number of ports increased as the number of components hsc is increased. To be not necessary to increase the number of control signals provided by the component slc when new components hsc are added, the component hsc itself has to send some control signals for the subsequent.
Fig. 4. Macro architecture with hsc components in parallel
These signals are to load the value xi (LEXi ) and start the reduction potential of each point (StartP ot), as showed in (3). Moreover, each component hsc should receive the signal EndAdd which indicates the end of the operation on the component Adder of the subsequent component hsc. This ensures that the main control (slc) only receives these signals after all components hsc in parallel complete their transactions at each stage, allowing the hardware can be reconfigured without changes in the main control. Figure 5 shows the effect of this scaling, simulating different processing times in the hsc modules. The n components hsc, implemented in parallel, compute the potential of n points of the set of samples. As explained before, the recording of the value xi has to be done in different period to be used in the calculation of the potential.
394
M.S. Farias, N. Nedjah, and L. de Macedo Mourelle
Fig. 5. Control signals with scaled architecture
It is shown in figure 5 that the first component hsc receives the signal LEXi to load xi from slc control and after this, it sends the signal LEXi to hsc subsequent. Only after all of the hsc have recorded its value xi , the signal to start the components exp (StartExp) is sent with the first pair of values xj in the dual bus BD. The internal architecture of the module exp1 and exp2 permits the calculation 2 of the exponential value e−αxi −xj . The exponential value was approximated by a second-order polynomial using the least-squares method [8]. Moreover, this architecture computes these polynomials and all values were represented using fractions, as in (4). e
−αx
Na = Da
Nv Dv
2
Nb + Db
Nv Dv
+
Nc Dc
(4)
Na Nc Nv b wherein, factors D , N Db and Dc are some pre-determined coefficients. Dv is a equivalent to variable (αx) in the representation. For high precision, the coefficients were calculated within the range [0, 1[, [1, 2[, [2, 4[ and [4, 8]. These coefficients are shown respectively in the quadratic polynomials of (5).
e−(αx) ∼ =
⎧ v ⎪ P[0,1[ ( N ⎪ ⎪ Dv ) = ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ v ⎪ P[1,2[ ( N ⎪ Dv ) = ⎪ ⎪ ⎪ ⎪ ⎨
773 2500
569 5000
Nv Dv Nv Dv
2
2
−
372 400
−
2853 5000
Nv Dv
Nv Dv
+
9953 10000
+
823 1000
2
Nv Nv 67 2161 4565 v P[2,4[ ( N ) = 2500 − 10000 ⎪ D D Dv + 10000 v v ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 2
⎪ ⎪ Nv Nv 234 835 ⎪ P[4,8[ ( Nv ) = 16 ⎪ − 10000 ⎪ D 10000 D Dv + 10000 v v ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ v P[8,∞[ ( N Dv ) = 0
(5)
Reconfigurable Hardware to Radionuclide Identification
395
The accuracy of these calculated values, i.e. the introduced error not bigger than 0.005, is adequate to properly obtain the potential values among the data provided during the process of subtractive clustering. The absolute error introduced is shown in Fig. 6. Depending on the data, this requires that the number of bits to represent the numerator and denominator have to be at least twice the maximum found in the data points provided. −3
6
x 10
5
Absolute Error
4
3
2
1
0
0
2
6
4
8
10
X
Fig. 6. Absolute error introduced by the approximation
The architecture of the Fig. 7 presents the micro-architecture of components exp1 and exp1 . It uses four multipliers, one adder/subtracter and some registers. These registers are all right-shifters. The controller makes the adjustment of the binary numbers with shifts to the right in these registers in order to maintain the frame of binary numbers after each operation. This is necessary to keep the results of multiplication with the frame of bits used without much loss of precision. The closest fraction is used instead of a simple truncation of the higher bits of the product. In this architecture, multipliers mult1 , mult2 , mult3 and mult4 operate in parallel to accelerate the computation. The state machine in the controller triggers these operations and controls the various multiplexers of the architecture. The computation defined in (4) is performed as described hereafter. – Step 1: Compute N V × N V , N B × N V , DV × DV and DB × DV ; – Step 2: Right-shift registers to render the frame of bits to the original size and in parallel with that, compute A = N A × N V × N V , C = N B × N V × DC, D = DB × DV × N C and E = DB × DV × DC; – Step 3: Add of C+D and, in parallel with that, compute B = DA×DV ×DV ; A – Step 4: Add B + C+D E .
396
M.S. Farias, N. Nedjah, and L. de Macedo Mourelle
Fig. 7. Architecture of EXP Modules to compute the exponential
5
Results
The data shown in figure 2 were obtained using a simulation program called Real Gamma-Spectrum Emulator. These data are in spreadsheet format of two columns, where the first column corresponds to the channel and the second is the number of counts accumulated in each channel. To validate the method chosen (subtractive clustering), the algorithm was implemented with Matlab, using the simulated data. As seen in the introduction, these data simulate a radioactive source consists of 137 Cs and 60 Co. To apply the subtractive clustering algorithm in Matlab, data provided by the simulation program has to be converted into one-dimensional data in one column. For example, if channel 1324 to accumulate 100 counts, means that the value 1324 should appear 100 times as input. only in this way the clustering algorithm is able to split the data into subgroups by frequency of appearance. In a real application this data would be equivalent to the output of AD converter of a gamma spectrometry system, as shown in the introduction. In the spectrum of Fig. 2, one can see three peaks. The first one in the channel 1324 is characteristic of 137 Cs (0.662 MeV). The second and third peaks correspond the energy of 60 Co. The circular black marks near the first and second peaks show the result of applying the subtractive clustering algorithm on the available data with Matlab software. These circular marks are center of the found clusters. These found clusters are very near (one channel to the left) of the signal peaks, the expected result. With the configuration to the algorithm in Matlab, the third peak was not found. This result can change with an adjust of the radius ra in 2. This is enough to conclude that the data provided belongs to a radioactive source with 137 Cs and 60 Co and the subtractive cluster method can be used to identify these radionuclides.
Reconfigurable Hardware to Radionuclide Identification
397
As the proposed architecture is based on the same algorithm, is expected to find the same result. The initial results show that the expected cluster center can be identified as in Matlab specification. The hardware takes about 12660 clock 2 cycles to yield one sum of exponential values ( nj=1 e−αxi −xj ). Considering n points in the avaiable data set, the identification of the first cluster center would take n times that amount. Finding the center of the second cluster is faster. It should take about 13000 clock cycles. This result can change with the data and depends of the amount of adjustment required to the right in the shift registers during the process.
6
Conclusions
This paper describes the implementation of subtractive clustering algorithm to radionuclide identification. The results shows the expected cluster center can be identified with a good efficiency. In data from the simulation of signals of radioactive sources, after conformation of the signal and its conversion into digital, the cluster center represents the points that characterize the energy provided by a simulated radionuclides. The identification of these points can sort the radioactive elements present in a sample. With this method it was possible to identify more than one cluster center, which would recognize more than one radionuclide in radioactive sources. These results reveal that the proposed hardware to subtractive cluster can be used to develop a portable system for radionuclides identification. This system can be developed and enhanced integrating the proposed hardware with a software to be executed by a processor inside the FPGA, bringing reliability and faster identification, an important characteristics for these systems. Following this work, we intend to develop the portable system and also a software-only implementation using an embedded processor or a micro-controller to compare it with the hardware-only solution developed.
References 1. Dunn, J.C.: A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal of Cybernetics 3, 32–57 (1973) 2. Yager, R.R., Filev, D.P.: Learning of Fuzzy Rules by Mountain Clustering. In: Proc. of SPIE Conf. on Application of Fuzzy Logic Technology, Boston, pp. 246– 254 (1993) 3. Chiu, S.L.: A Cluster Estimation Method with Extension to Fuzzy Model Identification. In: Proc. IEEE Internat. Conf. on Fuzzy Systems, pp. 1240–1245 (1994) 4. Navabi, Z.: VHDL - Analysis and Modeling of Digital Systems, 2nd edn. McGraw Hill, New York (1998) 5. The MathWorks, Inc.: Fuzzy Logic Toolbox - For Use With MATLAB. The MathWorks, Inc. (1999) 6. ANSI Standard N42.34: Performance Criteria for Hand-held Instruments for the Detection and Identification of Radionuclides (2003)
398
M.S. Farias, N. Nedjah, and L. de Macedo Mourelle
7. Hathaway, R.J., Bezdek, J.C., Hu, Y.: Generalized fuzzy C-means clustering strategies using Lp norm distances. IEEE Transactions on Fuzzy Systems 8, 576–582 (2000) 8. Rao, C.R., Toutenburg, H., Fieger, A., Heumann, C., Nittner, T., Scheid, S.: Linear Models: Least Squares and Alternatives. Springer Series in Statistics (1999) 9. Santi-Jones, P., Gu, D.: Fractional fixed point neural networks: an introduction. Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, Essex (2008) 10. Knoll, G.F.: Radiation Detection and Measurement, 2nd edn. John Wiley & Sons, Chichester (1989) 11. Gilmore, G., Hemingway, J.: Practical Gamma Ray Spectrometry. John Wiley & Sons, Chichester (1995)
A Parallel Architecture for DNA Matching 1
1
Edgar J. Garcia Neto Segundo , Nadia Nedjah , and Luiza de Macedo Mourelle 1
2
Department of Electronics Engineering and Telecommunications, Faculty of Engineering, State University of do Rio de Janeiro, Brazil 2 Department of Systems Engineering and Computation, Faculty of Engineering, State University of do Rio de Janeiro, Brazil
Abstract. DNA sequences can be often showed in fragments, little pieces, found at crime scene or in a hair sample for paternity exam. In order to compare that fragments with a subject or target sequence of a suspect, we need an efficient tool to analyze the DNA sequence alignment and matching. So DNA matching is a bioinformatics field that could find relationships functions between sequences, alignments and them try to understand it. Usually done by software through databases clusters analysis, DNA matching requires a lot of computational resources, what may increase the bioinformatics project budget. We propose the approach of a hardware parallel architecture, based on heuristic method, capable of reducing time spent on matching process.
1 Introduction Despite discoveries about DNA done a couple of years ago [13], computers were unable to provide enough performance to do some specific tasks. In fact, the biological application feature implies a prohibitive computational cost. Advances in computation allow the scientists to make use of informatics techniques to solve biological problems, or improve actual methods. The field that combines knowledge with biological answers is called bioinformatics or computational Biology, and it involves finding the genes in the DNA sequences of various organisms, developing methods to predict the structure and/or function of newly discovered proteins and structural RNA sequences, clustering protein sequences into families of related sequences and the development of protein models, aligning similar proteins and generating phylogeny trees to examine evolutionary relationships [13]. One of the main challenges in bioinformatics consists of aligning DNA strings and understanding any functional relationships that may have between them. In this purpose, algorithms are specifically developed to reduce time spent in DNA matching process, evaluating similarity degree between them. These algorithms are usually based on dynamic programming and might work well, in a fair time and cost for short sequences, but, commonly takes more time as the strings gets bigger. Massively implemented in software, algorithms for DNA alignment compare a query sequence with a subject sequence, often stored in a public database, running a global or local search in subject string to find the optimal alignment of two sequences. NeedlmanWunsh [9] and Smith-Waterman [16] algorithms are well known algorithms for DNA Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 399–407, 2011. © Springer-Verlag Berlin Heidelberg 2011
400
E.J. Garcia Neto Segundo, N. Nedjah, and L. de Macedo Mourelle
alignment. The former is based on global search strategy and the latter uses local search. While global search based methods work hard in all the search space, local search based methods attempt to reduce this space, find small similarities that are expanded in next stages. Consequently, local search based techniques are more appropriate to locate sections, wherein global search based alignment algorithms usually fail. The major advantage of the methods based on dynamic programming are the commitment to discover the best match. However, that commitment requires huge computational resources [2, 4]. DNA matching algorithms based on heuristics [19] emerged as an alternative to dynamic programming in order to remedy to the high computational cost and time requirements. Instead of aiming at the optimal alignment, heuristics based methods attempt to find a set of acceptable or pseudo-optimal possible solutions. Ignoring unlikely alignments, these techniques have improved the performance of DNA matching [3]. Among heuristics based methods BLAST [1, 7] and FASTA [5, 10, 11] stand out. Both of them have well defined procedures for the three main stages of aligning algorithms: seeding, extending and evaluating. BLAST is the fastest algorithms known so far [12, 14]. In this paper, we focus of this algorithm and propose a massively parallel architecture suited for hardware implementation of DNA Matching using BLAST algorithm. The main objective of this work is the acceleration of the aligning procedure. The rest of this paper is organized as follows: First, in Section 2, we briefly describe how the BLAST algorithm operates and report on its main characteristics. Then, in Section 3, we focus on the description of the proposed architecture. Subsequently, in Section 5, we draw some conclusions and point out some new directions for future work.
2 The BLAST Algorithm The BLAST (Basic Local Alignment Search Tool) [1] algorithm is a heuristic search based method that seeks words of length w that score at least t, called the threshold, when aligned with the query. The scoring process is performed according to predefined criteria that are usually prescribed by geneticists. This task is called seeding, where BLAST attempts to find regions of similarity to start its matching procedure. This step has a very powerful heuristic advantage, because it only keeps pairs whose matching score is larger than the pre-defined threshold t. Of course, there is some risk of leaving out some worthy alignments. Nonetheless, using this strategy, the search space decreases drastically, and hence accelerating the convergence of the matching process. After identifying all possible alignments locations or seeds and leaving out those pairs that do not score at least the prescribed threshold, the algorithm proceeds with the extension stage. It consists of extending the alignment words to the right and to the left within both the subject and query sequences, in an attempt to find a locally optimal alignment. Some versions of BLAST introduce the use of a wildcard symbol, called the gap, which can be used to replace any mismatch [1]. Here, we do not allow gaps. Finally, BLAST try to improve score of high scoring pairs, HSP, through a second extension process and the dismissal of a pair is done when the corresponding score does not reach a new pre-defined threshold. HSPs that meet this criterion will be reported by BLAST as final results, provided that they do not exceed the cutoff pre-
A Parallel Architecture for DNA Matching
401
scribed value, which specifies for number of descriptions and/or alignments that should be reported. This last step is called evaluating. BLAST employs a measure based on a well-defined mutation scores. It directly approximates the results that would be obtained by any dynamic programming algorithm for optimizing this measure. The method allows for the detection of weak but biologically significant sequence similarities. The algorithm is more than one order of magnitude faster than existing heuristic algorithms. Compared to other heuristicsbased methods, such as FASTA [5], BLAST performs DNA and protein sequence similarity alignment much faster but it is considered to be equally as sensitive. BLAST is very popular due to availability of the program online at the National Center for Biotechnology Information (NCBI), among other sites.
3 The Proposed Marco-architecture Although well-known, BLAST implementations are usually done using software [15]. While software implementations are of low cost, they often yield a low throughput. On the other hand, dedicated hardware implementations usually impose a much higher cost but they provide a better performance. The main motivation of this work is to propose hardware that implements the steps of the BLAST algorithm so that as to achieve a reduced response time and thus a high throughput. In this purpose, we explore some important features of BLAST to massively parallelize the execution of all independent tasks. The parallel architecture presented in this section is designed to execute the ungapped alignment using the BLAST procedure [1]. This is done for nucleotides of DNA sequences. A nucleotide can be one of four possibilities: A (Adenine), T (Thymine), C (Cytosine) and G (Guanine). Thus, a nucleotide may be represented using two bits. In order to improve the nucleotide comparison, we use two identical matching components, one for the most significant bits, and the other one for less significant bit. These components operate synchronously and in parallel. This should accelerate the comparison process up to twice the speed of a simple a bit-at-a-time comparison. The macro architecture of the aligning hardware of Fig. 1 shows the query and subject sequences (QS ans SS) are thus stored into four incoming registers, wherein respectively for LSW and MSW stand for Least and Most Significant Word. In this figure and throughout the figures of this paper, the components that appear in the background in gray color are the one that operate on the LSW of the query and subject sequence. We will use this architecture to show the computational steps in BLAST. 3.1 Seeding Intuitively, an alignment of two sequences consists of some comparisons followed by evaluations, using a kind of pointers that point at the start and end positions in the query and subject sequences. Our parallel hardware takes advantage of this idea and performing the same task in parallel.
402
E.J. Garcia Neto Segundo, N. Nedjah, and L. de Macedo Mourelle
Fig. 1. The macro-architecture of the aligning hardware
The hardware architecture for this step depends on a parameter to set the required velocity and sensibility of the alignment process. The query sequence is divided into words, as illustrated in Fig. 2. The words are logic mappings of the bits of the query sequence. Let w be the size of words to be formed and n and m the total size of the query sequence QS and subject sequence SS respectively. Then, the query sequence would be subdivided into n-w words where the ith word is formed by (QSi, QSi+1, QSi+2, …, QSi+w−1). Similarly, the subject sequence would be subdivided into m words where the jth word is formed by (SSj, SSj+1, SSj+2, …, SSj+w−1). The size of the considered words is determined by parameter w. Each cycle, the subject sequence is shifted by one, and compared to query sequence accordingly. The algorithm sensibility depends on the value of w: for small values of w, one expects to generate many words and thus the hardware becomes more sensitive but slower that when for larger values of w. word1 D
SET
CLR
Q
Q
Bit 0 word0 D
SET
CLR
D
Q
Q
Bit 0
SET
CLR
Q
D
Q
SET
CLR
Q
D
Q
SET
CLR
Q
D
Q
SET
CLR
Q
Q
Bit 1 Bit 2 Bit 3 Bit 4 Register - Query Sequence MS D
SET
CLR
Q
Q
D
SET
CLR
Q
Q
D
SET
CLR
Q
Q
D
SET
CLR
D
Q
Q
Bit 1 Bit 2 Bit 3 Bit 4 Register - Query Sequence LS
SET
CLR
Q
Q
Bit 5 word2 D
SET
CLR
Q
Q
Bit 5
Fig. 2. Illustration of the seeds identification process
A Parallel Architecture for DNA Matching
403
Finally, words are compared with subject sequence. This comparison grads the matching process based on predefined score table. Words that score below the threshold t are discarded. The remaining words are called seeds. For each seed, we create a block, parallelizing that to going through next algorithm steps. As usual for DNA string considers a seed only identical strings fragments between subject and query sequences, so our hardware find identical string and discards everything else.
Fig. 3. Illustration of the comparison during the seeding process
Some VHDL [8] features, such as the generate construction enable the description of repetitive and conditional structures in circuit specifications. Generic declarations allowed us to parameterize the size and structure of a circuit. Thus, for each seed, we generate an extension block, which is described in the next section, and thus having all the blocks performing in parallel for all several found seeds. 3.2 Extension In this step, each seed will be analyzed again in an attempt to improve the score. In order to that to happen, we stretch the alignment between the query-seed and the subject sequence, stored in a register. The extension is done to both the left and right directions, starting from the position where the exact match occurred. In the current extension step, we look either for exact matches, or whatever matches that meet the threshold constraints. The positions of the extended words that generated a hit are bookkept using a tag. This Tag is formed using two data: register position and offset, as shown in Fig. 4, wherein the parte of the tag ss indicates a position in the subject sequence register and wf indicates the relative position of the current word.
404
E.J. Garcia Neto Segundo, N. Nedjah, and L. de Macedo Mourelle
Fig. 4. Tags formation for the found seeds
For further processing, these tags are stored into a FIFO, and the sent to a processor, which will perform the comparison the scoring task. For each word generated in the seeding step, we have one comparison block, creating one tag and thus that inputs a dedicated FIFO. Therefore, the required work is done in a totally parallel manner until it reaches the load balancer. The extension does not stop until the accumulated total score of the high scoring pairs (HSP), begins to decrease. When extension should stop depends on a predefined parameter, called drop-off. In our implementation, though, extension stops when mismatch is found.
Fig. 5. Extension, comparison and scoring dynamics
A tag is treated by one of extension processors, which first computes the absolute position of the subsequence corresponding to this tag. After that it fetches from the subject and query registers the contents of the next position, which are either to the left or to the right of the subsequence being processed. Subsequently, it compares while scoring the subsequence of bits. Thereafter, the processor updates or discards the tag. The updated tags are stored into the memory for final evaluation and result output. So, the extension processor, whose architecture is shown in Fig. 6, performs very simple but several tasks. As explained the tasks need to be done sequentially. The right and the left extension as started immediately when a tag for a seed is generated, assuming that there exist an idle processor.
A Parallel Architecture for DNA Matching
405
Fig. 6. The extension processor architecture
In order to process several tags in parallel, we opted to include several processors that operate in parallel. As we presume that there the seed generation process will yield faster a processor when processing a given tag, as it has to extend the subsequence to the left and the right, which can be time consuming. For this purpose, we decide to include a FIFO between the seeding stage and the extension processor. This would control the fast inputs of tags and their slow processing. Note that the left and right extensions are lead separately and in parallel. The width of the FIFO is determined by the size of the tags while it depth is derived form the number of included processors. As there are more FIFOs than processors, we use a load balancer that dispatches tags to processors. This component monitors the content of the FIFOs and selects the next tag to be processed. It always withdraws tags from the FIFO that has more unavailable entries. The main purpose of the load balancer is to avoid full FIFO state to occur because when this happens the seeding process associated with the full FIFO must halt until a new entry becomes available.
406
E.J. Garcia Neto Segundo, N. Nedjah, and L. de Macedo Mourelle
3.3 Evaluating Once a tag has been processed and updated by the extension processor, it is then evaluated, comparing the obtained score against the second threshold. The final results of the DNA alignment process are those subsequences whose associated tags scored above this predefined threshold. The final outputs are presents in the form of tags. Note that this stage is implemented by a simple binary comparator of two signed integers which are the score associated with the considered tag and the threshold.
4 Conclusion In this paper, we presented reconfigurable parallel hardware architecture for DNA alignment. So it exploits the inherent advantages of reconfigurable hardware, such as availability and low cost. The proposed architecture is easily scalable for different query subject and word size. Moreover, the overall architecture is inherently parallel, resulting in reduced signal delay propagation. A parameterized VHDL code was written and simulated on ModelSim XE III 6.4 [6]. Future work consists of evaluating the characteristics of such an implementation on FPGA [17] and how it performs in a real-case DNA alignment.
References [1] Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990) [2] Baldi, P., Brunak, S.: Bioinformatics: the machine learning approach, 1st edn. MIT Press, Cambridge (2001) [3] Baxevanis, A.D., Francis Ouellette, B.F.: Bioinformatics: a practical guide to the analysis of genes and proteins, 1st edn. Wiley Interscience, Hoboken (1998) [4] Giegerich, R.: A systematic approach to dynamic programming in bioinformatics. Bioinformatics 16(8), 665–677 (2000) [5] Lipman, D.J., Pearson, W.R.: Rapid and sensitive protein similarity searches. Science 227(4693), 1435–1441 (1985) [6] ModelSim, High performance and capacity mixed HDL simulation, Mentor Graphics (2011), http://model.com [7] Mount, D.W.: Bioinformatics: sequence and genome analysis, 2nd edn. Cold Spring Harbor Laboratory Press (2004) [8] Navabi, Z.: VHDL: Analysis and modeling of digital systems, 2nd edn. McGraw Hill, New York (1998) [9] Needlman, S.B., Wunsh, S.B.: A general method applicable to the search of similarities in amino acid sequence of two protein. J. Mol. Biol. 48, 443–453 (1970) [10] Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America 85(8), 2444–2448 (1988) [11] Pearson, W.: Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics 11(3), 635–650 (1991)
A Parallel Architecture for DNA Matching
407
[12] Pearson, W.: Comparison of methods for searching protein sequence databases. Protein Science 4(6), 1145 (1995) [13] Searls, D.: The language of genes, vol. 420, pp. 211–217 (2002) [14] Shpaer, E.G., Robinson, M., Yee, D., Candlin, J.D., Mines, R., Hunkapiller, T.: Sensitivity and selectivity in protein similarity searches: a comparison of Smith-Waterman in hardware to BLAST and FASTA. Genomics 38(2), 179–191 (1996) [15] Oehmen, C., Nieplocha, J.: ScalaBLAST: A scalable implementation of BLAST for highperformance data-intensive bioinformatics analysis. IEEE Transactions on Parallel & Distributed Systems 17(8), 740–749 (2006) [16] Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981) [17] Wolf, W.: FPGA-based system design. Prentice-Hall, Englewood Cliffs (2004)
Author Index
Aalsalem, Mohammed Y. II-153 Abawajy, Jemal II-165, II-235, II-245, II-266 Abdelgadir, Abdelgadir Tageldin II-225 Abramson, David I-1 Adorna, Henry II-99 A. Hamid, Isredza Rahmi II-266 Ahmed, Mohiuddin II-225 Albaladejo, Jos´e II-343 Anjo, Ivo II-1 Ara´ ujo, Guido I-144 Arefin, Ahmed Shamsul II-375 Arshad, Quratulain II-153 Athauda, Rukshan II-175 Atif, Muhammad I-129 Aziz, Izzatdin A. I-433 Backes, Werner I-27 Bahig, Hatem M. II-321 Bahig, Hazem M. II-321 Baldassin, Alexandro I-144 Bardino, Jonas I-409 Based, Md. Abdul II-141 Bellatreche, Ladjel I-158 Ben´ıtez, C´esar Manuel Vargas II-363 Benkrid, Soumia I-158 Berretta, Regina II-375 Berthold, Jost I-409 Bichhawat, Abhishek I-218 Brezany, Peter I-206 Buyya, Rajkumar I-371, I-395, I-419 Byun, Heejung II-205 Cabarle, Francis George II-99 Cachopo, Jo˜ ao I-326, II-1 Carmo, Renato I-258 Carvalho, Fernando Miguel I-326 Chang, Hsi-Ya I-282 Chang, Rong-Guey I-93 Chen, Chia-Jung I-93 Chen, Xu I-294 Chen, Yi II-54 Chu, Wanming I-54, I-117 Chung, Chung-Ping I-80
Chung, Tai-Myoung II-74 Cohen, Jaime I-258 Colin, Jean-Yves II-89 Crain, Tyler I-244 Crespo, Alfons II-343 Crolotte, Alain I-158 Cuzzocrea, Alfredo I-40, I-158 da Silva Barreto, Raimundo I-349 David, Vincent I-385 de Macedo Mourelle, Luiza II-387, II-399 de Sousa, Leandro P. II-215 Dias, Wanderson Roger Azevedo I-349 Dinh, Thuy Duong I-106 Dom´ınguez, Carlos II-343 Duan, Hai-xin I-182, I-453 Duarte Jr., Elias P. I-258, II-215 Duato, Jos´e II-353 Duggal, Abhinav I-66 El-Mahdy, Ahmed I-270 Ewing, Gregory II-33 Faldella, Eugenio II-331 Fathy, Khaled A. II-321 Fernando, Harinda II-245 Folkman, Lukas II-64 Fran¸ca, Felipe M.G. II-14 F¨ urlinger, Karl II-121 Gao, Fan II-131 Garcia Neto Segundo, Edgar J. II-399 Garg, Saurabh Kumar I-371, I-395 Ghazal, Ahmad I-158 Gomaa, Walid I-270 Gopalaiyengar, Srinivasa K. I-371 Goscinski, Andrzej M. I-206, I-433 Goswami, Diganta I-338 Goubier, Thierry I-385 Gu, Di-Syuan I-282 Guedes, Andr´e L.P. I-258 Hackenberg, Daniel I-170 Han, Yuzhang I-206
410
Author Index
Haque, Asrar Ul II-24 Haque, Mofassir II-33 Hassan, Houcine II-343, II-353 Hassan, Mohammad Mehedi I-194 He, Haohu II-54 Hobbs, Michael M. I-433 Hou, Kaixi I-460 Huang, Jiumei I-460 Huang, Kuo-Chan I-282 Huh, Eui-Nam I-194 Hussin, Masnida I-443 Imbs, Damien I-244 Inostroza-Ponta, Mario Izu, Cruz II-276
II-375
Jannesari, Ali I-14 Javadi, Bahman I-419 Jiang, He-Jhan I-282 Jozwiak, Lech II-14 Kaneko, Keiichi I-106 Kaosar, Md. Golam I-360 Katoch, Samriti I-66 Khan, Javed I. II-24 Khan, Wazir Zada II-153 Khorasani, Elahe I-318 Khreishah, Abdallah II-109 Kim, Cheol Min II-196 Kim, Hye-Jin II-186, II-196 Kozielski, Stanislaw I-230 Kranzlm¨ uller, Dieter II-121 Kwak, Ho-Young II-196 Lau, Francis C.M. I-294 Lee, Cheng-Yu I-93 Lee, Junghoon II-186, II-196 Lee, Young Choon I-443 Lei, Songsong II-43 Leung, Carson K. I-40 Li, Hongjuan I-2 Li, Keqiu I-2 Li, Shigang II-54 Li, Xiuqiao II-43 Li, Yamin I-54, I-117 Li, Yongnan II-43 Liljeberg, Pasi II-287 Lim, Hun-Jung II-74 Lima, Carlos R. Erig II-363 Lin, Tzong-Yen I-93
Liu, Wu I-453 Lopes, Heitor Silv´erio II-363 Louise, St´ephane I-385 Majumder, Soumyadip I-338 Malysiak-Mrozek, Bo˙zena I-230 Marco, Maria II-343 Mart´ınez–del–Amor, Miguel A. II-99 Mathieson, Luke II-375 McNickle, Don II-33 Md Fudzee, Mohd Farhan II-235 Mjølsnes, Stig Fr. II-141 Molka, Daniel I-170 Moreno, Edward David I-349 Moscato, Pablo II-375 Mrozek, Dariusz I-230 M¨ uller, Matthias S. I-170 Nakechbandi, Moustafa II-89 Nedjah, Nadia II-14, II-387, II-399 Nery, Alexandre Solon II-14 Nguyen, Man I-481 Nic´ acio, Daniel I-144 Ninggal, Mohd Izuan Hafez II-165 Park, Gyung-Leen II-186, II-196 Pathan, Al-Sakib Khan II-225, II-255 Paulet, Russell I-360 Paulovicks, Brent D. I-318 Pawlikowski, Krzysztof II-33 Pawlowski, Robert I-230 Peng, Shietung I-54, I-117 Peng, Yunfeng II-54 P´erez–Jim´enez, Mario J. II-99 Petit, Salvador II-353 Phan, Hien I-481 Pranata, Ilung II-175 Pullan, Wayne II-64 Qin, Guangjun II-43 Qu, Wenyu I-2 Radhakrishnan, Prabakar I-66 Ragb, A.A. II-321 Rahman, Mohammed Ziaur I-306 Ram´ırez-Pacheco, Julio C. II-255 Raynal, Michel I-244 Ren, Ping I-453 Rivera, Orlando II-121 Rodrigues, Luiz A. I-258
Author Index Sahuquillo, Julio II-353 Salehi, Mohsen Amini I-419 Samra, Sameh I-270 Santana Farias, Marcos II-387 Scalabrin, Marlon II-363 Sch¨ one, Robert I-170 Serrano, M´ onica II-353 Seyster, Justin I-66 Sham, Chiu-Wing I-294 Sheinin, Vadim I-318 Shi, Justin Y. II-109 Shih, Po-Jen I-282 Shoukry, Amin I-270 Silva, Fabiano I-258 Sirdey, Renaud I-385 Skinner, Geoff II-175 So, Jungmin II-205 Soh, Ben I-481 Song, Biao I-194 Song, Bin II-312 Stantic, Bela II-64 Stojmenovic, Ivan I-2 Stoller, Scott D. I-66 Strazdins, Peter I-129 Sun, Lili II-54 Taifi, Moussa II-109 Tam, Wai M. I-294 Tan, Jefferson II-131 Tenhunen, Hannu II-287 Tichy, Walter F. I-14 Toral-Cruz, Homero II-255 Tucci, Primiano II-331 Tupakula, Udaya I-218
Varadharajan, Vijay I-218 Vinter, Brian I-409 Voorsluys, William I-395 Wada, Yasutaka I-270 Wang, Pingli II-312 Wang, Yini I-470 Wang, Yi-Ting I-80 Wen, Sheng I-470 Weng, Tsung-Hsi I-80 Westphal-Furuya, Markus Wetzel, Susanne I-27 Wu, Jianping I-182
I-14
Xiang, Yang I-470, II-153 Xiao, Limin II-43 Xu, Thomas Canhao II-287 Yao, Shucai II-54 Yeo, Hangu I-318 Yi, Xun I-360 Yoo, Jung Ho II-300 Zadok, Erez I-66 Zhang, Gongxuan II-312 Zhang, Lingjie I-460 Zhao, Ying I-460 Zhao, Yue I-294 Zheng, Ming I-182 Zhou, Wanlei I-470 Zhou, Wei I-470 Zhu, Zhaomeng II-312 Zomaya, Albert Y. I-443
411