Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4967
Roman Wyrzykowski Jack Dongarra Konrad Karczewski Jerzy Wasniewski (Eds.)
Parallel Processing and Applied Mathematics 7th International Conference, PPAM 2007 Gdansk, Poland, September 9-12, 2007 Revised Selected Papers
13
Volume Editors Roman Wyrzykowski Konrad Karczewski Czestochowa University of Technology Institute of Computer and Information Science Dabrowskiego 73 42-200 Czestochowa, Poland E-mail:{roman, xeno}@icis.pcz.pl Jack Dongarra University of Tennessee, Electrical Engineering and Computer Science Department Volunteer Boulevard 1122 37996-3450 Knoxville, TN, USA E-mail:
[email protected] Jerzy Wasniewski Technical University of Denmark, Department of Informatics and Mathematical Modelling Richard Petersens Plads, Building 321 2800 Kongens Lyngby, Denmark E-mail:
[email protected]
Library of Congress Control Number: 2008927189 CR Subject Classification (1998): D, F.2, G, B.2-3, C.2, J.2 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-540-68105-1 Springer Berlin Heidelberg New York 978-3-540-68105-2 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2008 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12270521 06/3180 543210
Preface
This volume comprises the proceedings of the 7th International Conference on Parallel Processing and Applied Mathematics - PPAM 2007, which was held in Gda´ nsk, Poland, September 9–12, 2007. It was organized by the Department of Computer and Information Sciences of the Czestochowa University of Technol ogy, with the help of the TASK Academic Computer Centre in Gda´ nsk. The main organizer was Roman Wyrzykowski. PPAM is a biennial conference. Six previous events have been held in different places in Poland since 1994. The proceedings of the last three conferences have been published by Springer in the Lecture Notes in Computer Science series (Nalecz´ ow, 2001, vol. 2328; Czestochowa, 2003, vol. 3019; Pozna´ n , 2005, vol. 3911). The PPAM conferences have become an international forum for exchanging ideas between researchers involved in parallel and distributed computing, including theory and applications, as well as applied and computational mathematics. The focus of PPAM 2007 was on software tools which facilitate efficient and convenient utilization of modern computing architectures, as well as grid computing, and large-scale applications. This meeting gathered more than 230 participants from 35 countries. A strict refereeing process resulted in the acceptance of 141 contributed presentations, while approximately 45% of the submissions were rejected. Regular tracks of the conference covered such important fields of parallel/distributed/grid computing and applied mathematics as: – – – – – –
Parallel/distributed architectures and mobile computing Numerical algorithms and parallel numerics Parallel and distributed non-numerical algorithms Environments and tools for parallel/distributed/grid computing Applications of parallel/distributed/grid computing Evolutionary computing, meta-heuristics and neural networks The plenary and invited talks were presented by:
– – – – – – – – – –
David A. Bader from the Georgia Institute of Technology (USA) Ben Bennet from ClearSpeed Technology (UK) Ewa Deelman from the University of Southern California (USA) Jack Dongarra from the University of Tennessee and Oak Ridge National Laboratory (USA) Richard Dracott from Intel (USA) Erik Elmroth from the Ume˚ a University (Sweden) Fabrizio Gagliardi from Microsoft (USA) Angel E. Garcia from the Rensselaer Polytechnic Institute (USA) Fred Gustavson from the IBM T.J. Watson Research Center (USA) Hans-Christian Hoppe from Intel (Germany)
VI
Preface
– Vladik Kreinovich from the University of Texas at El Paso (USA) – Jarek Nieplocha from the Pacific Northwest National Laboratory (USA) – Jennifer Schopf from the Argonne National Laboratory (USA) and Science Institute (UK) – Masha Sosonkina from the Ames Laboratory and Iowa State University (USA) – Boleslaw Szyma´ nski from the Rensselaer Polytechnic Institute (USA) – Jerzy Wa´sniewski from the Technical University of Denmark (Denmark)
Workshops and Minisymposia Important and integral parts of the PPAM 2007 conference were the workshops: – The Second Minisymposium on Novel Data Formats and Algorithms for Dense Linear Algebra Computations organized by Fred Gustavson from the IBM T.J. Watson Research Center (USA), and Jerzy Wa´sniewski from the Technical University of Denmark (Denmark) – Combinatorial Tools for Parallel Sparse Matrix Computations Workshop organized by Laura Grigori from INRIA (France), and Masha Sosonkina from the Ames Laboratory and Iowa State University (USA) – The Third Grid Application and Middleware Workshop - GAMW ’07 organized by Ewa Deelman from the USC Information Sciences Institute (USA), and Norbert Meyer from the Pozna´ n Supercomputing and Networking Center (Poland) – The Third Workshop on Large Scale Computations on Grids - LaSCoG’07 organized by Marcin Paprzycki from SWPS in Warsaw (Poland), Dana Petcu from the Western University of Timisoara (Romania), and Przemyslaw Stpiczy´ nski from Marie Curie-Sklodowska University in Lublin (Poland) – Workshop on Models, Algorithms and Methodologies for Grid-Enabled Computing Environments organized by Giovanni Aloisio from the University of Lecce (Italy), and Giuliano Laccetti from the University of Naples Federico II (Italy) – Workshop on Scheduling for Parallel Computing - SPC’07 organized by Maciej Drozdowski from the Pozna´ n University of Technology (Poland) – The Second Workshop on Language-Based Parallel Programming Models WLPP’07 organized by Ami Marowka from the Shenkar College of Engineering and Design in Ramat-Gan (Israel) – Workshop on Performance Evaluation of Parallel Applications on LargeScale Systems organized by Jan Kwiatkowski, Dariusz Konieczny and Marcin Pawlik from the Wroclaw University of Technology (Poland) – Workshop on Parallel Computational Biology - PBC’2007 organized by David A. Bader from the Georgia Institute of Technology in Atlanta (USA), Denis Trystram from ID-IMAG in Grenoble (France), and Jaroslaw Zola from the Iowa State University (USA) – Workshop on High Performance Computing for Engineering Applications organized by Piotr Doerffer from the Institute of Fluid Machinery in Gda´ nsk (Poland), and Jacek Rokicki from the Warsaw University of Technology (Poland)
Preface
VII
– Minisymposium on Interval Analysis organized by Vladik Kreinovich from the University of Texas at El Paso (USA), Pawel Sewastjanow from the Czes tochowa University of Technology, Bartlomiej J. Kubica from the Warsaw University of Technology (Poland), and Jerzy Wa´sniewski from the Technical University of Denmark (Denmark)
Tutorials The PPAM 2007 meeting began with four half-day tutorials: – Globus: fundamental tools for building your Grid, by Jennifer Schopf from the Argonne National Laboratory (USA) and eScience Institute (UK) – Intel tools for grid programming, by Ralf Ratering from Intel (Germany) – New data structures for the Cell processor, by Fred Gustavson from the the IBM T.J. Watson Research Center (USA), and Jerzy Wa´sniewski from the Technical University of Denmark (Denmark) – Grid computing with GridWay on Globus infrastructures: porting applications using the DRMAA standard, by GridWay Team (Spain)
The New Topics at PPAM 2007 The New Computer Architectures:1 Power consumption, heat dissipation and other physical limitations are pushing the microprocessor industry towards multicore design patterns. Most of the processor manufacturers, such as Intel and AMD, are following more conventional approaches, which consist of homogeneous, symmetric multicores where execution units are replicated on the same die; multiple execution units share some cache level (generally L2 and L3) and the bus to memory. Other manufacturers proposed still homogeneous approaches but with a stronger emphasis on parallelism and hyperthreading. Yet other chip manufacturers started exploring heterogeneous designs where cores have different architectural features. One such example is the Cell Broadband Engine (see2 ,3 , and4 ) developed by STI, a consortium formed by Sony, Toshiba and IBM. The Cell BE has outstanding floating-point computational power, which makes it a considerable candidate for high performance computing systems. IBM shipped the first Cell-based system, the BladeCenter QS20, on September 12th 2006. 1
2
3 4
Taken from the Introduction of the LAPACK Working Note Number 185 of Alfredo Buttari, Jack Dongarra, and Jakub Kurzak; Limitations of the Play Station 3 for High Performance Cluster Computing, UT-CS-07-597, May 2007; ”http://www.netlib.org/ lapack/lawnspdf/lawn185.pdf”. H.P. Hofstee. Power efficient processor architecture and the Cell process or. In Proceedings of the 11th Intel Symposium on High-Performance Computer Architecture, 2005. J.A. Kahle, M.N. Day, H.P. Hofstee, C.R. Johns, T.R. Maeurer, and D. Shippy. Introduction to the Cell multiprocessor. IBM J. Res. & Dev., 49(4/5):589-604, 2005. IBM. Cell Broadband Engine Architecture, Version 1.0, August 2005.
VIII
Preface
This blade was equipped with two Cell processors with a 512 MB memory each and connected in a NUMA configuration; the external connectivity was achieved through a Gigabit and an InfiniBand network interface. Now QS20 blade is replaced with a new QS21 Blade Center with larger memory. Its impressive computational power, coupled with high speed network interfaces, makes it a good candidate for high performance cluster computing. At almost the same period (November 11th 2006), Sony released the Play Station 3 (PS3) gaming console. Even if this console is not meant for high performance computing, it is still equipped with a (stripped down) Cell processor and its price ($500) definitely makes it an attractive solution for building a Cell-based cluster. Four speakers Alfredo Buttari, Jack Dongarra, Fred Gustavson and Adrianto Wirawan spoke, in their lectures, about the new Cell architecture. Also, there was a tutorial on New Data Structures and Cell related work by Jack Dongarra team, and a minisymposium on ”Novel Data Formats and Algorithms for Dense Linear Algebra Computations” both organized by Fred Gustavson and Jerzy Wa´sniewski. This latter material partly featured the Cell architecture as an example of the new Multi-Core / Many Core environment and role that New Data Structures could play for Cell. Ben Bennett, Richard Dracott, and Jerzy Wa´sniewski were also discussing the new Multi-Core / Many Core environment in their lectures. Interval analysis methods: Numerical methods – such as optimization methods, methods for solving systems of equations, etc. – usually produce approximate solutions. Often, there are no guaranteed bounds on the accuracy of these approximate solutions – or there are bounds but these bounds are too wide to be practically useful. In such situations, it is desirable to have verified numerical computations, i.e., computations that would produce verified (provable) accuracy. When a (final or intermediate) approximate result x of the computation comes with a verified bound Δ, it means that the actual (unknown) value of the estimated quantity belongs to the interval [ x − Δ, x + Δ]. In view of this fact, verified numerical computing is also known as interval analysis. To address these issues, a minisymposium on interval analysis method was organized by Vladik Kreinovich, Pawel Sewastjanow, Bartlomiej J. Kubica, and Jerzy Wa´sniewski. The intent was to present a state-of-the-art overview on the challenging and dynamic field of verified computing techniques and interval analysis for researchers, experts, and scientists who are current and future users of these techniques. This minisymposium included a tutorial on interval techniques and their application, and several research talks. Some of these talks described new interval (verification) techniques in solving various numerical problems: computing the value of a given special function (Evgueni Petrov), estimating the sum of an infinite series (Boguslaw Bo˙zek, Wieslaw Solak, and Zbigniew Szydelko), solving algebraic (Pawel Sevastjanov and Ludmila Dymowa) and differential equations (Karol Gajda, Malgorzata Jankowska, Andrzej Marciniak, and Barbara Szyszka), and checking properties of the solutions – such as their monotonicity (Iwona Skalna). Two talks discussed decision making under interval uncertainty: Bartlomiej J. Kubica and
Preface
IX
Adam Wo´zniak showed how to compute the Pareto set in a multi-criterial problem, and Van Nam Huynh, Vladik Kreinovich, Yoshiteru Nakamori, and Hung T. Nguyen showed how to efficiently predict human decisions under such uncertainty. Eva Dyllong, Cornelius Grimm, Jorge Fl´ orez, Mateu Sbert, Miguel A. Sainz, and Josep Veh´ı applied interval techniques to problems of computer vision, computer graphics, and computer-aided design and manufacturing. Finally, Piotr Orantek and Antoni John used interval techniques in biomedical engineering: namely, in the analysis of the human pelvic bone.
Acknowledgements The organizers are indebted to PPAM 2007 sponsors, whose support was vital to the success of the conference. The main sponsor was the Intel Corporation. The other sponsors were: Microsoft Corporation, IBM Corporation, Action S.A., and SIAM. We thank to all members of the International Program Committee and additional reviewers for their diligent work in refereeing the submitted papers. Finally, we thank to all of the local organizers from the Czestochowa University of Technology, TASK Academic Computer Centre in Gda´ nsk, and Academy of Music in Gda´ nsk, who helped us to run the event very smoothly. We are especially indebted to Gra˙zyna Kolakowska, Urszula Kroczewska, L ukasz Kuczy´ nski, and Marcin Wo´zniak from the Czestochowa University of Technology; to M´scislaw Nakonieczny, Ewa Politowska and Rafal Tylman from the TASK Academic Computer Centre; and to Henryk Maczka from the Academy of Music in Gda´ nsk.
PPAM 2009 We hope that this volume will be useful to you. We would like everyone who reads it to feel invited to the next conference, PPAM 2009, which will be held in Wroclaw (Poland), September 2009.
February 2008
Roman Wyrzykowski Jack Dongarra Konrad Karczewski Jerzy Wa´sniewski
Organization
Program Committee Jan Weglarz Roman Wyrzykowski Boleslaw Szyma´ nski Peter Arbenz Piotr Bala Radim Blaheta Jacek Bla˙zewicz Tadeusz Burczy´ nski Peter Brezany Jerzy Brzezi´ nski Marian Bubak ˇ Raimondas Ciegis Bogdan Chlebus Zbigniew Czech Jack Dongarra Maciej Drozdowski Jacek Gondzio Andrzej Go´sci´ nski Frederic Guinand Marta Fairen Ladislav Hluchy Alexey Kalinov Ayse Kiper Jacek Kitowski Jozef Korbicz Stanislaw Kozielski Dieter Kranzlmueller Henryk Krawczyk Piotr Krzy˙zanowski Jan Kwiatkowski Marco Lapegna Alexey Lastovetsky Aleksandr Legalov Vyacheslav Maksimov Victor E. Malyshkin Tomas Margalef
Pozna´ n University of Technology, Poland Honorary Chair Czestochowa University of Technology, Poland Chair of Program Committee Rensselaer Polytechnic Institute, USA Vice-Chair of Program Committee ETH, Zurich, Switzerland N. Copernicus University, Poland Institute of Geonics, Czech Academy of Sciences Pozna´ n University of Technology, Poland Silesia University of Technology, Poland University of Vienna, Austria Pozna´ n University of Technology, Poland Institute of Computer Science, AGH, Poland Vilnius Gediminas Tech. University, Lithuania University of Colorado at Denver, USA Silesia University of Technology, Poland University of Tennessee and ORNL, USA Pozna´ n University of Technology, Poland University of Edinburgh, Scotland, UK Deakin University, Australia Universite du Havre, France Univer. Polit. de Catalunya, Barcelona, Spain Slovak Academy of Sciences, Bratislava Institute for System Programming, Russia Middle East Technical University, Turkey Institute of Computer Science, AGH, Poland University of Zielona G´ ora, Poland Silesia University of Technology, Poland Johannes Kepler University Linz, Austria Gda´ nsk University of Technology, Poland University of Warsaw, Poland Wroclaw University of Technology, Poland University of Naples, Italy University College Dublin, Ireland Krasnoyarsk State Technical University, Russia Ural Branch, Russian Academy of Sciences Siberian Branch, Russian Academy of Sciences Universitat Autonoma de Barcelona, Spain
XII
Organization
Ami Marowka Norbert Meyer Jarek Nabrzyski Marcin Paprzycki Dana Petcu Edwige Pissaloux Jacek Rokicki Leszek Rutkowski Jaroslaw Rybicki Franciszek Seredy´ nski Robert Schaefer Norbert Sczygiol Jurij Silc Peter M.A. Sloot Przemyslaw Stpiczy´ nski Maciej Stroi´ nski Domenico Talia Andrei Tchernykh Roman Trobec Denis Trystram Marek Tudruj Pavel Tvrdik Jens Volkert Jerzy Wa´sniewski Bogdan Wiszniewski Krzysztof Zieli´ nski Jianping Zhu
Shenkar College of Eng. and Design, Israel PSNC, Pozna´ n, Poland PSNC, Pozna´ n, Poland SWPS, Warsaw, Poland Western University of Timisoara, Romania Universite de Rouen, France Warsaw University of Technology, Poland Czestochowa University of Technology, Poland Gda´ nsk University of Technology, Poland Polish Academy of Sciences, Warsaw Institute of Computer Science, AGH, Poland Czestochowa University of Technology, Poland Jozef Stefan Institute, Slovenia University of Amsterdam, The Netherlands UMCS, Lublin, Poland Supercomp. and Networking, Pozna´ n, Poland University of Calabria, Italy CICESE, Ensenada, Mexico Jozef Stefan Institute, Slovenia ID-IMAG, Grenoble, France Polish Academy of Sciences, Warsaw Czech Technical University, Prague Johannes Kepler University Linz, Austria Technical University of Denmark Gda´ nsk University of Technology, Poland Institute of Computer Science, AGH, Poland University of Akron, USA
Table of Contents
Parallel/Distributed Architectures and Mobile Computing Safety of a Session Guarantees Protocol Using Plausible Clocks . . . . . . . . Jerzy Brzezi´ nski, Michal Kalewski, and Cezary Sobaniec
1
On Checkpoint Overhead in Distributed Systems Providing Session Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arkadiusz Danilecki, Anna Kobusi´ nska, and Marek Libuda
11
Performance Evolution and Power Benefits of Cluster System Utilizing Quad-Core and Dual-Core Intel Xeon Processors . . . . . . . . . . . . . . . . . . . . . Pawel Gepner, David L. Fraser, and Michal F. Kowalik
20
Skip Ring Topology in FAST Failure Detection Service . . . . . . . . . . . . . . . Jacek Kobusi´ nski, Filip Gorski, and Stanislaw Stempin
29
Inter-processor Communication Optimization in Dynamically Reconfigurable Embedded Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . Eryk Laskowski and Marek Tudruj
39
An Algorithm to Improve Parallelism in Distributed Systems Using Asynchronous Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rouzbeh Maani and Saeed Parsa
49
IEBS Ticketing Protocol as Answer to Synchronization Issue . . . . . . . . . . Barbara Palacz, Tomasz Milos, Lukasz Dutka, and Jacek Kitowski
59
Analysis of Distributed Packet Forwarding Strategies in Ad Hoc Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcin Seredynski, Pascal Bouvry, and Mieczyslaw A. Klopotek
68
Implementation and Optimization of Dense LU Decomposition on the Stream Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Zhang, Tao Tang, Gen Li, and Xuejun Yang
78
Numerical Algorithms and Parallel Numerics An Adaptive Interface for the Efficient Computation of the Discrete Sine Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedro Alonso, Miguel O. Bernabeu, and Antonio-Manuel Vidal-Maci´ a
89
XIV
Table of Contents
Incomplete WZ Factorization as an Alternative Method of Preconditioning for Solving Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . Beata Bylina and Jaroslaw Bylina
99
A Block-Based Parallel Adaptive Scheme for Solving the 4D Vlasov Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olivier Hoenen and Eric Violard
108
On Optimal Strategies of Russia’s Behavior on the International Market for Emissions Permits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexey Kadiyev, Vyacheslav Maksimov, and Valeriy Rozenberg
118
Message-Passing Two Steps Least Square Algorithms for Simultaneous Equations Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jose-Juan L´ opez-Esp´ın and Domingo Gim´enez
127
Parallel Implementation of Cholesky LLT -Algorithm in FPGA-Based Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oleg Maslennikow, Volodymyr Lepekha, Anatoli Sergiyenko, Adam Tomas, and Roman Wyrzykowski
137
Dimensional Analysis Applied to a Parallel QR Algorithm . . . . . . . . . . . . . Robert W. Numrich
148
Sparse Matrix-Vector Multiplication - Final Solution? . . . . . . . . . . . . . . . . ˇ Ivan Simeˇ cek and Pavel Tvrd´ık
156
Parallel and Distributed Non-numerical Algorithms Petascale Computing for Large-Scale Graph Problems . . . . . . . . . . . . . . . . David A. Bader The Buffered Work-Pool Approach for Search-Tree Based Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Faisal N. Abu-Khzam, Mohamad A. Rizk, Deema A. Abdallah, and Nagiza F. Samatova Parallel Scatter Search Algorithm for the Flow Shop Sequencing Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wojciech Bo˙zejko and Mieczyslaw Wodecki Theoretical and Practical Issues of Parallel Simulated Annealing . . . . . . . Agnieszka Debudaj-Grabysz and Zbigniew J. Czech Modified R-MVB Tree and BTV Algorithm Used in a Distributed Spatio-temporal Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcin Gorawski and Michal Gorawski
166
170
180
189
199
Table of Contents
XV
Towards Stream Data Parallel Processing in Spatial Aggregating Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcin Gorawski and Rafal Malczok
209
On Parallel Generation of Partial Derangements, Derangements and Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zbigniew Kokosi´ nski
219
Parallel Simulated Annealing Algorithm for Graph Coloring Problem . . . ´ eto´ Szymon L ukasik, Zbigniew Kokosi´ nski, and Grzegorz Swi n
229
Parallel Algorithm to Find Minimum Vertex Guard Set in a Triangulated Irregular Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masoud Taghinezhad Omran
239
JaCk-SAT: A New Parallel Scheme to Solve the Satisfiability Problem (SAT) Based on Join-and-Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Singer and Anthony Monnet
249
Environments and Tools for Parallel/Distributed/Grid Computing Designing Service-Based Resource Management Tools for a Healthy Grid Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Erik Elmroth, Francisco Hern´ andez, Johan Tordsson, and ¨ Per-Olov Ostberg BC-MPI: Running an MPI Application on Multiple Clusters with BeesyCluster Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pawel Czarnul
259
271
Managing Distributed Architecture with Extended WS-CDL . . . . . . . . . . Konrad Dusza and Henryk Krawczyk
281
REVENTS: Facilitating Event-Driven Distributed HPC Applications . . . Dawid Kurzyniec, Vaidy Sunderam, and Magdalena Slawi´ nska
291
Empowering Automatic Semantic Annotation in Grid . . . . . . . . . . . . . . . . ˇ Michal Laclav´ık, Marek Ciglan, Martin Seleng, and Ladislav Hluch´y
302
Fault Tolerant Record Placement for Decentralized SDDS LH* . . . . . . . . Grzegorz L ukawski and Krzysztof Sapiecha
312
Grid Services for HSM Systems Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . Darin Nikolow, Renata Slota, and Jacek Kitowski
321
XVI
Table of Contents
The Vine Toolkit: A Java Framework for Developing Grid Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Russell, Piotr Dziubecki, Piotr Grabowski, Michal Krysin´ski, Tomasz Kuczy´ nski, Dawid Szjenfeld, Dominik Tarnawczyk, Gosia Wolniewicz, and Jaroslaw Nabrzyski Enhancing Productivity in High Performance Computing through Systematic Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Magdalena Slawi´ nska, Jaroslaw Slawi´ nski, and Vaidy Sunderam
331
341
A Formal Model of Multi-agent Computations . . . . . . . . . . . . . . . . . . . . . . . Maciej Smolka
351
An Approach to Distributed Fault Injection Experiments . . . . . . . . . . . . . Janusz Sosnowski, Andrzej Tymoczko, and Piotr Gawkowski
361
Applications of Parallel/Distributed/Grid Computing Parallel Solution of Nonlinear Parabolic Problems on Logically Rectangular Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andr´es Arrar´ as, Laura Portero, and Juan Carlos Jorge
371
Provenance Tracking in the ViroLab Virtual Laboratory . . . . . . . . . . . . . . Bartosz Bali´s, Marian Bubak, and Jakub Wach
381
Efficiency of Interactive Terrain Visualization with a PC-Cluster . . . . . . . Dariusz Dalecki, Jacek Lebied´z, Krzysztof Mieloszyk, and Bogdan Wiszniewski
391
Implementing Commodity Flow in an Agent-Based Model E-Commerce System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria Ganzha, Maciej Gawinecki, Pawel Kobzdej, Marcin Paprzycki, and Tomasz Serzysko MPI and OpenMP Computations for Nuclear Waste Deposition Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ondˇrej Jakl, Roman Kohut, and Jiˇr´ı Star´y A Pipelined Parallel Algorithm for OSIC Decoding . . . . . . . . . . . . . . . . . . . Francisco-Jose Mart´ınez-Zald´ıvar, Antonio-Manuel Vidal-Maci´ a, and Pedro Alonso A Self-scheduling Scheme for Parallel Processing in Heterogeneous Environment: Simulations of the Monte Carlo Type . . . . . . . . . . . . . . . . . . Grzegorz Musial, Lech D¸ebski, Dorota Jeziorek-Kniola, and Krzysztof Gola¸b
400
409
419
429
Table of Contents
Asynchronous Parallel Molecular Dynamics Simulations . . . . . . . . . . . . . . . Jaroslaw Mederski, L ukasz Mikulski, and Piotr Bala Parallel Computing of GRAPES 3D-Variational Data Assimilation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaoqian Zhu, Weimin Zhang, and Junqiang Song
XVII
439
447
Evolutionary Computing, Meta-Heuristics and Neural Networks The Effects of Heterogeneity on Asynchronous Panmictic Genetic Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boleslaw K. Szymanski, Travis Desell, and Carlos Varela
457
A Parallel Sensor Selection Technique for Identification of Distributed Parameter Systems Subject to Correlated Observations . . . . . . . . . . . . . . . Przemyslaw Baranowski and Dariusz Uci´ nski
469
Distributed Segregative Genetic Algorithm for Solving Fuzzy Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Octav Brudaru and Octavian Buzatu
479
Solving Channel Borrowing Problem with Coevolutionary Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krzysztof Gajc and Franciszek Seredynski
489
Balancedness in Binary Sequences with Cryptographic Applications . . . . Candelaria Hern´ andez-Goya and Amparo F´ uster-Sabater
499
A Cost-Benefit-Based Adaptation Scheme for Multimeme Algorithms . . . Wilfried Jakob
509
Optimizing the Shape of an Impeller Using the Differential Ant-Stigmergy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Peter Koroˇsec, Jurij Silc, Klemen Oblak, and Franc Kosel
520
Parallel Algorithm for Simulation of Circuit and One-Way Quantum Computation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marek Sawerwain
530
Modular Rough Neuro-fuzzy Systems for Classification . . . . . . . . . . . . . . . . Rafal Scherer, Marcin Korytkowski, Robert Nowicki, and Leszek Rutkowski
540
Tracing SQL Attacks Via Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . Jaroslaw Skaruz, Franciszek Seredynski, and Pascal Bouvry
549
Optimization of Parallel FDTD Computations Using a Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adam Smyk and Marek Tudruj
559
XVIII
Table of Contents
Modular Type-2 Neuro-fuzzy Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Janusz Starczewski, Rafal Scherer, Marcin Korytkowski, and Robert Nowicki Evolutionary Viral-type Algorithm for the Inverse Problem for Iterated Function Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Barbara Strug, Andrzej Bielecki, and Marzena Bielecka Tackling the Grid Job Planning and Resource Allocation Problem Using a Hybrid Evolutionary Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karl-Uwe Stucky, Wilfried Jakob, Alexander Quinte, and Wolfgang S¨ uß Evolutionary Algorithm with Forced Variation in Multi-dimensional Non-stationary Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dariusz Wawrzyniak and Andrzej Obuchowicz Hybrid Flowshop with Unrelated Machines, Sequence Dependent Setup Time and Availability Constraints: An Enhanced Crossover Operator for a Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Victor Yaurima, Larisa Burtseva, and Andrei Tchernykh
570
579
589
600
608
The Second Minisymposium on Novel Data Formats and Algorithms for Dense Linear Algebra Computations The Relevance of New Data Structure Approaches for Dense Linear Algebra in the New Multi-Core / Many Core Environments . . . . . . . . . . . Fred G. Gustavson Three Versions of a Minimal Storage Cholesky Algorithm Using New Data Structures Gives High Performance Speeds as Verified on Many Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jerzy Wa´sniewski and Fred G. Gustavson Hardware-Oriented Implementation of Cache Oblivious Matrix Operations Based on Space-filling Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Bader, Robert Franz, Stephan G¨ unther, and Alexander Heinecke Parallel Tiled QR Factorization for Multicore Architectures . . . . . . . . . . . Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra Application of Rectangular Full Packed and Blocked Hybrid Matrix Formats in Semidefinite Programming for Sensor Network Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jacek Blaszczyk, Ewa Niewiadomska-Szynkiewicz, and Michal Marks
618
622
628
639
649
Table of Contents
New Data Structures for Matrices and Specialized Inner Kernels: Low Overhead for High Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jos´e R. Herrero
XIX
659
The Implementation of BLAS for Band Matrices . . . . . . . . . . . . . . . . . . . . . Alfredo Rem´ on, Enrique S. Quintana-Ort´ı, and Gregorio Quintana-Ort´ı
668
Parallel Solution of Band Linear Systems in Model Reduction . . . . . . . . . Alfredo Rem´ on, Enrique S. Quintana-Ort´ı, and Gregorio Quintana-Ort´ı
678
Evaluating Linear Recursive Filters Using Novel Data Formats for Dense Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Przemyslaw Stpiczy´ nski
688
Combinatorial Tools for Parallel Sparse Matrix Computations Workshop Application of Fusion-Fission to the Multi-way Graph Partitioning Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Charles-Edmond Bichot
698
A Parallel Approximation Algorithm for the Weighted Maximum Matching Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fredrik Manne and Rob H. Bisseling
708
Heuristics for a Matrix Symmetrization Problem . . . . . . . . . . . . . . . . . . . . . Bora U¸car
718
A Supernodal Out-of-Core Sparse Gaussian-Elimination Method . . . . . . . Sivan Toledo and Anatoli Uchitel
728
The Third Grid Applications and Middleware Workshop (GAMW’07) A Large-Scale Semantic Grid Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marian Babik and Ladislav Hluchy
738
Scientific Workflow: A Survey and Research Directions . . . . . . . . . . . . . . . . Adam Barker and Jano van Hemert
746
A Light-Weight Grid Workflow Execution Engine Enabling Client and Middleware Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Erik Elmroth, Francisco Hern´ andez, and Johan Tordsson Supporting NAMD Application on the Grid Using GPE . . . . . . . . . . . . . . Rafal Kluszczy´ nski and Piotr Bala
754 762
XX
Table of Contents
A Grid Advance Reservation Framework for Co-allocation and Co-reservation Across Heterogeneous Local Resource Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Changtao Qu Using HLA and Grid for Distributed Multiscale Simulations . . . . . . . . . . . Katarzyna Rycerz, Marian Bubak, and Peter M.A. Sloot
770
780
The OpenCF: An Open Source Computational Framework Based on Web Services Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adrian Santos, Francisco Almeida, and Vincente Blanco
788
Service Level Agreement Metrics for Real-Time Application on the Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L ukasz Skital, Maciej Janusz, Renata Slota, and Jacek Kitowski
798
Dynamic Control of Grid Workflows through Activities Global State Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marek Tudruj, Damian Kopanski, and Janusz Borkowski
807
Transparent Access to Grid-Based Compute Utilities . . . . . . . . . . . . . . . . . Constantino V´ azquez, Javier Font´ an, Eduardo Huedo, Rub´en S. Montero, and Ignacio M. Llorente Towards Secure Data Management System for Grid Environment Based on the Cell Broadband Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roman Wyrzykowski and Lukasz Kuczynski Ontology Alignment for Contract Based Virtual Organizations Negotiation and Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joanna Zieba, Bartosz Kryza, Renata Slota, Lukasz Dutka, and Jacek Kitowski
817
825
835
The Third Workshop on Large Scale Computations on Grids (LaSCoG’07) On Service-Oriented Symbolic Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandru Cˆ arstea, Marc Frˆıncu, Alexander Konovalov, Georgiana Macariu, and Dana Petcu
843
CPPC-G: Fault-Tolerant Applications on the Grid . . . . . . . . . . . . . . . . . . . Daniel D´ıaz, Xo´ an C. Pardo, Mar´ıa J. Mart´ın, Patricia Gonz´ alez, and Gabriel Rodr´ıguez
852
Garbage Collection in Object Oriented Condensed Graphs . . . . . . . . . . . . Sunil John and John P. Morrison
860
Table of Contents
XXI
MASIPE: A Tool Based on Mobile Agents for Monitoring Parallel Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David E. Singh, Alejandro Miguel, F´elix Garc´ıa, and Jes´ us Carretero
870
Geovisualisation Service for Grid-Based Assessment of Natural Disasters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Sl´ıˇzik and Ladislav Hluch´y
880
Web Portal to Make Large-Scale Scientific Computations Based on Grid Computing and MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Assel Zh. Akzhalova and Daniar Y. Aizhulov
888
Workshop on Models, Algorithms and Methodologies for Grid-Enabled Computing Environments The GSI Plug-In for gSOAP: Building Cross-Grid Interoperable Secure Grid Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Massimo Cafaro, Daniele Lezzi, Sandro Fiore, Giovanni Aloisio, and Robert van Engelen Implementing Effective Data Management Policies in Distributed and Grid Computing Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luisa Carracciuolo, Giuliano Laccetti, and Marco Lapegna
894
902
Data Mining on Desktop Grid Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . Valerie Fiolet, Richard Olejnik, Eryk Laskowski, L ukasz Masko, Marek Tudruj, and Bernard Toursel
912
Distributed Resources Reservation Algorithm for GRID Networks . . . . . . Matviy Il’yashenko
922
A PMI-Aware Extension for the SSH Service . . . . . . . . . . . . . . . . . . . . . . . . Giuliano Laccetti and Giovanni Schmid
932
An Integrated ClassAd-Latent Semantic Indexing Matchmaking Algorithm for Globus Toolkit Based Computing Grids . . . . . . . . . . . . . . . . Raffaele Montella, Giulio Giunta, and Angelo Riccio
942
A Grid Computing Based Virtual Laboratory for Environmental Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Raffaele Montella, Giulio Giunta, and Giuliano Laccetti
951
Exploring the Behaviour of Fine-Grain Management for Virtual Resource Provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fernando Rodr´ıguez-Haro, Felix Freitag, Leandro Navarro, and Rene Brunner
961
XXII
Table of Contents
Workshop on Scheduling for Parallel Computing (SPC’07) Parallel Irregular Computations with Dynamic Load Balancing through Global Consistent State Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Janusz Borkowski and Marek Tudruj On-Line Partitioning for On-Line Scheduling with Resource Conflicts . . . Piotr Borowiecki A Multiobjective Evolutionary Approach for Multisite Mapping on Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ivanoe De Falco, Antonio Della Cioppa, Umberto Scafuri, and Ernesto Tarantino
971
981
991
Scheduling with Precedence Constraints: Mixed Graph Coloring in Series-Parallel Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1001 ˙ nski Hanna Furma´ nczyk, Adrian Kosowski, and Pawel Zyli´ A New Model of Multi-installment Divisible Loads Processing in Systems with Limited Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009 Maciej Drozdowski and Marcin Lawenda Scheduling DAGs on Grids with Copying and Migration . . . . . . . . . . . . . . 1019 Israel Hernandez and Murray Cole Alea – Grid Scheduling Simulation Environment . . . . . . . . . . . . . . . . . . . . . 1029 Dalibor Klus´ aˇcek, Ludˇek Matyska, and Hana Rudov´ a Cost Minimisation in Unbounded Multi-interface Networks . . . . . . . . . . . . 1039 Adrian Kosowski and Alfredo Navarra Scheduling in Multi-organization Grids: Measuring the Inefficiency of Decentralization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1048 Krzysztof Rzadca Tightness Results for Malleable Task Scheduling Algorithms . . . . . . . . . . . 1059 Ulrich M. Schwarz
The Second Workshop on Language-Based Parallel Programming Models (WLPP’07) Universal Grid Client: Grid Operation Invoker . . . . . . . . . . . . . . . . . . . . . . . 1068 Tomasz Barty´ nski, Maciej Malawski, Tomasz Gubala, and Marian Bubak Divide-and-Conquer Parallel Programming with Minimally Synchronous Parallel ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1078 Radia Benheddi and Fr´ed´eric Loulergue
Table of Contents
XXIII
Cloth Simulation in the SILC Matrix Computation Framework: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086 Tamito Kajiyama, Akira Nukada, Reiji Suda, Hidehiko Hasegawa, and Akira Nishida Computing the Irregularity Strength of Connected Graphs by Parallel Constraint Solving in the Mozart System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096 Adam Meissner, Magdalena Niwi´ nska, and Krzysztof Zwierzy´ nski DPSKEL: A Skeleton Based Tool for Parallel Dynamic Programming . . . 1104 Ignacio Pel´ aez, Francisco Almeida, and Fernando Su´ arez SkelJ: Skeletons for Object-Oriented Applications . . . . . . . . . . . . . . . . . . . . 1114 Joao L. Sobral Formal Semantics of DRMA-Style Programming in BSPlib . . . . . . . . . . . . 1122 Julien Tesson and Fr´ed´eric Loulergue A Container-Iterator Parallel Programming Model . . . . . . . . . . . . . . . . . . . 1130 Gerhard Zumbusch
Workshop on Performance Evaluation of Parallel Applications on Large-Scale Systems Semantic-Oriented Approach to Performance Monitoring of Distributed Java Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1140 Wlodzimierz Funika, Piotr Godowski, and Piotr P¸egiel Using Experimental Data to Improve the Performance Modelling of Parallel Linear Algebra Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1150 Luis-Pedro Garc´ıa, Javier Cuenca, and Domingo Gim´enez Comparison of Execution Time Decomposition Methods for Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1160 Jan Kwiatkowski, Marcin Pawlik, and Dariusz Konieczny An Extensible Timing Infrastructure for Adaptive Large-Scale Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1170 Dylan Stark, Gabrielle Allen, Tom Goodale, Thomas Radke, and Erik Schnetter End to End QoS Measurements of TCP Connections . . . . . . . . . . . . . . . . . 1180 Witold Wysota and Jacek Wytrebowicz Performance Evaluation of Basic Linear Algebra Subroutines on a Matrix Co-processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1190 Ahmed S. Zekri and Stanislav G. Sedukhin
XXIV
Table of Contents
Workshop on Parallel Computational Biology (PBC’2007) High Throughput Comparison of Prokaryotic Genomes . . . . . . . . . . . . . . . 1200 Luciana Carota, Lisa Bartoli, Piero Fariselli, Pier L. Martelli, Ludovica Montanucci, Giorgio Maggi, and Rita Casadio A Parallel Classification and Feature Reduction Method for Biomedical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1210 Mario R. Guarracino, Salvatore Cuciniello, and Davide Feminiano Applying SIMD Approach to Whole Genome Comparison on Commodity Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1220 Arpith Jacob, Marcin Paprzycki, Maria Ganzha, and Sugata Sanyal Parallel Multiprocessor Approaches to the RNA Folding Problem . . . . . . 1230 ´ Etienne Ogoubi, David Pouliot, Marcel Turcotte, and Abdelhakim Hafid Protein Similarity Search with Subset Seeds on a Dedicated Reconfigurable Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1240 Pierre Peterlongo, Laurent No´e, Dominique Lavenier, Gilles Georges, Julien Jacques, Gregory Kucherov, and Mathieu Giraud Parallel DNA Sequence Alignment on the Cell Broadband Engine . . . . . . 1249 Adrianto Wirawan, Kwoh Chee Keong, and Bertil Schmidt
Workshop on High Performance Computing for Engineering Applications Scalability and Performance Analysis of a Probabilistic Domain Decomposition Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1257 Juan A. Acebr´ on and Renato Spigler Scalability Analysis for a Multigrid Linear Equations Solver . . . . . . . . . . . 1265 Krzysztof Bana´s A Grid-Enabled Lattice-Boltzmann-Based Modelling System . . . . . . . . . . 1275 G´erard Dethier, Cyril Briquet, Pierre Marchot, and P.A. de Marneffe Parallel Bioinspired Algorithms in Optimization of Structures . . . . . . . . . 1285 Waclaw Ku´s and Tadeusz Burczy´ nski 3D Global Flow Stability Analysis on Unstructured Grids . . . . . . . . . . . . . 1293 Marek Morzy´ nski and Frank Thiele Performance of Multi Level Parallel Direct Solver for hp Finite Element Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1303 Maciej Paszy´ nski
Table of Contents
XXV
Graph Transformations for Modeling Parallel hp-Adaptive Finite Element Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1313 Maciej Paszy´ nski and Anna Paszy´ nska Acceleration of Preconditioned Krylov Solvers for Bubbly Flow Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1323 Jok Man Tang and Kees Vuik Persistent Data Structures for Fast Point Location . . . . . . . . . . . . . . . . . . . 1333 Michal Wichulski and Jacek Rokicki
Minisymposium on Interval Analysis A Reliable Extended Octree Representation of CSG Objects with an Adaptive Subdivision Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1341 Eva Dyllong and Cornelius Grimm Efficient Ray Tracing Using Interval Analysis . . . . . . . . . . . . . . . . . . . . . . . . 1351 Jorge Fl´ orez, Mateu Sbert, Miguel A. Sainz, and Josep Veh´ı A Survey of Interval Runge–Kutta and Multistep Methods for Solving the Initial Value Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1361 Karol Gajda, Malgorzata Jankowska, Andrzej Marciniak, and Barbara Szyszka Towards Efficient Prediction of Decisions under Interval Uncertainty . . . . 1372 Van Nam Huynh, Vladik Kreinovich, Yoshiteru Nakamori, and Hung T. Nguyen Interval Methods for Computing the Pareto-Front of a Multicriterial Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1382 Bartlomiej Jacek Kubica and Adam Wo´zniak Fuzzy Solution of Interval Linear Equations . . . . . . . . . . . . . . . . . . . . . . . . . 1392 Pavel Sevastjanov and Ludmila Dymova On Checking the Monotonicity of Parametric Interval Solution of Linear Structural Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1400 Iwona Skalna Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1411
Safety of a Session Guarantees Protocol Using Plausible Clocks Jerzy Brzeziński, Michał Kalewski, and Cezary Sobaniec Institute of Computing Science Poznań University of Technology, Poland {Jerzy.Brzezinski,Michal.Kalewski,Cezary.Sobaniec}@cs.put.poznan.pl
Abstract. Session guarantees is a group of consistency models used to manage replica consistency in a distributed system from the client’s perspective. In this paper we present and prove safety of a novel protocol implementing session guarantees. The protocol uses server-based version vectors conceptually based on plausible clocks. The version vectors are constant-size and accept dynamic reconfigurations, which is the main advantage of this approach. The cost is reduced accuracy of representation of sets of operations, which, however, does not violate session guarantees. Keywords: consistency models, session guarantees, plausible clocks.
1
Introduction
Replication in distributed systems is used for achieving high performance and availability. Unfortunately, replication introduces inconsistency of replicas in case of replica updates. The users are usually not interested in the internal mechanisms providing high availability; they rather want the system to behave in a transparent manner, as if it was not replicated. However, transparent replication means that it is necessary to apply a strong consistency model, i.e. the system must provide strong guarantees related to ordering of modifying operations, which usually makes the system inefficient. There are several weaker consistency models than strong consistency, developed mainly in the context of Distributed Shared Memory systems. However, these data-centric consistency models [1] do not take into account mobility; they assume that the clients are bound to the servers maintaining the replicas. There is also another class of consistency models — client-centric consistency models, also known as session guarantees [2]. Session guarantees maintain data consistency from a single client point of view, but the client can switch from one server to another. The data observed by the user are consistent with the previous sequence of operations issued by the client within a session. There are four different guarantees ordering different types of operations. Definitions of session guarantees refer to sets of operations issued by clients. A consistency protocol of session guarantees has to efficiently represent the sets, which is not trivial because the sets are monotonically growing. The original authors of session guarantees have proposed application of version vectors for R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1–10, 2008. c Springer-Verlag Berlin Heidelberg 2008
2
J. Brzeziński, M. Kalewski, and C. Sobaniec
efficient representation of sets of operations. Version vectors are conceptually based on vector clocks [3,4] used for keeping track of causal dependencies in distributed systems. In a static system, version vectors have a constant size, representing all operations performed up to the logical time represented by the value of a given vector. However, in a dynamic system the set of servers participating in the processing may change: some servers may crash, others may be added to the system. As a consequence, the structure of version vectors must reflect such reconfigurations. New positions in the version vector must be allocated, and some may become unused. In general, frequent reconfigurations may lead to difficult to maintain increase of size of version vectors. In this paper we propose adoption of plausible clocks [5,6] which are used to approximate causal dependencies in distributed systems. The main advantage of version vectors based on plausible clocks is constant size, regardless of the number of servers, and their reconfigurations. The cost is decreased accuracy of representation of sets of operations, which results in additional — not required by session guarantees — synchronization between servers, which, however, do not violate the model. This paper presents a consistency protocol of session guarantees, called VsRSG, and formally proves ifs safety. In this context safety means that clients’ requirements concerning session guarantees are preserved throughout the whole execution of the protocol.
2
Session Guarantees
We consider a weakly consistent replicated storage system. The system consists of a number of servers holding a full copy of a set of shared objects, and clients running applications that access the objects. Clients are separated from servers, i.e. a client application may run on a separate computer than the server. A client may access a shared object after selecting a single server and sending a direct request to the server. Clients are mobile, i.e. they can switch from one server to another during application execution. Session guarantees are expected to take care of data consistency observed by a migrating client. The set of shared objects replicated by the servers does not imply any particular data model or organization. Operations performed on shared objects are divided into reads and writes. A read does not change states of the shared objects, while a write does. A write may cause an update of an object, it may create a new object, or delete an existing one. A write may also atomically update states of several objects. C Operations on shared objects issued by a client Ci are ordered by a relation i called client issue order. A server Sj performs operations in an order represented Sj
by a relation . Writes and reads on objects will be denoted by w and r, respectively. An operation performed by a server Sj will be denoted by w|Sj or r|Sj . Definition 1. Relevant writes RW (r) of a read operation r is a set of writes that has influenced the current state of objects observed by the read r.
Safety of a Session Guarantees Protocol Using Plausible Clocks
3
The exact meaning of relevant writes will strongly depend on the characteristics of a given system or application. For example, in case of simple isolated objects (i.e. objects with methods that access only their internal fields), relevant writes of a read on object x may be represented by all previous writes on object x. Session guarantees have been defined in [2]. The following more formal definitions are based on those concepts. The definitions assume that operations are unique, i.e. they are labeled by some internal unique identifiers. Definition 2. Read Your Writes (RYW) session guarantee is defined as follows: Sj C ∀Ci ∀Sj w i r|Sj =⇒ w r Definition 3. Monotonic Writes (MW) session guarantee is defined as follows: Sj C ∀Ci ∀Sj w1 i w2 |Sj =⇒ w1 w2 Definition 4. Monotonic Reads (MR) session guarantee is defined as follows: Sj C ∀Ci ∀Sj r1 i r2 |Sj =⇒ ∀wk ∈ RW (r1 ) : wk r2 Definition 5. Writes Follow Reads (WFR) session guarantee is defined as follows: Sj Ci ∀Ci ∀Sj r w|Sj =⇒ ∀wk ∈ RW (r) : wk w
3
VsRSG Protocol
The proposed VsRSG protocol implementing session guarantees intercepts communication between clients and servers; at the client side before sending a request, at the server side after receiving the request and before sending a reply, and at the client side after receiving the reply. These interceptions are used to exchange and maintain additional data structures necessary to preserve appropriate session guarantees. After receipt of a new request a server checks whether its state is sufficiently up to date to satisfy client’s requirements. If the server’s state is outdated then the request is postponed and will be resumed after updating the server. Servers periodically exchange information about writes performed in the past in order to synchronize the states of replicas. This synchronization procedure eventually causes total propagation of all writes directly submitted by clients. It does not influence safety of the VsRSG protocol but rather its liveness, therefore it will not be discussed in this paper (example procedure is presented in [7]). Every server Sj records all writes performed locally in a history. The writes result from direct client requests, or are incorporated from other servers during synchronization procedure. The writes are performed sequentially, therefore the history is totally ordered. Formally, histories are defined as follows:
4
J. Brzeziński, M. Kalewski, and C. Sobaniec
Definition 6. A history HSj
Sj is a linearly ordered set OSj , where OSj is Sj
a set of writes performed by a server Sj , and relation represents an execution order of the writes. During synchronization of servers the histories are concatenated. A concatenation of two histories is constructed as a sum of the first history, and new writes found in the second history. The orderings of respective histories are preserved, and new writes are added at the end of the first history. A write in VsRSG protocol is labeled with a vector timestamp set to the current value of the version vector VSj of the Sj server performing the write for the first time. In the presentation of the VsRSG protocol the vector timestamp of a write w is returned by a function T : O → V . A single i-th position of the version vector timestamp associated with a write will be denoted by T (w)[i]. VsRSG protocol is presented in Fig. 2. A request sent by a client is a couple op, SG, where op is an operation to be performed, and SG is a set of session guarantees required for this operation. Before sending to the server, the request is supplemented with a vector W representing the client’s requirements. A reply is a triple op, res, W where op is the operation just performed, res represents the results of the operation (delivered to the application), and W is a vector representing the state of the server just after performing the operation. Before sending a request by a client Ci , a vector W representing its requirements is calculated based on the type of operation, and the set SG of session guarantees required for the operation. The vector W is set to either 0, or WCi — a vector representing writes issued by client Ci , or RCi — a vector representing writes relevant to reads issued by the client, or to a maximum of these two vector (lines 1, 3 and 6). The maximum of two vectors V1 and V2 is a vector V = max (V1 , V2 ), such that V [i] = max (V1 [i], V2 [i]). On receipt of a new request a server Sj checks whether its local version vector VSj dominates the vector W sent by the client (line 14), which is expected to be sufficient for providing appropriate session guarantees. A version vector V1 dominates a version vector V2 , which is denoted by V1 ≥ V2 , when ∀i : V1 [i] ≥ V2 [i]. If the state of the server is not sufficiently up to date, the request is postponed (line 15), and will be resumed after synchronization with another server (line 41). As a result of writes performed by a server Sj , its version vector VSj is updated (line 19), and a timestamped operation is recorded in history HSj (lines 20 and 21). The current value of the server version vector VSj is returned to the client (line 23) and updates the client’s vector WCi in case of writes (line 26), or RCi in case of reads (line 28). Version vectors are used for efficient representation of sets of writes required by clients and necessary to check at the server side. Version vectors used by standard VsSG protocol have the following form: v1 v2 . . . vN , where N is the total number of servers in the system. A single position vj denotes the number of writes performed by a server Sj , and changes whenever a new write request is performed by the server. Because every server increments the version vector for every write, and the changes are done at different positions, the values of version
Safety of a Session Guarantees Protocol Using Plausible Clocks
C1
S1
w(x)1
[1 0] [1 0]
S3
C2 RYW
5
w(x)2
r(x)1
(a) C1
S1
w(x)1
[2 0] [1 0]
S3
C2 RYW
C1
w(x)2
r(x)2
(b)
[1 0]
S1
S3
[1 0] [2 0]
w(x)1
C2 RYW
[2 0] [1 0]
w(x)2
r(x)1
(c)
Fig. 1. Ordering of writes in a protocol using server-based constant size version vectors
vectors at servers during execution of writes are unique. The version vectors must be unique because they are used for identifying write operations. VsRSG protocol uses constant size server-based version vectors, consisting of R-positions, and a server Sj is expected to update a single position j mod R of the server’s version vector VSj . Obviously, for R < N there are servers that share a single position of the version vectors, and thus certain writes may result in indistinguishable values of version vectors. Let us consider a case shown in Fig. 1(a). There are 2 clients, and 3 servers (server S2 is not shown). Let R = 2, which means that servers S1 and S3 share the first position of the version vector. Client C1 issues a write of value 1 to variable (object) x, which is denoted by w(x)1, and, as a result of the write, VS1 is updated to [ 1 0 ]. Client C2 concurrently issues another write w(x)2 at server S3 , which causes an update of the server version vector VS3 to [ 1 0 ]. Because these two servers share the first position of the version vector, the vector timestamps of the writes are identical. As a result, client C2 after migrating to server S1 cannot observe its previous write despite RYW session guarantee was requested. In order to solve the problem additional ordering must be forced. The values of version vectors will be unique if writes performed by different servers sharing a common position in the version vector are globally ordered. This can be achieved by means of additional sequencers used by groups of servers. VsRSG protocol generates a sequence number for every write (see line 10 of Fig. 2). The sequence numbers are generated for servers sharing the same position of the version vectors. Sequence numbers are checked
6
J. Brzeziński, M. Kalewski, and C. Sobaniec
before performing writes (line 14), and thus writes that are requested before performing all previous writes of appropriate group of servers are suspended. In the case shown in Fig. 1(a) it results in performing either w(x)1 before w(x)2 — Fig. 1(b), or w(x)2 before w(x)1 — Fig. 1(b). Regardless of the ordering of the writes, the requested RYW session guarantee is provided for client C2 . Depending on the length of the version vector, the additional ordering forced by VsRSG protocol limits possible concurrency of request processing; only R write requests can be performed concurrently. However, in a typical scenario reads are much more frequent than writes. Reads are processed by VsRSG protocol in a very similar manner to the standard VsSG protocol, without additional ordering. The version vectors are constant size, therefore their management is greatly simplified. They are also ready for changes: new participants of the processing may come and go at minimal additional cost. The price of these advantages is decreased accuracy of the version vectors resulting from their limited capacity. The sets of operations represented by plausible version vectors are larger than sets of writes of full version vectors, which results in more “eager” propagation of updates. A positive consequence of this fact is a reduction of the number of cases where client requests must wait due to missing writes at servers. VsRSG protocol uses a simple mapping between servers and positions in the version vector consisting in “wrapping” the server’s number around the length of the version vector by modulo function. In general, the mapping can be an arbitrary function that best suits the system, or application needs (see examples in [6]). The system designer can take into account additional specific knowledge concerning its characteristics, and propose a more appropriate mapping.
4
Safety of VsRSG Protocol
Definition 7. A supremum of a set of writes OSj , denoted by V OSj , is a vector that is set to 0 for an empty set, and for nonempty sets its i-th position is defined as V OSj [i] = maxw∈OSj T (w)[i]. Lemma 1. For every server Sj running VsRSG protocol at every moment: V OSj = VSj Proof. By induction. 1. Basis. At the very beginning VSj = 0, and the set of writes OSj = ∅, therefore V OSj = 0, hence V OSj = VSj . 2. Induction step. Let us assume a state where condition V OSj = VSj holds. The set OSj and the version vector VSj can change only in the following two situations: (a) Server Sj accepts a new write requested by a client. This causes the value of VSj [j mod R] to be incremented by 1, next the write is timestamped with the current value of vector VSj , and the write is added to OSj . This causes V OSj to be also incremented at position j mod R by 1 (lines 19 and 21 of Fig. 2). As a result, the condition V OSj = VSj still holds.
Safety of a Session Guarantees Protocol Using Plausible Clocks Upon sending a request op, SG to server Sj at client Ci 1: W ← 0 2: if (iswrite(op) and MW ∈ SG) or (not iswrite(op) and RYW ∈ SG) then 3: W ← max (W, WCi ) 4: end if 5: if (iswrite(op) and WFR ∈ SG) or (not iswrite(op) and MR ∈ SG) then 6: W ← max (W, RCi ) 7: end if 8: send op, W to Sj Upon receiving a request op, W from client Ci at server Sj 9: if iswrite(op) then 10: seq ← getSeqNumber(j mod R) 11: else 12: seq ← 0 13: end if 14: while VSj ≥ W ∨ seq > VSj [j mod R] + 1 do 15: wait() 16: end while 17: perform op and store results in res 18: if iswrite(op) then 19: VSj [j mod R] ← VSj [j mod R] + 1 20: timestamp op with VSj 21: HSj ← HSj ∪ {op} 22: end if 23: W ← VSj 24: send op, res, W to Ci Upon receiving a reply op, res, W from server Sj at client Ci 25: if iswrite(op) then 26: WCi ← max (WCi , W ) 27: else 28: RCi ← max (RCi , W ) 29: end if 30: deliver res Every Δt at server Sj 31: foreach Sk = Sj do 32: send Sj , HSj to Sk 33: end for Upon receiving an update Sk , H at server Sj 34: foreach wi ∈ H do 35: if VSj ≥ T (wi ) then 36: perform wi 37: VSj ← max VSj , T (wi ) 38: HSj ← HSj ∪ {wi } 39: end if 40: end for 41: signal() Fig. 2. VsRSG consistency protocol of session guarantees
7
8
J. Brzeziński, M. Kalewski, and C. Sobaniec
(b) Server Sj incorporates a write w received from another server. This causes the current value of VSj to be maximized with the vector T (w) of the write being added (line 37). The new write is then added to OSj (line 38). As a result, values of VSj and V OSj will be incremented at the same positions by the same values, therefore the condition V OSj = VSj still holds. A set of writes represented by a version vector will be denoted by WS(V ), and is formally defined in the following manner. Definition 8. A write-set WS (V ) of a given version vector V is defined as NS WS (V ) = j=1 w ∈ OSj : T (w) ≤ V . Lemma 2. For any two vectors V1 and V2 used by servers and clients of VsRSG protocol: V1 ≥ V2 ⇐⇒ WS (V1 ) ⊇ WS (V2 ) Proof. 1) Sufficient condition. By contradiction, let us assume that: V1 ≥ V2 ∧ WS (V1 ) ⊇ WS (V2 ) which means that: ∃w ∈ O [w ∈ WS (V1 ) ∧ w ∈ WS (V2 )] and, according to Definition 8: ∃k (T (w)[k] > V1 [k] ∧ T (w)[k] ≤ V2 [k]) =⇒ V1 [k] < V2 [k] =⇒ V1 ≥ V2 2) Necessary condition. By contradiction, let us assume that: WS (V1 ) ⊇ WS (V2 ) ∧ V1 ≥ V2 which means that: ∃k : V1 [k] < V2 [k] Version vectors at position k are only incremented when a new write is performed by a server Sj , where j mod R = k (line 19). The new value at position k is unique between all version vectors identifying writes, because it is generated by the function getSeqNumber(k), returning consecutive numbers for a given position (line 10). Before the server version vector is incremented, its value must be exactly 1 less than the sequence number, therefore the incrementation at line 19 has the same effect as assignment VSj [j mod R] = seq. As a result, every value of the version vector at position k is associated with a write accepted by some server. Based on the observation ∃k : V1 [k] < V2 [k], it can be concluded that: ∃w ∈ OSj [w ∈ WS (V2 ) ∧ w ∈ WS (V1 )] and hence WS (V1 ) ⊇ WS (V2 ).
Safety of a Session Guarantees Protocol Using Plausible Clocks
9
The implication V1 ≥ V2 =⇒ WS (V1 ) ⊇ WS (V2 ) is true for any vectors V1 and V2 , while the other implication WS (V1 ) ⊇ WS (V2 ) =⇒ V1 ≥ V2 is true only for vectors maintained by servers and clients, i.e. for vectors VSj , WCi , and RCi . Lemma 3. At any time during execution of VsRSG protocol OSj = WS VSj . Proof. By contradiction: 1) Let us assume that ∃w ∈ OSj : w ∈ WS VSj . According to Definition 8, a write w does not belong to WS VSj when T (w) ≤ VSj . This implies that ∃k : T (w)[k] > VSj [k], and, according to Lemma 1, T (w)[k] > V OSj [k], which implies V OSj ≥ T (w). Basedon Definition 7, w ∈ OSj — a contradiction. 2) Let us assume that ∃w ∈ WS VSj : w ∈ OSj . According to Definition 7, a write w does not belong to OSj when V OSj ≥ T (w). This implies that ∃k : T (w)[k] > V OSj , and, according to Lemma 1, T (w)[k] > VSj [k], which implies T (w) ≤ VSj . Based on Definition 8, w ∈ WS VSj — a contradiction. Lemma 4. At any time during execution of VsRSG protocol WS (WCi ) contains all writes issued by a client Ci . Proof. A write issued by a client Ci and performed by a server Sj updates the client’s vector WCi by calculating a maximum of its current value and value of the server version vector VSj (lines 23 and 26). Hence, afterperforming the write WCi ≥ VSj , and (according to Lemma 2) WS (WCi ) ⊇ WS VSj , and (according to Lemma 3) WS (WCi ) ⊇ OSj . It means that the write-set WS (WCi ) contains all writes requested directly at server Sj , including also writes requested by client Ci at server Sj . The vector WCi monotonically increases, therefore no past write is lost in case of a migration to another server. Lemma 5. At any time during execution of VsRSG protocol WS (RCi ) contains all writes relevant to reads issued by a client Ci . Proof. A read issued by a client Ci and performed by a server Sj updates the client’s vector RCi by calculating a maximum of its current value and value of the server version vector VSj (lines 23 and 28). Hence (according to Lemmata 2 and 3) RCi ≥ VSj =⇒ WS (RCi ) ⊇ WS VSj = OSj . It means that the write-set WS (RCi ) contains all writes performed at server Sj , therefore also writes relevant to reads requested by client Ci at server Sj . The vector RCi monotonically increases, therefore no past write is lost in case of a migration to another server. Theorem 1. RYW session guarantee is preserved by VsRSG protocol for clients requesting it. Proof. Let us consider two operations w and r, issued by a client Ci requiring RYW session guarantee. Let the read follow the write in the client’s issue order, C and let the read be performed by a server Sj , i.e. w i r|Sj . After performing w we have (according to Lemma 4) w ∈ WS (WCi ). Because VSj ≥ WCi is
10
J. Brzeziński, M. Kalewski, and C. Sobaniec
fulfilled performing r (lines 3 and 14), we get (according to Lemma 2) before WS VSj ⊇ WS (WCi ) =⇒ w ∈ WS VSj . Because local operations at servers Sj
are totally ordered, we get w r. This will happen for any client Ci requiring Sj Ci RYW and any server Sj , so ∀Ci ∀Sj w r|Sj =⇒ w r , which means that RYW session guarantee is preserved.
Theorem 2. MR session guarantee is preserved by VsRSG protocol for clients requesting it. Proof. Let us consider two reads r1 and r2 , issued by a client Ci requiring MR session guarantee. Let the second read follow the first read in the client’s issue C order, and let the second read be performed by a server Sj , i.e. r1 i r2 |Sj . After performing r1 we have (according to Lemma 5) ∀wk ∈ RW (r1 ) : wk ∈ WS (RCi ). Because VSj ≥ RCi is fulfilled before performing r2 (lines 6 and 14), we get (according to Lemma 2) WS V ⊇ WS (RCi ) =⇒ ∀wk ∈ RW (r1 ) : S j wk ∈ WS VSj . Because local operations at servers are totally ordered, we get Sj
∀wk ∈ RW (r1 ) : wk r2 . This will happen for any client Ci and any server Sj , S j C so ∀Ci ∀Sj r1 i r2 |Sj =⇒ ∀wk ∈ RW (r1 ) : wk r2 , which means that MR session guarantee is preserved.
A theorem and a proof for MW are analogous to RYW. A theorem and a proof for WFR are analogous to MR.
References 1. Tanenbaum, A.S., van Steen, M.: Distributed Systems — Principles and Paradigms. Prentice Hall, New Jersey (2002) 2. Terry, D.B., Demers, A.J., Petersen, K., Spreitzer, M., Theimer, M., Welch, B.W.: Session guarantees for weakly consistent replicated data. In: Proc. of the Third Int. Conf. on Parallel and Distributed Information Systems (PDIS 1994), Austin, USA, pp. 140–149. IEEE Computer Society, Los Alamitos (1994) 3. Mattern, F.: Virtual time and global states of distributed systems. In: Cosnard, Q., Raynal, R. (eds.) Proc. of the Int. Conf. on Parallel and Distributed Algorithms, pp. 215–226. Elsevier Science Publishers B.V, Amsterdam (1988) 4. Fidge, C.: Logical time in distributed computing systems. Computer 24, 28–33 (1991) 5. Torres-Rojas, F.J., Ahamad, M.: Plausible clocks: Constant size logical clocks for distributed systems. Distributed Computing 12, 179–196 (1999) 6. Gidenstam, A., Papatriantafilou, M.: Adaptive plausible clocks. In: Proc.of the 24th Int. Conf. on Distributed Computing Systems (ICDCS 2004), Tokyo, Japan, pp. 86–93 (2004) 7. Petersen, K., Spreitzer, M.J., Terry, D.B., Theimer, M.M., Demers, A.J.: Flexible update propagation for weakly consistent replication. In: Proc. of the 16th ACM Symp. on Operating Systems Principles (SOSP-16), Saint Malo, France, pp. 288– 301 (1997)
On Checkpoint Overhead in Distributed Systems Providing Session Guarantees Arkadiusz Danilecki, Anna Kobusi´ nska, and Marek Libuda Institute of Computing Science Pozna´ n University of Technology, Poland {Arkadiusz.Danilecki,Anna.Kobusinska,Marek.Libuda}@cs.put.poznan.pl
Abstract. This paper presents the performance evaluation of the checkpointing and rollback-recovery protocols for distributed mobile systems, guarantying client-centric consistency models, despite failures of servers. The performance of considered protocols is evaluated by estimating the checkpointing overhead. Additionally, the paper disscusses the influence of the consistency models provided by the system on the moments of taking checkpoints. The impact of the obtained results on strategies for checkpointing in the mobile environment guarantying client-centric consistency models is assessed. Keywords: checkpointing, rollback-recovery, performance evaluation, distributed mobile systems, session guarantees.
1
Introduction
The variety of computational and networking capabilities, introduced by mobile computing, results in our growing reliance on mobile applications. It also mandates deploying mobile systems that should meet not only performance demands imposed on them, but also reliability requirements. The well known method to achieve fault tolerance in distributed systems is checkpointing and rollback recovery. This technique reduces the overall expected time of completing the computation in case of failures, at the same time increasing the efficiency and enhancing the reliability of the system. In general, checkpointing and rollback-recovery is based on the idea of saving the current system state during a failure-free execution, to be able to restore the error-free state from the saved data in case of failures. Each time the system’s state is recorded into a stable storage, able to survive all failures, a checkpoint is said to be taken. The benefit of checkpointing, however, comes at a price — the excessive checkpointing would result in performance degradation, while deficient checkpoining would incur an expensive recovery overhead. Thus, the appropriate checkpoint interval minimizes the expected recovery time of the aplication in the presence of failures and reduces the time added to the application as a result of checkpointing in the failure free execution. Therefore, the problem of placing checkpoints
This work was supported in part by the State Committee for Scientific Research (KBN), Poland, under grant KBN 3 T11C 073 28.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 11–19, 2008. c Springer-Verlag Berlin Heidelberg 2008
12
A. Danilecki, A. Kobusi´ nska, and M. Libuda
“optimally” in time to meet the system performance objective is of paramount importance. Determining the appropriate checkpoint interval is a field of research with a rich history. The first papers on the topic appeard in the 70’s in the context of transaction processing systems. Later work was concentrated on real-time and distributed systems. It was suggested that the checkpointing frequency should be a function of the failure rate [1,2]. On the other hand, it was advanced that the number of checkpoints should rely on the distribution of the program execution time [3]. In turn, in this paper, we state that the moment of taking a checkpoint, depends on a consistency model provided by the system. We consider the client-centric consistency models (called session guarantees), proposed for distributed mobile systems and affirm that the most desirable moments of taking checkpoints in such systems depend on the semantics of required session guarantees. To prove this statemement we evaluate the performance of checkpoining and rollback-recovery protocols, proposed in [4], providing session guarantees despite failures of servers. The checkpointing scenarios considered in these protocols are evaluated by estimating the checkpointing overhead. Additionally, the impact of the obtained results on strategies for checkpointing in the mobile environment guaranteeing client-centric consistency models is assessed. The realistic performance evaluation of checkpointing and rollback-recovery is difficult due to numerous factors that must be taken into account, involving a detailed characterization of the system and of a client’s application. Therefore, the performance of the proposed rollback-recovery protocols is in this paper evaluated quantitatively, by carrying out appropriate simulation experiments. The rest of the paper is organized as follows. We describe the system model and session guarantees in Section 2. An informal presentation of the rollbackrecovery protocol appears in Section 3. The results of experiments follow in Section 5, and finally we conclude in Section 6.
2
Distributed Mobile System Model
Throughout this paper, a replicated distributed storage system is considered. The system consists of a number of unreliable servers holding a full copy of a shared objects and clients running applications that access these objects. Clients are mobile, i.e. they can switch from one server to another during an application execution. To access the shared object, clients select a single server and send a direct request to this server. Operations are issued by clients sequentially, i.e. a new operation may be issued after the results of the previous one have been obtained. The storage replicated by servers does not imply any particular data model or organization. Operations performed on shared objects are basically divided into reads and writes. Clients can concurrently submit conflicting writes at different servers, e.g. writes that modify the overlapping parts of the data storage. It is assumed that clients perceive the data from the replicated storage according to session guarantees [5], called also client-centric consistency models, which specify consistency conditions of the state of replicas that must hold at a
On Checkpoint Overhead in Distributed Systems
13
server to execute an operation requested by the client. The informal definitions of session guarantees in the context of replicated shared objects are given below: – Read Your Writes (RYW) guarantee states that a read operation requested by a client can be executed on a server that has performed all write operations previously issued by the requesting client. – Monotonic Reads (MR) guarantee states that a read operation requested by a client can be executed by a server that has performed all relevant writes preceeding past read operations of the client. The term relevant writes has been introduced in [5]. The interpretation of this term for the purpose of this paper is that all write operations that influence the current state of a given object are relevant to a read operation performed by the server. – Monotonic Writes (MW) guarantee states that a write operation issued by a client can be executed on a server if it has performed all write operations previously issued by the client. – Writes Follow Reads (WFR) guarantee states that a write requested by a client can be executed on a server that has performed all writes relevant to reads previously requested by the client. In this paper, we focus on failures of servers and assume the crash-recovery failure model, i.e. servers may crash and recover after crashing a finite number of times [6]. Additionally, we assume that servers have access to a stable storage, able to survive all failures. To ensure fault-tolerance, objects are saved in the stable storage in the form of the log or the checkpoint. A log is a set of operations issued by the client, while a checkpoint is a couple of a vector describing writes performed by the server and the ordered set of operations performed by the server and called history.
3
The Rollback-Recovery Protocols
To preserve the required session guarantee, the rollback-recovery protocol must ensure that writes issued by the client and essential to preserve this guarantee are not lost in the result of the server’s failure. In the proposed protocols, it is achieved by logging in the stable storage appropriate operations obtained by servers. To optimize the protocol, servers save only some of the obtained operations, namely those received directly from clients. Servers periodically exchange information about writes performed in the past in order to synchronize states of their replicas. This synchronization procedure eventually causes total propagation of all writes directly submitted by clients. The operations obtained during synchronization procedure are performed by the server, but they need not to be logged, because they have already been saved in the stable storage of other servers. Servers periodically take a checkpoint of its state. When the checkpoint is taken, the log is cleared. After the failure, the failed server restarts from the latest checkpoint and replays operations from the log.
14
A. Danilecki, A. Kobusi´ nska, and M. Libuda
The moments of taking checkpoints are determined by session guarantees requirements [4]. Therefore, for each guarantee, sets of operations essential to preserve it are distinguished and a desirable moment of taking a checkpoint, denoted by DMTC, is defined. DMTC indicates such a moment, before which there is no need to take a checkpoint, because the server has not performed any operation required by a given session guarantee. In the case of RYW guarantee, DMTC is determined by obtaining a read request from a client. For MR guarantee, DMTC is determined by obtaining a read request, which follows the read previously issued by the same client. With reference to MW session guarantee, DMTC is determined by obtaining a write request, which follows the write previously issued by the same client. Finally, for WFR session guarantee, DTMC is determined by obtaining a write request, which follows the read request issued by the same client. Depending on system characteristics, in general, checkpoints can be taken every n-th DMTC, where n = 1, 2, 3, ... . But, there is no need to take checkpoints between two consecutive DMTC. In this paper, several rollback-recovery protocols, called rVsSG, rVsRYW, rVsMW, rVsMR, rVsWFR and rVsAll, proposed in [4], are considered. In rVsSG protocol, servers take checkpoints periodically, every K operation issued by clients. Protocols rVsRYW, rVsMW, rVsMR, and rVsWFR take checkpoints accordingly to DTMC for RYW, MW, MR and WFR session guarantees, respectively. Finally, rVsSGOpt protocol gathers features of rVsRYW, rVsMW, rVsMR, and rVsWFR protocols and determines the moment of taking a checkpoint in accordance with DTMC of all session guarantees.
4
Input Parameters of Simulation Experiments
The performance of the proposed rollback-recovery protocols is quantitatively evaluated in terms of checkpointing overhead. Since the checkpoint overhead is the increase in the amount of time added to the application in the failure-free run as the result of checkpointing, the protocol total execution time was a main criterion of the evaluation. The experiments were performed with respect to two points of reference: the coherence protocol, called VsSG [7] and rollback-recovery protocol rVsPes [4], implementing typical pessimistic checkpointing, where every server logs every write operation it receives, and takes checkpoints from time to time. For the performance analysis the refined simulator, called Distributed Algorithms Simulator (DAS) [8], has been developed. Since, in general, the simulation parameters depend on the system configuration and hardware characteristics, the values used in DAS were estimated through benchmark tests, which assessed the memory read/write transfer time, hard drive read/write transfer time, LAN, WAN and WIFI network delays, failure model, intensity of client migrations, and finally log and checkpoint delays. To ensure that the simulator works as expected, the mean values obtained during the validation tests were used as DAS input values. In the performed experiments we assumed that hard drive write and read access time is set to 3 and 2, respectively. These values represent the
On Checkpoint Overhead in Distributed Systems
15
time of saving a single operation in the stable storage and reading it from the storage while retrieving this operation during rollback-recovery. The server’s local read/write operation execution time is characterized by the memory access time and is set to 0. The cost of logging and checkpointing is calculated on the basis of the definitions of these mechanisms and the above values. Thus, the cost of logging the operation in the stable storage is simply the hard drive access time, while for the checkpoint it is a number of objects saved in the checkpoint plus the number of operations in the history multiplied by the hard drive access time. The simulation experiments were repeated 150 times for each protocol and each combination of the following parameters: 10, 50, 100, 200, 300, 400 clients, 15 servers, and two client’s application scenarios: with 20 and 80 percentage of all access operations that are writes (from now on these scenarios will be denoted by 0.2 and 0.8, respectively). The obtained results are presented in the following section. For each of the conducted experiments the 99% confidence limits for a mean time value never exceed 1.81% of the mean value, while 95% confidence limits never exceed 1.37% of the mean value. Therefore, each point of the figures presented in the next section corresponds with the 99% probability to the range of 1.81% width of the mean time value, and with 95% probability to the range of 1.37% width.
5
Simulation Experiments and Performance Evaluation
In the figure 1 the comparison of VsSG, rVsSG, rVsAll and rVsPes protocols is presented. The vertical axes of the graphs use a logarithmic scale and represent the time, while the horizontal axes represent the number of clients. In the failure-free run, the best performance, regardless of the application scenario and the number of clients, is achieved by VsSG consistency protocol which does not bear any additional expenses connected with taking a checkpoint and accessing the stable storage. On the other hand, the highest checkpoint overhead has rVsPes rollback-recovery protocol, saving in the stable storage every 10000 VsSG rVsSG rVsAll rVsPes
0.20 0.20 0.20 0.20
VsSG rVsSG rVsAll rVsPes
0.80 0.80 0.80 0.80
Execution time
Execution time
10000
1000 1000 0
50
100
150
200 Clients
250
300
350
400
0
50
100
150
200 250 Number of clients
300
Fig. 1. Checkpoint overhead for 0.2 and 0.8 client’s application scenario
350
400
16
A. Danilecki, A. Kobusi´ nska, and M. Libuda
obtained write operation, i.e. the one issued by the client as well as the one obtained during synchronization with other servers. Since the frequent access to stable storage is very time-consuming, the performance penalty in the case of rVsPes protocol is very high. Obviously, the higher the percentage of all access operations that are writes in the client’s application scenario, the higher the performance degradation. This fact is visible in figure 1, where the curve characterizing rVsPes protocol for the application scenario with 0.8 ratio of writes to all operations is higher and grows faster than in the case of the application scenario with 0.2 write ratio. The important observations are related to rVsSG and rVsAll rollback-recovery protocols. They both log only operations received directly from clients. Thus, the number of taken checkpoints and the overall execution time for these protocols is significantly smaller than for rVsPes. In our experiments, rVsSG protocol takes a checkpoint only when the server’s log contains 10 operations, while in the case of rVsAll protocol, the moment of checkpointing server’s state is determined by the required session guarantee. In general, checkpoints in rVsAll protocol are taken every: – read operation r required with RYW, if, after the latest checkpoint, r was preceded in the server’s execution order by writes issued by the same client – second write w required with MW, if, since the latest checkpoint, the server has already performed one write issued by the same client – second read r required with MR, if, since the latest checkpoint, the server has already performed one read request issued by the same client having not empty set of relevant writes – write w required with WFR, if the latest checkpoint w follows in the server’s execution order a read operation issued by the same client and having not empty set of relevant writes. When clients issue significantly more reads than writes, the checkpoint overhead of rVsAll protocol is higher than of rVsSG. It comes from the fact that rVsAll protocol takes checkpoints also while receiving read requests issued with RYW or MR session guarantee. For the opposite client’s application scenario, where clients issue considerably more writes than reads, the execution time for rVsAll protocol is higher than for rVsSG one, only for the small number of clients. With the growing number of clients, the number of obtained requests also grows, so the execution time, and thus, checkpoint overhead becomes higher for rVsSG protocol. In the case of our experiment, the number of clients that influences the checkpoint overhead of rVsSG and rVsAll protocols is close to 200 (Fig. 1). As we mentioned in the Section 3, the moment of taking a checkpoint in the case of rVsAll protocol may be changed, and checkpoints may be taken every multiple of DMTC. Figure 2 presents a comparison of checkpoint overhead of the protocol rVsAll with checkpoints taken every DMTC, 5th -DTMC, 10th DTMC and 20th -DTMC. The obtained results depend on a number of clients in the system. For both types of the client’s application scenario, rescheduling the moment of taking a checkpoint results in the significant decrease of the
On Checkpoint Overhead in Distributed Systems
1800
17
3400 rVsAll rVsAll5 rVsAll10 rVsAll20
1700
0.20 0.20 0.20 0.20
rVsAll rVsAll5 rVsAll10 rVsAll20
3200 3000
0.80 0.80 0.80 0.80
1600 2800 1500 Execution time
Execution time
2600 1400
1300
2400 2200 2000
1200 1800 1100 1600 1000
1400
900
1200 0
50
100
150
200
250
300
350
400
0
50
100
150
200
Clients
250
300
350
400
Clients
Fig. 2. Checkpoint overhead of rVsAll protocol with checkpoints taken every 1, 5, 10, 20-DMTC for 0.2 and 0.8 application scenario 4000 rVsSG rVsSG rVsAll rVsAll rVsAll20 rVsAll20
3500
0.20 0.80 0.20 0.80 0.20 0.80
Execution time
3000
2500
2000
1500
1000
500 0
50
100
150
200 Clients
250
300
350
400
Fig. 3. Checkpoint overhead of protocols: rVsSG, rVsAll and rVsAll taking checkpoints every 20th -DMTC
checkpoint overhead, which is the effect of the decrease of checkpoint costs, when the checkpoints are taken less frequently. The experiments show that differences in the overhead of rollback-recovery protocols, where checkpoints are taken every nth -DMTC (where n = 5, 10, 20), are not high (Fig. 2). Our another observation concerns the fact that, in the case of 0.2 client’s application scenario, the checkpoint overhead of rVsSG protocol and of rVsAll protocols that take checkpoints every 5, 10, 20th-DMTC is lower than the one of rVsAll protocol taking checkpoints every single DMTC. Thus, a natural question regarding relationship between the overheads of the first two protocols arises. The comparison of the execution time of protocols rVsSG, rVsAll and rVsAll taking checkpoints every 20th -DMTC, for the 0.2 and 0.8 client’s application scenarios, is shown in Figure 3. Again, the obtained results depend on the number of clients in the system. In the case of both application scenarios, the execution time of rVsAll protocol that takes checkpoints every 20th -DMTC is the shortest, so the checkpoint overhead of this protocol is the lowest. The obtained results demonstrate that rVsAll protocol, for which the moment of taking a checkpoint
18
A. Danilecki, A. Kobusi´ nska, and M. Libuda
may be rescheduled, is an efficient approach for providing fault-tolerance for distributed applications with session guarantees (Fig. 3). The last experiment compares the performance of protocols taking checkpoints according to the requirements of only one, pre-defined session guarantee. The obtained results are shown in Figure 4. The overhead of rVsAll protocol is the highest in the case of 0.2 client’s application scenario, which results from the fact that this protocol takes checkpoints when both types of client’s requests are received: namely reads and writes. 1800
4000 VsSG rVsSG rVsAll rVsMR rVsMW rVsRYW rVsWFR
1700 1600 1500
0.20 0.20 0.20 0.20 0.20 0.20 0.20
VsSG rVsSG rVsAll rVsMR rVsMW rVsRYW rVsWFR
3500
0.80 0.80 0.80 0.80 0.80 0.80 0.80
Execution time
Execution time
3000 1400 1300 1200
2500
2000 1100 1000 1500 900 800
1000 0
50
100
150
200 Clients
250
300
350
400
0
50
100
150
200 250 Number of clients
300
350
400
Fig. 4. The execution time of rollback-recovery protocols taking checkpoints according to the requirements of only one of session guarantee for 0.2 and 0.8 application scenario
According to the considered application scenario, the overhead of rVsMR and rVsRYW protocols which take checkpoints on read operations is higher than for rVsMW and rVsWFR protocols. On the contrary, when we consider 0.8 client’s application scenario, then the overhead of the protocols rVsMW and rVsWFR is higher than the overhead of rVsMR and rVsRYW.
6
Conclusions
This paper has described the experimental performance evaluation of checkpointing and rollback-recovery protocols, done by estimating the checkpoint overhead. The obtained empirical results show directly that rollback-recovery mechanisms integrated with consistency protocols of session guarantees efficiently ensure that session guarantees are provided despite server’s failures. These results also clearly indicate that the proposed rVsSG, rVsRYW, rVsMW, rVsMR, rVsWFR and rVsAll protocols dominate the pessimistic rollback-recovery protocol. The advantage of the proposed protocols over rVsPes protocol results from the smaller number of operations saved by these protocols in the stable storage and thus their less frequent access to such a storage. However, the experiences with the simulation experiments show that the significant impact on the protocol performance has also the moment of taking a checkpoint, determined by both — the applied consistency model and the client application scenario. Rescheduling the moment of taking a checkpoint results in the significant decrease of the
On Checkpoint Overhead in Distributed Systems
19
checkpoint overhead of the proposed protocols. Thus, in the paper the evaluation of the cost of various moments of taking a checkpoint was done. Our future work is closely related with described in this paper experimantal results and encompassess the mechanism of adapting (tuning) rollback-recovery protocols to specific characteristics of clients requests.
References 1. Bruno, J., Coffman, E.: Moptimal fault-tolerant computing on multiprocessor systems. Acta Informatica 34, 881–904 (1997) 2. Plank, J., Li, K., Puening, M.: Tmdiskless checkpointing. IEEE Trans. Parallel and Distributed Systems 9, 972–986 (1998) 3. Ling, Y., Mi, J., Lin, X.: A variational calculus approach to optimal checkpoint placement. IEEE Trans. on Computers 50(7), 699–708 (2001) 4. Kobusinska, A.: Rollback-Recovery Protocols for Distributed Mobile Systems Providing Session Guarantees. PhD thesis, Institute of Computing Science, Poznan University of Technology (2006) 5. Terry, D.B., Demers, A.J., Petersen, K., Spreitzer, M., Theimer, M., Welch, B.W.: Session guarantees for weakly consistent replicated data. In: Proc. of the Third Int. Conf. on Parallel and Distributed Information Systems (PDIS 1994), Austin, USA, pp. 140–149. IEEE Computer Society, Los Alamitos (1994) 6. Guerraoui, R., Rodrigues, L.: Introduction to distributed algorithms. Springer, Heidelberg (2004) 7. Sobaniec, C.: Consistency Protocols of Session Guarantees in Distributed Mobile Systems. PhD thesis, Institute of Computing Science, Poznan University of Technology (2005) 8. Gorski, F., Kobusi´ nska, A., Marciniak, P., Plaza, M., Stempin, S.: Das - the distributed algorithms simulator. Technical Report RA-021/06, II-PP (2006)
Performance Evolution and Power Benefits of Cluster System Utilizing Quad-Core and Dual-Core Intel Xeon Processors Pawel Gepner, David L. Fraser, and Michal F. Kowalik Intel Corporation {pawel.gepner,david.l.fraser,michal.f.kowalik}@intel.com
Abstract. Multi-core processors represent an evolutionary change in conventional computing as well setting the new trend for high performance computing. The chip-level multiprocessing architectures with a large number of cores continue to offer dramatically increased performance and power savings characteristics. Energy efficiency and scalability in performance have become more important to many enterprises and play important role in a cluster environment as well. This paper will describe how much we may expect from the cluster systems in terms of performance and power saving if we based the installation on the servers which are founded on the Intel Xeon Quad-Core and Intel Xeon DualCore processors family. Keywords: HPC, multi-core processors, dual-core processors, quad-core processors, parallel processing, benchmarks.
1
Introduction
Since the launch of the first microprocessors based on Intel Core Microarchitecture in June 2006, the high-performance computing (HPC) community have had an interesting choice. The new Xeon processor based on this energyefficient philosophy with the performance driven architecture delivers not only record performance but also significantly reduces the power. The HPC industry is becoming increasingly aware of the impact that power consumption has on the total cost of HPC installation. Operational cost produced as a result of running the cluster and cooling it in a huge server room, or maintaining specialized buildings to prevent system failure increase as server power consumption levels rise. New metrics of HPC installation success are no longer focused just on a pure performance, but rather on delivering a system which provides leadership in both raw performance and on performance per watt. The new generation of Intel processor Xeon Dual-Core and Xeon Quad- Core are focused on providing leadership in both aspects and both benefit a broad spectrum of applications including HPC optimized code. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 20–28, 2008. c Springer-Verlag Berlin Heidelberg 2008
Performance Evolution and Power Benefits of Cluster System
2
21
Performance and Performance-Per-Watt Consideration
True performance is a combination of both clock frequency and Instruction Per Clock (IPC). This shows that the performance can be improved by increasing frequency and IPC. The frequency is a function of both the manufacturing process and the microarchitecture. On our existing 65nm CMOS process technology and micro-architecture optimized for that frequency (e.g. long pipelining design) such as NetBurst we can achieve today 3.8 GHz maximally. Unfortunately high clock ratio has some implications in power consumption. If we analyze the NetBursts based processors running today we observe highest available speed 3.8 GHz and the thermal guideline 115 W. Dealing with such a thermal consideration is not an easy task. Assuming that a new process technology, (45 nm CMOS), will change the situation dramatically is wrong. Unfortunately leakage power limits frequency scaling (Figure 1) and it is the most important constraint of frequency acceleration.
Fig. 1. Leakage Power (% of total) vs. process technology
If the frequency can not be easily accelerated and it is an issue to deal with the thermal management then we need to focus on increasing instructions per clock, whilst fitting within an acceptable thermal envelope. Pure performance is important but we need to always consider the implications on power, when measuring the performance of the system. More and more we look for the best ratio of performance per watt. If the power consumption is related to the dynamic capacitance, the square of the voltage with which the transistors and I/O buffers are supplied times the frequency at which the transistors and signals are switching then we can express: power = (dynamiccapacitance) ∗ voltage2 ∗ f requency
(1)
22
P. Gepner, D.L. Fraser, and M.F. Kowalik
Taking into account performance and power equations, CPU designers need to balance IPC efficiency from one side and voltage and frequency from the other to offer a compromise of performance and power efficiency of the processor.
3
Multi-core Processors
In addition to all the methods described above and all of the considerations we have discussed so far, there is also one extra way to build a high performance system. Dual and multi-core processor systems are going to change the dynamics of the market and enable new innovative designs delivering high performance with an optimized power characteristic. They drive multithreading and parallelism at a higher than instruction level, and provide it to mainstream computing on a massive scale. From an operating system level (OS) they look like a symmetric multiprocessor system (SMP) but they bring a lot more advantage than typical dual or multi processor systems which we know from the classic server architecture. Multi-core processing is a long-term strategy for Intel that began more than a decade ago. Intel has more than 15 multicore processor projects underway and we are on track to deliver multi-core processors in high volume across multiple platform families. Intels multi-core architecture will possibly feature dozens or even hundreds of processor cores on a single die. In addition to general-purpose cores, Intel multi-core processors will eventually include specialized cores for processing graphics, speech recognition algorithms, communication protocols, and more. Many new and significant innovations designed to optimize the power, performance, and scalability are implemented into the new multi-core processors. How all of these innovations reflect to the overall system performance and accelerating performance per watt ratio is described below. In the testing environment we have been testing single system performance and performance per Watt as well as the cluster configuration as found in a typical HPC workload.
4
Selected Benchmarks and Platforms Configuration Details
Benchmarks selected for the testing environment contain: Fluent, LS-DYNA, Amber, Star-CD and LINPACK. These application and synthetic benchmarks represent a board spectrum of HPC workloads and seem to be a typical representation of testing suite for this class of calculation. Fluent is a commercial engineering application used to model computational fluid dynamics. The benchmark consists of 9 standard workloads organized into small, medium and large models. These comparisons use all but the largest of the models which do not fit into the 8GB of memory available on the platforms. The Rating, the default Fluent metric, was used in calculating the ratio of the platforms by taking a geometric mean of the 8 workload ratings measured.
Performance Evolution and Power Benefits of Cluster System
23
LS-DYNA is a commercial engineering application used in finite element analysis such as a car collision. The workload used in these comparisons is called 3 Vehicle Collision and is publicly available from http://www.topcrunch.org/. The metric for the benchmark is elapsed time in seconds. Amber is a package of molecular simulation programs. The workload measures the number of problems solved per day (PS) using eight standard molecular dynamic simulations. See http://amber.ch.ic.ac.uk/amber8.bench1.html for more information Star-CD is a suite of test cases, selected to demonstrate the versatility and robustness of STAR-CD in computational fluid dynamic solutions. The metric produced is elapsed seconds (wall clock) converted to jobs per day. For more information go to http://www.cd-adapco.com/products/STAR-CD LINPACK is a floating-point benchmark that solves a dense system of linear equations in parallel. The metric produced is Giga-FLOPS or billions of floating point operations per second. The benchmark is used to determine the world’s fastest computers published at the website http://www.top500.org/. Performance was measured using these benchmarks comparing system configurations of Quad-Core Intel Xeon Processor X3220 (2.40 GHz, 8 MB L2 cache, 1066 MHz FSB) to Intel Xeon processor 3070 (2.67 GHz, 4 MB L2 cache, 1066 MHz FSB) and to Intel Pentium D processors 950 (3.40 GHz, 2 MB L2 cache, 800 MHz FSB). Configurations details: – Quad-Core Intel Xeon Processor X3220 based platform: Dell PowerEdge 860 using Quad-Core Intel Xeon processor X3220 (2.40 GHz, 8 MB L2 cache, 1066 MHz system bus), 4 x 1GB (dual-ranked) ECC 533MHz DDR2 SDRAM; Red Hat Enterprise AS Linux 4, Update 3, EM64T, TR673 – Intel Xeon Processor 3070 based platform: Intel preproduction customer reference board Whitney A1 BIOS EXTWM212 with Intel Xeon processor 3070, 2.67 GHz with 4M L2 Cache, 1066 MHz system bus) 4 x 1GB (dualranked) ECC 533MHz DDR2 SDRAM DDR2 ; HW Prefetch Enabled. Red Hat Enterprise AS Linux 4, Update 3, EM64T, TR673. – Intel Pentium D Processor 950 based platform: Intel SR1475NH1-E E7230 chipset server with Intel Pentium D Processor 950, (3.40 GHz with 2x 2M L2 Cache, 800 MHz system bus), 4GB (4x1GB) 533MHz DDR2 ; Red Hat Enterprise AS Linux 4, Update 3, EM64T, TR626.
5
Comparing Single Processor Performance
In this section we have focused on single processor performance based on different type of CPU microarchitecture operating in typical HPC workload. We have compared NetBurst based dual core processor Intel Pentium D Processor 950 versus CPUs based on Intel Core Microarchitecture like Intel Xeon processor 3070 We also observe how the new microarchitecture accelerates performance in the HPC environment and how different dual core products can scale. We may
24
P. Gepner, D.L. Fraser, and M.F. Kowalik
Fig. 2. LINPACK: Dense Floating-Point Operations
also see how Quad-Core Intel Xeon processor X3220 benefits HPC application comparing to dual core CPUs. Using LINPACK HPL we see 50% performance improvement between system based on NetBurst CPU and Intel Core Microarchitecture. Both products are dual core but Intel Xeon 3070 significantly improves floating-point calculation utilizing 128 Bit SSE instructions and result is very visible. Quad Core product performance is even more spectacular we see 250% performance improvement versus Intel Pentium D Processor 950 and 66% versus Intel Xeon 3070. This shows great scaling for applications where we solve a dense system of linear equations in parallel. Molecular simulations also show great scaling.
Fig. 3. Amber: Molecular Modeling Application
Performance Evolution and Power Benefits of Cluster System
25
Fig. 4. Fluent: Computational Fluid Dynamics Application
Fig. 5. Star-CD: Computational Fluid Dynamics Application
Amber shows almost double performance advantage when we use dual core and quad core products. As we see figure 3 Intel Core Microarchitecture brings 30% improvement versus product based on NetBurst. In the computational fluid dynamics modeling Quad core can deliver between 80% and 210% more then dual core based product and the new Intel Core Microarchitecture brings 40% of improvement comparing NetBurst Dual core to Intel Core Microarchitecture core based dual core. Automotive HPC departments may find the multi core solution very useful as well. Commercial engineering application used in finite element analysis such
26
P. Gepner, D.L. Fraser, and M.F. Kowalik
Fig. 6. LS-Dyna: Finite Element Analysis for Crash Simulations
as a car collision LS-Dyna shows more then 60% performance improvement on Quad core vs. Intel Pentium D 950 also Intel Xeon 3070 gives 20% performance increase than previous generation Dual Core products.
6
Cluster Performance Study
Cluster configuration is built on 4 typical two processors nodes connected with Gigabit Ethernet. The cluster was installed using Rocks Cluster Package. Intel Xeon Processor 5150 has been selected for node population. We have been testing
Fig. 7. Cluster speedup across benchmarks
Performance Evolution and Power Benefits of Cluster System
27
scalability of the system running 4, 8, and 16 cores (threads) in one, two, four servers respectively. Figure 7 shows the results of the test running Benchmarks HPL, Fluent, Amber, Star-CD. As we can see Fluent and HPL will scale very well because of the nature of those benchmarks. Both of them are highly CPU intensive benchmarks and the type interconnect used does not play an important role. Star-CD is more communication intensive benchmark and with Gigabit as the interconnect interface we see less scalability. Changing the interconnect type to a faster interface solution will definitely deliver better scaling. So for compute intensive benchmarks we see great scalability but also for communication intensive benchmarks dual core are also a good solution, because of the reduced number of nodes. As we can see degradation inside the node is less observable than outside when we connect nodes. The gap between one and two nodes is not big like the transition from two to four when interconnect plays a more important role.
7
Performance/Watt Consideration
Intel Core Microarchitecture was designed in principle to drive energy-efficiency. Quad-Core based product delivers better performance than Dual-Core on the same platform. Generally we get problems solved quicker and save money on energy. Taking in to consideration that both products have a small difference in thermal TDP we may anticipate the results. Figure 8 shows how Quad-Core Intel Xeon Processor provide best in class performance and energy efficiency in a single socket configuration in typical HPC benchmark workloads.
Fig. 8. Performance per system watt - Dual Core and Quad Core comparison
8
Conclusion
With the release of the first dual-core processor we have entered a new era of processor architecture. Dual-core and multi-core processors will become the
28
P. Gepner, D.L. Fraser, and M.F. Kowalik
standard for delivering greater performance, improved performance per watt, and new capabilities. New Multi-core products provide a tremendous advantage in the HPC environment. We see a significant performance advantage, ranging from 20% - 210% on single CPU system performance and in addition observe significant performance per watt improvement. In a cluster environment we see great scaling when comparing to clusters built on previous generations of processor architecture, resulting in a reduction in the number of nodes necessary to achieve the intend performance level Multi-core processors do more work per clock cycle, and can be designed to operate at lower frequencies than their singlecore counterparts. All of this drives significant improvements in user experiences for the HPC environment and the same time extends Moores Law well into the future.
References 1. Gepner, P., Kowalik, M.F.: Multi-Core Processors: New Way to Achieve High System Performance. In: PARELEC 2006, pp. 9–13 (2006) 2. Ramanathan, R.M.: Intel Multi-Core Processors: Leading the Next Digital Revolution, Technology @ Intel Magazine 3. Wechsler, O.: Inside Intel Core Microarchitecture: Setting New Standards for Energy- Efficient Performance, Technology @ Intel Magazine 4. Smith, J.E., Sohi, G.S.: The Microarchitecture of superscalar processors. Proc. IEEE 83, 1609–1624 (1995) 5. Ronen, R., Mendelson, A., Lai, K., Lu, S.-L., Pollack, F., Shen, J.P.: Coming challenges in microarchitecture and architecture. Proc. IEEE 89, 325–340 (2001) 6. Moshovos, A., Sohi, G.S.: Microarchitectural innovations: Boosting microprocessor performance beyond semiconductor technology scaling. Proc. IEEE 89, 1560–1575 (2001) 7. Tullsen, D.M., Eggers, S.J., Levy, H.M.: Simultaneous multithreading: Maximizing onchip parallelism. In: Proc. 22th Annu. Int. Symp. Computer Architecture, pp. 392–403 (1995) 8. Eggers, S.J., Emer, J.S., Levy, H.M., Lo, J.L., Stamm, R.L., Tullsen, D.M.: Simultaneous multithreading
Skip Ring Topology in FAST Failure Detection Service Jacek Kobusiński, Filip Gorski, and Stanisław Stempin Institute of Computing Science Poznań University of Technology, Poland {jkobusinski,fgorski}@cs.put.poznan.pl,
[email protected]
Abstract. This paper addresses the problem of communication among loosely coupled groups of nodes in distributed systems. We describe a novel proposal of logical communication topology based on skip list data structure. We enhance this structure to make it more resilient to failures. Its good self-stabilization characteristics are shown through extensive simulation experiments. We present this new concept in the context of our failure detection service, where we use it at a local communication level. Keywords: failure detection, distributed systems, fault tolerance, probabilistic communication.
1
Introduction
The use of distributed systems as a computing platform constitutes a very promising research and business area due to their availability, economic aspects and scalability. The intense development of Grids, P2P networks, cluster and high-speed network topologies gave the possibility to allocate an enormous amount of resources to distributed applications at a reasonably low cost. Their level of parallelism can improve the performance of existing applications and raise the processing power of distributed systems to a new higher level. Unfortunately, those systems are failure prone and the probability that a failure occurs during computations is higher than in traditional systems. Moreover, failures of underlying communication network have a great impact on system behavior and condition. There are many techniques to prolong the time in which the component will work correctly or even avoid the failure by introducing new rigorous technology procedures. These efforts are often very expensive and cannot ensure the full reliability of the system. Therefore, to overcome the problem one should construct a fault tolerant mechanism to detect unavoidable failures and make their effects transparent to a user. Fault tolerance can be achieved by introducing a certain degree of redundancy, either in time or in space. A common approach consists in replicating vulnerable components of the systems. The replication can guarantee, to a large
This work was supported by France Telecom, under project "Brain" No. 21/06.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 29–38, 2008. c Springer-Verlag Berlin Heidelberg 2008
30
J. Kobusiński, F. Gorski, and S. Stempin
extent, continuous availability of resources and continuity of work, but must be complemented with a failure detection mechanism. Chandra and Toueg presented a concept of failure detectors [1], as an abstract mechanism that supports asynchronous system model . To imagine the basic idea behind this mechanism, consider two processes p1 and p2 . Process p1 using its failure detector module tries to evaluate if process p2 is crashed or alive. If failure detector’s response is crashed, we can say that process p1 suspects p2 . A failure detector that makes no mistakes and eventually suspects all failed processes is called perfect failure detector. However, this requirements are very difficult to be fulfilled in practice. So, one can weaken safety (completeness) or liveness (accuracy) property to construct unreliable failure detector [1], which can also be used to solve agreement problems in an asynchronous system. In [2] authors propose a new type of a failure detector, called accrual failure detector. It provides a non-negative accrual value representing the current suspicion level of the monitored process, which can be further interpreted by the application. A general survey of the above mentioned failure detectors can be found in [3,4,5]. The failure detection mechanism [6,7,8,9,10,11,12] can be build into a distributed application or implemented as a service. The first approach allows to adjust the parameters of the detection to certain needs. However, it must be implemented as an application integral component. When more applications want to use such a mechanism, there is no possibility to share it. Besides, the resources like CPU power or link bandwidth allocated by this mechanism are used inefficiently. Considering these drawbacks, the failure detection service seems to be a good alternative. In this approach one can create environment for end users with common services that can be shared among different users working concurrently. In our recent research we focus on building flexible, adaptable, scalable and timely (in short FAST) failure detection service, which is composed of the following modules: failure detection, inter-group communication and local communication. While the inter-group communication module is responsible for efficient message dissemination in large scale environment, the local communication module has to provide a fast, scalable and failure resilient communication mechanism, as well as support for failure detection and low maintenance cost. Unfortunately, the existing communication topologies do not fulfill these requirements. Thus, we propose a novel approach called skip ring, which we present in the context of FAST. The remaining part of the paper is structured in the following way. In section 2, we define the system model. Section 3 describes the architecture of the FAST failure detection service. Description and experimental results concerning the skip ring topology are presented in section 4. Finally, section 5 brings our concluding remarks.
2
System Model
Throughout this paper, a distributed system is considered. For reasons of complexity and overhead, we model the system as consisting of processes and
Skip Ring Topology in FAST Failure Detection Service
31
computers (hosts). We do not include a network, computer elements or low level software abstractions as monitored components because the primary objective is not to construct a service that will diagnose the cause of the failure but to enable the construction of a reliable distributed application. So, the system consists of a set of nodes connected by quasi-reliable communication channels which are subject to crash failures. It means that there is no message corruption, message loss, and finally no creation of spurious messages. We assume a crash-stop failure model of nodes and consider the system to be asynchronous because there exists no bound either on communication delays or on process speed. Moreover, according to [13], we assume that the network performs in a synchronous way “most of the time”, i.e. long stable periods, where the system behaves like a synchronous one and short instability periods, where no timing guarantees can be made, occur alternately. This phenomenon, called partial synchrony [14], justifies the use of a “weaker” failure detectors because their properties are eventually satisfied.
3
FAST Failure Detection Service
FAST failure detection service architecture is modular, flexible and open. It allows swapping any single component without the need of changing the whole service (Figure 1), as long as other components’ requirements are met. This feature can be used to compare different failure detection methods and communication schema.
Fig. 1. Fast service architecture
Fig. 2. Inter-group pattern
communication
Considering the scalability issues, we assume a twolevel structure. To reduce the number of messages exchanged over internetwork links and to speed up the process of dissemination we divide the entire system into groups of hosts. In general one can imagine the simplest architecture without any hierarchy, where all modules are replaced by single module. On the other hand, one can extend this architecture and add some hierarchy levels by adding new modules. In the following subsections we shortly present general assumptions and ideas concerning the three components of our service: failure detection, inter-group communication and local communication.
32
3.1
J. Kobusiński, F. Gorski, and S. Stempin
Failure Detection
Failure detection protocol is the key component of the designed service. It will use all other components to disseminate acquired information about node failures. A node can learn about other node failure in two ways, by detecting it by itself or by getting this information from others. While the former method is faster, it can incur high cost and suffer poor scalability. The latter method is based on distribution of monitoring responsibility among nodes. Alas, the drawback of this dispersal is latency arising from information dissemination between the monitor and other nodes. As scalability issues are more important, we decide to disperse the monitoring responsibility. Another important aspect is the type of information maintained and shared by failure detectors. It can be either positive (correct nodes) or negative (suspected nodes). Regardless of the method, the issue of incorrect suspicions has to be considered as failure detectors are unreliable. The former approach handles this issue transparently, while the latter should be enhanced by the routine of sending single message carrying positive information once a false suspicion is discovered. On the other hand, it is reasonable to expect a majority of nodes to be correct most of the time and therefore, the latter approach is less bandwidth and memory consuming. Taking these facts into consideration, we decided to apply negative information scenario. To increase the failure resilience of the proposed detection mechanism we impose a lower limit of m on the number of monitors for each node. Despite that, in order to avoid creating a single point of failure, we are interested in more or less equal distribution of the monitoring responsibility among all the nodes. Therefore, we restrict the number of monitored nodes for each failure detector to a multiple of m. Finally, to satisfy various requirements of a user’s application we decided to use the accrual failure detector [2]. Instead of binary (correct/suspected) decision, it provides a non-negative suspicion value, which can be further interpreted by a user’s application. The higher value it returns, the more probable the node crashed. This kind of failure detection requires that monitoring relationships be stable and do not change frequently. 3.2
Inter-group Communication
At the inter-group communication level, nodes are organized into groups based on some metric. In our case, it is similarity of IP addresses, which can be replaced with a more complex method if required. Each group is assigned a unique identifier — GID. When a group’s size reaches a value from a priori defined range, split process occurs. As a result, two disjoint groups are created from its members. Their GIDs are unambiguously derived from the GID of the original group. We expect the communication at this level to be scalable, fast and failure resilient. Despite the speed and reliability of all-to-all communication, it does not guarantee scalability. On the other hand, the hierarchical approach, which
Skip Ring Topology in FAST Failure Detection Service
33
guarantees scalability, is difficult to maintain due to its lack of failure resilience. With regard to the above facts, we decided to use gossiping, which is a randomized communication pattern that fulfills our requirements. We have evaluated many different gossiping patterns based on rumor age, timeout, probability and feedback mechanism. Moreover, we decided to use a cooperative bidirectional push-pull approach which makes the dissemination faster. Finally, we modify the basic algorithm to limit the number of messages sent while the dissemination efficiency remains high. This was possible due to combination of gossip aggregation and different characteristics of push-pull approach. While gossiping makes our service more resilient to any communication failure and easily adoptable to the dynamic changes of membership or network topology, it cannot guarantee deterministic broadcast to all nodes. To solve this problem, we additionally introduced a simple ring communication pattern that will be used only occasionally. The maintenance of this structure induces almost no extra cost since it is done during exchange of gossip messages. Each node maintains randomly chosen addresses of two other nodes that are members of adjacent groups (one per group). As nodes of one group with high probability have different contact addresses, there are many parallel connections between those groups. This fact is very positive in the context of self-stabilization and fault tolerance. Logical schema of described the inter-group communication is presented on Figure 2, where each cloud represents a group, arrows depict gossiping, while dotted arcs show a logical ring structure. 3.3
Local Communication
At the local communication level, we handle the issues of communication between the nodes constituting a single autonomous group, which is self-sufficient, i.e. it does not relay on any other components. The term group refers to a set of loosely coupled nodes, with a limited size and no guarantees about consistent group view or message delivery. We require that nodes made some long term connections due to reasons mentioned in section 3.1. We do not assume the complete knowledge about group membership or group size at any single node. However, it should be possible to estimate that size at least by some nodes in order to perform a split procedure mentioned above (section 3.2). In order to disseminate information among all the nodes in the group, we considered communication schema similar to those discussed in the context of inter-group communication (section 3.2). We think that all-to-all pattern is not acceptable due to lack of scalability. We also claim that gossiping does not satisfy our long term connection requirement. Thus, we decided to use a hierarchical structure, which offers acceptable message propagation delays. As existing solutions incur high cost, stronger guarantees about group membership or centralized knowledge, we propose a novel logical topology called skip ring, which is described in details in section 4.
34
J. Kobusiński, F. Gorski, and S. Stempin
Fig. 3. Skip list
4
Fig. 4. Skip ring
Skip Ring Topology
Our service uses the skip ring topology which is based on a skip list structure proposed in [15]. Skip list is similar to a balanced tree regarding good lookup performance (O(log n)) and does not suffer high costs of inserting and deleting items at the same time (Figure 3). This is achieved by constructing this ordered list from items of different levels randomly chosen by each element independently. Each item starts from level 1 and repeatedly tries to increase it with probability p (we assume p = 12 ) till the first unsuccessful try. It is easy to notice that the levels are randomly distributed in the following proportion: 50% are of level 1, 25% are of level 2, 12.5% are of level 3 and so on. An item that has l forward pointers is called a level l item. Each such a pointer points to the next element on the list of level l or higher. As mentioned earlier, we want to use the skip list as a communication pattern. The first modification we made was removing head and tail of the list and making it cyclic. This topology is used as a communication pattern at a group level, where items represent the group members and pointers stand for network links. This structure is fragile as nodes can fail. To increase the failure resilience, it should be enhanced by adding extra pointers. The final structure, called skip ring, augmented with those pointers to the previous items is presented in Figure 4. For clarity, newly added pointers are shown only for one selected element. Selection strategy and the number of backward pointers is a very important issue in the context of self-stabilization of our structure. We carry out a set of experiments and their results prove that the resilience to failure can be increased by choosing more than two (k > 2) backward pointers. The more we use, the more resilient to failure the skip ring structure is (Figures 5def). On the other hand, the cost of maintenance will grow as the number ofpointers increases. Our theoretical analysis shows that k should be O log1/f n where n is the size of the group and f is the number of failed nodes. These theoretical results have been confirmed by empirical data obtained from simulations (Figures 5ef) where we present failure resilience of skip ring for groups of different size assuming the number of backward pointers scale accordingly to theoretical results. We have considered two alternative selection strategies of backward nodes: random (RND) and deterministic (LAST). In the former method, the node randomly chooses k other nodes (RND), while in the latter, the node selects the
100
100
80
80
% of stabilized runs
% of stabilized runs
Skip Ring Topology in FAST Failure Detection Service
60
40 f=1 f=5 f=10 f=20
20
0
0
10
60
40 f=1 f=10 f=10 f=20
20
20 30 40 50 60 70 80 90 Buffer size of known nodes as % of total nodes
0
100
0
(a) rnd k=3 m=25% 100
40
20
60
40
20
0
10
20
30
40
0
50
0
Concurrently failed nodes as % of total nodes
10
80
80
% of stabilized runs
% of stabilized runs
100
60
40 n=64 f=5 n=128 f=10 n=256 f=20 n=512 f=40 0
1
2 3 4 number of backward pointers
(e) last m=25%
30
40
50
(d) m=25%
100
20
20
Concurrently failed nodes as % of total nodes
(c) k=3
0
100
LAST k=1 LAST k=3 LAST k=5 RND k=1 RND k=3 RND k=5
80 % of stabilized runs
% of stabilized runs
20 30 40 50 60 70 80 90 Buffer size of known nodes as % of total nodes
(b) last k=3 m=25%
60
0
10
100
LAST m=25% LAST m=50% RND m=25% RND m=50% RND m=100%
80
35
5
60
40
LAST f=5 LAST f=10 LAST f=20 LAST f=30 LAST f=40 LAST f=50
20
6
0
0
1
2 3 4 number of backward pointers
5
6
(f) last m=25% f/n=const.
Fig. 5. Skip ring self-stabilization experimental results
closest k nodes behind (LAST). In both scenarios, the node chooses candidates from all locally known nodes. The experimental simulation shows that the second scenario is much better in the context of the self-stabilization property. All the experiments were conducted for groups of size 128 (n = 128), unless specified otherwise. We measured the percentage of stabilized runs for different parameters configuraitons. By “stabilized run” we mean a run, in which the structure overcomes the problem of nodes crashes and reconfigure properly.
36
J. Kobusiński, F. Gorski, and S. Stempin
Figure 5a presents the percentage of stabilized runs for RND scenario while changing the size of the locally known nodes buffer (m) for different number of concurrent failures (f ) and constant number of backward pointers (k = 3). It is easy to notice a strong relation between the failure rate f and the number of successfully stabilized runs. It is more difficult to achieve stabilization when f increases. For a given f this situation can be improved by extending the amount of information about other members of the ring gathered by each node. Figure 5b shows corresponding results for the LAST scenario. In this case, the relation between the failure rate f and the number of successfully stabilized runs is hard to observe, which is a consequence of a better fault resilience in such a skip ring. We can also see that the size of the buffer has a little influence on successful stabilization and for most configurations it would be unreasonable to keep it high. Comparing those two plots, it is evident that LAST scenario behaves better than RND scenario for any given f and buffer size. It is clear that a rational choice of the scenario can improve the self-stabilization property of the system while decreasing its memory requirements. We believe that there is possibility for further improvements in this area. In Figure 5c we present the percentage of stabilized runs (k = 3) for both scenarios, with different buffer sizes (25%, 50%, 100% for RND and 25%, 50% for LAST) while changing the percentage of concurrent failures. This plot demonstrates once more a better quality of LAST scenario over RND. We can also see that LAST for m = 50% is comparable to RND with full knowledge of the ring. Improvements in stabilization observed for LAST scenario are less significant comparing to RND scenario. We think it is reasonable to set m to 25% for further analysis of LAST scenario. In Figure 5d we analyze backward pointers impact on skip ring stabilization when (m = 25%) and failure rate changes. Not surprisingly, there is a strong impact for LAST scenario. The more backward pointers, the easier to stabilize the ring. It is also noticeable that selecting a sophisticated scenario like LAST should be connected with careful considering other parameter values, especially k. Data series for LAST with k = 1 shows almost no improvement to RND with k = 1. The following two plots highlight the relation between the success of stabilization and the number of backward pointers (k). First of them, (Figure 5e), presents a positive effect of increasing k for different sizes of the skip ring with the same percentage of concurrent failures and buffer size. It is evident that for constant f there exists a value k such that for k > k further stabilization improvements are insignificant. The same regularity can be observed on the next plot (Figure 5f), however, instead of variable ring sizes, we analyze various numbers of concurrent failures f . The shape of the plot suggests again that there exists the value k and it is also clear that it depends on f . That would be an empirical validation of our expectations derived from the theoretical analysis.
5
Conclusion
Our recent studies concentrate on FAST — a failure detection service. In the paper, we described its architecture and characterized its components briefly.
Skip Ring Topology in FAST Failure Detection Service
37
During the design and implementation of FAST we tried to merge some promising techniques presented recently in the literature and overcome difficulties in finding appropriate topology for local communication. Therefore, we proposed a novel probabilistic communication pattern based on the skip list — the skip ring. We showed how to transform the skip list, a data structure, into the skip ring, a communication topology. As the next step we enhanced it with additional pointers to make it failure resilient. We also showed the results of theoretical and experimental studies of the proposed solution. They confirm that the proposed the skip ring is a scalable and failure resilient topology that provides fast message dissemination and incurs low maintenance cost. Moreover, it is designed to support failure detection. We believe that the skip ring is an interesting solution with very good characteristics, which make it suitable for a wide variety of distributed applications including the local management component of FAST. As the future work we plan to extend our studies concerning efficient ring size estimation, improved scenarios of backward pointer selection and integration of accrual failure detection mechanism into the skip ring.
References 1. Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of the ACM 43, 225–267 (1996) 2. Hayashibara, N., Défago, X., Yared, R., Katayama, T.: The ϕ accrual failure detector. In: SRDS, pp. 66–78 (2004) 3. Brzeziński, J., Kobusiński, J.: A survey of failure detector protocols. Foundations of Computing and Decision Sciences 28, 65–81 (2003) 4. Reynal, M.: A short introduction to failure detectors for asynchronous distributed systems. SIGACT News, 53–70 (2005) 5. Freiling, F., Guerraoui, R., Kouznetsov, P.: The failure detector abstraction. Technical Report TR 2006-003, Department for Mathematics and Computer Science, University of Mannheim (2006) 6. van Renesse, R., Minsky, Y., Hayden, M.: A gossip-based failure detection service. In: Proc. of the Int. Conf. on Distributed Systems Platforms and Open Distributed Processing, pp. 55–70 (1998) 7. Stelling, P., DeMatteis, C., Foster, I.T., Kesselman, C., Lee, C.A., von Laszewski, G.: A fault detection service for wide area distributed computations. Cluster Computing 2, 117–128 (1999) 8. Gupta, I., Chandra, T.D., Goldszmidt, G.S.: On scalable and efficient distributed failure detectors. In: Proc. of 20th Annual ACM Symp. on Principles of Distributed Computing, pp. 170–179. ACM Press, New York (2001) 9. Bertier, M., Marin, O., Sens, P.: Implementation and performance evaluation of an adaptable failure detector. In: Proc. of the Int. Conf. on Dependable Systems and Networks, Washington, DC, pp. 354–363 (2002) 10. Hayashibara, N., Cherif, A., Katayama, T.: Failure detectors for large-scale distributed systems. In: Proc. of the 1st Workshop on Self-Repairing and SelfConfigurable Distributed Systems (RCDS), 21st IEEE Int’l Symp. on Reliable Distributed Systems (SRDS-21), Osaka, Japan, pp. 404–409 (2002) 11. Dunagan, J., Harvey, N.J.A., Jones, M.B., Kosti, D., Theimer, M., Wolman, A.: FUSE: Lightweight guaranteed distributed failure notification. In: Proc. of the 6th Symp. on Operating Systems Design and Implementation, pp. 151–166 (2004)
38
J. Kobusiński, F. Gorski, and S. Stempin
12. Horita, Y., Taura, K., Chikayama, T.: A scalable and efficient self-organizing failure detector for grid applications. In: Proc. of 6th Int. Workshop on Grid Computing, pp. 202–210 (2005) 13. Cristian, F., Fetzer, C.: The timed asynchronous distributed system model. IEEE Trans. on Parallel and Distributed Systems 10, 642–657 (1999) 14. Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the presence of partial synchrony. Journal of the ACM 35, 288–323 (1988) 15. Pugh, W.: Skip lists: a probabilistic alternative to balanced trees. Communication of the ACM 33, 668–676 (1990)
Inter-processor Communication Optimization in Dynamically Reconfigurable Embedded Parallel Systems Eryk Laskowski1 and Marek Tudruj1,2 2
1 Institute of Computer Science PAS, 01-237 Warsaw, Ordona 21, Poland Polish-Japanese Institute of Information Technology, ul. Koszykowa 86, 02-008 Warsaw, Poland {laskowsk,tudruj}@ipipan.waw.pl
Abstract. The paper concerns program design methods for a new kind of parallel embedded systems in which communication infrastructure is dynamically run-time adaptable to particular application program needs. The new system architecture assumes a fairly large number of autonomous communication links in each executive processor. Interprocessor link connections are subdue to dynamic reconfiguration according to the compile-time elaborated strategy based on the application program graph analysis. Automatic program structuring methods are used for defining the structuring of reconfigurable processor link sets which enables the look-ahead connection reconfiguration that overlaps with the current program execution including data communication. Algorithms for respective program task scheduling and dynamic program decomposition into sections executed with the look-ahead created connections of subsets of processor links are presented. Simulation experiment results with structuring of parallel numerical programs of matrix multiplication are presented. The experiments compare program structuring quality of the look ahead connection reconfiguration based on multiple crossbar switches with the quality of reconfiguration in a single crossbar switch but with the use of multiple link subsets reconfigured in advance.
1
Introduction
In previous papers we have investigated novel message-passing circuit-switching system architecture based on redundant communication resources [3, 5, 6]. This architecture involves new program execution paradigm based on the look-ahead dynamic reconfiguration of inter-processor connections. It is based on preparing inter-processor connections in advance in spare (redundant) communication resources (e.g. additional crossbar switches). The experimental results have shown [5, 6] that the look-ahead reconfiguration in redundant communication resources gives very good speedups, especially for fine-grain parallel programs. The look-ahead reconfiguration is particularly beneficial when applied in embedded parallel systems based on user-level communication [1]. The user-level R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 39–48, 2008. c Springer-Verlag Berlin Heidelberg 2008
40
E. Laskowski and M. Tudruj
communication provides a very low software overhead (sometimes as low as 1 CPU cycle time, e.g. DIMMnet network interface [2]) through accessing the inter-processor network directly from user program without intervention from the operating system. This imposes the tough requirements for efficiency of interprocessor connection setting in high performance parallel embedded systems. It is interesting to check whether it is possible to exploit some ideas coming from the look-ahead reconfiguration paradigm in standard message-passing parallel systems based on single reconfigurable communication network. The lack of network hardware redundancy in standard systems directs our effort to a still better reconfiguration control approach, namely to a careful program graph structuring in a non-redundant communication infrastructure. In this paper, we propose a new strategy of inter-processor connections reconfiguration control for message-passing multiprocessor systems, where application programs are partitioned into sections executed with fixed inter-processor connections to reduce reconfiguration control time overheads. In [7] the first attempt to use the look-ahead reconfiguration paradigm in standard message-passing system was presented. Unfortunately, the solution presented there was based on static processor link sets partitioning, which gave disappointing speedups in comparison to classical on-request approach. The paper presents how program execution control, which has originated from the look-ahead reconfigurable systems can be applied in a classic circuitswitching on-request reconfigurable parallel system. Although the program execution paradigm of look-ahead reconfiguration is closely connected to/inherently based on hardware redundancy, it is possible to use specific program execution control based on partitioning of an application program into sections in standard hardware environment. Thanks to this, we can expect to achieve speedup improvements without investment into sophisticated, and thus expensive, system architecture. In the paper, we show that parallel program graph partitioning for optimized reconfiguration control is justified and confirmed by the results of experimental research. The paper consists of three parts. In the first part, the inter-processor connection reconfiguration principles are discussed. In the second part, main features of the applied graph partitioning algorithm are discussed. In the third part, experimental results of efficiency measures of program execution based on the dynamic reconfiguration are shown and discussed.
2
Look-Ahead Dynamically Reconfigurable Multiprocessor Systems
In the paper, we discuss multiprocessor systems with distributed memory and communication based on message passing. In such systems, reconfigurable through inter-processor connection switching, before a message can be sent, the connection of the sender processor link with the receiving processor link has to be created. With the dynamic on request reconfiguration each communication in an application program provokes a connection request directed to the
Inter-processor Communication Optimization
41
operating system. The system creates the requested connection and acknowledges the requesting processes. When the communication is completed, link release messages are sent to the operating system to enable new connections. The look-ahead dynamic connection reconfiguration paradigm is based on anticipated connection setting in some redundant communication resources provided in the system. Some subsets of the redundant resources are allocated to program execution and some others to the look-ahead connection setting. The redundant resources considered in this paper are multiple processor link sets and, for comparison purposes, link connection switches (crossbar switches, multistage connection networks). In look-ahead dynamic reconfigurable systems an application program is divided into communication-disjoint sections. Inter-processor connections for currently executed sections are created in advance in some connection switching devices. The connections do not change during execution of the current sections. In parallel with the execution of current sections, a configuring control process sets the connections for next sections in redundant connection switches. The partitioning of application programs into sections is performed automatically at compile time, based on an analysis of the application program graph. 2.1
Systems with Many Link Connection Switches
Creation of link connections in parallel with communication can be enabled by the use of several link connection switches (crossbars) for program execution and anticipated connection setting. The general structure of a look-ahead link connection reconfigurable system with several connection switches is shown in Fig. 1. When WPi executes a current program section using links connected in one connection switch, the global control subsystem (GCS) sets, in another switch, connections required for execution of next sections. GCS controls the link set config.
Crossbar
switch
S1
Crossbar
switch
...
switch S2
...
config.
Crossbar
... ... ... ...
links
config.
SX
links
... ... ... ... ...
... ... ... ...
Processor link set switch
...
links
WP1
...
links
WP2
control
...
... ... ...
links
...
links
WPN
Synchronization Path
GCS Communication Control Path
Fig. 1. Structure of the look-ahead reconfigurable system with multiple connection switches
42
E. Laskowski and M. Tudruj
switch, which is used to change connections between the WPi ’s links and the crossbar switches. When the connections for next sections are ready and the links of WPi will be no more used in the current section, the link set switch connects WPi links to the configured connection switch. We assume that switching time of the link set switch is very small comparing the connection setting time in the crossbar switch, so new connections are made available to program execution in almost time transparent way. Depending on the strategy used for program execution, the link set switch can switch all or selected links of a worker processor subsystem. The control bus connects worker processor subsystems with the global control subsystem. It is used for transferring section completion (end of use of links) messages from WPi to GCS and section activation messages from GCS to WPi . The simplest implementation of the Communication Control Path is a bus but a more efficient solution can assume direct links (data lines) connecting worker processors with the GCS. Besides the Communication Control Path, the Synchronization Path is used in this system. The path is used to synchronize end of section reports coming from worker subsystems. 2.2
Systems with Multiple Processor Link Sets
Instead of multiple connection switches it is possible to use multiple processor link sets connected to a single link connection switch (e.g. a single crossbar). In such a configuration, links of each worker processor subsystem are divided into two subsets: one used for the execution of current program sections and the other used for the reconfiguration. The complete structure of the look-ahead reconfigurable system with partitioned processor link sets is shown in Fig. 2. This solution requires less hardware, but, it requires processors (WPi ) with a larger link connection switch controlled by the global control subsystem (GCS). The worker processor subsystems (WPi ) are controlled by GCS using a communication control path. It can be a bus, which transfers the end of link use/section activation messages.
config.
Crossbar switch S
links c.l.
... ... w.l
links c.l.
WP1
...
... ... w.l WP2
links c.l.
...
... ... w.l WPN
... Communication Control Path w.l. – working links
GCS control c.l. – configured links
Fig. 2. Look-ahead reconfigurable system with partitioned processor link sets
Inter-processor Communication Optimization
43
The partitioning of processor links into the subset of working links and subset of configured links can be done using two strategies. The set of all processor links is statically divided into two equal size subsets [7] or the set is dynamically divided into two usually non-equal size subsets depending on the current program needs. In the paper we will examine the performance obtained with the dynamic link partitioning strategy in simulation experiments and we will compare them to the strategy based on statically divided two equal size subsets of processor links.
3
Parallel Program Scheduling and Partitioning into Sections
An application program is represented as a weighted Directed Acyclic Graph (DAG) with computation task nodes and communication (i.e. node data dependency) edges. Programs are executed according to the macro data flow model. The graph is static and deterministic. A two-phase approach is used to tackle the problem of scheduling and graph partitioning in presented environments [5, 6]. In the first phase, a scheduling algorithm reduces the number of communications and minimizes program execution time, based on program DAG. The scheduling algorithm is an improved version of ETF (Earliest Task First) scheduling [4]. In the second phase, scheduled program graph is partitioned into sections for the look-ahead execution in the assumed environment. A scheduled program is specified by the Assigned Program Graph (APG) [5,6], Fig. 4. Program graph partition into sections is defined using the Communication Activation Graph (CAG) [5,6]. Program sections correspond to such sub-graphs in the CAG for which the following conditions are fulfilled: 1. section sub graphs are mutually disjoint in respect to communication edges connecting processes allocated to different processors, 2. sub-graphs are connected in respect to activation and communication edges, 3. inter-processor connections do not change inside a section sub-graph, 4. section sub-graphs are complete in respect to activation paths and include all communication edges incident to all communication nodes on all activation paths between any two communication nodes which belong to a section, 5. all connections for a section are prepared before this section activation, which simultaneously concerns all processors involved in section execution. 3.1
Program Graph Partitioning into Sections
In the second phase of the program structuring, the program graph partitioning heuristics is used to find program sections, Fig. 3. Besides, the heuristics assigns a crossbar switch to each section. The algorithm starts with an initial partition, which consists of sections built of single communication and assigned to the same crossbar switch. In each step, a vertex of CAG is selected and then the algorithm
44
E. Laskowski and M. Tudruj
Begin Initialize the set of sections /each section is composed of a single communication and assigned to a crossbar 1/ While the stop condition is not satisfied /stop condition: each vertex of CAG is visited, no execution time improvement during given number of steps/ Select a vertex v of CAG, which maximizes the selection function and which is not placed in tabu list Try to merge section of v with some sections of predecessors of v Select the merging, which gives the shortest partitioned program execution time EndWhile End
Fig. 3. The general scheme of the graph partitioning algorithm
tries to include this vertex to a union of existing sections determined by edges of the current vertex. The heuristics tries to find such a union of sections, which doesn’t break rules of graph partitioning. The union, which gives the shortest program execution time is selected. See [5, 6] for detailed description of the partitioning heuristics. 3.2
Simulated Execution of Programs
The program execution time is obtained during the graph structuring by simulated execution of the partitioned graph in a look-ahead (or on-request) reconfigurable system. A simulator for symbolic execution of APG graphs, written in C language, has been used. For simulated execution of the program graph in a look-ahead reconfigurable system, an APG graph with a valid partition is extended by subgraphs, which model the look-ahead reconfiguration control, Fig. 4. The functioning of the Communication Control Path, Synchronization Path and the Global Control Subsystem GCS are modeled as subgraphs executed on virtual additional processors. Weights in the graph nodes correspond to latencies of respective control actions, such as crossbar switch reconfiguration, bus latency, connection setting and activation of program sections. These weights are expressed as control parameters of the simulation (tS , tF , tR , tA ).
4
Experimental Results
During experiments we performed evaluation and comparative studies of execution efficiency of programs for different program execution control strategies and system parameters. We have examined program execution speedup versus parameters of reconfiguration control (tR and tV ), the number of crossbar switches, the number of processor links, for the parallel program of Strassen matrix multiplication (two recursion levels). The results presented in the paper are obtained
Inter-processor Communication Optimization
P1 10
S1
5
L1 1
L1
P3
P4
P5
20
30
5
5
S2
5 5
5 5
P2
L1 5 L1
L1 2
L1
5
10
L1 5
5 L2
L2
P6 10 L1 5
10
4
5 5
5
10
end of link use messages
3
5 L2
6
Control Bus Process
10 L2
5
15 15
L2
L2
7
5
tS L2
10
8
L1
S3
5
10 L1 9
10 L2
S4
L1 5 L1
5
10 5 L1 L2
5
5
10
S6 20
14
10 L2
L2 L1
L2 11
10 5
10
L1
S5
10
12
5
15
tF
10 15
10
20
45
L2
13
L1
120 5
activation messages
synchronisation for activation
tR tA
15
L1 20
Activation Process
Reconfiguration Process
Fig. 4. Modeling the reconfiguration control in an APG
for two graphs of Strassen matrix multiplication program: Strassen (two recursion levels), Strassen(cg) (Strassen multiplication with coarse–grain parallelism – one recursion level). Table below shows parameters of program graphs (the average distance between reconfigurations and the granularity, defined as the communication/computation ratio): graph name nb. of nb. of commu- granularity vertices nications Strassen 209 352 1.2 Strassen (cg) 27 44 4.2
distance between reconfigurations 80-130 500-1300
Program execution was simulated using the look-ahead paradigm in the following architectural variants: a system with dynamically partitioned multiple processor link sets (multiple-link-sets), a system with single connection switch with on-request dynamically established connections (on-request ). The number of processors (4, 8, 12) and the number of processor links (2, 4, 8) were used as basic parameters in the experiments. Additional parameters were: tR – reconfiguration time of a single connection in the range 1-100 and tV – section activation time overhead (tV = tF + tS + tA , for components of tV see Fig. 4) in the range 1-100. Fig. 5 shows maximal speedup for fine-grain version of Strassen program graph. Both architectures obtained similar maximal speedup values. This outcome was obtained for very small values of time parameters representing reconfiguration control overhead (tV ) and creation of a single connection (tR ). For these very small values of tV and tR , the general impact of overheads coming from reconfiguration control on total execution time of the program is minimal or even negligible. Thus, regardless the efficiency of reconfiguration control, both architectures give similar results. Fig. 5 shows also the average speedup for wide range of values of parameters tV and tR . The range of tR and tV used during experiments included values for which
46
E. Laskowski and M. Tudruj
Speedup 10
Max (multiple-link-set) Max (on-request) Ave (multiple-link-set) Ave (on-request) Tr=1, Tv=18 (multiple-link-set) Tr=1, Tv=18 (on-request)
9 8 7 6 5 4 3 2 1 0 2
4
2
4
4
2
8
4 12
Fig. 5. Maximal and average speedup for Strassen program for full range of tR and tV , and speedup for tR = 1, tV = 18
the efficiency of reconfiguration control is very low, thus the average speedup was substantially lower than maximal. The architecture based on multiple link sets yields better results in presence of higher reconfiguration time overheads, due to more effective reconfiguration subsystem based on optimized program control and section partitioning. The analysis of experimental results for different values of tV and tR reveals that the better results of multiple link set architecture were obtained when reconfiguration control overhead (tV ) had significant influence of total program execution time. An example of such a situation is shown in Fig. 5, where the speedup for selected values of tR = 1 (very small value) and tV = 18 (very big value) is presented. It should be noticed that for these values of tV and tR parameters, the system based on classical on-request reconfiguration control completely fails, especially for higher number of processors. In the system based on multiple link sets, the number of control messages is reduced, comparing the classical on-request control. So, for the higher number of processors in the system (and thus the higher number control messages transmitted), the efficiency of reconfiguration control is preserved for bigger values of tV parameter.
Speedup 7
Max (multiple-link-set) Max (on-request) Ave (multiple-link-set) Ave (on-request) Tr=10, Tv=48 (multiple-link-set) Tr=10, Tv=48 (on-request)
6 5 4 3 2 1 0 2
4 4
2
4 8
2
4 12
Fig. 6. Maximal and average speedup for Strassen (coarse-grain) program for full range of tR and tV , and speedup for tR = 10, tV = 48
Inter-processor Communication Optimization
47
Speedup 3,5 3 2,5 dyn. mls stat. mls 4 crossbars on-request
2 1,5 1 0,5 0 4
8
12
Fig. 7. Comparison of average speedup of Strassen program in look-ahead reconfigurable system (4 crossbars – 4 crossbar switches), dynamically partitioned multiplelink-sets (dyn. mls), statically partitioned multiple-link-sets (stat. mls), on-request system (on-request)
The afore-mentioned results were confirmed for coarse-grain version of Strassen program graph (Fig. 6). The coarse-grain version of Strassen graph yields lower values of maximal speedup since we were unable to parallelize it for the number of processor bigger than 8 (the same results were obtained for 8 and 12 processors, Fig. 6). Using the coarse-grain parallelization entails smaller impact of reconfiguration control overheads on total program execution time. Fig. 6 shows that the average speedup was similar to the maximal. So, as expected, the difference between classical on-request reconfiguration control and multiple link set control is smaller than for fine-grain parallelism, but still the architecture based on multiple link sets gives better results. Fig. 7 shows the average speedup of the fine-grain Strassen program for all system architectures discussed in the paper.
5
Conclusions
In the paper, we have discussed the idea of the look-ahead dynamic connection reconfiguration in multi-processor systems implemented as parallel embedded systems whose communication behavior is tailored to program needs. We have shown that the system based on multiple link sets and optimized reconfiguration control through program graph partitioning into sections executed with fixed inter-processor connections gives better speedups than the classic on-request reconfiguration control, especially when the reconfiguration subsystem is not enough efficient or not adjusted to application’s communication requirements. It is important, that better performance is achieved without changing the standard system architecture. So, the proposed parallel program optimization method provides a much better program behavior without any additional cost from the hardware point of view.
48
E. Laskowski and M. Tudruj
Experiments with Strassen matrix multiplication have shown that the lookahead reconfiguration with multiple crossbar switches gives in general a better program execution speedup and larger applicability area in respect to various system time parameters than the reconfiguration with link set partitioning. The lower is the reconfiguration efficiency of the system, the better results are obtained from the application of the look-ahead dynamic connection reconfiguration. It confirms the value of the look-ahead reconfiguration for fine-grained parallel programs, which have time-critical reconfiguration requirements. The presented experiments show that for low values of unitary connection reconfiguration time and program section activation time, performance of the architecture with the multiple crossbar switches and link set partitioning is similarly good.
References 1. Chen, J., Watson, W.: User-level communication on Alpha Linux Systems. In: Intl. Symp. On Parallel Architectures, Algorithms and Networks, ISPAN, IEEE Cs Press, Los Alamitos (2000) 2. Tanabe, N., et al.: Low Latency Communication on DIMNET-I network Interface Plugged into a DIMM slot. In: Intl. Conf. On Parallel Computing in Electrical Engineering, Warsaw, pp. 9–14. IEEE CS press, Los Alamitos (2002) 3. Tudruj, M.: Look-Ahead Dynamic Reconfiguration of Link Connections in MultiProcessor Architectures. In: Parallel Computing 1995, Gent, September 1995, pp. 539–546 (1995) 4. Hwang, J.-J., Chow, Y.-C., Angers, F.D., Lee, C.-Y.: Scheduling Precedence Graphs in Systems with Interprocessor Communication Times. Siam J. Comput. 18(2) (1989) 5. Laskowski, E.: New Program Structuring Heuristics for Multi-Processor Systems with Redundant Communication Resources. In: Proceedings of PARELEC 2002, Warsaw, September 2002, pp. 183–188 (2002) 6. Laskowski, E.: Program Scheduling in Look-Ahead Reconfigurable Parallel Systems with Multiple Communication Resources. In: Proceedings of PARELEC 2004, Dresden, September 2004, pp. 256–261 (2004) 7. Laskowski, E., Tudruj, M.: Efficient Parallel Embedded Computing Through LookAhead Configured Dynamic Inter-Processor Connections. In: Proceedings of 5-th International Symposium on Parallel Computing in Electrical Engineering PARc ELEC 2006, Bialystok, Poland, September 13-17 (2006) 2006 IEEE
An Algorithm to Improve Parallelism in Distributed Systems Using Asynchronous Calls Rouzbeh Maani and Saeed Parsa Iran University of Science and Technology
[email protected],
[email protected]
Abstract. Distributed systems use different methods for making parallelism. A common approach is using asynchronous calls. In this method the caller method and called method are located in different workstations and the caller method continues to run even after calling the remote method. Although the caller and called methods can be executed concurrently, the dependency of the instructions after the call instruction to the values affected by the called method make the caller to stop. In this article, an instruction scheduling algorithm is presented to achieve more concurrency in the execution of these distributed codes.
1
Introduction
The use of asynchronous calls is a common approach to create concurrency in distributes programs. The concurrent execution of the caller and the called methods, in this approach, results in speeding up the program execution time [1][2]. However, the problem is that programmers are used to think sequentially in such a way that when they write a call or any other statement of code they want to use the results of the statement immediately after. This way of thinking contradicts the idea of speeding up the execution of a distributed code by applying asynchronous calls, because, the concurrent execution of the caller and called methods in an asynchronous call ends up with the very first positions within the caller where any value affected by the called method is required. Instruction scheduling algorithms has been conventionally used to enhance parallelism in the execution of program codes [3][4][5]. These algorithms mostly depend on the architecture of the underlying system. As a case in point, in pipelined architectures, instruction scheduling is applied to speed up programs by keeping the pipeline full. This is achieved by finding sequences of unrelated instructions that can be overlapped in the pipeline [6]. As an another example, Nystrom and Eichenberger have presented an instruction scheduling algorithm, mainly focusing on minimizing execution overhead due to inter-cluster communications in VLIW architectures [7]. Instruction scheduling has been also applied to speed up programs code execution by increasing throughput [8][9], minimizing register pressure [10][11], reducing the effect of the cache misses or a combination of these [12][13][14]. Our main motivation has been to enhance the degree of concurrency in the execution of remote asynchronous calls. This is achieved by maximizing R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 49–58, 2008. c Springer-Verlag Berlin Heidelberg 2008
50
R. Maani and S. Parsa
the distance between the call statement and the very first positions within the caller where any value affected by the called method is used. The proposed instruction scheduling algorithm needs to know the time of each statement. In this regard, a time estimation approach can be used [15]. The remaining parts of this article are organized as follows. In section 2, different type of dependencies between statements and their effects on reordering of instructions is discussed. In section 3, the instruction scheduling algorithm is presented. Section 4, includes a case study which surveys applying the proposed algorithm to a sample distributed code. The conclusion is brought in section 5.
2
Instruction Scheduling, Challenges and Requirements
As mentioned before, the aim is to gain the maximum possible concurrency in remote asynchronous calls by reordering the statements. The maximum amount of concurrency can be achieved when the caller method uses the values affected by the called method, just after the completion time of the asynchronous call. Therefore, the distance between each asynchronous call statement and the statements using any values affected by the called method should be increased, if possible. An instruction may be moved within a program code as far as control and data dependencies amongst the instructions are not violated [6]. Control and data dependencies and their effects on instruction reordering are further described in the next subsections. 2.1
Control Dependency
By definition an instruction is control dependent on another instruction if its execution is determined by that instruction. As an example, all the instructions within the body of an If statement or a For statement are control dependent on the instruction with which these statements begin. Control dependencies between instructions can be represented by a directed graph in which, each node addresses an instruction and the edges indicate control dependencies between the instructions. Any Instruction scheduling algorithm must preserve control dependencies between the instructions. To preserve control dependencies, instructions are only moved within their control dependency scope. A control dependency scope consists of all the instructions which are directly control dependent on a same instruction. In Fig. 1 a sample code and the corresponding control dependency scopes are shown. In this graph solid arrows represent control dependencies. As an example, all the instructions which are control dependent on instruction number 2 are in scope2. Instruction number 2, itself, is in scope1. 2.2
Data Dependency
To preserve the program semantics, data dependencies should be preserved, while reordering the program instructions. Generally, there are three kinds of data dependencies between two instructions: Read after Write, Write after Read and
An Algorithm to Improve Parallelism in Distributed Systems
51
Fig. 1. A sample code and its corresponding task graph
1 2 3
10
Algorithm:ResolveDependency(TheInstruction) Begin Find the parents of TheInstruction in DATA DEPENDENCY GRAPH and put them in ListOfParents For each Parent in ListOfParents Begin Find the CommonScope between TheInstruction and Parent InstructionParentInCommonScope ← Ancestor of TheInstruction in the CONTROL DEPENDENCY GRAPH the scope of which is CommonScope ParentParentInCommonScope ← Ancestor of Parent in the CONTROL DEPENDENCY GRAPH the scope of which is CommonScope If InstructionParentInCommonScope locates before ParentParentInCommonScope, put it after ParentParentInCommonScope End
11
End
4 5 6 7 8 9
Fig. 2. Resolve Dependency Algorithm
Write after Write[6]. Data dependencies are represented by a graph in which, each node addresses an instruction and edges indicate data dependencies between the instructions. Data dependencies are shown by dotted arrows in Fig. 1. A graph representing both data and control dependencies is called a task graph. To prevent any dependency violation while reordering the instructions, a procedure called ResolveDependency is used. This procedure gets an instruction and checks whether all the instructions which are the parents of the instruction in the data dependency graph, execute before the instruction. In Fig. 2, the pseudo code description of the ResolveDependency procedure is given.
3
Instruction Scheduling, Challenges and Requirements
In this section the proposed instruction scheduling algorithm is presented. In this algorithm, instructions are categorizes into 3 types: – Call Instructions: Asynchronous remote call instructions
52
R. Maani and S. Parsa
– Use Instructions: Instructions which use any value affected by a Call Instruction – Common Instructions: Other instructions The algorithm firstly moves Call instructions and their parents in the caller task graph, to the reordered code. In the second step, Use instructions and their parents are moved. In the third step Common instructions are moved into the reordered code and between Calls and Uses. In the forth step of the algorithm the reordered code is further optimized. Before moving an instruction to the reordered code, the ResolveDependency procedure, described above, is invoked to resolve any dependency conflicts caused by the move. These four steps are described more in the next subsections. 3.1
First Step
In the first step of the algorithm, presented in Fig.3, asynchronous call instructions within each method are moved to the top positions in the reordered code of the method. Apparently, the longer the execution time of a called method the longer the caller may wait for the call results; hence, if there are several Call instructions within a method, the Call instruction with the longest completion time is moved to the reordered code, first. Before moving a Call instruction, all its parents in the task graph are moved to the reordered code. Again, within the parents the priority is given to the Call instructions with the longest execution time. In the first step, after the Call instructions, Common instructions and finally all the Use instructions within the parent nodes are selected. A selected node is moved to the reordered code if it is not already moved. 3.2
Second Step
In the second step of the algorithm, presented in Fig.3, all the remaining Use instructions within each method are moved to the bottom of the reordered code of the method. While moving a Use instruction, all its parents in the task graph are moved to the reordered code. A selected instruction is moved to the reordered code if it is not already moved. 3.3
Third Step
As mentioned before, in an asynchronous remote call the caller has to wait for the completion of the call at the first position, where any value affected by the called method is used. To reduce the wait time, in the third step, the remaining Common instructions are positioned in between each Call and its corresponding Use instruction, if possible. Here, the priority is given to a Use instruction whose wait time is relatively longer than the others. The wait time is equal to completion time of the Call minus the time elapsed to reach the Use position from the Call position. If the wait time is computed less than zero it is set to zero (Fig. 4). In order to find suitable positions for inserting Common
An Algorithm to Improve Parallelism in Distributed Systems
1 2 3
53
22 23 24 25
Algorithm Step 1: ScheduleTheCalls(ListOfInstructions:List) Begin While there is a CallInstruction in ListOfInstructions which is not Moved To Reordered Code Begin Find a CallInstruction with the longest execution time from ListOfInstructions which is not Moved To Reordered Code Find all the parent nodes of the found CallInstruction in Task Graph and put them in the NewListOfInstructions ScheduleTheCalls(NewListOfInstructions) MoveToReorderedCode(CallInstruction) End While there is a CommonInstruction in ListOfInstructions which is not Moved To Reordered Code Begin Take a CommonInstruction Find all control dependent parents of the CommonInstruction and put them in ListOfControlParents For each ParentInstruction in ListOfControlParents MoveToReorderedCode(ParentInstruction) MoveToReorderedCode(CommonInstruction) End While there is a UseInstruction in ListOfInstructions which is not Moved To Reordered Code Begin Take a UseInstruction Find all control dependent parents of the UseInstruction and put them in ListOfControlParents For each ParentInstruction in ListOfControlParents MoveToReorderedCode(ParentInstruction) MoveToReorderedCode(UseInstruction) End
26
End
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
1 2 3 4 5 6 7 8 9 10
Algorithm Step2: SelectUseInstructions Begin While there is a UseInstruction which is not Moved To Reordered Code Begin Take a UseInstruction Find all control dependent parents of the UseInstruction and put them in ListOfControlParents For each ParentInstruction in ListOfControlParents MoveToReorderedCode(ParentInstruction) MoveToReorderedCode(UseInstruction) End
11
End
Fig. 3. Step 1 and 2 of the algorithm
instructions a procedure called FindCandidatePositions is invoked for each Call and its corresponding Use instruction. Here, suitable positions are defined as all the positions within each scope located in between the Call and its corresponding Use instruction.
54
1 2 3
R. Maani and S. Parsa
23 24 25 26
Algorithm Step3: ScheduleInstructionsFillWaitingTimes(ListOfUseInstructions) Begin While there is an Instruction in ListOfUseInstructions which is not Moved To Reordered Code Begin Find the UseInstruction with the longest wait time from ListOfUseInstructions Remove the UseInstruction form ListOfUseInstructions CandidatePlacesList ← FindCandidatePositions(CallInstruction, UseInstruction) for each place in CandidatePlacesList, if it is possible to put an Instruction which is not Moved To Reordered Code Begin Take the Instruction Find all control dependent parents of the Instruction and put them in ListOfControlParents For each ParentInstruction in ListOfControlParents MoveToReorderedCode(ParentInstruction) MoveToReorderedCode(Instruction) Subtract the time of the Instruction form the wait time of the UseInstruction If the wait time of the UseInstruction is less than or equal to 0 then go to 3 End End While there is an Instruction which is not Moved To Reordered Code Begin Take an Instruction Find all control dependent parents of the Instruction and put them in ListOfControlParents For each ParentInstruction in ListOfControlParents MoveToReorderedCode(ParentInstruction) MoveToReorderedCode(Instruction) End
27
End
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 2 3
17 18 19 20 21 22
Algorithm Step4: MoveInstructionBeforWaitAndAfterCall(ListOfUseInstructions) Begin While there is an Instruction in ListOfUseInstructions which is not Moved To Reordered Code Begin Find the UseInstruction with the lowest wait time from ListOfUseInstructions Remove the UseInstruction form ListOfUseInstructions Add all instructions of the scope which are placed after the UseInstruction to MoveBeforeUseList Find the corresponding call Instruction of the UseInstruction and Name it CallInstruction Add all instructions of the scope which are placed after the CallInstruction to MoveAfterCallList For each instruction in MoveBeforeUseList, if it is possible to move it before the UseInstruction Begin Move the instruction to the place before the UseInstruction Subtract the time of the instruction form the wait time of the UseInstruction If the wait time of the UseInstruction is less than or equal to 0 then go to 3 End For each instruction in MoveAfterCallList, if it is possible to move it before the CallInstruction Begin Move the instruction to the place after the CallInstruction Subtract the time of the instruction form the wait time of the UseInstruction If the wait time of the UseInstruction is less than or equal to 0 then go to 3 End End
23
End
4 5 6 7 8 9 10 11 12 13 14 15 16
Fig. 4. Step 3 and 4 of the algorithm
An Algorithm to Improve Parallelism in Distributed Systems
3.4
55
Fourth Step
In the first step of the algorithm when moving the dependent parents of a Call instruction, there maybe Use instructions within the parents. Moving a Use instruction at this step to the reordered code, may increase the wait time for the Call results. Also because of some other reasons the wait time of Use instructions might be increased. To resolve this problem, in the forth step, it is attempted to move all the instructions after each Use instruction to a position immediately before it. Similarly, it is attempted to move all the instructions, within the reordered code, which are positioned before a Call instruction to a position after it. Here, again the priority is given to the Call and Use instructions with relatively lowest wait time. It should be noted that it is not always possible to move an instruction before or after another one. In general, it is possible to move instruction 2 to a position before instruction 1 if and only if instruction 2 and all its successors in control dependency graph have no data dependencies on instruction 1 and all the instructions in between instructions 1 and 2. In Fig. 4, the pseudo code description of the step 4 of the proposed instruction reordering algorithm is presented.
4
Case Study
In this section the effect of applying the proposed algorithm to the Traveling Salesman Problem (TSP) program, is demonstrated. The TSP program code includes 17 classes and 129 methods. After extracting the call flow graph from within the program code, the graph is partitioned into two clusters, as shown in Fig. 5. These clusters are then distributed and executed on separate computing nodes. In order to evaluate the proposed instruction scheduling algorithm the
Fig. 5. Clustered Call Flow Graph of TSP program
56
R. Maani and S. Parsa
(a) Execution Time in Java Virtual Machine Cycles
(b) Speed Up before and after using Instruction Scheduling
(c) Ratio of speedup after using instruction scheduling algorithm to before
(d) Amount of concurrent time in Java Virtual Machine Cycles
(e) Ratio of concurrent time after using algorithm to before
Fig. 6. TSP program experimental results
execution time of the program is obtained statically by applying the time estimation algorithm presented in [16] which estimates programs execution time in Java Virtual Machine Cycles, platform independently. Java Virtual Machine Cycles can be translated into the seconds on different platforms, easily [17]. The execution time of the program, in Java Virtual Machine Cycles, for different input graphs is shown in Fig. 6.a. The serial execution time and parallel execution time before and after applying the proposed algorithm for each input graph is depicted in this figure. This is clear that the execution time is reduced when
An Algorithm to Improve Parallelism in Distributed Systems
57
the program runs as a distributed program and can be improved by using the instruction scheduling algorithm. Fig. 6.b shows the amount of speedup before and after applying the scheduling algorithm. It is observable that the presented algorithm improves the speedup amongst different inputs. The highest speedup is obtained by the input graph with 120 nodes and 250 edges which is 1.45 before and 1.49 after applying the algorithm. The ratio of speedup after using instruction scheduling algorithm to before applying it is brought in Fig. 6.c. Another parameter is comparing the amount of concurrent time before and after using the presented algorithm. Fig. 6.d shows this measurement. Also the ratio of concurrent times after using instruction scheduling algorithm to before applying it is depicted in Fig. 6.e.This shows that the amount of concurrent time can be increased 1.06 times.
5
Conclusion
Programmers may develop distributed code without bothering about concurrency in execution of the program code. It is possible to efficiently exploit the inherent concurrency in execution of distributed code by applying an instruction scheduling technique. Maximum concurrency in execution of distributed code can be achieved, by maximizing the distance between asynchronous call instructions and instructions applying the results of the calls, as much as possible. In this paper an algorithm for enhancement of concurrency in execution of distributed program code is presented. The experimental results demonstrate a relatively high improvement of concurrency in the execution of a distributed program code for solving the TSP problem.
References 1. Parsa, S., Bushehrian, O.: Automatic Translation of Serial into Distributed Code ¨ Using CORBA Event Channels. In: Yolum, p., G¨ ung¨ or, T., G¨ urgen, F., Ozturan, C. (eds.) ISCIS 2005. LNCS, vol. 3733, pp. 152–161. Springer, Heidelberg (2005) 2. Parsa, S., Khalilpoor, V.: Automatic Distribution of Sequential Code Using JavaSymphony Middleware. In: Wiedermann, J., Tel, G., Pokorn´ y, J., Bielikov´ a, ˇ M., Stuller, J. (eds.) SOFSEM 2006. LNCS, vol. 3831, pp. 440–450. Springer, Heidelberg (2006) 3. Aleta, A., Codina, J., Sanchez, J., Gonzalez, A., Kaeli, D.: Exploiting pseudoschedules to guide data dependence graph partitioning. In: Procedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, Charlottesville, VA (September 2002) 4. Golumbic, M.C., Rainish, V.: Instruction scheduling beyond basic blocks. IBM 1. Res. Develop. 34(I) (January 1990) 5. Hagog, M., Zaks, A.: Swing Modulo Scheduling for GCC. In: GCC Developers Summit (2004) 6. Hennessy, J., Patterson, D.: Computer Architecture: A Quantitative Approach, 3rd edn. Morgan Kaufmann Publishers, San Francisco (2003)
58
R. Maani and S. Parsa
7. Nystrom, E., Eichenberger, A.: Effective cluster assignment for modulo scheduling. In: Proceedings of the 31st International Symposium on Microarchitecture (MICRO-31), Dallas, TX, December 1998, pp. 103–114 (1998) 8. Jain, S.: Circular Scheduling: A New Technique to Perform Software Pipelining. In: Proc. of the Int. Conf. on Programming Languages, Design and Implementation (1991) 9. Rau, B.: Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops. In: Proc. of 27th Int. Symp. on Microarchitecture, November 1994, pp. 67–74 (1994) 10. Eichenberger, A., Davidson, E.: Step Scheduling: A Technique to Reduce the Register Requirements of a Module Schedule. In: Proc. of the 28th Int. Symposium on Microarchitecture, pp. 338–349 (1995) 11. Eichenberger, A., Davidson, E., Abraham, S.: Optimum Module Schedules for Minimum Register Requirements. In: Proc. of Supercomputing 1995 (1995) 12. Huff, R.: Lifetime-Sensitive Modulo Scheduling. In: Proc. of the Int. Conf. on Programming Languages, Design and Implementation, pp. 318–328 (1993) 13. Llosa, J., Valero, M., Ayguade, E., Gonzalez, A.: Modulo Scheduling with Reduced Register Pressure. IEEE Transactions on Computers 47(6), 625–638 (1998) 14. Sanchez, J., Gonzalez, A.: Cache Sensitive Modulo Scheduling. In: Proc. of 30th Int. Symp. on Microarchitecture, December 1997, pp. 338–348 (1997) 15. Corti, M., Gross, T.: Instruction duration estimation by partial trace evaluation. In: Proc., W.I.P. (ed.) Session 10th Real-Time and Embedded Technology and Applications Symposium, Toronto, Canada, IEEE, Los Alamitos (2004) 16. Parsa, S., Bushehrian, O.: On the Optimal Object-Oriented Program Remodularization. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp. 599–602. Springer, Heidelberg (2007) 17. Schoeberl, M.: JOP: A Java Optimized Processor for Embedded Real-Time Systems. PhD thesis, Vienna University of Technology (2005)
IEBS Ticketing Protocol as Answer to Synchronization Issue Barbara Palacz1 , Tomasz Milos1 , Lukasz Dutka2 , and Jacek Kitowski1,2 1
Institute of Computer Science, AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Cracow, Poland 2 Academic Computer Centre CYFRONET-AGH, ul. Nawojki 11, 30-950 Cracow, Poland {palacz,milos}@student.agh.edu.pl, {dutka,kito}@agh.edu.pl
Abstract. In this paper we present a novel synchronization technology, which is designed for distributed computer systems that use replication as means of redundancy. Our mechanism was primarily created for Intelligent ExaByte Storage, which is a network storage system with multiuser access. The protocol is based on the concept of digital identity certificates and supports real-time update of replicas and annulment of operations that might have produced out-of-date results. It also guarantees that the number of replicas stays the same at all times during the system’s performance.
1
Introduction
Nowadays, there exists a great number of distributed computer systems that require a mechanism for managing synchronization of their resources. Most of these systems try to improve their reliability, performance or fault-tolerance through the use of redundancy. In context of storage systems, redundancy often involves data replication, which can be regarded as the process of managing copies of data, as well as a caching strategy where identical files are available from multiple locations [1]. Data replication is one of key techniques to improve data locality, increase dependability, reduce access delay, and enhance scalability for distributed systems in wide area computing environments [2]. The literature offers several papers that address this issue (e.g. [3], [4], [5], [6]). This process of replication is very much related to the problem of replica synchronization, since most of the times strict data consistency is of a paramount importance. The paper presents a novel approach to the issue of synchronization that is inseparably bound to IEBS - Intelligent ExaByte Storage. IEBS is a distributed storage system, whose structure is the basis of our mechanism, and with which our protocol is meant to be used.
2
Synchronization Problem in Distributed Systems
As mentioned before, most distributed computer systems with multiuser access, use replication as means of redundancy. Providing replication is equivalent to R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 59–67, 2008. c Springer-Verlag Berlin Heidelberg 2008
60
B. Palacz et al.
having multiple copies of the same data stored in the system. What goes with it, is a question of replica synchronization, meaning supplying data coherency. When multiple users have access to maintaining the system data, it is possible that more than one user requires to execute an operation on the same data simultaneously. Multiple read operations do not cause a problem, since each user may acquire a different data copy. Update operations are another, much more complicated, matter. When one data copy is modified, all other copies must be synchronized before they are ever used, so that the system data stays coherent. The question is how this can be done in a system storing great amounts of data, such as IEBS? This is exactly the problem that our synchronization protocol is trying to solve.
3
Idea of Storage by IEBS
In order to understand well the idea of our approach to the problem of data synchronization, it is essential to first apprehend the general concept and structure of IEBS storage system. It is a system that is to supply distributed usable storage of ExaByte order of capacity, which would also provide an intelligent way of maintaining the retained information from many independent authorities simultaneously [7]. 3.1
Overview of System Structure
IEBS system supplies the necessary storage by the means of thousands of disk drives of approximately terabyte order of capacity each. All disks are equipped with simple processing units and Ethernet interfaces, as they work together over UDP/IP network [7]. The disks are located in racks, which, in turn, are placed in geographically distributed data centers. Each rack is managed by a device called the rack manager, which takes care of maintaining work of all its disk drives. The entire storage system is composed of several additional devices, which are necessary in order for it to function as a whole. Amongst all, these components provide easy management of great amounts of data retained in the system and high quality access control, which is extremely significant in a system of such a scale. Therefore, in context of the synchronization issue, there are three groups of devices that we will be interested in. These are: block managers, access control servers and rack managers. First of all, we will focus on the block managers, which are responsible for maintaining the entire ExaByte storage. They are a group of servers that divide the storage into logical units called blocks, serving as containers for data segments. Block managers store information about location of all objects, divided into pieces and kept in these blocks, and extract the information when needed. Secondly, we need to consider the role of access control servers, supervising the system’s access authorization. In order to do so, access control servers store information about all user accounts, access groups and access control lists. Moreover, they are also responsible for creating digital identity certificates (tickets), which are the basis of our synchronization mechanism.
IEBS Ticketing Protocol as Answer to Synchronization Issue
61
Finally, we have the rack managers, which reside in each rack of every data center of IEBS system. Aside from rack administration, their primary responsibility is taking care of aggregating communication from disk drives to block managers. This operation is essential in a system of such a scale, because otherwise block managers might be flooded with too much information flowing from thousands of disk drives simultaneously. 3.2
Replication as Problematic Feature
Since we are interested in our synchronization mechanism solution, it is essential to understand the extent of the problem in IEBS system specifically. Not only it is the matter of maintaining the data that might be accessed by multiple machines simultaneously, but also there is the issue of replica synchronization. IEBS system supports several supplementary features, which collectively constitute to the intelligence of the storage. Thanks to the built-in processing powers in disk drives, the system is capable of performing some advanced data processing scenarios, such as redundancy support on demand [7]. What it means, is that every client may request his file to be replicated a specified number of times, indicating what type of replication is to be performed. On-line replication means performing the operation at the exact moment in time, while the client is waiting for it to be finished. Off-line replication suggests that the client does not want to wait until all replicas are written, and the operation is performed later on. In both cases, disk drives communicate only between each other, passing on the replication request with no additional devices’ support needed. The problem, which here arises, is that in order to keep the system coherent, it is unacceptable to store obsolete replicas of a file. Therefore, after any changes made to any one of a file’s copies, the rest of them must be synchronized and updated, before they might be used again. Furthermore, in case if one of the disk drives crashes, the number of replicas must remain constant, thus all copies of a file, that were lost, have to be written into some other disk drive.
Fig. 1. On-line replication
Fig. 2. Off-line replication
62
4
B. Palacz et al.
Synchronization Protocol
The synchronization mechanism, included in IEBS, is based on the concept of digital identity certificates, or so-called tickets. In IT security, a ticket is a proof of access and usage rights to a particular service [8], thus it can be used as means of authentication or proof of authorization. In IEBS, this ticket definition is sustained as well.
Fig. 3. Access control and ticket creation
As mentioned before, our tickets are created by the access control servers as a response for a client’s request to perform an operation. The tickets are issued on the basis of client’s credentials and his access rights to the object that he requested to execute an operation on. Of course, if the access rights are not applicable, the operation is rejected without ticket creation. Moreover, each ticket is signed by the access control server for further verification and set out for a limited time period. In case of its expiration, it may not be used, until it is renewed within the access control server. If the ticket never expired, it could become something the client could give away without any consequences [9]. 4.1
General Operation Scenario
We have already explained the concept of tickets and responsibilities of several system devices, which are to control both data management and synchronization issues. At the moment, we will present a basic substantial situation, in which they cooperate together, in order to execute an operation requested by a client. Let us consider a simple read operation. Since the client must be authenticated, his request to execute an operation on a specified object is first sent to the access control server, where his ticket is created. The ticket strictly indicates the client’s identity, type of operation he requested to perform (in this case a read), and the object of his request (file specification). Right after its creation, the ticket is forwarded to one of the block managers, which stores in a block database all information about the previously-specified file. Since the file might have been divided into data segments, the block manager
IEBS Ticketing Protocol as Answer to Synchronization Issue
63
retrieves information about their exact location. This information takes form of a so-called block handle, containing disk drive’s identification and a block’s address on the drive. Each block handle is added to one of client’s tickets - one block handle per ticket. Since, at first, the ticket was only one, it is duplicated as many times as needed and sent back to the client. Afterwards, the client forwards his tickets directly to disk drives, indicated by the block manager in the tickets’ block handles. Thanks to the processing powers in drives, they are able to analyze the tickets, checking the client’s access rights, signature of access control server, ticket’s time limit and a list of annulled tickets, which is stored at each disk drive itself. If all these conditions are well met, the operation is executed, and the client receives his results. 4.2
Synchronization Problems
We have described the basic scenario of an operation performed in IEBS system. Certainly, in the case mentioned above, there was very little synchronization problems. However, there are several distinct situations, which are not as simple as they seem. Knowing that IEBS system supports additional advanced features, each client may want his files to be replicated several times. What will happen, if a client requests to modify a file that already had replicas? In this case, the answer is evident: replicas must be also updated and synchronized. But what if during a file’s modification, a different client requests some other operation on the same file? Should he get a reference to one of its replica? Certainly not, because this may lead to an unwanted data incoherency. Therefore the only way to maintain this is that block managers have to keep track of which copies of files are currently being used, and in what way (what operations are to be executed on them). The ultimate problem arises when we consider the same update operation to be performed on a file, but with previously-issued tickets for other operations on the same file. To clarify this, we are considering the following situation: a client requests to read a file, and thereafter another client requests an update operation on the same file. Since the tickets are set out for a limited time period, they do not have to be used exactly at a moment of their receival. Thus, if the read operation was not yet performed, it must be canceled so that the client does not receive data that is out-of-date. 4.3
Update with Replica Synchronization
As we can see, there are several significant synchronization requirements that our protocol must meet. They will all be covered while explaining an exemplary operation that includes all the problems mentioned in the previous section. We will focus on an update operation, performed on a file with replicas. The general process of the operation is basically the same as in case of a simple read operation, although it has several differences. It starts out similarly: with client authorization, ticket creation and its forwarding to one of the block managers. It is now, that our synchronization protocol must begin its action.
64
B. Palacz et al.
Fig. 4. First phase of an update operation
The block manager, seeing that the requested operation was an update, retrieves all replicas of the specified file. Each block, containing data segments of these replicas, are then marked as locked in the block database. This signifies that they may not be used again until they are properly synchronized with the copy to be modified. Therefore, the block manager will not be setting out block handles pointing to any of these blocks, until the update operation is complete. Afterwards, still before block handle creation, block manager informs all interested rack managers about annulment of tickets that were previously issued for any of the blocks, containing file’s replicas. Interested rack managers are the ones that maintain at least one disk drive, on which a data segment of a replica is stored. Of course, rack managers forward the annulment to the specified disk drive, which adds such ticket to its list of annulled tickets. Any ticket that appears on this list will not be executed, and will be returned to the client with an appropriate notification. Finally, the block handles are inserted into the tickets. They indicate the location of data segments of the only unlocked file copy. The tickets are sent back to the client, who uses them accordingly to the scenario presented before - forwards them to specified disk drives, where they are executed. When the modification is done and the client closes his file, a notification is sent by the drive to its rack manager. In the mean time, after block handle creation, the block manager had already sent information to interested rack managers about the location of replicas of each data segment being modified. This way, when a notification about end of update operation is received by a rack manager, it forwards replica information to disk drives, which store the updated copy of the file. Therefore, every disk drive that had a block, whose contents had changed during the update, knows which other blocks (on other disk drives) need to be updated. The changes made
IEBS Ticketing Protocol as Answer to Synchronization Issue
65
Fig. 5. Second phase of an update operation
are, thus, passed on from drive to drive, until all replicas are synchronized. This is when another notification is sent to the rack manager. Knowing that all data segments of a file had been updated and synchronized, rack managers notify the block managers about it, forwarding them, as well, the information about changes that were made to any of their disk drives’ blocks. At last, when the block manager recieves confirmation that the entire process of synchronization is over, it unlocks all replicas of the file that had been modified. From now on, all copies of the file may be used again, until they are locked again.
5
Related Work
In the previous section, we presented the entire process of replica synchronization, which is very much specific for IEBS storage system. Of course, several other distributed replica systems have also been developed. AFS and NFS are systems, where strict data coherency is of major importance, although they both work assuming that there is a much tighter data coupling of individual nodes than can be supported in geographically distributed environments, such as IEBS[10]. Furthermore, AFS (Andrew File System) does not support multisite replication issues[1]. One of very popular replica systems is nsync tool, which is widely used in Grid environments, such as GridLab, Globus and SRB[10]. It is implemented as an asynchronous distributed algorithm that, similarly to IEBS, does all synchronization at a block level. However, unlike IEBS, it does not synchronize all
66
B. Palacz et al.
replicas as soon as block modification is over, but it does it regularly. Each time the synchronization is performed, it first detects all files that had been modified since the last synchronization, and then it propagates the modifications[10]. Another common synchronization solution is a tool called rsync++, which is an amelioration of an open source file mirroring tool, rsync[11]. It is based on a master-slave model, where replica update information is generated on master site, then it is transported through the network to slave sites and finally it is processed on each of the slave sites[12]. What is also interesting in this solution, is that the differences between old and new copies of file are calculated locally, and only the patches are transmitted to other nodes[10].
6
Concluding Remarks
Replication is one of the key techniques to achieve better access times and reliability in data grid and storage systems[2], which are extremely significant in modern computer systems. The problem of replica synchronization is tried to be solved by various technologies, but many systems develop their own synchronization mechanisms that are specific to their structure. This is the case with Intelligent ExaByte Storage (IEBS), for which a new ticket-based synchronization protocol had been created. As IEBS is a distributed system that can be maintained by many institutions simultaneously, it generates multiple synchronization issues, which can all be resolved by our novel technology. The protocol supports real-time update of replicas, meaning that it does not allow storing obsolete data in the system. Furthermore, it provides an advanced mechanism for annulment of operations that might have produced results, which are out-of-date and a guarantee that the number of replicas stays the same at all times during the system’s performance. It is, therefore, perfect for IEBS storage system’s needs.
Acknowledgements The work described in this paper was supported in part by the European Union through the project int.eu.grid funded by the European Commission program IST-2006-031857.
References 1. Hoschek, W., Jan-Martnez, F.J., Samar, A., Stockinger, H., Stockinger, K.: Data Management in an International Data Grid Project. In: Buyya, R., Baker, M. (eds.) GRID 2000. LNCS, vol. 1971, Springer, Heidelberg (2000) 2. Qin, X., Jiang, H.: Data Grid: Supporting Data-Intensive Applications, May 07 (2003) 3. Stockinger, H.: Distributed Database Management Systems and the Data Grid. In: Eighteenth IEEE Symposium on Mass Storage Systems and Technologies (MSS 2001), January 26 (2001)
IEBS Ticketing Protocol as Answer to Synchronization Issue
67
4. Stockinger, H., Stockinger, K., Schikuta, E., Willers, I.: Towards a Cost Model for Distributed and Replicated Data Stores. In: Proceedings of the Ninth Euromicro Workshop on Parallel and Distributed Processing, pp. 461–467. IEEE Computer Society, Los Alamitos (2001) 5. Lamehamedi, H., Szymanski, B., Shentu, Z., Deelman, E.: Data Replication Strategies in Grid Environments. In: Proceedings 5th International Conference on Algorithms and Architectures for Parallel Processing, pp. 378–383. IEEE Computer Science Press, Los Alamitos (2002) 6. Slota, R., Skital, L., Nikolow, D., Kitowski, J.: Algorithms for Automatic Data Replication in Grid Environment. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 707–714. Springer, Heidelberg (2006) 7. Dutka, L., Palacz, B., Kitowski, J.: IEBS - Intelligent ExaByte Storage based on Grid Approach. In: CGW 2006 Proceedings (2006) 8. Mobile electronic Transactions: MeT White Paper on Mobile Ticketing Mobile electronic Transactions Ltd (2003) 9. Blumson, S.: DLPS Security Alternatives University of Michigan, http://www.umdl.umich.edu/pubs/dlpssecurity.html 10. Schutt, T., Schintke, F., Reinefeld, A.: Efficient Synchronization of Replicated Data in Distributed Systems. In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J., Zomaya, A.Y. (eds.) ICCS 2003. LNCS, vol. 2657, pp. 274–283. Springer, Heidelberg (2003) 11. Dempsey, B.J., Weiss, D.: Towards an Efficient, Scalable Replication Mechanizm for the I2-DSI Project Tech. Rep. TR-1999-01, School of Information and Library Science, University of North Carolina. Chapel Hill, NC (April 1999) 12. Dempsey, B.J., Weiss, D.: On the Performance and Scalability of a Data Mirroring. Approach for I2-DSI University of North Carolina at Chapel Hill (1999); In Plank, J., Dempsey B.: Internet2 Network Storage Symposium (NetStore 1999), Seattle, WA, October 14-15 (1999)
Analysis of Distributed Packet Forwarding Strategies in Ad Hoc Networks Marcin Seredynski1 , Pascal Bouvry1 , and Mieczyslaw A. Klopotek2 1
University of Luxembourg, Faculty of Sciences, Technology and Communication 6, rue Coudenhove Kalergi, L-1359, Luxembourg, Luxembourg {marcin.seredynski,
[email protected]}@uni.lu 2 Polish Academy of Sciences, Institute of Computer Science, Ordona 21, 01-237 Warsaw, Poland
[email protected]
Abstract. Battery energy conservation and preventing selfish behavior are important, closely related issues in ad hoc networks. In this paper we demonstrate how using a strategy based packet forwarding approach can increase throughput in such a networks and at the same time can minimize the usage of resources of participating nodes. Each node is asked to use a strategy that defines conditions under which packets are being forwarded. Such strategy is based on the notion of trust and activity of the source node of the packet. We demonstrate that nodes applying such strategy use the network in more efficient way comparing to nodes not using such a reputation based packet forwarding. A genetic algorithm (GA) is applied to evolve good strategies, while for evaluation purposes of the strategies a game theoretical model of the ad hoc network is used. Keywords: Ad hoc networks, cooperation, game theory.
1
Introduction
A wireless mobile ad hoc network is composed of two or more devices (nodes) equipped with wireless communications and network capability [1]. Such network does not rely on any fixed architecture. Routing functionality is incorporated into mobile nodes. Devices can directly communicate with each other only when they are located in their radio range. Otherwise, intermediate nodes should be used to forward packets. As a result, nodes beside sending their own packets are also expected to forward packets on behalf of others. Topology of such network may change quickly in an unpredictable way. Most of the devices participating in the network run on batteries [2] [3] [4]. As a result the temptation to save energy and therefore act selfishly might be very high. So-called self-policing mobile ad hoc networks [5] [6] [7] [8] [9] is a common solution to this problem. In such networks nodes are equipped with a reputation management system combined with a response mechanism. Each nodes keeps its R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 68–77, 2008. c Springer-Verlag Berlin Heidelberg 2008
Analysis of Distributed Packet Forwarding Strategies
69
own ratings of other network participants based on own experience and reputation data received from other nodes. If a packet forwarding request comes from a node with a bad reputation it is very likely to be discarded. Such an approach enforces cooperation because selfish nodes can not use the network for their own purposes unless they contribute to the packet forwarding. Energy saving can be done by either discarding packets (instead of forwarding them) or switching wireless interface into a sleep mode. Second option is by far the greatest way of saving the battery [10]. The power consumption is about 98% lower comparing to the one into the idle mode. The significantly higher idle power consumption reflects the cost of listening to the wireless channel. If a node wants actively participate to the network then its network interface should be in the idle mode, ready to receive traffic from its neighbors. Switching into a sleep node is very tempting, because such a behavior will be unnoticed by other network participants. This is due to the fact that it is not possible to distinguish between a node being in a sleep mode and a node which has temporally left the network. In order to enforce nodes to minimize the time spent in a sleep mode, their activity has to be taken into account by the reputation management system. In this paper we address the problem of the selfish behavior in self-policing ad hoc networks. Our approach aims at enforcing cooperation using packets forwarding strategy. The decision whether to forward or discard packets depends on the trust level to the source node of the packet and its activity. Nodes using such approach will spend less battery energy for sending same number of packets than those nodes not using it. We also propose a new game theoretic-based model of the ad hoc network whose goal is to evaluate strategies. A GA is used to evolve strategies. The paper is organized as follows. In the next Section, related work is discussed. Then in Section 3 we show our trust and activity evaluation mechanisms and we describe the strategy driven forwarding approach. This is followed by Section 4, where we present our game based model of ad hoc network. Next, in Section 5 we describe the evolution of strategies using GA. Simulation results are presented in Section 6. Last Section concludes the paper.
2
Related Work
A good survey of cooperation models with a game theoretical analysis can be found in [8] [11] [12]. In [3] authors present two techniques, watchdog and pathrater that aim at improving throughput of the network in the presence of selfish nodes. First, watchdog mechanism identifies selfish nodes and next, pathrater helps routing protocol to avoid these nodes. Such mechanisms do not discourage nodes from selfish behavior because selfish nodes are not excluded from the network. Authors show that in the network composed of 50 nodes with presence of 20 selfish nodes proposed mechanisms can increase the throughput by 17%. In [6] authors propose a mechanism called CONFIDANT whose goal is to make selfish behavior unattractive. Both, first and second-hand observations
70
M. Seredynski, P. Bouvry, and M.A. Klopotek
are used. Similarly to CORE, packets coming from selfish nodes will not be forwarded by normally behaving nodes. Additionally, if a selfish node starts to behave correctly for a certain amount of time it might re-integrate with the network. In [2] authors present an economic approach to the problem. Network is modelled as a market in which a virtual currency called nuglet is used. In such network nodes have to pay for the packets they want to send and are paid when they forward packets coming from other nodes.
3
Strategy Based on Trust and Activity Levels
Evaluation of trust and path rating We assume that each node uses an omni-directional antenna with the same radio range. A source routing protocol is used, which means that a list of intermediate nodes is included in the packet’s header. In our model the reputation information is gathered only by nodes directly participating in the packet forwarding. Similarly to the watchdog mechanism proposed in [3] each node monitors the behavior of the next forwarding node. Reputation data are collected in the following way: let’s suppose that the node A wants to send a packet to node E using intermediate nodes B, C, and D. If the communication is successful then node E receives the packet and all nodes participating in that forwarding process update reputation information about each other. If communication fails (for example node D decides to discard the packet) this event is recorded by the watchdog mechanism of the node C. In such a case node C forwards alert about selfish node D to the node B and then node B forwards it to the source node A. If at some point node B would like to check how trustworthy node A is (using collected reputation data concerning node B) it had to do the following: first calculate the fraction of correctly forwarded packets by the node B (forwarding rate) and then, using a trust lookup table (Figure 1a), get one of the four possible trust levels. For example, forwarding rate of 0.95 results in the trust level of 3. If a source node has more than one path available to the destination it will choose the one with the best rating which is calculated as a multiplication of all known forwarding rates of all nodes belonging to the route. An unknown node has a forwarding rate set to 0.5. Evaluation of the activity level Three activity levels are defined: low (LO), medium (ME) and high (HI). Those levels are calculated using the same reputation data as used for trust evaluation. In order to verify the activity level of a source node an intermediate node calculates the average number of all packets forwarded by all known nodes and next compares it with the number of packets forwarded by the source node. The exact activity value is assigned using an activity lookup table (Figure 1b). Coding the strategy The decision whether to forward or discard the packet is determined by the strategy represented by a binary string of the length 13. An example of a strategy is shown in Figure 1c. The exact decision is based on trust and activity levels of
Analysis of Distributed Packet Forwarding Strategies
Trust lookup table
pfA fr(B,A)= psA
fr 1 - 0.9 0.9 - 0.6 0.6 - 0.3 0.3 - 0
Activity lookup table
packets forwarded pfA > (pfAVR + pfAVR * 0.2) pfA pfAVR +/- pfAVR * 0.2 pfA < (pfAVR - pfAVR * 0.2)
TLBA 3 2 1 0
<
fr(B,A) - forwarding rate of the node A pfA - number of packets forwarded by the node A psA - number of packets sent to the node A TLBA - trust level of the node B in the node A
decision
trust 0
trust 1
trust 2
HI ME LO
decision against an unknown player
trust 3
DDD FFF DDD FDD F 1
<
b)
LO ME HI LO ME HI LO ME HI LO ME HI 0
ALA
pfA - number of packets forwarded by node A pfAVR - average number of packets forwarded by all known nodes ALA - activity level of node A
a) Trust and activity of the source node
71
2
3
4
5
6 7
8
9 10 11
c)
12
D: discard intermediate packet F: forward intermediate packet
Fig. 1. Computation of trust level (a), computation of activity level (b), coding of the strategy (c)
a source node. There are 12 possible combinations of those levels. Decisions for each case are represented by bits no. 0-11. Bit no. 12 defines behavior against an unknown node. Decision F means ”forward packet” while D stands for the opposite (”discard the packet”). For example, let’s suppose that the node B receives a packet originally coming from the node A. Assuming that node B has a trust level 3 in the node A, and nodes’ A activity is ”LO” then according to the strategy shown in Figure 1c the decision would be to forward the packet to the next hop (F , bit no. 9). In order to use more advanced and refined algorithms for trust and activity calculation several properties of the network should additionally be taken into account (like security algorithms used in the network, social relations between nodes, etc.).
4
Ad Hoc Network Game, Payoff Table, Tournament Scheme and Types of Players
Packet forwarding as Ad Hoc Network Game We define an Ad Hoc Network Game as a game in which one node (refereed also as player) is originating the packet and some other nodes have to decide whether to forward or to discard it. The number of game participants depends on the length of the path leading from the source to the destination node. The game participants the source node and all intermediate nodes. The destination node is not a part of the game. Each player is said to play his own game when being a source of a packet and is said to be a participant of other players’ game when being an intermediate node. After the game is finished all its participants receive payoffs according to the decisions they made.
72
M. Seredynski, P. Bouvry, and M.A. Klopotek source node reliability T A HI 3 ME LO HI 2 ME LO HI 1 ME LO HI 0 ME LO
T-trust F- decision forward A-activity D- decision discard Rank: forwarding preference
payoff F 14 9 5 12 8 4 10 7 3 2 1 0
D 0 3 7 1 4 8 2 5 9 10 12 14
Rank 1 4 7 2 5 8 3 6 9 10 11 12
a)
transmission payoff status 12 success (S) faillure (F) 0 b)
interface payoff mode in (per a round round) sleep 6 iddle 0 c)
Fig. 2. Payoff tables for: intermediate nodes forwarding or discarding packets (a), source nodes originating packets (b), nodes staying in a sleep mode (c)
Payoff table and fitness function The goal of payoffs is to capture essential relations between alternative decisions and their consequences. Three payoff tables for different kind of activities are defined: one for intermediate nodes forwarding packets (Figure 2a), one for source node of the packet (Figure 2b), and one for spending time in a sleep mode (Figure 2c). Payoffs received by intermediate nodes depend on their decisions (packet discarded or forwarded) and on their trust and activity levels in the source node. Generally, the higher the trust/activity levels are the higher payoff node forwarding the packet receives. High trust level in the source node means that in the past this node has already forwarded some packets for the currently forwarding node. So it is more likely that such node will be used in the future as a intermediate node. This means that forwarding for such node might be considered as an investment of trust for the future situations. On the other hand when a node decides to discard a packet it will be rewarded for saving its battery live. When defining a payoff for forwarding/distarding packets one has to decide about preference order of particular combinations of trust and activity values. Our rank of payoffs for forwarding/discarding packets is shown in Fig. 2a - column ”Rank”. The highes payoff is received for forwarding packets for a source node with trust3/activityHI, then for trust2/activityHI, next for trust1/activityHI, etc. As for the source node, the exact payoff depends only on the status of the transmission. If the packet reaches the destination then the transmission status is denoted as S (success). Otherwise (packet discarded by one of the intermediate node’s), the transmission status is denoted as F (failure). Players also receive additional payoff for time (rounds) spent in a sleep mode as a reward for saving the battery. The fitness value of each player is calculated as follows: f itness =
tps + tpf + tpd , ne
(1)
Analysis of Distributed Packet Forwarding Strategies
73
where tps, tpf, tpd are total payoffs received respectively for sending own packets, forwarding packets on behalf of others and discarding them. The ne is a number of all events (number of own packets send, number of packets forwarded and number of packets discarded). Evaluation of strategies in a tournament Each player is using a single strategy which is evaluated in a tournament that represents an ad hoc network. Several players using different strategies play number of Ad Hoc Games. Tournament is composed of R rounds. In every round each player is a source of a packet exactly once (plays its own game) and participates in the packet forwarding several times. Destination and intermediate nodes are chosen randomly. Each node might decide to spend certain number of rounds in a sleep mode and as a result become unavailable for routing purposes. But whenever it wants to send own packet it turns to iddle mode. The tournament itself can be described as follows: Tournament scheme Step 1: Specify i (source node) as i := 1, K as a number of players participating in the tournament and R as a number of rounds. Step 2: Randomly select player j (destination of the packet) and intermediate nodes. Step 3: For each available path calculate its rating (as described in Section 3) and select the path with the best reputation. Step 4: Play the Ad Hoc Network Game. Step 5: Update payoffs of the source node i and all intermediate nodes (game participants) that received the packet. Step 6: Update the reputation data among all game participants. Step 7: If i < K, then choose the next player i := i + 1 and go to the step 2. Else go to the step 8. Step 8: If r < R, then r := r + 1 and go to the step 1 (next round). Else stop the tournament. Types of players We defined five different types of players participating in the network (tournament): normal players 1 and 2 (N P 1, N P 2), low activity player (LAP ), and two types of selfish players - SP 1 and SP 2. Both types of normal players (N P 1 and N P 2) play according to their trust and activity based strategy. The only difference between them is the approach towards getting into a sleep mode. N P 2 is more focused on battery saving comparing to N P 1. He goes to the sleep mode as soon as over 40% of his packet reach the destination, while NP1 does the same at the level of 90%. When those rates fall below predefined thresholds, player is switching back to an idle mode. The remaining types of players do not take into account any information concerning a source node when forwarding packets. LAP spends 25% of the time (rounds) in a sleep mode and forwards all packets whenever being awake. Selfish players (SP 1 and SP 2) are always awake. SP 1 discard all packets, while SP 2 players forward packets with a probability of 0.5.
74
M. Seredynski, P. Bouvry, and M.A. Klopotek Table 1. Types of players participating in the network type of player forwarding strategy sleeping approach normal 1 (N P 1) based on trust/activity of src. node sleep if succ. rate > 0.9 normal 2 (N P 2) based on trust/activity of src. node sleep if succ. rate > 0.4 low activity(LAP ) always forward when awake 25 % of time (rounds) selfish 1 (SP 1) newer forward never selfish 2 (SP 2) forward with a probability of 0.5 never
All players try to send the same number of packets. Properties of each type of player are summarized in Table 1.
5
Evolution of the Behavior Using GA
Only strategies of normal players evolve, other types of players continue to use the same forwarding strategy. There are two populations of normal players denoted as P OP 1 (players N P 1) and P OP 2 (players N P 2). At the beginning randomly generated strategies are assigned to each normal player. Then players are evaluated in a tournament as described in Section 4. During the evaluation, normal players have to compete with other types of players. Afterwards, selection and reproduction operators are applied on the population of strategies used by normal players (independently for P OP 1 and P OP 2). Then N pairs of strategies are selected using a tournament selection. New strategies are obtained by applying crossover and mutation operators to each of N selected pairs. A standard one-point crossover is used. One of the two strategies created after crossover is randomly selected to the next generation. Finally, a standard uniform bit flip mutation is applied. As a result a new population of strategies of both populations is created. The process is repeated for a predefined number of times. In each generation the number of particular types of players participating to the tournament remains the same.
6
Experiments
Conditions of experiments In each experiment the players compete in a network (tournament) composed of 50 players, but numbers of players of the particular typed are different in each experiment. Settings of the tournament for each experiment are shown in Table 2. Both populations, P OP 1 and P OP 2 are composed of 50 players. The following parameters of GA are used: crossover probability - 0.9; mutation probability 0.001, number of generations - 100. Each tournament is composed of 600 rounds. Unknown nodes have a default trust value assigned to 1. All experiments were repeated 60 times and an average value was taken as a result. Experiments were performed according to the following steps: Step 1: Load SP 1, SP 2, LAP players according to values shown in Table 2. Step 2: Randomly select players from P OP 1 and P OP 2 according to values shown in Table 2) and load them to the tournament.
Analysis of Distributed Packet Forwarding Strategies
75
Table 2. Number of players in the tournament in each experiment
number of NP1 number of NP2 number of SP1 number of SP2 number of LAP
experiment 1 experiment 2 experiment 3 25 15 10 25 15 10 0 5 5 0 10 20 0 5 5
Step 3: Play the tournament with selected players as described in Section 4. Step 4: If some players from P OP 1 or P OP 2 have not yet played in the tournament then go to the step 1, otherwise go to the step 5. Step 5: Create a new P OP 1 and P OP 2 (as described in Section 5). Step 6: Repeat steps 1-5 for the predefined number of generations. Experimental results A success rate of a player is defined as a percentage of packets that reached the destination. All results were taken from last generations of each experiment (average values per type of player). All types of players were trying to send the same number of packets. The performances of each type of player in experiment 1 are shown in Table 3. One can see than in a network composed of normal players only, there was a significant difference in the success rate between two participating types of players. Although N P 1 and N P 2 where trying to send the same number of packets, N P 2 managed to successfully send only 28.6 % which was a very bad result comparing to almost 82% sent by N P 1. This was due to the fact that N P 2 players where seen by N P 1 as the ones the activity below the average. Strategies used by N P 1 preferred nodes with at least average activity level. Table 3. Performance of various types of players in experiment 1 type of player success rate number of packets forwarded sleep time (rounds) NP1 82.7 % 914 5% NP2 28.6 % 615 13 %
If N P 2 would change its sleeping conditions from 0.4 to 0.6 it would result in a success rate improvement from 28.6% to 46.8%. In the next two experiments network was composed of all types of players. The results of experiment 2 are shown in Table 4. This time the success rate of N P 2 slightly improved, while for for N P 1 it had decreased significantly (from almost 83% to 47%). In order to achieve these results both types of players had to forward the same number of packets, but N P 2 additionally spent 15% of time in a sleep mode saving battery live. In order to decide which ”sleeping” approach was better, one have to choose between sending more packets or saving more battery live. As for other types of players LAP managed to send 40% of packets (second best result). It was quite costly effort: LAP nodes had to forward on average 2400 packets (SP 1 and SP 2 forwarded only 720 packets). On the other hand such players spent
76
M. Seredynski, P. Bouvry, and M.A. Klopotek Table 4. Performance of various types of players in experiment 2 type of player success rate number of packets forwarded sleep time (rounds) NP1 47 % 720 0% NP2 34.5 % 720 15 % SP1 4.3 % 0 0% SP2 10,5 % 2839 0% LAP 40 % 2400 25 % Table 5. Performance of various types of players in the experiment 3 type of player success rate number of packets forwarded sleep time (rounds) NP1 34.5 % 1095 0% NP2 33.5 % 1005 7% SP1 7.1 % 0 0% SP2 16.8 % 3404 0% LAP 29.3 % 6191 25 %
25% in a sleep mode. The probabilistic forwarding strategy of SP 2 turned to be very unsuccessful. Those nodes forwarded the highest number of packets among all types of players and obtained a success rate of 10.5%. In the network tested in experiment 3 more SP 2 were present. The results are shown in Table 5. Success rate for N P 1 decreased again, while N P 2 saw a very slight decrease. Both types of normal players had almost the same success rate, although overall sleeping strategy of N P 2 turned to be more successful than the one of N P 1. While having almost the same success rate, N P 2 managed to spend 7% of time in a sleep mode (N P 1 did not sleep at all). As for other types of players, all of them improved their success rate except LAP . The success rate of LAP was very close to the one of normal players but for reaching this level those players had to forward on average 6191 (about 6 times more than normal players). The experiments showed that even when players using reputation based packet forwarding strategies form only 40% of the population of the network (experiment 3), they sill obtain best results in terms of success rate and battery use. Other types of players had to spent far more energy on packets forwarding in order to be able to send their own packets. Such energy was often wasted when forwarding packets for nonreciprocal players like SP 1.
7
Conclusions
Due to the limited energy of the batteries, nodes might be reluctant to share packet forwarding responsibilities. Such behavior might seriously threat the existence of the network. In this paper we demonstrated how a strategy driven packet forwarding approach based on the notion of trust and activity can increase the cooperation level in ad hoc networks. A fair contribution to the packet forwarding alongside the high activity are the only ways enabling nodes to send large
Analysis of Distributed Packet Forwarding Strategies
77
number of their own packets. However the most efficient way to send own packets required the usage of the reputation based forwarding strategies.
References 1. Ilyas, M., Mahgoub, I. (eds.): Mobile Computing Handbook. Auerbach Publications (2005) 2. Buttyan, L., Hubaux, J.P.: Nuglets: a virtual currency to stimulate cooperation in self-organized mobile ad hoc networks. Technical Report DSC/2001/001, Swiss Federal Institute of Technology (2001) 3. Marti, S., Giuli, T., Lai, K., Baker, M.: Mitigating routing misbehavior in mobile ad hoc networks. In: Proc. ACM/IEEE 6th International Conference on Mobile Computing and Networking (MobiCom 2000), pp. 255–265 (2000) 4. Michiardi, P., Molva, R.: Simulation-based analysis of security exposures in mobile ad hoc networks. In: Proc. European Wireless Conference (2002) 5. Buchegger, S., Boudec, J.Y.L.: The effect of rumor spreading in reputation systems for mobile ad-hoc networks. In: Proc. Workshop on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks (WiOpt 2003), pp. 131–140 (2000) 6. Buchegger, S., Boudec, J.Y.L.: Performance analysis of the confidant protocol. In: ACM 3rd International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc 2002), pp. 226–236 (2002) 7. Buchegger, S., Boudec, J.Y.L.: Self-policing mobile ad-hoc networks by reputation systems. IEEE Communications Magazine, Special Topic on Advances in SelfOrganizing Networks 43(7) (July 2005) 8. Giordano, S., Urpi, A.: A self-organized and cooperative ad hoc networking. In: Basagni, S., Conti, M., Giordano, S., Stojmenovic, I. (eds.) Mobile Ad Hoc Networking, Wiley-IEEE Press (2004) 9. Michiardi, P., Molva, R.: Core: A collaborative reputation mechanism to enforce node cooperation in mobile ad hoc networks. In: Proc. IFIP 6th Conference on Security Communications, and Multimedia (CMS 2002), pp. 107–121 (2002) 10. Feeney, L., Nilsson, M.: Investigating the energy consumption of a wireless network interface in an ad hoc networking environment. In: Proc. The IEEE Conference on Computer Communications (INFOCOM 2001), pp. 1548–1557 (2001) 11. Felegyhazi, M., Hubaux, J.P.: Game theory in wireless networks: A tutorial. Technical report, EPFL - Switzerland (2006) 12. MacKenzie, A., DaSilva, L.A.: Game Theory for Wireless Engineers. Morgan & Claypool Publishers (2006)
Implementation and Optimization of Dense LU Decomposition on the Stream Processor Ying Zhang, Tao Tang, Gen Li, and Xuejun Yang School of Computer, National University of Defense Technology, 410073 Changsha China {zhangying,tangtao,ligen,yangxuejun}@nudt.edu.cn
Abstract. Developing scientific computing applications on the stream processor has absorbed a lot of researchers attention. In this paper, we implement and optimize dense LU decomposition on the stream processor. Different from other existing parallel algorithms for LU decomposition, StreamLUD algorithm aims at exploiting producerconsumer locality and at overlapping chip-off memory access with kernel execution. Simulation results show that dealing with matrices of different sizes, compared with LUD of HPL on an Itanium 2 processor, StreamLUD we implement and optimize gets a speedup from 2.56 to 3.64 ultimately. Keywords: stream processor; LU decomposition; kernels; stream; producer-consumer locality; scientific computing.
1
Introduction
Of all rising architecture processors, the stream processor has absorbed a lot of scientific computing researchers attention[1,2,3,4,5,6] because of its capacity of high parallel computing, memory system, low power and low cost[7,8,9]. The stream processor is absolutely different from other conventional processors whether from programming model or from the architecture. Stream programming model organizes applications as two levels: kernel-level and stream-level[10,11,12,13]. A kernel is a computation-intensive function that
Fig. 1. Block diagram of a stream processor
Fig. 2. Stream flowing across the memory system
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 78–88, 2008. c Springer-Verlag Berlin Heidelberg 2008
Implementation and Optimization of Dense LU Decomposition
79
operates on sequences of records called streams. Each kernel takes streams of records as input and produces streams as output. Kernels are written using a C-like language called KernelC. The stream-level program declares the streams and defines the high-level control- and data-flow between kernels. Stream-level programs are written using a programming language extension called StreamC intermixed with C++. Stream programming model exposes three level parallelisms of the stream architecture to programmers: Instruction-Level Parallelism (ILP), Data-Level Parallelism (DLP) and Task-Level Parallelism (TLP). Both the organization of ALUs and the memory system of the stream architecture(as shown in figure 1) are different from those of conventional architecture. There are N clusters controlled by a single controller to work in SIMD mode, and every cluster consists of M ALUs(in the stream processor prototype Imagine[14,15] developed by Stanford University, N = 8 and M = 6). At the same time, the stream processor has three level memory hierarchies - local register files (LRF) near ALUs, exploiting locality in kernels, global stream register files (SRF), exploiting producer-consumer locality between kernels, and streaming memory system, exploiting global locality. Figure 2 shows a stream flows across three level memory hierarchies during the execution of a stream program. First, the stream is loaded from chip-off memory into SRF and distributed among buffers in SRF. Then it is loaded from SRF to LRF as operands of a kernel. After the kernel is finished, the input and output streams are stored back to SRF. If there is producer-consumer locality between this kernel and its later kernel, the result streams are saved in SRF. Otherwise, they are stored back to chip-off memory. Dense LU decomposition is representative of many dense linear algebra computations and a typical dense matrix factorization. The kernel is widely used in a variety of solvers and eigenvalue computations, and radar cross-section applications typically require LU decomposition of large matrices. Additionally, LU decomposition is a typical kernel of many benchmark packages, such as SPLASH and NPB, and is the based algorithm of LINPACK which is the most popular benchmark to evaluate the float-point sustained performance of supercomputers. It is a hard work to program on the stream processor for general-purpose uses, although the stream processor can provide high performance. In implementing StreamLUD (Stream LU Decomposition), how to organize data as streams is an important and difficult task. Additionally, StreamLUD not only exploits parallelism inherent as existing parallel algorithms for LU decomposition, but also aims at exploiting producer-consumer locality and at overlapping memory access with kernel execution. In this paper, we present our implementation of dense LU decomposition on the stream processor, which is aimed at media applications originally. We give the mapping method and our optimization, and run our StreamLUD on Isim[14,15], a cycle-accurate stream simulator. Compared with an Itanium 2 processor, dealing with matrices of different sizes the stream processor achieves a speedup from 2.56 to 3.64.
80
Y. Zhang et al.
The remainder of this paper describes and evaluates our implementation and optimization of dense LU decomposition on the stream architecture. In Section 2 we describe our implementation of StreamLUD on the stream processor. In section 3 we evaluate and analyze the performance of StreamLUD on Isim. In section 4 we optimize our implementation special for the stream processor, and evaluate and analyze our optimization. Section 5 discusses what must be done before the stream processor can support scientific computing applications well and section 6 makes a conclusion.
2
Dense LU Factorization
The LU factorization has the form A = P × L × U , where P is a permutation matrix, L is the lower triangular with unit diagonal elements, and U is the upper triangular. L and U are stored in A. The kernel is widely used in a variety of solvers and eigenvalue computations. Solving a system of equations requires O(n3 ) floating-point operations, more specifically, 2/3n3 + 2n2 + O(n) floatingpoint additions and multiplications. 2.1
Algorithm
Our StreamLUD is based on block-partitioned algorithm[16,17] which divides a dense n × n matrix A into an N × N array of B × B blocks, (n = N B). Blocking is performed in our algorithm to improve the temporal locality in LRF and the producer-consumer locality in SRF, and reduce the number of loop nests and the number of streams in the stream program, which consequently reduce the overheads of the stream program. The block size should be chosen to be large enough to exploit the locality in LRF and SRF, and small enough to avoid the overflow of LRF. It also should be the multiple of cluster number in the stream processor, because all streams are consumed by clusters in SIMD mode. In our experiment, we choose the block size as 8 which is the cluster number in our simulator, Isim[14,15]. The pseudo-code in figure 3(a), expressed in terms of streams and kernels, shows the most important steps in StreamLUD. The pseudo-code of kernel lu edge kc is shown in figure 3(b), and that of kernel lu body kc is shown in StreamLUD 1. For k=0 to N-1 do 2. Update l_edge_in, u_edge_ in, l_ body_ in, u_body_in, diagnal_block_ T 3. Factor edge: lu_ edge_ kc(in: u_ edge_in, l_ edge_in,diagnal_ block_ T out: u edge out,l edge out)) 4. Update the body using corresponding edge record: lu body kc(in: u_edge_out, l_edge_out,l_body_in,u_body_in, out:l_body_out,u_body_out);
lu_edge_kc 1. Factor the first block of u_edge_in and record the diagnal elements 2. For k=0 to (u_edge_length/block_size-1) do Factor the remaining elements in u_edge_in. 3. For k=0 to (l_edge_length/block_size) do Factor l_edge_in using the diagonal elements generated above.
(a) StreamLUD
(b) lu edge kc
lu_doby_kc 1. Factor l_body_in using corresponding u_edge_out and l_edge_out 2. Factor u_body_in using corresponding l_edge_out and u_edge_out
(c) lu body kc
Fig. 3. Pseudo-code of StreamLUD and its kernels
Implementation and Optimization of Dense LU Decomposition
Fig. 4. Simplified dataflow graph of a single iteration of StreamLUD
81
Fig. 5. Concrete computation of a single k iteration
figure 3(c). Figure 4 shows the dataflow graph of a single k iteration. And figure 5 shows the concrete computation corresponding to a single k iteration in the pseudo-code symbolically. In StreamLUD, we organize all computation as two kernels, lu edge kc and lu body kc, to process all input streams, as shown in figure 4. In every iteration, the first block size×block size records of u edge in is decomposed first using the first block size × block size records of u edge in and stream diagonal block T. And the diagonal elements of this block are recorded in step 1 of figure 3(b). Stream diagonal block T has the same records as the first block size×block size records of u edge in, but the record sequences of them are different. The former organizes elements of the same column one by one, then the next column, while the latter are just the contrary. Then the remaining records of u edge in and l edge in are factorized in step 2 and 3 of figure 3(b). And then stream l body in and u body in are updated by corresponding records of l edge out and u edge out, as shown in step 1 and 2 of figure 3(c). 2.2
Organizing Stream
Stream programming model exposes a lot of architecture details to programmers, such as the use of SRF, chip-off memory, and clusters. Although this improves the hardness to programme, it provides a consistent structure that is specialized for stream applications[18]. So, how streams are organized decides program behavior and performance straightly. Consequently, it is very important to organize stream efficiently. Then, how streams are organized in StreamLUD is described. Figure 6 shows stream organization of StreamLUD which decomposes a 48 × 48 matrix that is divided into 6 × 6 arrays of 8 × 8 blocks. Of the blocked arrays, the lower triangular is organized as stream l one column after anther; the upper triangular with diagonal blocks is organized as stream u one row after anther. The record length of stream l equals to the row length of the block and consequently the same row of a block lies on the same cluster; the record length of stream u equals to 1, the record sequence of blocks in stream u is row first, and consequently the same column of a block lies on the same cluster.
82
Y. Zhang et al.
Fig. 6. Simplified graph of data stream organization
Our stream organization can ease programming and improve program performance efficiently. First, when stream l edge in, u edge in, l body in, u body in and diagonal block T are updated, which are all parts of stream l or stream u, our organization makes them be derived by stream l or stream u easily. Secondly, when stream l edge in and u edge in are factorized, records that are processed by identical operations are on different clusters, which fits SIMD mode perfectly. Thirdly, when l body in and u body in are updated, the dot product of the corresponding row of l edge out and the corresponding column of u edge out should be computed. This stream organization makes rows and columns participating in the same dot product lie on the same cluster. As a result, the dot product can be done without any data collecting from other clusters, and consequently the number of inter-cluster communications is cut down efficiently.
3
Experimental Setup and Performance Evaluation
In our experiments, we use Isim[14,15], a cycle-accurate stream processor simulator supplied by Stanford University, to get the performance of StreamLUD. The baseline configuration of the simulated stream processor and memory system is detailed in table 1, and is used for all experiments unless noted otherwise. For comparison, LU decomposition from HPL is compiled by Intels IA64 compiler (with optimization option -O0, -O1, -O2 and -O3) and runs on an Itanium 2 processor whose baseline parameters are listed in table 2. And the best performance results on the Itanium 2 processor are chosen to compare with StreamLUD results. Table 3 lists the comparison of the performance of LU decomposition of different sizes matrices on Itanium 2 with on Isim and corresponding speedup yielded by Isim over Itanium 2. The speedup is more than 1, which means that for dense LU decomposition, the stream processor can get much better performance. For 64 × 64 and 96 × 96 matrices, StreamLUD is faster and faster with the increase of size. This is because input and output streams of StreamLUD are all in SRF, which means that when running, kernels gets all streams participating in computation from SRF not from chip-off memory. For 128 × 128, 160 × 160 and 192 × 192 matrices, the speedup over Itanium 2 gets smaller and smaller. This is because input and output streams of corresponding StreamLUD are larger than SRF and double buffer happens. At this time when kernel running, only a part
Implementation and Optimization of Dense LU Decomposition
Table 1. Baseline parameter of Isim Parameter Number of clusters Operating frequency Capacity of LRF Capacity of SRF Bandwidth of LRF Bandwidth of SRF Bandwidth of chip-off DRAM
8 500MHz 9.6KB 128KB 544GBps 32GBps 2.67GBps
83
Table 2. Baseline parameter of Itanium 2 processor Parameter Operating frequency Capacity of Data Cache level 1 Capacity of Instruction Cache level 1 Capacity of Data/Instruction Cache level 2 Capacity of Data/Instruction Cache level 3
1.5GHz 16KB 16KB 256KB 4MB
Table 3. Performance of LU decomposition of matrices with different sizes on Itanium and on Isim(ms) and the speedup Matrix Size Itanium 2 Isim Speedup
64 × 64 0.31171 0.014973 2.082
96 × 96 0.78165 0.032339 2.417
128 × 128 1.54545 0.068209 2.266
160 × 160 2.60274 0.120907 2.153
192 × 192 4.1098 2.81449 1.460
of some stream is in SRF. After this part has been consumed by corresponding kernel, the stream processor stalls to wait for another part of this stream that is being loaded from chip-off memory to SRF by double-buffer. When this happens, kernel stalls and double buffer withdraw StreamLUD performance. Then we take 128×128matrix as example to discuss the details of StreamLUD on Isim. Figure 7 depicts cluster and memory system occupancy. The leftmost column is cycle number. The CLUSTERS column shows which kernels were running
4.0E+05
MC Chip-off Memory Run_time
3.0E+05
2.5E+07
Chip-off Memory SRF LRF
2.0E+07 1.5E+07
2.0E+05 1.0E+05
1.0E+07 5.0E+06 0.0E+00
0.0E+00
Fig. 7. Cluster and memory system occupancy
Fig. 8. Main parts active time and program run time
Read
Write
Read-Write
Fig. 9. Traffics of chipoff memory, SRF and LRF
84
Y. Zhang et al.
during which cycles, and the MEMORY STREAM columns show the streams passing between Isim and its off-chip memory. Note that because in our algorithm, input steams of kernel lu edge kc in iteration k depend on output streams of kernel lu body kc in iteration k-1, there is little overlap between kernel execution and chip-off memory access, excluding that double buffer happens. But double buffer will generate a lot of kernel stalls. So, StreamLUD should be optimized to avoid problem above. Figure 8 shows run time of StreamLUD, chip-off memory access time and microcontroller active time which equals to the total of kernel run time. MC and chip-off memory run time weighs highly, 45% and 61% respectively. There is a little overlap between them, only 6% of total run time. Read, write and total traffics of chip-off memory, SRF and LRF are shown in figure 9. The ratio of them is 1 : 2.7 : 26 with respect to total traffic. Temporal and space locality in LRF is well exploited in our StreamLUD. But locality in SRF is employed badly. So we should optimize our StreamLUD to improve its SRF locality.
4
Implementation Optimizations
Imagine provide two methods, unroll and pipeline, to optimize kernels, and two methods, doUnroll and doSoftwarePipeline, to optimize stream-level programs[12,13]. Implementation above has been optimized by these methods. Then well depict optimizations we used to optimize StreamLUD specially. From figure 9, we can see that SRF locality of StreamLUD isnt exploited well. From figure 7 we can see that chip-off memory access seldom overlaps kernel execution. Chip-off memory access is on the critical path of the performance of StreamLUD. So, we optimize our StreamLUD in the two ways. 4.1
Exploiting Producer-Consumer Locality in SRF
There is dependency between l body out of iteration k − 1 and l edge in of iteration k, and between u body out of iteration k−1 and u edge in of iteration k. So, l edge in and u edge in of iteration k cant be loaded into SRF until l body out and u body out are saved into chip-off memory, as shown in figure 7. And consequently kernel lu edge kc can only run after l body out and u body out of iteration k − 1, and l edge in and u edge in of iteration k have been transferred between SRF and chip-off memory. As a result, there is no overlap of memory access latency between the execution of kernel lu edge kc of iteration k and kernel lu body kc of iteration k − 1. As shown in figure 5, in iteration k, the input streams, l edge in and l body in, are parts of the output stream l body out produced by iteration k − 1; the input streams, u edge in and u body in, are parts of the output stream u body out produced by iteration k − 1. If output streams of iteration k − 1 are organized as l edge in next, l body in next, u edge in next and u body in next, output streams of iteration k − 1 have not to be saved to chip-off memory, but remain in SRF as input streams of iteration k.
Implementation and Optimization of Dense LU Decomposition
85
Dataflow graph of two neighboring iterations of optimized StreamLUD is shown in figure 10. From the figure, we can see that output streams of kernel lu body kc havent to be saved to chip-off memory but saved in SRF as input streams of kernel lu edge kc of next iteration, except stream u edge in next. Because input stream diagonal block T of next iteration depends on output stream u edge in next of last iteration, stream u edge in next has to not only be saved in SRF but also be loaded to chip-off memory. The performance results of exploiting SRF locality optimization are shown in figure 11. Figure 12 shows chip-off memory traffics with and without our optimization. Exploiting SRF locality improves program performance greatly. Over Itanium 2 processor, the speedup yielded by optimized StreamLUD is from 2.33 to 3.30. This optimization not only cuts down memory access but also reduces SRF space needed by StreamLUD, because between adjacent iterations output streams of last iteration and input streams of next iteration are the same. Over unoptimized program, the speedup yielded by optimizing becomes larger and larger with the increase of size. For 64 × 64 and 96 × 96 matrices, exploiting SRF locality cuts down chip-off memory access sharply which weighs highly in total program run time. For other matrices, exploiting SRF locality not only cuts down chip-off memory access but also reduces double buffer. For 128 × 128 matrix, there is even no double buffer during StreamLUD execution. 4.2
Balancing Chip-Off Memory Access and Kernel Execution
As discussed above, while the first block size × block size records of u edge in are factorized, all records of the input stream diagonal block T are the same as the first block size × block size records of u edge in, but the record sequences of
Speedup over Itanium Speedup over Unoptimized
3.0
Without
With
4.0E+06 3.0E+06
2.5
2.0E+06
2.0
1.0E+06
1.5 1.0
0.0E+00 64
96
128
160
192
Fig. 10. Dataflow graph of two Fig. 11. Performance neighboring iterations results of exploiting SRF locality optimization
64
96
128
160
192
Fig. 12. Comparison of chip-off memory traffics with and without our optimization
86
Y. Zhang et al. Speedup over Itanium Speedup over Unoptimized
Without
With
2.0E+06 1.5E+06
3.0
1.0E+06
2.0
5.0E+05 0.0E+00
1.0 64
96
1 28
160
19 2
Fig. 13. Performance results of balancing chip-off memory access and kernel execution
64
96
128
16 0
1 92
Fig. 14. Comparison of chip-off memory traffics with and without our optimization
them are quite different. The former organizes elements of the same column one by one, then the next column; on the contrary, the latter organizes elements of the same row one by one, then the next row. In the optimization of balancing chip-off memory access and kernel execution, the input stream diagonal block T is gotten rid of. When kernel lu edge kc runs, the records of the stream diagonal block T can be supplied by getting records of the first block size × block size u edge in from other clusters via inter-cluster communication. So, by balancing chip-off memory access and kernel execution, output stream u edge in next of the last iteration neednt be saved to chip-off memory. Additionally, as we all known, communication between clusters is far faster than chip-off memory access, so this optimization can improve program performance. The results of balancing chip-off memory access and kernel execution optimization are shown in figure 13, and figure 14 shows chip-off memory traffics with and without our optimization. Balancing chip-off memory access and kernel execution improves program performance greatly. Over Itanium 2 processor, the speedup yielded by optimized method StreamLUD is from 2.56 to 3.64. Over unoptimized StreamLUD, the speedup yielded by our optimization gets larger and larger with the increase of matrix size, except for 192 × 192 matrix. This is because in unoptimized program, records of stream diagonal block T are produced by last iteration, stream u edge in next must be saved to SRF. And consequently saving stream u edge in next and loading stream diagonal block T cant be overlapped with any kernel execution. After optimized, not only kernel lu edge kc neednt stream diagonal block T any more which means chip-off memory access is cut down, but also SRF space needed by StreamLUD is reduced. For 160 × 160 matrix, there is even no double buffer during StreamLUD running. But for 192×192 matrix, double buffer happens, which withdraws program performance.
5
Discussion
StreamLUD gets good performance on the stream processor compared with LUD from HPL on Itanium 2 processor. But before scientific computing applications
Implementation and Optimization of Dense LU Decomposition
87
are dealt with on the stream processor well, some key issues should be dealt with first. – Effectively allocating SRF and utilizing SRF should be solved urgently, because how SRF is employed is the key issue to program performance. – When large matrix is dealt with, double buffer happens. So, it is meaningful to improve program performance with large input and output streams that exceed the capacity of SRF. Although Imagine supply stripmining to solve this problem, it is a semi-auto method and the effect brought by this method depends on programmers very much.
6
Conclusion
In this paper, we implement and optimize dense LU decomposition which is used widely in scientific computing applications on a stream processor that is designed for media applications. Optimizing StreamLUD program makes it get better performance on the stream processor by exploiting it producer-consumer locality and by balancing chip-off memory access and kernel execution. Measurements of StreamLUD implemented and optimized on Isim show that the architecture is successful in developing scientific applications. Dealing with different size matrices, the stream processor gets a speedup from 2.56 to 3.64 over the Itanium 2 processor. Although the stream processor gets good program performance, there is a great deal work to do to make this architecture fit scientific applications well as discussed above.
References 1. Merrimac: Stanford Streaming Supercomputer Project. Stanford University, http://merrimac.stanford.edu/ 2. Dally, W.J., Hanrahan, P., Erez, M., Knight, T.J., Labont´e, F., Ahn, J.H., Jayasena, N., Kapasi, U.J., Das, A., Gummaraju, J., et al.: Merrimac: Supercomputing with Streams. In: Proceedings of the ACM/IEEE SC 2003 Conference (SC 2003), vol. 1, pp. 58113–695 (2003) 3. Erez, M., Ahn, J.H., Jayasena, N., Knight, T.J., Das, A., Labont´e, F., Gummaraju, J., Dally, W.J., Hanrahan, P., Rosenblum, M.: Merrimac: Supercomputing with Streams. In: Proceedings of the, SIGGRAPH GP2 Workshop on General Purpose Computing on Graphics Processors (June 2004) 4. Fatica, M., Jameson, A., Alonso, J.J.: STREAMFLO: an Euler solver for streaming architectures. In: AIAA Conference (submitted) 5. Yuhua, T., Guibin, W.: Application and Study of Scientific Computing on Stream Processor. In: Proceedings of Advances on Computer Architecture (ACA 2006) (2006) 6. Yang, X.J., Du, J.: Implementation and Evaluation of Scientific Computing Programs on Imagine. In: Proceedings of Advances on Computer Architecture (ACA 2006) (2006)
88
Y. Zhang et al.
7. Khailany, B.: The Vlsi Implementation And Evaluation Of Area-And EnergyEfficient Streaming Media Processors. PhD thesis, Stanford University (2003) 8. Owens, J.: Streaming architectures and technology trends. In: International Conference on Computer Graphics and Interactive Techniques (2005) 9. Rixner, S.: Stream Processor Architecture. Kluwer Academic Publishers, Dordrecht (2002) 10. Kapasi, U.J., Rixner, S., Dally, W.J., Khailany, B., Ahn, J.H., Mattson, P., Owens, J.D.: Programmable Stream Processors. Computer 36(8), 54–62 (2003) 11. Mattson, P.R.: A programming system for the imagine media processor. PhD thesis, Stanford, CA, USA, Adviser-William J. Dally (2002) 12. Mattson, P., et al.: Imagine Programming System Developers Guide 13. Das, A., Mattson, P., Kapasi, U., Owens, J., Rixner, S., Jayasena, N.: Imagine Programming System Users Guide 2.0 (2004) 14. The Imagine Project. Stanford University, http://cva.stanford.edu/imagine/ 15. Kapasi, U.J., Dally, W.J., Rixner, S., Owens, J.D., Khailany, B.: The Imagine Stream Processor. In: Proceedings of 2002 IEEE International Conference on Computer Design, pp. 282–288 (2002) 16. Woo, S.C., Singh, J.P., Hennessy, J.L.: The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors 17. Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodologicalconsiderations. In: Proceedings. 22nd Annual International Symposium on Computer Architecture, pp. 24–36 (1995) 18. Zain, U., Ola, J., Magnus, S.: Programming & Implementation of Streaming Applications. Masters thesis, Computer and Electrical Engineering Halmstad University (2005)
An Adaptive Interface for the Efficient Computation of the Discrete Sine Transform, Pedro Alonso1 , Miguel O. Bernabeu1 , and Antonio-Manuel Vidal-Maci´ a1 Universidad Polit´ecnica de Valencia, Camino de Vera s/n, 46022 Valencia, Spain {palonso,mbernabeu,avidal}@dsic.upv.es
Abstract. This paper shows an easy to use interface for applying the Discrete Sine Transform (DST) to a vector. This transform is an FFT– related routine and frequently used in many applications, such as, the translation of a class of structured matrices into another class. The ease of use of the interface has been achieved by exploiting the interesting features of the Fortran 90/95 programming language. In addition, a technique has been incorporated to resolve the performance breakdown arising when the vector size cannot be decomposed into small prime numbers. This breakdown stems from the divide–and–conquer type of algorithms used when applying the DST.
1
Introduction
This paper presents a package to make use of the type I Discrete Sine Transform (also Sine–I Transform, DST–I, or simply DST) [1] of an n–array x, y = Sx , where S = [Sn ]jk = α sin 2πkj , (1) n+1 for k, j = 0, . . . , n − 1. If α = 1 the transformation is termed unnormalized, otherwise, if α = 2/(n + 1) the transformation is termed normalized [2]. The DST has a wide range of applicability, both in signal processing and in the numerical solution of partial differential equations. One of the most important applications is the mapping of Toeplitz class matrices into the class of Cauchy– like matrices. This mapping has been recently used to build successful parallel routines to solve linear symmetric [3] and hermitian [4] Toeplitz systems. It is well known that the DST is closely related to the Fast Fourier Transform (FFT) [1], so the computational cost of the algorithm when appling the DST to an array is O(log n) operations. Many routine libraries, using differing algorithms, are available for computing the FFT of a vector. Algorithms based on divide–and–conquer methods really obtain a cost of O(log2 n) operations if the size of the problem is a power of two (radix–2); otherwise, they produce very
Supported by the Directorate of Research and Technology Transfer of the Valencian Regional Administration under grant number GV06/091. Supported by Spanish MCYT and FEDER under Grant TIC 2003-08238-C02-02.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 89–98, 2008. c Springer-Verlag Berlin Heidelberg 2008
90
P. Alonso, M.O. Bernabeu, and A.-M. Vidal-Maci´ a
poor performance results. Other radix–p (p = 2) can be used, but the performance inevitably breaks down in a manner inversely proportional to the size of the maximum prime number in the prime decomposition of n. Most available FFT packages do not have a specific routine to apply the DST to a n–vector, maybe because this can be obtained by applying a real–data FFT to a 2n + 2 expanded vector. However, some performance and ease of use advantages are lost with this omission. The fftpack is one of the best known packages that includes a DST routine [5]. The routine runs fast because the largest prime number in which n + 1 is decomposed is ‘smaller’ (the concept of small will be set experimentally); otherwise, the cost can be very high. To solve the problem arising when the largest prime number is itself large, a technique called Chirp–z factorization [1] as used in the FFT has been adapted to the DST case. The fftpack and the mentioned factorization technique use additional workspaces and intermediate FFT’s that make their use fairly unintuitive. Thus, a software module has been developed that makes the application of a DST transformation to a vector v as easy as an application to type dst(v); that is, in the same way as in Matlab or Octave. Our module is written in Fortran 90/95 [6] which enables routine calls to be easily made. Moreover, the package makes the best selection between the fftpack routine, or Chirp–z factorization, for use at runtime with negligible overhead. In addition, the nontrivial problem of using the package in multithreaded contexts has been addressed. The following section deals with Chirp–z factorization. In Section 3 the Fortran 90/95 module basics are outlined. The routine selection tool is explained in the following section and the installation process is shown in Section 5. Section 6 shows a real example involving use of the module under differing conditions. Finally, conclusions and further research lines are presented.
2
The Chirp–z Factorization of the DST
It is possible to convert the DST (1) into a convolution of expanded length. The central idea can be found in Bluestein (1970), and is very well explained by C. Van Loan in [1]. The exposition in [1] is a reformulation called the Chirp–z factorization of the FFT that is extended in this paper to the case of the DST. Taking Theorem 4.2.5 of [1] and changing dimensions as needed, the FFT given by matrix: 2πkji F = [F2(n+1) ]jk = exp − , 2(n + 1) √ where k, j = 0, . . . , 2n + 1 and i = −1, can be factorized as: F = ΣT Σ ,
(2)
Σ = diag (σ0 , . . . , σ2n+1 ) ,
(3)
where
An Adaptive Interface for the Efficient Computation
and T is the following Toeplitz matrix: ⎛ ⎞ σ0 σ−1 ... σ−(2n+1) ⎜ ⎟ .. ⎜ σ ⎟ . ⎜ 1 σ0 σ−1 ⎟ ⎜ . ⎟ .. T = ⎜ .. ⎟ , . σ1 ⎜ ⎟ ⎜ ⎟ . . .. .. ⎝ σ−1 ⎠ σ2n+1 σ1 σ0
91
(4)
where the tilde denotes the conjugate, and being: πj 2 i σj = exp − , 2(n + 1) for j = −(2n + 1), . . . , −1, 0, 1, . . . , 2n + 1. Thus, the computation of a DST involves a convolution operation denoted by the product of T by a vector. The convolution can be evaluated using FFT techniques, given yet another computational framework. Given the blocking form of the FFT matrix exposed in [1] through theorems 4.4.1-2, the DST y of a vector x of size n can be computed by means of an FFT, ⎛ ⎞ ⎛ ⎞ y0 0 ⎜y⎟ ⎜ ⎟ ⎜ ⎟ = i F2(n+1) ⎜ x ⎟ , (5) ⎝ z0 ⎠ 2 ⎝ 0 ⎠ z −¯ x where x ¯ represents the vector x with its elements in reverse order. Scalars y0 , z0 and vector z are irrelevant entries. Applying the Chirp–z factorization for the computation of F2(n+1) in (5) yields the Chirp–z factorization for the DST. Summarizing, given a vector x of size n, the problem of computing the DST y = Sx can be solved by means of the computation of a convolution operation composed of two FFT’s and one inverse FFT of size m = 2j , where j is chosen such that m is the minimum value greater than 4n+3, being 4n+3 the minimum order of the circulant matrix containing T used in the convolution operation.
3
The DST Fortran 90/95 Module
The goal of the DST module is to facilitate for the programmer the computation of the DST of a vector. To apply the DST to a given vector v is as easy as to type dst(v), that is, as easy as in Matlab or Octave. The solution is accomplished by means of a Fortran 90/95 module. The use of Fortran 90/95 [7] makes this kind of simple interface possible as it is used in other numerical libraries such as LAPACK95 [8], which is really an interface for LAPACK [9]. The Fortran module hides all the implementation details and enables the optimal computational method. The essential point of the module is explained by the fact that it stores state properties from a call
92
P. Alonso, M.O. Bernabeu, and A.-M. Vidal-Maci´ a
to a public routine to the following call by means of static (save in Fortran) arguments. The main routine of the module is called dst, and this is the only public routine. The interface is dst(v,n_). The first argument is the array to compute the DST. The second argument is an optional integer argument representing the size of the array. If the second argument is not present, the size of the problem is taken from the array by means of the size intrinsic Fortran function. Then, an auxiliary array of size size(v)+1 is allocated because the fftpack routine requires on entry an array of n + 1 entries. Otherwise, the user can call dst with an array of size n + 1 and set the second argument to n_= n, so the routine will work on v and thereby avoid using the auxiliary array. However, the auxiliary array is allocated once because the module keeps it for reuse in further calls. One of the main advantages of this module is that the initialization step performed before the DST computation is automatically carried out only at the first call to dst –as long as the following calls to dst use the same vector size. Thus, one of the state arguments is the actual size (actual_size). The first call to dst performs a call to the initialization fftpack routine: dsinti( n, wsave ) being n the size of the DST and wsave a workspace requested by the routine. The workspace wsave is allocated at this first call and stored in a saved array for further calls as long as the size of the transformation does not change. It is clear that with the module, the workspaces are completely hidden to the user. Further calls will only make calls to the fftpack routine dsint( n, v, wsave ) to apply the DST to v. The same technique is used when the Chirp–z factorization is selected for the computation of the DST. The advantages of this initialization stage are now even clearer. The initialization stage, carried out at the first call to dst as in the other case, is made up of the following tasks: – array allocations, – computation of the size of the FFT that will be used (as shown at the end of Section 2), – initialization of the array Σ (3), – initialization of the FFT that will be used in the convolution of matrix T (4) with other arrays, Other fftpack routines such as dcffti for the initialization of a workspace, or dcfftf (forward FFT), and dcfftb (backward FFT), are used in this case. Workspaces needed for this routines are conveniently allocated, and saved for reuse in following calls. Another of the most important tasks performed by the initialization stage on the first call to dst deals with the decision to be taken about which routine to use to perform the DST: the fftpack routine (dsint); or the Chirp–z factorization based routine (ChirpDST). This decision is made by an external routine
An Adaptive Interface for the Efficient Computation
93
(fftpack or ChirpDST). This routine uses information gathered at installation time (Section 4). Once the selection is made, it is stored in a boolean state variable (use_fftpack) to be used in further calls. The following scheme of the dst routine gives a clear view of the essentials of the process. if( actual_size .ne. n ) then actual_size = n use_fftpack = (fftpack_or_ChirpDST(n) .eq. 1) if (use_fftpack) then call init_dsint(n) else call init_ChirpDST(n) end if end if if (use_fftpack) then call apply_dsint(n, v) else call apply_ChirpDST(n, v) end if The routine takes the decision of making the initialization process based on the actual size stored in the module (actual_size) and the size n of the vector received as actual argument. Obviously, calls with the same array size will result in more efficient use of the module, than calls with alternate sizes. Furthermore, the use of the module from a C code source has been considered. Arrays passed from C code to Fortran 90/95 do not have the same attributes as the Fortran ones because Fortran 90/95 cannot take the array shape from an array allocated by C. The solution consists in implementing the Fortran routine cdst(n,v), where n is the integer representing the size, and v is the array for performing the DST. The only cdst assignment is to make a call to dst. This can be called from any C code in the usual way, cdst_( n, v ) –taking into account the familiar underscore. Features found in Fortran 90/95, such as the use of hidden workspaces, have been used to create user-friendly and efficient code. However, the use of these workspaces resulted in a problem of thread–safe usage of the dst subroutine. Different calls from concurrent threads returned wrong results because they concurrently accessed the same workspace (dst is a non–reentry routine). Protecting workspaces under critical sections is not a solution, because it does not work when the routine is called from different threads with different problem sizes yielding routines which only work well with one problem size at a time. In addition, we noticed that initialized data was problem size specific, so sequences of calls with different n produced unnecessary initialization work and memory allocation/deallocation overheads (i.e. calls with alternate problem sizes n = 2000, n = 1000, n = 2000, . . . ). To ensure thread–safeness and avoid unnecessary workload we have extended the model of the Fortran 90/95 module private data presented in Section 3. Once the routine detects a change in the problem size, a new workspace for
94
P. Alonso, M.O. Bernabeu, and A.-M. Vidal-Maci´ a
the new size n is created and initialized without discarding the previous ones. That is, a new structure made of pointers to real data workspaces and simple variables is created, so future calls with size n will find the structure already initialized. This feature has been implemented by means of a linked list. Starting from an empty list, each time a new problem size is detected, a new node is created and added to the list. These nodes are identified by n. Each contains the null pointers necessary to allocate and initialize data plus a boolean variable identifying whether dsint or ChirpDST will be used for a single problem size. This improvement still does not solve the contention of concurrent data access, because the linked list is located in the heap of the process, a common memory space for all the threads. To fix this problem, every time a thread calls dst and prior to the computation itself, it allocates the necessary memory space in its own stack storage space by copying the initialized data from the heap. Note that only arrays read and written during the computations (i.e. wsave in dsint or the workspace for the convolution of matrix T in the Chirp–z factorization) are processed in this way. Arrays which are only accessed for reading (such as Σ) are used directly from the heap. Furthermore, this improvement avoids the need for the overhead introduced by locks for preventing common storage areas being accessed concurrently.
4
Selection of the Best Routine
One of the main contributions of this work consists in the technique used to select the best DST routine on runtime: the fftpack routine (dsint); or the routine based on Chirp–z factorization (ChirpDST). The cost of the computation of the fftpack routine depends on the size of the prime numbers in which n + 1 is decomposed, where n is the size of the transformation. Thus, the first approach consisted of selecting a threshold prime number, so if the maximum prime number in the decomposition of n+1 was lower than the given threshold, the fftpack routine (dsint) was chosen: otherwise the Chirp–z based routine (ChirpDST) was selected instead. However, our first approach did not work. Using the example shown in Table 1 executed on our target machine and assuming that some prime number lower and close to 127 was selected as the threshold, then, for a problem size of n = 126, ChirpDST was chosen. The selection seemed correct since ChirpDST runs 1.32 times faster than dsint for n = 126. This selection was also suitable for n = 253, that is, with a problem size for which the largest prime value was also larger than the threshold (n + 1 = 2 × 127). However, this selection became incorrect for a larger multiple of 127 (n = 380, n + 1 = 3 × 127). For this last problem size, dsint was almost twice as fast as ChirpDST. An explanation of this behavior is easy in the light of Fig. 1. The figure shows the execution time of both routines for all the different sizes in the range 1– 1024. For the sake of clarity, data is shown in a logarithmic scale of both axes. Time used by dsint is quite short along the different values. Time results from very similar values can be dramatically different (i.e. the time for size n = 1888,
An Adaptive Interface for the Efficient Computation
95
Table 1. Comparison of the fftpack and the Chirp–z factorization based routines n n + 1 prime decomposition dsint/ChirpDST (time ratio) 4n + 3 fft size
126 127 1.32 507 512
253 2 × 127 1.18 1015 1024
380 3 × 127 0.51 1523 2048
1e-02 dsint ChirpDST
sec.
1e-03
1e-04
1e-05
1e-06 1
2
4
8
16
32
64
128
256
512
1024
n+1 Fig. 1. Comparison of time between the fftpack routine (dsint) and the Chirp–z factorization technique (ChirpDST) (logarithmic scales). The sparsity in the dots for dsint shows how differing and ‘unexpected’ can be the execution time, whereas the graphic of ChirpDST shows a layout in ‘time steps’.
where n + 1 = 1889 is prime, is 25.20 ms. and the time for n = 1889, where n + 1 = 1890 = 2 × 33 × 5 × 7, is only 0.18 ms.). Table 2 shows how this problem increases with the increment of the largest prime number and explains the sparsity of dsinttime time values. In the other case, time values of ChirpDST are clustered into time ‘steps’. The DST of an n array with ChirpDST consisting of the computation of three FFT’s of size m, being m the minimum power of two greater than 4n + 3 (Table 1 also shows m). Table 3 shows similar results as Table 1 by using the prime number 1889 as the largest prime number in the decomposition of n + 1. It can be seen that different multiples of 1889 produce time results in different time ‘steps’ of ChirpDST; the ratio between both times varies from 6.31 to 0.63. The interpretation of Fig. 1 clearly means that our first approach was correct, as long as n belongs to the same time ’step’. Size n belongs to ’step x’, for x = 0, . . . , if n ∈ [2x , 2x+1 ). Time of ChirpDST is rather similar for all values
96
P. Alonso, M.O. Bernabeu, and A.-M. Vidal-Maci´ a
Table 2. Comparison of the fftpack and the Chirp–z factorization based routines n dsint/ChirpDST ( dsint time ChirpDST time ) n + 1 prime decomp. 4997 0.017 ( 000.829 48.418 ) 4998 = 2 × 3 × 72 × 17 4998 3.579 ( 173.844 48.578 ) 4999 = 4999 4999 0.008 ( 000.405 48.339 ) 5000 = 23 × 54
x inside the ’step’ because it computes three FFT’s of size m = 2x+3 . Thus, a different threshold must be chosen in each time ’step’.
5
Installation and Tuning
Our tools consist of Fortran 90/95 (Section 3), as well as other Fortran and C routines. The installation process consists of the usual compilation to build a library (libDST.a) and some header files. Following the initial installation stage, a tuning process takes place in which the threshold prime value of each ‘’step’ is computed yielding a table similar to the one in Table 4. The output of the tuning stage shows the x value defined in the previous section in the first column and the threshold in the second column (INT_MAX means that the choice will always be the dsint in the given ’step’). The tuning Table 3. Comparison of time for different ’time’ steps n 1888 3777 5666 11333 17000 prime dec. n + 1 1889 2 × 1889 3 × 1889 2 × 3 × 1889 3 × 3 × 1889 dsint (msec.) 25.2 34.5 38.7 60.8 84.9 ChirpDST (msec.) 4.0 11.1 26.1 59.3 134.6 dsint/ChirpDST 6.31 3.11 1.48 1.03 0.63 fft size 8192 (213 ) 16384 (214 ) 32768 (215 ) 65536 (216 ) 131072 (217 )
Table 4. Example of ’step’ prime thresholds obtained by the tuning process step (x) 0 1 2 3 4 5 6 7 8 9
threshold INT MAX INT MAX INT MAX INT MAX INT MAX INT MAX 89 163 193 251
An Adaptive Interface for the Efficient Computation
97
stage can take some time because different size problems must be tested in order to obtain the threshold prime. The table obtained is used to build a C header file that will be used by the selection routine (fftpack_or_ChirpDST). Later, the software is compiled again and the definitive library for the given target machine is completely installed. The dst routine will, in a negligible time, chose the best routine at its first call on runtime.
6
Example of Use
The complete module with the improvements introduced in this section has been proved to be useful in the solution of symmetric Toeplitz linear systems [3], Tx = b
T ∈ IRn×n x, b ∈ IRn .
(6)
The first stage to solve problem (6) consists of translating the problem itself to the field of Cauchy–like matrices T x = b → Cx ˆ = ˆb ,
where C = ST S, xˆ = Sx and ˆb = Sb ,
(7)
being C a symmetric Cauchy–like matrix. The transformation is carried out by means of a DST of size n, S (1). It is known that matrix C has a large sparsity that can be exploited since entries ci,j , i, j = 0, . . . , n − 1, such that i + j is odd and zero. Performing an odd-even permutation of the Cauchy–like linear system (7), two independent linear Cauchy–like systems arise of size n/2 and n/2 , respectively. Once these two reduced linear systems are solved by the corresponding two concurrent threads, each makes a DST call (size n/2 and n/2 , respectively) to undo the transformation that lets it return to the Toeplitz field and yields the solution to the original linear system (6). Through this practical example, we can see how the Fortran 90/95 module provides the programmer with an easy method to handle with the DST by making simple calls to dst and only passing the target array as argument, so freeing the programmer from worry about workspaces, data initialization, or selecting the best performance routine - as the array size is set at each call in a multi-thread context.
7
Conclusions
An initial real transformation (DST) of a future library consisting of different FFT–related routines has been developed. With this first routine, the basis of how to build all of the FFT–related routines with an easy to use interface and a performance free of drawbacks has been set. The first goal is achieved thanks to the features introduced by Fortran 90/95, one of the most used programming languages in numerical algorithms, and which even provides an easy interface
98
P. Alonso, M.O. Bernabeu, and A.-M. Vidal-Maci´ a
for traditional C language. The second objective has been achieved by means of an intelligent installation/tuning process. As numerical programmers know, a very powerful and popular package for using FFT–related functions called FFTW [10] is currently available. This package has been set as a standard because the new Intel processors are provided with the same interface as the FFTW for using native FFT–related routines. Our interface is so simple that it can be used like a layer over the FFTW function calls. Our future library will offer the option of being used over the fftpack of FFTW routines, and so in the second case, providing it with the performance of native routines.
References 1. Loan, C.V.: Computational Frameworks for the Fast Fourier Transform. SIAM Press, Philadelphia (1992) 2. Bojanczyk, A.W., Heinig, G.: Transformation techniques for toeplitz and toeplitzplus-hankel matrices part I. transformations. Technical Report 96-250, Cornell Theory Center (1996) 3. Alonso, P., Vidal, A.M.: The symmetric–toeplitz linear system problem in parallel. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3514, pp. 220–228. Springer, Heidelberg (2005) 4. Alonso, P., Bernabeu, M.O., Vidal, A.M.: A parallel solution of hermitian toeplitz linear systems. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3991, pp. 348–355. Springer, Heidelberg (2006) 5. Swarztrauber, P.: Vectorizing the FFT’s. Academic Press, New York (1982) 6. Metcalf, M., Reid, J.K.: Fortran 90/95 explained, 2nd edn. Oford University Press, Inc., New York (1999) 7. Akin, E.: Object Oriented Programming Via FORTRAN 90/95. Cambridge University Press, New York (2003) 8. Barker, V.A., Blackford, L.S., Dongarra, J., Croz, J.D., Hammarling, S., Marinova, M., Wasniewski, J., Yalamov, P.: LAPACK95 Users’ Guide. SIAM: Software, Environments and Tools, SIAM, Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688, USA (2001) 9. Anderson, E., Bai, Z., Bischof, C., J., D., Dongarra, J.: LAPACK Users’ Guide LAPACK Quick Reference Guide to the Driver Routines: Release 2.0, 2nd edn. SIAM Press, Philadelphia (1995) 10. Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proceedings of the IEEE 93, 216–231 (2005), special issue on Program Generation, Optimization, and Platform Adaptation
Incomplete WZ Factorization as an Alternative Method of Preconditioning for Solving Markov Chains Beata Bylina and Jaroslaw Bylina Department of Computer Science Institute of Mathematics Marie Curie-Sklodowska University Pl. M. Curie-Sklodowskiej 1, 20-031 Lublin, Poland
[email protected],
[email protected]
Abstract. The purpose of the article is to present and evaluate usefulness of a new preconditioning technique (for the Gauss-Seidel algorithm), namely incomplete WZ factorization, for iterative solving of sparse and singular linear equations systems, which arise during modeling with Markov chains. The incomplete WZ factorization proposed here will be compared with the incomplete LU factorization in respect of amount of fill-in (newly created non-zeros) and in respect of the accuracy improvement of preconditioned algorithms in relation to not preconditioned ones. In the paper, the results of some numerical experiments will be presented, which were conducted for various matrices representing Markov chains. The experiments show that the incomplete WZ factorization can be a real alternative – because it is faster than incomplete LU factorization and the fill-in generated in the process is smaller (the output matrices are sparser).
1
Introduction
During modeling a real system (as, for example, a computer system or a communications network) with Markov chain we come across a linear system (of n equations with n unknowns), when we want to find stationary probabilities of states of the modeled system [15]: QT x = 0,
n
xi = 1,
x ≥ 0,
(1)
i=1
where the unknown x = (x1 , . . . , xn ) is a vector of probabilities of states of the modeled system after a long run (xi being the probability of the ith state) and the matrix Q = (qij ) is an infinitesimal generator (that is a transition rate matrix) of the given Markov chain (qij being a rate of a transition from the ith state to the jth state). R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 99–107, 2008. c Springer-Verlag Berlin Heidelberg 2008
100
B. Bylina and J. Bylina
The equation (1) has an unambiguous solution in the most interesting cases (when the Markov chain is irreducible, so the rank of Q is n − 1). However, the matrix Q has some unusual properties, which require special treatment. Namely, the matrix Q is a huge, sparse, ill-conditioned matrix with a weakly dominant diagonal. For solving the system (1), one generally uses iterative [15], projection and decomposition methods. However, sometimes direct methods are used – especially when high accuracy is needed [5,15]. All the methods have their own advantages and disadvantages. Two methods of solving linear systems can be joined – namely, direct methods with iterative methods [8]. The preconditioning is an example of such a fusion. The very concept of preconditioning is almost as old as iterative methods [11]. The idea of incomplete factorization was presented by Buleev [2,3] and Varga [16]. The papers that popularized the incomplete factorizations were [12,13]. There is a need for fast, stable, scalable, easy to parallelize and generating a small fill-in preconditioners. Incomplete LU factorization has some of these features, but the paper presents another one – which, perhaps, is better. Moreover, the work on incomplete factorizations is needed for other purposes – as, for example, for block approximate inverse [1]. Preconditioning is presented in Section 2, which also describes incomplete LU and WZ factorizations and their application for preconditioning with the cost of computations needed for such a preconditioning. Section 3 includes results of a numerical experiment which consists in solving linear systems arising during Markovian modeling with the Gauss-Seidel method preconditioned with the incomplete WZ and LU factorizations. Section 3 also analyzes properties of both preconditioners and their influence on the convergence of Gauss-Seidel method. Section 4 contains some conclusion.
2
Preconditioning
Convergence rate of iterative methods depends on properties of the coefficient matrix of the linear system. The matrix Q is ill-conditioned what can make convergence of iterative methods slow. One of manners for preventing such problems is to transform the system (1) into an equivalent system (having the same solution), but with better numerical properties. Such a transformation can be done by preconditioning, that is by converting the system (1) into: M−1 QT x = 0,
n
xi = 1,
x ≥ 0,
(2)
i=1
where the nonsingular matrix M (known as a preconditioner) approximates the matrix QT in a manner. The system (2) has the same solutions as (1) but it is (hopefully) better conditioned. Generally, computing and using a good preconditioner is an expensive task consisting of finding the matrix M and its inverse. If the preconditioning is to be used, that cost should be refunded by reduced number of iterations needed
Incomplete WZ Factorization as an Alternative Method of Preconditioning
101
to acquire required accuracy – or by using the same preconditioner for various linear systems. There are various techniques used for finding the preconditioner matrix M, none of them sufficiently universal. To verify the usability of a preconditioning technique, the only sure method is heuristics. The preconditioner matrix is usually built on the basis of the original coefficients of the matrix Q. In [1] preconditioners for Krylov subspace methods for solving large singular linear systems arising from Markov modeling are considered. Here we describe a manner of constructing preconditioners based on incomplete factorizations for stationary iterative methods. 2.1
Incomplete LU Factorization
Incomplete LU factorization (denoted ILU) is based on the well known LU fac and torization, where a lower triangular matrix (with ones on the diagonal) L are found and where the preconditioner matrix an upper triangular matrix U U is a kind of approximation for the matrix QT . M=L There are many variants of ILU, the most straightforward being ILU(0) [15]. In ILU(0) computations are conducted as in traditional (complete) LU factorization (that is Gaussian elimination), but new non-zero elements (lij and uij ) arising in the process are dropped if they appear in the place of a zero element in the original matrix QT . Hence, the factors together have the same number of nonzeros as the original matrix QT . Thereby, the most important problem of the factorization of sparse matrices – the fill-in (which consists in appearing non-zero elements in new matrices on the places of zero elements in the original matrix, what makes dense the output factors and and renders impossible their packed storage) – is eliminated at the expense of accuracy. After ILU(0) we have: U + RLU , QT = L (3) and U are (respectively) lower and upper triangular matrices and the where L remainder matrix RLU is supposed to be small in a sense. U, then M−1 = U −1 L −1 and the equation (2) takes the shape: Let M = L n
−1 L −1 QT x = 0, U
xi = 1,
x ≥ 0.
(4)
i=1
−1 L −1 QT . Now, the equation (4) takes the shape: Let SLU = U SLU x = 0,
n
xi = 1,
x ≥ 0.
(5)
i=1
2.2
Incomplete WZ Factorization
The WZ factorization consists in decomposition of the given matrix (QT in the paper) into a product of two matrices: W and Z (Figure 1).
102
B. Bylina and J. Bylina
1
0
0
0
0 1 Fig. 1. The form of the output matrices in the WZ factorization (left: W; right: Z)
The WZ factorization was proposed by Evans and Hatzopoulos [10] as a manner of matrix factorization intended for SIMD machines (Single Instruction stream – Multiple Data stream in Flynn classification; such an architecture is characterized by multiple processing units). In [6,9,10,17] the WZ is developed, as well as its modifications and implementations (parallel ones, among others). Incomplete WZ factorization [4] (denoted IWZ) is based on the described and Z (of the form of matrices above WZ factorization, where we find matrices W W and Z shown in Figure 1) and the product WZ is a kind of approximation for the matrix QT . Let us present the simplest form of IWZ, that is IWZ(0). In IWZ(0) computations are conducted as in complete WZ factorization, but new non-zero elements (wij and zij ) arising in the process are dropped if they appear in the place of a zero element in the original matrix QT . Hence, the factors together have the same number of non-zeros as the original matrix QT – as in ILU(0) – and we have no fill-in. Z, then M−1 = Z −1 W −1 and the equation (2) takes the shape: Let M = W n
−1 W −1 QT x = 0, Z
xi = 1,
x ≥ 0.
(6)
i=1
−1 W −1 QT . Now, the equation (4) takes the shape: Let SW Z = Z SW Z x = 0,
n
xi = 1,
x ≥ 0.
(7)
i=1
There is one more thing worth noting – because of singularity of the matrix QT its complete factorizations (both LU and WZ) are difficult to compute. However, factors of an incomplete factorization of a sparse matrix are singular with a zero probability, hence incomplete factorizations are easier to do.
Incomplete WZ Factorization as an Alternative Method of Preconditioning
3
103
Numerical Experiment
The experiment was conducted on a Pentium IV 2.8GHz computer, 1GB RAM, with Debian GNU/Linux operating system. The described algorithms were implemented with the language C and compiled with gcc using an optimizing option -O3. Data structures for storing all the matrices were full two-dimensional square arrays, fitting wholly in the machine’s RAM. Matrices (with IDs from 1 to 6) used in tests were generated by the paper authors on the basis of some abstract queuing models – the matrices are infinitesimal generators of Markov chains describing these models – and they are neither symmetric nor anyway structural. In Table 1 the essential characteristics of the matrices are presented (n is the number of rows/columns of the matrix, nz is the number of non-zeros in the matrix and d is an average number of non-zeros in a row/column of the matrix). Table 1. Essential characteristics of the matrices used in tests matrix ID n nz d 1 100 1190 11.9 2 1000 7744 7.7 3 1500 37955 25.3 4 1500 5873 3.9 5 3000 120590 40.2 6 3000 11636 3.9 7 2116 12376 5.9
The matrix 7 was generated from a standard two-dimensional model [7,14]. The states of such a chain are described with two numbers (u, v), u = 0, . . . , Nx , y = 0, . . . , Ny (here Nx = Ny = 45) and transitions are only allowed from (u, v) to (u , v ) if |u − u| ≤ 1 and |v − v| ≤ 1. There was assumed – as in [7] – that only some transition from each state are permitted. Each of these matrices was incompletely factorized with the use of algorithms ILU(0) and IWZ(0) and then the matrices SLU and SW Z are created. Table 2 presents times of ILU(0) and IWZ(0) factorization. In Table 3 numbers of non-zeros in matrices SLU and SW Z are presented. It can be noticed that the number of non-zeros in those matrices depends on sparsity of the input matrix QT . More important, the number of non-zeros in SW Z is lesser than in SLU . In the paper, the aim of incomplete factorization is its usage as preconditioning for Gauss-Seidel method. That is we use Gauss-Seidel method (but others iterative, projection etc. methods can be used as well) to solve the equation (5) when ILU(0) is used, or the equation (7) when IWZ(0) is used – instead of the original equation (1). Gauss-Seidel method is given by: x(k+1) = (D − L)−1 Ux(k) ,
(8)
104
B. Bylina and J. Bylina
or in the scalar form: i−1 (k+1) xi
=
(k+1) j=1 lij xj
+
n
(k)
j=i+1
dii
uij xj
,
for i = 1, . . . , n,
(9)
where the coefficient matrix (in the paper: QT , SLU and SW Z ) is written as D − L − U (D = (dij ) is a diagonal matrix, U = (uij ) is an upper triangular matrix with a zero diagonal, L = (lij ) is a lower triangular matrix with a zero diagonal). n Because equation (1) without the additional condition i=1 xi = 1 has an infinite number of solution vectors (and so have (5) and (7)), the output vector x have to be normalized after computations (in this case it should be divided by its first norm ||x||1 ). After that, when rankQT = n − 1 (which n is true in our cases) there will be only one solution, it will fulfil the condition i=1 xi = 1, and moreover, it will fulfill the condition x ≥ 0 (it is true when Q is an infinitesimal generator of a Markov chain). (0) (0) A vector x(0) = (xi ) with xi = 1i was chosen as an initial vector. As a measure of accuracy of the solution we chose: ε(k) = ||0 − QT x(k) ||2 .
(10)
Table 2. Comparison of times of ILU(0) and IWZ(0) factorization matrix ID IWZ(0) time [s] ILU(0) time [s] 1 0.01 0.01 2 1.17 2.18 3 4.60 8.13 4 3.02 5.80 5 30.69 57.59 6 20.43 41.30 7 9.88 15.04
Table 3. Numbers of non-zeros in matrices SLU and SW Z (distinguished IDs are for matrices which are more sparse (d ≤ 8) and where the greater improvement in sparsity is seen at the same time)
matrix ID nz(SLU ) nz(SW Z ) 1 9901 9632 2 910496 535808 3 2250000 2155092 4 200173 66250 5 9000000 8766420 6 397272 137560 7 3777061 732514
by how much percent SW Z is sparser than SLU (nz(SLU ) − nz(SW Z ))/nz(SLU ) 2.72% 41.15% 4.22% 66.90% 2.60% 65.37% 80.61%
Incomplete WZ Factorization as an Alternative Method of Preconditioning
105
Table 4. Comparison of accuracy of tested methods (distinguished IDs are for matrices which are more sparse (d ≤ 8) and where the greater improvement in accuracy is seen at the same time) matrix ID step 1 1 5 10 2 1 5 10 3 1 5 10 4 1 5 10 5 1 5 10 6 1 5 10 7 1 50 100
GS IWZ(0)+GS ILU(0)+GS 5.0e − 01 1.4e − 01 1.7e − 01 4.5e − 04 1.8e − 05 6.6e − 05 6.8e − 08 3.7e − 10 2.16e − 09 1.7e − 01 5.9e − 02 4.7e − 02 9.6e − 04 2.1e − 06 1.1e − 06 1.0e − 06 3.6e − 11 1.9e − 11 1.3e − 01 2.3e − 02 1.9e − 02 2.5e − 05 1.5e − 06 2.7e − 05 3.2e − 10 3.6e − 11 1.4e − 10 1.4e − 01 8.8e − 02 7.5e − 02 7.5e − 03 1.6e − 04 1.1e − 04 1.5e − 04 4.0e − 08 1.4e − 08 9.6e − 02 1.2e − 02 1.1e − 02 7.1e − 06 8.2e − 07 1.5e − 06 9.7e − 11 2.6e − 11 7.1e − 11 1.1e − 01 6.3e − 02 5.3e − 02 5.8e − 03 1.1e − 04 8.9e − 05 1.5e − 04 2.9e − 08 1.1e − 08 5.2e − 02 6.5e − 02 1.7e + 00 1.1e − 01 1.6e − 09 1.2e − 15 1.3e − 08 1.5e − 15 1.2e − 15
Table 4 shows values of (10) for selected steps of Gauss-Seidel method for three method: Gauss-Seidel alone (denoted GS), Gauss-Seidel preconditioned with IWZ(0) (denoted IWZ(0)+GS) and Gauss-Seidel preconditioned with ILU(0) (denoted ILU(0)+GS).
4
Conclusion
The paper presents the incomplete WZ factorization, and compares it with the incomplete LU factorization. We can see that the cost of both incomplete factorizations is O(n3 ), both the preconditioners improve significantly convergence of Gauss-Seidel method – especially for very sparse matrices (as for example the matrix 2, 4 etc. – the more sparse matrix, the better effects of preconditioning – see also [1]), the accuracy is roughly the same for both preconditioners (slightly to the advantage of IWZ(0)). So, for the accuracy sake, there is no matter which of the factorizations we choose (IWZ(0) or ILU(0)). However, from the results of the experiment we can conclude the following. 1. When we care about time of execution, we should choose IWZ(0), because it is about twice as fast as ILU(0) (it is so, because loops in IWZ(0) have
106
B. Bylina and J. Bylina
half the repetitions of loops in ILU(0), but these loops’ turns are longer and an optimizing compiler – as gcc -O3 – can more effectively use minivector properties of modern processors and multi-level organization of the memory). These results relate to authors’ own implementations of both the algorithms. 2. The number of non-zeros in SW Z is lesser than in SLU . 3. For very sparse matrices (d ≤ 8) the matrix SW Z is much more sparse than the matrix SLU and for such matrices the Gauss-Seidel method improves very much with preconditioning. 4. For the matrix 7 it turned out that it needed much more iterations to get sufficient accuracy. However, it can be observed that the presented algorithms improve the convergence of the Gauss-Seidel method. It is contrary to observations in [7] – however, the Gauss-Seidel method was not considered there (other iteration methods were, with SOR being the most similar). On the basis of the results presented it appears that the incomplete WZ factorization as a preconditioner is a good alternative for preconditioning with the LU factorization, especially for sparse matrices – because it is faster than incomplete LU factorization and the fill-in generated in the process is smaller.
Acknowledgements This work was partially supported by Marie Curie-Sklodowska University in Lublin, Poland within the project Niekompletny rozklad WZ jako uwarunkowanie wstepne (ang. preconditioning) stosowane do znajdowania wektora prawdopodobienstw stanow sieci calkowicie optycznych modelowanych lancuchami Markowa. This work was also partially supported within the project Metody i modele dla kontroli zatloczenia i oceny efektywnosci mechanizmow jakosci uslug w Internecie nastepnej generacji (N517 025 31/2997).
References 1. Benzi, M., Ucar, B.: Block Triangular preconditioners for M-matrices and Markov chains. Electronic Transactions on Numerical Analysis 26 (to appear, 2007) 2. Buleev, N.I.: A numerical method for solving two-dimensional diffusion equations. At. Energ. 6, 338 (1959) (in Russian) 3. Buleev, N.I.: A numerical method for solving two- and three-dimensional diffusion equations. Mat. Sb. 51, 227 (1960) (in Russian) 4. Bylina, B., Bylina, J.: Niekompletny rozklad WZ jako uwarunkowanie wst¸epne dla rozwi¸azywania Markowowskich modeli sieci optycznych. Sieci komputerowe, Wydawnictwa Komunikacji i L a¸czno´sci, Warszawa, tom 1 (Nowe technologie), p. 103 (2007) (in Polish) 5. Bylina, B., Bylina, J.: The Vectorized and Parallelized Solving of Markovian Models for Optical Networks. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3037, p. 578. Springer, Heidelberg (2004) 6. Chandra Sekhara Rao, S.: Existence and uniqueness of WZ factorization. Parallel Computing 23, 1129 (1997)
Incomplete WZ Factorization as an Alternative Method of Preconditioning
107
7. Dayar, T., Stewart, W.J.: Comparison of Partitioning Techniques for Two-Level Iterative Solvers on Large, Sparse Markov Chains. SIAM Journal on Scientific Computing 21, 1691 (2000) 8. Duff, I.S.: Combining direct and iterative methods for the solution of large systems in different application areas. Technical Report RAL-TR-2004-033 9. Evans, D.J., Barulli, M.: BSP linear solver for dense matrices. Parallel Computing 24, 777 (1998) 10. Evans, D.J., Hatzopoulos, M.: The parallel solution of linear system. Int. J. Comp. Math. 7, 227 (1979) ¨ 11. Jacobi, C.G.J.: Uber eine neue Aufi¨ osungsart der bei der Methode der kleinsten Quadrate vorkommenden linearen Gleichungen. Aston. Nachrichten 22, 297 (1845) 12. Kershaw, D.S.: The incomplete Cholesky conjugate gradient method for the iterative solution of systems of linear equations. J. Comput. Phys. 26, 43 (1978) 13. Meijerink, J.A., van der Vorst, H.A.: An iterative solution method for linear systems of which the coefficient matrix is a symmetric M-matrix. Math. Comput. 31, 148 (1977) 14. Pollett, P.K., Stewart, D.E.: An Efficient Procedure for Computing QuasiStationary Distributions of Markov Chains with Sparse Transition Structure. Advances in Applied Probability 26, 68 (1994) 15. Stewart, W.: Introduction to the Numerical Solution of Markov Chains. Princeton University Press, Chichester, West Sussex (1994) 16. Varga, R.S.: Factorizations and normalized iterative methods. In: Langer, R.E. (ed.) Boundary Problems in Differential Equations, Univ. Wisconsin Press, Madison (1960) 17. Yalamov, P., Evans, D.J.: The WZ matrix factorization method. Parallel Computing 21, 1111 (1995)
A Block-Based Parallel Adaptive Scheme for Solving the 4D Vlasov Equation Olivier Hoenen and Eric Violard Laboratoire LSIIT - ICPS, UMR CNRS 7005 Universit´e Louis Pasteur, Strasbourg {hoenen,violard}@icps.u-strasbg.fr
Abstract. We present a parallel algorithm for solving the 4D Vlasov equation. Our algorithm is designed for distributed memory architectures. It uses an adaptive numerical method which reduces computational cost. This adaptive method is a semi-Lagrangian scheme based on hierarchical finite elements. It involves a local interpolation operator. Our algorithm handles both irregular data dependencies and the big amount of data by distributing data into blocks. Performance measurements on a PC cluster’s confirm the pertinence of our approach. This work is a part of the CALVI project1 . Keywords: Parallel numerics, 4D Vlasov equation, Adaptive method.
1
Introduction
The Vlasov equation can describe the evolution in time of charged particles under the effects of electro-magnetic fields. It is used to model important phenomena in plasma physics such as controlled thermonuclear fusion. This equation is defined in the phase space, i.e., the position and velocity space which has 6 dimensions in the real case (one dimension of velocity for each dimension of position). Methods discretizing the Vlasov equation on a mesh of phase space has been proposed to get an accurate description of the physics. Due to the high number of dimensions of the equation domain, solving this equation with such a numerical method yields a very large computational problem. In order to reduce this computational cost, adaptive methods have been developed. Some of these methods are based on the semi-Lagrangian schemes [1,2]. We developed such an adaptive method and its parallel implementation for the 2D Vlasov equation [3]. But a higher number of dimensions brings new challenges. On an other hand, some parallel implementations of 4D adaptive Vlasov solver exist but they are essentially designed for shared memory architectures. Therefore new parallel adaptive schemes have to be developed for distributed memory machines. 1
Supported by a grant from Alsace Region. CALVI is a french INRIA project devoted to the numerical simulation of problems in Plasma Physics and beams propagation.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 108–117, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Block-Based Parallel Adaptive Scheme
109
This paper presents an appropriate block-based semi-Lagrangian scheme and its parallelization. This scheme is based on a hierarchical finite element decomposition [4] which involves a local interpolation operator. This method yields an efficient parallelization based on data distribution into blocks. The paper is organized as follows. Section 2 presents our adaptive semiLagrangian scheme. Section 3 presents in details its parallel implementation. Section 4 shows our experimental results before concluding.
2
An Adaptive Numerical Scheme
We consider the four dimensional Vlasov equation ∂f ∂f ∂f +v + E(t, x) = 0. ∂t ∂x ∂v
(1)
whose unknown f (t, (x, v)) represents the distribution of particles at time t, where x = (x, y) ∈ R2 and v = (vx , vy ) ∈ R2 are coordinates of a point in the phase space, and E(t, x) is the, so called, self consistent electrostatic field, which is generated by particles. The equation is coupled with the Poisson’s equation which gives the electric field E. Our resolution scheme is based on the splitting of the equation (introduced in [5]) into a succession of three transport equations of the form ∂f ∂f + U (t, z) =0. ∂t ∂z
(2)
whose resolution, so called advection in z (where z stands for phase space coordinates), is performed by using a semi-Lagrangian scheme. This scheme uses the property of conservation of the unknown along the characteristic curves, i.e., n+1 Property 1. f (tn+1 , (x, v)) = f (tn , AΔt − tn is the time z (x, v)) where Δt = t Δt step and Az , so called advection operator, is a one-to-one correspondence between points of the phase space (see [6] for more details).
Our numerical method therefore boils down to perform three successive advections on a phase space mesh : one in x, one in y and one in v. These advections are defined by the advection operators: AΔt x (x, v) = ((x − vx Δt , y) , v), AΔt y (x, v) = ((x , y − vy Δt) , v), AΔt v (x, v) = (x , v − E(t, x)Δt) .
(3)
We use a dyadic structured adaptive mesh, i.e., a mesh whose each cell belongs to an uniform grid of 2j cells per dimension. The integer j is called the level of the cell. In this 4-dimensional dyadic mesh, each cell of level j, say α, is a 4-cube and can be refined into 24 smaller cells of level j + 1. These smaller cells are called the daughters of α and, by analogy, cell α is called their mother. We will denote J, the finest level and j0 , the coarsest level of our mesh cells.
110
O. Hoenen and E. Violard
At any time step tn = nΔt, the solution f is represented by a dyadic mesh Mn and a function F n which gives the value of f at every point (x, v) corresponding to a node of Mn . Each cell has 34 equally spaced nodes and we use a biquadratic Lagrange interpolation to reconstruct the value of f at any point (x, v) within a cell. Let us now present one advection in z, i.e., our numerical scheme for solving one transport equation. It gives the new representation (Mn+1 , F n+1 ) from a known old one (Mn , F n ). It consists in the following procedure, for every cell α of the uniform grid of level j0 and denoting Mn+1 the part of Mn+1 which is α contained in α: 1. Mesh prediction: Let us note jβ , the level of any cell β; Cβ , the point of the phase space corresponding to the center of β; β , the cell of Mn which contains the advected point AΔt z (Cβ ) and jβ , the level of β . Then n+1 recursively refine each cell β of Mα such that jβ ≥ jβ . 2. Computation of values: For every node of Mn+1 α , let N be its corresponding point in the phase space and α , the cell of Mn which contains the n+1 advected point AΔt (N ) to the z (N ). By conservation property 1, set F Δt n interpolated value at point Az (N ) by using the values of F at nodes of α . 3. Mesh compression: For every group of 24 daughter cells in Mn+1 α , compare every value at their nodes with the value obtained by interpolation by using the values at mother’s nodes. If the L1-norm of the difference is lower than a given threshold , then replace the daughter cells with their mother into Mn+1 α . Figure 1 shows the time loop of our resolution scheme. Notice that for sake of conciseness, the diagnostic step has not been reported. As the advection in v step uses the electric field E, it is preceded by the computation of E. In order to compute E, the Poisson’s equation is discretized onto an uniform grid of the position space (x) with 2J+1 +1 points per dimension. The computation of E then consists in a summation of values onto the 2D position space. Mn,Fn
advection in x
advection in y
comput. of E
advection in v
Mn+1,Fn+1
Fig. 1. The time loop of our resolution scheme
3
Parallelization
In our numerical scheme an advection consists in applying the same treatment to every parts Mn+1 of the dyadic mesh in any order. In the following we call α block all the data used to describe such a part. Our parallel implementation is based on the distribution of blocks amongst processors. We first present the data structure which will represent the dyadic mesh. We propose a partitioning of the mesh which will determine how blocks are distributed. Then we show how to optimize communications.
A Block-Based Parallel Adaptive Scheme
3.1
111
Data Structure
Data structure has a great impact on the efficiency of an adaptive method implementation. In our case, due to the domain high number of dimensions and the big amount of data it represents, we search a good tradeoff between data access cost and memory usage. In our algorithm, the access to any value is defined by the advection operator that uses an absolute location within the phase space. Therefore, random access is critical to our algorithm (as shown in [7]). Thus, we use arrays to store values. Moreover, we use pointers to handle mesh sparsity and we use dynamically-allocated arrays to minimize memory usage. Our data structure is composed with four-dimensional arrays and has two level. The first level is made of two arrays: one [2j0 +1 ]4 -sized array of double which stores the values at nodes of the cell of level j0 and one [2j0 ]4 -sized array of pointers whose elements are in one-to-one correspondence with cells of the uniform grid of level j0 . Each pointer either is NULL meaning that the corresponding cell belongs to the mesh, or points to an array of second level. An array of second level stores the values of all the nodes which are located within the corresponding part of the mesh. The undefined value is stored in the array when the corresponding node does not exist in the mesh. The second level is thus made of some [2J−j0 +1 ]4 -sized arrays of double. Moreover since the mesh adapts at each time step, the arrays of second level are dynamically-allocated arrays. A data block either is stored into the array of first level if it describes a single cell of level j0 , or is stored into an array of second level. 3.2
Mesh Partitioning
A good mesh partitioning is one which both reduces the cost of communications, and balances the computational load [8]. In order to meet the first issue, we use the following properties of our numerical method : – During any advection in v, the treatment of any block, say B, only requires data within blocks having the same coordinates in position space as B. Symmetrically, during any advection in x, the treatment of any block B only requires data within blocks having the same coordinates in velocity space as B. – Any v-advection exhibits irregular and unpredictable data dependencies because these dependencies are defined by the self consistent electrostatic field. On an other hand, the data dependencies in any x-advection are predictable as they are only defined by v and dt. According to these properties, if we partition the mesh along the position dimensions, then the advections in v will induce no communication. Moreover, the communications which will occur during the advections in x are predictable and linear. Hence, we choose to reduce the partitioning problem to partitioning the 2D uniform grid of level j0 of the position space. This means that each partition is defined by an 2D area made of some cells of this uniform grid. The partition contains all the mesh cells of the 4D phase space whose projection onto the
112
O. Hoenen and E. Violard
position space is included in this area. As said previously the induced communications are linear, therefore we only consider connex areas to form our partitions. More precisely, we choose to define our partitions from some rectangular areas, which simplifies the communications scheme. Now to achieve the second requirement that is load balancing we build approximately balanced areas for any given test case. Notice that our approach in this work is to perform a static partitioning. Future work will be directed at modifying partitions dynamically to follow the evolution in time of the physics more closely. As said previously, each area will be a 2D rectangular one. Numerous techniques tends to build such areas by minimizing the ratio between height and width and reduce the amount of border elements in each direction that usually induces communications (see [9] as an example). Besides these techniques, our partitioning uses some knowledge on the considered test case. More precisely, it uses a bounding box, say B, which approximates the shape of the beam of particles. It is assumed that most of computational load is carried by the mesh elements contained within bounding box B. The other parameters used in the building of our partitions are P , the number of processors and j0 , the level of the uniform grid. Notice that level j0 defines the coarse grid and determines the number of discretisation points per surface unit of the physics domain. We therefore consider it as a parameter and not as an unknown because we do not want to influence the solution accuracy. The partitioning problem then reduces to find P rectangles composed of coarse cells such that these rectangles do not overlap and every rectangle approximatively covers the same extent of surface of B. Our partitioning problem is an optimization one which is minimizing the greatest area amongst the P surface extents of B which are covered by each of the P rectangles. This optimization problem can be expressed as an 0-1 integer linear programming problem. It is thus in general NP-hard, and as such, it is considered unlikely that there exists an efficient algorithm for solving it. Instead, we propose an heuristic which consists in recursively splitting the bounding box area into 2 approximatively equal parts. It is thus assumed that the number of processors is a power of 2. 3.3
Obtaining Regular Communications
The communications are generated by the data dependences which are induced by the advection operator: the treatment of a node corresponding to the point (x, y, vx , vy ) of the phase space requires the values of the cell containing the advected point. As we partition the mesh along the (x, y)-axis (Cf. section 3.2), the communications only occur during a x- or y-advection and the data dependences are defined by the linear advection operators x → x − vx Δt and y → y − vy Δt. Figure 2 (left) shows these data dependences for the x-advection in the Cartesian (x, vx )-plane : the data on the oblique line are needed to compute the data on the vertical line. Figure 2 (right) shows which data blocks are required to compute the data blocks on a column. In this particular case, we observe that the computation of any data blocks only requires the data within two blocks : the block itself and a neighboring block at the left or at the right depending on
A Block-Based Parallel Adaptive Scheme
113
x-vx.dt x vxmax
0
vxmin xmin
x
x-vx.dt
xmax
Fig. 2. Data dependencies and blocks required for updating one column of blocks
the vx sign. It is a suitable situation where the communication volume is minimum. This situation occurs when the test case parameters satisfy a particular condition. Our algorithm works on the assumption that this condition holds. This condition can be expressed by the following equation xmax − xmin ≥ max( |vxmin | , |vxmax | )Δt . 2j0 3.4
(4)
Communication Overlapping
Conceptually, in the application of partitioning, each partition is assigned to a processor. In this section, we consider a lower level of abstraction and discuss how data have to be effectively distributed across processors in order to reach the best performance. To allow easy access to boundary data on neighboring processors, the data structure is extended to include ghost cells on each processor. Ghost cells are remote data blocks replicated in local memory to reduce access time (see [10]). The data structure have allocated memory for some boundary blocks that can be filled in with data from the adjoining processors. This is illustrated in figure 3. In the following, the region assigned to a processor that is formed from the partition extended with replicated blocks will be refered to as local region. Each advection in our numerical scheme then consists in an “update operation” which computes data within the local region and fills in some of replicated blocks with updated data.
p1
p0
p2 y
x
Fig. 3. Extension of global ordinary data structure to local regions, i.e., data structures with replicated blocks (for 3 processors)
114
O. Hoenen and E. Violard
Several algorithms for this update operation can be investigated. Their efficiency mostly depends on the characteristics of the computer system and on the size or dimensionality of the problem (see [11] for more details). Our implementation is described in algorithm 1. This algorithm tends to maximize the overlapping of communication and computation. Algorithm 1. Update of one local region Input: A, one local region Output: B, the updated region begin init recv of replicated blocks of B from adjoining processors foreach border blocks of B do compute it from blocks of A init its send to the neighboring processor end compute inner blocks of B (from inner and border blocks of A) wait all send/recv end
We use two data structures we call A and B. Data structure A stores the data before the operation and B stores the resulting data. Since our numerical scheme consists in a succession of advections, the output data is used as the input by the next operation. The role of each structure changes each time the operation is performed. This change of role is achieved by pointer assignments. In addition, the local region is virtually split into three disjunctive parts : the replicated blocks, the border blocks and inner blocks. These parts are defined by the direction (x, y or v = (vx , vy )) of the next advection. They are shown on figure 4. The algorithm works on the assumption that all required data of A are initially available. In particular, it means that the replicated blocks have been filled in with data at the first time step. The algorithm first initiates the non-blocking receive of all these blocks in B. Then, it computes each border block of B and initiates its send as soon as it is computed. Communications are overlapped with computation of inner blocks and next border blocks of B.
replicated block border block inner block
(a)
(b)
(c)
Fig. 4. The local region p2 in Fig.3 split into three kinds of blocks: (a) when the next update is a x-advection, (b) when the next update is a y-advection, (c) when the next update is a v-advection (no replicated or border block)
A Block-Based Parallel Adaptive Scheme
4
115
Implementation and Performance Measurements
Our code has been written in C and use the portable implementation MPICH [12]. In order to reduce memory copies, we defined two MPI data-types that describe the storage of data within the cells which are received or sent. A difficulty here is that data blocks may have two different memory mapping because a data block describes one single or more cells, as said in section 3.1. Because of the adaptive mesh, it is not possible to predict which kind of blocks will be sent or received. Therefore we use two MPI data-types : MPI Block1 and MPI Block2, one for each block kind. These two types are identical except that the extent of type MPI Block2 is greater than the extent of type MPI Block1 in memory. Each of these two types is a structure with the same fields and the same offset for each of these fields. We use type MPI Blocks2 to initiate a block receive. Therefore data are received in place whatever the kind the block has. Last, a call to function MPI Get count is used to determine the block type. For our experiments, we use a PC cluster’s equipped with 17 dual core processors with 2 GB of RAM each. Processors are Athlon 4800+ cadenced at 2.4 GHz and connected through an Gigabit Ethernet. Each core is seen as an independent MPI processor unit. Our test case is the uniform magnetic focusing of a semi-Gaussian beam of protons. The emittance of this proton beam is 2.36×10−5 π m rad, its current is equal to 100 mA, and its energy is 5 M eV . The initialization parameters are computed by solving the envelope equation of the equivalent KV beam (see [13]). These physical parameters give a tune depression of 0.7. The length of the period is equal to S = 26.6292 m. We perform 75 time steps (with Δt = 0.0841 s). The particles beam is given by a semi-Gaussian distribution function v 2 +v 2 1 (− x 2 y ) if x2 + y 2 < 6 2e 8π f0 (x, v) = . (5) 0 else where x and v live in [−6, 6]2 . The number of points is equal to 128 in each direction of position space and 64 in each direction of velocity space. We use j0 = 3 and J = 5. Figure 5 shows projection of the distribution function. Impact of load balancing. We compare two partitioning of the mesh. The first one uses our heuristics to build balanced areas with a bounding box set
Fig. 5. Representation of the x − y projection (left) and x − vx projection (right) of the distribution function at time step 20
116
O. Hoenen and E. Violard
to the square [−3.5, 3.5]2. The second one splits the domain into equally sized areas such that the difference between width and height is minimum. Figure 6 (right) shows the wall-clock time of the code in each of these two cases. Since the x-projection of the distribution function is round in shape and forms a disk of center (0, 0), the second partitioning gives well-balanced areas for 4 processors. Therefore our heuristics cannot enhance performance in that case. Otherwise simulations are between 15% and 30% faster with our heuristics. 02:00
32 equi part box part
linear balanced imbalanced
90.8%
01:45
01:30
speedup
time hh:mm
01:15
01:00
16 93.1%
00:45
47%
00:30 58.5% 8
00:15
95.9% 79%
00:00
4 4
8
16 nb procs
32
4
8
16
32
nb procs
Fig. 6. Performances of our code on a PC cluster
Performance and speedup. We observe that the performance improves until 32 processors. But the speedup is under-linear due to load imbalance arising from the beam evolution in time. Figure 6 (left) shows the speedup and efficiency obtained when the workload is well-balanced throughout the execution (setting the compression threshold = 0) in comparison with an unbalanced execution. Note that the version of the code is not fully optimized, particularly the interpolation operation cost can be reduce. We observe a slight loss of efficiency. It is not due to communications because they are totally overlapped by computation: we checked that in this experiment all messages are received before the wait barrier is reached. Thus, it can be explained by the computation of field E that is sequential.
5
Conclusion
In this paper, we proposed a new adaptive numerical scheme and its parallel implementation for solving the 4D Vlasov equation. On the contrary to the previous method [14], each part of the predicted mesh can be computed independently of each other, at each time step. Thus, our parallel implementation based on mesh partitioning induces low overhead. Moreover, we made some assumptions on the physical parameters in order to ensure a regular communication pattern.
A Block-Based Parallel Adaptive Scheme
117
Experiments on a PC cluster shown a good scalability provided the computational workload is well balanced throughout the whole simulation. They shown that a static partitioning alone is not enough to obtain the good scalability and performance even in the advantageous case of the focusing beam. Future works then consist in integrating to the code a mechanism that updates partitioning in order to follow the evolution of the physics. Then we intend to compare the scalability of our solver for distributed memory machines with the codes which target shared memory machines.
References 1. Gutnic, M., Haefele, M., Latu, G.: A parallel Vlasov solver using a wavelet based adaptive mesh refinement. In: 7th Workshop on High Perf. Scientific and Engineering Computing (ICPP 2005), pp. 181–188 (2005) 2. Sonnendr¨ ucker, E., Filbet, F., Friedman, A., Oudet, E., Vay, J.L.: Vlasov simulation of beams with a moving grid. Comput. Phys. Commun. 164, 390 (2004) 3. Hoenen, O., Mehrenberger, M., Violard, E.: Parallelization of an adaptive Vlasov solver. In: Kranzlm¨ uller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 430–435. Springer, Heidelberg (2004) 4. Campos-Pinto, M., Mehrenberger, M.: Adaptive numerical resolution of the Vlasov equation. In: Numerical Methods for Hyperbolic and Kinetic Problems, CEMRACS, pp. 43–58 (2004) 5. Cheng, C.Z., Knorr, G.: The integration of the Vlasov equation in configuration space. J. of Comput. Phys. 22, 330–351 (1976) 6. Sonnendr¨ ucker, E., et al.: The semi-Lagrangian method for the numerical resolution of the Vlasov equation. J. Comput. Phys. 149, 201–220 (1999) 7. Hoenen, O., Violard, E.: An efficient data structure for an adaptive Vlasov solver. Technical Report 06-02, LSIIT–ICPS (February 2006) 8. Boukerche, A., Tropper, C.: A static partitioning and mapping algorithm for conservative parallel simulations. In: 8th workshop on Parallel and Distributed Simulation (PADS 1994), pp. 164–172 (1994) 9. Berger, M.J., Bokhari, S.H.: A partitioning strategy for nonuniform problems on multiprocessors. IEEE Transactions on Computers 36(5), 570–580 (1987) 10. Ding, C., He, Y.: A ghost cell expansion method for reducing communications in solving PDE problems. In: Supercomputing 2001 (CDROM) (2001) 11. Palmer, B., Nieplocha, J.: Efficient algorithms for ghost cell updates on two classes of mpp architectures. In: Parallel and Distributed Computing and Systems (PDCS 2002), pp. 192–197 (2002) 12. Gropp, W., et al.: A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing 22(6), 789–828 (1996) 13. Filbet, F., Sonnendr¨ ucker, E.: Modeling and numerical simulation of space charge dominated beams in the paraxial approximation. Technical Report RR-5547, INRIA Lorraine (2005) 14. Mehrenberger, M., et al.: A parallel adaptive Vlasov solver based on hierarical finite element interpolation. Nucl. Inst. and Meth. in Phys. Res. 558(A), 188–191 (2006)
On Optimal Strategies of Russia’s Behavior on the International Market for Emissions Permits Alexey Kadiyev, Vyacheslav Maksimov, and Valeriy Rozenberg Institute of Mathematics and Mechanics, Ural Branch of RAS, 16 S.Kovalevskoi Str., 620219 Ekaterinburg, Russia {kadiev,maksimov,rozen}@imm.uran.ru http://www.imm.uran.ru
Abstract. A simplified dynamical model of the international market for greenhouse gases emissions permits is considered. A procedure for constructing optimal strategies of Russia’s behavior is suggested. A possibility of obtaining algorithm’s input data from integrated assessment models is discussed. A specific numerical analysis is performed.
1
Introduction
The global climate change is one of the most important problems in the modern world. The driving forces of this process are not completely studied yet, and its ecological, social, and economic consequences are rather disputable and complicated from the analytical viewpoint. However, the experts are in agreement that the dramatic climate change observed in the last century is explained to some extent by the increase of the atmospheric concentration of greenhouse gases (GHG), first of all, CO2 , due to man’s impact that is characterized by the essential increase of fossil fuel consumption. One of the efforts of the international community to control environmental impacts is the Kyoto Protocol developed by the United Nations Framework Convention on Climate Change (UNFCCC) and passed in December 1997. As is known, after the waivers of ratification of the Kyoto Protocol by the main countries producing GHG emissions, USA and China, the future of the Protocol directly depended on the position of Russia. However, even after the President of the Russian Federation signed the Federal Law “On the ratification of the Kyoto Protocol...” on 4 November 2004, the debate in Russia about future costs and benefits of being a Party to the Kyoto Protocol has continued with hardly mitigated intensity since many statements of the Protocol and mechanisms of its application have an ambiguous value in the context of developing economy of Russia. In the discussion, arguments of proponents and opponents of Russia’s participation in the Kyoto Protocol are
The work was supported by the International Institute for Applied Systems Analysis (Laxenburg, Austria), by the Russian Foundation for Basic Research (project no. 06-01-00359), and by the Programs of Basic Research of the Russian Academy of Sciences no. 16 “Environment and climate change: natural catastrophes” and no. 22 “Control processes”.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 118–126, 2008. c Springer-Verlag Berlin Heidelberg 2008
On Optimal Strategies of Russia’s Behavior on the International Market
119
often based on results of application of different mathematical models, mainly, of two types: (i) integrated models for evaluating regional and global effects of GHG reduction policies; (ii) optimization models. In the present paper, a model-oriented approach to constructing optimal strategies of Russia’s behavior on the international market for emissions permits is applied. This market is one of the Kyoto flexible mechanisms. It is regulated by the Marrakech Accords (passed in 2001) and was intensively discussed at the first and second Meetings of the Parties to the Kyoto Protocol (November 2005, Montreal, Canada; November 2006, Nairobi, Kenya). A simple model is characterized by Russia’s monopoly on the trade with the Annex B countries and by an opportunity of banking permits and optimizing their sale over time. Note that, due to the collapse of the industrial sectors in 1990s, Russia actually does not need to reduce emissions for selling permits, since the amount of Russian so called “hot air” is large enough. Therefore, Russia’s monopoly is considered as the first approximation to a more complicated multi-pole market. The model uses a demand function describing the market price for emissions permits in the Annex B countries, a cost function for emissions abatement in Russia, and a temporal dynamics of “hot air”. To obtain specific dependencies, different integrated assessment models are applied.
2
The Simple Model of Dynamics of the Stock of Permits
To describe the process of emissions permits banking with an opportunity at every time moment to sell some amount on the market and/or to increase the stock by emissions abatement, we use a dynamical controlled system. Let x(t) be the stock of permits that are banked at time t; h(t) be the “hot air” available for sale (this function is actually regulated by the Protocol and assumed to be known); q(t) be the emissions abatement; u(t) be the amount of permits supplied for sale. The last two functions are control parameters. It is natural to equate the rate of the stock of permits, x(t), ˙ with the difference of two values, h(t)+q(t) and u(t). This results in the following differential equation: x(t) ˙ = h(t) + q(t) − u(t),
t ∈ [t0 , T ].
(1)
We assume that the initial time is a moment t0 when the stock of permits equals zero, i.e., x(t0 ) = 0. (2) It is evident that the “hot air” h(t), the emissions abatement q(t), and the amount of permits u(t) supplied for sale can not exceed some definite values; this can be expressed as the following constraints: a1 (t) ≤ q(t) ≤ b1 (t),
a2 (t) ≤ u(t) ≤ b2 (t),
a1 (t) ≥ 0,
a2 (t) ≥ 0, x(t) ≥ 0.
a3 (t) ≤ h(t) ≤ b3 (t),
(3)
a3 (t) ≥ 0, (4)
120
A. Kadiyev, V. Maksimov, and V. Rozenberg
Note that constraints (3) provide fulfillment of the inequality x(t) ≤ K, where K is a constant, which can be explicitly written. Therefore, condition (4) is naturally replaced by the condition 0 ≤ x(t) ≤ K.
(5)
We assume that all the scalar functions from the right-hand part of (1) belong to the space L2 ([t0 , T ]; R), functions ai (·), bi (·), i = 1, 2, 3, are continuous. In what follows, we consider any function h(·) as a known one. Functions q(·) and u(·) satisfying relations (3) and providing fulfillment of inequality (5) are called admissible controls. The set of all admissible controls is denoted by the symbol U∗ . A solution of equation (1) is a Caratheodory solution and belongs to the space of absolutely continuous functions A([t0 , T ]; R). A solution corresponding to a pair of admissible controls (q(·), u(·)) ∈ U∗ is denoted by x(·; q(·), u(·)).
3
Statement of Optimal Control Problems
Let us formulate a problem of optimal control for system (1)–(3), (5). Problem P1. It is required to find functions q∗ (·) and u∗ (·) solving the extremal problem max F (u, q), (6) u,q
T β(τ )[P (u(τ ))u(τ ) − C(q(τ ))q(τ )] dτ + β(T )π(T )x(T ),
F (u, q) =
(7)
t0
T (h(τ ) + q(τ ) − u(τ )) dτ,
x(T ) =
(8)
t0
and satisfying constraints (3) and (5). Here β(t) is the discount rate; P (u(t)) is the price of permits, which, as a rule, is inversely proportional to the amount of permits on the market; C(q(t)) is the cost function for marginal abatement, which, as a rule, is directly proportional to the level of abatement (it is determined by a so called regional MAC curve); π(t) is the expected price of permits. Thus, the integral term characterizes the total income from operations on the market minus abatement costs, whereas the terminal one represents the cost of all emissions permits banked till the moment T ; both with discounting. Optimization problems of a similar type have been investigated by many authors, see, for example, [1, 2]. Note that in the model described above the idea of possible banking of permits that can be profitable, firstly, due to the growth of demand and, consequently, of the market price for emissions permits in the future and, secondly, for decreasing own abatement costs (in a remote perspective) is realized.
On Optimal Strategies of Russia’s Behavior on the International Market
121
We assume that the functions P (·) and C(·) are such that the functional F (u, q) is strongly convex with respect to u and q. Since equality (8) is valid, functional (7) can be rewritten in the form: T [(β(τ )P (u(τ )) − β(T )π(T ))u(τ ) − (β(τ )C(q(τ )) − β(T )π(T ))q(τ )+
F (u, q) = t0
+ β(T )π(T )h(τ )] dτ.
(9)
Let us formulate an auxiliary problem of optimal control. Problem P2. It is required to find functions qα (·) and uα (·) solving the extremal problem max Fα (u, q), (10) u,q
T [(β(τ )P (u(τ )) − β(T )π(T ))u(τ ) − (β(τ )C(q(τ )) − β(T )π(T ))q(τ )+
Fα (u, q) = t0
+ β(T )π(T )h(τ ) − αx2 (τ )] dτ,
(11)
τ (h(ξ) + q(ξ) − u(ξ)) dξ,
x(τ ) = t0
and satisfying constraints (3) and (5). Here α > 0 is a small parameter. Note that, in virtue of strong convexity of functionals (9) and (11) with respect to u and q, and convexity, boundedness, and closedness (in L2 ([t0 , T ]; R) × L2 ([t0 , T ]; R) ) of the set U∗ , Problems P1, P2 have unique solutions, (q∗ (·), u∗ (·)) ∈ U∗ and (qα (·), uα (·)) ∈ U∗ , respectively. Theorem 1. Let for any α > 0 functions qα (·) and uα (·) solve Problem P2. Then, for the functional sequence (qα (·), uα (·)), the following convergence is valid as α → 0: (qα (·), uα (·)) → (q∗ (·), u∗ (·)) weakly in L2 ([t0 , T ]; R) × L2 ([t0 , T ]; R),
(12)
where (q∗ (·), u∗ (·)) is the unique solution of Problem P1. Taking into account convergence (12), we solve auxiliary Problem P2 instead of Problem P1. The former, being a problem of optimal control under phase constraints, needs special solving methods. The algorithm used in the present work is described in [3]. It consists in reduction of solving the problem with phase constraints to solving a sequence of classical optimal control problems, for example, by means of Pontryagin’s maximum principle [4].
122
4
A. Kadiyev, V. Maksimov, and V. Rozenberg
Results of Numerical Modeling
In the numerical experiments, system (1) was considered on the time interval [2010, 2030], under the assumption that the Kyoto mechanisms are applicable on the whole interval (a so called “Kyoto Forever” scenario), i.e., under the assumption that the emissions levels (regulated by the Protocol) for the Annex B countries and the international market for emissions permits are preserved. The unit of the stock of permits x(t) was megaton of carbon equivalent (1 MtC); respectively, the controls u(t) and q(t) as well as the function h(t) were measured in MtC per year. The dynamics of carbon dioxide CO2 was studied. All prices were given in USD. As a forecast of the dependence of the market price for emissions permits on the amount of permits supplied for sale, under the conditions of Russia’s monopoly (actually, as an estimate of the demand for permits in the Annex B countries), the demand function P (u(·)) from model GEMINI-E3 was chosen, see Fig. 1a. This model is a general equilibrium model of the world economy [5]. The cost function C(q(·)) for the marginal abatement depending on the level of abatement (a so called regional MAC curve) was taken from the same model, see Fig. 1b. The linear interpolation was used between the pictured curves.
Fig. 1. Input data: (a) law of demand; (b) MAC curve of Russia. Functions for 2010, 2020, and 2030 are presented.
The constraints on the controls u(t) and q(t) were chosen as constants: a1 (t) = a2 (t) = 0, b1 (t) = b2 (t) = 250. We considered the process without discounting (β = 1), the parameter α (see (11)) was equal to 0.01, the expected price of permits at the terminal time π(T ) was equal to 0. The main goal of the experiment was studying the dependence of the optimal dynamics of control parameters u(t) and q(t), the stock of permits x(t), and the income obtained by Russia from operations on the market for permits on the amount of “hot air”, additionally (to the abatement) available for sale, i.e., on the function h(t). Actually, the value of the function h(t) is the difference between emissions at the moment t and the known emissions level of 1990 (the Kyoto level for Russia, 646 MtC). Then, using different scenarios of the economic
On Optimal Strategies of Russia’s Behavior on the International Market
123
Table 1. Estimates for the temporal dynamics of Russian “hot air”, MtC per year time 2010 2015 2020 2025 2030
(1) 155 114 69 33 -1
(2) 169 110 52 -13 -85
(3) 155 125 93 60 25
(4) 132 78 19 -19 -59
(5) 85 67 57 28 -3
(6) 186 105 41 16 6
(7) 300 245 199 163 136
Remarks. 1. Variants (1)–(7) correspond to the following forecasts: (1) — the reference scenario of the International Energy Outlook 2006 [6]; (2) — the forecast of the Energy Research Institute of RAS [7]; (3) — the reference scenario of the Fourth National Communication of the RF [8]; (4) — the innovation-active scenario of the Fourth National Communication of the RF [8]; (5) — the simulation results by MERGE [9]; (6) — the simulation results by EPPA [10]; (7) — the simulation results by GEMINI-E3 [5]. 2. Between the points listed in Table 1, the linear interpolation was used. 3. Negative values were replaced by zeros.
development of Russia and applying different models forecasting the dynamics of CO2 emissions, we obtained several scenarios of the dynamics of h(t), see Table 1. In addition to variants (1)–(7), two “extremal” cases with constant functions h(t) were computed: variant (0), where h(t) = 0, and variant (8), where h(t) = 163. The temporal dynamics of the main output parameters was presented in Figs. 2–3. Note that, due to the method’s error, there is a sense to limit the analysis by 2028. It is evident that the income from permits sale (see extremal problem (10)– (11)) should take a minimal (comparing with other variants) value in variant (0);
Fig. 2. The temporal dynamics of (a) the amount of permits supplied for sale, u(t); (b) the emissions abatement, q(t). Variants (0)–(8).
124
A. Kadiyev, V. Maksimov, and V. Rozenberg
this fact is confirmed by simulations. It turns out that variant (8) provides a maximum possible income over different functions h(t) (in the case when the remaining parameters of the problem are fixed); the same result is obtained in variant (7). The maximality of the income in these variants is explained by the zero optimal value of q(t) (see Fig. 2b). Note that the least integer providing the maximum above was taken as the constant value of h(t) in variant (8). As is seen in Fig. 2a, the amount of permits supplied by Russia for sale on the international market is varied in 2010 from 73 MtC up to 127 MtC, in 2020 from 119 MtC up to 167 MtC, in 2028 from 142 MtC up to 187 MtC (in variant (0) and in variants (7), (8), respectively). In all the variants, the amount of permits supplied for sale increases with time, and the growth rate is approximately the same (varies from 2.0% up to 3.4% per year). On the contrary, the emissions abatement is rather stable with time in all the variants (see Fig. 2b; the most considerable growth is observed in variant (0)). The share of emissions reduction in the amount of permits supplied for sale is changed from evident 0% in variants (7), (8) up to 94% in 2010, 65% in 2020, and 59% in 2028 in variant (4) (with the exception of variant (0), when this index is not informative). In all the variants, it is inexpedient to use the disposable “hot air” at a time; the maximal banking of permits (the abatement is also taken into account) with the purpose of future income increase is observed in variant (6): 160 MtC in 2010 (with the exception of variants (7), (8), when this index is not informative). Note that the banking becomes possible due to the intertemporal optimization (on the whole time interval). As to the market price of permits (see Fig. 3a), it rises from 116 USD/MtC in 2010 up to 229 USD/MtC in 2028 in variants (7), (8) (the minimal prices) and from 171 USD/MtC in 2010 up to 287 USD/MtC in 2028 in variant (0) (the maximal prices), in all the variants rather slowly (from 2.9% up to 3.9% per year) increasing with time. For the comparative analysis of modeling results in variants (0)–(8), the histogram (Fig. 3b), where the maximum possible income (for the whole time
Fig. 3. Modeling results: (a) the temporal dynamics of the price of permits; (b) the Russian income from permits sale (in % from the maximum possible income in the model). Variants (0)–(8).
On Optimal Strategies of Russia’s Behavior on the International Market
125
interval) is taken as 100%, is constructed. Analyzing the histogram, we conclude that the maximal income loss over forecasts (1)–(7) is 10.6% (in variant (4)), whereas the maximum possible loss is 20.5%. The average (over variants (1)– (7)) loss is rather small (6.2%). Hence, we can deduce that domestic resources of Russia (namely, an opportunity of relatively cheap (especially comparing with countries of European Union and Japan) emissions reduction in Russia due to incomplete realization of the energy effectiveness and energy saving potential) provide a considerable income from permits sale even in the case of unfavorable situation with “hot air”. It turns out that the dependence of the value of this income on the function h(t) is not so essential, as one can suppose, when analyzing mathematical model (1)–(11).
5
Concluding Remarks
It should be noted that there is a high level of uncertainty in the specification of parameters of the model in question. In the paper, several scenarios forecasting the temporal dynamics of “hot air” were studied. It is reasonable to consider the analysis of the dependence of optimal strategies of Russia’s behavior on the international market for permits on variation of different model parameters (in particular, the functions presented in Fig. 1) as one of the basic perspective directions in modeling.
References 1. Grimm, B., Pickl, S., Reed, A.: Management and optimization of environmental data within emissions trading markets - VEREGISTER and TEMPI. In: Antes, R., Hansj¨ urgens, B., Letmathe, P. (eds.) Emissions Trading and Business, pp. 165–176. Physica, Heidelberg (2006) 2. Bernard, A., Haurie, A., Vielle, M., Viguier, L.: A two-level dynamic game of carbon emissions trading between Russia, China, and Annex B countries. Swiss National Centre of Competence, NCCR-WP4 Working paper 11, University of Geneva (September 2002) 3. Kryazhimskii, A., Ruszczy´ nski, A.: Constraint aggregation in infinite-dimensional spaces and applications. IIASA Interim Report 97-051, Laxenburg, Austria (1997) 4. Pontryagin, L.S., Boltyanskii, V.G., Gamkrelidze, R.V., Mishchenko, E.F.: Mathematical theory of optimal processes. Moscow, Nauka (1969) (in Russian) 5. Bernard, A., Reilly, J., Vielle, M., Viguier, L.: Russia’s role in the Kyoto Protocol. In: Proceedings of the Annual Meeting of the International Energy Workshop jointly organized by EMF/IEA/IIASA, Stanford University, USA, June 18-20 (2002) 6. International Energy Outlook 2006. Energy Information Administration, www.eia.doe.gov/oiaf/ieo/index.html 7. Makarov, A., Likhachev, V.: Mitigation Measures and Policies from Russian Perspective. In: Workshop on Post-2012 Climate policies, IIASA, Laxenburg, Austria, June 13 (Preprint, 2006) 8. The Fourth National Communication of the Russian Federation. Israel, Y.A., Nakhutin, A.I., Semenov, S.M., et al. (eds.), ANO Meteoagentstvo Rosgidrometa, Moscow (2006)
126
A. Kadiyev, V. Maksimov, and V. Rozenberg
9. Manne, A., Mendelson, R., Richels, R.: MERGE - a Model for Evaluating Regional and Global Effects of GHG reduction policies. Energy Policy 23(1), 17–34 (1995) 10. Babiker, M.H., Reilly, J.M., Mayer, M., Eckaus, R.S., Sue Wing, I., Hyman, R.C.: The MIT Emissions Prediction and Policy Analysis (EPPA) Model: revisions, sensitivities, and comparisons of results. MIT Joint Program on the Science and Policy of Global Change, report 71, Cambridge, MA, USA (2000)
Message-Passing Two Steps Least Square Algorithms for Simultaneous Equations Models Jose-Juan L´opez-Esp´ın1 and Domingo Gim´enez2 1
2
Departamento de Estad´ıstica, Matem´ aticas e Inform´ atica Universidad Miguel Hern´ andez, 03202, Elche, Spain
[email protected] Universidad de Murcia, Departamento de Inform´ atica y Sistemas 30071 Murcia, Spain
[email protected]
Abstract. The solution of Simultaneous Equations Models in high performance systems is analyzed. Message-passing algorithms for the Twostage Least Squares method are developed. Algorithms are studied theoretically and experimentally. The algorithms make extensive use of basic libraries like BLAS, LAPACK, and ScaLAPACK to obtain efficient and portable versions.
1
Introduction
The solution of Simultaneous Equations Models in high performance systems is studied. These models can be solved through a variety of methods. In [5], a study of Indirect Least Square (ILS) and Two Step Least Square (2SLS) was presented. The cost of using ILS is lower than using 2SLS but ILS can only be used in exactly identified equations. 2SLS can be used in exactly identified equations and in over identified equations, and for this reason 2SLS must be improved. In this paper three different versions of the 2SLS algorithm are presented. The first is a basic algorithm which will be improved in the second and the third versions. In the first version, the structure of the parallel 2SLS algorithm is stated. In the other versions, the same structure is followed but matrix decompositions are used to obtain lower costs. A number of variables will be used to analyze the algorithms. The system will be in structural form, and each equation will have one endogenous variable (called main endogenous) as a function of the other variables [3]. The variables which appear in the algorithms are: – The number of data per equation, d. We suppose d has the same value for each equation. – The number of endogenous variables in the system, N . We suppose it coincides with the number of equations in the systems. Each endogenous variable is the main one in one equation. – The number of predetermined variables in the system, K. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 127–136, 2008. c Springer-Verlag Berlin Heidelberg 2008
128
J.-J. L´ opez-Esp´ın and D. Gim´enez
– The number of endogenous variables in equation i, ni . – The number of predetermined variables in equation i, ki . – X is the predetermined matrix. It has d × K dimension and contains the data of predetermined variables. – Y is the endogenous matrix. It has d × N dimension and contains the data of endogenous variables. There are three types of equations: non identified (ni − 1 > K − ki ), with no solution; exactly identified (ni − 1 = K − ki ), for which it is possible to use ILS or 2SLS; and over identified (ni − 1 < K − ki ), where it is necessary to use 2SLS. In an equation, the problem of the correlation between random and endogenous variables is avoided by substituting the original variable for a new variable called proxy which has been calculated previously.
2
Parallel 2SLS Algorithms
Parallel algorithms of 2SLS techniques for distributed memory have been developed. Parallelization can be made at different levels: in the basic matrix operations using PBLAS and ScaLAPACK, by dividing the work in the loops among the processors in the system, etc. The parallelization has been made each time at the highest possible level. It is assumed that X and Y are distributed between all processors in ScaLAPACK style. When the algorithm finishes, the solution is also distributed. It is also assumed that all the processors have the complete structure of the system at the beginning of the algorithm, and it is not necessary to receive data from other processors. In the theoretical study, the typical model of the communications is used [4]. It is assumed that ts is the start-up time in point-to-point communications and tsb in broadcast, and tw and twb are the word-sending time in a point-to-point and a broadcast communication. In some parts of the algorithms, a matrix is distributed in all the processors in ScaLAPACK style, and it is necessary to send the complete matrix to all processors. MPI library can be used here (MPI Alltoall routine). The cost of this routine is assumed as TA2A (T, p) = (p−1)ts +T tw , where T is the total data to send and p is the number of processors. ts and tw depend on the hardware system. OLS technique is used in the algorithms studied in this work. The computation in OLS consists of (X t X)−1 X t Y . The resulting matrix is the estimator of the ˆ in the equation: coefficients β (β) Yt = β1 X1,t + . . . + βn Xn,t + ut
(1)
and Yˆ = X βˆ is calculated to obtain an estimation of Y .The cost of OLS is: TOLS (N, d, K, δest ) =
2 3 K + 2K 2 (d + N ) + 2N Kd + δest 2N Kd 3
where δest is 1 if Yˆ is requested in the algorithm, and 0 in the other case.
(2)
Message-Passing Two Steps Least Square Algorithms
129
When parallel multiplications and inverse are used in (X t X)−1 X t Y , the cost of OLS (TOLSp (N, d, K, δest )) is the cost of the sequential case divided by the number of processors, plus a communication cost of lower order. 2.1
Parallel Two-Stage Least Squares (First Version)
Since the solution of a system by 2SLS requires the use of OLS in each equation with proxys, predetermined and endogenous variables, it is necessary to send all the proxys and part of the system structure to each processor. Algorithm 1 shows the scheme of 2SLS when solving a complete system. Matrix Xe is formed by the endogenous variables (which now take the values of the proxys) of the equation, excluding the main endogenous and the predetermined variables. Thus, matrix Xe has size d × (ni − 1 + ki ). initially, in 2SLS all the proxys are calculated. It could be that several of the N proxys will not be used, but since they can not be identified and the matrices are distributed at the beginning of the algorithm, it is better to calculate all of them. The cost is TOLSp (N, d, K, δest = 1). When OLS ends, Yˆ is distributed in all the processors in ScaLAPACK style and it is necessary to send the complete matrix to all the processors. The cost of sending Yˆ is TA2A (dN, p). Algorithm 1. Scheme of the parallel 2SLS algorithm 1: 2: 3: 4: 5: 6: 7: 8:
Yˆ = X(X t X)−1 X t Y Distribute Yˆ to all the processors IN PARALLEL Each processor q DO for j=1... Np do i = q + (j − 1)p Compute X(Xet Xe )−1 Xet yi end for END PARALLEL
To simplify the problem it is assumed that the structure of all the equations is similar, and the cost of solving the first Np equations is similar to that of solving any other Np equations. Thus the cost is: T2SLSp,1ver (N, d, K) = TOLSp (N, d, K, δest = 1) + TA2A (dN, p)+ N p + maxq=1...p TOLS (1, d, kq+(j−1)p + nq+(j−1)p − 1, δest = 0) ≈ j=1 N
≈ TOLSp (N, d, K, δest = 1) + TA2A (dN, p) +
p
(TOLS (1, d, ki + ni − 1, δest = 0))
i=1
(3) 2.2
Second Version of 2SLS (QR Decomposition)
An algorithm using the Householder decomposition is studied. QR decomposition has been used before in other econometric techniques with good results
130
J.-J. L´ opez-Esp´ın and D. Gim´enez
[1,2,7]. The exogenous matrix X is decomposed as QR and the new matrices are used. Given the exogenous matrix X (with d × K dimension), there is an orthogonal matrix Q (with d × d dimension) and atriangular matrix R (with R1 K ×K d × K dimension), with X = QR. R has the form , where 0 (d − K) × K R1 is an upper triangular matrix. Partitioning Q in (Q1 |Q2 ), where Q1 is d × K, we obtain: R1 X = QR = (Q1 |Q2 ) = Q1 R1 (4) 0 An expression for Yˆ is: Yˆ = XΠ = QRR1−1 Qt1 Y = Q
Id 0
Qt1 Y = Q1 Qt1 Y
(5)
Using this expression it is easy to prove: Yˆ t Yˆ = Y t Q1 Qt1 Q1 Qt1 Y = Y t Q1 Qt1 Y = Y t Yˆ
(6)
Because R1 is an upper triangular matrix, it is less costly to calculate its inverse. LAPACK can be used here and also in the QR decomposition. Routines used in the experiments in section 3 are: dtrtri for the inverse; dgeqrf for QR decomposition; and dorgqr to calculate Q. The cost of dgeqrf is 23 K 2 (3d − K) and the cost of dorgqr is 23 K 2 (3d− K). The three versions of the 2SLS algorithm have been compared using only one processor (table 3). A parallel version can be developed by substituting the LAPACK functions (dgeqrf , dorgqr and dtrtri) for their equivalent ones in ScaLAPACK (pdgeqrf , pdorgqr and pdtrtri). The scheme of a new OLS using QR decomposition is shown in algorithm 2. Algorithm 2. Scheme of OLS using QR decomposition (OLS2ver ) 1: 2: 3: 4: 5: 6:
Obtain Q1 and R1 {cost → 43 K 2 (3d − K)} Compute R1−1 {cost → 13 K 3 } coef = R1−1 Qt1 yi {cost → 2K(K + d)} if estimation = TRUE then est = Q1 Qt1 yi {cost → 2dK} end if
Algorithm 2 is used by each processor in each equation solved by 2SLS. When the OLS is used in this case, the X matrix is formed by the proxy variables and the predetermined variables of the equation, and yi is the main endogenous variable of the i-th equation. The cost of OLS using the QR decomposition is: TOLS2ver (d, K, δest ) = 4K 2 d − K 3 + 2K(K + d) + δest 2dK
(7)
Message-Passing Two Steps Least Square Algorithms
2.3
131
Third Version of 2SLS (Inverse Decomposition)
It is possible to obtain better results by using a special decomposition of a matrix in 2SLS, according to the theorem below. Theorem 1. For symmetrical matrices A and D, if all necessary inverses exist, then −1 −1 A B A 0 −E (8) = + (D − B t A−1 B)−1 (−E t , Id) Bt D 0 0 Id where E = A−1 B Proof. see Appendix A of [6]. Let us suppose that an equation is solved using 2LSL. Since the proxys have already been calculated, the first step is to substitute the original endogenous variables in the equation by the proxys: yj = α0 + γ1 xj1 + . . . + γk xjk + α1 yˆj1 + . . . + αm yˆjm +
(9)
The set of proxys will be denoted by Yˆ1 and the set of predetermined variables by X1 . The matrix used in subsequent calls to OLS is [X1 Yˆ1 ] (dimension d × (m + k)). Then a call to OLS is made to calculate ([X1 Yˆ1 ]t [X1 Yˆ1 ])−1 [X1 Yˆ1 ]t yj . Using the previous theorem, the inverse of the matrix can be calculated as follows: t −1 X t X X t Yˆ −1 (X t X )−1 0 X1 t ˆt 1 1 X1 Y1 = ˆ 1t 1 ˆ 1t ˆ1 = + 0 0 Yˆ1t Y1 X1 Y1 Y1
−(X1t X1 )−1 X1t Yˆ1 Id
(Yˆ1t Yˆ1 − Y1t X1 (X1t X1 )−1 X1t Yˆ1 )−1 (−Y1t X1 (X1t X1 )−1 , Id)
(10) where a lot of data can be taken from the matrix calculated at the beginning of the algorithm. The operations necessary in equation 10 are shown bellow (some of then were obtained in previous computations and others are performed at this point): – X1t X1 is taken from X t X, which was calculated at the beginning. – (X1 X1 )−1 is calculated (cost 23 k 3 ). – X1t Yˆ1 is taken from X t Y , which was calculated at the beginning because X t Yˆ = X t XΠ = X t X(X t X)−1 X t Y = X t Y . – (X1t X1 )−1 X1t Yˆ1 is calculated (cost 2k 2 m). – Yˆ1t X1 (X1t X1 )−1 X1t Yˆ1 is calculated (cost 2m2 k). – Yˆ1t Yˆ1 is taken from Yˆ t Yˆ , which was calculated at the beginning. – (Yˆ1t Yˆ1 − Y1t X1 (X1t X1 )−1 X1t Yˆ1 )−1 is calculated (cost 23 m3 ). – Multiplication of the first and the second matrices in equation 10 has cost 2m2 k, because the identity matrix is not computed.
132
J.-J. L´ opez-Esp´ın and D. Gim´enez
– Multiplication of the previous matrix and the third matrix in equation 10 has cost 2m2 k because it also includes the identity matrix. The cost of additions is not considered, and the total cost is 23 m3 + 23 k 3 + 6m2 k + 2k 2 m. Matrix [X1 Yˆ1 ]t yj must be also calculated. X1t yj can be taken from X t Y , which was calculated at the beginning. Yˆ1t yj can be taken from Yˆ t Yˆ (Yˆ t Yˆ = Yˆ t Y , equation 6). Finally, the multiplication of the two matrices is made and the total cost is TOLS3ver = 23 (m3 + k 3 ) + 2(m + k)2 + 6m2 k + 2k 2 m when only one equation is solved. This can be compared with the cost of the OLS which was used in the first version of 2SLS for an equation with m endogenous variables, k predetermined variables and sample size d, TOLS1ver = TOLS (1, d, m + k, δest = 0) = 23 (m + k)3 + 2(m + k)2 d + 2(m + k)2 + 2(m + k)d. Since TOLS3ver does not depend on d, the cost of solving the equations (without considering the matrix operations in lines 1, 2 and 3 of algorithm 4) is independent of the sample size d. It is possible to deduce that TOLS1ver > TOLS3ver . Because 23 m3 + 23 k 3 ≤ 2 2 3 3 3 2 2 2 2 3 (m+k) = 3 (m +k +3m k+3mk ), it is sufficient to see that 6m k+2k m < 2 t −1 2(m + k) d + 2(m + k)d. (X X) must be calculated and it is necessary that K ≤ d (X has d × K dimension), and 0 < m + k ≤ K ≤ d because the equation is identified. From the previous inequality, it is deduced that (m+k)3 < d(m+k)(m+k+1) and 3m2 k +k 2 m ≤ (m+k)3 . Then 3m2 k +k 2 m < (m+k)3 < d(m+k)(m+k +1) and TOLS3ver < TOLS1ver . Of course, there are several extra operations (Yˆ Yˆ in line 3 of algorithm 4 must be computed) at the beginning of the algorithm. These extra operations have a cost of 2N 2 d. Experimental results confirm that the third version is faster than the first version. On comparing expressions 7 and 11, it is possible to deduce that TOLS2ver > TOLS3ver . Because 2SLS is used in over identified equations, K + 1 > ni + ki > 0, and then 3K 3 ≥ 3(ni + ki )3 > 23 (n3i + ki3 ) + 6n3i ki + 2ni ki3 . Furthermore 4K 2 d − K 3 ≥ 3K 3 and 2K(K + d) ≥ 4K 2 > 2(ni + ki )2 , because K ≤ d (X has d × K dimension). From the previous inequalities, it is deduced that TOLS3ver < TOLS2ver . When ki > 0 and ni > 1, the equation can be solved by OLS3ver . When ki = 0 or ni = 1, which is very improbable, OLS can be used with a reduced cost because matrix Yˆ t Yˆ has been calculated before. Algorithm 3 shows the scheme of OLS using the inverse decomposition (equation 10). The cost of OLS using the inverse decomposition is: TOLS3ver (ki , ni ) = 23 (ki3 + n3i ) + 2(ni + ki )2 + 6n2i ki + 2ki2 ni
(11)
A scheme of the third version of 2SLS is shown in algorithm 4. A call to OLS3ver is made sequentially in each equation and the scheme studied above is used.
Message-Passing Two Steps Least Square Algorithms
133
Algorithm 3. Scheme of the OLS algorithm using inverse decomposition (OLS3ver ) 1: Compute (X1 X1 )−1 (X1t X1 )−1 X1t Yˆ1 Y1t X1 (X1t X1 )−1 X1t Yˆ1 (Yˆ1t Yˆ1 − Y1t X1 (X1t X1 )−1 X1t Yˆ1 )−1 t −1 X X1 X1t Yˆ1 2: Compute ˆ 1t {using equation 10} Y1 X1 Yˆ1t Yˆ1 3: Compute ([X1 Yˆ1 ]t [X1 Yˆ1 ])−1 [X1 Yˆ1 ]t yj
Algorithm 4. Scheme of the parallel 2SLS algorithm 3rd version 1: 2: 3: 4: 5: 6: 7: 8: 9:
Yˆ = X(X t X)−1 (X t Y ) {saving Π, X t X and X t Y } Yˆ t Yˆ {parallel multiplications} Distribute Π, X t X, X t Y, Yˆ , Yˆ Yˆ to all the processors IN PARALLEL Each processor q DO for j=1... Np do i = q + (j − 1)p OLS3ver (yi ,Yˆ ,X,X t X,X t Y , Yˆ Yˆ ) end for END PARALLEL
The total cost of the third version of 2SLS is: T2SLSp,3ver (N, d, K) = 2 TOLSp (N,⎧ d, K, δest = 1) + 2Np d + TA2A (K 2 + N 2 + 2KN + dN, p)+ ⎫ N ⎨ ⎬ p 2 3 3 2 2 2 + maxq=1...p (k + n ) + 2(n + k ) + 6n k + 2k n ≈ t t t t t t t t 3 ⎩ ⎭ j = 1 t = q + (j − 1)p
≈ TOLSp (N, d, K, δest = 1) +
2N 2 d p
+ TA2A (K 2 + N 2 + 2KN + dN, p)+
N p
2 i=1
3 3 (ki
+ n3i ) + 2(ni + ki )2 + 6n2i ki + 2ki2 ni
(12)
3
Experimental Results
Experimental results have been obtained in: – Kefren: A cluster of 20 biprocessors Pentium Xeon 2 Ghz interconnected by a SCI net with a Bull 2D topology in a mesh of 4 × 5. Each node has 1 Gigabyte RAM.
134
J.-J. L´ opez-Esp´ın and D. Gim´enez
Table 1. Execution time (in seconds) of the third version of the 2SLS algorithm in Kefren, with N =1000, K=400 and varying the sample size (d) in one processor d:
500 1000 1500 2000 time % time % time % time % total time 197,43 202,78 203,35 228,08 Yˆ , Yˆ t Yˆ 2,05 1,04 3,40 1,68 4,99 2,45 6,29 2,76
Table 2. Execution time (in seconds) and speed-up of the third version of the 2SLS algorithm in Marenostrum, when varying the number of endogenous variables (N ), the sample size (d) and the number of processors N: 500 1000 1500 2000 2500 d: 500 500 1000 1000 1500 proc. time Sp time Sp time Sp time Sp time Sp 1 21,20 352,22 2014,73 7005,57 18471,71 4 6,26 3,39 91,55 3,85 528,08 3,82 1820,22 3,85 4930,59 3,75 8 3,49 6,07 48,23 7,30 274,49 7,34 944,70 7,42 2536,07 7,28 16 2,12 10,00 26,97 13,06 148,08 13,61 1425,33 4,92 1348,91 13,69 32 1,36 15,55 15,69 22,45 83,77 24,05 273,55 25,61 677,84 27,25 64 1,12 18,96 10,38 33,94 48,42 41,61 150,91 46,42 393,16 46,98
Table 3. Execution time (in seconds) and speed-up of the second and third versions of the 2SLS algorithm with respect to the first version, with one processor in Kefren, when varying the number of endogenous variables (N ), the sample size (d), and the number of processors N: d: proc. 1st ver 2rd ver 3nd ver
500 1000 1500 2000 2500 500 500 1000 1000 1500 time Sp time Sp time Sp time Sp time Sp 72,82 790,93 7031,96 19337,92 97874,21 74,32 0,98 549,07 1,44 4643,24 1,51 9676,49 2,00 29830,90 3,28 12,91 5,64 198,16 3,99 1225,33 5,74 4192,10 4,61 10217,02 9,58
– Marenostrum: A supercomputer based on PowerPC processors, BladeCenter architecture, a Linux system and a Myrinet interconnection. The main characteristics are: 10240 IBM Power PC 970MP processors at 2.3 GHz (2560 JS21 blades), 20 TB of main memory, 280 + 90 TB of disk storage and a peak Performance of 94,21 Teraflops. Marenostrum is the most powerful supercomputer in Europe and the fifth in the world, according to the latest TOP500 list. As has been shown, the theoretical cost depends on a large number of parameters. To simplify the experiments, these have been done with different numbers of endogenous variables (N ), and the number of exogenous variables has been
Message-Passing Two Steps Least Square Algorithms
135
Fig. 1. Quotient between theoretical and experimental execution times of the second version of 2SLS, varying the number of processors
taken as 40% of N . The Simultaneous Equations Systems used in the experiments have been created randomly. In all the experiments, the system to solve is in P0 , and the solutions are in P0 when the algorithm finishes. Table 1 shows the cost of the third version of the 2SLS algorithm when d is varied and the other parameters are fixed. It can be seen that the total cost depends slightly on d. The cost of calculating Yˆ and Yˆ t Yˆ (lines 1 and 2 of algorithm 4) increases with d to a high degree, and this cost and the communications influence the total cost by about 2 per cent. Furthermore, the cost of calculating Yˆ and Yˆ t Yˆ is negligible compared with the total cost. Thus, it is important to parallelize the solution of the equations (lines 5 to 8 of algorithm 4). The third version of the 2SLS algorithm is experimentally analyzed (table 2). It can be seen that the speed-up is good and it is better when N , K and d are increased. Figure 1 shows that the model of the execution time (equation 4) approaches well the experimental time. The table shows the quotient for the second 2SLS version. Table 3 shows a comparison between the first, second and third versions of the algorithm with one processor. It can be seen that the third algorithm is the best.
4
Conclusions
Algorithms for the solution of Simultaneous Equations Models with Two-stage Least Square methods have been developed for message-passing systems. Simultaneous Equations Models appear in different scientific fields, and because the size of the applications varies greatly, it is interesting to have tools for the solution of small and large systems. Routines in LAPACK and ScaLAPACK style
136
J.-J. L´ opez-Esp´ın and D. Gim´enez
have been developed. The parallel versions of the algorithms have proved to be scalable, and thus appropriate to afford solutions for huge systems, like those in the simulation of national or world economy.
Acknowledgment This work has been funded in part by the Spanish MCYT and FEDER under Grant TIC2003-08238-C02-02 and Fundaci´ on S´eneca 02973/PI/05. The author gratefully acknowledges the computer resources, technical expertise and assistance provided by the Barcelona Supercomputing Center - Centro Nacional de Supercomputaci´ on (Marenostrum) and the Networking Group of the Polytechnic University of Valencia (Kefren).
References 1. Foschi, P., Kontoghiorghes, E.J.: Estimation of VAR Models: Computational Aspects. Computational Economics 21(1), 3–22 (2003) 2. Gatu, C., Kontoghiorghes, E.J.: Parallel algorithms for computing all possible subset regression models using the QR decomposition. Parallel Computing 29(4), 505–521 (2003) 3. Greene, W.H.: Econometric Analysis, 3rd edn. Prentice-Hall, Englewood Cliffs (1998) 4. Kumar, V., Grama, A., Gupta, A., Karypis, G.: Introduction to Parallel Computing. Design and Analysis of Algorithms. The Benjamin Cumming Publishing Company (1994) 5. L´ opez-Esp´ın, J.J., Gim´enez, D.: Solution of Simultaneous Equations Models in high performance systems. In: PARA 2006 (June 2006) 6. Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic Press, London (1979) 7. Yanev, P., Foschi, P., Kontoghiorghes, E.J.: Algorithms for Computing the QR Decomposition of a Set of Matrices with Common Columns. Algorithmica 39(1), 83–93 (2004)
Parallel Implementation of Cholesky LLT -Algorithm in FPGA-Based Processor Oleg Maslennikow1, Volodymyr Lepekha2 , Anatoli Sergiyenko2, Adam Tomas3 , and Roman Wyrzykowski3 1
Technical University of Koszalin, ul. Sniadeckich 2, 75-453 Koszalin, Poland
[email protected] 2 National Technical University of Ukraine, pr.Peremogy 37, 03056 Kiev, Ukraine 3 Czestochowa University of Technology, ul. Dabrowskiego 69, 42-200 Czestochowa, Poland
Abstract. The fixed-size processor array architecture, which is intended for realization of matrix LLT -decomposition based on Cholesky algorithm, is proposed. In order to implement this architecture in modern FPGA devices, the arithmetic unit (AU) operating in the rational fraction arithmetic is designed. The AU is intended for configuring in the Xilinx Virtex4 FPGAs, and its hardware complexity is much less than the complexity of similar AUs operating with floating-point numbers.
1
Introduction
The majority of modern HPC clusters are constructed from high-end PCs, significantly reducing procurement costs. The double precision floating-point calculations became the standard method for the linear algebra problem solving. But the floating-point super-scalar microprocessor units show their low loadings due to the often pipeline stalls and small data memory bandwidth. This disadvantage is slightly compensated by cash memory volume increase and high processor clock frequency. Many classes of applications demonstrate significant performance improvements when implemented in a field programmable gate array (FPGA). Although FPGA’s clock rate rarely exceeds one tenth that of a PC, applications which exhibit massive parallelism are typically speeded-up by FPGAs. Therefore many efforts were made to introduce floating-point arithmetic units (AUs) configured in FPGA in academic institutions [1], [2], [3], [4], [5], as well as in industry [6], [7], [8], [9]. Some authors prove that FPGAs can provide the excellent peak performance of such AUs [1],[2], [8]. But it is shown [9] that FPGAs could not be prevalent in this area, because they do not provide a good performance/cost ratio for double precision floating-point calculations. To minimize the drawbacks of the floating-point FPGA implementation a set of limitations is used. The standardized underflow calculations and rounding methods are usually not implemented because they are both hardware and time R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 137–147, 2008. c Springer-Verlag Berlin Heidelberg 2008
138
O. Maslennikow et al.
consumable. Addition, multiplication, division and square root operations are often implemented in the separate AUs, because when they are combined in a single AU, then the hardware volume is increased substantially [7]. Only the multiplication with addition is combined in a single AU due to the fact that it is a natural operation in matrix calculation [2]. FPGAs give the opportunity to configure the system architecture, which is adapted to the application field. Therefore in papers [11],[12], floating-point AUs are used, whose data bit width depends on the needed calculation precision. Moreover, it is proposed [12] to implement general calculations in AUs with small bit width, and to use the double precision AUs to derive precise results in the final iteration. Since the complexity of the floating-point multiplier and divider is proportional to the square of operand bit width, using the logarithmic number system can have some advantages [13]. Rational numbers are the real numbers, which are derived as solutions of linear equations or integer polynomial divisions. Any rational number can be represented by a fraction, which consists of an integer numerator and integer denominator [17]. Therefore, representation of data by rational fraction numbers is a natural solution for the linear algebra problem solving in FPGAs. In particular, the rational fraction number system is proposed [14] for the Toeplitz matrix problem solving. Then, this system was successfully used to implement the conjugate gradient method. Cholesky matrix factorization algorithm is widely used in many applications like modeling and least square problems, due to its comparatively low computational complexity and convergence properties. In this paper, we propose a processor array architecture and its AU operating in the rational fraction arithmetic (RFA) which are intended for the Cholesky algorithm implementation. This AU is adapted to configuring in Xilinx Virtex 4 FPGA platform, and its hardware complexity is much less than the complexity of similar AUs operating with floating-point numbers.
2
Cholesky Algorithm and Processor Array Structures for Its Implementation
LLT -decomposition is [16], [19]-[22] an efficient alternative to classical Gaussian elimination in case of solution of dense linear systems Ax = b,
(1)
when A is an N × N symmetric positively definite matrix. Cholesky algorithm is usually used for the LLT -decomposition of such matrices A in such a way that L LT = A,
(2)
where L is a lower N × N triangular matrix. This algorithm can be represented by the following construction:
Parallel Implementation of Cholesky LLT -Algorithm
139
for i = 1 to N do begin l(i, i)= SQRT(a(i, i)); for j = i+1 to N do begin l(j, i)= a(j, i)/l(i, i); for k = i+1 to j do a(j, k) = a(j, k) - a(j,i)*a(k, i); end end Note that the input matrix A = A1 = aij (where i = 1, ..., N ; j = 1, ..., i) is recursively modified during N computation steps in order to obtain the lower triangular matrix L = AN+1 . Moreover, in the i-th algorithm step (i = 1, ..., N ) the elements of the i-th column of the matrix L are computed, as well as the modification of the k-th column of the matrix Ai is carried out (k = i + 1, i + 2, ..., N ). The main advantage of Cholesky algorithm is its low computational complexity (for example, two times lower than Gauss elimination [16]). Moreover, this algorithm is numerically stable without any pivoting procedures and is more suitable for the parallel implementation. The basic dependence graph GB of the Cholesky algorithm is a threedimensional one, and is shown in the left part of Fig.1 for N = 4. It has been constructed in accordance with the method for deriving dependence graph of regular algorithm proposed in [18]. Nodes of GB are distributed in nodes of the threedimensional lattice Q = {K = (i, j, k) : 1 ≤ i ≤ N, (i+1) ≤ j ≤ N, k ≤ j} which form a pyramid. The height of the lattice is N units (or layers), where the i-th layer (i = 1, 2, ..., N ) corresponds to computation of elements lji , where j = i + 1, ..., N , and to transformation of others columns of the matrix A, from (i + 1)-th to N -th. The right part of Fig.1 represents an example of the processor array structure S1, obtained after space-time mapping of the graph GB using the method [19]. This architecture consists of N identical processor units (PUs), where the i-th PU performs the i-th algorithm step. Therefore all PUs must perform both types of the algorithm operations: multiplication with subtraction and division. Because a new matrix A can be processed as soon as the input of the previous matrix A is completed, the above scheme is characterized by the pipelining period of t = N 2 steps and the processor utilization η → 1/6. Note that the obtained value of the processor utilization can be improved in a processor array architecture which consists of a given number h of processor elements. In order to design such a processor array structure, where value h depends on the target FPGA device, the locally parallel globally sequential (LPGS) method of the dependence graph decomposition is used [19]. In accordance with this method, all nodes within one subgraph are processed concurrently, while all subgraphs are processed sequentially. As a result all intermediate data which correspond to data dependencies between subgraphs can be stored in FIFO buffer or random access memory (RAM) block outside the processor array.
140
O. Maslennikow et al.
Fig. 1. Dependence graph of Cholesky algorithm and one of possible processor array structures
Starting with the graph GB, we try to decompose it into a set of s =]N/h[ subgraphs having the same (or nearly same) topology, where ]e[ denotes the nearest integer equal to or greater then e. As evident from Fig.1, this can be done only if we cut the graph GB using a set of planes parallel to i-axis. These lines decompose GB into s regular subgraphs with h layers each. After mapping each resulting g-th subgraph (g = 1, 2, ..., s) into corresponding architecture, the fixed-size processor array structure S2 is obtained (see Fig.2). In this array all PUs are of the same type, and must implement addition, multiplication, division and square root. The total execution time T of the Cholesky algorithm realization in this array is nearly equal to T =
S
(N − g · h + h)2 = s · N 2 /3 + N
(3)
g=1
time steps, and the asymptotic processor utilization η → 1/2 when N >> h. Note that this array processor is characterized by the highest value of utilization η and lowest time T among known architectures [19]-[22]. The analysis of the processing units of any modern architecture shows that, firstly, all operations are implemented in the pipelined mode. Secondly, different
Parallel Implementation of Cholesky LLT -Algorithm
141
Fig. 2. The designed fixed-size processor array structure S2
operations like addition, division, and square root have different latent period of their calculations. Therefore, to keep the synchronous mode of calculations, the synchronization period of Cholesky algorithm must be equal to the period of the longest operation; here it is the square root operation. As a result, the above estimations of PUs loadings can be decreased dramatically. This fact forced the effort to find out an another form of Cholesky algorithm, in which square root operations are separated from others. Table 1 shows the calculations in the original Cholesky algorithm for N = 4. Let us consider calculation of the variable a∗22 = a22 − a21 · a21 /(l11 · l11 ). In this expression, value l11 is equal to the square root of a11 , i.e. the datum is recalculated as (l11 · l11 ) = a11 . Thus, here the unnecessary calculations are made, which introduce additional errors. Note that similar calculations are made for elements a∗32 , ..., a∗44 . In order to eliminate unnecessary computations, we introduce extra variables mji in the original Cholesky algorithm, where mji = aji /aii (for example, m21 = a21 /a11 ) and change the standard order of computation. The modified algorithm is represented in Table 2. The analysis of the modified algorithm shows that for the i-th PU its arithmetic unit must perform only two types of operations: division and multiplication with subtraction. Note that the new order allows us to group all square root operations in the last (N -th) algorithm step, and therefore to move these operations to the last processor element of the array, denoted as SP in Fig.3. Such modifications cause to further reduction of the processor array hardware overhead. The only disadvantage is increase in memory volume and bandwidth of communication channels necessary to store and transfer multipliers mij .
Fig. 3. Modified processor array structure
142
O. Maslennikow et al.
Table 1. Standard order of computations in Cholesky LLT -decomposition algorithm (N = 4) Step 1 √ l11 = a11 l21 = a21 /l11 l31 = a31 /l11 l41 = a41 /l11 a∗22 = a22 − l21 · l21 a∗32 = a32 − l31 · l21 a∗42 = a42 − l41 · l21 a∗33 = a33 − l31 · l31 a∗43 = a43 − l41 · l31 a∗44 = a44 − l41 · l41
Step 4 Step 2 Step 3 l22 = a∗22 l33 = a∗∗ l44 = a∗∗∗ 33 44 l32 = a∗32 /l22 l43 = a∗∗ 43 /l33 ∗∗ l42 = a∗42 /l22 a∗∗∗ 44 = a44 − l43 · l43 ∗∗ ∗ a33 = a33 − l32 · l32 ∗ a∗∗ 43 = a43 − l42 · l32 ∗ a∗∗ = a 44 44 − l42 · l42
Table 2. Cholesky algorithm with reordered computations (N = 4) Step 1 m21 = a21 /a11 m31 = a31 /a11 m41 = a41 /a11 a∗22 = a22 − a21 · m21 a∗32 = a32 − a31 · m21 a∗42 = a42 − a41 · m21 a∗33 = a33 − a31 · m31 a∗43 = a43 − a41 · m31 a∗44 = a44 − a41 · m41
Step 2 Step 3 ∗∗ m32 = a∗32 /a∗22 m43 = a∗∗ 43 /a33 ∗∗ m42 = a∗42 /a∗22 a∗∗∗ = a − a∗∗ 44 44 43 · m43 ∗∗ ∗ ∗ a33 = a33 − a32 · m32 ∗ ∗ a∗∗ 43 = a43 − a42 · m32 ∗∗ ∗ a44 = a44 − a∗42 · m42
l11 l22 l33 l44 l21 l31 l41 l32 l42 l43
Step 4 √ = a11 = a∗22 = a∗∗ 33 = a∗∗∗ 44 = m21 · l11 = m31 · l11 = m41 · l11 = m32 · l22 = m42 · l22 = m43 · l33
The modified Cholesky algorithm can be represented as follows: for i = 1 to N-1 do begin for j = i+1 to N do begin m(j, i) = a(j, i)/a(i, i); for k = i+1 to j do a(j, k) = a(j, k) - a(j,i)*m(k, i); end end for i = 1 to N do l(i, i) = SQRT(a(i, i)); (4) for i= 1 to N do for j = i+1 to N do l(j, i) = m(j, i)*l(i, i); It is obvious from this description that square root calculations can be implemented separately, in the pipelined mode.
Parallel Implementation of Cholesky LLT -Algorithm
3
143
Rational Fraction Arithmetic Unit
In this section, we discuss how to implement the obtained processor array architecture in the Xilinx Virtex4 FPGAs. The main quality criteria of the design process are both reducing the PU arithmetic unit hardware and maximizing the clock. The designed AU must perform all types of arithmetic operations (square roots, divisions and multiplications with subtractions) in the pipelined mode. The authors propose to use the rational fraction AUs in the designed processor array, in order to utilize effectively the FPGA resources. Pipelined implementation of a given arithmetic operation is effective if AU consecutively performs several identical operations of this type. It is clear from Table 2 that only the division result may initiate the inner loop of the first loop nest. Therefore it is preferable to implement the division operation and subtraction with multiplication with an equal calculation period. Moreover, it would be better to redirect the division result m(j, i) just to the input of the subtraction with multiplication hardware, not to store it in the memory. As a result, AU performs multiplication with or without addition (subtraction) P = Z × X + Y , division operation P = Z/X, and square root P = SQRT (X), where X, Y , Z and P are rational fraction numbers. In case of multiplication with addition, AU calculates the following expression: Pn Zn Xn Yn (Zn · Xn ) · Yd + (Zd · Xd ) · Yn = · + = Pd Zd Xd Yd (Zd · Xd ) · Yd
(4)
Note that in case of a division operation P = Z/X, operands Z and X are substituted to each other and operand Y is equal to zero. Newton-Rafson method allows to calculate the expression P = SQRT (X) based on a given approximate value P1 of the result P . Then this value is modified while performing K iterations of the following form: P k+1 = (P k + X/P k )/2,
(5)
where k = 1, 2, ..., K. This expression differs from the multiplication with addition only in substituting numerator and denominator of the multiplier to each other, and in a single bit shift right of the resulting numerator. Therefore, only slight modifications are required to perform the square root iteration on the AU designed for multiplication with addition, and division. In the proposed AU structure, the values of numerator Pn1 and denominator 1 Pd of P1 are obtained from the read-only memory (ROM) modules. Their volume is very small, and after performing K = 4 iterations the needed result precision is provided. The AU structure is shown in Fig.4, where indices n and d denote the numerator and denominator of the corresponding fraction number. This AU consists of five multiply units MPUs, one adder SM and two normalization blocks NM1, NM2. These blocks shift left both numerator and denominator of the operation result to the equal bit number to prevent the significant bits loss after
144
O. Maslennikow et al.
the multiplication result truncation. Input multiplexers MX serve to exchange the numerator and denominator of an operand Z, to feed back the intermediate results P i during the square root calculation, and the division result before implementing the inner loop in the modified Cholesky algorithm. The designed rational fraction AU has been implemented in the Xilinx Virtex4 XC2VP4 device. This device has built-in DSP48 blocks based on 18-bit multiply units with 48-bit accumulators and 18 kilobit dual port BlockRAM blocks. The fraction bit width was selected equal to 35. This bit width is enough to process matrices of size of up to thousands of columns with the enough precision, as it was shown in [15]. A single 35-bit multiplier is implemented on four DSP48 blocks without additional logic. In Table 3 the performance of designed AU is compared with the known double precision floating point AUs, which have been implemented in similar FPGA devices. To compare the efficiency of different variants, we consider a system corresponding to the structure S3 from Fig.3; it can be configured in a single FPGA device. Note that the Xilinx XC4VSX55 device contains 512 DSP48 modules and 24648 CLB slices. This comparison shows that the proposed AU has higher throughput and minimized configurable hardware volume, which is in 1,7-9 times less than in other AUs, and provides much higher throughput thanks to its small hardware volume in CLB slices.
Fig. 4. Structure of the proposed rational fraction AU
Parallel Implementation of Cholesky LLT -Algorithm
145
Table 3. Comparison of the designed AU with double precision floating point AUs (* division is not implemented; ** floating point divider was added to the PU) AU parameter Hardware volume: slices multiplier units Pipeline stages (latent delay) for addition and multiplication for division for square root Maximum clock frequency PU number in XC4VSX55 device Throughput of the system in XC4VSX55, Mflops
Proposed AU 811 20
AU in [7] 7311 9
AU in [2]* 1419 9
9 9 45 175 25
12+10 56 56 187 5
5+8 200 5**
4300
900
1000
In order to estimate the precision of computations in the designed rational fraction AU, the calculation error δ for elements of the resulting matrix L has been determined as: N i δ = [ (ai,j − qi,j )2 ]/(0, 5 · N 2 ) (6) i=1 j=1
where qij are elements of the matrix Q = L · LT . We have computed the values δ for different data bit widths of the AU. Computation has been performed for input matrices A(N, N ) with N = 10...120. The obtained results are presented in Fig.5 for N = 20. It follows from Fig.5 that the error δ is nearly equal to δ = 5 ∗ 10−9 when the bit width is equal to 35. Moreover, experimental results show that the error δ slowly increases (up to 2.5 times) with increasing of the matrix size N from 10 to 120. Therefore, the designed 35-bit rational fraction AU can be used to compute the LLT decomposition of the matrices of size up to 1000 columns with enough precision.
Fig. 5. Computational errors of the proposed rational fraction AU
146
4
O. Maslennikow et al.
Conclusions
Cholesky algorithm for the LLT -decomposition of symmetric matrices is characterized by minimized computations and inherent parallelism at the cost of a set of division and square root operations. The last properties make it unattractive to implement it in the FPGA-based parallel systems, which perform the double precision floating-point calculations. In this paper, the arithmetic unit structure is proposed, which calculates division and square root as well as multiplication with addition on the base of the rational fraction data representation, and can be effectively configured in modern FPGAs. The advantages of using the rational fractions in the modern FPGA implementation are: small hardware volume, high throughput, possibility to regulate the precision and hardware volume by selecting the data width. The VHDL modeling has shown the possibility of using such a data representation in solving systems of linear equations by different methods, with a sufficiently reduced hardware complexity of AUs comparing to similar AUs operating with float-point numbers, without decreasing AU performance.
Acknowledgment This work was supported by the Polish Ministry of Science and Higher Education under grant N515 002 32/0176.
References 1. Underwood, K.D., Hemmert, K.S.: Closing the Gap: CPU and FPGA Trends in Sustained Floating Point BLAS Performance. In: Proc. IEEE Symp. Field Programmable Custom Computing Machines, FCCM-(2004) 2. Dou, Y., Vassiliadis, S., Kuzmanov, G.K., Gaydadjiev, G.N.: 64-bit Floating Point FPGA Matrix Multiplication. In: ACM/SIGDA 13-th Int. Symp. on Field Programmable Gate Arrays, FPGA-2005, pp. 86–95 (2005) 3. Johnson, J., Nagvajara, P., Nwankpa, C.: High-Performance Linear Algebra Processor using FPGA. In: Proc. High Performance Embedded Computing, HPEC 2003 (2003) 4. El-Kurdi, Y., Gross, W.J., Giannacopoulos, D.: Sparse Matrix-Vector Multiplication for Finite Element Method Matrices on FPGAs. In: Proc. 14th IEEE Symp. on Field-Programmable Custom Computing Machines FCCM 2006 (2006) 5. Beauchamp, M.J., Hauck, S., Underwood, K.D., Hemmert, K.S.: Embedded Floating-Point Units in FPGAs. In: Proc. ACM Int. Symp. on Field Programmable Gate Arrays, Monterey, CA (2006) 6. Durbano, J.P., Ortiz, F.E., Humphrey, J.R., Prather, D.W.: FPGA-Based Acceleration of the 3D Finite-Difference Time-Domain Method. In: Proc. 12-th IEEE Symp. on Field-Programmable Custom Computing Machines FCCM 2004 (2004) 7. Xilinx Floating-point Operators v2.0 DS335, Xilinx (2006), http://www.xilinx.com 8. Storaasli, O.: Scientific Applications on a NASA Reconfigurable Hypercomputer. In: Proc. 5-th MAPLD Conf. (2002)
Parallel Implementation of Cholesky LLT -Algorithm
147
9. Strenski, D.: Computational Bottlenecks and Hardware Decisions for FPGAs. FPGA and Structured ASIC Journal (14), 1–20 (2006) 10. Craven, S., Athanas, P.: Examining the Viability of FPGA Supercomputing. EURASIP Journal on Embedded Systems (2) (2007) 11. Lienhart, G., Kugel, A., M¨ anner, R.: Using floating-point arithmetic on FPGAs to accelerate scientific n-body simulations. In: Proc. 10-th Ann. IEEE Symp. on Field-Programmable Custom Computing Machines, FCCM 2002, p. 182 (2002) 12. Strzodka, R., Goddeke, D.: Pipelined Mixed Precision Algorithms on FPGAs for Fast and Accurate PDE. Solvers from Low Precision Components. In: Proc. FieldProgrammable Custom Computing Machines (2006) 13. Matousek, R., Tichy, M., Phol, Z., Kadlec, J., Softley, C., Coleman, N.: Logarithmic number systems and floating-point arithmetics on FPGA. In: Proc. 12-th Int. Conf. on Field Programmable Logic and Applications, London, pp. 627–636 (2002) 14. Maslennikow, O., Shevtshenko, J., Sergyienko, A.: Configurable Microprocessor Array for DSP applications. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 36–41. Springer, Heidelberg (2004) 15. Maslennikow, O., Sergyienko, A., Lepekha, V.: FPGA Implementation of the Conjugate Gradient Method. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 526–533. Springer, Heidelberg (2006) 16. Golub, G.G., Van Loan, C.F.: Matrix Computations, 2nd edn., p. 642. J. Hopkins Univ. Press, Baltimore (1989) 17. Irvin, M.J., Smith, D.R.: A rational arithmetic processor. In: Proc. 5-th Symp. Comput. Arithmetic, pp. 241–244 (1981) 18. Wyrzykowski, R., Kanevski, J., Maslennikova, N., Maslennikov, O., Ovramenko, S.: Formalized Construction Method of Array Functional Graphs for Regular Algorithms. In: Engineering Simulation, vol. 14, pp. 217–232. Gordon and Breach Science Publishers, England (1997) 19. Kung, S.Y.: VLSI Array Processors. Prentice-Hall, Englewood Cliffs (1988) 20. Moreno, J.H., Lang, T.: Matrix Computation on Systolic-Type Arrays. Kluwer Acad. Publ., Boston (1992) 21. Quinton, P., Robert, Y.: Systolic Algorithms and Architectures. Prentice-Hall, Engl. Cliffs (1991) 22. Cosnard, M., Trystram, D.: Parallel Algorithms and Architectures. Int. Thomson Computer Press, Boston (1995)
Dimensional Analysis Applied to a Parallel QR Algorithm Robert W. Numrich Minnesota Supercomputing Institute University of Minnesota Minneapolis, MN 55455 USA
Abstract. We apply dimensional analysis to a formula for execution time for a QR algorithm from a paper by Henry and van de Geijn. We define a single efficiency surface that reduces performance analysis for this algorithm to an exercise in differential geometry. As the problem size and the number of processors change, different machines move along different paths on the surface determined by two computational forces specific to each machine. We show that computational force, also called computational intensity, is a unifying concept for understanding the performance of parallel numerical algorithms. Keywords: Scalability, performance analysis, computational intensity, computational force, parallel numerical algorithms, dimensional analysis.
1
Introduction
In a paper on the scalability of a parallel QR algorithm, Henry and van de Geijn [8] report the formula, t = (Cn + 20n2 /p + 40n)γ + (n/r + 2n/h)α + (3n + 4n2 /hp)β ,
(1.1)
for execution time as a function of problem size n and the number of processors p. The details of the algorithm are not important for the purpose of this paper except to say that, according to Henry and van de Geijn, the QR algorithm is notoriously difficult to program such that it scales to large numbers of processors. In a later paper, Henry, Watkins and Dongarra [9] report another algorithm, which scales better but has a more complicated timing formula. For simplicity of presentation, we use the older formula to illustrate the principles of dimensional analysis as they apply to performance analysis for parallel algorithms. In a system of units based on length, energy and time, the quantities involved in timing formula (1.1) have the matrix of dimensions, l 0 e 0 t0 t γ α β n p C r h L(word) 1 0 0 0 0 0 −1 0 0 0 0 0 E(flop) 0 1 0 0 −1 0 0 0 0 0 0 0 T (s) 0 0 11 1 1 1 00 00 0 R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 148–155, 2008. c Springer-Verlag Berlin Heidelberg 2008
(1.2)
Dimensional Analysis Applied to a Parallel QR Algorithm
149
where the unit of length is the eight-byte word, l0 = 1 word, the unit of energy (work) is the eight-byte floating-point operation, e0 = 1 flop, and the unit of time is one second, t0 = 1 s. The quantity t is the execution time; the quantity γ is the reciprocal of the computational power of a single processor; α is a startup time; β is the reciprocal of bandwidth; n is the matrix size; and p is the number of processors. The three parameters C, h, and r are specific to the details of the QR algorithm. The block size h is set to the value h = n/p; the “bundling factor” r is set to the value r = 2; and the number of operations C for a Householder transformation is an undefined small number, which we take to be C = 5.
2
Dimensional Analysis
Dimensional analysis is based on the algebraic properties of the matrix of dimensions [2,4]. From the matrix (1.2), it is clear that formula (1.1) is not dimensionally consistent. This formula, along with many others like it, lacks a clear distinction between the unit of length and the unit of work. The problem size, which is dimensionless, enters into the calculation of both the number of operations performed and the amount of data moved. The reader is expected to infer units of work or units of length based on various powers of the problem size. The work done in this QR algorithm, for example, corresponds to the first factor of the first term of formula (1.1), w = (Cn + 20n2 /p + 40n)e0 .
(2.1)
The first factor of the second term counts the number of startup penalties incurred, s = (n/r + 2n/h) , (2.2) and is dimensionless. The first factor in the third term, l = (3n + 4n2 /hp)l0 ,
(2.3)
equals the number of words moved. To make the formula dimensionally consistent, therefore, we need to rewrite it as, t = wγ + sα + lβ .
(2.4)
Execution time, then, is a function of six dependent variables, t = t(w, l, s, α, β, γ) .
(2.5)
We have found in previous work [15,16] that a system of units based on length, force and time provides a unifying framework for performance analysis. Computational force is measured in the unit, f0 = e0 /l0 = 1 flop/word.
(2.6)
It is also called computational intensity [6,10,11,12,13] and is recognized as an important quantity in performance analysis. In a system of measurement with
150
R.W. Numrich
length, force and time as primary units, the quantities involved in relationship (2.5) have the matrix of dimensions, l0 f0 t0 t w l s α β γ L(word) 1 0 0 0 1 1 0 0 −1 −1 F (flop/word) 0 1 0 0 1 0 0 0 0 −1 T (s) 0 0 11 000 1 1 1
(2.7)
The basic assumption of dimensional analysis is that relationship (2.5) is independent of the system of units we use [1,2,5,14,15,16]. Under this assumption, we may pick three scaling parameters, αL , αF , αT , one for each primary unit, and scale each quantity, according to its entries in the matrix of dimensions, such that the relationship still holds, −1 −1 αT t = t(αL αF w, αL l, s, αT α, α−1 L αT β, αL αF αT γ) .
(2.8)
We can pick these scale factors in such a way to establish a new system of units where the number of parameters reduces from six to three and all quantities are scaled to dimensionless ratios. Indeed, if we pick them such that αL αF w = 1 ,
αL l = 1 ,
−1 α−1 L αF αT γ = 1 ,
(2.9)
we can solve for the three parameters to find αL = 1/l ,
αF = l/w ,
αT = 1/(γw) .
(2.10)
Relationship (2.8) scaled by these parameters becomes the new relationship, t/(γw) = t(1, 1, s, α/(γw), lβ/(γw), 1) ,
(2.11)
involving only three quantities rather than six. We recognize the new unit of time, t∗ = 1/αT , as the minimum time to complete the computation with the value, t∗ = γw ,
(2.12)
from (2.10). The left side of equation (2.11) is, therefore, the reciprocal of the efficiency, t/t∗ = 1/e . (2.13) Scaling both sides of (2.4) using the parameters from (2.10), we obtain the efficiency function, e = [1 + sα/(γw) + lβ/(γw)]−1 .
3
(2.14)
Computational Forces and the Efficiency Surface
The efficiency formula (2.14) depends on ratios of computational forces. To recognize this fact, we first observe that the efficiency formula assumes the simple form, 1 e= , (3.1) 1+u+v
Dimensional Analysis Applied to a Parallel QR Algorithm
151
in terms of the two variables, u = sα/(γw) ,
v = lβ/(γw) .
(3.2)
Each of these variables is the ratio of opposing computational forces. Using the quantities defined by equations (2.1)-(2.2) and the values h = n/p, r = 2, we find the first variable, φH u(n, p) = S 1 , (3.3) φ1 (n, p) is the ratio of a hardware force, φH 1 = (α/l0 )/γ ,
(3.4)
n(20n/p + 40 + C) f0 . n/2 + 2p
(3.5)
to a software force, φS1 =
In the same way, the second variable is the ratio of two other opposing forces, v(n, p) =
φH 2 , φS2 (n, p)
(3.6)
with a second hardware force, φH 2 = β/γ ,
(3.7)
and a second software force, φS2 (n, p) =
20n/p + 40 + C f0 . 7
(3.8)
The two hardware forces are independent of the problem size and the number of processors. The first one is determined by the ratio of the startup time to the reciprocal of the computational power, and the second one is determined by the ratio of the reciprocal of the bandwidth to the reciprocal of the computational power. The two software forces, on the other hand, are functions of problem size and the number of processors, independent of the hardware. They are determined by this particular version of the QR algorithm. To help visualize how efficiency changes as a function of problem size and the number of processors, we define curvilinear coordinates on the efficiency surface. For fixed problem size, n = n∗ , we use the number of processors as a parameter to define the curvilinear coordinates, u∗ (p) = u(n∗ , p) ,
v ∗ (p) = v(n∗ , p) .
(3.9)
As the number of processors changes, these coordinates define paths along the surface determined by the equation, e∗ (p) =
1 1+
u∗ (p)
+ v ∗ (p)
.
(3.10)
152
R.W. Numrich
These paths correspond to strong scaling where the problem size remains fixed and the efficiency decreases as the number of processors increases. Different machines follow different paths determined by their particular values of the hardH ware forces φH 1 and φ2 . Similarly, for fixed number of processors, p = p∗ , the curvilinear coordinates, u∗ (n) = u(n, p∗ ) ,
v∗ (n) = v(n, p∗ ) ,
(3.11)
define paths along the surface, e∗ (n) =
1 , 1 + u∗ (n) + v∗ (n)
(3.12)
as the problem size changes. These paths correspond to weak scaling where the problem size increases to obtain higher efficiency for a fixed number of processors.
4
Discussion
The relative sizes of the opposing hardware and software forces determine the efficiency of the algorithm. From (3.1) it is clear that high efficiency requires small values for the variables u and v. Machines with large hardware forces require large software forces to obtain the same efficiency. The following table, α β −1 γ −1 φH 1 φH 2
Machine 1 ca. 1990 Machine 2 ca. 2007 10−4 s 10−6 s 6 10 word/s 109 word/s 6 10 flop/s 109 flop/s 2 10 flop/word 103 flop/word 1 flop/word 1 flop/word
(4.1)
H shows how the hardware forces φH 1 and φ2 have changed over approximately the last two decades. We expect modern machines to be less effecient than older machines for the same size problem. For this algorithm, equal efficiency requires higher software forces obtained by increasing the problem size or by decreasing the number of processors. Figure 1 shows two paths along the efficiency surface, one for each machine in the table. The number of processors is fixed at p = 512, and each point on a path corresponds to a different problem size calculated from the curvilinear coordinates u∗ (n) and v∗ (n) from equation (3.11). For large enough problem size, both machines approach perfect efficiency. But they follow different paths at different rates along the surface to reach the summit. The modern machine is clearly less efficient than the earlier machine. Our approach to performance analysis [14,15,16] is quite different from the more familiar approaches that have appeared in the literature. Given formula (3.1), our first impulse is to plot the surface, or contours of the surface, as a
Dimensional Analysis Applied to a Parallel QR Algorithm
153
1 0.9 0.8 0.7
e
0.6 0.5 0.4 0.3 0.2
0 10
0.1 0 0
20 30 0.05
0.1
40 0.15
0.2
50
u
v
Fig. 1. Two paths along the efficiency surface for fixed number of processors, p = 512. The problem size increases from n = 500 at low efficiency in increments of 100. In the limit of very large problem size, both machines approach perfect efficiency. They approach the limit at different rates along different paths on the surface. The higher path on the surface corresponds to Machine 1; the lower path to Machine 2.
function of the problem size and the number of processors. For example, Henry and van de Geijn [8] display the results in the first three columns of the following table, p n efficiency(measured) efficiency(modeled) 4 2000 0.63 0.76 8 2800 0.65 0.69 16 4000 0.63 0.62 32 4800 0.53 0.39 48 6000 0.48 0.45 64 8000 0.47 0.45 96 9600 0.39 0.39
(4.2)
measured on an Intel Paragon XP/S Model 140 parallel system. Unfortunately, they do not specify the parameters α, β, and γ that enter into their timing formula (1.1). Nor, for that matter, do they specify the value of the constant C, which we have arbitrarily taken as C = 5. This lack of information makes it difficult to compare their measured results with their analytic results. It is especially difficult, after ten to fifteen years, to determine what exactly the Intel Paragon system was and what the particular parameters for the machine might have been. As our best guess, we have used the values α = 30 microseconds and β −1 = 175 Mbyte/s from Bolen et al. [3]. From reference [7] we find that the frequency of the machine was ν = 200 MHz and that it could perform one eight-byte floating-point operation per clock cycle. We therefore picked the value γ −1 = 200 Mflop/s. These values correspond to forces φH 1 = 6000 flop/word and φH = 9.1 flop/word. 2 The fourth column in Table (4.2) shows the efficiencies calculated using these parameters in formula (3.1), and the solid line in Figure 2 shows the path across
154
R.W. Numrich
1 0.9 0.8 0.7
e
0.6 0.5 0.4 0.3 0.2 0.1 0
0.04 1
0.03 2
0.02 3
0.01 4
0 v
u
Fig. 2. The path followed across the efficiency surface for the selected values of p and n chosen by Henry and van de Geijn [8]. The solid line is the path on the surface predicted by the analytic formula (3.1). The bullets are the measured efficiencies. Some of the measured values lie below the surface and are not visible from this perspective.
the efficiency surface followed by these values. The bullets mark the measured efficiencies from column three of the table. Some of the measured values lie below the surface and are not visible from this viewing angle. Agreement of the measured values with the predicted values is reasonably good, especially given that we had to guess the values for the parameters that enter the analytic formula.
5
Summary
We have shown that for a particular QR algorithm there exists one efficiency surface for all machines. Each machine follows its own path along the surface H depending on two hardware forces φH 1 and φ2 . Two machines with the same hardware forces are self-similar and scale the same way even though the particular values for their startup time α, their bandwidth β −1 , and their computational power γ −1 may be quite different. For a fixed problem size and number of processors, the distance along the geodesic between paths is a measure of the distance between machines. The two paths shown in Figure 1, for example, indicate that modern machines are a long distance from earlier machines and that they are less efficient. They travel a long way across the surface at low efficiency before mounting the ascent to the summit. The modern machine has higher values for the coordinate u∗ (n) for each point along the paths. Startup times, although they have decreased, have not kept up with higher computational power so that the hardware force φ1 , which determines the coordinate u∗ (n), has increased by a factor of ten. It is harder now to overcome long startup times than it was before. This difficulty translates directly into a harder job for the programmer who must overcome the larger
Dimensional Analysis Applied to a Parallel QR Algorithm
155
hardware force by designing a new algorithm as done, for example, by Henry, Watkins and Dongarra in a later paper [9].
Acknowledgment The United States Department of Energy supported this research by Grant No. DE-FG02-04ER25629 through the Office of Science.
References 1. Barenblatt, G.I.: Scaling. Cambridge University Press, Cambridge (2003) 2. Birkhoff, G.: Hydrodynamics: A Study in Logic, Fact and Similitude, 2nd edn. Princeton University Press, Princeton (1960) 3. Bolen, J., Davis, A., Dazey, B., Gupta, S., Henry, G., Robboy, D., Schiffler, G., Scott, D., Stallcup, M., Taraghi, A., Wheat, S., Fisk, L., Istrail, G., Jong, C., Riesen, R., Shuler, L.: Massively Parallel Distributed Computing: World’s First 281 GigaFlop Supercomputer. In: Proceedings of the Intel Supercomputer Users Group (1995), http://www.cs.utk.edu/∼ ghenry/isug.ps 4. Brand, L.: The Pi Theorem of Dimensional Analysis. Arch. Rat. Mech. Anal. 1, 35–45 (1957) 5. Bridgman, P.W.: Dimensional Analysis, 2nd edn. Yale University Press, New Haven (1931) 6. Callahan, D., Cocke, J., Kennedy, K.: Estimating interlock and improving balance for pipelined architectures. Journal of Parallel and Distributed Computing 5, 334– 358 (1988) 7. Henry, G.: (1995), http://www.cs.utk.edu/∼ ghenry/mplin.soft 8. Henry, G., van de Geijn, R.A.: Parallelizing the QR algorithm for the unsymmetric algebraic eigenvalue problem: myths and reality. SIAM Journal on Scientific Computing 17(4), 870–883 (1996) 9. Henry, G., Watkins, D., Dongarra, J.: A parallel implementation of the unsymmetric QR algorithm for distributed memory architectures. SIAM Journal on Scientific Computing 24(1), 284–311 (2002) 10. Hockney, R.W.: Performance parameters and benchmarking of supercomputers. Parallel Computing 17, 1111–1130 (1991) 11. Hockney, R.W.: The Science of Computer Benchmarking. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1996) 12. McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers, IEEE Computer Society; Technical committee on Computer Architecture Newsletter (December 1995) 13. Miles, D.: Compute intensity and the FFT. In: Proceedings Supercomputing 1993, pp. 676–684 (1993) 14. Numrich, R.W.: A note on scaling the Linpack benchmark. Journal of Parallel and Distributed Computing 67(4), 491–498 (2007) 15. Numrich, R.W.: Computer Performance Analysis and the Pi Theorem. Under review (2007) 16. Numrich, R.W.: Computational force: A unifying concept for scalability analysis. In: Proceedings of ParCo 2007 (to appear, 2007)
Sparse Matrix-Vector Multiplication - Final Solution? ˇ Ivan Simeˇ cek and Pavel Tvrd´ık Department of Computer Science and Engineering, Czech Technical University, Prague xsimecek,
[email protected]
Abstract. Algorithms for the sparse matrix-vector multiplication (shortly SpM ×V ) are important building blocks in solvers of sparse systems of linear equations. Due to matrix sparsity, the memory access patterns are irregular and the utilization of a cache suffers from low spatial and temporal locality. To reduce this effect, the register blocking formats were designed. This paper introduces a new combined format, for storing sparse matrices that extends possibilities of the variable-sized register blocking format.
1
Introduction
There are several formats for storing sparse matrices. They have been designed mainly for the SpM ×V . The SpM ×V for the most common format, the compressed sparse rows (shortly CSR) format, suffers from low performance due to the indirect addressing. Several papers focussed on increasing the efficiency of the SpM ×V [1,2]. There are some formats, such as register blocking, that eliminate indirect addressing during the SpM ×V . Then, vector instructions can be used. These formats are suitable only for matrices with a known structure of nonzero elements. The overhead of a reorganization of a matrix from one format to another one is often of the order of tens of executions of a SpM ×V . So, such a reorganization pays off only if the same matrix A is multiplied with multiple different vectors, e.g., in iterative linear solvers. In this paper, we propose a new format designed mainly for the SpM ×V . We compare the performance with the CSR format and other related works. Our measurements shows that this new format can gain the significant speedup.
2 2.1
Terminology and Notation The Cache Model
The cache model we consider corresponds to the structure of L1 and L2 caches in the Intel x86 architecture. An s-way set-associative cache consists of h sets and R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 156–165, 2008. c Springer-Verlag Berlin Heidelberg 2008
Sparse Matrix-Vector Multiplication - Final Solution?
157
one set consists of s independent lines. Let CS denote the size of the data part of a cache in bytes and LS denote the cache line size in bytes. Then CS = s · LS · h. 2.2
Definitions and Notation
In the following text, we assume that A is a real sparse matrix of order n. We will use the following definitions and notation: – nZ denotes the total number of nonzero elements in A. – A block is a small submatrix of matrix A. It can be linear (horizontal, vertical, or diagonal) or rectangular. – A partially full block is a block that contains at least one nonzero element. Its fill-in ratio, denoted by α, is the ratio between the number of nonzero elements in the block and the size of the block. To be more precise, we will sometimes call these blocks α-dense blocks. – A fully dense block is a block that contains only nonzero elements (α=1). – NB denotes the number of partially full blocks. – L denotes the total number of elements in all blocks. – SB denotes the size of the auxiliary data structure for a block (see Section 2.3). – SD denotes the size of type double in bytes. – SI denotes the size of type integer in bytes. 2.3
Common Sparse Matrix Formats
The coordinate (XY) format. In this simplest sparse format, the matrix A is represented by 3 linear arrays A, Xpos, and Y pos. Array A[1, . . . , nZ ] stores the nonzero elements of A, arrays Xpos[1, . . . , nZ ] and Y pos[1, . . . , nZ ] contain x-position and y-position, respectively, of the nonzero elements. The compressed sparse row format. It is the most common format for storing sparse matrices. A matrix A stored in the CSR format is represented by 3 linear arrays A, adr, and ci.Array A[1, . . . , nZ ] stores the nonzero elements of A, array adr[1, . . . , n] contains indexes of initial nonzero elements of rows of A, and array ci[1, . . . , nZ ] contains column indexes of nonzero elements of A. Hence, the first nonzero element of row j is stored at index adr[j] in array A. Register blocking (RB) formats. Sparse matrices often contain dense submatrices (blocks in our terminology), so various blocking formats were designed to accelerate matrix operations (mainly the SpM ×V ). Compared to the CSR format, the aim of these formats is to consume less memory and to allow better use of registers. Also, the effect of indirect addressing is to be reduced and vector (SIMD) instructions can be used. The RB formats are designed to handle randomly occurring dense blocks in a sparse matrix. Geometrical and topological properties (such as structure and size of one-subgraphs) in matrices with faulty components are studied by many
158
ˇ I. Simeˇ cek and P. Tvrd´ık
author (for example in [3]). Blocks can be linear (horizontal, vertical, diagonal) or rectangular. There are 2 variants: – Fixed-sized register blocking (shortly FRB): Elements of A are grouped into partially-full blocks of the same size. For example, in case of the rectangular FRB, each block i has size rx · ry and is described by a pair [xstart ,yistart ] i (the position of its left upper corner) and rx ·ry numbers (the values of all its elements, including zeroes). This format has been deeply studied [2,4] (including fast heuristics for finding a suboptimal sizes rx and ry of rectangular blocks) and it is discussed in Section 3.1. – Variable-sized register blocking (shortly VRB): Nonzero elements of A are grouped into blocks whose sizes and shapes can differ. For example, in case of the diagonal VRB, each block i is described by a pair [xstart ,yistart ] (the i end end position of its begin) and yi (the y-position of its end) and yi −yistart +1 numbers (the values of all its elements). The main idea of the diagonal VRB format is illustrated on Figure 1 b). It can be implemented in 2 ways: • a linked list of diagonal blocks (see Figure 1 c)); • a pair of arrays block elems[1, . . . , L] (elements of diagonal blocks) and block aux[1, . . . , NB ] (items xstart , yistart , yiend , and pointers to the first i elements of blocks into the array block elems[1, . . . , L]) (see Figure 1 d)).
c)
a)
b) d)
Fig. 1. The idea of the VRB format: a) Nonzero elements of a sparse matrix A. b) A diagonal blocking of the nonzero elements. c) Blocks are stored as linked dense arrays. d) Blocks are stored in arrays block aux and block elems.
3
Related Work and Comparison
Storing a matrix in register blocking formats is the most common technique for improving the performance of the SpM ×V . There are 2 implementations of RB formats that are relevant for the proposal of this paper: – the SPARSITY method [2,4] (and his successor OSKI), – the CARB method [5].
Sparse Matrix-Vector Multiplication - Final Solution?
3.1
159
The SPARSITY Method
The SPARSITY method is an implementation of the rectangular FRB format. The SPARSITY transformation algorithm of a sparse matrix to the rectangular FRB format can be described by the following pseudocode. The SPARSITY transformation algorithm (simplified pseudocode) for rx = 1 to max do for ry = 1 to max do find a decomposition of the matrix into rx × ry blocks such that all nonzero elements are covered and the number of blocks is minimal. predict the SpM ×V performance for that decomposition endfor endfor transform the matrix into the FRB format with the best performance. The most important feature of this format is that it can adapt to machine specific parameters due to the performance prediction. Before the transformation, machine specific parameters are measured and incorporated into performance prediction. 3.2
The CARB Method
In [5], we have proposed an implementation of the diagonal VRB format, called the CARB format. Matrix A is viewed as a set of equal-sized horizontal belts. The belt size is computed from cache hierarchy parameters. Also, a lower bound α ¯ on the fill-in ratio α is estimated by a block heuristic using similar architecture dependent parameters. The CARB transformation algorithm can be described by the following pseudocode. The CARB transformation algorithm (simplified pseudocode) for every belt do for every diagonal in the current belt do if the fill-in ratio of the current diagonal is greater than α ¯ then represent the current diagonal as a diagonal block, else represent nonzero elements in the current diagonal in the CSR format endif endfor endfor The most important features of this method are very small transformation overhead and also the possibility to combine of the CSR and the VRB formats.
160
ˇ I. Simeˇ cek and P. Tvrd´ık
? Fig. 2. The block heuristic tries to find in the CARB format a configuration of blocks with minimal space complexity
4
Design of a New Format for Sparse Matrices
In this section, we describe a new linear VRB method called SMURB (Specific Machine parameters Using Register Blocking). Firstly, we describe the new transformation algorithm and derive the supplying equations. Secondly, we will describe our simplifying assumptions for the estimation of time complexity. For the purpose of finding an efficient SMURB format, a matrix A is partitioned into a set of square regions. A region is a temporary data structure representing a submatrix of A. 4.1
The SMURB Transformation Algorithm
The SMURB transformation algorithm follows. SMURB transformation (simplified pseudocode) for every region do if the number of nonzero elements in the current region > threshold then compute arrays V erti and Hori using Eq. 1)-5) compute array C using Eq. (6) backtrace the decomposition of current region from array C else represent nonzero elements in the current region in the CSR format endif endfor Let us discuss the algorithm in more details. The matrix is decomposed into equal-sized square regions of side length γ. The choice of γ is a tradeoff between space complexity of temporary data structures (for matrix transformation) and the probability of finding suitable blocks in a region. The experiments indicate that the best choice is γ ∈ 50, 200 for most of nowadays CPU architectures. The SMURB method uses the following transformation algorithm to find linear blocks inside each region. It uses dynamic programming to estimate the
Sparse Matrix-Vector Multiplication - Final Solution?
161
space complexity using different blocking schemas. It evaluates the complexity for each element in a region so its space complexity for an γ × γ region is Θ(γ 2 ). We precompute arrays Hori (and V erti) with space complexity of horizontal (and vertical) blocks. Firstly, we define array Single(x, y) for each element of the matrix A in this way: 0 if A(x, y) = 0, Single(x, y) = min (1) SD + 2SI if A(x, y) = 0. Then we can precompute arrays V ertin (x, y), V ertib (x, y)), and V erti(x, y). All these arrays denote space complexity of elements from (x, 0) to (x, y) using vertical blocks, but – In V ertib (x, y), the element (x, y) is in a vertical block. – In V ertin (x, y), the element (x, y) is not in a vertical block. – V erti(x, y) = min(V ertin (x, y), V ertib (x, y)). These arrays can be computed as follows: V ertin (x, 0) = Single(x, 0),
(2)
V ertib (x, 0) = SB + SD , //start of vertical block
(3)
⎧ ⎪ ⎪ V ertin (x, y − 1) + Single(x, y) ⎨ //continous vertical non-block, V ertin (x, y) = min V ertib (x, y − 1) + Single(x, y) ⎪ ⎪ ⎩ //end of vertical block. ⎧ V ertin (x, y − 1) + SB + SD ⎪ ⎪ ⎨ //start of vertical block, V ertib (x, y) = min V ertib (x, y − 1) + SD ⎪ ⎪ ⎩ //continous vertical block.
(4)
(5)
And similarly, we precompute arrays Horin (x, y), Horib (x, y), and Hori(x, y) that estimate the space complexity of elements from (0, y) to (x, y) using horizontal blocks. For elements with coordinates x ≥ 1, y ≥ 1, we can compute their space complexity of elements from (0, 0) to (x, y) by this equation (for better illustration, see Figure 3). ⎧ C(x − 1, y) + V erti(x, y) ⎪ ⎪ ⎪ ⎪ //continous vertical block, ⎪ ⎪ ⎨ C(x, y − 1) + Hori(x, y) C(x, y) = min //continous horizontal block, ⎪ ⎪ ⎪ ⎪ C(x−1, y−1)+Hori(x − 1, y)+V erti(x, y − 1)+Single(x, y) ⎪ ⎪ ⎩ //continous diagonal block. (6)
162
ˇ I. Simeˇ cek and P. Tvrd´ık C(x−1,y)
Verti(x,y)
111 000 000 111 000 111 000 111 000 111 a)
C(x,y−1)
11111 00000 00000 11111 00000 11111 Hori(x,y)
11111 00000 00000 11111 00000 11111 Hori(x−1,y)
b)
111 000 000 111 000 111 000 111 000 111
C(x−1,y−1)
Verti(x,y−1)
Single(x,y)
c)
Fig. 3. Graphical representation of the computation of the space complexity
In our approach, the algorithm finds the optimal configuration of blocks with minimal space complexity. The transformation time complexity for γ ×γ is Θ(γ 2 ) and it can be too much for some types of matrices. Therefore region size γ and the constant threshold must be chosen carefully. 4.2
Estimation of the SpM ×V Time Complexity
Time complexity can be predicted similarly, just replace terms SD , SI and SB in Eq. 1-6 with TD , TI and TB . We assume the following model: – The SpM ×V using the XY format takes time T = nZ · T I , where TI is an architecture dependent constant. – The SpM ×V using the CSR format takes time T = n · Tn + nZ · Tnz , where Tn and Tnz are architecture dependent constants. – The multiplication of NB blocks (all these blocks together contain L elements) takes time T = NB · T B + L · T D , where TB and TD are architecture dependent constants. 4.3
Comparison Related Results
Let us briefly compare previous methods and SMURB format. – In SPARSITY, all nonzero elements are in blocks. In the SMURB method (similarly in the CARB format), we use the combination of the VRB and the CSR format. – In SPARSITY, only 2D rectangular blocks of the same size can occur (the FRB format, see Section 2.3). In the CARB format, only diagonal blocks are used. In the SMURB format, we use only linear (horizontal, vertical, or diagonal) blocks. Block sizes can differ (the VRB format, see Section 2.3).
Sparse Matrix-Vector Multiplication - Final Solution?
163
– Various (even architecture dependent) block heuristics can be designed. We have designed a heuristic to find blocks with minimal space complexity or time of the SpM ×V . It is adaptive to machine specific parameters. – In SPARSITY, the time overhead of the matrix transformation is very high (in the order of thousands of execution of the SpM ×V ). In the CARB format, the transformation is fast simply because the block heuristic is trivial. Since the SMURB method uses much more complex heuristics, it is applied only for regions that are ”dense enough” due to the overhead of region transformation. This approach minimizes the total overhead of the matrix transformation and the matrix transformation algorithm is relatively faster (approximately one order less than in SPARSITY). – We extend the CARB’s idea of belts to regions. The regions divide matrix A in 2 dimensions so that sizes of all regions are approximately the same. The usage of regions allows to keep as much as possible elements in the cache memory during the SpM ×V (an application of the cache blocking).
5
Evaluation of the Results
All results was measured at Pentium Celeron M420 at 1.6 GHz, 1GB@ 266 MHz, running OS Windows XP Professional with the following cache parameters: L1 cache is data cache with LS = 64, CS = 32K, s = 8, h = 64, and LRU replacement strategy. L2 cache is data cache with LS = 64, CS = 1M B, s = 8, h = 2048, and LRU strategy. Microsoft Visual Studio 2003 Intel compiler version 9.0 with switches: /O3 /Og /Oa /Oy /Ot /Qpc64 /QxB /Qipo /Qsfalign16 /Zp16 5.1
Test Data
We have used 52 real matrices from various technical areas from the MatrixMarket and Harwell sparse matrix test collection. 5.2
Evaluation of the Results
The matrix transformation performance. The SMURB matrix transformation overhead depends strongly on the constant threshold (see Section 4.1). We have chosen threshold = 250. We can conclude that the time of the transformation varies from 50 to 200 executions of the SpM ×V for this threshold for all testing matrices. The SpM ×V performance Comparison with the CSR format We compare SpM ×V performance for the CSR and for SMURB format.Speedups depend strongly on the structure of nonzero elements (it means on the application domain of the matrix). Great speedups (sometimes over
ˇ I. Simeˇ cek and P. Tvrd´ık
164
200%) were achieved for matrices of partial differential equations (PDE) origin. No speedup for matrices with the near-random structure. More exactly, a significant speedup (more than 10%) has been achieved for real matrices from areas: diffusion operators, structural engineering, simulation of solid objects. On the other hand, the matrices from the following areas turned unsuitable for our new format: economical and social models, driven cavity problems, nonlinear chemical problems, matrices with near-random structure, but no slowdowns were observed. Comparison with the SPARSITY method It is hard to compare our and SPARSITY methods. The SMURB method achieved speedup for wider range of matrices due to more universal VRB scheme. But SPARSITY achieved higher speedups for some types of matrices due to the smaller overhead of the FRB format. Both of these methods fail for matrices with (almost) random structure. Comparison with the CARB method Figure 4 shows the relative speedup of the CARB and the SMURB formats for all testing matrices. Our implementation is always faster than the CARB format due to the more complex and machine adaptive block heuristics. 2.4 SMURB CARB 2.2
Speedup over the CSR format
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4
50
40
30
20
10
Testing matrix
Fig. 4. The relative speedup of the SpM ×V using CARB and SMURB format
Sparse Matrix-Vector Multiplication - Final Solution?
6
165
Conclusions
In this paper, we described a new format called SMURB for storing sparse matrices that combine advantages of the CSR format and the VRB format. This format uses new transformation algorithm with a small overhead. This format is also adaptive to the machine specific parameters. For many types of matrices arising from various technical disciplines, our new format gives a significant performance speedup.
Acknowledgements ˇ This research has been supported by MSMT under research program MSM6840770014.
References 1. Mellor-Crummey, J., Garvin, J.: Optimizing sparse matrix vector product computations using unroll and jam. International Journal of High Performance Computing Applications 18(2), 225–236 (2004) 2. Vuduc, R., Demmel, J.W., Yelick, K.A., Kamil, S., Nishtala, R., Lee, B.: Performance optimizations and bounds for sparse matrix-vector multiply. In: Proceedings of Supercomputing 2002, Baltimore, MD, USA (November 2002) 3. Neh´ez, M.: On Geometrical Properties of Random Tori and Random Graph Models. Journal of Electrical Engineering 51(12/s), 59–62 (2000) 4. Im, E.: Optimizing the Performance of Sparse Matrix-Vector Multiplication. Dissertation thesis, University of Carolina at Berkeley (2001) ˇ 5. Tvrd´ık, P., Simeˇ cek, I.: A new diagonal blocking format and model of cache behavior for sparse matrices. Proceedings of Parallel Processing and Applied Mathematics 12(4), 617–629 (2005)
Petascale Computing for Large-Scale Graph Problems David A. Bader College of Computing Georgia Institute of Technology Atlanta, GA 30332 USA
Abstract. Graph theoretic problems are representative of fundamental kernels in traditional and emerging computational sciences such as chemistry, biology, and medicine, as well as applications in national security. Yet they pose serious challenges for parallel machines due to non-contiguous, concurrent accesses to global data structures with low degrees of locality. Few parallel graph algorithms outperform their best sequential implementation due to long memory latencies and high synchronization costs. In this talk, we consider several graph theoretic kernels for connectivity and centrality and discuss how the features of petascale architectures will affect algorithm development, ease of programming, performance, and scalability.
1
Petascale Computing
Computational science enables us to investigate phenomena where economics or constraints preclude experimentation, evaluate complex models and manage massive data volumes, model processes across interdisciplinary boundaries, and transform business and engineering practices. Increasingly, cyberinfrastructure is required to address our national and global priorities, such as sustainability of our natural environment by reducing our carbon footprint and by decreasing our dependencies on fossil fuels, improving human health and living conditions, understanding the mechanisms of life from molecules and systems to organisms and populations, preventing the spread of disease, predicting and tracking severe weather, recovering from natural and human-caused disasters, maintaining national security, and mastering nanotechnologies. Several of our most fundamental intellectual questions also require computation, such as the formation of the universe, the evolution of life, and the properties of matter. Realizing that cyberinfrastructure is essential to research innovation and competitiveness, several nations are now in a “new arms race to build the world’s mightiest computer” (John Markoff, New York Times, August 19, 2005). These petascale computers, expected around 2008 to 2012, will perform 1015 operations per second, nearly an order of magnitude faster than today’s speediest
This keynote talk presented joint work with Kamesh Madduri. This work was supported in part by NSF Grants CNS-0614915, CAREER CCF-0611589, and DBI0420513; and DARPA contract NBCH30390004.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 166–169, 2008. c Springer-Verlag Berlin Heidelberg 2008
Petascale Computing for Large-Scale Graph Problems
167
supercomputer. In fact several nations are in a worldwide race to deliver highperformance computing systems that can achieve 10 petaflops or more within the next five years.
2
Massive Graph Theoretic Applications
The modeling and analysis of massive, dynamically evolving semantic networks raises new and challenging research problems to respond quickly to complex queries. Empirical studies on real-world systems such as the Internet, socioeconomic interactions, and biological networks have revealed that they exhibit common structural features – a low graph diameter, skewed vertex degree distribution, self-similarity, and dense sub-graphs. Analogous to the small-world (short paths) phenomenon, these real-world data sets are broadly referred to and modeled as small-world networks [1,2]. Our research highlights the design and implementation of novel high performance computing approaches to efficiently solve advanced small-world network analysis queries, enabling analysis of networks that were previously considered too large to be feasible. For tractable analysis of large-scale networks, we present SNAP(Small-world Network Analysis and Partitioning) [3], an open-source graph analysis framework. SNAPis a modular infrastructure that provides an optimized collection of algorithmic building blocks (efficient implementations of key graph-theoretic analytic approaches) to end-users. In prior work, we have designed novel parallel algorithms for several graph problems that run efficiently on shared memory systems. Our implementations of breadth-first graph traversal [4], shortest paths [5,6], spanning tree [7], minimum spanning tree, connected components [8], and other problems achieve impressive parallel speedup for arbitrary, sparse graph instances. We redesign and integrate several of our recent parallel graph algorithms into SNAP, with additional optimizations for social networks. Thus, SNAPprovides a simple and intuitive interface for the network analyst, effectively hiding the parallel programming complexity involved in low-level algorithm design from the user while providing a productive high-performance environment for complex queries. 2.1
Centrality Analysis Queries
One of the fundamental problems in network analysis is to determine the importance (or the centrality) of a given entity in a network. Some of the wellknown metrics for computing centrality are closeness, stress and betweenness [9]. Of these indices, betweenness has been extensively used in recent years for the analysis of social-interaction networks, as well as other large-scale complex networks. Some applications include lethality in biological networks, study of sexual networks and AIDS, organizational behavior, supply chain management processes, as well as identifying key actors in terrorist networks. Betweenness is also used as the primary routine in accepted social network analysis algorithms for clustering and community identification in real-world networks.
168
D.A. Bader
Betweenness is a global centrality metric that is based on shortest-path enumeration. It is compute-intensive with a quadratic time complexity in the number of vertices. We explore high performance computing techniques [10] that exploit the typical small-world graph topology to speed up exact, as well as approximate, centrality computation. We demonstrate the capability to compute betweenness on networks that are three orders of magnitude larger than ones that can be processed by state-of-the-art network analysis packages. In contrast to existing approaches, we use a global topological measure for centrality queries in a large network and also support this metric in SNAP. 2.2
Path-Based Queries
Several common analysis queries can be naturally formulated as path-based problems. For instance, while analyzing a collaboration network, we might be interested in chains of publications or patents relating the work of two researchers. In a social network, relationship attributes are frequently encapsulated in the edge type, and we may want to discover paths formed by the composition of specific relationships (for instance, subordinate of and friend of ). SNAPsupports common variants of shortest-path and flow-based query formulations. Other related advanced queries include subgraph isomorphism (finding an exact or approximate pattern in the large graph) and connection subgraphs (informally, finding a subgraph relating two entities of interest) that are extensions of simpler path-based algorithms. 2.3
Automated Community Detection
A key problem in social network analysis is that of finding communities, dense components, or detecting other latent structure. In recent work, we designed three new clustering schemes (two hierarchical agglomerative approaches, and one divisive clustering algorithm) [3] that optimize modularity, a popular clustering measure. We also conducted an extensive experimental study and demonstrated that our parallel schemes give significant running time improvements over existing modularity-based clustering heuristics. For instance, our novel divisive clustering approach based on approximate edge betweenness centrality is more than two orders of magnitude faster than the Newman-Girvan algorithm [11] on a multicore computer, while maintaining comparable clustering quality.
3
Summary
The analysis of massive graphs requires petascale computing systems. We discuss the design and implementation of efficient parallel algorithms for novel community structure identification, classical graph-theoretic kernels, topological indices that provide insight into the network structure, and preprocessing kernels for small-world graphs. Our results demonstrate that these parallel approaches are several orders of magnitude faster than competing algorithms – this enables
Petascale Computing for Large-Scale Graph Problems
169
analysis of networks that were previously considered too large to be tractable. As part of ongoing work, we are designing new small-world network analysis kernels and incorporating existing techniques into SNAP. Our current focus is on the design of novel algorithms for massive-scale small-world networks, including better techniques for centrality analysis, path-based queries, spectral partitioning, and community detection.
References 1. Watts, D., Strogatz, S.: Collective dynamics of small world networks. Nature 393, 440–442 (1998) 2. Amaral, L., Scala, A., Barth´el´emy, M., Stanley, H.: Classes of small-world networks. Proceedings of the National Academy of Sciences USA 97, 11149–11152 (2000) 3. Bader, D., Madduri, K.: SNAP, Small-world Network Analysis and Partitioning: an open-source parallel graph framework for the exploration of large-scale networks. In: Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS 2008), Miami, FL (2008) 4. Bader, D., Madduri, K.: Designing multithreaded algorithms for breadth-first search and st-connectivity on the Cray MTA-2. In: Proc. 35th Int’l Conf. on Parallel Processing (ICPP), Columbus, OH, IEEE Computer Society, Los Alamitos (2006) 5. Madduri, K., Bader, D., Berry, J., Crobak, J.: An experimental study of a parallel shortest path algorithm for solving large-scale graph instances. In: Proc. The 9th Workshop on Algorithm Engineering and Experiments (ALENEX 2007), New Orleans, LA (2007) 6. Crobak, J., Berry, J., Madduri, K., Bader, D.: Advanced shortest path algorithms on a massively-multithreaded architecture. In: Proc. Workshop on Multithreaded Architectures and Applications, Long Beach, CA (2007) 7. Bader, D.A., Cong, G.: A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs). Journal of Parallel and Distributed Computing 65, 994– 1006 (2005) 8. Bader, D., Cong, G., Feo, J.: On the architectural requirements for efficient execution of graph algorithms. In: Proc. 34th Int’l Conf. on Parallel Processing (ICPP), Oslo, Norway, IEEE Computer Society, Los Alamitos (2005) 9. Freeman, L.: A set of measures of centrality based on betweenness. Sociometry 40, 35–41 (1977) 10. Bader, D., Madduri, K.: Parallel algorithms for evaluating centrality indices in realworld networks. In: Proc. 35th Int’l Conf. on Parallel Processing (ICPP), Columbus, OH, IEEE Computer Society, Columbus, OH, IEEE Computer Society, Los Alamitos (2006) 11. Girvan, M., Newman, M.: Community structure in social and biological networks. Proceedings of the National Academy of Sciences USA 99, 7821–7826 (2002)
The Buffered Work-Pool Approach for Search-Tree Based Optimization Algorithms Faisal N. Abu-Khzam1, , Mohamad A. Rizk1 , Deema A. Abdallah1 , and Nagiza F. Samatova2,3 1
Division of Computer Science and Mathematics Lebanese American University Beirut, Lebanon
[email protected] 2 Computer Science and Mathematics Division Oak Ridge National Laboratory Oak Ridge, TN, USA 3 Computer Science Department North Carolina State University Raleigh, NC, USA
Abstract. Recent advances in algorithm design have shown a growing interest in seeking exact solutions to many hard problems. This new trend has been motivated by hardness of approximation results that appeared in the last decade, and has taken a great boost by the emergence of parameterized complexity theory. Exact algorithms often follow the classical search-tree based recursive backtracking strategy. Different algorithms adopt different branching and pruning techniques in order to reduce the unavoidable exponential growth in run time. This paper is concerned with another time-saving approach by developing new methods for exploiting high-performance computational platforms. A load balancing strategy is presented that could exploit multi-core architectures, such as clusters of symmetric multiprocessors. The well-known Maximum Clique problem is used as an exemplar to illustrate the utility of our approach.
1
Introduction and Background
Computationally demanding applications remain the main source of challenge for algorithm designers and for the computer industry. The major reason for
This research has been supported by the “Exploratory Data Intensive Computing for Complex Biological Systems” project from U.S. Department of Energy (Office of Advanced Scientific Computing Research, Office of Science). The work of N. F. Samatova was also sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory. Oak Ridge National Laboratory is managed by UT-Battelle for the LLC U.S. D.O.E. under contract no. DE-AC0500OR22725. Communicating author.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 170–179, 2008. c Springer-Verlag Berlin Heidelberg 2008
The Buffered Work-Pool Approach for Search-Tree
171
this challenge stems from the fact that most real life questions are modeled by computationally intractable problems. Algorithm designers have resorted to approximation methods whenever it is acceptable to trade quality for time. However, some problems are hard to approximate with a guaranteed error bound, unless some strongly believed complexity hierarchy collapses. Another issue with approximation methods is (what we call) two-fold approximations: adopting simplifying assumptions to model a question by a certain problem and adopting an approximate solution to that problem. This kind of double inaccuracy is not desired, especially when a problem is to be solved only once in a while. For example, when data is produced over years (as in some biological experiments), one may afford spending a few days or even weeks to get an accurate answer. Determining whether an accurate solution exists that is affordable requires more scientific efforts and extensive experimental study. With the high expected increase in multiprocessor performance, a natural question to pose is whether we would eventually bridge the gap between intractability and practicality, by solving practical instances of problems that were once judged (and dismissed) as intractable. The hope for a positive question has been intensified recently when a new atmosphere that favors exact algorithms appeared in the algorithms community. This is partially attributed to the emergence of fixed-parameter tractability [1], which led to new methodologies for problem reduction and novel parameter-driven search algorithms whose exponential-time behavior is restricted to some input parameter(s). Coupling the best exact algorithm for a problem with a highly scalable parallel implementation sounds like the right approach to bridging the gap between intractability and practicality. We shall try to realize this vision by providing a special parallel technique for problems that are solved (optimally) via recursive backtracking. We use the Maximum Clique problem as an exemplar.
2
The Maximum Clique Problem
Maximum Clique (henceforth MC) is a well-known NP-complete problem [2]. Its decision version is often formulated as follows: Input: a graph G = (V, E) and a positive integer k < |V |. Question: is there a subset V of V with |V | ≥ k for which every pair of vertices in V is joined by an edge in E? When posed as an optimization search problem (in which a clique of maximum size is sought), Maximum Clique finds application in a wide variety of domains. In computational biology, for example, powerful Maximum Clique tools are needed in the context of cis regulatory motif finding [3], microarray analysis [4], and the study of quantitative trait loci [5]. Moreover, recent clique finding methods have been incorporated into ClustalXP [6], the parallel version of ClustalW. There are many methods for solving the optimization version of MC. Most of these methods appeared in the context of solving the Independent Set problem [7,8,9], which is equivalent to MC: an independent set in a graph G is a clique
172
F.N. Abu-Khzam et al.
in its complement. Here, the complement of G is the graph representing the complement of the adjacency relation of G. A search-tree based clique algorithm can be viewed as a traversal of a virtual tree whose nodes are search states. Every search state consists of a set CurrentK holding the current clique, a set Clique N eighbors containing common neighbors of the elements of CurrentK, and a few parameters. The largest clique found during the search, say M axK, is kept unchanged along with its size, until a search-tree node is reached whose CurrentK is larger than M axK. We consider a particular recent MC algorithm, due to Tomita and Kameda [10], which showed a great improvement over previously known methods. We refer to this algorithm by TK in what follows. The TK algorithm employs a greedy coloring of the Clique N eighbors vertices at every state of its search. Initially, Clique N eighbors is the whole vertex set. The assigned colors are represented by positive integers in such a way that no two adjacent vertices receive the same color. Note that the maximum clique size in a graph cannot exceed the number of colors assigned by any coloring. The TK coloring method guarantees that the colored vertices are also sorted according to their color values. A vertex of color c must have at least c − 1 adjacent vertices with c − 1 different smaller colors. This has a significant impact on increasing the performance during the search. In fact, if the size of M axK is greater than the sum of |CurrentK| and the maximum assigned color, then the corresponding branch of the search is terminated (or pruned). As for the criteria for selecting a candidate vertex for branching/expansion, the last vertex in the sorted list is considered, being a vertex of maximum color value. We refer to this vertex by Candidate in the sequel. Once Candidate is determined, Clique N eighbors is updated to become the intersection between itself and the neighbors of Candidate. If the intersection set is not empty, the coloring procedure is applied to the new Clique N eighbors set and the branching function is called again. If this intersection is empty and the size of the new CurrentK is greater than the size of M axK, the latter is then replaced by CurrentK. Otherwise, the vertex is removed from the clique and another one is to be chosen. If no other candidate is found, the search backtracks.
3
The Buffered Workpool Approach
The Buffered Work-Pool approach (henceforth BWP) is a hybrid dynamic load balancing technique that combines threading and message passing between the different processors of any cluster. This is a master-worker technique that is similar to decentralized work-pool (see chapter 7 of [11]). The main objectives are to allow workers to manage their own local shared work-pools (or task-buffers), from/to which the different local threads can exchange tasks, and to let them communicate tasks through the master. The master acts like a grid middleware, whose role is to reduce the overhead of communication and synchronization. Relying on a master process in the inter-processor communication facilitates
The Buffered Work-Pool Approach for Search-Tree
173
Fig. 1. The General BWP Approach
termination detection. Figure 1 below illustrates the main functions of a general BWP computation. We describe these functions in the following subsections. We use a BWP system in a computation whenever a single task has the potential of producing a large number of other tasks. Such is the case in searchtree based exact algorithms. To explicate, note that each search-tree node (or search-state) represents a complete problem instance that can be viewed as a task by itself. All the state-related information could be encoded in the structure of its corresponding task. In the following description of the general BWP approach, and for the sake of illustration, we assume a search-tree based computation. 3.1
The Master
The master starts the computation with any problem-specific pre-processing steps. It then fills its task-buffer with a set of initial tasks and distributes them equally among the workers (in a sequential manner). Afterwards, the master’s entire task evolves around communication with the workers. If it receives a request for tasks from a worker, it answers by allotting a number of tasks to be exchanged after a master-worker agreement is settled. Similarly, if the worker sends a request for sharing part of its load with other workers, the master tries to accommodate part of that worker’s task-buffer in order to delegate it later to other workers that are requesting tasks.
174
3.2
F.N. Abu-Khzam et al.
The Worker
After receiving the initial set of tasks, each worker sets off one thread to handle the communication with the master and initiates a pre-defined number of search threads that are in charge of expanding/branching the tasks from its task-buffer. The addition of tasks to the task-buffer takes place when all of the following conditions are satisfied: – The number of sub-tasks that are generated from the task being expanded is greater than a predefined parameter called Add Threshold. – The current search tree level (or, in general, a computation stage) is a multiple of a user-defined parameter Level Threshold. This allows the user to determine how coarse/fine (s)he wants the computation to be. – The worker’s buffer is not full and not locked by another thread. If any of the above mentioned conditions fails, the search thread expands its tasks recursively (it proceeds as in the sequential version). 3.3
Master/Worker Communication
During the master-worker communication, we distinguish various types of messages whose role is identified by special message tags. Some messages are problem specific (such as updating the maximum clique size), while others are used to exchange tasks and to settle agreements. An agreement is needed to determine the number of tasks to be exchanged. The worker can issue two types of requests. A Tasks Request is used to request tasks when the number of tasks in its buffer drops below the pre-defined parameter Worker Starving Level. The master sends back the current number of tasks in its buffer. If the master’s answer is greater than zero, then the worker responds back with the number of empty slots below its buffer’s Hungry Level. A Delegation Request is used by the worker to delegate some tasks when its buffer is full. In this case, the master sends back the number of empty slots in its task-buffer. If the master’s answer is positive, the worker sends the number of tasks that are above its Hungry Level. If the worker’s task-buffer was no longer full when it receives the master’s response, it sends back a zero answer. The master sends a Normal Request or an Urgent Request message when the number of tasks in its buffer drops below the normal (hungry) or urgent (starving) level, respectively. In case of a Normal Request, the worker’s reply message has a Normal Request Answer tag and contains the number of tasks in its buffer that are above its Worker Starving Level. An Urgent Request Answer tag is used in response to an Urgent Request message from the master, in which case the worker sends a pre-defined percentage of the number of tasks in its buffer. The master’s requests are sent only to those workers that have tasks and have answered previous requests. Similarly, the worker’s communication thread does not send a Task Request message unless the master has available tasks and has previously answered the worker’s requests.
The Buffered Work-Pool Approach for Search-Tree
175
The communication algorithms are shown below. (Note that New Message() is true if the messages buffer has a new message.) Master-Communication-Function Begin While (Terminate==FALSE) if(number of tasks ≤ Master Hungry Level) for each worker do if(worker is not starving and has answered previous requests) ISend(Normal Request, worker, number of tasks); if(number of tasks ≤ Master Starving Level) for each worker do if(worker is not starving and has answered previous requests) ISend(Urgent Request, worker, number of tasks); if(New Message()) if(new message tag == Normal Request Answer) Receive Normal Request Answer(); if(new message tag == Urgent Request Answer) Receive Urgent Request Answer(); if(new message tag == Tasks Request) Send Tasks Request Answer(); if(new message tag == Delegation Request) Send Delegation Request Answer(); End
Worker-Communication Function Begin While (Terminate==FALSE) if (number of tasks ≤ Worker Starving Level) if(master has tasks and has answered previous requests) ISend(Tasks Request, master, number of tasks); if (task-buffer is full) if (master has answered previous requests) ISend(Delegation Request, master, number of tasks); if(New Message()) if(new message tag==Normal Request) Send Normal Request Answer(); if(new message tag==Urgent Request Answer) Send Urgent Request Answer(); if(new message tag==Tasks Request Answer) Receive Tasks Request Answer(); if(new message tag==Delegation Request Answer) Receive Delegation Request Answer(); if(new message tag==Termination) Terminate = TRUE End
176
F.N. Abu-Khzam et al.
Note that the master and the workers always send the number of tasks they own, even if no tasks are to be exchanged. This is used, in the above protocol, as an estimate of the ability of the worker/master to delegate tasks. Blocking messages are used when the master and a worker are settling on an agreement or exchanging tasks1 . However, negative answers can use non-blocking send routines, because they will not be followed by exchange of tasks. 3.4
Termination Detection
The master detects termination and broadcasts a termination signal to all workers only when the following conditions are satisfied. – All workers and the master have zero tasks in their local task buffers; – No worker has an active search thread; – No messages are in transmission. When a worker has an empty task-buffer and receives an Urgent Request message with a zero value (i.e. the number of tasks in the master’s buffer is zero), it does any of the following: – If all of its search threads are idle and the master has answered all its previous Tasks Request messages, it sends back a negative value for its number of tasks, signaling that it is ready to terminate. – If any of its search threads is active, or it has an unanswered Tasks Request message, the worker replies with a zero number of tasks. On the other hand, when the master has an empty task-buffer and receives a negative value in all the Urgent Request Answer messages, it broadcasts a termination signal. 3.5
Avoiding Deadlocks
While settling an agreement with a certain worker X, the master uses blocking send/receive messages. To avoid having a deadlock, the master probes its message buffer, looking for any blocking message from X. This takes place after a blocking send message and before its corresponding receive message. If a blocking message from X arrives whose tag is different from the master’s blocking send, then a deadlock may occur. In this case, the master serves the worker’s message before it goes back into its blocking receive state.
4
A Buffered WorkPool Algorithm for Maximum Clique
In the BWP version of the TK algorithm, the computation process begins at the master with the pre-processing steps that include coloring and sorting of vertices. 1
Note that non-blocking send/receive messages allow the sender/receiver to proceed in its computation instead of sitting idle waiting for a respond to their message.
The Buffered Work-Pool Approach for Search-Tree
177
The master starts the search phase and creates a number of tasks by imitating the sequential branching in a breadth-first manner until it has a number of visited search-tree nodes that is equal to the number of processors available. Every task would have its own CurrentK clique. Each worker, consequently, initiates a communication thread and a user-predefined number of search threads. 4.1
The Task Structure
The task structure is made up of a linear array called task. The size of task is n + 5, where n is the number of vertices in the graph. The first n slots contain integer values representing the states of the vertices. If a vertex v is active (a member of Candidate), task[v] is the index of v in the sorted list. Otherwise, task[v] is either −1 or −2 depending on whether v is in CurrentK or just deleted. The remaining slots of task contain the candidate vertex and its color, the size of CurrentK, the number of vertices and edges in the subgraph induced by the Candidate set. 4.2
Parallel Branching
When the branching function is called on a task chosen by its corresponding search thread, it starts by checking whether the maximum clique size is smaller than the sum of its candidate vertex color and its CurrentK size. If so, the candidate vertex is added to CurrentK whose size is thus incremented by one. Then the intersection set between the neighbors of Candidate and the current Clique N eighbors set is determined in order to extract a new set of active neighbors of the clique. As in the sequential algorithm, if the intersection is empty and |CurrentK| > M axK, then CurrentK is sent to the master in order to replace M axK, and the master broadcasts |CurrentK| as the new maximum clique size found so far. If |CurrentK| is not greater than M axK, then Candidate is removed from the clique and another one is chosen for a similar procedure. On the other hand, if the intersection set is non-empty, the greedy coloring is re-applied to Clique N eighbors to produce a new ordering by color, as in the sequential version. A new list of tasks is produced in the same breadth-first manner used by the master. The current thread will proceed in handling the branching as in the sequential version, unless all the conditions for adding tasks hold (see section 3.2), in which case the newly generated tasks are added to the local task-buffer.
5
Preliminary Experimental Results
We have implemented the sequential TK algorithm, together with our parallel version. The two versions were tested on graphs from the DIMACS Benchmark (http://dimacs.rutgers.edu/Challenges). For the preliminary experiments we report below, we chose three graphs on which the sequential TK is known to take hours. These graphs are given the names: brock800 1, brock800 4 and p hat10002. Figure 2 shows the average speedups obtained over all the different runs
178
F.N. Abu-Khzam et al.
Fig. 2. Sample Experimental Results
(on the three files) after adjusting the parameters. The experiments were conducted on a cluster of dual-core processors at Oak Ridge National Lab. Each node consists of a Dual Intel 3.4GHz Xeon EM64T processor with 4GB of memory and dual Gigabit Ethernet Interconnects. As for the number of threads per machine, we tried different values and we found that using five threads per machine gave the best performance on the used dual-core processors. Finally, note that our user-defined parameters play a major role in the efficiency of the produced code. More experiments are needed on randomly generated graphs and known benchmarks in order to determine the relationships between the structure of input graphs (such as density, chromatic number, and degree distribution) and the various parameters.
6
Concluding Remarks
We presented a dynamic load balancing technique that targets clusters of sharedmemory multiprocessors. The Buffered Work-Pool approach is suitable for highly demanding computations where each task assigned to a process has the potential of producing a large number of sub-tasks. Experiments were presented that showed work in progress. Further experiments are needed to assess the BWP approach on large platforms and on other problems. The BWP method applies well to many other combinatorial problems. We are currently using it in designing algorithms for many classical problems like SAT, Vertex Cover and Maximal Cliques enumeration.
References 1. Downey, R.G., Fellows, M.R.: Parameterized Complexity. Springer, Heidelberg (1999) 2. Garey, M.R., Johnson, D.S.: Computers and Intractability. W.H. Freeman, New York (1979)
The Buffered Work-Pool Approach for Search-Tree
179
3. Leung, H., Chin, F.: An Efficient algorithm for String Motif Discovery. In: Proceedings, Asia-Pacific Bioinformatics Conference (APBC), pp. 79–88 (2006) 4. Ji, Y., Stormo, G.: Clustering binary fingerprint vectors with missing values for DNA array data analysis. Journal of Computational Biology 11(5), 887–901 (2004) 5. Chesler, E., Lu, L., Shou, S., Qu, Y., Gu, J., Wang, J., Hsu, H., Mountz, J., Baldwin, N., Langston, M., Hogenesch, J., Threadgill, D., Manly, K., Williams, R.: Complex Trait Analysis of Gene Expression Uncovers Polygenic and Pleiotropic Networks that Modulate Nervous System Function. Nature Genetics 37(3), 233– 242 (2005) 6. Abu-Khzam, F.N., Cheetham, J., Dehne, F., Langston, M.A., Pitre, S., RauChaplin, A., Shanbhag, P., Taillon, P.J.: ClustalXP, http://134.117.206.42:8000 7. Fomin, F.V., Grandoni, F., Kratsch, D.: Measure and conquer: A Simple O(20.288n ) Independent Set Algorithm. In: SODA 2006: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, Miami, Florida, pp. 18–25 (2004) 8. Jian, T.: An O(20.304n ) Algorithm for Solving Maximum Independent Set Problems. IEEE Transactions on Computers 35(9), 847–851 (1986) 9. Robson, J.M.: Finding a maximum independent set in time O(2n/4 ). Universite Bordeaux I, LaBRI, Techincal Report 1251-01 (2001) 10. Tomita, E., Kameda, T.: An efficient branch-and-bound algorithm for finding a maximum clique with computational experiments. Journal of Global Optimization 37, 95–111 (2007) 11. Wilkinson, B., Allen, M.: Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, 2nd edn. Prentice-Hall, Inc., Upper Saddle River (2004)
Parallel Scatter Search Algorithm for the Flow Shop Sequencing Problem Wojciech Bo˙zejko1 and Mieczyslaw Wodecki2 1
Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology Janiszewskiego 11-17, 50-372 Wroclaw, Poland
[email protected] 2 Institute of Computer Science, University of Wroclaw Joliot-Curie 15, 50-383 Wroclaw, Poland
[email protected]
Abstract. In the paper we consider strongly NP-hard flow shop problem with the criterion of minimization of the sum of job’s finishing times. We present the parallel algorithm based on the scatter search method. Obtained results are compared to the best known from the literature. Superlinear speedup has been observed in the parallel calculations. Keywords: metaheuristics, scatter search, flow shop problem.
1
Introduction
We take into consideration the permutation flow shop scheduling problem described as follows. A number of jobs are to be processed on a number of machines. Each job must go through all the machines in exactly the same order and the job order must be the same on every machine. Each machine can process at most one job at any point of time and each job may be processed on at most one machine at any time. The objective is to find a schedule that minimizes the sum of job’s completion times. The problem is indicated by F ||Csum . There are plenty of good heuristic algorithms for solving F ||Cmax flow shop problem, with the objective of minimizing maximal job’s completion times. For the sake of special properties (blocks of critical path, [4]) it is recognized as an easier one than a problem with objective Csum . Unfortunately, there are not any similar properties (which can speedup computations) for the F ||Csum flow shop problem. Constructive algorithms (LIT and SPD from [11], NSPD [7]) have low efficiency and can only be applied to a limited range. There is hybrid algorithm in [9], consisting of elements of tabu search, simulated annealing and path relinking methods. The results of this algorithm, applied to Taillard benchmark tests [10], are the best known ones in the literature nowadays. The big disadvantage of the algorithm is its time-consumption. Parallel computing is the way to speed it up. This work is the continuation of author’s research on constructing efficient parallel algorithms to solve hard combinatorial problems ([1,2,3,12]). Further, we present a parallel algorithm based on scatter search method which not only speeds up the computations, but also improves the quality of the results. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 180–188, 2008. c Springer-Verlag Berlin Heidelberg 2008
Parallel Scatter Search Algorithm for the Flow Shop Sequencing Problem
2
181
Problem Definition and Notation
The flow shop problem can be defined as follows: there is a set of n jobs J={1,2,. . . ,n} and a set of m machines M ={1,2,. . . ,m}. Job j ∈ J consists of a sequence of m operations Oj1 , Oj2 ,. . . ,Ojm . Operation Ojk corresponds to the processing of job j on machine k during an uninterrupted processing time pjk . We want to find a schedule so that a sum of job’s completion times is minimal. Let π =(π(1), π(1),. . . ,π(n)) be a permutation of jobs {1,2,. . . ,n} and let Π be the set of all permutations. Each permutation π ∈ Π defines a processing order of jobs on each machine. We wish to find a permutation π ∗ ∈ Π that: Csum (π ∗ ) = min Csum (π), where Csum (π) = π∈Π
n
Ci,m (π),
i=1
where Ci,j (π) is the time required to complete job i on the machine j in the processing order given by the permutation π. The completion time of job π(j) on machine k can be found by using the recursive formula: Cπ(j)k = max {Cπ(j−1)k , Cπ(j)k−1 } + pπ(j)k , where π(0)=0, C0k =0, k=1,2,...,m , Cj0 =0, j =1,2,...,n. Such a problem belongs to the strongly NP-hard class.
3
Scatter Search Method
The main idea of the scatter search method is presented in [6]. The algorithm is based on the idea of evaluation of the so-called starting solutions set. In the classic version a linear combination of the starting solution is used to construct a new solution. In case of a permutational representation of the solution using linear combination of permutations gives us an object which is not a permutation. Therefore, in this paper a path relinking procedure is used to construct a path from one solution of the starting set to another solution from this set. The best element of such a path is chosen as a candidate to add to the starting solution set. Algorithm 1. Scatter search for i := 1 to iter do Step 1. Generate a set of unrepeated starting solutions S, |S| = n. Step 2. For randomly chosen n/2 pair from the S apply path relinking procedure to generate a set S - of n/2 solutions which lies on paths. Step 3. Apply local search procedure to improve value of the cost function of solutions from the set S .
182
W. Bo˙zejko and M. Wodecki
Step 4. Add solutions from the set S to the set S. Leave in the set S at most n solutions by deleting the worst and repeated solutions. Step 5. if |S| < n then Add new random solutions to the set S such, that elements in the set S does not duplicate and |S| = n. end for.
4
Path Relinking
The base of the path relinking procedure, which connects two solutions π1 , π2 ∈ Π, is a multi-step crossover fusion (MSXF) described by Reeves and Yamada [9]. Its idea is based on a stochastic local search, starting from π1 solution, to find a new good solution where the other solution π1 is used as a reference point. The neighborhood N (π) of the permutation (individual) π is defined as a set of new permutations that can be achieved from π by exactly one adjacent pairwise exchange operator which exchanges the positions of two adjacent jobs of a problem’s solution connected with permutation π. The distance measure d(π,σ) is defined as a number of adjacent pairwise exchanges needed to transform permutation π into permutation σ. Such a measure is known as Kendall’s τ measure. Algorithm 2. Path-relinking procedure Let π 1 , π 2 be reference solutions. Set x = q = π1 ; repeat For each member yi ∈ N (π), calculate d(yi , π 2 ); Sort yi ∈ N (π) in ascending order of d(yi , π 2 ); repeat Select yi from N (π) with a probability inversely proportional to the index i; Calculate Csum (yi ); Accept yi with probability 1 if Csum (yi ) ≤ Csum (x), and with probability PT (yi ) = exp((Csum (x) − Csum (yi )) / T ) otherwise (T is temperature); Change the index of yi from i to n and the indices of yk , k = i+1,...,n from k to k−1; until yi is accepted; x ← yi ; if Csum (x) < Csum (q) then q ← x; until some termination condition is satisfied ; return q { q is the best solutions lying on the path from π1 to π2 } The condition of termination consisted in exceeding 100 iterations by the path relinking procedure.
Parallel Scatter Search Algorithm for the Flow Shop Sequencing Problem
5
183
Parallel Scatter Search Algorithm
The parallel algorithm was projected to execute on the cluster of 152 dual-core Intel Xeon 2.4 GHz processors connected by Gigabit Ethernet with 3Com SuperStack 3870 swiches installed in the Wrocaw Center of Networking and Supercomputing. This supercomputer has got a distributed memory, where each processor has its local 4 GB memory. Taking into consideration this type of architecture we choose a client-server model for the scatter search algorithm proposed here, where calculations of path-relinking procedures are executed by processors on local data and communication takes place rarely to create a common set of new starting solutions. The process of communication and evaluation of the starting solutions set S is controlled by processor number 0. We call this model global. For comparison a model without communication was also implemented in which an independent scatter search threads are executed in parallel. The result of such an algorithm is the best solution from solutions generated by all the searching threads. We call this model independent. Algorithms were implemented in C++ language using MPI (mpich 1.2.7) library and executed under the OpenPBS batching system which measures times of processor’s usage. Algorithm 3. Parallel scatter search algorithm for the SIMD model without shared memory parfor p := 1 to number of processors do for i := 1 to iter do Step 1. if (p = 0) then {only procesor number 0} Generate a set of unrepeated starting solutions S, |S| = n. Broadcast a set S among all the processors. else {other processors} Receive from the procesor 0 a set of starting solutions S. end if; Step 2. For randomly chosen n/2 pair from the S apply path relinking procedure to generate a set S - of n/2 solutions which lies on paths. Step 3. Apply local search procedure to improve value of the cost function of solutions from the set S . Step 4. if (p = 0) then Send solutions from the set S to procesor 0 else {only processor number 0} Receive sets S from other processors and add its elements to the set S Step 5. Leave in the set S at most n solutions by deleting the worst and repeated solutions.
184
W. Bo˙zejko and M. Wodecki
if |S| < n then Add a new random solutions to the set S such, that elements in the set S does not duplicate and |S| = n. end if; end if; end for; end parfor.
6
Computer Simulations
Tests were based on 50 instances with 100,. . . ,500 operations (n × m=20×5, 20×10, 20×20, 50×5, 50×10) due to Taillard [10], taken from the OR-Library [8]. The results were compared to the best known, taken from [9]. For each version of the scatter search algorithm (global or independent), following metrics were calculated: – ARPD - Average Percentage Relative Deviation to the benchmark’s cost function value from [9], – ttotal (in seconds) – real time of executing the algorithm for 50 benchmark instances from [10], – tcpu (in seconds) – the sum of time’s consuming on all processors for 50 benchmark instances from [10]. Table 1. Average percentage deviations ARPD (independent model - no communication). The sum of iterations number for all processors is 1600. n×m 20x5 20x10 20x20 50x5 50x10 average
1 (iter =1600) 0.007 0.000 0.000 1.024 1.060 0.418
Processors 2 (iter = 800) 4 (iter = 400) 0.021 0.065 0.012 0.010 0.013 0.047 1.093 1.364 1.312 1.425 0.490 0.582
8 (iter = 200) 0.111 0.024 0.046 1.662 1.821 0.733
Tables 1, 2, 3 and 4 presents results of computations of the scatter search method for the number of iterations (as a sum of iterations on all the processors) equals to 1600. The cost of computations, understanding as a sum of time-consuming an all the processors, is about 7 hours for the all 50 benchmark instances of the flow shop problem (Table 2, 4). The best results (average percentage deviations to the best known solutions) has the 2-processors version of the global model of the scatter search algorithm (with communication), see Figure 1. Because the timeconsuming on all the processors is a little bit longer than the time of the sequential version we can say that the speedup of this version of the algorithm if almost-linear
Parallel Scatter Search Algorithm for the Flow Shop Sequencing Problem
185
Table 2. Times of execution (for all 50 instances, independent model, iter = 1600) Cluster of Xeon 3000 2.4 GHz processors ttotal(hours:min:sec) tcpu (hours:min:sec) 7:13:30 7:13:13 3:34:08 7:04:44 1:46:05 6:58:43 0:53:33 6:57:44
Processors 1 2 4 8
Table 3. Average percentage deviations ARPD (global model - with communication). The sum of iterations number for all processors is 1600. n×m 20x5 20x10 20x20 50x5 50x10 average
1 (iter =1600) 0.21 0.037 0.008 0.917 1.171 0.431
Processors 2 (iter = 800) 4 (iter = 400) 0.020 0.007 0.006 0.004 0.000 0.004 0.762 0.978 0.860 1.126 0.330 0.423
8 (iter = 200) 0.077 0.013 0.015 1.208 1.448 0.552
Table 4. Times of execution (for all 50 instances, global model, iter = 1600) Cluster of Xeon 3000 2.4 GHz processors ttotal(hours:min:sec) tcpu (hours:min:sec) 7:26:00 7:25:51 3:52:36 7:17:39 2:14:02 7:04:07 1:24:31 7:06:52
Processors 1 2 4 8
Table 5. Average percentage deviations ARPD (independent model - no communication). The sum of iterations’s number for all processors is 16000. n×m 20x5 20x10 20x20 50x5 50x10 average
1 (iter =16000) 0.000 0.000 0.000 0.904 0.913 0.363
Processors 2 (iter = 8000) 4 (iter = 4000) 0.007 0.000 0.000 0.000 0.000 0.000 1.037 0.906 0.986 1.033 0.406 0.388
8 (iter = 2000) 0.006 0.000 0.000 0.903 0.989 0.380
(or even super-linear, because sequential algorithms, for both: independent and global models, have worse results of the ARPD). The situation is more clear for the number of iterations equals to 16000, Tables 5, 6, 7 and 8. The cost of computations, a sum of time-consuming an all
186
W. Bo˙zejko and M. Wodecki
Table 6. Times of execution (for all 50 instances, independent model, iter = 16000) Cluster of Xeon 3000 2.4 GHz processors ttotal(hours:min:sec) tcpu (hours:min:sec) 75:27:40 75:25:48 37:40:08 75:02:51 18:38:23 74:10:18 9:06:24 72:19:26
Processors 1 2 4 8
Table 7. Average percentage deviations ARPD (global model - with communication). The sum of iterations’s number for all processors is 16000. n×m 20x5 20x10 20x20 50x5 50x10 average
1 (iter =16000) 0.000 0.000 0.000 0.993 1.103 0.419
Processors 2 (iter = 8000) 4 (iter = 4000) 0.000 0.000 0.000 0.000 0.000 0.000 0.677 0.537 0.648 0.474 0.265 0.202
8 (iter = 2000) 0.008 0.004 0.000 0.449 0.404 0.173
Table 8. Times of execution (for all 50 instances, global model, iter = 16000) Processors 1 2 4 8
Cluster of Xeon 3000 2.4 GHz processors ttotal(hours:min:sec) tcpu (hours:min:sec) 75:23:43 75:20:42 41:19:51 77:57:57 23:28:19 75:46:07 14:30:03 74:38:51
the processors, is about 75 hours for the all 50 benchmark instances of the flow shop problem (Table 6, 8). The best results are achieved for the 8-processors version of the global model version of scatter search and they are 58.6% better than the results of sequential global scatter search algorithm, and 52.3% better than the results of sequential independent model of scatter search algorithm (see Figure 2). The time-consuming on all 8 processors is shorter than the time of the both sequential version. We can say that the speedup of 8-processors global version of the scatter search algorithm is superlinear: better results are achieved with the lower cost of computations. This anomaly can be understood as the situation where the sequential algorithm executes its search threads such that there is a possibility to chose a better path of the solutions space trespass, which the parallel algorithm do. As we can say in Table 5 such a situations takes place only for the global model of the scatter search algorithms – independent searches are not so effective. The advantage of the global model of calculations over the independent searches is specially visible for the large instance of the flow shop problem - for
Parallel Scatter Search Algorithm for the Flow Shop Sequencing Problem
1
2
0,552
0,423
0,582
0,733
global model
0,33
0,49
0,431
1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0
0,418
ARPD
independent model
187
4
8
number of processors
Fig. 1. Average percentage deviations ARPD of the global and independent scatter search algorithms for the number of iterations iter = 1600 for all 50 instances from OR-Library [8]
1
2
global model
4
0,173
0,202
0,38
0,388
0,265
0,406
0,419
1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0
0,363
ARPD
independent model
8
number of processors
Fig. 2. Average percentage deviations ARPD of the global and independent scatter search algorithms for the number of iterations iter = 16000 for all 50 instances from OR-Library [8]
n = 50, m = 5, 10. The ARPD is about 50% better for the 8-processors implementation comparing to 1-processor version, for the same number of iterations calculated as a sum of iterations executed on all processors.
188
7
W. Bo˙zejko and M. Wodecki
Conclusions
We have discussed a new approach to the permutation flow shop scheduling based on parallel scatter search algorithm. The advantage is especially visible for large problems. As compared to the sequential algorithm, parallelization increases the quality of obtained solutions keeping comparable costs of computations.
Acknowledgements The calculations were done in the Wroclaw Centre of Networking and Supercomputing.
References 1. Bo˙zejko, W., Wodecki, M.: Solving the flow shop problem by parallel tabu search. In: Proceedings of PARELEC 2004, pp. 189–194. IEEE Computer Society, Los Alamitos (2004) 2. Bo˙zejko, W., Wodecki, M.: Parallel genetic algorithm for the flow shop scheduling problem. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 566–571. Springer, Heidelberg (2004) 3. Bo˙zejko, W., Wodecki, M.: A fast parallel dynasearch algorithm for some scheduling problems. In: Proceedings of PARELEC 2006, pp. 275–280. IEEE Computer Society, Los Alamitos (2006) 4. Grabowski, J., Pempera, J.: New block properties for the permutation flow-shop problem with application in TS. Journal of Operational Research Society 52, 210– 220 (2001) 5. Holland, J.H.: Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. University of Michigan Press (1975) 6. James, T., Rego, C., Glover, F.: Sequential and Parallel Path-Relinking Algorithms for the Quadratic Assignment Problem. IEEE Intelligent Systems 20(4), 58–65 (2005) 7. Liu, J.: A new heuristic algorithm for csum flowshop scheduling problems, Personal Communication (1997) 8. OR-Library: http://people.brunel.ac.uk/∼ mastjjb/jeb/info.html 9. Reeves, C.R., Yamada, T.: Solving the Csum Permutation Flowshop Scheduling Problem by Genetic Local Search. In: IEEE International Conference on Evolutionary Computation, pp. 230–234 (1998) 10. Taillard, E.: Benchmarks for basic scheduling problems. European Journal of Operational Research 64, 278–285 (1993) 11. Wang, C., Chu, C.: Proth J., Heuristic approaches for n/m/F/ΣCi scheduling problems. European Journal of Operational Research, 636–644 (1997) 12. Wodecki, M., Bo˙zejko, W.: Solving the flow shop problem by parallel simulated annealing. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2001. LNCS, vol. 2328, pp. 236–247. Springer, Heidelberg (2002)
Theoretical and Practical Issues of Parallel Simulated Annealing Agnieszka Debudaj-Grabysz1 and Zbigniew J. Czech1,2 1
Silesia University of Technology, Institute of Computer Science Akademicka 16, 44-100 Gliwice, Poland {agrabysz,zczech}@polsl.pl 2 Silesia University, Sosnowiec, Poland
Abstract. Several parallel simulated annealing algorithms with different co-operation schemes are considered. The theoretical analysis of speedups of the algorithms is presented. The outcome of the theoretical analysis was verified by practical experiments whose aim was to investigate the influence of the co-operation of parallel simulated annealing processes on the quality of results. The experiments were performed assuming a constant cost of parallel computations, i.e., searching for solutions was conducted with a given number of processors for a specified period of time. For the experiments a suite of benchmarking tests for the vehicle routing problem with time windows was used. Keywords: Simulated annealing, parallel processing, co-operation of processes, vehicle routing problem with time windows.
1
Introduction
The paper presents several co-operation schemes for parallel algorithms of simulated annealing (SA) which is a popular heuristic method of optimization. The algorithms with periodical and hybrid communication are compared with a no communication algorithm. The distinctive feature of the algorithm with hybrid communication is that it is intended to be executed on clusters of shared-memory nodes (SMP), combining the benefits of both shared and distributed memory systems. The theoretical analysis of speedups of the algorithms is presented. It considers the computational complexity of SA trials under the assumption that the algorithms stop after having performed a specified number of trials. The outcome of the theoretical analysis was verified by practical experiments whose aim was to investigate the influence of the co-operation of parallel SA processes on the quality of results. The experiments were performed assuming a constant cost of parallel computations, i.e. searching for solutions was conducted with a given number of processors for a specified period of time. Such an approach guarantees linear speedup which might be desirable in practical applications. For the experiments a suite of benchmarking tests for the vehicle routing problem with time windows (VRPTW) was used. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 189–198, 2008. c Springer-Verlag Berlin Heidelberg 2008
190
A. Debudaj-Grabysz and Z.J. Czech
Simulated annealing is a heuristic method of optimization employed in the cases when the solution space is too large to be exhaustively explored within a reasonable amount of time. The VRPTW is an example of such a problem. Other examples are: school bus routing, newspaper and mail distribution, delivery of goods to department stores etc. The optimization of routing lowers distribution costs, whereas parallelization makes possible to find better routes if the computation time is limited. The SA bibliography focuses mainly on the sequential versions of the algorithm [2,15]. However parallel versions are also investigated, since the sequential method is considered to converge slowly as compared with other heuristics [16]. In [1,3,11,12,13] some recommendations for parallelization of SA are given. The VRPTW, formulated by Solomon [14] who also proposed a suite of benchmarking tests, has a rich bibliography [16]. The research whose results are presented in this paper is a continuation of our efforts described in [8,9,10,5,6], where parallel SA algorithms to solve the VRPTW are discussed. The VRPTW is a bicriterion optimization problem. Solving it consists in finding a solution with minimum number of routes (first optimization criterion) and then minimizing the total travel distance travelled by vehicles (second optimization criterion). This paper focuses on the first optimization goal, i.e. on minimizing the number of solution routes. Section 2 presents the theoretical foundation of sequential and parallel SA algorithms. Section 3 describes a few variants of the co-operation of processes in parallel simulated annealing. Section 4 is devoted to the theoretical analysis of the algorithms. The results of the experiments are described in section 5. The last section contains conclusions.
2
Sequential and Parallel Simulated Annealing
Simulated annealing searches for the optimal state, i.e. the state which minimizes (or maximizes) the cost function. This is achieved by comparing the current solution with a random solution taken from a specific neighborhood. If a neighbor solution has lower cost than the current solution then it is accepted. Worse solutions can also be accepted with some probability, what prevents the algorithm from being prematurely stuck in local optima. The probability of accepting a worse solution decreases over the process of annealing, and it is governed by a control parameter called temperature. An outline of the SA algorithm is presented in Figure 1. A single execution of the innermost loop is called a trial. The final solution which is returned is the best one ever found. Simulated annealing can be modelled in terms of Markov chains. Namely, the algorithm is considered to generate a sequence of Markov chains, where each chain consists of trials executed in the same temperature. Since in SA each new state is potentially a modification of the previous state, the process is often considered as inherently sequential. However, a few strategies to develop parallel SA were proposed. In our implementations the trials are executed in parallel by a number of processors. We assumed that the length of
Theoretical and Practical Issues of Parallel Simulated Annealing
01 02 03 04 05 06 07 08 09 10 11 12
191
S ← GetInitialSolution(); T ← InitialTemperature; for i ← 1 to NumberOfTemperatureReduction do for j ← 1 to ChainLength do S ← GetSolutionFromNeighborhood(); ΔC ← CostFunction(S ) − CostFunction(S); if (ΔC < 0 or AcceptWithProbabilityP(ΔC, T )) then S ← S; {the trial is accepted} end if; end for; T ← λT ; {with λ < 1} end for; Fig. 1. Simulated annealing algorithm
Markov chains was fixed in such a way that the number of trials executed by all processors within each chain is the same as in the sequential algorithm.
3
Co-operation of Processes in Parallel Simulated Annealing
No communication algorithm (NC). The main assumptions for the no communication algorithm were formulated in [2], where the division algorithm was proposed. The method uses all available processors to run many sequential algorithms, where the original chain is split into sub-chains of ChainLength (see Figure 1) divided by the number of processors. At the end, the best solution found is selected as the final one. The use of a large number of processors can result in excessive shortening of the sub-chains length what in turn may negatively affect the quality of results. Periodical communication algorithm (PC). The idea of periodically interacting searches was fully developed in [13]. As in the NC algorithm the length of the sub-chains is decreased. Additionally, processes communicate after performing a part of a chain called a segment, and the best solution is selected and mandated for all processes. In the PC algorithm the segment length is defined by the number of temperature drops. As suggested in [13] to prevent search paths from being trapped in local minima areas as a result of communication, the period of the information exchange needs to be carefully selected. Hybrid communication algorithm (HC). The details of the method were described in [9]. The implementation is intended to run on clusters of SMP nodes, so the parallelization is accomplished using two levels. Intensively communicating operations are moved to the inner level where a shared memory environment is used. The remaining, comparatively rare, communication – the outer level – takes place in a distributed memory environment.
192
A. Debudaj-Grabysz and Z.J. Czech
Outer-level parallelization. Each Markov chain of SA optimization is divided into sub-chains. Their length is equal to the length of the original chain divided by the number of sub-chains. The main idea is to assign a separate sub-chain to a cluster node to allow nodes to generate different sub-chains simultaneously. In this way the computational effort of generating a Markov chain is distributed among the available nodes. Having generated the first Markov chain, the generation of the next chains is performed with no communication among nodes. Each node takes the outcome of the last trial of the preceding sub-chain as the starting point for the subsequent sub-chain. At the end, the best solution found is produced as the final one. It is worth mentioning the sub-chains length is shortened by the number of nodes instead of the number of processors. That results in longer sub-chains as compared with the NC and PC algorithms. Inner-level parallelization. Within a node a few threads communicate with each other while building a single sub-chain of the length determined at the outer level. The idea of parallelization consists in dividing the total number of trials in each sub-chain into rather small sets of trials. Each thread performs its part of the set independently of the other threads. Having completed a set, the master thread selects a solution among all solutions which have been accepted and the remaining ones are discarded. The selected solution is made common for all threads and it becomes the starting point for further computation.
4
Theoretical Analysis of Algorithms
The speedups of the algorithms are established through the analysis of the computational complexity of trials. It is assumed that the algorithms stop when a defined number of sub-chains are completed. The results of the analysis are presented in Table 1, where the parameters denote: p — number of processors, L — length of the Markov chain for the sequential algorithm (see Figure 1), ω — period of communication (in the PC algorithm only), λ — intensity of the Poisson’s process of generation of trials, to be identified experimentally, β — coefficient determining the duration of a broadcast type inter-node communication, to be identified experimentally, M — number of Markov chains for the corresponding sequential algorithm, d — size of the set of trials (in the inner level of the HC algorithm), t1 — duration of the sequential part executed on the inner-level of parallelization, to be identified experimentally. The details of the derivation of formulas showed in Table 1 can be found in [7]. Note that the formulas differ only in the denominators. By the definition the denominator is an inverse of the parallel efficiency η −1 . For arbitrarily and experimentally defined parameters the parallel efficiency as a function of the number of processors is presented in Figure 2. In terms of efficiency the NC algorithm seems to be the best. The HC algorithm is the worst although its
Theoretical and Practical Issues of Parallel Simulated Annealing
193
Table 1. Theoretical speedups Algorithm
Speedup
PC 1+
NC 1+
HC
1 p(p−1)2 2βpλ + Lω (2p−1)Lω
p log2 p
1 p(p−1)2 (2p−1)LM
p βpλ
+ LM log2
1 βpλ p 1+ √d−1 +λt1 + LM log2 d 2d−1
p
efficiency has a striking feature of being relatively independent from the number of processors. One could conclude that only the trivial parallelizations of the NC algorithm are recommended. Nevertheless, analysing the chart the following caveats should not be ignored: – In a real life application a user can demand the linear speedup, or he/she can set a time limit for the execution of the algorithm. In other words, one might need to stop the algorithm after a specified period of execution time. – The way of exploration of the search space is not taken into account. – The influence of the co-operation of processes is not considered.
10
Ș-1
8 6 4 2 0 1
10
100
1000
10000
number of proces s ors P C1
P C10
NC
HC2
HC4
Fig. 2. Parallel efficiency vs. number of processors (PC1, PC10 – the PC algorithm with the period of communication 1 and 10; NC, HC2, HC4 – the NC and HC algorithms with 2 and 4 parallel threads)
194
A. Debudaj-Grabysz and Z.J. Czech
The first note deserves explanation. As assumed at the beginning of this section, the chart concerns executions with no time limits (instead, the number of sub-chains to be generated is defined). In [7] the following theorem is proved: Theorem 1. The deterioration of the solution quality of the parallel algorithm with the time limit is a non-increasing function of the efficiency of the parallel algorithm without the time limit. The deterioration of the solution quality is measured by the difference between the expected solutions obtained by the parallel and sequential algorithms (the result of the sequential algorithm is taken as a reference). The theorem says that lower theoretical efficiency indicates a possibly worse quality of the solution achieved if the time limit is imposed. In the context of the investigated algorithms running with the time limit one may infer from the chart that it is necessary to take into account the number of available processors. More specifically, for a small number of processors one may expect that the HC and NC algorithms shall yield the worst and best results, respectively. On the other hand, for a large number of processors the HC algorithm is expected to show its superiority. We believe, however, that these theoretical conclusions should be verified in practice, what is a subject of the next section.
5
Experimental Results
In the vehicle routing problem with time windows it is assumed that there is a warehouse “centrally” located to the customers. There is a road between each pair of customers and between each customer and the warehouse. The objective is to supply goods to all customers at the minimum cost. A solution with fewer routes (first goal of optimization) is better than a solution with a smaller total distance travelled (second goal of optimization). With each customer as well as with the warehouse a time window is associated. Each customer has its own demand for goods and should be visited only once. Every route must start and terminate at the warehouse and should preserve the maximum vehicle capacity. The sequential algorithm proposed in [4] was the starting point for parallelization. The experiments were carried out on the NEC Xeon EM64T Cluster installed at the High Performance Computing Center in Stuttgart. Additionally, the NEC TX-7 (ccNUMA) system was used. Due to the lack of access to a genuine SMP cluster with 4 CPUs per node, the use of 4 threads per node was emulated (the NEC Xeon EM64T Clusters consisted of 2 CPU nodes). The experiments were conducted on 5 tests from Solomon’s benchmarking suite [14]: R108, R111, RC105, RC106 and RC108. It was shown in [5] that all the tests differ from each other substantially in terms of difficulty. This difficulty was measured by the factor P r1 – the probability that after an execution of the co-operating searches algorithm (CS, see [5] for details) a solution with the
Theoretical and Practical Issues of Parallel Simulated Annealing
195
minimum number of routes is found. In the current research we measured the same factor for algorithms PC, NC and HC. However: – the following numbers of processors: 4, 8, 16, 20, 32, 40, 60, 80, 100, 120 were used instead of 5, 10, 15, 20, – the limit on the execution time instead on the number of performed cooling stages was imposed. It should be also stressed that the goal of parallelization in the PC, NC and HC algorithms is to achieve a high speedup with as small deterioration of solution quality as possible. Table 2. Solomon’s tests ranked by the factor P r1 obtained for the CS algorithm Test
RC106 RC105 R111 R108 RC108
Algorithm CS
PC1
PC10
NC
HC2
HC4
0.21 0.35 0.66 0.94 1.00
0.17 0.32 0.47 0.46 0.96
0.13 0.60 0.30 0.56 0.86
0.07 0.68 0.23 0.82 1.00
0.12 0.71 0.31 0.88 1.00
0.16 0.73 0.48 0.92 1.00
Table 2 shows the results of the experiments for the selected Solomon’s tests ranked by the factor P r1 obtained by executing the CS algorithm. Our results confirm that the tests differ from each other substantially in terms of difficulty. It can be seen that in every case an algorithm with co-operating processes, either the PC or HC, outperforms the NC algorithm. Let us compare the results obtained for growing numbers of processors (Figure 3). For the hardest test RC106 the significant loss of the quality of results is observed as the number of processors increases. The co-operation of processes alleviates this loss, as in every case the results of the NC algorithm are worse as compared to other algorithms. Generally, the results of the PC1 algorithm are the best, although the results of the HC4 algorithm can also be described as satisfactory. Considering the test RC105, the algorithms with periodic communication do not perform well. For every number of processors the NC algorithm generates results of better quality. All versions of the HC algorithm perform better than the versions of the PC. If the number of processors exceeds 40, the HC versions outperform the NC algorithm as well. Note that except of 4 processors the HC4 algorithm produces the results which are the closest to solutions generated by the sequential algorithm (number of processors = 1). The positive influence of the co-operation of processes can be observed for the R111 test. The HC4 algorithm gives the best results and the NC the worst. The former algorithm gives also the best results for the test R108, although the NC algorithm outperforms both versions of the PC algorithm. In case of the test RC108 it is easy to find a solution with the minimum number of routes.
196
A. Debudaj-Grabysz and Z.J. Czech
RC106
no. of solutions
100 80 60 40 20 0 1 P C1
P C10
NC
4
8
HC2
HC4
16
20
32
40
60
80
100
120
60
80
100
120
60
80
100
120
60
80
100
120
60
80
100
120
number of processors
RC105
no. of solutions
100 80 60 40 20 0 1 P C1
P C10
NC
4
8
HC2
HC4
16
20
32
40
number of processors
R111
no. of solutions
100 80 60 40 20 0 1 P C1
P C10
NC
4
8
HC2
HC4
16
20
32
40
number of processors
R108
no. of solutions
100 80 60 40 20 0 1 P C1
P C10
NC
4
8
HC2
HC4
16
20
32
40
number of processors
RC108
no. of solutions
100 80 60 40 20 0 1 P C1
P C10
NC
4
8
HC2
HC4
16
20
32
40
number of processors
Fig. 3. Comparison of quality of results
Using the HC algorithms this can be done even for more than 400 processors (what is not presented in the figure). The quality of results generated by the NC algorithm decreases when the number of processors exceeds 120. Solutions of the PC algorithms lose their quality much earlier.
Theoretical and Practical Issues of Parallel Simulated Annealing
6
197
Conclusions
The theoretical analysis shows that the efficiency of the NC algorithm is the best among the algorithms under consideration. The experiments indicate that the cooperation of processes (not present in the NC algorithm) can improve the quality of solutions. The influence of periodic communication (PC) strongly depends on a test. Although for the test R111 the results of the PC1 algorithm were very good, for the test R108 they were the worst. The periodical co-operation of processes does not always compensate for shorter annealing sub-chains executed by processes. In the HC algorithms the inner level communication enables to extend the length of the sub-chains while preserving the same number of trials executed within a cooling stage. That is why the quality of results of these algorithms are generally better than the results of the NC algorithm. The HC algorithms can be also described as ,,balanced”, since the quality of results is satisfactory for all investigated tests. Summing up, the HC algorithms are the advisable choice for solving the VRPTW by parallel simulated annealing.
Acknowledgement The authors thank Rolf Rabenseifner of the High Performance Computing Center in Stuttgart for his contribution to this work. The research was supported by the HPC-Europa project (contract No RII3-CT-2003-506079), the Minister of Education and Science of Poland grants 3 T11F 004 29 and BK-239/RAu2/2006. Computing time was provided within the framework of the HLRS-NEC cooperation and by the following computing centers: Academic Computer Centre in Gda´ nsk TASK, Academic Computer Centre CYFRONET AGH, Krak´ ow (computing grant 027/2004), Pozna´ n Supercomputing and Networking Center, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University (computing grant G27-9), Wroclaw Centre for Networking and Supercomputing (computing grant 04/97).
References 1. Aarts, E., de Bont, F., Habers, J., van Laarhoven, P.: Parallel implementations of the statistical cooling algorithm. Integration, the VLSI journal, 209–238 (1986) 2. Aarts, E., Korst, J.: Simulated Annealing and Boltzman Machines. John Wiley & Sons, Chichester (1989) 3. Azencott, R. (ed.): Simulated Annealing Parallelization Techniques. John Wiley & Sons, New York (1992) 4. Czarnas, P.: Traveling Salesman Problem With Time Windows. Solution by Simulated Annealing. MSc thesis (in Polish), Uniwersytet Wroclawski, Wroclaw (2001) 5. Czech, Z.J.: Speeding up sequential simulated annealing by parallelization. In: Proc. of the International Symposium on Parallel Computing in Electrical Engineering (PARELEC 2006), Bialystok, pp. 349–356 (2006) 6. Czech, Z.J.: Co-operation of processes in parallel simulated annealing. In: Proc. of the 18th IASTED International Conference on Parallel and Distributed Computing and Systems, Dallas, Texas, pp. 401–406 (2006)
198
A. Debudaj-Grabysz and Z.J. Czech
7. Debudaj-Grabysz, A.: Parallel simulated annealing algorithms. PhD thesis (in Polish), Silesian University of Technology, Gliwice (2007) 8. Debudaj-Grabysz, A., Czech, Z.J.: A concurrent implementation of simulated annealing and its application to the VRPTW optimization problem. In: Juhasz, Z., Kacsuk, P., Kranzlmuller, D. (eds.) Distributed and Parallel Systems. Cluster and Grid Computing. Kluwer International Series in Engineering and Computer Science, vol. 777, pp. 201–209 (2004) 9. Debudaj-Grabysz, A., Rabenseifner, R.: Nesting OpenMP in MPI to implement a hybrid communication method of parallel simulated annealing on a cluster of SMP nodes. In: Di Martino, B., Kranzlm¨ uller, D., Dongarra, J. (eds.) EuroPVM/MPI 2005. LNCS, vol. 3666, pp. 18–27. Springer, Heidelberg (2005) 10. Debudaj-Grabysz, A., Rabenseifner, R.: Load balanced parallel simulated annealing on a cluster of SMP nodes. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 1075–1084. Springer, Heidelberg (2006) 11. Greening, D.R.: Parallel Simulated Annealing Techniques. Physica D 42, 293–306 (1990) 12. Lee, F.A.: Parallel Simulated Annealing on a Message-Passing Multi-Computer. PhD thesis, Utah State University (1995) 13. Lee, K.–G., Lee, S.–Y.: Synchronous and Asynchronous Parallel Simulated Annealing with Multiple Markov Chains. IEEE Transactions on Parallel and Distributed Systems 7(10), 993–1008 (1996) 14. Solomon, M.: Algorithms for the vehicle routing and scheduling problem with time windows constraints. Operation Research 35, 254–265 (1987), http://w.cba.neu.edu/∼ msolomon/problems.htm 15. Salamon, P., Sibani, P., Frost, R.: Facts, Conjectures and Improvements for Simulated Annealing. SIAM, Philadelphia (2002) 16. Tan, K.C., Lee, L.H., Zhu, Q.L., Ou, K.: Heuristic methods for vehicle routing problem with time windows. In: Artificial Intelligent in Engineering, pp. 281–295. Elsevier, Amsterdam (2001)
Modified R-MVB Tree and BTV Algorithm Used in a Distributed Spatio-temporal Data Warehouse Marcin Gorawski and Michal Gorawski Silesian University of Technology, Institute of Computer Science, Akademicka 16, 44-100 Gliwice, Poland {Marcin.Gorawski, Michal.Gorawski}@polsl.pl
Abstract. Structural and software modifications of MVB-tree (reverse pointers, aggregations) in exchange for higher space consumption enable answering the timestamp and time aggregated queries in a fast and easy way . Software extensions are new algorithms that accelerate query processing. This paper contains a brief description of temporal data and ways of handling them in modified R-MVB tree and presents distributed system in which above-mentioned index was tested along with a load balancing algorithm used in this solution. Keywords: spatio-temporal indexing, R-MVB, load balancing.
1
Introduction
Standard (relational) database systems accelerate query performance basing on one-dimensional indexes from a B-tree family. Those indexes were optimized for one-dimensional processing where objects are identified by a single alphanumerical value. Actual research is concentrated near: (a) spatial (multidimensional) indices, where the key is an alphanumerical values vector and (b) temporal indices, where object parameters are characterized by the keys that differs in time. In the case where one-dimensional indexes are used for spatio-temporal queries, it is very hard to create good query execution plan, because separate columns are described by separate indexes, (selectivity on every column depends not on the query range but only on data). Extensive research of dedicated indexes has been conducted for many years. Thanks to that, many different access methods were discovered. This paper focuses on one of the temporal indexes, namely the MVB-tree (Multi-version B-tree) [1,2,3,4] and its usage in a distributed spatio-temporal telemetric data warehouse also presented in the paper. There were many modifications made to the original MVB-tree algorithm including its connection with an R-tree index. The newly obtained structure is called a modified R-MVB tree. The modifications made to the original MVB-tree were essential for efficient algorithm use in the B-DSTDW(t) system (section 4) that is the main subject of our research. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 199–208, 2008. c Springer-Verlag Berlin Heidelberg 2008
200
M. Gorawski and M. Gorawski
The paper structure can be described as follows: in section 2 spatial and temporal data characteristic is given, section 3 describes an MVB-tree and its modifications, section 4 presents a B-DSTDW(t) system, an on-line balancing algorithm named BTV [11] used in this system and a spatio-temporal version of MVB-tree, which is a modified R-MVB tree. In section 5 results of the tests performed in a B-DSTDW(t) system are shown, section 6 summarizes the paper and section 7 gives a quick glance on authors current research.
2
Spatial and Temporal Data Characteristic
Spatial data are multidimensional ones, defined as a number of unique key parts. Data connected to objects with different coordinates do not maintain linear order and are characterized with dominating dissimilarity object schedule. Effective access to large number of spatial data is assured by its indexing. Multidimensional character of spatial data makes use of indexing methods for one-dimensional indexes (for example B-tree family) impossible. In the case of one-dimensional indexes usage obtaining a good plan of query execution in spatio-temporal solutions is extremely hard, when several indexes are placed upon next columns (selectivity by each of the columns depends not only on query range but also on the data itself.) Spatial index is a structure, which groups spatial data in a manner, which speeds up spatial queries execution. After analysis of telemetric data character, decision was made to expand spatial indexing issue by a temporal dimension. This allowed efficient handling not only spatial queries, but also spatio-temporal queries. Such approach of including temporal dimension in Spatial Data Warehouse (SDW) enabled the creation of Spatio-Temporal Data Warehouse (STDW), and with usage of telemetric data and data distribution along several system nodes, the obtained system can be called Distributed (Parallel) Spatio-Temporal Telemetric Data Warehouse (D(P)STDW(t)) [5,6,7]. In the following section the idea of the MVB-tree is presented. It is ideal for servicing temporal data, and its variation, along with modification made, structure is a efficient solution to temporal data indexing problem in the Balanced Distributed Spatio-Temporal Telemetric Data Warehouse (B-DSTDW(t)) system. Its variation, the R-MVB tree is presented as a complex solution for spatio-temporal data servicing.
3
MVB-Tree and Its Modifications
The original MVB-tree algorithm is presented in [2,8]. During original MVB-tree usage certain drawbacks have been observed. Below-mentioned modifications make MVB-tree better suited for our needs presented in section 4. The MVBtree is a modification of a B-tree that was prepared for temporal (multi-version) data. In other words, it is a B-tree prepared for storing temporal data [9]. The main goal was creation of an index that will have the same performance for multiversion data as systems for single-version (that is static) data. We are talking about asymptotic optimal solution. Authors of [2] proved that MVB-tree fulfills
Modified R-MVB Tree and BTV Algorithm
201
those criteria. Looking deeply, the MVB-tree is not a ’tree’ but a hierarchical structure based on a directed list of roots. Figure 1 shows outline, which should give better view into this definition. Nevertheless, according to convention, this structure will still be called ’a tree’.
5-22
0-5
38-*
22-37
Fig. 1. MVB-tree structure outline
The tree structure consists of the following elements: – Tree root descriptor - object that contains information about the time range in which the root is valid and the root identificator. – List of tree descriptors. Descriptors for all roots in the tree are connected in a list (or other structures) according to growing, separable life spans. – Middle nodes and leaves Nodes in MVB-tree are divided into two groups, middle (directory) nodes and leaves. This division is similar for that made in every balanced hierarchical index. Every entry in a node has its life span (that is, validity time range), in which information contained in the entry is valid. Middle nodes are redirecting the search to lower levels; they use key ranges in their subtrees (minKey/maxKey). Leaves contain individual data objects connected to specific keys. ROOT 1
ROOT 2
ROOT 3
[- *, 6 )
[6 ,9 )
2 2 3 5
2 3 7 9
[3 ,6 ) [1 ,3 ) [2 ,*) [4 ,5 )
RO OT 4 [9 ,1 1 )
[6 ,*) [2 ,*) [7 ,*) [8 ,*)
[11 , *)
2 3 9 9
2 / 3 [ 2 ,1 1 ) A 9 /1 1 [7 ,1 1 ) B 2 /9 [1 1 ,*) C
A 2 [6 ,*) 3 [2 ,*)
B 7 [7 ,1 0 ) 9 [8 ,*) 1 1 [9 ,1 1 )
[6 ,* ) [2 ,* ) [8 ,1 2 ) [1 2 ,*)
C 2 [6 ,*) 3 [2 ,*) 9 [8 ,*)
Fig. 2. An example of MVB-tree
Figure 2, shows an MVB-tree after the following series of operations: Add(2), Add(3), Add(2), Add(5), Del(5), Add(2), Add(7), Add(9), Add(11), Del(7), Del(11), Add(9). Adding a non-existing object result in its creation. Adding an existing object results in creation of a new version for this object. Structural modifications, which are presented in the following section, are autonomous. Implementations can use both of them, one or none without impact on correctness of the solution. Extending MVB-tree with those modifications is relatively simple, but results in one inconvenience which is higher disc space consumption [3].
202
3.1
M. Gorawski and M. Gorawski
Reverse Pointers
MVB-tree structure was originally designed for timestamp queries processing. the first proposed extension introduces the ability to overcome this limitation by applying a convenient mechanism for a time-range query processing. Our proposal is to keep for every entry the reverse pointer pointing to the previous version of the given object. These pointers are stored in one-directional list in every object. Storing the pointers is very simple (when inserting an object, it is checked if a previous version of a given object exists (it is checked whether it’is an update or ’pure’ insertion)). If the above is true, in a new entry a reverse pointer to the leaf containing the previous version is activated. The approach has at least one significant drawback (which depends on the characteristics of the system employing the MVB-tree). Namely, if at some point in time, deletion of an object was performed, then the reverse pointers list is finished. Further inserted object will not be connected to the deleted version. If it is required then an additional effort during insertion must be taken (only for objects which appear to be new) by searching the latest active version of the object. Such an operation can be time-consuming. Other solution is to use a separate B+ tree structure to keep the last object versions. The second solution has two advantages: (1) it can be used for fast processing of queries concerning present time, (2) keeping all reverse pointers is very fast and simple. In our implementation, the problem has marginal significance, because for the implemented system there is no deletion operation after which insertion for an object is performed. Figure 3. shows an MVB-tree with reverse pointers. The reverse pointers list is presented for object with ID=2. As we can see, regardless of entry duplication (for example entry 2[6,*) for ROOT2 and ROOT3) this approach points directly to the previous version of the object. It is possible to get a full object A history with max H+UPD(A) node accesses, where H is the MVB-tree max height and UPD(A) is the number of updates for object A. Some new ideas are tested targeting the reverse pointers disc usage reduction. Probably only a few reverse pointers for the entire leaf are required, and every entry can be assigned to a specific entry in the table.
ROOT 1
ROOT
2
ROOT 3
[ - *, 6)
[6 ,9 )
[9 ,* )
2 [3,6) 2 [1,3) 3 [2,*) 5 [4,5)
2 [6,*) 3 [2,*) 7 [7,*) 9 [8,*)
2/3 [2,*) A 9/11 [7,*) B
A 2 [6,*) 3 [2,*)
B 7 [7,*) 9 [8,*) 11 [9,*)
Fig. 3. Reverse pointers in MVB-tree structure
Modified R-MVB Tree and BTV Algorithm
3.2
203
Temporal Aggregates
There are applications in which the access to detailed data is not necessary, sometimes even undesirable. In data warehousing and data processing, the emphasis is put on fast aggregates and summaries retrieval, rather than on retrieving detailed results for specific objects. Dependencies between keys (if any exists) are not visible at the tree level, so one way to use the aggregates is the time aggregation approach. The approach assumes aggregating values for every object during its development. The MVB-tree does not provide a way to answer such a query, and without reverse pointers, evaluating such a query is very difficult. The simplest example of the approach application can be a system storing values from water meters. Every meter has its ID, and values are updated by sending a changed value since the previous measurement. In the basic version of the system, such a query is quite difficult to process, however by setting additional aggregate field named SUM it becomes easy. The detailed data are not lost, so it is still possible to get the detailed answer. Another measured feature can be the number of updates counted since the first object registration. The feature provides knowledge about objects change rate. These aggregations can be used not only for retrieving measurements for entire history. By submitting two queries concerning two extreme points in time range, it is possible to easily retrieve an answer for such a range query. Figure 4 shows the structure of a MVB-tree extended with aggregations. The tree keeps the last measure update as a value, SUM is a sum of values (the current value of the meter) and COUNT contains the number of updates (the number of versions in the history). ROO T 1
ROO T 2
ROOT 3
[-*, 6)
[6,9)
2 2 3 5
2 3 7 9
[3,6):4:2:5 [1,3):1:1:1 [2,*):3:1:3 [4,5):2:1:2
[6,*):-2:3:3 [2,*):3:1:3 [7,*):7:1:7 [8,*):6:1:6
A 2 [6,*):-2:3:3 3 [2,*):3:1:3
RO OT 4 [9,11)
2/3 [2,11) A 9/11 [7,11) B 2/9 [11,*) C
B 7 [7,10) ):7:1:7 9 [8,*):6:1:6 11 [9,11)5:1:5
[11, *)
2 3 9 9
[6,*) :-2:3:3 [2,*):3:1:3 [8,12) :6:1:6 [12,*) :-6:2:0
C 2 [6,*):-2:3:3 3 [2,*):3:1:3 9 [8,*):6:1:6
Fig. 4. Example of MVB-tree with aggregation fields in leaves
4
MVB Structure Usage in Distributed Spatio-temporal Telemetric Data Warehouse (B-DSTDW(t)) - The BTV Balancing Algorithm
Load balancing in spatio-temporal data warehouses systems is used for minimizing average response time. Average response time is defined as an average interval
204
M. Gorawski and M. Gorawski
between the moment when a query had been send to DSTDW, and the moment when a response had been acquired. Acceleration of response time is acquired by adequate data distribution between DSTDW system nodes (servers) and making operations on them as parallel as possible [10]. In the case of B-DSTDW systems based on computers with the same parameters, balancing comes down to equal loading of every node (here the node performance is the bottleneck). However when a B-DSTDW system is based on computers with various performance characteristics, adequate load balancing algorithm should be used for balancing such a system. Moreover, adequate data allocation schema must be used, to ensure proper participation of every node in the process of response generation. The algorithm of direct balancing based on threshold values (BTV) applied to BDSTDW system uses vertical data partitioning basing on ranged values of data partitioning algorithm, where relation is divided into partitions, with partition boundaries in [11]. The node administers partition attributes value’s range , for all . When , it is assumed that administers empty partition Nodes administrating adjacent ranges are called neighbors. denotes node load, defined as a number of tuples stored in . It is assumed that coordinator (in PSDW it is the managing module, which performs distributed data loading) has access to information of partition ranges and it points adequate nodes needed byl query, insert and delete operations. After every insert or delete operation, BTV algorithm is activated. The BTV algorithm bases on two fundamental operations: – When a certain node becomes responsible for too big data set, it moves part of the data to least loaded neighbor, attempting to balance the load. Such operation shall be called NBRADJUST – In the case when node ( ) is greatly overloaded comparing to least loaded node but loads in neighboring nodes does not differ enough REORDER operation is performed. The original online balancing described in paper [10] based on an algorithm which was designed for peer-to-peer systems. For efficient usage of this algorithm in B-DSTDW(t) system (BTV), certain modifications had to be applied: – node data load analysis is related only to the amount of data stored in the node nodes, servers performance is not taken into consideration, – data rows loaded to the data warehouse one by one cause a considerable elongation of data loading process, – an assumed way of calculating successive threshold values is adjusted for systems consisting of a large number of nodes (original definition of threshold values allows considerable system imbalance ratios, which is not acceptable in DSDW(t) system). – the algorithm allows for existence of only one partitioning attribute (in PSDW(t) system data partitioning bases on 2 attributes - X,Y coordinates of meters location (source of the measures). In the BTV algorithm those limitations were removed, by the following modifications: – the introduction of new methods of calculating the imbalance ratio for the system that take the nodes performance characteristics into consideration,
Modified R-MVB Tree and BTV Algorithm
205
– the insertion of tuples in packets of size determined by system administrator, – the creation of one partitioning attribute for measurements table based on X, Y coordinates of meter location 4.1
Balanced Distributed Spatio-temporal Telemetric Data Warehouse (B-DSTDW(t)) System
Regulation of energy sector in European Union created a new market - media (electricity, gas, heat and water) recipient market. Main problem which emerges is automated reading of hundred of thousands recipients meters and critically fast analysis of terabyte sets of meters data. In the case of electricity suppliers reading, request prognosis and decision-making subjects to energy balance closure regime (e.g. in 30 minutes time). Condition for achieving this goal is usage of Integrated Meter Reading (IMR) and Distributed Spatial Telemetric Data Warehouse (DSDW(t)) technology 5. IMR system sends data from media meters to database system via cellular telephony network (GPRS technology). IMR is a transactional system, which services four types of meters: electrical energy, water, gas and heat Data stored in the telemetric server are in raw state and needs to be adequately formatted. DSDW(t) system gathers data from telemetric servers in extraction process via network with TCP/IP protocol.In distributed DSTDW(t) system during extraction process additional partitioning is performed followed by data loading into system nodes. Lo c a l ra dio c om m u nica tio n
B -D ST D W (t)
G S M N etw o rk
T C P /IP
GSM
E T L p ro ce ss – d ata p artitio nin g an d lo ad ba lan cin g
D ata lo ad ing
AIUT G SM Telem etric Server
D S TDW S erv ers
C lien t – se rver R M I com m un ication
S p a tio te m p o ra l q u e rie s
D S TDW C lien t
10
In te grated M eter R ead in g S ystem (IM R )
12
33
34 1
31 21 15
22
14 30
Fig. 5. IMR/DSTDW(t) system architecture
Existing research on DSDW systems and spatio-temporal DW systems (STDW) concentrated on architecture indexed with aR-tree [5]. In current research we test new structures capable of servicing spatio-temporal queries. One of proposed structures is the below-presented R-MVB index, which combines both MVB-tree and R-tree. 4.2
Modified R-MVB Structure
Modified R-MVB index 6 consists of two indexes - R-tree and modified MVBtree. The first index handles responses to spatial queries (list of spatial objects in
206
M. Gorawski and M. Gorawski Hashtable pointing adequate MVB-tree Hashtable pointing adequate R-tree 1 D A T A S E T
1
R-tree R-tree
MVB-tree
2 Spatial result + temporal query
R-tree
3
MVB-tree
4 .
MVB-tree
.
Fig. 6. Modified R-MVB-tree
a range query). The second index calculates aggregates for certain spatial object in certain time. R-tree is optimized for range-spatial search. Such combination ensures efficient support of spatio-temporal queries. In this structure data loading is independent of data characteristic. This means that no knowledge is needed while to build the index.
5
R-MVB Structure Tests
Test environment consisted of three B-DSTDW(t) server nodes and two BDSTDW(t) client nodes. The below-mentioned tests shows results of modified RMVB tree in comparison with other indexes (STCAT) [3] used in B-DSTDW(t).
Index building time [s]
800 700 600 500
STCAT
400
R-MVB
300 200 100 0 120000
350000
610000
1300000
2000000
Data set's size
Fig. 7. Indexes’ building time
The Above figure shows that modified R-MVB tree has great response time to spatio-temporal queries, however its building time 7 is much longer comparing to the other indexes (in this case STCAT index tests results are shown). This gives R-MVB certain disadvantage, but still, thanks to its ability for storing record’s complex history and superior response time it is still worth of using. The STCAT index is a good alternative to the modified R-MVB tree, however as it is shown in 8 the response time of the STCAT index increases with data set size whereas R-MVB tree has constant response time regardless of the data set size.
Response time to spatio-temporal queries [ms]
Modified R-MVB Tree and BTV Algorithm
207
1000 800 600
STCAT
400
R-MVB
200 0 120000
350000
610000
1300000
2000000
Data set's size
Fig. 8. Spatio-temporal queries response time
6
Conclusion
In this paper, the modifications of the MVB-tree are described thoroughly, along with its spatio temporal version namely R-MVB index. This structure has a quite valuable asset in terms of B-DSTDW(t) system effectiveness increase. The previous SDW(t) systems based on an aR-tree supported only spatial queries, while modified R-MVB allows efficient responses not only to spatial queries but also to spatio-temporal queries. Certain disadvantages such as long building time and large hard disk usage shows that there is still the need for new indexing structures. Currently new indexes are being upgraded and developed,, however, the modified R-MVB is still the best index available in this term.
7
Future Works
Apart from new indexing structures current research focuses also on a new GDWSA(t) system [12,13](Grid Data Warehouse with Software Agents system for telemetric data). The grid is a form of a distributed computation, that coordinates and shares: calculations, applications, data, memory and network resources. This is obtained through dynamically changing and geographically distributed organization. The grid technology can be the key to solve highly complexed computation problems. The Software Agent’s methodology bases on the grid technology. Single node has the ability to configure, connect to a remote catalogue and directly to other Agents. Tasks are divided and executed parallelly on many workstations. This results in considerable speed-up. Built-in functions, that measures efficiency of a local node allows whole system performance evaluation. The system is adapted to a multi-user mode, and a good balancing mechanism provides equal use of all system nodes. It is essential for a full usage of system capabilities in an unknown environment. The balancing goal is to omit slower nodes and to send more tasks to the fastest nodes. To achieve this goal a new genetic balancing algorithm [14] designed especially for GDWSA(t) system is currently being implemented.
208
M. Gorawski and M. Gorawski
References 1. Manolopoulos, Y., Nanopoulos, A., Papadopoulos, A.N., Theodoridis, Y.: R- Trees have grown everywhere. In: Proceedings of the 23th VLDB Conference, Greece, pp. 13–14 (1997) 2. Tao, Y., Papadias, D.: Range aggregate processing In spatial databases. IEEE Transactions on Knowledge and Data Engineering (TKDE) 16(12), 1555–1570 (2004) 3. Gorawski, M., Faruga, M.: STAH-tree, Hybrid Index for Spatio Temporal Aggregation. In: 9th International Conference on Enterprise Information System (ICEIS), Portugal (2007) 4. Papadias, D., Kalnis, P., Zhang, J., Tao, Y.: APN 2001. LNCS. Springer, Heidelberg (2001) 5. You, B., Lee, D., Eo, S., Lee, J., Bae, H.: Hybrid Index for Spatio-temporat OLAP operations. In: Proceedings of the ADVIS Conference, Izmir, Turkey (2006) 6. Gorawski, M., Malczok, R.: Materialized aR-tree in Distributed Spatial Data Warehouse. An International Journal Intelligent Data Analysis 10(4), 361–377 (2006) 7. Gorawski, M., Malczok, R.: On Efficient Storing and Processing of Long Aggregate Lists. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 190– 199. Springer, Heidelberg (2005) 8. Gorawski, M., Chechelski, R.: Parallel Telemetric Data Warehouse Balancing Algorithm. In: 5th International Conference on Intelligent Systems Design and Applications, Wrocaw, Poland, September 8-10, 2005, pp. 387–392. IEEE CS, Los Alamitos (2005) 9. Tao, Y., Papadias, D.: Spatial Queries In Dynamic Environment. ACM Transactions on Database Systems 28(2) (June 2003) 10. Becker, B., Gschwind, S., Ohler, T., Seeger, B., Windmayer, P.: An asymptotically optima multiversion B-Tree. VLDB Journal 5(4), 264–275 (1996) 11. Gorawski, M., Kamiski, M.: On-Line Balancing of Horizontally-Range- Partitioned Data in Distributed Spatial Telemetric Data Warehouse. In: 3rd International Workshop on Grid and Peer-to-Peer Computing Impacts on Large Scale Heterogeneous Distributed Database Systems (GLOBE 2006), DEXA 2006, September 4 8, 2006, pp. 273–277. IEEE CS, Los Alamitos (2006) 12. Ganesan, P., Bawa, M., Garcia-Molina, H.: Online Balancing of Range-Partitioned Data with Applications to Peer-to-Peer Systems. Stanford University, Stanford 94305 13. Gorawski, M., Bakowski, S., Gorawski, M.: The Software Agents in a Database Grid. In: 2nd Workshop on Large Scale Computations on Grids (LaSCoG 2006) XXII Autumn Meeting of Polish Information Processing Society, Wisa, Poland, November 6 - 10 2006, pp. 259–266. IMCSIT (2006) 14. Polat, F., Alhajj, R.: A multi-agent tuple-space based problem solving framework, Middle East Technical University, 06531 Ankara, Turkey (1998)
Towards Stream Data Parallel Processing in Spatial Aggregating Index Marcin Gorawski and Rafal Malczok Silesian University of Technology, Institute of Computer Science, Akademicka 16, 44-100 Gliwice, Poland {Marcin.Gorawski, Rafal.Malczok}@polsl.pl
Abstract. Data processing computer systems store and process large volumes of data. The volumes tend to grow very quickly, especially in data warehouse systems. A few years ago data warehouses were used only for supporting strictly business decisions but nowadays they find their application in many domains of everyday life. New and very demanding field is stream data warehousing. Car traffic monitoring, cell phones tracking or utilities meters integrated reading systems generate stream data. In a stream data warehouse the ETL process is a continuous one. Stream data processing poses many new challenges to memory management and data processing algorithms. The most important aspects concern efficiency and scalability of the designed solutions. In this paper we present an example of a stream data warehouse and then, basing on the presented example and our previous work results, we discuss a solution for stream data parallel processing. We also show, how to integrate the presented solution with a spatial aggregating index. Keywords: stream processing, stream data warehouse, parallel algorithms.
1
Introduction
Not so long ago data warehouses were used only for supporting decision making processes in strictly business systems. Nowadays, when more and more human activities produce huge volumes of data, data warehouse systems find their application in many domains of everyday life. Recently, stream data and stream data warehouses have become an active field of research. There are many examples of systems generating stream data: car traffic and utilities consumption monitoring or cell phones tracking. Stream data warehouse designers must face many problems which do not exist in standard data warehouses. The ETL process is not executed periodically in batch mode but runs continuously, transforming and loading data generated by data sources. The data must be stored and efficiently processed to provide up-to-date data-mining results. Processing large amounts of data is highly time and resource consuming process. To make the data processing efficient and scalable parallel or distributed computing is often applied. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 209–218, 2008. c Springer-Verlag Berlin Heidelberg 2008
210
M. Gorawski and R. Malczok
In [1] we presented Materialized Aggregates List (MAL) – a solution that is able to process long lists of aggregates. We applied multithreading and partial aggregates materialization to make the solution applicable for large volumes of data. In this paper we first show how to use MAL to process stream data and then we address the problem of integrating MAL with a spatial aggregating index. The final outcome of our research is a spatial aggregating index which not only can answer a range query, but also is able to paralelly build and integrate streams of aggregates. The remaining part of the paper is organized as follows: in the next section we first present an example of a stream data warehouse and then, basing on results of our previous work, we discuss the details of a solution for stream data processing. Section 3 presents the streams merging operation and addresses the memory management problems. In section 4 we present efficiency tests results. Finally in section 5 we conclude the paper and present our future plans.
2
Stream Data Warehouse
We started our research from describing an example motivating our research – a stream data warehouse system. The system monitors consumption of utilities such as water, natural gas and electrical energy. The meters are located in some region where a telemetric installation exists. The meters send the readings via radio waves to collecting points called nodes. The nodes, using standard TCP/IP network, transfers the readings to telemetric servers. The ETL process gathers the data from telemetric servers. The operation of reading the utility consumption in a meter can be executed according to two different models. The first model assumes that there is a signal that triggers the meter to send the reading to the collecting point. In the second model no signal is required and the meters send the readings at some time intervals. The intervals depend on the telemetric system configuration. In the discussed example the second model is applied. Next to the meter readings the data warehouse stores also additional information: the geographical location of the meters and collecting points, weather conditions in the region encompassed by the telemetric installation and a brief description of the meters users. The information is used when the utilities consumption is being analyzed. Stream Source Definition. Considering the telemetric system operation we assume that every single meter is an independent source generating an endless stream of readings. A single reading generated by a meter contains a unique identifier of a meter that generated the reading, a precise timestamp defining the date and time of reading generation and a list of values storing the meter’s counters values for the zones monitored by the meter. As a data stream we understand an endless sequence of elements of a given type. The elements exists in a stream in a well-defined moment of time. For a stream there is no concept of a stream end. As as a stream beginning we consider the first element generated by the source of the stream (utilities meter).
Towards Stream Data Parallel Processing in Spatial Aggregating Index
211
The intensity of a stream (the number of elements occurring in a stream in a given time unit) depends on the telemetric system configuration parameters defining how often the meters send the readings to the collecting point. In figure 1 we present an example of a stream generated by an utility meter (depicted as a triangle). The stream elements (readings) are marked with dots. 2.1
Materialized Aggregate List
It is very common for data warehouses to process aggregated data instead of raw data. There are two main reasons for this: (1) in most cases end users are more interested in general trends existing in data than in detailed analyses results, and (2) aggregating data reduces data volumes sizes and, as a consequence, shortens the processing times. To efficiently process stream data, raw data streams are aggregated as well. Aggregates are calculated for some time interval called an aggregate time window Taw . In the case of utilities consumption the values stored in the aggregate can be interpreted as approximated medium consumption in some time period (aggregate time window). Figure 1 presents aggregate time window beginning at the moment tb1 and ending at the moment tb2 . Aggregate values are calculated in the following way: 1. The meter readings in the moment tb1 are approximated using linear interpolation (vn denotes a meter reading in the moment tn ). For a scenario presented in figure 1 the meter reading in the moment tb1 is calcu−tn−1 lated as: vb1 = vn−1 + ttb1 (vn − vn−1 ). Similarly, for the moment tb2 : n −tn−1 tb2 −tn+2 vb2 = vn+2 + tn+3 −tn+2 (vn+3 − vn+2 ) 2. Aggregate value for a given time window is calculated as a product of subtracting approximated readings in aggregate window time moments: vb2 −vb1 .
aggn+2
Taw meter kn-1
aggregate stream
kn
aggn-1
aggn
tn-1
tn tb1
TIMESTAMP value1 value2
kn+2 kn+3
aggn+1
tn+2 tn+3 tb2
aggn+2
valuen t
Fig. 1. Aggregates stream is generated basing on a readings data stream
The outcome of a raw data stream aggregation is a stream of aggregates calculated for a given aggregate time window (user-defined). Figure 1 shows that the number of elements in the aggregates stream is significantly less than the number of elements in the raw data stream – the raw data stream is reduced. The wider the aggregate time window, the less elements in the resulting aggregate stream.
212
M. Gorawski and R. Malczok
The process of raw data stream processing and aggregates calculation is very time-intensive, especially when we want to process a few streams in the same time. Materialized Aggregate List can be applied for creating and processing aggregates streams. A single information stored in MAL is an aggregate A. The aggregate comprises of a timestamp T S and a set of values VA = {Vi } (A = [T S, VA ]). Every element Vi ∈ VA is of a defined type tVi (integer or floating point number). MAL bases its operation on the concept of list and iterator. Mechanisms implemented in the list allows parallel generation and optional materialization of the calculated aggregates. Iterators are used for browsing the generated data and communicating with the list. Aggregates created by the list are stored in a static table located in the iterator. The table is a set of logical parts called pages. Each page stores some number of aggregates; the number is equal for each page. The fact that the table is logically divided into pages is used when the table is being filled with aggregates. In [1] we presented and compared three multithread page-filling algorithms. Static iterator tables coupled with multi-thread algorithms allow efficient processing data streams of any length without imposing memory limitations. MAL supports also the materialization mechanism which significantly speeds-up the process of aggregates recreation. The list can be used as a tool for retrieving aggregates directly from a database (eg. for single telemetric object) and also as a component of indexing structure nodes (the aggregates are created basing on the data from lower levels). The MAL can work in one of the following aggregate-creating modes: 1. In the first mode MAL creates aggregates retrieving data directly from a database. This mode is used when MAL processes data of a single object (a single telemetric meter). 2. In the second mode MAL also creates aggregates retrieving data from a database, but the aggregates are created for more than one object. This mode finds its application when MAL manages aggregates of a node located on the lowest level of the indexing structure. The objects, for which the MAL creates aggregates are appropriately marked and their readings are aggregated when MAL creates aggregates for the node. 3. In the third mode MAL creates aggregates basing on the aggregates of other MALs. This mode is used when MAL is a component of a node located on intermediate level of the indexing structure.
3
Managing Aggregates Streams in Spatial Aggregating Index
The basic functionality of the presented stream data warehouse system is to provide an answer to a range query. A range query is defined as a set of regions R encompassing one or more utilities meters. The answer generated by the system consist of two parts. The first part contains information about the number of
Towards Stream Data Parallel Processing in Spatial Aggregating Index
213
various kinds of meters located in the query region. The second part is a merged stream (or streams) of aggregated readings coming from the meters encompassed by the query region. There are many spatial indexes which can answer any region query. The best known and most popular are indexes based on the R-Tree [2] spatial index. The main idea behind the R-Tree index is to use a hierarchical indexing structure where index nodes on the higher levels encompasses regions of the nodes on the lower levels of the index. A detailed review of spatial indexing structures based on R-Tree index can be found in [3]. To calculate the second part of the answer we need to apply an aggregating spatial index which in index nodes stores aggregates concerning objects located in the nodes regions. The first proposed solution to this problem was aR-Tree [4] (aggregation R-Tree). aR-Tree index nodes located on the higher levels of the hierarchy stores the number of objects in the nodes located on the lower levels of the hierarchy. The functionality of this solution can be easily extended by adding any kind of aggregated information stored in the nodes. The idea of partial aggregates was used in a solution presented in [5], where the authors want to use it for creating a spatio-temporal index. Most of the solutions presented in the literature assume that the size of the aggregated data is known and small enough to be stored in main computer memory. In the case of the stream data such an assumption cannot be made. In this paper we focus on the problems resulting from integrating MAL in a spatial aggregating index nodes. We show the algorithm we use to make the resulting solution efficient and scalable. 3.1
Aggregates Streams Merging
When MAL is used in aggregating spatial index the lists stored in upper level nodes merges aggregates streams created by the lower level nodes. Prior to defining details of the streams merging operation we need to define the requirements which must be satisfied to add two aggregates. Each aggregate has a timestamp and a list of values. The length of the values list is not limited and can be adopted to system requirements. Aggregate type is defined by the cardinality of the VA list and the types of elements of VA . Two aggregates are of equal type if their values lists are of equal cardinality and values located under the same indexes are of the same type tV . Two aggregates can be added if an only if they have equal timestamps and they are of equal type. The result of aggregates adding operation is an aggregate having timestamp equal timestamps of the added aggregates and resulting aggregate values list is created by adding values of the added aggregates (see fig. 2). Aggregates streams merging operation merges two or more aggregates streams creating one aggregates stream. The merging operation is performed by adding aggregates with equivalent timestamps. For aggregates streams merging there is a special moment in time when the operation begins. All merged streams must contain all aggregates used during adding operation. The merging operation cannot be performed for streams where some required aggregates are missing.
214
M. Gorawski and R. Malczok
2007-01-01 00:30 2007-01-01 01:00 23.345 3.2 10.7 4.05
Si
Sk 2007-01-01 00:30 2007-01-01 01:00 31.475 3.7 16.97 4.29
2007-01-01 00:30 2007-01-01 01:00 8.13 0.5 6.27 0.24
Sj
ts
t
Fig. 2. Example of aggregates adding and aggregates streams merging. Two streams Si and Sj are merged into one resulting stream Sk .
3.2
Query Answering and Memory Management
MAL iterator uses a static table for browsing and managing aggregate stream. All tables used by the system are stored in form of a resource pool. It is possible for an external process to use MAL iterator only if there is at least one free table in the pool. Defining the number of tables stored in the pool one can very easily control the amount of memory consumed by the part of system responsible for processing data streams. In most cases the number of MALs involved in answer stream generation process is significantly greater than the number of available tables. To make the process of answer stream generation efficient we defined an algorithm that assigns the available tables to appropriately chosen MALs. Finding Nodes Answering the Query. For every window from the query region set the indexing structure is recursively browsed in order to find the nodes which provides answer to the query. The browsing starts from the tree root and proceeds towards tree leaves. The tree browsing algorithm checks the relation of the window region O and the node region N . The node is skipped if its region share no part with the window region (O ∩ N = ∅). If the window region entirely encompasses the node region (O ∩ N = N ), the node is added to F AN set (Full Access Nodes – nodes which entirely participates in answer stream generation with all theirs aggregates). The last case is when a window region and node region share a part (O ∩ N = O ). The algorithm performs a recursive call to lower structure levels, passing parameter O as an aggregate window. When traversing to lower tree levels it is possible that the algorithm reaches the node on the lowest hierarchy level. In this case, the algorithm executes a query searching for encompassed objects. A set of found objects is marked with a letter M . The objects in the M set are marked with a so called query mark which is then used when the answer stream is being generated. Sorting Elements Answering the Query. The process of finding nodes and single objects involved in query answer stream generation creates two sets: F AN and M . The first operation executed before iterator tables assignment is to check the M set if it contains all elements encompassed by some lowest level node. Such a scenario may occur when the query contains many small windows of which
Towards Stream Data Parallel Processing in Spatial Aggregating Index
215
none encompasses the entire node region, but merging a few windows results in a window which region encompasses the entire node region. If such windows are found then the single objects encompassed by those windows are removed from the M set, and a node which region is encompassed by the merged window region is added to the F AN set. An example can be found in figure 3. M set
O1
O2 node E region
O5 O4 O3 O7
O6
O1 O3
O2
O4
O5 O6
queries regions
FAN set
O7 O8
F
O8 O10 O9
E H
F
G
A B C D E
Fig. 3. An example showing a few query windows, which, when merged, encompass the entire node region. The node E region encompasses objects O1 − O7 . There are three windows which, when merged, entirely encompass node E region. The objects O1 − O7 are removed from the M set, the node E is added to the F AN set.
In the next step the nodes creating the F AN set are sorted according to the following criteria: 1. The number of materialized data available for a given node. The more materialized data, the faster a stream is recreated and the higher the position the node. Only the materialized data that can be used during answer stream generation are taken into account. 2. Generated stream materialization possibility. One of the data warehouse operating parameter is the border level, starting from which the streams generated by the nodes are materialized. For level set to 0 all streams generated by the nodes are materialized. For level set to -1 not only nodes streams are materialized, but also streams generated by single objects. The materialization possibility is a binary criteria: a node has it or has not. The node with the materialization possibility is always placed on higher level than a node which stream will not be materialized. 3. The the amount of objects encompassed by the node’s region. The more encompassed objects, the higher the position of the node. Alike the elements in F AN set sorted are the elements of M set. In the case of M set, the only sorting criteria is the amount of available materialized data which can be used during answer stream generation. Assigning Iterator Tables. After sorting F AN and M sets the algorithm starts the tables assignment process. Let P denotes the iterator table pool, and |P | the number of tables available in the pool. If ((|F AN | = 1 and |M | = 0) or
216
M. Gorawski and R. Malczok
(|F AN | = 0 and |M | = 1)) (the answer is a single stream generated by a node or a single object) then one table is taken from the pool and it is assigned to an element generating the answer stream. The condition that must be satisfied is that |P | ≥ 1 (there is at least one table in the pool). In the other case (the answer stream consists of more than one stream) the algorithm requires an additional structure called GCE (Global Collecting Element). The GCE element is of type MAL and it is used for merging streams generated by other elements (nodes and objects). The GCE element requires one iterator table. If, after assigning a table to the GCE element, there are some free tables left in the pool they are assigned to other elements involved in the answer stream generating process. The tables are assigned first to the elements in F AN set (according to the order set by the sorting operation). Then, if there are still some free tables, they are assigned to single objects from M set (also in appropriate order). A few tables assigning scenarios are possible: – |P | ≥ |F AN | + |M | – every element of F AN and M is assigned a separate table, – |P | ≥ |F AN | and |P | < |F AN |+|M | – every element of F AN is assigned a separate table, some objects streams of M are merged at creation stage, – |P | < |F AN | – all streams of the elements from M set and some streams of the elements from F AN set are merged into one stream during the process of aggregates creation. A stream can be materialized only if a generating element is assigned a separate iterator table. If a stream of element aggregates is at early generation stage merged with streams of other elements the matarialization is not performed because it cannot be explicitly determined for which element the stream was generated. In general, partial answer streams, for example streams merged in GCE, are not materialized. The stream generated by GCE is not materialized either because it changes with every single query. If the system materialized every indirect stream the amount of materialized data would grow very fast. Figure 4 presents an example of tables assignment algorithm operation. In this example the number of tables in the pool is 4. Answer stream generating elements are: nodes 4, 6 and 7 (F AN set) and objects a, b and c (M set). The elements were sorted into an order presented in the params table (GCE element). Because (|F AN | + |M | > 1) we need to reserve one iterator table for the GCE element. The remaining tables are assigned to nodes 7, 4 and merged elements: node 6 and single objects a, b and c. If the materialization level is ≤ 0 the streams generated by nodes 7 and 4 will be materialized.
4
Test Results
In this section we present experimental test results. We implemented the presented solution in Java. For the tests we used a machine equipped with Intel Core 2 T7200 processor, 2 GB of RAM and 100 GB HDD 7200 rpm. The software environment was: Windows XP, Sun Java 1.5 and Oracle 9i database. The F AN and M sets sorting algorithm has guaranteed log(n) performance.
Towards Stream Data Parallel Processing in Spatial Aggregating Index
217
GCE params
1
7
2
4
3
6 6
7
8
4
5 b
c
MAL2
+ +
answer stream
a b
a
MAL1
MAL3
c
Fig. 4. Operation of the algorithm assigning iterator tables to answer stream generating nodes and objects. Elements are stored in the params table of the GCE element.
The test model contained 1000 meters. Every meter generated a reading in a time period from 30 to 60 minutes. The aggregation window width was 120 minutes. Database contained readings from a period of one year (over 12 million readings). We tested the system by first defining a set of query windows and then calculating the answer stream. The generated stream was browsed without aggregates processing. The times presented in the charts are sums of answer calculation and stream browsing times. In all tested cases the answer calculation time was negligibly small when compared to the stream browsing time. The measured times concerned answer stream generation and browsing for a query encompassing 100 objects. In order to investigate various query defining possibilities we separated two different query variants: 1. a query encompassing 2 index nodes (each node encompassed 25 objects) and 50 single objects (marked as 2*25+50), 2. a query encompassing 3 index nodes and 25 single objects (marked as 3*25+25), Figure 5 shows answer stream generation times for the above described queries for the first run (no materialized data) and the second run (materialized data available). We can observe that stream materialization has a great positive influence on the query answering process. The chart shows also that the stream browsing time depends on the number of MALs used for answer stream generation. The less number of MALs used the shorter the stream browsing times during the first run. On the other hand, even thought the first run times slightly increase with the number of MALs, during the second run the times significantly decrease (thanks to materialization). The second investigated aspect concerned system scalability. Figure 6 presents the answer generation and stream browsing times for various numbers of objects encompassed by the query. Experiments show that the stream browsing time depends linearly on the number of encompassed objects and aggregation period length. Thanks to the linear dependency, knowing the number of objects and the aggregation period length, we can easily predict the answer stream generation time.
218
M. Gorawski and R. Malczok 2*25+50 (first run) 2*25+50 (second run)
1200
3*25+25 (first run) 3*25+25 (second run)
1000
1600
first run
second run
1400 1200
800
1000
time [s]
time [s]
600 400
800 600 400
200
200 0
0 2
5
15
30
number of available iterator tables
25
50
75
100
125
150
175
200
number of encompassed elements
Fig. 5. Answer calculation times for vari- Fig. 6. Answer calculation time as a funcous number of available iterator tables tion of the number of elements
5
Conclusions and Future Plans
In this paper we presented continuation of our research on efficient stream data processing. We integrated a previously presented solution with a spatial aggregating index and provided a memory managing algorithm. The algorithm, using a set of criteria, sorts all the elements involved in answer stream calculation process and assigns available iterator tables. The most important criteria is the amount of available materialized data that can be used in answer stream generation process. The presented tests results show that increasing number of encompassed objects and extending the aggregation period length results in linear growth of answer stream calculation time. In the nearest future we want to verify the necessity of the materialization process. We suppose that if some of the streams generated by the lowest level nodes are not materialized it will cause no harm to answer generation time but will significantly reduce the number of materialized data stored in the database.
References 1. Gorawski, M., Malczok, R.: Multi-thread Processing of Long Aggregates Lists. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 59–66. Springer, Heidelberg (2006) 2. Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of the SIGMOD Conference, Boston, MA, June 1984, pp. 47–57 (1984) 3. Manolopoulos, Y., Nanopoulos, A., Papadopoulos, A.N., Theodoridis, Y.: R-trees: Theory and Applications. Springer, Heidelberg (2005) 4. Papadias, D., Kalnis, P., Zhang, J., Tao, Y.: APN 2001. LNCS. Spinger, Heidelberg (2001) 5. You, B., Lee, D., Eo, S., Lee, J., Bae, H.: Hybrid Index for Spatio-temporat OLAP operations. In: Proceedings of the ADVIS Conference, Izmir, Turkey (2006)
On Parallel Generation of Partial Derangements, Derangements and Permutations Zbigniew Kokosi´ nski Cracow University of Technology, Faculty of Electrical and Computer Eng., ul. Warszawska 24, 31-155 Krak´ ow, Poland
[email protected]
Abstract. The concept of a partial derangement is introduced and a versatile representation of partial derangements is proposed with permutations and derangements as special cases. The representation is derived from a representation of permutations by iterative decomposition of symmetric permutation group Sn into cosets. New algorithms are proposed for generation of partial set derangements in t. The control sequences produced by the generation algorithms appear either in lexicographic or reverse lexicographic order while the output sequences representing partial derangements are obtained from the control sequences in corresponding linear orders. A parallel hardware implementation of the generator of partial derangements is described. Keywords: derangement, partial derangement, permutation, derangement generation, permutation generation.
1
Introduction
Combinatorial generation is one of basic problems in computer science [11,17]. Combinatorial objects are involved as test or problem instances in numerous important application areas. A great many generation algorithms has been developed for such combinatorial objects like n–tuples, combinations, permutations, numerical and set partitions, trees, graphs etc. The focus of this paper is on generation of important classes of n–permutations with a forbidden set of k constant points, 0 ≤ k ≤ n. One specific class of permutations with no constant points (no 1–cycles) are derangements. Combinatorial properties of derangements are described in depth in [4,6]. Several methods for generation of all set derangements sequentially or in parallel linear array model are published in the literature [1,2,3,5,14]. In the present paper a new concept of a partial derangement is introduced and a versatile representation of partial derangements is proposed with permutations and derangements as special cases. No generation algorithm for such a class of combinatorial objects is known so far. The representation of partial derangements is derived from a representation of permutations by iterative decomposition of symmetric permutation group Sn into cosets [12]. Some particular properties of partial derangements as permutations are also established. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 219–228, 2008. c Springer-Verlag Berlin Heidelberg 2008
220
Z. Kokosi´ nski
The algorithms for generation of partial set derangements are developed in the parallel counter model. The control sequences are produced in O(1) average time per generated object. The output sequences are then obtained from the control sequences in O(n) time. Following [12,13] we propose a parallel hardware implementation of the generation algorithms with the help of a cellular permutation array that makes the generation process time efficient. The representation and some properties of permutations and partial derangements are described in section 2. In the next section two generation algorithms are presented. Finally, a hardware implementation of the generator of partial derangements is described in detail in section 4. The last section contains concluding remarks.
2
Representation of Partial Derangements
The representation of permutations is derived from an iterative decomposition of symmetric permutation group Sn into cosets [12]. In this representation set permutations are described by integer sequences called choice functions of indexed families of sets. Let us introduce at first the two coset representations of permutations. Let < Ai >i∈I denote an indexed family of sets Ai = A, where: A = I ={1, . . . , n}. Any mapping f which ”chooses” one element from each set A1 , . . . , An is called a choice function (or a system of representatives, or a transversal) of the family < Ai >i∈I . If for every i = j a suplementary condition: ai = aj is satisfied then any choice function α =< ai >i∈I that belongs to the indexed family < Ai >i∈I is called n–permutation of the set A. Set of all such choice functions represents the set of all permutations of the n–element set. Let us denote any permutation π of n–element set A = {1, . . . , n} by the sequence < π(1), π(2), . . . , π(n) >. The set of all permutations of A is called the symmetric group Sn . Theorem 1. Let < Pi >i∈I be indexed family of sets Pi ⊆ A, where Pi = {1, . . . , i}. Any choice function α =< pi >i∈I , that belongs to Carthesian product ×i∈I Pi represents a permutation of A. Proof. In [7] Hall and Paige have proposed two partitions of the symmetric group Sn on the finite set A = {1, . . . , n} into right and left cosets of Sn−1 in Sn : Sn = Sn−1 τn1 + Sn−1 τn2 + . . . + Sn−1 τnn
(1)
τnn Sn−1
(2)
Sn =
τn1 Sn−1
+
τn2 Sn−1
+ ...+
where + denotes the union of disjoint sets and τij denotes the transposition (ij) (in particular τii is the identity permutation). The complete iterative decomposition of Sn into right cosets resulting from the equation (1) is given below: Sn = Sn−1 τn1 + Sn−1 τn2 + . . . + Sn−1 τnn n−1 1 2 Sn−1 = Sn−2 τn−1 + Sn−2 τn−1 + . . . + Sn−2 τn−1
On Parallel Generation of Partial Derangements
.......................................................................
221
(3)
S3 = S2 τ31 + S2 τ32 + S2 τ33 S2 = {τ21 , τ22 } In a similar way the complete iterative decomposition of Sn into left cosets is obtained from (2):
Sn−1
Sn = τn1 Sn−1 + τn2 Sn−1 + . . . + τnn Sn−1 n−1 1 2 = τn−1 Sn−2 + τn−1 Sn−2 + . . . + τn−1 Sn−2
....................................................................... S3 = τ31 S2 + τ32 S2 + τ33 S2
(4)
S2 = {τ21 , τ22 } Any set Pi = {τi1 , . . . , τij , . . . , τii } , for i ∈ I = {1, . . . , n} is a system of representatives of right cosets (left cosets) and is called the complete right (left) transversal of Si in Si−1 . Moreover, |Pi | = i and Pk ∩Pl = ∅, for every k = l, k, l ∈ I. By substituting τij = j, we receive : Pi = {1, . . . , i}. Depending on the decomposition scheme any choice function α =< pi >i∈I , belonging to Carthesian product ×i∈I Pi correspondes to one of the two sequences: π1 =< π1 (1), . . . , π1 (n) > or π2 =< π2 (1), ..., π2 (n) >. The sequences π1 and π2 are obtained by performing on the elements of the set {1, . . . , n} the transposition sequence τ1p1 , τ2p2 , . . . , τnpn (according to the decomposition scheme (1)) or τnpn , . . . , τ2p2 , τ1p1 (according to the decomposition scheme (2)), respectively. The element τ1p1 that denotes the identity transposition may be omitted. Exemplary sequences π1 =< π1 (1), . . . , π1 (n) >, π2 =< π2 (1), . . . , π2 (n) > are represented by the choice functions α in lexicographic and reverse lexicographic orders, respectively, as shown for n = 4 in Tables 1 and 2. Let us now define permutations with forbidden positions, partial derangenments and derangements. Definition 1. A permutation π of n–element set A = {1, . . . , n} with a forbidden position i is the sequence < π(1), π(2), . . . , π(n) >, where π(i) = i, for some 1 ≤ i ≤ n. Lemma 1. Let < Pi >i∈I be an indexed family of sets Pi ⊆ A, where Pi = {1, . . . , i}, 1 ≤ i < n − 1, and Pn = Pn−1 . Any choice function α =< pi >i∈I , that belongs to Carthesian product ×i∈I Pi represents a permutation of A with a forbidden position i if and only if (p(i) = i) ∨ [(p(i) = i) ⇒ ∃j : (i < j ≤ n) ∧ p(j) = i].
(5)
Proof. There are to cases to be considered. 1. If p(i) = i then p(i) = j, j < i, and the transposition (ij) is performed. Hence, i is not a fixed point in the permutation π. 2. If p(i) = i then the identity transposition (ii) is performed. If no other transposition (ji) exists, (i < j ≤ n), then i is a fixed point in the permutation π.
222
Z. Kokosi´ nski
Definition 2. Any n–permutation with k forbiden positions i = π(i), 0 ≤ k ≤ n, 1 ≤ i ≤ n, is called a partial derangement δ(n, k) and the set of all forbidden positions is called the forbidden set F. Theorem 2. Let < Di >i∈I be an indexed family of sets Di ⊆ A, where Di = Pi , 1 ≤ i < n, and F = {f1 , f2 , . . . , fk }, 0 ≤ k ≤ n, be the forbidden set. Any choice function δ(n, k) =< di >i∈I , that belongs to Carthesian product ×i∈I Di represents a partial derangement of A, i.e. an n–permutation with k forbidden positions from the set F if and only if ∀d(i) ∈ F : (d(i) = i) ∨ [(d(i) = i) ⇒ ∃j : (i < j ≤ n) ∧ d(j) = i].
(6)
Proof. The proof follows directly from the Lemma 1. There are two special cases: 1. iff k = 0 and F = ∅ then we have a permutation α = δ(n, 0); 2. iff k = n and F = {1, 2, . . . , n} = A then we have a derangement δ(n, n). Otherwise, for k ∈ / {0, n}, we have a proper partial derangement δ(n, k). Definition 3. Any permutation with k forbiden positions i = π(i), 1 ≤ i ≤ k < n, is called a prefix partial derangement δp (n, k) with F = {1, . . . , k}. Definition 4. Any permutation with k forbiden positions i = π(i), 1 < k < n, n − k + 1 ≤ i ≤ n, is called a suffix partial derangement δs (n, k) with F = {n − k + 1, . . . , n}. Lemma 2. The number of partial derangements P D(n, k), partial prefix derangements P Dp (n, k) and partial suffix derangements P Ds (n, k) is given by P D(n, k) = P Dp (n, k) = P Ds (n, k) = D(n) +
k k D(n − i), i i=1
(7)
where D(i)=PD(i,i) denotes the number of derangements of i–element set. Proof. Proof is obvious. D(n) n→∞ P (n)
It is well known that lim
= e−1 [4]. Therefore, 1 > lim
n→∞
P D(n,k) P (n)
> e−1 .
The expected number of fixed points in an element δ(n, 0) is 1 [4]. Since the set of derangements δ(n, n) has no constant points, the average number of fixed P (n e ∼ points in permutations δ(n, k), 0 ≤ k < n, is lim P (n)−D(n) = e−1 = 1.582. n→∞
3
The Generation Algorithms
In [15] the authors have invented a derangement generation algorithm by interchanging at most four elements, with roughly 2.980 elements exchanged in average. The first Gray code and a constant average time generation algorithm is described in [3]. In this algorithm the next derangement is obtained from its
On Parallel Generation of Partial Derangements
223
predecessor by one or two transpositions or a rotation of three elements. It can be used also for generation of some classes of partial derangements. Our generation methods are much simpler then the above mentioned solutions. The algorithms PDGENLEX and PDGENREVLEX are shown in Figures 1 and 2. The control sequences are produced by the algorithms in lexicographical or reverse lexicographical order, respectively, and the corresponding partial derangements appear in one of the two linear orders depending on the decomposition scheme used. Input : n – size of the set, k – the number of forbidden positions in partial derangements, D – the forbidden set. Output: Table DA with the consecutive partial derangements. Method: The first function α in the table PD is obtained in step 1. The generation method is based on a counting process in the table PD. In step 4.3 the output is produced if the condition (6) is fulfilled. Computations run until the last c.f. α is generated. /1-2 initialization phase/ 1. for i:=1 to n do PD[i]:=I[i]:=1; MAX[i]:=i+1; 2. i:=n; 3. output(PD); 4. repeat 4.1. PD[i]:=(PD[i]+1) mod MAX[i]; 4.2. if PD[i]=0 then 4.2.1. repeat 4.2.1.1. PD[i]:=PD[i]+1; 4.2.1.2. i:=i—1; 4.2.1.3. PD[i]:=(PD[i]+1) mod MAX[i]; until PD[i] = 0; 4.2.2. i:=n; 4.3. if condition (6) is satisfied then output(PD); until MAX—PD=I; output(PD) /conversion and output/ 1. case of decomposition scheme left coset: permute the set DA={1, ... , n} according to transpositions in PD; right coset: permute the set DA={1, ... , n} according to transpositions in PD; 2. output DA; Fig. 1. The algorithm PDGENLEX
Since condition (6) can be examined sequentially in O(kn) time and computing the next DA takes O(n) time the ovarall complexity of the algorithm PDGENLEX is O(D(n)kn). In a parallel implementation described in section 4 computing of PD and DA in parallel takes O(1) and O(n) time, respectively, and the condition (6) is examined in parallel in O(n) time with a very low constant factor. Although not optimal in the sense of asymptotic time and space complexity, the parallel hardware generator provides fast generation of derangements in roughly 2.718
224
Z. Kokosi´ nski
Table 1. Sequences of partial (n,k)–derangements generated by algorithm PDGENLEX (n=4, k=0,2,4) No. α = δ(4, 0) δ(4, 4) δp (4, 2) F=∅ F={1,2,3,4} F={1,2} 1 1111 1111 1111 2 1112 1112 1112 3 1113 1113 1113 4 1114 1114 5 1121 1121 1121 6 1122 1122 1122 7 1123 1123 1123 8 1124 1124 9 1131 1131 10 1 1 3 2 1132 11 1 1 3 3 1133 1133 12 1 1 3 4 1134 13 1 2 1 1 14 1 2 1 2 1212 1212 15 1 2 1 3 16 1 2 1 4 17 1 2 2 1 1221 1221 18 1 2 2 2 19 1 2 2 3 20 1 2 2 4 21 1 2 3 1 22 1 2 3 2 23 1 2 3 3 24 1 2 3 4
δs (4, 2) F={3,4} 1111 1112 1113
δ(4, 2) F={1,3} 1111 1112 1113 1114 1121 1121 1122 1122 1123 1123 1124
1133
1133
1211 1212 1213
1 1 1 1 1
1221 1222 1223
1233
2 2 2 2 2
1 1 1 1 2
1 2 3 4 1
δ(4, 2) π1 F={2,4} 1111 234 1112 431 1113 241 231 1121 342 1122 314 1123 412 312 1131 413 1132 243 1133 214 213 324 1212 341 421 321 1221 432 1222 134 1223 142 132 423 1232 143 124 123
π2 1 2 3 4 1 2 3 4 2 1 3 4 1 2 3 4 1 2 3 4 1 2 3 4
4 3 3 3 4 2 2 2 2 4 2 2 4 3 3 3 4 1 1 1 4 1 1 1
1 4 1 1 3 4 3 3 4 1 1 1 2 4 2 2 3 4 3 3 2 4 2 2
2 2 4 2 1 1 4 1 3 3 4 3 1 1 4 1 2 2 4 2 3 3 4 3
3 1 2 4 2 3 1 4 1 2 3 4 3 2 1 4 1 3 2 4 1 2 3 4
steps (clock periods) per object in average, and partial derangements in even better rate. Given constant size of the generator producing output derangements DA takes a constant time per object. Both VLSI and FPGA implementations are feasible. Two exemplary permutation sequences π1 =< π1 (1), . . . , π1 (n) > and π2 =< π2 (1), . . . , π2 (n) > (represented by choice functions α for n = 4) generated by the algorithm PDGENLEX are shown in Table 1. Input : n – size of the set, k – the number of forbidden positions in partial derangements, D – the forbidden set. Output: Table DA with the consecutive partial derangements. Method: The first function α in the table PD is obtained in step 1. The generation method is based on a counting process in the table PD. In step 4.3 the output is produced if the condition (6) is fulfilled. Computations run until the last c.f. α is generated.
On Parallel Generation of Partial Derangements
1. 2. 3. 4.
225
/1-2 initialization phase/ for i:=1 to n do PD[i]:=I[i]:=1; MAX[i]:=i+1; i:=2; output (PD); repeat 4.1. PD[i]:=(PD[i]+1) mod MAX[i]; 4.2. if PD[i]=0 then 4.2.1. repeat 4.2.1.1. PD[i]:=PD[i]+1; 4.2.1.2. i:=i+1; 4.2.1.3. PD[i]:=(PD[i]+1) mod MAX[i]; until PD[i] = 0; 4.2.2. i:=2; 4.3. if condition (6) is satisfied then output(PD); until MAX—PD=I;
output(PD) /conversion and output/ 1. case of decomposition scheme left coset: permute the set DA={1, ... , n} according to transpositions in PD; right coset: permute the set DA={1, ... , n} according to transpositions in PD; 2. output DA; Fig. 2. The algorithm PDGENREVLEX
Computational complexity of the algorithm PDGENREVLEX is the same as PDGENLEX. Two exemplary permutation sequences π1 =< π1 (1), . . . , π1 (n) > and π2 =< π2 (1), . . . , π2 (n) > (represented by choice functions α for n = 4) generated by the algorithm PDGENREVLEX are shown in Table 2.
4
A Parallel Hardware Implementation
The circuit described in this section can be used for generation of partial derangements in parallel processing systems [9]. For the hardware implementation the algorithm PDGENREVLEX has been selected. Both decomposition schemes (3) and (4) have been shown to be group–theoretic representations of triangular permutation networks [16]. The triangular permutation network and its control circuit (complex parallel counter) are the main components of the hardware generator. The hardware complexity of the generator is O(n2 ), and the network propagation delay is O(n). For practical applications the networks size is limited and the propagation delay can be considered constant. The triangular permutation network is built of two–state cells (2–permuters) [12,13,16]. Each cell requires a separate control signal. The control circuit is organized in the following way. With every ith column of the triangular network (1 ≤ i ≤ n) the ith ring counter is associated, with the initial state from the ”1– out–of–i” code. All column counters form the parallel counter with n! different
226
Z. Kokosi´ nski
Table 2. Sequences of partial (n,k)–derangements generated by algorithm PDGENREVLEX (n=4, k=0,2,4) No. α = δ(4, 0) δ(4, 4) δp (4, 2) F=∅ F={1,2,3,4} F={1,2} 1 1111 1111 1111 2 1211 3 1121 1121 1121 4 1221 1221 1221 5 1131 1131 6 1231 7 1112 1112 1112 8 1212 1212 1212 9 1122 1122 1122 10 1 2 2 2 11 1 1 3 2 1132 12 1 2 3 2 13 1 1 1 3 1113 1113 14 1 2 1 3 15 1 1 2 3 1123 1123 16 1 2 2 3 17 1 1 3 3 1133 1133 18 1 2 3 3 19 1 1 1 4 1114 20 1 2 1 4 21 1 1 2 4 1124 22 1 2 2 4 23 1 1 3 4 1134 24 1 2 3 4
δs (4, 2) F={3,4} 1111 1211 1121 1221
δ(4, 2) F={1,3} 1111 1211 1121 1221
1 1 1 1
1 2 1 2
1 1 2 2
2 2 2 2
1112 1212 1122
1 1 1 1 1 1
1 2 1 2 1 2
1 1 2 2 3 3
3 3 3 3 3 3
1113 1213 1123 1133 1114 1214 1124
δ(4, 2) F={2,4} 1111 2 3 1121 3 1221 4 1131 4 4 1112 4 1212 3 1122 3 1222 1 1132 2 1232 1 1113 2 4 1123 4 1223 1 1133 2 1 2 3 3 1 2 1
π1 3 2 4 3 1 2 3 4 1 3 4 4 4 2 1 4 1 2 3 2 1 3 1 2
4 4 2 2 3 3 1 1 4 4 3 3 1 1 2 2 4 4 1 1 2 2 3 3
π2 1 1 1 1 2 1 2 2 2 2 1 2 3 3 3 3 3 3 4 4 4 4 4 4
4 4 4 4 2 4 3 3 2 1 4 1 3 3 2 1 2 1 3 3 2 1 2 1
1 2 3 3 4 2 4 4 4 4 1 4 1 2 3 3 1 2 1 2 3 3 1 2
2 1 1 2 3 3 2 1 1 2 3 3 4 4 4 4 4 4 2 1 1 2 3 3
3 3 2 1 1 1 1 2 3 3 2 2 2 1 1 2 3 3 4 4 4 4 4 4
states. Clock enable signal for the ith ring couter is a product of carry signals (overflows) from all ring counters preceding it. The asynchronous setup of each ring counter and global reset for all ring counters is provided. If the jth bit of the ith ring counter bji = 1, for 1 ≤ j ≤ (i − 1), then in the ith column of the network only one cell denoted by C[i, j] is activated to perform the corresponding transposition τij . If bii = 1, 1 ≤ i ≤ n, then all cells in the ith column are in the ”identity” state. After setting the initial state of the network the control circuit generates consecutive states of network in a constant time (one clock period) and the permutation network generates subsequent configurations representing permutations. Valid partial derangements are detected by a logic function V checking if the condition (6) is satisfied: V =
n−1 i=1
F V (i)vi =
n−1 i=1
F V (i)(bii +
n j=i+1
bij ),
(8)
On Parallel Generation of Partial Derangements
227
where: F V – is a binary forbidden set vector: F V (i) = 1 iff position i is forbidden, otherwise F V (i) = 0; vi – is the function detecting if the condition (5) for the forbiden position i is satisfied; in fact, for the technological reason, vi should be rewritten to the form: vi = (bii + (((. . . (bii+1 + bii+2 ) + . . .) + bin−1 ) + bin ).
(9)
The above logic functions can be computed in O(n) time what matches the network propagation delay. Because the size of the network is limited and the constant factor hidden in the function O(n) is very low, for most applications we may assume that consecutive network configurations are generated in constant time. It is expected that generation of a single partial derangement takes in average at most e clock periods, when the whole set of permutations is generated. The proposed hardware generator of partial derangements for n=32 has been implemented in Xilinx Spartan 3 FPGA. For the project we have used VHDL, Xilinx Foundation ISE 8.2i software and a Digilent board with XC3S200-4ft256 programmable device.
5
Conclusions
The concept of partial derangements can gain much attention when application problems involve irregular permutation problems with partially forbidden positions. In that case known methods for generation of permutations and derangements are not directly applicable. The correctness of both generation algorithms has been verified by computer simulation. In addition the algorithm PDGENREVLEX has been implemented in VHDL and tested in a real FPGA device. It is not likely that the hardware generator will be used for enumeration of full sets of partial derangements for large values of n. However, fast generation of derangements, partial derangements and permutations can be of practical value for many specific tasks involving generation of subsequences of such objects.
Acknowledgements This work was supported by the research grant No. E–3/112/BW/07 from Cracow University Technology.
References 1. Akl, S.G.: A new algorithm for generating derangements. BIT 20, 2–7 (1980) 2. Akl, S.G., Calvert, J.M., Stojmenovic, I.: Systolic generation of derangements. In: Proc. Int. Workshop on Algorithms and Parallel VLSI Architectures II, pp. 59–70. Elsevier, Amsterdam (1992)
228
Z. Kokosi´ nski
3. Baril, J.–L., Vajnovszki, V.: Gray code for derangements. Discrete Applied Mathematics 140, 207–221 (2004) 4. Erickson, M.J.: Introduction to Combinatorics, vol. 83, pp. 119–120. Wiley Interscience, Chichester (1996) 5. Gupta, P., Bhattacharjee, G.P.: A parallel derangement generation algorithm. BIT 29, 14–22 (1989) 6. Graham, R.L., Knuth, D.E., Patashnik, O.: Concrete Mathematics, 2nd edn., pp. 194–195. Addison-Wesley Publishing Company, Reading (1994) 7. Hall, M., Paige, J.L.: Complete mapping of finite groups. Pacific J. Math., 541–549 (1955) 8. Hassani, M.: Derangements and applications. Journal of Integer Sequences 6, 1–8 (2003) 9. Kapralski, A.: Sequential and parallel processing in depth search machines. World Scientific, Singapore (1994) 10. Kim, D., Zeng, J.: A new decomposition of derangements. J. Comb. Theory, Ser. A 96, 192–198 (2001) 11. Knuth, D.: The art of computer programming, Pre–fascicles of ch. 7.2. Addison– Wesley, Reading (2001–2004) 12. Kokosi´ nski, Z.: On generation of permutations through decomposition of symmetric groups into cosets. BIT 30, 583–591 (1990) 13. Kokosi´ nski, Z.: Circuits generating combinatorial configurations for sequential and parallel computer systems. Cracow University of Technology, Cracow, Poland, Monograph 160 (1993) 14. Korsh, J.F., LaFolette, P.: Constant time generation of derangements. IPL 1990, pp. 181–186 (2004) 15. Mikawa, K., Semba, I.: Generating derangements by interchanging at most four elements. Systems and Computers in Japan 35(12), 25–31 (2004), available online from: www.interscience.wiley.com 16. Oru¸c, A.Y., Oru¸c, A.M.: Programming cellutar permutation networks through decomposition of symmetric groups. IEEE Trans. on Computers 36, 802–809 (1987) 17. Ruskey, F.: Combinatorial generation, Working version (1j–CSC 425/520) (2003), available at: www.edu/∼ruskey/book.pdf 18. Sedgewick, R.: Permutation generation methods. Computing Survey 9, 137–164 (1977)
Parallel Simulated Annealing Algorithm for Graph Coloring Problem Szymon Łukasik1 , Zbigniew Kokosiński2, and Grzegorz Świętoń2 1
2
Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01-447 Warsaw, Poland
[email protected] Department of Automatic Control, Cracow University of Technology ul. Warszawska 24, 31-155 Cracow, Poland
Abstract. The paper describes an application of Parallel Simulated Annealing (PSA) for solving one of the most studied NP-hard optimization problems: Graph Coloring Problem (GCP). Synchronous master-slave model with periodic solution update is being used. The paper contains description of the method, recommendations for optimal parameters settings and summary of results obtained during algorithm’s evaluation. A comparison of our novel approach to a PGA metaheuristic proposed in the literature is given. Finally, directions for further work in the subject are suggested. Keywords: graph coloring, parallel simulated annealing, parallel metaheuristic.
1
Introduction
Let G = (V, E) be a given graph, where V is a set of |V | = n vertices and E set of |E| = m graph edges. Graph Coloring Problem (GCP) [1,2] is defined as a task of finding an assignment of k colors to vertices c : V → {1, . . . , k} , k ≤ n , such that there is no conflict of colors between adjacent vertices, i.e. ∀(u, v) ∈ E : c(u) = c(v) and number of colors k used is minimal (such k is called the graph chromatic number χ(G)). A number of GCP variants is used for testing printed circuits, frequency assignment in telecommunication, job scheduling and other combinatorial optimization tasks. The problem is known to be NP–hard [3]. Intensive studies of the problem resulted in a large number of approximate and exact solving methods. GCP was the subject of Second DIMACS Challenge [4] held in 1993 and Computational Symposium on Graph Coloring and Generalizations organized in 2002. The graph instances [5] and reported research results are frequently used in development of new coloring algorithms and for reference purposes. Most algorithms designed for GCP are iterative heuristics [11], such as genetic algorithms [6], simulated annealing [7,8], tabu or local search techniques [9], minimizing selected cost functions. At the time of this writing, the only parallel metaheuristic for GCP is parallel genetic algorithm [10,12,13,14,15]. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 229–238, 2008. c Springer-Verlag Berlin Heidelberg 2008
230
S. Łukasik, Z. Kokosiński, and G. Świętoń
The purpose of the paper is to present a new algorithm capable of solving GCP, developed on the basis of Parallel Simulated Annealing (PSA) method [16]. Recent years brought a rapid development of PSA techniques. Classical Simulated Annealing [17] was transformed into parallel processing environment in various ways: most popular approaches involve parallel moves, where single Markov chain is being evaluated by multiple processing units calculating possible moves from one state to another. The other method uses multiple threads for computing independent chains of solutions and exchanging the obtained results on a regular basis. Broad studies of both techniques can be found in [18,19]. The PSA scheme for GCP, proposed by the authors, includes above strategies with the rate of current solution update used as a distinctive control parameter. The paper is organized as follows. Next section is devoted to the description of the proposed PSA algorithm. Besides its general structure, details of cooling schedule and cost function being used as well as neighborhood solution generation procedure are given. Subsequent part of the paper presents results of algorithm’s experimental evaluation. Final part of the contribution gives general comments on the performance of PSA algorithm, and possible directions for future work in the subject.
2
Parallel Simulated Annealing
PSA algorithm for GCP introduced in this paper uses multiple processors working concurrently on individual chains and agreeing about current solutions at fixed iteration intervals. The aim of the routine is to minimize chosen cost function, with storing the best solution found. The coordination of the algorithm is performed in master–slave model – one of processing units is responsible for collecting solutions, choosing the current one and distributing it among slave units. The exchange interval ei is a parameter which decides which PSA scheme is being used. Setting ei = 1 is equivalent to producing single chain of solutions using multiple moves strategy. Increasing the interval ei leads to creating semiindependent chains on all slave processors starting at each of concurrent rounds with the same established solution. Setting ei to infinity results in performing independent simulated annealing runs. The general scheme of the algorithm is presented below: Parallel Simulated Annealing for Graph Coloring Problem if proc=master best_cost:=infinity; if proc=slave // generate initial solution randomly at each slave solution[proc]:=Generate_Initial_Solution(); for iter:=1 to iter_no do if proc=slave // each slave generates a new solution neighbor_solution[proc]:=Generate_Neighbor_Sol(solution[proc]);
PSA Algorithm for Graph Coloring Problem
231
// and accepts it as a current one according to SA methodology solution[proc]:=Anneal(neighbor_solution[proc],solution[proc],T); // all solutions are then gathered at master Gather_at_master(solution[proc]); if proc=master current_cost:=infinity; // find solution with minimum cost and set it as a current one for j:=1 to slaves_no do if Cost(solution[j])<current_cost current_solution:=solution[j]; current_cost:=Cost(solution[j]); // update best solution found (if applicable) if current_cost
Detailed information about our Simulated Annealing algorithm like cooling scheme, representation of the solution, method for generation of neighborhood and cost calculation will be given in following subsections. 2.1
Cooling Schedule
Choosing proper temperature schedule is crucial for algorithm based on Simulated Annealing methodology since it influence the acceptance probability of positive transitions (i.e. when a cost difference Δcost,i between generated neighbor and initial solution is positive) given by Metropolis [20]: P (Δcost,i ) = e
−
Δcost,i Ti
.
(1)
As a result of intensive studies in this area multiple cooling strategies were developed [21]. The cooling schedule used here is the exponential one: Ti+1 = αTi ,
(2)
where α is the cooling rate (usually set at 0.80–0.99 level [22]) for each cooling step. Every SA step consists of Mi iterations. For the exponential schedule following holds: Mi+1 = βMi . (3)
232
S. Łukasik, Z. Kokosiński, and G. Świętoń
In order to extend gradually SA runs at lower temperature levels constant β is chosen usually from the range [1.01, 1.20]. In addition to proper cooling schedule one has to choose correct initial temperature T0 . The authors used the most common method that involves calculating average cost difference Δcost,0 from a set of pilot runs consisting of positive transitions from an initial state. Preliminary temperature assuring desired initial acceptance probability P (Δcost,0 ) can be calculated afterwards from equation: T0 = −
Δcost,0 . ln P (Δcost,0 )
(4)
The alternative approach could follow a universal method for initial temperature selection introduced in [23]. 2.2
Solution Representation and Neighborhood Generation
A graph coloring c is represented by a sequence of natural numbers c =< c[1], . . . , c[n] >, c[i] ∈ {1, . . . , k}, which is equivalent to set partition representation with exactly k non–empty blocks. A rule for generation of a neighbor solution can be selected out of a wide range of existing methods [24]. For the purpose of the presented algorithm the following form of restricted 1-exchange neighborhood is used: Restricted 1-exchange neighborhood for Graph Coloring Problem // check for vertices with color conflicts conflict_vertices:=Find_Conflicting_Vertices(c); if sizeof(conflict_vertices)>0 do // if conflicts were found choose a random conflicting vertex vertex_to_change:=random(conflict_vertices); // and replace its color randomly with one of the colors {1, ... ,k+1} c[vertex_to_change]:=random(k+1); else // if no conflicts are found choose random vertex vertex_to_change:=random(n); // and replace its color randomly with one of the colors {1, ... ,k} c[vertex_to_change]:=random(k); return c;
2.3
Cost Assessment
As a quality measure of a selected coloring c the following cost function was used [13]: f (c) = q(u, v) + d + k , (5) (u,v)∈E
where q – is a penalty function: q(u, v) =
2 when c(u) = c(v) 0 otherwise
(6)
PSA Algorithm for Graph Coloring Problem
d – is a coefficient for solution with conflicts: 1 when (u,v)∈E q(u, v) > 0 d= 0 when (u,v)∈E q(u, v) = 0
233
(7)
and k – is the number of colors used.
3
Experimental Evaluation
For testing purposes an implementation of the algorithm based on Message Passing Interface was prepared. All experiments with simulated parallelism were R carried out on Intel XeonTM machine. As test instances standard DIMACS graphs, obtained from [5], were used. For experiments following values of SA control parameters were chosen: α = 0.95 and β = 1.05. Initial temperature was determined from a pilot run consisting of 1% (relative to overall iteration number) positive transitions. The termination condition was either achieving the optimal solution or the required number of iterations. Due to space limitations only most representative results are presented in the paper. The full set of simulation data can be found on the first author’s web site (http://www.pk.edu.pl/~szymonl). 3.1
SA Parameters Settings
At first, the optimal values of SA parameters were under investigation. Essential results of those experiments are gathered in Table 1. Opening set of 500 runs with final temperature Tf = 0.1, iter_no = 10000, randomly generated initial solution with k = χ(G) and the initial probability Table 1. Influence of Simulated Annealing parameters on algorithm’s performance Graph P (Δcost,0 ) Tf k0 G(V,E) Description Results Best P Results Best Tf Results Best k0 anna, χ(G) = 11 best f(c) 11.12 70% 11.00 0.04 · T0 11.12 χ(G) |V | = 138 avg. f(c) 11.32 11.14 11.17 |E| = 493 σf (c) 0.29 0.21 0.05 queen8_8, χ(G) = 9 best f(c) 11.33 60% 10.60 0.06 · T0 11.33 χ(G) |V | = 64 avg. f(c) 11.46 11.84 11.41 |E| = 728 σf (c) 0.10 1.26 0.05 mulsol.i.4, χ(G) = 31 best f(c) 38.23 80% 31.03 0.2 · T0 37.63 χ(G) − 5 |V | = 197 avg. f(c) 38.66 33.65 38.22 |E| = 3925 σf (c) 0.46 1.64 0.36 myciel7, χ(G) = 8 best f(c) 12.95 60% 8.00 0.2 · T0 11.38 χ(G) − 5 |V | = 191 avg. f(c) 13.22 9.47 12.93 |E| = 2360 σf (c) 0.34 1.60 0.96
234
S. Łukasik, Z. Kokosiński, and G. Świętoń
changing within the range 10%–90% proved that the best algorithm’s performance, measured primarily by minimum average cost function (the second criterion was the iteration number), is achieved for high initial probabilities with an optimum found at about 60%–80%. It was observed, however, that the exact choice of P (Δcost,0 ) in this range is not very significant for the overall algorithm’s performance. Obtained results confirmed a hypothesis that for more complex problems it is advisable to use higher values of initial temperature. In the next experiment the optimal final temperature (relative to T0 ) was under examination. For fixed P (Δcost,0 ) = 70%, iter_no = 10000 and k = χ(G) the best results were obtained for Tf ∈ [0.01, 0.2]·T0 . Again, higher solution quality for more complex graph instances was achieved with increased temperature ratios. The influence of initial number of colors k0 on the solution quality was also determined experimentally. The range [χ(G) − 5, χ(G) + 5] was under consideration with P (Δcost,0 ) = 70%, Tf = 0.05·T0 and iter_no = 10000. It was observed that using initial color number slightly different than chromatic number do not affect significantly the algorithm’s performance. For some graph instances it is even recommended to start with colorings with k0 lower than χ(G). In the end it should be noted that above presented statements are to be treated as overall guidelines for SA parameters settings obtained from a relatively small set of graphs. The exact values for those parameters depend largely on the considered class of graph instances. 3.2
Influence of Parallelization Schemes
The second stage of the computing experiments involved examination of algorithm’s performance with different parallelization schemes and comparison with results obtained with sequential Simulated Annealing algorithm. For PSA the configurations with ei = {1, 2, 4, 6, 8, 10, ∞} and slaves number from 2 to 18, were tested with various graph instances. To examine the effect of parallelization on the processing time the same number of iterations iter_no = 100000 was set for both sequential SA and PSA algorithms (in PSA each slave performs only iter_no/slaves_no iterations). For the temperature schedule following settings were applied: P (Δcost,0 ) = 70% and Tf = 0.05 · T0 . Obtained results include mean values of the cost function, the number of conflict–free/optimal solutions, the number of iterations needed to find an optimal coloring (if applicable), and algorithm’s execution time t[s] (until best solution has been found). The summary of the results is presented in Table 2. Best and worst parallel configurations, in terms of average f (c) and processing time, with the obtained results are reported. As a reference average performance of the algorithm is given as well. PSA clearly outperforms the sequential Simulated Annealing in terms of computation time. Moreover, applying parallelization improves the quality of the obtained solution. It can be seen though that it is important to select a proper configuration of the PSA algorithm to achieve its high efficiency. For most problem instances it is advisable to use multiple moves strategy with
PSA Algorithm for Graph Coloring Problem
235
Table 2. Experimental evaluation of the PSA algorithm for GCP Graph G(V,E) games120 χ(G) = 9 |V | = 120 |E| = 638 anna χ(G) = 11 |V | = 138 |E| = 493 myciel7 χ(G) = 8 |V | = 191 |E| = 2360 miles500 χ(G) = 20 |V | = 128 |E| = 1170 mulsol.i.4 χ(G) = 31 |V | = 197 |E| = 3925 queen8_8 χ(G) = 9 |V | = 64 |E| = 728 le450_15b χ(G) = 15 |V | = 450 |E| = 8169
Description avg. f(c) c.–f. c /opt. c avg. iter. /opt. c avg. t[s] /best c avg. f(c) c.–f. c /opt. c avg. iter. /opt. c avg. t[s] /best c avg. f(c) c.–f. c /opt. c avg. iter. /opt. c avg. t[s] /best c avg. f(c) c.–f. c /opt. c avg. iter. /opt. c avg. t[s] /best c avg. f(c) c.–f. c /opt. c avg. iter. /opt. c avg. t[s] /best c avg. f(c) c.–f. c /opt. c avg. iter. /opt. c avg. t[s] /best c avg. f(c) c.–f. c /opt. c avg. iter. /opt. c avg. t[s] /best c
SA PSA Results PSA Config. Results Best Worst Average Best Worst 9 9 9 9 7 slaves 18 slaves 100/100 100/100 100/100 100/100 ei = 1 ei = ∞ 477 78 258 176 0.72 0.05 0.40 0.14 11 11 11.32 11.02 4 slaves 18 slaves 100/100 100/100 100/72 100/98 ei = 1 ei = 1 5821 199 1177 462 1.31 0.08 1.73 0.31 8 8 8.66 8.05 3 slaves 18 slaves 100/100 100/100 100/43 100/95 ei = 1 ei = 1 7376 797 1524 1539 1.85 0.20 1.83 1.03 20 20 20.1 20.01 6 slaves 17 slaves 100/100 100/100 100/90 100/98 ei = 1 e1 = 1 38001 544 422 2842 4.71 0.21 0.58 1.06 31.04 31.19 38.24 34.74 2 slaves 17 slaves 100/96 100/81 100/0 100/1 ei = ∞ ei = 1 19007 15908 - 13451 4.30 1.83 2.64 2.08 9.97 9.81 10.05 9.97 5 slaves 16 slaves 100/3 100/19 100/0 100/3 ei = 1 ei = ∞ 66488 8831 8820 1.66 0.64 3.82 1.65 18.58 17.39 21.79 18.71 9 slaves 18 slaves 100/0 100/0 100/0 100/0 ei = 1 ei = ∞ 42.88 3.54 6.47 4.99
optimal, relatively small, number of slaves. There exists one exception to the presented statement - for the class of mulsol.i graphs significantly better results were obtained with fully independent SA runs. The worst results of using Parallel Simulated Annealing were obtained when a high number of slaves was involved in the computations and parallelization scheme was far from the optimal one. 3.3
Comparison with Parallel Genetic Algorithm
The last stage of the testing procedure involved comparison of time efficiency of the PSA algorithm and Parallel Genetic Algorithm introduced in [12]. The implementation of the PGA for GCP used in [13] was applied. Both algorithms were executed on the same machine for selected DIMACS graph instances and computation time needed to find optimal coloring was reported. PGA was
236
S. Łukasik, Z. Kokosiński, and G. Świętoń
executed with 3 islands, subpopulations consisting of 60 individuals, migration rate 5, migration size 5 with the best individuals being distributed, initial number of colors 4 and operators: CEX crossover (with 0.6 probability), First–Fit mutation (with 0.1 probability). For PSA 3 slaves were used, P (Δcost,0 ) = 70%, Tf = 0.05 · T0 , iter_no = 300000 (for instance mulsol.i.1 to find optimal solution runs of length 1500000 iterations were needed). For most instances multiple moves strategy was applied. One exception was mulsol.i class of graphs where, according to earlier observations, independent SA runs were executed. Results of the experiments, enclosed in Table 3, clearly demonstrate that PSA performance is comparable to the one achieved by the PGA. For some graph instances, like book graphs and miles500, the proposed algorithm was found to be superior. On the other hand, there exist a group of problems relatively easy to solve by PGA and, at the same time, difficult to solve by PSA (like mulsol.i.1). Table 3. Comparison of time efficiency of PGA and PSA algorithms applied for GCP Graph G(V,E) anna, χ(G) = 11 |V | = 138, |E| = 493 myciel7, χ(G) = 8 |V | = 191, |E| = 2360 miles500, χ(G) = 20 |V | = 128, |E| = 1170
4
t[s] Graph PSA PGA G(V,E) 0.23 0.35 mulsol.i.4, χ(G) = 31 |V | = 185, |E| = 3946 0.34 0.25 mulsol.i.1, χ(G) = 49 |V | = 197, |E| = 3925 0.48 18.0 games120, χ(G) = 9 |V | = 120, |E| = 638
t[s] PSA PGA 2.89 1.99 14.9 4.47 0.20 0.34
Conclusion
In the paper a new Parallel Simulated Annealing algorithm for GCP was introduced and evaluated. First experiments revealed that its performance depends on choosing a cooling schedule and generation of the initial coloring suitable to the considered problem. Some general guidelines were derived for the algorithm’s settings that ensure a better solution quality. Further research in the subject could concern adaptive cooling schedules and generation of initial solution by means of an approximate method. Choosing an optimal number of processing units and parallelization scheme for the PSA was also under consideration. We found that problem specification essentially influence the proper choice of these elements. However, it can be stated as a general remark that the highest efficiency of the master–slave PSA algorithm is achieved for optimal, relatively small number of slaves. During the performance evaluation the PSA algorithm was proved to be an effective tool for solving Graph Coloring Problem. The experiments showed that it achieves a similar performance level as PGA. The comparison results of both methods showed that none of them is superior. It encourages efforts for development of a new hybrid metaheuristics which would benefit from advantages of
PSA Algorithm for Graph Coloring Problem
237
both PGA and PSA approaches. The overall concept of a hybrid algorithm could implement the idea presented in [25] and include some other improvements of the standard PSA scheme as proposed in [26].
References 1. Kubale, M. (ed.): Graph colorings. American Mathematical Society (2004) 2. Jensen, T.R., Toft, B.: Graph coloring problems. Wiley Interscience, New York (1995) 3. Garey, R., Johnson, D.S.: Computers and intractability. A guide to the theory of NP–completeness. W. H. Freeman, New York (1979) 4. Johnson, D.S., Trick, M.A. (eds.): Cliques, Coloring and Satisfiability. DIMACS Series in Discrete Mathematics and Theoretical Science, vol. 26 (1996) 5. Graph coloring instances: http://mat.gsia.cmu.edu/COLOR/instances.html 6. Fleurent, C., Ferland, J.A.: Genetic and Hybrid Algorithms for Graph Coloring. Annals of Operations Research 63, 437–461 (1996) 7. Chams, M., Hertz, A., de Werra, D.: Some Experiments with Simulated Annealing for Coloring Graphs. European Journal of Operational Research 32, 260–266 (1987) 8. Johnson, D.S., Aragon, C.R., McGeoch, L., Schevon, C.: Optimization by Simulated Annealing: An Experimental Evaluation; Part II, Graph Coloring and Number Partitioning. Operations Research 39(3), 378–406 (1991) 9. Galinier, P., Hertz, A.: A Survey of Local Search Methods for Graph Coloring. Computers & Operations Research 33, 2547–2562 (2006) 10. Alba, E. (ed.): Parallel Metaheuristics. John Wiley & Sons, New York (2005) 11. Szyfelbein, D.: Metaheuristics in graph coloring. In: Kubale, M. (ed.) Discrete optimization. Models and methods for graph coloring, WNT, Warszawa, pp. 26–52 (2002) (in Polish) 12. Kokosiński, Z., Kołodziej, M., Kwarciany, K.: Parallel genetic algorithm for graph coloring problem. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3036, pp. 215–222. Springer, Heidelberg (2004) 13. Kokosiński, Z., Kwarciany, K., Kołodziej, M.: Efficient Graph Coloring with Parallel Genetic Algorithms. Computing and Informatics 24(2), 109–121 (2005) 14. Łukasik, S.: Parallel Genetic Algorithms using Message Passing Paradigm for Graph Coloring Problem. Journal of Electrical Engineering 56(12s), 123–126 (2005) 15. Kokosiński, Z., Kwarciany, K.: On sum coloring of graphs with parallel genetic algorithms. In: Beliczynski, B., Dzielinski, A., Iwanowski, M., Ribeiro, B. (eds.) ICANNGA 2007. LNCS, vol. 4431, pp. 211–219. Springer, Heidelberg (2007) 16. Azencott, R. (ed.): Simulated Annealing: Parallelization Techniques. John Wiley & Sons, New York (1992) 17. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.: Optimization by Simulated Annealing. Science 220, 671–680 (1983) 18. Diekmann, R., Lüling, R., Simon, J.: Problem Independent Distributed Simulated Annealing and its Applications. Working Paper, University of Paderborn, Germany 19. Lee, S.-Y., Lee, K.G.: Synchronous and Asynchronous Parallel Simulated Annealing with Multiple Markov Chains. IEEE Transactions on Parallel and Distributed Systems 7(10), 993–1008 (1996) 20. Metropolis, N.A., Rosenbluth, A., Teller, A., Teller, E.: Equation of State Calculations by Fast Computing Machines. Journal of Chemical Physics 21, 1087–1092 (1953)
238
S. Łukasik, Z. Kokosiński, and G. Świętoń
21. Ingber, A.L.: Simulated Annealing: Practice versus Theory. Journal of Mathematical Computation Modelling 18(11), 29–57 (1993) 22. Sait, S.M., Youssef, H.: Iterative computer algorithms with applications in engineering. IEEE Computer Society, Los Alamitos (1999) 23. Ben-Ameur, W.: Computing the Initial Temperature of Simulated Annealing. Computational Optimization and Applications 29, 369–385 (2004) 24. Avanthay, C., Hertz, A., Zufferey, N.: A Variable Neighborhood Search for Graph Coloring. European Journal of Operational Research 151, 379–388 (2003) 25. Tomoyuki, H., Mitsunori, M., Ogura, M.: Parallel Simulated Annealing using Genetic Crossover. Science and Engineering Review of Doshisha University 41(2), 130–138 (2000) 26. Bożejko, W., Wodecki, M.: The New Concepts in Parallel Simulated Annealing Method. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC 2004. LNCS (LNAI), vol. 3070, pp. 853–859. Springer, Heidelberg (2004)
Parallel Algorithm to Find Minimum Vertex Guard Set in a Triangulated Irregular Network Masoud Taghinezhad Omran Sharif University of Technology
[email protected]
Abstract. This paper presents a new serial algorithm for selecting a nearly minimum number of vertex-guards so that all parts of a geographical surface modeled by a TIN (Triangulated Irregular Networks) is covered. Our algorithm selects fewer guards than the best existing algorithms on the average. Based on this approach, a new coarse-grain parallel algorithm for this problem is proposed. It has been showed that the upper bound for total number of guards, selected by this algorithm, is 2n where n is number of vertices in the TIN. Average case analysis 3 and implementation results show that in real TINs even fewer than n2 guards (proved upper bound of needed guards in worse-case) are selected by our serial and parallel algorithms. Keywords: Parallel Computing, Art Gallery Problem, Guarding Problem, MPI, Graph Partitioning, Approximation Algorithms.
1
Introduction
Finding the minimum number of light sources to illuminate a polygon is NPHard [6]. However, it has been shown that illuminating an n-sided simple polygon needs at most n3 light sources, and this number is sometimes necessary [5,8]. For different kinds of guarding problem in 2 dimension, several algorithms have been proposed [12,16,17]. The problem has also been considered for 2 12 -dimensional surfaces [7]. Bose et al. [3] showed that the upper bound for number of guards, needed to cover the entire surface of an n-vertex terrain when it is triangulated and guards are placed on vertices, is n2 and in some situations this number of guards is necessary. Triangulated Irregular Network is a good model to represent a geographical terrain and guarding such terrain has important applications. The amount of data for such problem is huge and thus the use of parallel processing is justifiable. In this paper, a new form of guarding problem in TINs is proposed that behaves better on the average compared to the best exiting algorithm. Based on this approach, a parallel algorithm for this problem is proposed. The parallel algorithm has been implemented on a 24-node PC-cluster running MPI. The experimental data will be presented in this paper. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 239–248, 2008. c Springer-Verlag Berlin Heidelberg 2008
240
2
M. Taghinezhad Omran
Guarding 3D Terrain
The problem of guarding a polyhedral terrain is to find the smallest set of points on the terrain such that every point on the terrain is visible from at least one point of the set. What we mean by terrain is usually a geographic surface in 3D space that no two different points of it with the same x and y values, has the same value for z where < x, y, z > is the Cartesian Coordinate of points. These surfaces are called 2 12 -dimensional surfaces. To become processable by computer, representing data of the terrain should be presented in a discrete form. Therefore, the 3D region is approximated by small triangles which is called Triangulated Irregular Network (TIN). To reduce the complexity in the guarding problem, the height of nodes is ignored. As a result, each guard, placed on a vertex of the TIN, can only see its direct adjacent triangles. The Upper bound of needed guards in this case is similar to the case that the height is considered and equal to n2 .
3
Serial Algorithms for Guarding TINs
For guarding problem on TINs , two different approximation algorithms were posed that we will briefly go through them here. Guarding Algorithm Using Graph Coloring: It has been shown that every planar graph is five colorable. According to this theorem, it is possible to color the entire vertices of a planar graph using five colors in linear time. Algorithm: Guarding the TIN using 5-coloring 1. Color the TIN graph using five colors. 2. Among five assigned colors select three least used colors and place a guard on vertices, colored by one of these colors. Placing guards as mentioned above ensures that all triangles will have at least one guard on their vertices at the end. It has been shown that needed time for both phases is O(n). This algorithm uses fewer than 3n 5 guards for an n-vertex graph. More details of this algorithm is available in [3]. Worse-Case-Optimal Guarding Algorithm: In every TIN, the upper bound for total number of guards, needed in worse case, is n2 . Bose et al. posed a worsecase-optimal algorithm which selects at most n2 guards [2]. This algorithm is based on finding maximum matching on TINs for which a complex linear time algorithm exists [1]. Algorithm: Worse-Case-Optimal Guarding Algorithm 1. Find the dual graph G∗ related to the TIN graph G. 2. Calculate maximum matching M ∗ in G∗ graph. 3. Construct G by eliminating edges dual to M ∗ from graph G. As G is a bipartite graph, it is possible to partition it into two separate sets of nodes; A and B. 4. Between A and B, select smaller set as set of guards.
Parallel Algorithm to Find Minimum Vertex Guard Set
241
Each phase of this algorithm has linear time and it is clear that the upper bound of total guard number, selected by this algorithm, is n2 .
4
New Serial Algorithm for Guarding TINs
Average case analysis and implementation results show that both serial guarding algorithms, mentioned in previous section, select near n2 guards in real TINs. Even in conditions that fewer than n2 guards are sufficient for guarding a TIN, in contrast to their upper bound, usually the first algorithm selects fewer guards than the worse-case-optimal algorithm. Hence, existing serial guarding algorithms are not such efficient in real TINs. We suggest a new guarding algorithm that can select fewer guards than existing algorithms on the average. In new guarding algorithm, the main concept is selecting a vertex at each step and putting guard on it. By placing a guard on a vertex, all triangles around the selected vertex are covered and it is possible to remove them form the TIN. Each vertex that has no triangle around it is removed from the TIN. Algorithm: New Serial Guarding Algorithm 1. In each step, if there exist a vertex with one or two surrounding triangles in the TIN then (a) In the case that A is a vertex having one triangle around it, and B and C are its adjacent vertices, if B and C has the same number of surrounding triangles, select one of them as a vertex guard. Otherwise, between B and C select the vertex that has more surrounding triangles as a vertex guard. Remove the selected vertex along with its surrounding triangles from the TIN. (b) In the case that A is a vertex having two surrounding triangles, and B, C, and D are its adjacent vertices, among B, C, and D select the vertex, adjacent to both triangles around A, as a vertex guard and remove it along with its surrounding triangles from the TIN. 2. If there is no vertex with one or two surrounding triangles in the TIN, select a vertex that has at least four surrounding triangles and remove it from the TIN along with its surrounding triangles. Removing the vertex and triangles should not create hole in the TIN. Continue from the first phase. To prove that the new algorithm can guard the whole TIN properly, we should show that it continues until all triangles are removed from the TIN. First, we should show that when there exist no vertex, satisfying conditions of phase one, it is possible to find a vertex having conditions of phase two. Also, we should show that when there exist no vertex, satisfying conditions of phase two, the algorithm can continue to the end through phase one. In following theorems we show that these conditions are satisfied. Theorem 1. While there exist any vertex in the TIN, having at least four surrounding triangles, it is possible to remove a vertex with at least four surrounding triangles in a way that no hole is created in the TIN.
242
M. Taghinezhad Omran
Fig. 1. Different cases of the boundary node (j) and the interior node (i)
Fig. 2. Case that the boundary node (j) has three surrounding triangles
Proof. We suppose that there exist one vertex or more in the TIN that have at least four surrounding triangles and it is not possible to remove a vertex with at least four surrounding triangles in a way that no hole is created in the TIN. Consider the case that a boundary vertex (j) is connected to an interior vertex (i). Different cases are as follows: If vertex j has at least four surrounding triangles, then it can be removed from the TIN and it is a contradiction to our first supposition.(Figure 1 A) If vertex j has three surrounding triangles (Figure 1 B) and if vertex i or k has more than three triangles around it, by removing each of them the boundary remains still connected. So, it is a contradiction. So, i and k should have less than four surrounding triangles. as a result a type of TIN is produced that we call it B-type TIN (Figure 2 A). The produced TIN can extend only through vertices m and n (Figure 2 B). In the case that vertex j has two triangles around it (Figure 1 C), if vertex i has more than three surrounding triangles, again it is a contradiction to our first supposition. If vertex i has less than three surrounding triangles, as i is an interior vertex, it should have exactly three surrounding triangles. We refer to this kind of TIN as A-type TIN (Figure 3 A). Also, B-type TIN can be produced in condition that n and m have three surrounding triangles two of them in common (Figure 3 B). The produced TIN can extend only through n and m (Figure 3 C). As a result, if we can not remove a vertex with at least four surrounding triangle then there exist no such vertex in the TIN which is a contradiction to our first supposition. Proposition 1. In triangulated simple polygons that each vertex has fewer than four surrounding triangles, at least three vertices have fewer than three (1 or 2) surrounding triangles.
Parallel Algorithm to Find Minimum Vertex Guard Set
243
Fig. 3. Case that the boundary node (j) has two surrounding triangles
Theorem 2. When there exist only vertices with fewer than four surrounding triangles in the TIN, new guarding algorithm can continue until all triangles are removed. Proof. As a result of ”Theorem 1”, after removing vertices with more than three surrounding triangles, only some A and B-type TINs and some simple polygons remain in the TIN. These items may have one common point with each other but they can not form a hole in the TIN because in both phases of the algorithm we prevented creating holes. If Pi be each A or B-type TIN or simple polygon, considering that there is no hole in the TIN, there is a chain of Pi s in which exist at least two ending heads. We consider one of these heads and name it Ph . Each Pi has at least two vertices with fewer than three surrounding triangles (According to the shape of A and B-type TINs and from ”Proposition 1”). So, Ph has at least a vertex with fewer than three surrounding triangles. This ensures that the algorithm can begin to remove a vertex. On the other hand, by removing a vertex as mentioned in the algorithm, either the whole Ph is removed or a simple polygon remains. In both cases the algorithm can continue by next Pi which is the new ending head of the chain. Consequently, the alghorithm can continue until the whole TIN is Removed. Theorem 3. The presented guarding algorithm completely covers the TIN by selected guards. Proof. As a result of ”Theorem 1 and 2” our new guarding algorithm can correctly cover the whole TIN with assigned guards. Theorem 4. New guarding algorithm selects at most 2n 3 guards to cover the whole TIN. Proof. We consider three cases: 1) If a vertex with one surrounding triangle exists and it is a single triangle on the plane, by removing a vertex, totally three vertices and one triangle are removed from the TIN. Also, when adjacent vertices have more than one surrounding triangle, at least two vertices and two triangles are removed. 2) When a vertex with two surrounding triangles exists, by removing respective vertex, at least two triangles and two vertices are removed from the TIN.
244
M. Taghinezhad Omran
3) When a vertex is removed according to phase two of the algorithm, at least one vertex and four surrounding triangles are removed from the TIN. Consider following assumptions: i: Number of vertices that by removing them, four or more triangles are removed. j: Number of vertices that by removing them, one triangle is removed. k: Number of vertices that by removing them, two or three triangles are removed. We know that in every TIN ”T = 2n−t−2”, where T is total number of triangles, t is number of boundary vertices, and n is total number of vertices. Then: i + 3j + 2k ≤ n ⇒ 2i + 6j + 4k ≤ 2n 4i + j + 2k ≤ 2n − t − 2 ⇒ 4i + j + 2k ≤ 2n
(1) (2)
(1) + (2) ⇒ 6i + 7j + 6k ≤ 4n ⇒ 6i + 6j + 6k ≤ 4n ⇒ i + j + k ≤
2n 3
One guard is assigned for each i, j, and k kind of vertex. So, the new algorithm selects fewer than 2n 3 guards in every TIN. The new algorithm is a linear time algorithm. This can be inferred directly from the algorithm. Each vertex is visited once either at the beginning and constructing the data-structure or in the middle of the algorithm when removing other vertices. Also, a vertex is removed either because of putting guard on it or by putting guard on one of its neighbor vertices. So, each vertex is inserted to and removed from the data-structure only once. By using DCEL scheme for representing the TIN the algorithm defines vertex guards in linear time. Experimental results show that our new serial guarding algorithm can select fewer guards than other existing algorithms on the average. We have executed our implementation on different TINs and in all cases we received better results than old guarding algorithms. For example, we tested our algorithm on different TINs, needed fewer than 2n 5 guards to be covered. While other algorithms selected near n guards, our new algorithm assigned about 2n 2 5 guards to cover TINs. Implementations show that the new algorithm runs most of the times in phase one because when there exists a vertex satisfying conditions of phase one, by removing proper vertex from the TIN, it is very probable that other vertices with one or two surrounding triangles appear in the TIN. As a result, nearly in most steps of the algorithm at least two vertices are removed from the TIN. Also, there are many cases that more than two vertices are removed in each step of the algorithm (e.g. single A-type TIN on the plane). This ensures that fewer than n2 guards will be selected by this algorithm on the average.
5
Parallel Algorithm for Guarding TINs
One of the most important applications of guarding problem is finding guards on geographic regions which have been modeled by TINs. In such problem, the size of data is so huge that processing by single processor has some difficulties.
Parallel Algorithm to Find Minimum Vertex Guard Set
245
Based on our new serial algorithm, we have presented a new coarse-grained parallel algorithm which can identify guards in less time than the serial algorithm and can remove processing difficulties that emerge when using single processor. The new parallel algorithm is a coarse-grained algorithm in which our new serial algorithm is used on each processor parallel to other processors. To prevent conflicts between assigned guards of each processor, we have used a preprocessing phase to separate partitions of the TIN. so, there will be no need for extra work at the end of the algorithm for merging the results. Algorithm: New Parallel Guarding Algorithm 1. Partitioning phase. One of the processors (e.g. processor P 0) partitions the TIN graph into P (number of processors) parts. In this partitioning scheme, a set of connected vertices is selected as graph partitioner nodes and no hole should be created in the TIN. 2. Preprocessing Phase. Processor P 0 removes each partitioner vertex satisfying two following conditions in each step until no vertex of this kind exists in the TIN: 1) Removing the vertex does not create hole in the plane. 2) It causes at least four triangles to be removed form the TIN. 3. Parallel Phase. Processor P 0 sends remaining data of triangles to relevant processors. Each processor finds guards in its part using new serial guarding algorithm and sends back the result to processor P 0. 4. Merging Phase. Processor P 0 puts received data from each processor together to obtain the total result. After phase two of the algorithm, each processor selects guards according to our new serial guarding algorithm on remaining data. In ”Theorem 3” it has been showed that the new serial algorithm covers the whole TIN correctly by selected guards. So, ”Theorem 5” can be posed as follows: Theorem 5. The new parallel algorithm correctly covers the whole TIN by selected guards. Now, we show that the upper bound for selected guards by parallel algorithm is the same as the new serial algorithm and equal to 2n 3 . Proposition 2. No two directly adjacent interior vertices of a TIN, has exactly three surrounding triangles and at least one of them should have more than three surrounding triangles. Theorem 6. For every partitioning path in the graph of the TIN that begins from a boundary node and ends with another boundary node, it is possible to remove vertices of the path that have at least four surrounding triangles by beginning from a boundary node until reaching to the other side boundary node. It can be done in a way that does not create hole in the graph of the TIN. Proof. Suppose that i is a boundary node of the TIN and j is an interior node connected to it both on the partitioning path. Following cases are possible:
246
M. Taghinezhad Omran
Fig. 4. Different cases of a boundary node (i) connected to an interior node (j)
Fig. 5. Different types of partitioning path
In the case that vertex i has more than three surrounding triangles, by its removing, no hole appears in the TIN (Figure 4 A). If vertex i has exactly three surrounding triangles (Figure 4 B) and j has at least four surrounding triangles, by removing j, more than three triangles are removed and the boundary remains connected. Otherwise, if vertex j has three surrounding triangles, partitioning path can extend through m or n. If it extends through the boundary vertex m, different conditions should be verified from this vertex. Otherwise, if j is connected to the interior node n, according to ”Proposition 2”, vertex n should have more than three surrounding triangles. So, by removing it the boundary remains connected. When two triangles are around vertex i (Figure 4 C), if vertex j has more than three surrounding triangles, its removing causes more than three triangles to be removed without creating a hole in the TIN. In the case that j has three surrounding triangles, partitioning path should continue through boundary vertex m or n. So, different conditions should be verified from these nodes. By considering different cases mentioned above, in each step, it is possible to remove a vertex with at least four surrounding triangles in the partitioner path without creating hole in the TIN until reaching to the other side boundary node. Theorem 7. After removing vertices with at least four surrounding triangles along the partitioning path, P (number of processors) sub-graphs have at most P common points and at most three vertices can take part in each common point. Proof. Partitioning path can have one of following forms: 1) Partitioning path begins from and ends to a boundary vertex (Figure 5 A). 2) Partitioning path begins from a boundary vertex and ends to a vertex in the partitioning path (Figure 5 B).
Parallel Algorithm to Find Minimum Vertex Guard Set
247
Fig. 6. Two sub-graphs having common point
3) Partitioning path begins from and ends to a vertex in partitioning path (Figure 5 C). At the end points of partitioning path it is possible that some vertices with fewer than four surrounding triangles remain from the partitioning path. As we always begin from a boundary vertex at most P common points will be created among sub-graphs after removing qualified vertices from the TIN. Now, we show that in each common point at most three vertices can participate. We suppose that two sub-graphs have more than three common vertices (Figure 6). By supposing that i is a boundary node at the end of partitioning path, no other vertex near i in the path can be a boundary vertex because there would be a shorter partitioning path which we assumed does not exist. So, vertices j and m should be interior nodes (Figure 6). This is not possible because according to ”Proposition 2” one of these vertices should have at least four surrounding triangles and it is a contradiction. So, sub-graphs can have at most three common vertices at their common point. By paying attention to our new serial and parallel algorithms, it is clear that removed vertex in phase two of the parallel algorithm is similar to the vertex, removed in phase two of the serial algorithm. So, we can consider vertices, removed in phase two of the parallel algorithm, as a part of those, removed in phase two of the serial algorithm. Remaining part of the parallel algorithm is similar to its serial counterpart done in separate sections. We can also ignore common points and extra guards they add sometimes because according to ”Theorem 7”, their sum has constant value. Now, we can present the following theorem. Theorem 8. The upper bound for selected guards by new parallel algorithm is 2n 3 in all TINs. We have implemented our new parallel algorithm using MPI (Message Passing Interface) environment. In MPI, communication between processors is done by sending messages through an interconnected network. More details on MPI and programming using MPI can be obtained from [13,15]. The parallel algorithm has been implemented on a 24-node PC-cluster running MPI. Each computer had 2.4 MHz P4 processor and 512 MB of physical memory. Different graph partitioning schemes exist [9] and among them we used METIS [11] mesh partitioner to partition the input TIN. ”Figure 7” shows the speedup, obtained by multiple runs on different number of processors.
248
M. Taghinezhad Omran
Fig. 7. The speedup obtained by the new parallel algorithm
References 1. Biedl, T., Bose, P., Demaine, E., Lubiw, A.: Efficient algorithms for Petersen’s Matching Theorem. In: Proc. of Symp. on Discrete Algorithms, pp. 130–139 (1999) 2. Bose, P., Kirkpatrick, D., Li, Z.: Worse-Case-Optimal Algorithms for Guarding Planar Graphs and Polyhedral Surfaces (1999) 3. Bose, P., Shermer, T., Toussaint, G., Zhu, B.: Guarding polyhedral terrains. Computational Geometry - Theory and Applications 7(3), 173–185 (1997) 4. Chiba, N., Nishizeki, T., Saito, N.: A linear five coloring algorithm for planar graphs. Journal of Algorithms (1981) 5. Chv´ atal, V.: A combinatorial theorem in plane geometry. Journal of Comb. Theory Ser. B 18, 39–41 (1975) 6. Cole, R., Sharir, M.: Visibility Problems for Polyhedral terrains. Journal of Symbolic Computation 7, 11–30 (1989) 7. de Floriani, L., Falcidieno, B., Pienovi, C., Allen, D., Nagy, G.: A visibility-based model for terrain features. In: Proc. 2nd International Symposium on Spatial Data Handling, pp. 235–250 (1986) 8. Fisk, S.: A short proof of chv´ atal’s watchman theorem. Journal of Comb. Theory Ser. B 24, 374 (1978) 9. Fj¨ allst¨ orm, P.O.: Algorithms for Graph Partitioning: A Survey, Link¨ oping Electronic Articles in Computer and Information Science (1998) 10. Heath, M.T., Ranade, A., Schreiber, R.S.: Algorithms for parallel processing. Springer, Heidelberg (1998) 11. Karypis, G., Kumar, V.: METIS: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices Version 4.0, University of Minnesota, September 20 (1998) 12. O’Rouke, j.: Art Gallery Theorems and Algorithms. Oxford University Press, Oxford (1987) 13. Pacheco, P.S.: Parallel Programming with MPI. Morgan Kaufmann, San Francisco (1997) 14. Parhami, B.: Introduction to Parallel Processing: Algorithms and Architectures. Springer, Heidelberg (1999) 15. Quinn, M.J.: Parallel Programming in C with MPI and OpenMP. McGraw-Hill Professional, New York (2004) 16. Shermer, T.: Recent results in art galleries. Proc. IEEE, Special Issue on Comp. Geom. (1992) 17. Urrutia, J.: Art Gallery and illumination Problems. In: Handbook on Comp. Geom., Elsevier, Amsterdam (2000)
JaCk-SAT: A New Parallel Scheme to Solve the Satisfiability Problem (SAT) Based on Join-and-Check Daniel Singer and Anthony Monnet LITA - EA 3097 Universit´e Paul Verlaine - Metz, ˆIle du Saulcy, 57 045 Metz cedex, France
[email protected],
[email protected]
Abstract. This paper presents and investigates for the first time a new trail for parallel solving of the Satisfiability problem based on a simple and efficient structural decomposition heuristic. A new Joining and model Checking scheme (JaCk-SAT) is introduced. The main goal of this methodology is to recursively cut the variable-set in two subsets of about equal size. On the one hand, in contrast with recent propositions [12,16] for sequential resolution, we do not use sophisticated hypergraph decomposition techniques such as Tree Decomposition that are very likely infeasible. On the other hand, in contrast with all the actual propositions [27] for parallel resolution, we make use of a structural decomposition (of the problem) instead of a search space one. The very first preliminary results of this new approach are presented.
1
Introduction
The propositional Satisfiability problem (SAT) is certainly the most studied one in computer science since it was the first problem proven to be NP-complete by S. Cook in 1971. Nowadays, the Satisfiability problem evidences great practical importance in a wide range of disciplines, including hardware verification, artificial intelligence, cryptography and it is especially important in the area of Electronic Design Automation (EDA). There is increasing demand for high performance SAT-solving algorithms in industry to solve huge and harder problems that show an exponential explosion of the search space. Unfortunately, most state-of-the-art solvers are sequential and fewer are parallel. In our appreciation, finding “good decompositions” for parallel resolution is the real actual challenge for researchers to handle SAT applications. There is still a gap between our actual ability to handle instances with hundreds of variables and real ones with hundreds of thousands of variables! This article introduces a new possibility to decompose a SAT problem with a simple and efficient methodology that does not need a complex hypergraph decomposition. It should be noted that this general framework can also be applied to Constraint Satisfaction Problems (CSP). R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 249–258, 2008. c Springer-Verlag Berlin Heidelberg 2008
250
D. Singer and A. Monnet
The remainder of this paper is organized as follows. Section 2 briefly introduces the related work and motivations for this new approach. Section 3 presents the main steps of the JaCk-SAT scheme based on Joining and model-Checking. Section 4 briefly describes the main proposed methods to parallelize the core sequential algorithms and it gives the verry first experimental results of this new track for parallel resolution of SAT. The last Section 5 summerizes the whole proposition and points to some perspectives.
2
Related Work and Motivations
Decomposition methods have been applied in a variety of combinatorial problems such as CSP, especially for parallel resolution [14], but very few work exist in this direction for SAT. [2] used decompositions based on some approximation of the most constrained subproblems (those with the least number of solutions). Complex algorithms to find Minimum Vertex Separators or 0-1 fractional programming optimization decompose a formula into a tree of partitions (subproblems). Then a DPLL algorithm1 runs with a branching heuristic based on this decompostion solving first the most constrained subformulas. At last it performs a compilation procedure for each subproblem and joins the solutions to answer the initial problem. Unfortunately, we are not aware of any experimental results of these propositions. [6,15,16] study the applicability of Tree Decomposition method to SAT but only in the sequential case. Tree decomposition [7] is a graph theoretic concept which captures the topological structure of the formula represented as a hypergraph. Formulas that have bounded treewidth can be checked for satisfiability in linear time [8]. [16] recognizes the actual non-feasibility of the Tree Decomposition method for SAT but [6] considers its integration to modern DPLL solvers mainly to guide the variable ordering process. It is noticeable that it does not respond as well in terms of runtime. [12,20] use also hypergraph decomposition methods to obtain balanced clause-set bipartition or unbalanced tripartition but only to guide the SAT solver inside the subproblems. A state-of-the-art hypergraph partition tool is used and preliminary results are presented. More recently [5] presents a connected components decomposition but only to improve sequential performance. Certainly because of the number of limitations of the structural approach all the actual propositions for parallel resolution of SAT use a Search Space Decomposition scheme. This simple methodology permits to conduct independent searches that may cooperate (see [27] for a recent review of this approach). Here, we introduce a new methodology based on a different but feasible Structural Decomposition of the underlying problem graph. It permits to obtain much simpler subproblems that can be independently solved but with an additional Joining and Model Checking phase.
1
Davis Putnam Logemann Loveland procedure [9].
JaCk-SAT: A New Parallel Scheme to Solve the Satisfiability Problem
3
251
The New Decomposition Scheme
Let V = {v1 , v1 , . . . vn } be a set of n boolean variables. A (partial) truth assignment τ for V is a (partial) function: V → {T rue, F alse}. Corresponding to each variable v are two literals: v and ¬v called positive and negative literals. A clause C is a set of literals interpreted as a disjunction. A formula F is a set of clauses interpreted as a Conjunctive Normal Form (CNF) of a formula of the propositional calculus. A truth assignment τ satisfies a formula F (τ is a solution) iff it satisfies every clause in F . A truth assignment τ satisfies a clause C iff it satisfies at least one literal in C. Definition 1. The Satisfiability Problem (SAT): -Input: A set of Boolean variables V and a set of clauses C over V. -Output: Yes (gives a satisfying truth assignment τ for C if it exists) or No. Here we present the new Joining and model Checking scheme (JaCk-SAT) for parallel resolution of the SATisfiability problem2 . The main goal of this methodology is to divide the variable-set V (with |V| = n) in two subsets V1 , V2 of about equal size ( |V1 | |V2 | n/2). In contrast with the recent propositions [12,16] (for sequential resolution) we do not use sophisticated hypergraph decompositions such as Tree Decomposition that are very likely infeasible or not yet convincing (“they are far too expansive” [15]). The two fundamental differences are the following ones. First, remembering the inherent exponential behaviour of SAT depends on n, we do not focus our search for clause-sets decompositions but for variable-sets ones. Secondly, we do not impose a bipartition of the clause-set, that induces in general a large joining variable-set, but we define a tripartition of the clause-set. A recursive use of this scheme permits to obtain more than two subproblems. One advantage of this new scheme is allowing a great flexibility to fit the target parallel architecture: one multi-processors (cores) machine, a cluster of dozen of machines or a grid with thousands of machines. Let P = (V, C) be the initial SAT problem, the decomposition step builds the two subproblems P1 = (V1 , C1 ), P2 = (V2 , C2 ) and a residual clause-set C3 . V = V1 ∪ V2 , but we do not have V = V1 ⊕ V2 , meaning that V1 ∩ V2 = ∅. We shall denote by Vjs = V1 ∩V2 , the Join variable-set and it is not necessarily a cutset of V. C1 = {C ∈ C/∀x ∈ C, x ∈ V1 } , C2 = {C ∈ C/∀x ∈ C, x ∈ V2 } , C3 = C\(C1 ∪ C2 ) ⊆ {C ∈ C/∃x ∈ C, x ∈ V1 and ∃y ∈ C, y ∈ V2 }. S (resp. S1 , S2 ) represents the possibly empty set of solutions of P (resp. P1 , P2 ). This decomposition results in searching independently for all the solutions of both subproblems P1 and P2 , trying to join them to obtain global solutions of P thus checking for satisfiability of the residual clauses C3 . The major drawbacks of this method to be overcome are two-fold. First, the search for all the solutions of n/2 variables problems compared to the search for one solution of a n variables problem. Next, the additional Join-and-Check step that is polynomial in the number of solutions but may be exponential in term of n. 2
In fact it solves #SAT finding all solutions of the problem.
252
D. Singer and A. Monnet
1. Decompose(P, P1 , P2 , Vjs , C3 ) 2. //SearchAllSol((P1 , S1 ), (P2 , S2 )) 3. //Join-and-Check(S1 , S2 , Vjs , S, C3 ) Fig. 1. Main Algorithm of JaCk-SAT
3.1
The Decomposition Step
Extracting an approximation of the most constrained subproblem P1 = (V1 , C1 ) with |V1 | ∼ n/2 is the main goal for this decomposition step. This results in the non-symetric property of our decomposition in P1 and P2 but it is justified by the possibility to early find the formula unsatisfiability and by the next Joining step complexity depending on the number of solutions of the problems. We have experimented different heuristics to obtain the most constrained subproblem P1 with |V1 | ∼ n/2 variables and to control the different key parameters that are |Vjs |, |C3 |, |S1 | and |S2 |. Moreover applying recursively this decomposition scheme gives the possibility to fit the desired parallel granularity. A recursive depth of 1 produces 2 subproblems P1 and P2 , and a depth equals 2 applies this algorithm to P1 and P2 giving 4 subproblems . . . Heuristics may be variable or literal oriented and also may be statically or dynamically defined. Each heuristic consists in the five following steps (see Fig.2 below). The main idea is to start (step.1) with the selection of (n*mos) variablesliterals called kernel variables-literals in order to initialize V1 . A simplified example is presented in Sect. 3.3 and we had a first experimental study of the different parameters of the sequential algorithm in [28]. 1. Select (statically or dynamically) the (n ∗ mos) kernel variables (or literals) with maximal number of occurrences to initialize V1 2. Complete (statically or dynamically) V1 until |V1 | = n ∗ (1 + js)/2 with variables (or literals) with maximal number of co-occurrences with V1 (and then C1 ) 3. Initialize V2 with V\V1 4. Complete (statically or dynamically) V2 until |V2 | = n ∗ (1 + js)/2 with variables (or literals) with maximal number of co-occurrences with V2 to obtain V2 (then C2 ) and these variables are those of Vjs 5. C3 is C\(C1 ∪ C2 ) Fig. 2. Decomposition Algorithm
3.2
The Parallel Join-and-Check Step
Since the invention of relational database systems, tremendous effort has been undertaken in order to develop efficient join algorithms. Starting from a simple nested-loop, introduction of sorting, hashing and index structures are the main improvements (see [22]). The key challenge of this approach is clearly the
JaCk-SAT: A New Parallel Scheme to Solve the Satisfiability Problem
253
memory explosion obtained for S1 and S2 to be joined. It is important to note that parallelization of the resolution with //join results in dividing this explosion as much as possible thus minimizing this major obstacle. After trying a number of versions without success because of the inherent time and space limitations of the sequential prototyping study presented in [28], we developed an algorithm and data structures largely inspired by the work of M. Bamha on Parallel Join Algorithms in [3]. In order to obtain a better overall performance the step 3 of the main algorithm (Fig.1) proceeds at the same time the Joining and Checking steps. Thus S1 and S2 are filtered by the residual clause-set C3 to be satisfied. First of all to put it more precisely, the variables of Vjs shall be considered as a single compacted attribute and each assignment of these variables will be one single vector
(in the C ++ terminology). At this point, it is important to note that (to our best knowedge) there is no join algorithm dedicated to boolean value attributes and that the reason why we had to make this encoding step. Let Solution-Set be an object containing the solution-vector sol and an index Index that associates to each possible value of the joining attribute (Vjs ) the corresponding set of solution numbers. This index is systematically updated each time a new solution is added to Solution-Set. Before the Join-and-Check step begins we get the solution-sets S1 and S2 from previous step.2. Below in Fig. 3 is a simplified version of //Join-and-Check Algorithm that shall be illustrated by the example in the next Sect. 3.3. We will not develop here all the specific parallel methodology and we refer to [23] for more details. 1. Compute locally the histograms hist1 and hist2 for S1 and S2 respectively from (hash)Index, thus compute their intersecting joining histogram: joinHist 2. Compute sols1Constraints (resp. sols2Constraints) the set of clauses from C3 restricted to variables of V1 (resp. V2 ) /* a joining solution of sol1 ∈ S1 and sol2 ∈ S2 satisfies C3 iff ∀i ∈ [1...|C3 |] /* sol1 satisfies sols1Constraints[i] or sol2 satisfies sols2Constraints[i] 3. forall// joinV alue ∈ joinHist do forall index i ∈ S1 .Index[joinV alue] do constr ← ∅ /* a set of integers for k from 1 to |C3 | do If Checking (S1 [i] , sols1Constraints[k]) = False Then constr ← constr +k forall index j ∈ S2 .Index[joinV alue] do If ∀k ∈ constr, Checking (S2 [j] , sols2Constraints[k]) =True Then S ← S + S1 [i] S2 [j] Fig. 3. //Join-and-Check Algorithm
254
D. Singer and A. Monnet
3.3
A Simplified Example
CNF problem P with : |V| = n = 6 ⎧ c1 ⎪ ⎪ ⎪ ⎪ ⎪ c2 ⎪ ⎪ ⎪ ⎨c 3 ⎪ c4 ⎪ ⎪ ⎪ ⎪ ⎪ c5 ⎪ ⎪ ⎩ c6
|C| = 12
and
: v1 ∨ ¬v2 ∨ ¬v4
c7 : ¬v1 ∨ v2 ∨ ¬v3 ∨ v4 ∨ ¬v5 ∨ v6
: v3 ∨ ¬v6 : ¬v2 ∨ v4 ∨ v6
c8 : v1 ∨ ¬v3 c9 : v3 ∨ ¬v5 ∨ ¬v6
: v2 ∨ ¬v4 ∨ ¬v5 ∨ ¬v6 : v3 ∨ v4 ∨ v5
c10 : ¬v1 ∨ v6 c11 : ¬v4 ∨ ¬v6
: v1 ∨ v5
c12 : ¬v3 ∨ v5
The Decomposition Step parameters are : static and variable oriented heuristic, mos = 33% (|Vjs | = 2) and js = 33% (|V1 | = |V2 | = 4).
1.Selection of kernel variables:
variable v1 v2 v3 v4 v5 v6
5 4 6 6 6 7
occurrences (c1 , c6 , c7 , c8 , c10 ) (c1 , c3 , c4 , c7 ) (c2 , c5 , c7 , c8 , c9 , c12 ) (c1 , c3 , c4 , c5 , c6 , c11 ) (c4 , c5 , c6 , c7 , c9 , c12 ) (c2 , c3 , c4 , c7 , c9 , c10 , c11 )
V1 ← {v3 , v6 } variable v1 v2 v4 v5
2.Completion of V1 :
co-occurrences with v3 and/or v6 3 (c7 , c8 , c10 ) 3 (c3 , c4 , c7 ) 5 (c3 , c4 , c5 , c6 , c11 ) 5 (c4 , c5 , c7 , c9 , c12 )
V1 ← V1 ∪ {v4 , v5 } = {v3 , v4 , v5 , v6 } C1 ← {c | ∀v ∈ c, v ∈ V1 } = {c2 , c5 , c9 , c11 , c12 } 3.Initialization of V2 :
V2 ← V\V1 = {v1 , v2 }
4.Completion of V2 :
variable v3 v4 v5 v6
co-occurrences with v1 and/or v2 2 (c7 , c8 ) 4 (c1 , c3 , c4 , c7 ) 3 (c4 , c6 , c7 ) 4 (c3 , c4 , c7 , c10 )
V2 ← V2 ∪ {v4 , v6 } = {v1 , v2 , v4 , v6 } C2 ← {c | ∀v ∈ c, v ∈ V2 } = {c1 , c3 , c10 , c11 } Vjs ← V1 ∩ V2 = {v4 , v6 } 5.The residual clause-set:
C3 ← C\(C1 ∪ C2 ) = {c4 , c6 , c7 , c8 }
JaCk-SAT: A New Parallel Scheme to Solve the Satisfiability Problem
255
//Solving of P1
S1 (v3 , v4 , v5 , v6 ) join value (v4 , v6 ) sol11 (1, 0, 1, 0) (0,0) sol12 (1, 0, 1, 1) (0,1) sol13 (1, 1, 1, 0) (1,0) sol14 (0, 0, 1, 0) (0,0) sol15 (0, 1, 0, 0) (1,0) sol16 (0, 1, 1, 0) (1,0)
hist1 (v4 , , v6 ) Index (0, 0) 2 {1,4} (0, 1) 1 {2} (1, 0) 3 {3,5,6} (1, 1) 0 {}
//Solving of P2
S2 (v1 , v2 , v4 , v6 ) join value (v4 , v6 ) sol21 (0, 0, 0, 0) (0,0) sol22 (0, 0, 0, 1) (0,1) sol23 (0, 0, 1, 0) (1,0) sol24 (1, 0, 0, 1) (0,1) sol25 (0, 1, 0, 1) (0,1) sol26 (1, 1, 0, 1) (0,1)
hist2 (v4 , , v6 ) Index (0, 0) 1 {1} (0, 1) 4 {2,4,5,6} (1, 0) 1 {3} (1, 1) 0 {}
join value (v4 , v6 ) solution (v1 , v2 , v3 , v4 , v5 , v6 ) verifies C3 ? (0,0) sol11 1 sol21 (0, 0, 1, 0, 1, 0) no (c8 ) (0,0) sol14 1 sol21 (0, 0, 0, 0, 1, 0) yes (0,1) sol12 1 sol22 (0, 0, 1, 0, 1, 1) no (c8 ) (0,1) sol12 1 sol24 (1, 0, 1, 0, 1, 1) yes //Join-and-check (0,1) sol12 1 sol25 (0, 1, 1, 0, 1, 1) no (c8 ) (0,1) sol12 1 sol26 (1, 1, 1, 0, 1, 1) yes (1,0) sol13 1 sol23 (0, 0, 1, 1, 1, 0) no (c8 ) (1,0) sol15 1 sol23 (0, 0, 0, 1, 0, 0) no (c6 ) (1,0) sol16 1 sol23 (0, 0, 0, 1, 1, 0) yes
Conclusion: the problem is satisfiable with 4 distinct satisfying assignments.
4
Parallel Resolution of SAT: Preliminary Results
In this section are presented the first experimental results of this new decomposition scheme to solve in parallel the Satisfiability problem. Our main objective here is only to show the general approach feasibility and to point out challenging efforts that have to be done to deal in the future with problems having thousands of variables. One major advantage of this approach is that it does not depend on the solver, and we will mention below some experience with two of state-of-the-art solvers (for allsolutions) kcnfs [11] and relsat [4]. Different decomposition strategies in step.1 of the main algorithm (Fig. 2) can be characterized by the variable selection mode: dynamic or static and variableoriented or literal-oriented. Different variants have been defined and tested for each mode. The Join-Set size parameter (js) is crucial to obtain a “good decoposition” and this task is not easy as the results show a great variability of behaviour depending on the problem structure and size, the decomposition strategy and solver.
256
D. Singer and A. Monnet
We present below some results for well-known benchmark problems used by the SAT community that can be found in SATLIB [1]. uuf75 and dubois30 are hard unstisfiable instances. hole9 is a structured unsatisfiable problem from the pigeon-hole series and pret60 is a SAT encoding of a graph-colouring problem from the Pretolani-series. Early results found a dynamic and variable oriented strategy (DynVar) to be the best one. Note the very interesting result it obtained for the pret60 instances that has seemingly discovered a separating-variable set (Vjs ) of the underlying graph to be coloured? For both instances hole9 and uuf75 that have more variables the memory explosion obtained for |S1 | and |S2 | led to unmanageably large set of solutions explaining the poor results. In the uuf75 instance we have |S1 | = 393, 926, |S2| = 31, 774, 621 and 11, 044, 382 solutions have been joinedand-checked! The decomposition depth and solvers study Table 1 shows comparative results for both solvers kcnfs and relsat for different depth of decomposition as it was argued in Section3.1. The direct column corresponds to the direct solving by the sequential solver on a single processor. The depth1 seq and depth2 seq columns correspond to sequential solving after a decomposition in 2 or 4 subproblems and the depth1 par and depth2 par columns correspond to parallel solving after the same decompositions. The parallel solving with a depth i decomposition corresponds to the running with 2i processors. The parallel algorithms are implemented in C ++ (except the solvers) under (LAM)MPI and run on a linux cluster of 4 PCs with 512Mo of RAM at 2.4GHz and a 100Mbit Ethernet internal network. Table 1. Solving of various problems using a DynVar strategy with various depths (join size: 20% except for uuf75-073: 30%, mos: 10%) prob pret60 25 pret60 40 pret60 60 pret60 75 dubois30 hole9 uuf75-073
vars clauses solver direct depth1 seqdepth1 pardepth2 seq depth2 par kcnfs 3.56 1.56 1.22 0.24 0.16 60 160 relsat 0.0 2.94 0.74 0.24 0.16 kcnfs 3.58 1.62 1.22 0.24 0.16 60 160 relsat 0.0 3.02 0.72 0.24 0.16 kcnfs 3.58 1.59 1.21 0.25 0.16 60 160 relsat 0.0 2.95 0.76 0.25 0.15 kcnfs 3.52 1.61 1.2 0.24 0.15 60 160 relsat 0.0 2.95 0.72 0.26 0.15 kcnfs 2,748.880 88.03 52.3 7.24 2.62 54 140 relsat 0.0 111.89 28.97 7.94 2.6 kcnfs 4.13 553.53 541.49 513.94 271.19 90 415 relsat 14.0 1,167.78 500.54 524.28 268.05 kcnfs 0.0 310.72 862.95 601.47 455.55 75 325 relsat 0.0 3,671.05 1,373.31 602.26
JaCk-SAT: A New Parallel Scheme to Solve the Satisfiability Problem
5
257
Conclusion and Future Work
This paper presents and investigates for the first time a new track for parallel solving of the Satisfiability problem based on a simple and efficient structural decomposition. It defines a general Join-and-Chek schema and it brings forward preliminary propositions for a parallel prototype. Moreover this new general approach can also be applied to Constraint Satisfaction Problems (CSP). Early results of an experimental evaluation show the feasibility and interest of this approach making sense for a deeper study on scalability. This also makes clear that a number of substantial improvements should be made, the first being to define better solvers that efficently generate all solutions as mentioned in [19]. Then using and developing better algorithms and data structures for the crucial join step especially in a parallel-and-distributed execution framework should improve significantly. And last but not the least we hope that this work provides enough motivation for other researchers to follow this new direction.
Acknowledgments The authors would like to thank M. Bamha from Orl´eans University (LIFO) for a number of usefull discussions on the Joining algorithms that are intensively studied in the Data Base area. We also want to thank G. Dequen from Amiens University (LARIA) for making kcnfs to solve all solutions.
References 1. http://www.cs.ubc.ca/∼ hoos/SATLIB/ 2. Amir, E., McIllraith, S.: Solving Satisfiability using Decomposition and the Most Constrained Subproblem. In: IJCAI 2001 Workshop on Distributed Constraint Reasoning, online http://www.cs.berkeley.edu/∼ eyal/paper.html 3. Bamha, M.: An optimal skew-insensitive join and multi-join algorithm for distributed architecture. In: Andersen, K.V., Debenham, J., Wagner, R. (eds.) DEXA 2005. LNCS, vol. 3588, pp. 616–625. Springer, Heidelberg (2005) 4. Bayardo Jr., R.J., Pehousek, J.D.: Counting Models Using Connected Components. In: Proc. 17th. Nat. Conf. on AI (AAAI 2000) (2000) 5. Biere, A., Sinz, C.: Decomposing Sat Problems into Connected Components. Journal on Satisfiability, Boolean Modeling and Computation 2, 201–208 (2006) 6. Bjesse, P., Kukula, J., Damiano, R., Stanion, T., Zhu, Y.: Diagnosis with Tree Decomposition. In: Giunchiglia, E., Tacchella, A. (eds.) SAT 2003. LNCS, vol. 2919, Springer, Heidelberg (2004) 7. Bodlaender, H.: A Tourist Guide through Treewidth. Acta Cybernetica 11 (1993) 8. Darwiche, A.: Compiling knowledge into decomposable negation normal form. In: Proc. of 15th. IJCAI, pp. 284–289 (1999) 9. Davis, M., Logeman, G., Loveland, D.: A machine program for Theorem Proving. Com. of the ACM 5(7) (1962) 10. del Val, A.: Simplifying Binary Propositional Theories into Connected Components Twice as Fast. In: Nieuwenhuis, R., Voronkov, A. (eds.) LPAR 2001. LNCS (LNAI), vol. 2250, pp. 389–403. Springer, Heidelberg (2001)
258
D. Singer and A. Monnet
11. Dequen, G., Dubois, O.: kcnf s: an efficient solver for random k-SAT formulae. In: Giunchiglia, E., Tacchella, A. (eds.) SAT 2003. LNCS, vol. 2919, pp. 486–501. Springer, Heidelberg (2004) 12. Durairaj, V., Kalla, P.: Exploiting Hypergraph Partitioning for Efficient Boolean Satisfiability. In: Proc. 9th. IEEE Int. High Level Design Validation and Test Workshp, pp. 141–146 (2004) 13. Gummadi, R., Narayanaswamy, N.S., Venkatakrishnan, R.: Algorithms for Satisfiability Using Independent Sets of Variables. In: H. Hoos, H., Mitchell, D.G. (eds.) SAT 2004. LNCS, vol. 3542, p. 2005. Springer, Heidelberg (2005) 14. Habbas, Z., Krajecki, M., Singer, D.: Decomposition Techniques for Parallel Resolution of Constraint Satisfaction Problems in Shared Memory: a Comparative Study. Int. Jour. of Computational Science and Engineering (IJCSE) 1(2-4), 192– 206 (2005) 15. Herwig, P.R.: Decomposing satisfiability problems. Master’s Thesis, Delft University of Technology (2006) 16. Heule, M., Kullmann, O.: Decomposing clause-sets: Integrating DLL algorithms, tree decompositions and hypergraph cuts for variable and clause-based graph representations of CNF’s. TR. CSR 2-2006, Univ. Wales Swansea (2006) 17. Hicks, I.V., Koster, A., Kolotoglu, E.: Branch and Tree Decomposition Techniques for Discrete Optimization. TutORials in O.R., Informs (2005) 18. Karypis, G.: Multilevel Hypergraph partitionning. In: Cong, J., Shinnerl, J. (eds.) Multilevel Optimization Methods for VLSI, ch.6, Kluwer Ac. Pub, Dordrecht (2002) 19. Khurshid, S., Marinov, D., Shlyakhter, I., Jackson, D.: A Case for Efficient Solution Enumeration. In: Giunchiglia, E., Tacchella, A. (eds.) SAT 2003. LNCS, vol. 2919, Springer, Heidelberg (2004) 20. Li, W., Van Beek, P.: Guiding Real-World Sat Solving with Dynamic Hypergraph Separator Decomposition. In: Proc. 16th. IEEE Int. Conf. on Tools for AI (ICTAI 2004) (2004) 21. Marques Silva, J., Oliveira, A.: Improving Satisfiability Algorithms with Dominance and Partitioning. In: Int. Workshop on Logic Synthesis (IWLS) (1997) 22. Mishra, P., Eich, M.H.: Join Processing in Relational Databases. ACM Computing Surveys 24(1), 63–113 (1992) 23. Monnet, A.: JaCk-SAT: un nouvel algorithme de r´esolution parall´ele du probl´eme SAT par d´ecomposition structurelle. Master Thesis (in french), Universit´e Paul Verlaine-Metz (2007) 24. Park, T.J., Van Gelder, A.: Partitioning Methods for Satisfiabilty Testing on Large Formulas. Information and Computation 162, 179–184 (2000) 25. Parkes, A.J.: Exploiting Solutions Clusters for Coarse-Grained Distributed Search. In: Workshop on Dist. Constraints Reasoning, IJCAI 2001 (2001) 26. Schloegel, K., Karypis, G., Kumar, V.: Graph Partitioning for High Performance Scientific Simulations. In: Dongarra, J., et al. (eds.) CRPC Parallel Computing Handbook, Morgan Kauffmann, San Francisco (2000) 27. Singer, D.: Parallel Resolution of the Satisfiability Problem: a Survey. In: Talbi, E.G. (ed.) Parallel Combinatorial Optimization, Wiley and Sons, Chichester (2006) 28. Singer, D., Monnet, A.: A New Decomposition Scheme for Parallel Resolution of the Satisfiability Problem (SAT). In: 14th. RCRA 2007, Roma (2007) 29. Singer, D., Vagner, A.: Parallel Resolution of the Satisfiability Problem with OpenMP and MPI. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, Springer, Heidelberg (2006) 30. Snir, M., Otto, S.W., Huss-Lederman, S., Walker, D.W., Dongarra, J.: MPI: the complete reference. The MIT Press, Cambridge (1996)
Designing Service-Based Resource Management Tools for a Healthy Grid Ecosystem ¨ Erik Elmroth, Francisco Hern´ andez, Johan Tordsson, and Per-Olov Ostberg Dept. of Computing Science and HPC2N Ume˚ a University, SE-901 87 Ume˚ a, Sweden {elmroth,hernandf,tordsson,p-o}@cs.umu.se
Abstract. We present an approach for development of Grid resource management tools, where we put into practice internationally established high-level views of future Grid architectures. The approach addresses fundamental Grid challenges and strives towards a future vision of the Grid where capabilities are made available as independent and dynamically assembled utilities, enabling run-time changes in the structure, behavior, and location of software. The presentation is made in terms of design heuristics, design patterns, and quality attributes, and is centered around the key concepts of co-existence, composability, adoptability, adaptability, changeability, and interoperability. The practical realization of the approach is illustrated by five case studies (recently developed Grid tools) high-lighting the most distinct aspects of these key concepts for each tool. The approach contributes to a healthy Grid ecosystem that promotes a natural selection of “surviving” components through competition, innovation, evolution, and diversity. In conclusion, this environment facilitates the use and composition of components on a per-component basis.
1
Introduction
In recent years, the vision of the Grid as the general-purpose, service-oriented infrastructure for provisioning of computing, data, and information capabilities has started to materialize in the convergence of Grid and Web services technologies. Ultimately, we envision a Grid with open and standardized interfaces and protocols, where independent Grids can interoperate, virtual organizations co-exist, and capabilities be made available as independent utilities. However, there is still a fundamental gap between the technology used in major production Grids and recent technology developed by the Grid research community. While current research directions focus on user-centric and serviceoriented infrastructure design for scenarios with millions of self-organizing nodes, current production Grids are often more monolithic systems with stronger intercomponent dependencies.
This research was conducted using the resources of the High Performance Computing Center North (HPC2N). Financial support has been provided by The Swedish Research Council (VR) under contract 621-2005-3667.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 259–270, 2008. c Springer-Verlag Berlin Heidelberg 2008
260
E. Elmroth et al.
We present an approach to Grid infrastructure component development, where internationally established high-level views of future Grid architectures are put into practice. Our approach addresses the future vision of the Grid, while enabling easy integration into current production Grids. We illustrate the feasibility of our approach by presenting five case studies. The outline of the rest of the paper is as follows. Section 2 gives further background information, including our vision of the Grid, a characterization of competitive factors for Grid software, and a brief review of internationally established conceptual views of future Grid architectures. Section 3 presents our approach to Grid infrastructure development, which complies with these views. The realization of this approach for specific components is illustrated in Section 4, with a brief presentation of five tools recently developed within the Grid Infrastructure Research & Development (GIRD) project [26]. These are Grid tools or toolkits for resource brokering [9,10,11], job management [7], workflow execution [8], accounting [16,24], and Grid-wide fairshare scheduling [6].
2
Background and Motivation
Our approach to Grid infrastructure development is driven by the need and opportunity for a general-purpose infrastructure. This infrastructure should facilitate flexible and transparent access to distributed resources, dynamic composition of applications, management of complex processes and workflows, and operation across geographical and organizational boundaries. Our vision is that of a large evolving system, realized as a Service-Oriented Architecture (SOA) that enables provisioning of computing, data, and information capabilities as utility-like services serving business, academia, and individuals. From this point of departure, we elaborate on fundamental challenges that need to be addressed to realize this vision. 2.1
Facts of Life in Grid Environments
The operational context of a Grid environment is harsh, with heterogeneity in resource hardware, software, ownerships, and policies. The Grid is distributed and decentralized by nature, and any single point of control is impossible not only for scalability reasons but also since resources are owned by different organizations. Furthermore, as resource availability varies, resources may at any time join or leave the Grid. Information about the set of currently available resources and their status will always to some extent be incomplete or outdated. Actors have different incentives to join the Grid, resulting in asymmetric resource sharing relationships. Trust is also asymmetric, which in scenarios with cross trust-domain orchestration of multiple resources that interact beyond the client-server model, gives rise to complex security challenges. Demand for resources typically exceed supply, with contention for resources between users as a consequence. The Grid user community at large is disparate in requirements and knowledge, necessitating the development of wide ranges of
Designing Service-Based Resource Management Tools
261
user interfaces and access mechanisms. All these complicating factors add up to an environment where errors are rule rather than exception. 2.2
A General-Purpose Grid Ecosystem
Recently, a number of organizations have expressed views on how to realize a single and fully open architecture for the future Grid. To a large extent, these expressions conform to a single view of a highly dynamic service-oriented infrastructure for general-purpose use. One such view proposes the model of a healthy ecosystem of Grid components [25], where components occupy niches in the ecosystem and are designed for component-by-component selection by developers, administrators, and endusers. Components are developed by the Grid community at large and offer sensible functionality, available for easy integration in high-level tools or other software. In the long run, competition, innovation, evolution, and diversity lead to natural selection of “surviving” components, whereas other components eventually fade out or evolve into different niches. European organizations, such as the Next Generation Grids expert group [12] and NESSI [23], have focused on a common architectural view for Grid infrastructure, possibly with a more emphasized business focus compared to previous efforts. Among their recommendations is a strong focus on SOAs where services can be dynamically assembled, thus enabling run-time changes in the structure, behavior, and location of software. The view of services as utilities includes directly and immediately usable services with established functionality, performance, and dependability. This vision goes beyond that of a prescribed layered architecture by proposing a multi-dimensional mesh of concepts, applying the same mechanisms along each dimension across the traditional layers. In common for these views are, for example, a focus on composable components rather than monolithic Grid-wide systems, as well as a general-purpose infrastructure rather than application- or community-specific systems. Examples of usage range from business and academic applications to individual’s use of the Grid. These visions also address some common issues in current production Grid infrastructures, such as interoperability and portability problems between different Grids, as well as limited software reuse. Before detailing our approach to Grid software design, which complies with the views presented above, we elaborate on key factors for software success in the Grid ecosystem. 2.3
Competitive Factors for Software in the Grid Ecosystem
In addition to component-specific functional requirements, which obviously differ for different types of components, we identify a set of general quality attributes (also known as non-functional requirements) that successful software components should comply with. The success metrics considered here are the amount of users and the sustainability of software. In order to attract the largest possible user community, usability aspects such as availability, ease of installation, understandability, and quality of documentation and support are important. With the dynamic and changing nature of
262
E. Elmroth et al.
Grid environments, flexibility and the ability to adapt and evolve is vital for the survival of a software component. Competitive factors for survival include changeability, adaptability, portability, interoperability, and integrability. These factors, along with mechanisms used to improve software quality with respect to them, are further discussed in Section 3. Other criteria, relating to sustainability, include the track record of both components and developers as well as the general reputation of the latter in the user community. Quality attributes such as efficiency (with emphasis on scalability), reliability, and security also affect the software success rate in the Grid ecosystem. These attributes are however not further discussed herein.
3
Grid Ecosystem Software Development
In this section we present our approach to building software well-adjusted to the Grid ecosystem. The presentation is structured into five groups of software design heuristics, design patterns, and quality attributes that are central to our approach. All definitions are adapted to the Grid ecosystem environment, but are derived from, and conform to, the ISO/IEC 9126-1 standard [20]. 3.1
Co-existence – Grid Ecosystem Awareness
Co-existence is defined as the ability of software to co-exist with other independent softwares in a shared resource environment. The behavior of a component well adjusted to the Grid ecosystem is characterized by non-intrusiveness, respect for niche boundaries, replaceability, and avoidance of resource overconsumption. When developing new Grid components, we identify the purpose and boundaries of the corresponding niches in order to ensure the components’ place and role in the ecosystem. By stressing non-intrusiveness in the design, we strive to ensure that new components do not alter, hinder, or in any other way affect the function of other components in the system. While the introduction of new software into an established ecosystem may, through fair competition, reshape, create, or eliminate niches, it is still important for the software to be able to cooperate and interact with neighboring components. By the principle of decentralization, it is crucial to avoid making assumptions of omniscient nature and not to rely on global information or control in the Grid. By designing components for a user-centric view of systems, resources, component capabilities, and interfaces, we emphasize decentralization and facilitate component co-existence and usability. 3.2
Composability – Software Reuse in the Grid Ecosystem
Composability is defined as the capability of software to be used both as individual components and as building blocks in other systems. As systems may themselves be part of larger systems, or make use of other systems’ components, composability becomes a measure of usefulness at different levels of system
Designing Service-Based Resource Management Tools
263
design. Below, we present some design heuristics that we make use of in order to improve software composability. By designing components and component interactions in terms of interfaces rather than functionality, we promote the creation of components with welldefined responsibilities and provision for module encapsulation and interface abstraction. We strive to develop simple, single-purpose components achieving a distinct separation of concerns and a clear view of service architectures. Implementation of such components is faster and less error-prone than more complex designs. Autonomous components with minimized external dependencies make composed systems more fault tolerant as their distributed failure models become simpler. Key to designing composable software is to provision for software reuse rather than reinvention. Our approach, leading to generic and composable tools well adjusted to the Grid ecosystem, encourages a model of software reuse where users of components take what they need and leave the rest. Being decentralized and distributed by nature, SOAs have several properties that facilitate the development of composable software. 3.3
Adoptability – Grid Ecosystem Component Usability
Adoptability is a broad concept enveloping aspects such as end-user usability, ease of integration, ease of installation and administration, level of portability, and software maintainability. These are key factors for determining deployment rate and niche impact of a software. As high software usability can both reduce end-user training time and increase productivity, it has significant impact on the adoptability of software. We strive for ease of system installation, administration, and integration (e.g., with other tools or Grid middlewares), and hence reduce the overhead imposed by using the software as stand-alone components, end-user tools, or building blocks in other systems. Key adoptability factors include quality of documentation and client APIs, as well as the degree of openness, complexity, transparency and intrusiveness of the system. Moreover, high portability and ease of migration can be deciding factors for system adoptability. 3.4
Adaptability and Changeability – Surviving Evolution
Adaptability, the ability to adapt to new or different environments, can be a key factor for improving system sustainability. Changeability, the ability for software to be changed to provide modified behavior and meet new requirements, greatly affects system adaptability. By providing mechanisms to modify component behavior via configuration modules, we strive to simplify component integration and provide flexibility in, and ease of, customization and deployment. Furthermore, we find that the use of policy plug-in modules which can be provided and dynamically updated by third parties are efficient for making systems adaptable to changes in operational
264
E. Elmroth et al.
contexts. By separating policy from mechanism, we facilitate for developers to use system components in other ways than originally anticipated and software reuse can thus be increased. 3.5
Interoperability – Interaction within the Grid Ecosystem
Interoperability is the ability of software to interact with other systems. Our approach includes three different techniques for making our components available, making them able to access other Grid resources, and making other resources able to access our components, respectively. Integration of our components typically only requires the use of one or two of these techniques. Whenever feasible, we leverage established and emerging Web and Grid services standards for interfaces, data formats, and architectures. Generally, we formulate integration points as interfaces expressing required functionality rather than reflecting internal component architecture. Our components are normally made available as Grid services, following these general principles. For our components to access resources running different middlewares, we combine the use of customization points and design patterns such as Adapter and Chain of Responsibility [15]. Whenever possible, we strive to embed the customization points in our components, simplifying component integration with one or more middlewares. In order to make existing Grid softwares able to access our components, we strive to make external integration points as few, small, and well-defined as possible, as these modifications need to be applied to external softwares.
4
Case Studies
We illustrate our approach to software development by brief presentations of five tools or toolkits recently developed in the GIRD project [26]. The presentations describe the overall tool functionality and high-light the most significant characteristics related to the topics discussed in Section 3. All tools are built to operate in a decentralized Grid environment with no single point of control. They are furthermore designed to be non-intrusive and can coexist with alternative mechanisms. To enhance adoptability of the tools, user guides, administrator manuals, developer APIs, and component source code are made available online [26]. As these adoptability measures are common for all projects, the adoptability characteristics are left out of the individual project presentations. The use of SOAs and Web services naturally fulfills many of the composability requirements outlined in Section 3. The Web service toolkit used is the Globus Toolkit 4 (GT4) Java WS Core, which provides an implementation of the Web Services Resource Framework (WSRF). Notably, the fact that our tools are made available as GT4-based Web services should not be interpreted as been built primarily for use in GT4-based Grids. On the contrary, their design is focused on generality and ease of middleware integration.
Designing Service-Based Resource Management Tools
4.1
265
Job Submission Service (JSS)
The JSS is a feature-rich, standards-based service for cross-middleware job submission, providing support, e.g., for advance reservations and co-allocation. The service implements a decentralized brokering policy, striving to optimize the job performance for individual users by minimizing the response time for each submitted job. In order to do this, the broker makes an a priori estimation of the whole, or parts of, the Total Time to Delivery (TTD) for all resources of interest before making the resource selection [9,10,11]. Co-existence: The non-intrusive decentralized resource broker handles each job isolated from the jobs of other users. It can provide quality of service to end-users despite the existence of competing job submission tools. Composability: The JSS is composed of several modules, each performing a welldefined task in the job submission process, e.g., resource discovery, reservation negotiation, resource selection, and data transfer. Changeability and adaptability: Users of the JSS can specify additional information in job request messages to customize and fine-tune the resource selection process. Developers can replace the resource brokering algorithms with alternative implementations. Interoperability: The architecture of the JSS is based on (emerging) standards such as JSDL, WSRF, WS-Agreement, and GLUE. It also includes customization points, enabling the use of non-standard job description formats, Grid information systems, and job submission mechanisms. The latter two can be interfaced despite differences in data formats and protocols. By these mechanisms, the JSS can transparently submit jobs to and from GT4, NorduGrid/ARC, and LCG/gLite. 4.2
Grid Job Management Framework (GJMF)
The GJMF [7] is a framework for efficient and reliable processing of Grid jobs. It offers transparent submission, control, and management of jobs and groups of jobs on different middlewares. Co-existence: The user-centric GJMF design provides a view of exclusive access to each service and enforces a user-level isolation which prohibits access to other users’ information. All services in the framework assume shared access to Grid resources. The resource brokering is performed without use of global information, and includes back-off behaviors for Grid congestion control on all levels of job submission. Composability: Orchestration of services with coherent interfaces provides transparent access to all capabilities offered by the framework. The functionality for job group management, job management, brokering, Grid information system access, job control, and log access are separated into autonomous services. Changeability and adaptability: Configurable policy plug-ins in multiple locations allow customization of congestion control, failure handling, progress monitoring,
266
E. Elmroth et al.
service interaction, and job (group) prioritizing mechanisms. Dynamic service orchestration and fault tolerance is provided by each service being capable of using multiple service instances. For example, the job management service is capable of using several services for brokering and job submission, automatically switching to alternatives upon failures. Interoperability: The use of standardized interfaces such as JSDL as job description format, OGSA BES for job execution, and OGSA RSS for resource selection improves interoperability and replaceability. 4.3
Grid Workflow Execution Engine (GWEE)
The GWEE [8] is a light-weight and generic workflow execution engine that facilitates the development of application-oriented end-user workflow tools. The engine is light-weight in that it focuses only on workflow execution and the corresponding state management. This project builds on experiences gained while developing the Grid Automation and Generative Environment (GAUGE) [19,17]. Co-existence: The engine operates in the narrow niche of workflow execution. Instead of attempting to replace other workflow tools, the GWEE provides a means for accessing advanced capabilities offered by multiple Grid middlewares. The engine can process multiple workflows concurrently without them interfering with each other. Furthermore, the engine can be shared among multiple users, but only the creator of a workflow instance can monitor and control that workflow. Composability: The main responsibilities of the engine, managing task dependencies, processing tasks on Grid resources, and managing workflow state, are performed by separate modules. Adaptability and Changeability: Workflow clients can monitor executing workflows both by synchronous status requests and by asynchronous notifications. Different granularities of notifications are provided to support specific client requirements – from a single message upon workflow completion to detailed updates for each task state change. Interoperability: The GWEE is made highly interoperable with different middlewares and workflow clients through the use of two types of plug-ins. Currently, it provides middleware plug-ins for execution of computational tasks in GT4 and in the GJMF, as well as GridFTP file transfers. It also provides plug-ins for transforming workflow languages into its native language, as currently has been done for the Karajan language. The Chain of Responsibility design pattern allows concurrent usage of multiple implementations of a particular plug-in. 4.4
SweGrid Accounting System (SGAS)
SGAS allocates Grid capacity between user groups by coordinated enforcement of Grid-wide usage limits [24,16]. It employs a credit-based allocation model where Grid capacity is granted to projects via Grid-wide quota allowances. The
Designing Service-Based Resource Management Tools
267
Grid resources collectively enforce these allowances in a soft, real-time manner. The main SGAS components are a Bank, a logging service (LUTS), and a quota-aware authorization tool (JARM), the latter to be integrated on each Grid resource. Co-existence: SGAS is built as stand-alone Grid services with minimal dependencies on other software. Normal usage is not only non-intrusive to other software but also to usage policies, as resource owners retain ultimate control over local resource policies, such as strictness of quota enforcement. Composability: There is a distinct separation of concerns between the Bank and the LUTS, for managing usage quotas and logging usage data, respectively. They can each be used independently. Changeability and adaptability: The Bank can be used to account for any type of resource consumption and with any price-setting mechanism, as it is independent of the mapping to the abstract “Grid credit” unit used. The Bank can also be changed from managing pre-allocations to accumulating costs for later billing. The JARM provides customization points for calculating usage costs based on different pricing models. The tuning of the quota enforcement strictness is facilitated by a dedicated customization point. Interoperability: The JARM has plug-in points for middleware-specific adapter code, facilitating integration with different middleware platforms, scheduling systems, and data formats. The middleware integration is done via a SOAP message interceptor in GT4 GRAM and via an authorization plug-in script in the NorduGrid/ARC GridManager. The LUTS data is stored in the OGF Usage Record format. 4.5
Grid-Wide Fairshare Scheduling System (FSGrid)
FSGrid is a Grid-wide fairshare scheduling system that provides three-party QoS support (user, resource-owner, VO-authority) for enforcement of locally and globally scoped share policies [6]. The system allows local resource capacity as well as global Grid capacity to be logically divided among different groups of users. The policy model is hierarchical and sub-policy definition can be delegated so that, e.g., a VO can partition its share among its projects, which in turn can divide their shares among users. Co-existence: The main objective of FSGrid is to facilitate for distributed resources to collaboratively schedule jobs for Grid-wide fairness. FSGrid is nonintrusive in the sense that resource owners retain ultimate control of how to perform the scheduling on their local resources. Composability: FSGrid includes two stand-alone components with clearly separated concerns for maintaining a policy tree and to log usage data, respectively. In fact, the logging component in current use is the LUTS originally developed for SGAS, illustrating the potential for reuse of that component.
268
E. Elmroth et al.
Changeability and adaptability: A customizable policy engine is used to calculate priority factors based on a runtime policy tree with information about resource pre-allocations and previous usage. The priority calculation can be customized, e.g., in terms of length, granularity, and rate of aging of usage history. The administration of the policy tree is flexible as sub-policy definition can be delegated to, e.g., VOs and projects. Interoperability: Besides the integration of the LUTS (see Section 4.4), FSGrid includes a single external point of integration, as a fair-share priority factor call-out to FSGrid has to be integrated in the local scheduler on each resource.
5
Related Work
Despite the large amount of Grid related projects to date, just a few of these have shared their experiences regarding software design and development approaches. Some of these projects have focused on software architecture. In a survey by Filkenstein et al. [13], existing data-Grids are compared in terms of their architectures, functional requirements, and quality attributes. Cakic et al. [2] describe a Grid architectural style and a light-weight methodology for constructing Grids. Their work is based on a set of general functional requirements and quality attributes that derives an architectural style that includes information, control, and execution. Mattmann et al. [22] analyze software engineering challenges for large-scale scientific applications, and propose a general reference architecture that can be instantiated and adapted for specific application domains. We agree on the benefits obtained with a general architecture for Grid components to be instantiated for specific projects, however, our focus is on the inner workings of the components making up the architecture. The idea of software that evolves due to unforeseen changes in the environment also appears in the literature. In the work by Smith et al. [3], the way software is modified over time is compared with Darwinian evolution. In this work, the authors discuss the best-of-breed approach, where an organization collects and assembles the most suitable software component from each niche. The authors also construct a taxonomy of the “species” of enterprise software. A main difference between this work and our contribution is that our work focuses on software design criteria. Other high-level visions of Grid computing include that of interacting autonomous software agents [14]. One of the characteristics of this vision is that software engineering techniques employed for software agents can be reused with little or no effort if the agents encompasses the service’s vision [21]. A different view on agent-based software development for the Grid is that of evolution based on competition between resource brokering agents [4]. These projects differ from our contribution as our tools have a stricter focus on functionality (being welladjusted to their respective niches). Finally, it is also important to notice that there are a number of tools that simplify the development of Grid software. These tools facilitate, for example, implementation [18], unit testing [5], and automatic integration [1].
Designing Service-Based Resource Management Tools
6
269
Concluding Remarks
We explore the concept of the Grid ecosystem, with well-defined niches of functionality and natural selection (based on competition, innovation, evolution, and diversity) of software components within the respective niches. The Grid ecosystem facilitates the use and composition of components on a per-component basis. We discuss fundamental requirements for software to be well-adjusted to this environment and propose an approach to software development that complies with these requirements. The feasibility of our approach is demonstrated by five case studies. Future directions for this work include further exploration of processes and practices for development of Grid software.
Acknowledgements We acknowledge Magnus Eriksson for valuable feedback on software engineering standardization matters.
References 1. B´egin, M.-E., Diez-Andino, G., Di Meglio, A., Ferro, E., Ronchieri, E., Selmi, M., Zurek, M.: Build, configuration, integration and testing tools for large software projects: ETICS. In: Guelfi, N., Buchs, D. (eds.) RISE 2006. LNCS, vol. 4401, pp. 81–97. Springer, Heidelberg (2007) 2. Cakic, J., Paige, R.F.: Origins of the Grid architectural style. In: Engineering of Complex Computer Systems. 11th IEEE Int. Conference, IECCS 2006, pp. 227– 235. IEEE CS Press, Los Alamitos (2006) 3. Smith David, J., McCarthy, W.E., Sommer, B.S.: Agility – the key to survival of the fittest in the software market. Commun. ACM 46(5), 65–69 (2003) 4. Dimou, C., Mitkas, P.A.: An agent-based metacomputing ecosystem (October 2007), visited, http://issel.ee.auth.gr/ktree/Documents/RootFolder/ISSEL/Publications/ BiogridAnAgent-basedMetacomputingEcosystem.pdf 5. Duarte, A., Cirne, W., Brasileiro, F., Machado, P.: GridUnit: software testing on the Grid. In: Anderson, K.M. (ed.) Software Engineering. 28th Int. Conference, ICSE 2006, pp. 779–782. ACM Press, New York (2006) 6. Elmroth, E., Gardfj¨ all, P.: Design and evaluation of a decentralized system for Grid-wide fairshare scheduling. In: Stockinger, H., et al. (eds.) First International Conference on e-Science and Grid Computing, pp. 221–229. IEEE CS Press, Los Alamitos (2005) ¨ 7. Elmroth, E., Gardfj¨ all, P., Norberg, A., Tordsson, J., Ostberg, P.-O.: Designing general, composable, and middleware-independent Grid infrastructure tools for multitiered job management. In: Priol, T., Vaneschi, M. (eds.) Towards Next Generation Grids, pp. 175–184. Springer-, Heidelberg (2007) 8. Elmroth, E., Hern´ andez, F., Tordsson, J.: A light-weight Grid workflow execution engine enabling client and middleware independence. In: Wyrzykowski, R., et al. (eds.) Parallel Processing and Applied Mathematics. 7th Int. Conference, PPAM 2007. Lecture notes in Computer Science, Springer, Heidelberg (2007)
270
E. Elmroth et al.
9. Elmroth, E., Tordsson, J.: An interoperable, standards-based Grid resource broker and job submission service. In: Stockinger, H., et al. (eds.) First International Conference on e-Science and Grid Computing, pp. 212–220. IEEE CS Press, Los Alamitos (2005) 10. Elmroth, E., Tordsson, J.: A standards-based Grid resource brokering service supporting advance reservations, coallocation and cross-Grid interoperability. Concurrency and Computation: Practice and Experience (submitted to, 2006) 11. Elmroth, E., Tordsson, J.: A Grid resource brokering algorithms enabling advance reservations and resource selection based on performance predictions. In: Future Generation Computer Systems. The International Journal of Grid Computing: Theory, Methods and Applications (to appear, 2008) 12. Expert Group on Next Generation Grids 3 (NGG3). Future for European Grids: Grids and service oriented knowledge utilities. Vision and research directions 2010 and beyond (2006), visited October 2007, ftp://ftp.cordis.lu/pub/ist/docs/grids/ngg3 eg final.pdf 13. Finkelstein, A., Gryce, C., Lewis-Bowen, J.: Relating requirements and architectures: a study of data-grids. J. Grid Computing 2(3), 207–222 (2004) 14. Foster, I., Jennings, N.R., Kesselman, C.: Brain meets brawn: why Grid and agents need each other. In: Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, vol. 1, pp. 8–15. IEEE CS Press, Los Alamitos (2004) 15. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns. In: Elements of Reusable Object-Oriented Software, Addison-Wesley, Reading (1995) 16. Gardfj¨ all, P., Elmroth, E., Johnsson, L., Mulmo, O., Sandholm, T.: Scalable Gridwide capacity allocation with the SweGrid Accounting System (SGAS). Concurrency and Computation: Practice and Experience (accepted, 2007) 17. Guan, Z., Hern´ andez, F., Bangalore, P., Gray, J., Skjellum, A., Velusamy, V., Liu, Y.: Grid-Flow: a Grid-enabled scientific workflow system with a petri-net-based interface. Concurrency Computat.: Pract. Exper. 18(10), 1115–1140 (2006) 18. Hastings, S., Oster, S., Langella, S., Ervin, D., Kurc, T., Saltz, J.: Introduce: an open source toolkit for rapid development of strongly typed Grid services. J. Grid Computing 5(4), 407–427 (2007) 19. Hern´ andez, F., Bangalore, P., Gray, J., Guan, Z., Reilly, K.: GAUGE: Grid Automation and Generative Environment. Concurrency Computat.: Pract. Exper. 18(10), 1293–1316 (2006) 20. ISO/IEC. Software engineering - Product quality - Part 1: Quality model. International standard ISO/IEC 9126-1 (2001) 21. Leong, P., Miao, C., Lee, B.-S.: Agent oriented software engineering for Grid computing. In: Cluster Computing and the Grid. 6th IEEE Int. Symposium, CCGRID 2006, IEEE CS Press, Los Alamitos (2006) 22. Mattmann, C.A., Crichton, D.J., Medvidovic, N., Hughes, S.: A software architecture-based framework for highly distributed and data intensive scientific applications. In: Anderson, K.M. (ed.) Software Engineering. 28th Int. Conference, ICSE 2006, pp. 721–730. ACM Press, New York (2006) 23. Networked European Software and Services Initiative (NESSI), visited October 2007, http://www.nessi-europe.com 24. Sandholm, T., Gardfj¨ all, P., Elmroth, E., Johnsson, L., Mulmo, O.: A serviceoriented approach to enforce Grid resource allocations. International Journal of Cooperative Information Systems 15(3), 439–459 (2006) 25. The Globus Project. An “ecosystem” of Grid components, visited October 2007, http://www.globus.org/grid software/ecology.php 26. The Grid Infrastructure Research & Development (GIRD) project. Ume˚ a University, Sweden (visited October 2007), http://www.gird.se
BC-MPI: Running an MPI Application on Multiple Clusters with BeesyCluster Connectivity Paweł Czarnul Faculty of Electronics, Telecommunications and Informatics Gdansk University of Technology, Poland [email protected] http://fox.eti.pg.gda.pl/∼pczarnul
Abstract. A new software package BC-MPI which allows an MPI application to run on several clusters with various MPI implementations is presented. It uses vendor MPI implementations for communication inside clusters and exploits the multithreaded MPI THREAD MULTIPLE mode for handling inter-cluster communication in additional threads of the MPI application. Furthermore, a BC-MPI application can be automatically compiled and started by the BeesyCluster middleware. The latter allows users to manage and use cluster accounts via a single BeesyCluster account and WWW or Web Services. The middleware connects to clusters via SSH and does not require any software installation on the clusters. Results of various latency and bandwidth tests for intra and inter-cluster communication are presented for BC-MPI using OpenMPI and LAM/MPI and Infiniband or TCP within clusters. Keywords: WAN-aware MPI, threads and MPI, grid middleware.
1 Introduction In recent years we have observed a growth of interest in both highly parallel software solutions and bridging clusters using grid technology. The first included progress towards thread-safe MPI implementations (OpenMPI), focusing on transparent checkpointing for MPI applications (LAM/MPI/BLCR or MPICH-V), failure-proof parallel algorithms in view of large parallel machines like IBM’s BlueGene, mixed shared memory and MPI programming etc. The latter resulted in releases of high-level grid systems like CrossGrid ([1]), CLUSTERIX ([2]) and also MPI implementations for WANs like MPICH-G2 ([3]) based on grid middleware Globus Toolkit ([4]). Still shortcomings can be identified especially regarding: – difficulty of installation, complex configuration and version compatibility of grid middlewares, – limited availability of high level grid systems for a broader community, – a usually complex process for middleware configurations, setup, distribution of account credentials, etc.
Partially covered by MNiI grant No N516 035 31/3499.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 271–280, 2008. c Springer-Verlag Berlin Heidelberg 2008
272
P. Czarnul
2 Related Work and Motivations The author proposes a software package, BC-MPI, that allows running an MPI application over several distributed clusters, using different MPI implementations on various clusters if needed. The package consists of a library, a compilation script replacing mpicc to preprocess and replace MPI * calls with ones shipped with BC-MPI as well as (optional) TCP forwarders for inter-cluster communication. There are MPI implementations available allowing an MPI application to span over several clusters. Examples include MPICH-G2 ([3]) using Globus ([4]) for job control (startup, monitoring, termination) and TCP messaging ([5]). PACX-MPI ([6]) and LAM/ MPI ([7]) can use Globus to couple remote clusters and start the application. Interoperable MPI ([8]) defines a protocol for communication between MPI implementations. Compared to other MPIs for WANs like MPICH-G2 or PACX-MPI, benefits of the proposed solution include: 1. BC-MPI is designed to exploit multithreading (MPI THREAD MULTIPLE model, [9]) as additional threads are used for handling inter-cluster communication in MPI processes. In BC-MPI each MPI process is multithreaded and one per cluster serves as a proxy for inter-cluster communication unlike in PACX-MPI – where additional processes acting as cluster proxies are created. This is discussed in paragraph 3. 2. BC-MPI can use any MPI implementations on cluster sides. This allows to exploit additional features of the particular MPI application, if necessary. 3. ability to start the BC-MPI application using the BeesyCluster middleware. Although the package is also meant as a standalone software with manual startup on remote clusters (much like PACX-MPI in [6]), it can benefit from automatic and secure startup via BeesyCluster ([10]) which is a J2EE application and middleware to clusters exposing WWW and Web Service interfaces. Contrary to using Globus, this approach does not require installation of any parts of the middleware on remote clusters since BeesyCluster accesses user accounts on such clusters via the Jsch SSH library. As such, it only requires that the BeesyCluster user provides system logins/passwords to accounts on remote clusters they would like to use through BeesyCluster. This allows to start an MPI application on several clusters via a single account in BeesyCluster to which several registered user accounts from clusters are mapped. Complex middlewares like Globus are more prone to modifications while plain Web Services are well established standards. [5] lists specific Globus versions MPICH-G2 does not operate with.
3 Architecture and Design of the Proposed Solution 3.1 Architecture of a BC-MPI Application The architecture of the BC-MPI application is depicted in Figure 1. BC-MPI uses multithreaded MPI processes one of which is a proxy for a cluster rather than distinguished processes like e.g. in PACX-MPI ([6]). If the cluster configuration requires access via a dedicated node then a TCP forwarder process can be launched there. In the case MPI is used to forward/receive from the proxy, communication links are equivalent to
BC-MPI: Running an MPI Application on Multiple Clusters
273
PACX-MPI. However, inter-proxy communication is potentially faster in BC-MPI since it requires only TCP communication and not MPI-TCP-MPI like in PACX-MPI. Application threads use MPI for communication within one cluster. BC-MPI does not require any changes in the MPI application source code.
app thread
recv thread
BC−MPI
send thread
BC−MPI
send thread
BC−MPI
recv thread
BC−MPI
app thread
send thread
BC−MPI
recv thread
BC−MPI
app thread
send thread
BC−MPI
recv thread
BC−MPI
app thread
optional TCP forwarder
TCP
vendor MPI
TCP or MPI
Fig. 1. Architecture of BC-MPI Application
3.2 BeesyCluster as a Middleware for BC-MPI BeesyCluster, installed at Academic Computer Center in Gdansk, Poland (at https:// beesycluster2.eti.pg.gda.pl/ek/Main), can be seen as an access portal/middleware to clusters/supercomputers/PCs with WWW (Figures 2, 3) and Web Service ([10]) interfaces. The user can access and use many accounts on various clusters through one account in BeesyCluster (single sign-on). Users can run any commands on clusters, edit, copy files and directories between clusters, queue or run tasks interactively, publish actions such as running a parallel or sequential application (run interactively or queued on clusters) or editing a file as services visible to other users via WWW or Web Services. For the use of services, users-providers earn points which can be spent on running services published by others. Services can also be offered free of charge and can be combined into workflows as presented by the author in [11]. BeesyCluster only presumes that user accounts on clusters are accessible via SSH and does not require installation of any software to run or publish services. Further information can be found in [10] or at the aforementioned web site. In such an environment, in the context of a BC-MPI application the user can: 1. Register existing accounts in a BeesyCluster account in seconds without assistance of remote clusters’ administrators. 2. Launch a WAN-aware MPI application (using BC-MPI) on these clusters via BeesyCluster. BeesyCluster can be used to upload application sources to target clusters, compilation and starting processes of the application on the clusters. In particular, BC-MPI’s MPI Init() can use BeesyCluster’s Web Services to launch processes of the application on other clusters. BeesyCluster’s Web Services can be called in a secure way using SOAP/HTTPS and use client authentication/authorization. Specifically, BeesyCluster’s Web Services require an initial call to method
274
P. Czarnul
Fig. 2. BeesyCluster’s File Manager in Web Browser
Fig. 3. Task’s Results in BeesyCluster
String[] auth = port.logIn(new String[] "", "<password>","loginAgentID","signerID") to log in and then allow to call any of the following ([10]): runCommand(auth,cluster,command)) for running a command, enqueueJob (auth,cluster,jobPath,minCPU,maxCPU, resultPath,email) for queueing a task using a queueing system on cluster (queueing details handled transparently by BeesyCluster), retrFile(auth,cluster,remoteFileName,localFileName) for retrieving a file from cluster, sendFile(auth,cluster,remoteFileName,localFileName) for sending a file to cluster. The latter can be used to download sources of a BC-MPI application from one cluster, upload to others, the first one to compile the source code. Then processes on other clusters could be started using one of the first two services. Figure 4 shows the startup sequence (download/upload ommitted). One process of an MPI application will act as a proxy waiting for socket connections and forwarding traffic to other clusters. Communication between clusters in a BC-MPI application will use TCP. 3. Optionally consume services made available by other BeesyCluster users from their accounts if proper rights have been set by the owners. This potentially extends the MPI application with ability to run external functionality using SOAP/HTTPS.
4 Multithreaded Implementation Calling MPI functions in threads other than the main one for forwarding to or receiving from the proxy process for inter-cluster communication requires proper threading support from the MPI implementation (possibly different in various clusters). Thus, for performance tests in this paper, in terms of multithreading the author tested three versions of the code: 1. OpenMPI in the MPI THREAD MULTIPLE mode, 2. LAM/MPI in the MPI THREAD SERIALIZED mode with the BC-MPI code including special synchronization to avoid deadlocks but only for the exemplary code,
BC-MPI: Running an MPI Application on Multiple Clusters
1. start app on fox (WWW)
275
BeesyCluster server(s) Sun
BeesyCluster’s client 1. start app on fox (SSH)
3. start app (SSH) 2. start app on parowiec and holk (Web Service) parowiec
holk 288 IA−64 procs Infiniband
fox 16 Xeon procs Fast Ethernet BC−MPI communication (TCP)
Fig. 4. Start-up of a BC-MPI Application using BeesyCluster
3. BC-MPI’s TCP – fully multithreaded (can be used with any MPI implementation) using sockets for forwarding/receiving to/from proxy for inter-cluster communication. MPI is still used for communication within one cluster. The following are used in BC-MPI, apart from the main MPI application threads: 1. Receive threads – used to listen to both internal communication (MPI or TCP) from MPI processes and from external clusters (only in the proxy process) via TCP. 2. Sending thread – forwarding to the proxy process or TCP communication to another cluster is done in another thread. This allows the client-side to continue and implement e.g. MPI I*send modes as well as continue receiving when forwarding data in proxy processes. Data is set in proper structures and a call to pthread cond signal instructs the sending thread to flush the send buffer. BC-MPI currently incorporates the following optimizations: receiving to user buffer when a matching receive has already been posted – if a message has not already arrived, an MPI Recv() call inserts a receive request (with the destination buffer specified by the user) into a table and waits on a condition; when the message arrives, it is received into the user buffer, partitioning and forwarding packets of long messages – the proxy process receives the message it should forward in packets; the received packet is forwarded immediately in another thread while receiving of next packets continues.
5 Performance Tests For performance tests, an MPI example for benchmarking point-to-point performance from [12] was used but modified to test latency/bandwidth between any MPI ranks. For each test described below a pair of clusters was used to benchmark point-to-point times
276
P. Czarnul
between MPI processes in separate clusters. Cluster holk features 288 Itanium2 processors with Infiniband, parowiec contains 16 Pentium Xeon processors with Fast Ethernet, fox is an AthlonXP PC node, all running Linux. We use a single MPI application compiled with BC-MPI with processes of ranks 0 to 3 on one cluster and rank 4 (and 5 in the last case) on the second cluster. Note the inter-departmental link between cluster parowiec and fox is of 10Mbit/s speed only. 5.1 Inter-cluster Communication and Communication within a Node in Cluster Figure 5 shows latency times and Figure 6 bandwidth for communication between ranks 1 and 4 of the MPI application. Communication occurs in configuration: 1 (parowiec) MPI (shared memory or TCP) - 0 (proxy process on parowiec) - inter-cluster interdepartmental TCP (10 Mbit/s) - 4 (proxy process on fox)). Processes 0 and 1 run on the 750
Intercluster Latency [us]
700
intercluster communication (TCP) 0-4
650
1-OMPI shm (THREAD_MULTI PLE) -0 – TCP -4
600 550
1-LAMMPI shm (THREAD_SERIA LIZED)-0–TCP-4
500 450
1-OMPI tcp -0 – TCP -4
400 350
1-LAMMPI tcp -0 – TCP -4
Message size [bytes]
300 0
32
64
96
128
160
192
224
256
Fig. 5. Latency for Inter-cluster Communication + Communication in a Node in Cluster 1.1 1.075
Intercluster Bandwidth [MB/s}
intercluster communication (TCP) 0-4
1.05
1-OMPI shm (THREAD_MULTI PLE) -0 – TCP -4
1.025 1
1-LAMMPI shm (THREAD_SERIA LIZED)-0–TCP-4
0.975
1-OMPI tcp -0 – TCP -4
0.95
1-LAMMPI tcp -0 – TCP -4
0.925
Message Size [bytes]
0.9 0
4096
8192
12288
16384
Fig. 6. Bandwidth for Inter-cluster Communication + Communication in a Node in Cluster
BC-MPI: Running an MPI Application on Multiple Clusters
277
same node. For communication between 0 and 1 both shared memory and TCP modes in MPI were used for comparison. It can be seen that LAM/MPI slightly outperforms OpenMPI (as also reported in [13]). The times are obviously dominated by the inter-cluster communication times. 5.2 Inter-cluster Communication and Communication between Nodes in Cluster Figure 7 shows latency times and Figure 8 bandwidth for communication between ranks 1 and 4. Communication occurs in configuration: 1 (parowiec) - MPI (TCP) or BCMPI’s TCP - 0 (proxy process on parowiec) - inter-cluster inter-departmental TCP (10 Mbit/s) - 4 (proxy process on fox)). Processes 0 and 1 run on separate nodes in cluster parowiec using Fast Ethernet. For communication between 0 and 1 both vendor MPI and BC-MPI’s TCP were used for comparison. It can be seen that as in Figure 5 750
Intercluster Latency [us]
700 650
1-OMPI tcp (THREAD_MULTI PLE) -0 – TCP -4
600 550 500
1-LAMMPI tcp (THREAD_SERIA LIZED)-0–TCP-4
450
1-BC-MPI's TCP -0 – TCP -4
Message size [bytes]
400 0
32
64
96
128
160
192
224
256
Fig. 7. Latency for Inter-cluster Communication + Communication between Nodes in Cluster 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5
Intercluster Bandwidth [MB/s] 1-OMPI tcp (THREAD_MULTI PLE) -0 – TCP -4 1-LAMMPI tcp (THREAD_SERIA LIZED)-0–TCP-4 1-BC-MPI's TCP -0 – TCP -4
0
4096
8192
Message Size [bytes] 12288
16384
Fig. 8. Bandwidth for Inter-cluster Communication + Comm. between Nodes in Cluster
278
P. Czarnul
(communication between 0 and 1 on a single node) LAM/MPI offers slightly better latency times than OpenMPI (for TCP). Furthermore, BC-MPI’s TCP implementation is slightly faster than the code using LAM/MPI with MPI THREAD SERIALIZED. This may be in part due to the synchronization added to BC-MPI in the latter case so that LAM/MPI can be called in many threads without deadlocks. This synchronization is specific and would work only for the MPI application used for tests. OpenMPI was tested with MPI THREAD MULTIPLE which allows to call MPI in many threads without deadlocks. 5.3 Data Partitioning and Forwarding For larger messages the author tested the impact of partitioning messages and immediate forwarding of received data in rank 0 while continuing receiving. Figure 9 shows results for the configuration as in the previous case. It indicates that for large messages the latency of sending the message of size 8MBytes between processes 1 and 0 (around 0.7s between separate nodes in cluster parowiec) can be practically hidden. 12 forwarding whole messages
10 8
Intercluster Communication Times [s]
partitioning when forwarding
6 4 2
Message size [Kbytes]
0 0
1024 2048 3072 4096 5120 6144 7168 8192
Fig. 9. Latency for Inter-cluster Communication with and without Message Partitioning
5.4 Testing Infiniband vs. TCP and an TCP Forwarding In this configuration communication occurs between processes 1 and 5 in configuration: 1 (parowiec) - BC-MPI’s TCP or MPI (TCP) - 0 (proxy process on parowiec) inter-cluster TCP - additional TCP forwarder on access node karawela - inter-cluster TCP - 4 (proxy process on holk) - BC-MPI’s TCP or MPI (Infiniband) - 5 (holk). 1 and 0 as well as 4 and 5 run on separate nodes in clusters parowiec and holk respectively. Figures 10 and 11 show differences in communcation times for both small and large message sizes when using Infiniband within holk compared to TCP (Ethernet).
BC-MPI: Running an MPI Application on Multiple Clusters 1200
Intercluster Latency [us] Infiniband vs TCP in holk
1100 1000 900
279
1–BC-MPI's TCP–0–TCP–4– BC-MPI's TCP– 1–MPI(TCP)–0– TCP–4–MPI (IB)–5
800 700
Linear regression, 1– BC-MPI's TCP–
600 500 400
Linear regression, 1– MPI(TCP)–0–
Message Size [bytes]
300 0
512
1024
1536
2048
Fig. 10. Inter-cluster Communication Latency /w TCP vs Infiniband on holk: Short Messages 0.8
Intercluster Communication Times [s] Infiniband vs TCP in holk
0.7 0.6 0.5
1–BC-MPI's TCP–0–TCP– 4–BC-MPI's
0.4 0.3 0.2 0.1
Message Size [Kbytes]
1–MPI(TCP)– 0–TCP–4– MPI (IB)–5
0.0 0
512
1024 1536 2048
2560 3072 3584
4096
Fig. 11. Inter-cluster Communication Times with TCP vs Infiniband on holk
6 Summary and Future Work In this paper, a new software package for bridging MPI applications on clusters was presented. While it can be used as a standalone software with MPI implementations, it can benefit from the BeesyCluster middleware for automatic startup of the MPI application spanning clusters as well as spawning BeesyCluster services in parallel. Its architecture can exploit multithreaded features of MPI implementations e.g. MPI THREAD MULTIPLE of OpenMPI. Various performance tests of point-to-point communication between clusters were presented using BC-MPI’s TCP, LAM/MPI and OpenMPI. Future work includes further implementation of BC-MPI including collective operations, performance comparisons with other systems like MPICH-G2, PACX-MPI, LAM/MPI etc. as well as tests of TCP communication with encryption. The author will also focus on building an ontology for HPC computing and using it for intelligent searching of HPC services within BeesyCluster.
280
P. Czarnul
Acknowledgments Calculations were carried out at the Academic Computer Center in Gdansk, Poland.
References 1. Official Crossgrid Information Portal. supported by Grant No. IST-2001-32243 of the European Commission, http://www.crossgrid.org/main.html 2. CLUSTERIX: The National Linux Cluster, http://clusterix.pcz.pl 3. Karonis, N., Toonen, B., Foster, I.: MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface. Journal of Parallel and Distributed Computing (JPDC) 63(5), 551– 563 (2003) 4. Sotomayor, B.: The Globus Toolkit 4 Programmer’s Tutorial (November 2005), http://www.casasotomayor.net/gt4-tutorial/ 5. Karonis, N., Toonen, B.: MPICH-G2, http://www3.niu.edu/mpi/ 6. Keller, R., Mller, M.: The Grid-Computing library PACX-MPI: Extending MPI for Computational Grids, www.hlrs.de/organization/amt/projects/pacx-mpi/ 7. LAM/MPI Parallel Computing, http://www.lam-mpi.org/ 8. National Institute of Standards and Technology: Interoperable MPI, http://impi.nist.gov/ 9. Lusk, E., et al.: MPI-2: Extensions to the Message-Passing Interface: MPI and Threads, http://www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-2.0/ node162.htm#Node162 10. Czarnul, P., Bajor, M., Fraczak, M., Banaszczyk, A., Fiszer, M., Ramczykowska, K.: Remote Task Submission and Publishing in BeesyCluster: Security and Efficiency of Web Service Interface. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, Springer, Heidelberg (2006) 11. Czarnul, P.: Integration of Compute-Intensive Tasks into Scientific Workflows in BeesyCluster. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3993, Springer, Heidelberg (2006) 12. Gropp, W., Lusk, E.: MPI Tutorial: Benchmarking point to point performance, http://www-unix.mcs.anl.gov/mpi/tutorial/mpiexmpl/src3/ pingpong/C/main.html 13. Barrett, B.: Open MPI User’s Mailing List Archives, http://www.openmpi.org/community/lists/users/2006/04/1076.php
Managing Distributed Architecture with Extended WS-CDL Konrad Dusza and Henryk Krawczyk Gdansk University of Technology, Narutowicza 11/12, 80-952 Gdansk, Poland [email protected], [email protected]
Abstract. The paper presents problems with WS-CDL in providing a central point of service layer management in a distributed system architecture. A proposal of CDLExt a model extension to WS-CDL addressing its drawbacks in services management is given. Features of CDLExt model include: describing dependencies between services and other IT artifacts and QoS attributes introduced by these dependencies. Furthermore, a toolset supporting service layer management with the new model is presented. Future directions for CDLExt development are also discussed. Keywords: WS-CDL, choreography, Web services, QoS.
1
Introduction
One of the latest trends in distributed systems architecture is service orientation. Widely known as SOA[1], it delivers a promise for more scalable, composable and business-oriented software. In order to support new architectural paradigm, two complementary approaches to supporting service layer management have been developed: choreography and orchestration. Orchestration can be defined as programming in-the-large in WSBPEL[2] or other programming languages in order to compose web services according to modeled business processes. On the other hand, choreography focuses on describing the observable behavior of the system, from a distant, not being a part of the system, perspective. At present, the most recognized approach for describing web services choreography is the Web Services Choreography Description Language (WS-CDL) developed by W3C[3]. Although, WS-CDL provides a set of constructs that seem sufficient for choreography description, its scope does not cover certain aspects of service layer management including: describing QoS characteristics of service interactions and modeling dependencies between services and other components existing in a organizations IT infrastructure. The purpose of this paper is to present CDLExt, a model proposal for service layer management based on WS-CDL extended with additional elements describing important service layer characteristics. The design of WS-CDL extension is based on WS-CDL feature set analysis from a service architect point of view. Features introduced by CDLExt are to enable service architects to achieve better R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 281–290, 2008. c Springer-Verlag Berlin Heidelberg 2008
282
K. Dusza and H. Krawczyk
control of the service layer by supporting a more detailed view at IT infrastructure in service oriented organization than WS-CDL itself. Ideas and scenarios are presented in this paper within a context of an organization which is transitioning its distributed, legacy architecture towards a service-oriented one. Moreover, we assume that the organization maintains description of current state of service layer in form of a WS-CDL described choreography. By a mature SOA we understand the definition provided by the OASIS SOA Technical Committee[1]. The remainder of this paper is organized as follows. The next section presents an overview of concepts and issues related to service layer management. Section 3 gives more in-depth analysis of areas which are not covered by WS-CDL and their importance in service layer management. Sections 4 and 5 cover details on the CDLExt and tool supporting this solution. Section 6 discusses future plans and areas for CDLExt improvement. Final section of the paper presents conclusion.
2
Managing the Service Layer
Service orientation, meant as the encapsulation of organizations business-related functionalities in form of services, implies logical and often physical distribution of system architecture components. Effective SOA requires high-level management in order to ensure fulfilling service contracts negotiated between the organization and its services consumers. Service layer management involves various tasks including: creating, modifying and removing services from the layer, monitoring service activities and risks of failing a services contract, service-enabling of existing legacy components, etc. Many of those tasks are supported in a variety of ways by a plethora of tools and standards available. Moreover, even the initial version of web services standards were designed with support for service management in mind. One of the first standards related to web services was WSDL[4] the language for specification of functional capabilities of services. Later, the concept of service registries that would help discovering needed services was introduced in form of UDDI[5]. UDDI-compliant registries can store references to services deployed in distributed locations, granting authorized parties access to service registry data via a single point of entry. With the advancement of web services development, the concept of services composability was reintroduced in form of business process modeling and mapping of business processes onto web service invocations. Orchestration engines delivered the functionality of interpreting such scripted web services invocations and support for some non-functional aspects of service layer including: logging, authorization and authentication management, etc. The extension to the concept of orchestration engine - the Enterprise Service Bus[6] (ESB) provided a single point of entry into the IT infrastructure by integrating various technologies of IT infrastructure (CORBA[7], Web Services, etc.). Each of the aforementioned concepts, addresses certain aspects of service layer management. However, solutions they provide are either fragmentary (like service listing in form of registry) or too detailed and orchestration-centered
Managing Distributed Architecture with Extended WS-CDL
283
(e.g. ESB). Lack of a generalized, top-level insight into service layer state hinders reliable evaluation of organizations capability of fulfilling negotiated service contracts. On the other hand, the capabilities of WS-CDL, which include describing both the static side of service layer (e.g. enumeration of services, providing services description, etc.) and the dynamic side (describing expected observable behavior on a service layer) gave us reason to pick WS-CDL as the foundation for further research on a lightweight and complete solution for service layer management. WS-CDL capabilities and potential areas for extending its service layer model are discussed in the next section.
3
Reasons for Extending WS-CDL
The most common scenarios with regard to building a service layer in an organization can be described with keywords: create, compose and use. Create scenario represents encapsulation of a business-relevant functionality in form of a service. Compose scenario, is a special case of create that involves combining existing services into a service offering new functionality from the business point of view. Use scenario describes the customer side of a create/compose situation adapting the organization to consuming a service provided by an external provider. Apart from implementation and deployment, all the aforementioned situations usually involve: formalization of service contract agreed between the consumer and the provider, identification of service dependencies and arranging the service layer so that service invocations are possible, effective, reliable and service contract compliant. To conduct all aforementioned tasks, a top-level view of the organizations service layer is needed by a service architect (the one in charge of the service layer). Main purpose of WS-CDL is to provide means to describe the sequence and conditions in which the messages are exchanged[8], with regard to Web Services. This is to be achieved from a perspective of a third-party observer that does not interfere in any of the message exchanges. Although, WS-CDL and WSDL offer a wide array of elements for modeling services functionalities and their interactions, they do not provide sufficient support for other activities related to service layer management. Examples of such important features are given in the following paragraphs. Firstly, effective management of service layer being the front line for providing business functionalities needs proper identification and verification of services dependencies due to their impact on organizations ability to fulfill service contracts. For instance, it could happen that two independently designed and implemented component, i.e. a service and a legacy component, access the same database in a manner that introduces race condition (see Fig. 1). If such case, one of the components is unable to work properly, what will affect the organizations workflow. Moreover, such problem could be discovered after production deployment of the service, and therefore introduces severe risk for the organization violation of service contract.
284
K. Dusza and H. Krawczyk
Secondly, if SOA is to be the core for the organizations business, it is necessary that both functional and the non-functional side of a service contract are properly served. To define the former, the WSDL standard has been developed, and currently it is has reached an acceptable degree of maturity. Even though the latter, named quality of service (QoS), gained recognition in the web services world [9] [10] and was discussed within W3C [11], it has not become a part of any service contract-related specification. As an example for QoS importance in service contract we could use a modified scenario from the previous paragraph (see Fig. 1). Having resolved the race condition possibility (i.e. through transactions), system architect could face a situation, in which the legacy component overloads the database with queries, so that the service is unable to respond to customers with response time above the minimum described in service contract. Once again a problem with fulfilling the contract by the service provider has arisen. In our opinion, separation of service layer from its dependencies in
Fig. 1. Example of access conflict that could disrupt a service execution
existing IT architecture and QoS aspects makes it difficult for service architect to manage the layer in an effective way. Whereas WS-CDL provides sufficient means to model interactions, it should be reinforced with additional constructs to support other service layer management tasks in an effective way.
4
Model of the CDLExt Extension
Analysis of problems related to service layer management described in section 3 resulted in the CDLExt model presented in Fig. 2. The choreography of a service layer is represented by the Package element and its subelements, according to the WS-CDL specification. The Artifacts element represents the IT infrastructure
Managing Distributed Architecture with Extended WS-CDL
285
Fig. 2. CDLExt model structure. The package element is the WS-CDL root, the Artifacts element is the root of the extension.
layer of an organization. In order to allow different types of IT infrastructure artifacts to be modeled, an abstract Artifact element and its derivative types were created (Business/Database/Service/Other). Distinction between various types of IT artifacts enables enriching each of the artifact types with typespecific attributes. For example, in the CDLExt model, the Service artifact was extended with the RoleRef attribute which can contain a reference to a WS-CDL RoleType element instance. The Dependency element serves the purpose of dependency modeling, whereas the Load element enables modeling the load which is introduced by a particular dependency. Among various QoS characteristics, load introduced by dependencies was chosen as the most important QoS aspect regarding the ability of an organization to fulfill a service contract. A detailed description of CDLExt elements with shapes related to dependency modeling is given in Table 1. Notably, the CDLExt syntax has been described in an XSD schema, which can be used for validation of a CDLExt document.
5
CDLExt Toolset
In order for the proposed extension to be effective and truly supporting service architects tasks of managing the organizations service layer, a special toolset has been developed. In the following subsections, three main development steps are presented. 5.1
Design
Toolset design was based on the Model-View-Controller (MVC) pattern, with a miscellaneous tool package added to the toolset. The general architecture of the CDLExt toolset is shown in Fig. 3. The naming convention on the diagram is metamodel, as it represents the model for CDLExt instances models of service layer.
286
K. Dusza and H. Krawczyk
Table 1. Description of CDLExt elements and attributes Element
Attributes
CDLExt Package
root element as in ws-cdl package root element for containing the choreograelement phy definition root element for extension part ID, name abstraction representing artifacts in IT architecture Artifact-derived database tier artifact Artifact-derived business tier artifact
Artifacts Artifact Database Business
Description
Shape -
cuboid round. rect. Other Artifact-derived other kind of IT artifact rect. Service Artifact-derived, rol- service layer artifact, allows referencing a circle eRef role from ws-cdl part Dependency ID, to dependency between artifacts, the artifact arrow containing the dependency is depenendent on the artifact pointed by to attribute Load amount, timeunit, QoS related element data on the load rect. type that the containing dependency element linked enforces on the dependent on artifact to dep.
Metamodel structure core part of CDLExt support. It represents the implementation of CDLExt metamodel including syntax and semantics rules of WS-CDL and the extension part as described in section 4 (see Table 1). Metamodel logic logic of model editing. It is based on the strategy/policy and command patterns[12], an enable undo/redo for operations on a CDLExt instance, validation of editing operations, etc. Examples of operation on the model include adding a valid element to the CDLExt instance (e.g. WS-CDL RoleType), removing an element, etc. Metamodel editing graphical editors for editing different parts of a CDLExt model (e.g. modeling dependencies and QoS, modeling static aspects of WS-CDL, etc.). Support tools supporting tasks related to web service management but unrelated to CDLExt code generation, editors for web services-related languages, etc. The choice of shapes representing model elements in the toolsets graphical editors (like circle for services, etc.) is arbitrary. Shapes relevant to dependency modeling are presented in Table 1. To represent static choreography elements from WS-CDL, a similar notation was chosen (ie. rectangles and lines). However, if any of graphical notation elements are proven unclear or more appropriate shapes are found, the graphical notation will be subject to change.
Managing Distributed Architecture with Extended WS-CDL
287
Fig. 3. CDLExt toolset architecture diagram. Main elements are shown within the dashed rectangle labeled MVC, auxiliary tools like syntax coloring are symbolized by the Support element.
5.2
Implementation
The aforementioned toolset has been implemented using Eclipse 3.3 and supporting technologies: Eclipse Modeling Framework[13], Graphical Modeling Framework[14] and other minor Eclipse Foundations projects, which EMF and GMF depend on. The most important part of toolset development was to implement the CDLExt metamodel described in section 4 (metamodel structure) with use of the Eclipse Modeling Framework. The EMF implementation of the CDLExt metamodel (in EMF such metamodel is called ecore) should reflect the syntax and semantics of WS-CDL and the extension. In order to achieve that, the WSCDLs XSD schema had to be enriched with additional EMF attributes according semantic rules described in the WS-CDL specification (for instance: WS-CDL RoleRef element allowing pointing to only to existing Role elements). Next steps of implementation focused on fine-tuning of EMF and GMF-generated editors (metamodel logic and metamodel editing) and enriching them with additional features (support). Current version of the CDLExt toolset (0.8) delivers the following functionalities: CDLExt document editing, graphical modeling of artifacts and dependencies, graphical modeling of static WS-CDL part (roles, participants, relationship types, etc.), validation of CDLExt documents and others. At present, toolset development is focused on improving the editors look and feel (colors, icons, etc.) and visualization of activity-related WS-CDL parts in form similar to UML sequence diagrams. 5.3
Testing
The purpose of testing process was to evaluate whether CDLExt addresses problems discussed in section 3, to verify the usability of the metamodel and its toolset and to discover areas for further improvements in CDLExt and its toolset.
288
K. Dusza and H. Krawczyk
Fig. 4. One of scenarios used for testing of the CDLExt metamodel. Customer represents a service consumer, whereas other entities (Store, Warehouse, Pricing) represent services, with Store being a composite one.
We have tested the CDLExt toolset in five service layer scenarios. Each scenario included creation of at least 1 composite service (2 or more services composed) or 3 simple services. One of test scenarios is presented in Fig. 4. The service layer configuration description capabilities (enumeration of services and their interactions) provided by WS-CDL have proven sufficient for describing different service interactions scenarios, including request, request/response and notification cases. CDLExt features, i.e. IT artifacts and dependency modeling were found useful in early discovery of risk for service contract fulfillment introduced by services concurrent access to a shared resource. Furthermore, tests confirmed that given a general view provided by the toolset, a service architect can elaborate more realistic and therefore less risky to fulfill service contracts. Without visualization of service dependencies, it is very hard to evaluate, whether a service with given set of parameters can be delivered under specified contract and budget. Is such case, dependency visualization and modeling is in fact the recognition of service contracts risks. However, the existing QoS modeling capabilities provided unsatisfactory QoS description capabilities for 1 scenario customer demanded a high load level for 8 hours a day and low load level for the rest of the day. Modeling such distribution is impossible with present, flat average characteristic.
Managing Distributed Architecture with Extended WS-CDL
289
The most important areas for further extension in the metamodel are: extending the dependency QoS modeling capabilities to enable more complex load distribution modeling, and revision of WS-CDL interaction description constructs, including verification of channel variables usability for service layer management tasks. Future improvements of the toolset should follow changes made to the metamodel.
6
Future Plans
Establishment of the core part of CDLExt and the toolset provided a starting point for future extensions and improvements. These features can be categorized in the following way. Automatic verification of new constructs against choreography implementing formal verification of programmed services (e.g. WSBPEL) against the choreography by projecting them on a common process algebra like FSP[15] or other specifically designed for that purpose[16] and verifying whether the programmed service composition fits organizations interactions model described in the choreography. Such feature would increase the value of a properly maintained choreography in the organization by ensuring early recognition of choreography breach by new services (create/compose/use scenarios). Extending graphical editing capabilities for the CDLExt metamodel providing a graphical editor for modeling interactions of WS-CDL would certainly ease WS-CDL editing tasks, and therefore make use of CDLExt more efficient and productive. Extending QoS features of the CDLExt metamodel complementing the CDLExt with additional elements for more adequate modeling of the IT infrastructure characteristics that have influence on the organizations capability of fulfilling the non-functional aspects of service contracts e.g. reliability. For example, adding the higher estimate of downtime per year attribute (in seconds) to the Artifact element to model availability of an IT artifact, which could be used to calculate a service availability for a contract basing on its dependencies availability values.
7
Summary
We have proposed a WS-CDL based solution for top-level service layer management. In our opinion, effective transition of IT infrastructure to SOA, will be impossible without providing a single point of insight into current service layer state and related architectural dependencies. The functionalities of service registries should be extended for supporting listing of services, their interactions (wires), QoS capabilities and their dependencies. Otherwise, transition to SOA will be fragmentary and not triggering its full potential risking breaching of service contracts and reducing services business value. The proposed CDLExt model is the first step to eliminate such discrepancy problems.
290
K. Dusza and H. Krawczyk
References 1. OASIS SOA Reference Model TC: OASIS Reference Model for Service Oriented Architecture V 1.0, http://www.oasis-open.org/committees/download.php/19679/soa-rm-cs.pdf 2. OASIS WSBPEL Technical Committee: Web Services Business Process Execution Language Specification. http://www.oasis-open.org/committees/download.php/18714/ wsbpel-specification-draft-May17.htm 3. W3C: Web Services Choreography Description Language Version 1.0. http://www.w3.org/TR/2005/CR-ws-cdl-10-20051109/ 4. W3C: Web Services Description Language, http://www.w3.org/TR/wsdl 5. UDDI Spec Technical Committee: Universal Description Discovery and Integration (UDDI), http://uddi.org/pubs/uddi v3.htm 6. Keen M., Bishop S., Hopkins A., Milinski S., Nott C., Robinson R., Adams J., Verschueren P., Acharya A.: Patterns: Implementing an SOA using an Enterprise Service Bus. IBM Redbook (2004), http://www.redbooks.ibm.com/abstracts/SG246346.html?Open 7. Object Management Group: Common Object Request Broker Architecture (CORBA), http://www.omg.org/technology/documents/formal/corba iiop.htm 8. W3C: WS Choreography Model Overview, http://www.w3.org/TR/2004/WD-ws-chor-model-20040324/ 9. Menasce, D.A.: Composing Web Services: A QoS View. IEEE Internet Computing, XI-XII, IEEE (2004) 10. Mani A., Nagarajan A.: Understanding quality of service for Web services. IBM developerWorks (2002), http://www-128.ibm.com/developerworks/library/ws-quality.html 11. W3C Working Group: QoS for Web Services: Requirements and Possible Approaches (November 25, 2003), http://www.w3c.or.kr/kr-office/TR/2003/ws-qos/ 12. Gamma, E., Helm, R., Johnson, R., Vlissides, J.M.: Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, Reading (1995) 13. Budinsky, F., Steinberg, D., Merks, E., Ellersick, R., Grose, T.J.: Eclipse Modeling Framework. Addison-Wesley, Reading (2003) 14. Eclipse Graphical Modeling Framework project page, http://www.eclipse.org/gmf/ 15. Foster, H., Uchitel, S., Magee, J., Kramer, J.: Model-Based Analysis of Obligations in Web Service Choreography. In: Advanced International Conference on Telecommunications and International Conference on Internet and Web Applications and Services (AICT-ICIW 2006), p. 149 (2006) 16. Busi, N., Gorrieri, R., Guidi, C., Lucchi, R., Zavattaro, G.: Choreography and Orchestration: a synergic approach for system design. In: Benatallah, B., Casati, F., Traverso, P. (eds.) ICSOC 2005. LNCS, vol. 3826, pp. 228–240. Springer, Heidelberg (2005)
REVENTS: Facilitating Event-Driven Distributed HPC Applications Dawid Kurzyniec, Vaidy Sunderam, and Magdalena Sławi´nska Dept. of Math and Computer Science, Emory University 400 Dowman Drive, Atlanta, GA 30322, USA {dawidk,vss,magg}@mathcs.emory.edu
Abstract. Modern scientific applications that need to share geographically scattered resources and dynamically adapt to changes in the environment pose challenges to traditional parallel and distributed programming paradigms. Distributed component frameworks attempt to address the demands of contemporary HPC applications by enabling coarse-grained decomposition and loose coupling. Nonetheless, components usually communicate via synchronous RPC, which is not suitable for interactive applications. This paper introduces a novel distributed event notification system, called REVENTS, which enables both synchronous and decentralized asynchronous component interactions. The REVENTS system is based on a topic-list publisher-subscriber model. It integrates and enhances common technologies for messaging, events, and group communication. The article introduces the REVENTS API, its reference implementation, and its application in the H2O metacomputing framework. Presented experimental results confirm REVENTS’ usability in distributed HPC scenarios. Keywords: distributed event system, middleware, loosely coupled system.
1 Introduction Seamless resource aggregation across administrative domains is one of the objectives of grid computing. However, metacomputing over geographically distributed machines introduces numerous challenges related to resource and network heterogeneity, unreliability, and dynamicity. In grid environments, resources appear and disappear dynamically, load changes unpredictably, and nodes exhibit non-uniform performance characteristics. Additionally, scientific applications are becoming increasingly more sophisticated. Unlike traditional, monolithic, uni-modal parallel codes executed in batch mode on supercomputing clusters, modern applications often consist of multiple coupled simulation kernels, and support interactive steering and visualization. Traditional parallel programming paradigms are not designed to cope with these new levels of complexity. For instance, the Single-Program-Multiple-Data (SPMD) model, commonly used in conjunction with the message passing communication paradigm on
Research, publication, and presentation supported in part by DOE grant DE-FG0206ER25729, NSF grant CNS-720761, and a faculty travel grant from The Institute of Comparative and International Studies, Emory University.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 291–301, 2008. c Springer-Verlag Berlin Heidelberg 2008
292
D. Kurzyniec, V. Sunderam, and M. Sławi´nska
massively parallel machines, assumes a static collection of uniform and reliable computing nodes linked with a low-latency interconnect. Performance of SPMD applications degrades dramatically when these assumptions are not met, as is the case with grid environments. Support for interaction and coupling of simulation kernels are difficult to implement within SPMD. Clearly, new programming approaches are needed to allow modern scientific applications to take advantage of the cumulative distributed computational power of grid environments. Distributed component frameworks have been proposed as a deployment technology that might be better suited to address these emerging application requirements. The component-based approach facilitates coarse-grained, heterogeneous application decomposition, which maps naturally to widely distributed systems. While individual components may still internally encapsulate parallel solvers requiring tightly-coupled, co-located resources, they externally present coarse-grained and opaque interfaces and service endpoints. In this model, applications are constructed by deploying necessary components and linking their communication interfaces. Existing distributed component frameworks typically use a variant of Remote Procedure Calls (RPC) or Remote Method Invocations (RMI) paradigms to implement component interactions. RPC (or RMI) is appropriate for controlling the standard, sequential execution flow. However, interactive applications that need to respond to asynchronous events, such as user requests or resource availability changes, may be better served by a distributed event notification paradigm. This paper describes REVENTS – an API and a reference implementation of a distributed event notification middleware system based on the topic-list publishersubscriber model. REVENTS combines features provided individually by various event, messaging, and group communication technologies [1,2,3,4,5], augmenting and extending them as necessary to apply in the context of distributed event notification. The REVENTS API is partially based on Java Messaging Service API (JMS) [1], a popular messaging technology for enterprise middleware. We give an overview of the REVENTS API and its unique features, discuss its use case in the H2O framework [6], and briefly evaluate its performance characteristics, assessing its applicability for scientific applications.
2 Related Work Event notification is a communication paradigm commonly used in conjunction with remote procedure calls in distributed component- and service-oriented systems [3,5,7,8,9]. This section discusses REVENTS in the context of the most important related projects. ECho [10] is an event-based middleware library, providing flexible event typing functionality and efficient binary transport. It supports group communication by allowing multiple (distributed) sources to publish events to a common channel, and uses dynamic code generation to propagate subscription filters from event sinks to event sources. In comparison, REVENTS is agnostic with respect to the communication protocol, enabling interoperability with third-party event systems. It refrains from supporting group communication facilities at the low level, stressing full decentralization. It utilizes SQL-based expression syntax for specifying event filters, avoiding security and
REVENTS: Facilitating Event-Driven Distributed HPC Applications
293
logistical complications related to dynamic code generation. Finally, it supports initial state notification for late joining subscribers, which is not addressed by ECho. Java Message Service (JMS) [1] is a messaging standard for JEE applications to create, send, receive, and read messages. Although REVENTS derives many abstractions from JMS, including topic-based publish-subscribe APIs, message types and metadata, filter expressions, and expiration semantics, it targets different application scenarios. Contrary to JMS, REVENTS is an asymetric API, distinguishing event sources from event consumers. It does not require deployment of dedicated middleware infrastructure except on the directly involved (event source and event sink) machines. It provides APIs for dynamic topic creation and deletion, subscription listing and notification. It supports features specific to event (rather than message) delivery, such as initial state notification, and event overriding. WS-Messenger [8] is an event broker for grid systems, based on Web Services technologies. It delegates the event delivery semantics and implementation to the underlying messaging middleware (e.g. JMS). WS-Messenger is intended for handling infrastructure events rather than being used as an application runtime library.
3 REVENTS Overview The REVENTS library provides an API and a reference implementation for distributed event notification based on the publisher-subscriber model with a topics list, as shown in Figure 1. In this model, events are generated by publishers asFig. 1. The architecture of REVENTS sociated with specific topics, and remote subscribers register subscriptions on their topics of interest. A subscription may be considered a proxy for its remote subscriber; events are propagated through the subscription to the subscriber through an event channel providing transport and session control mechanisms. Channel Implementation. Asynchronous messaging middleware systems usually mandate deployment of designated, centralized services for storing queued messages enroute from the publisher to the subscriber. Such an approach enables loose coupling, allowing message exchange between parties that are not simultaneously online. In contrast, event channels in REVENTS link clients and servers directly. Publishers insert events to the channel’s server-side queue, and subscribers pull them from the clientside queue, as shown in Figure 1. The channel is responsible for transferring events between queues, necessitating that both parties are simultaneously online. This design choice has the advantage of decentralization, enabling ad-hoc application composition without the need for deployment of additional dedicated services. If needed though, decoupling can be enabled through a dedicated proxy.
294
D. Kurzyniec, V. Sunderam, and M. Sławi´nska
Event Structure. Each event contains (1) a header, consisting of common fields such as an expiration date, priority, etc., (2) a set of properties (metadata) which can be used to further filter out the events using selectors (described below), and (3) a body (payload) depending on the event type. Access to header fields is provided via dedicated methods in the base Event interface, and to properties – via appropriate getter and setter methods. Access to the body is provided via subinterfaces, separately for each event type. REVENTS supports two groups of events: control and ordinary. The former are generated automatically by the system and inform the subscriber about important metaoccurrences such as deletion of a publication topic or disconnection of an event channel. The latter are created by the application at the server side via event factories, and published via topic publishers. Header fields are initialized upon publication. In particular, a publication ID, unique within the topic publisher, is assigned to every event. Published events become read-only and remain in read-only mode at the subscriber side; however, the event payload may be reused by cloning the event before publishing. Hierarchical Topic List. In REVENTS, topics are organized into a tree structure, allowing certain operations to be applied to subtrees rather than single nodes. Publishers have a scope attribute, which can be either SELF or SUBTREE. Whereas events generated by a SELF-scoped publisher events are delivered to subscriptions linked to that publisher’s topic only, SUBTREE-scoped publisher events are delivered to all subscriptions on the subtree. For instance, an event published on the root of the topic tree by a SUBTREEscoped publisher will be broadcasted to all subscriptions. By symmetry, a subscription is also associated with the scope attribute. A SELF-scoped subscription receives only the events published specifically on its topic. A SUBTREE-scoped subscription additionally receives events posted on subtopics of its main topic. For example, a SUBTREE subscription on the root node receives all posted events. As these examples illustrate, scoping enables and promotes natural grouping of event topics, allowing event broadcasting and monitoring within hierarchical sub-groups. Additionally, the API provides chroot-like mechanisms for projecting subtrees as if they were separate topic trees, enabling multiplexing of virtual topic trees through a single event channel. Priorities, Selector Filters, Expiration, and Overriding. Event publishers have associated priorities, assigned to them upon creation. Events from a single publisher, or from multiple publishers of the same priority, are always delivered to clients in their publication order. Events from publishers of different priorities, on the other hand, may be subject to en-route reordering, applied at server- and client-side queues. The exact ordering of such event streams is non-deterministic and may be non-uniform between subscribers, as it depends on timing of subscribers consuming the events versus the publishers producing them. It is guaranteed, however, that the only departures from the FIFO order stem from pulling higher-priority events ahead of lower-priority ones. Subscribers indicate their notification interests by creating subscriptions on selected topics. REVENTS allows the subscription scope to be refined via event selectors. The selectors are expressions (based on SQL expression syntax, borrowed from JMS [1]) applied at the publisher’s side and evaluated against the event’s header and metadata in order to decide if it is a match for the subscriber’s interest. Events pending delivery may have limited periods of validity. They may become outdated either upon a predefined timeout or when more recent events obsolete them.
REVENTS: Facilitating Event-Driven Distributed HPC Applications
295
Canonical example scenarios include progress- or status-tracking mechanisms, in which each event often carries complete status information. In these cases, pending events may be safely dropped when a new event is enqueued. REVENTS supports event overriding by allowing publishers to assign events with override selectors upon publication. The event’s override selector is evaluated when the event is inserted into a queue, and specifies criteria for events (originating from the same publisher) that should be dropped from the queue. Additionally to override selectors, events may be associated with an expiration timeout, which limits the total time the event is allowed to spend in queues.1 Late Subscription and Welcome Events. Events are often used for state change notification. In simple cases, each event fully and independently describes the complete state (e.g., absolute mouse pointer position, or percentage of completion of an activity). However, when the state is complex, the change event is often reduced to a mere notification, requiring the subscriber to issue an explicit query in order to obtain the updated state. This approach is appropriate in non-distributed event frameworks such as GUI subsystems. In distributed settings, in order to avoid frequent additional remote calls, it is common to embed complete state change information within each event. It allows the current state at any time to be fully determined by (and reconstructible from) an initial state followed by the sequence of state change events. As long as the changes are small and localized, this approach saves bandwidth while minimizing the number of RPC calls. In addition to reliable event delivery, however, this model also requires that the subscriber knows the initial state preceding the first change event it receives. Since subscriptions are created dynamically, a newly created subscriber may have missed some number of events, say n; therefore, the initial state from its point of view is the state between events n (which it missed) and n+1 (which it is going to receive). The REVENTS API supports generating and propagating such initial state events (called welcome events). When a new subscription is created, a request is issued to all relevant publishers asking them to generate a welcome event dedicated to the new subscription. The library guarantees that the welcome event is the first that the subscriber receives from a given publisher, and that the correct FIFO semantics are preserved for subsequently received events. Pluggable Transport Layer. In the grid computing arena, event notification technologies tend to focus on communication protocols and interoperability [7,8]. REVENTS exemplifies a complementary approach, concentrating on APIs and semantics of event subscription and delivery. In fact, REVENTS is protocol-agnostic. The APIs for the client and for the server are defined separately. The library isolates the transport layer from server- and client-side event handling. The transport layer is responsible merely for implementing channel functionality, which includes (1) transferring events from server-side queues to client-side queues and (2) propagating subscription-related requests from clients to servers. Thus, in principle, interoperability of REVENTS with third-party distributed event systems could be achieved by implementing an appropriate transport layer to couple them at the communication protocol level. 1
The expiration timeout does not account for the communication overhead, i.e., time needed to transfer events between server- and client-side queues, as the API has no means of measuring it.
296
D. Kurzyniec, V. Sunderam, and M. Sławi´nska
The default transport implementation in REVENTS is based on RMIX [11,12] – a flexible multiprotocol RMI library. Subscription-related requests are sent from clients to servers as RMI calls. To transfer events from servers to clients, the default implementation uses RMI-based blocking polling. The client repeatedly sends RMI requests to fetch all available events from the server-side queue and append them to the client-side queue. If the server-side queue is empty upon the invocation, the call blocks, waiting (with a timeout) for the next event to be published. Server-side blocking eliminates spin-waiting, while timeouts allow clients to distinguish a no-events condition from a silent network failure. If the client does not receive any response within triple the timeout duration, it assumes disconnection and signals it to the application. In case no events are generated for a prolonged period, the resulting timeouts can be viewed as heartbeats sent from the server and acknowledged by the client through subsequent poll requests. By using RMIX as the event transport layer, REVENTS can take advantage of its multi-protocol nature and configurable communication stacks, enabling configuration of the communication channel. Events may be encoded using various protocols such as SOAP, ONC RPC, or JRMP, and transmitted over a variety of transport layers, including SSL and JXTA. Moreover, when used in an RMIX-based distributed system (like H2O described in Section 4), REVENTS can leverage the existing communication infrastructure (including facilities for authentication, authorization, encryption, compression, or firewall tunneling) without introducing new semantics at the communication level.
4 Use Case: The H2O Distributed Platform H2O is a lightweight, componentoriented resource sharing framework. In H2O, resource providers execute lightweight Java-based containers, called kernels, providing a hosting environment for remotely deployable components, called pluglets. Similar to Java servlets, pluglets execute in the server context; however, they may be dynamically plugged in by external entities such as clients, the Fig. 2. H2O collaborative metacomputing model provider, or third-party resellers. Subsequently, the deployed pluglets can be utilized and assembled by users. Role separation between container providers, software deployers, and users distinguishes H2O among other component frameworks, enabling resource sharing. Providers can supply raw resources (CPU, storage) that are further reconfigured by external deployers or clients. As illustrated in Figure 2, providers share resources autonomously and independently. Aggregation is
REVENTS: Facilitating Event-Driven Distributed HPC Applications
297
assumed by specialized DVM-enabling pluglets that communicate in order to implement an abstraction of a Distributed Virtual Machine (DVM). Pluglets interact using two basic communication paradigms: RMI, including asynchronous and one-way RMI (RMIX), and distributed event notification (REVENTS). Kernel Events. The H2O API allows clients to register event listeners to monitor kernel, session and pluglet state and to track pluglet loading progress. The kernel can generate the following events: – NewSessionEvent: fired when a new session is created at the kernel (e.g. when a new remote user logs in), given that the client is authorized to see the new session. – SessionStateEvent: fired when the session state changes (e.g. when a remote user logs out). The event carries information about the old and the new state. – DeployEvent: fired when a new pluglet is deployed. – PlugletStateEvent: fired when the pluglet state changes. The event carries information about the old and new states. Two subclasses supply more information in specific cases: (1) PlugletLoadedEvent, fired after the main pluglet class is loaded into the kernel JVM, and (2) PlugletFailedEvent, fired when the pluglet fails to initialize or start due to an exception. – JarDownloadingEvent: fired during pluglet loading, when a new chunk of data has been read by the class loader. This event can be used to track the progress of pluglet class loading. (It is particularly useful for pluglets deployed from the network). H2O uses REVENTS to deliver kernel events to remote clients. The kernel uses a single topic tree for all event-handling purposes, exploiting its hierarchical structure. Kernel-related events (NewSessionEvent and DeployEvent) are scoped within the /system/kernel/ topic, while session- and pluglet-related events (SessionStateEvent, PlugletStateEvent, JarDownloadingEvent) are scoped within /system/session/{ID}/ and / system/pluglet/{ID}/ subtopics, respectively. Each H2O session is associated with a dedicated event channel through which the events are propagated. H2O relies on REVENTS transport facilities (Section 3), to monitor session liveness. From the kernel’s perspective, if the client does not poll for events for a predefined timeout, which is equivalent to the lack of a heartbeat acknowledgment, the client disconnection is concluded; the kernel then closes the session and releases all its resources. From the client’s perspective, if the REVENTS transport reports a broken channel (due to the lack of response from the server, equivalent to three missing heartbeats), the H2O library assumes disconnection, invalidates all the client API objects associated with it, and attempts a best-effort, asynchronous session close request to the server. Pluglet Events. Additionally to propagating kernel events, H2O enables pluglets to publish events of their own. Each pluglet is assigned a subtree within the main topic tree, anchored at /user/pluglets/{ID}/, serving as the application-level event scope. Pluglet events, along with kernel events, are tunneled through the client session’s event channel. From the API point of view, however, pluglets and their clients are provided with dedicated channel abstractions. A pluglet can obtain an event provider API object allowing it to create subtopics and publishers and to publish events, as well as the current session’s event channel server, allowing the pluglet to register event subscriptions in response to client’s RMI requests. At the client side, the API provides the application
298
D. Kurzyniec, V. Sunderam, and M. Sławi´nska
with the virtual event channel instance through which the subscribers can be managed. These virtual channels are mapped onto the underlying session channel through multiplexing mechanisms described in Section 3.
5 Performance Evaluation We have conducted a series of experiments in order to assess the performance of REVENTS. We focused on measuring throughput, in terms of number of events generated and delivered per second, under various load conditions. All experiments were conducted on a network of 32 Sun Blade 2500 machines, each with two 1.2 GHz CPUs and 2 GB RAM, interconnected via a Gigabit Ethernet switch. In the experimental setup, one of the machines was used as the event publisher, while a varying number of remaining machines were event subscribers. We ran four separate experiments, each for a different event type: (1) a basic text event, containing a single string, (2) a map event, containing 10 key-value pairs, (3) a medium size bytes event containing a 1 KB byte array, and (4) a large size bytes event containing a 1 MB byte array. Within each experiment, and for each count of subscribers, a series of 5-second event sequences were generated with increasing requested throughput, until we observed saturation indicating that further throughput increase was not possible. We report two quantities: the cumulative sustained throughput, representing the number of events per second that the system is able to deliver to subscribers, and the burst throughput, indicating how fast can the publisher generate events, even if they cannot be processed by subscribers at that rate. The justification for these two quantities is that while the former shows how much processing capacity the system has at a steady rate, the latter indicates the potential for generating short bursts of events that can be enqueued at a faster rate without causing the publisher to block. Both quantities represent the total number of events generated per the time interval multiplied by the number of subscribers, thus representing the count of events per second that have been actually pushed through all event queues at the server side. Given that representation, in an ideal case of a perfectly scalable system, the reported throughput would remain constant irrespective of the number of subscribers. The experimental results are shown in Figure 3. For single-string events (Fig. 3a), the maximum sustained throughput of about 45000 events/s is obtained for a subscriber count of 5; each subscriber observes a throughput of about 9000 events/s in this case. As the number of subscribers increases beyond 5, sustained throughput becomes limited by burst throughput, indicating CPU saturation at the server side. For map events (Fig. 3b), which are still relatively small yet somewhat more complex, the maximum sustained throughput of about 22000 events/s is achieved for a subscriber count in the 5-9 range, while the server-side CPU saturation can be observed for 7 and more subscribers. For 1 KB events (Fig. 3c), characterized by moderate size and structural simplicity, further flattening of the throughput curve and shift of the server-side CPU saturation point can be observed. The maximum sustained throughput of about 13000 events/s (equivalent to 13 MB/s of payload transmitted over the network) is achieved for a subscriber count between 2 and 17, while the server saturates at 13 subscribers. In all three tests, the sustained throughput at a load level of 30 subscribers is approximately equal to 8000
REVENTS: Facilitating Event-Driven Distributed HPC Applications
(a) Text event containing a single string
(b) Map event with 10 key-value pairs
(c) Bytes event, 1 KB payload
(d) Bytes event, 1 MB payload
299
Fig. 3. Maximum throughput measured for various event types, as a function of subscriber count. The throughput is normalized with respect to subscriber count (i.e. shows events that were actually enqueued rather than published), so that in the ideal case, a constant line would be expected.
events/s, corresponding to about 260 events/s perceived by each subscriber. Finally, for 1 MB events (Fig. 3d), the sustained throughput quickly reaches 25 events/s and remains in the 25-30 range, corresponding to 25-30 MB/s of payload transmitted over the network. The communication capacity is limited by the networking layer in this case; we were unable to measure the maximum burst throughput as it exceeded limits imposed by available memory. These results suggest that REVENTS can offer satisfactory performance on generalpurpose networks.2 Using moderately powerful hardware, it can achieve network utilization at the level of 7-30 MB/s, which significantly exceeds the capacities of typical wide-area networks for which it is primarily designed.
6 Conclusions and Future Work This paper describes REVENTS – an experimental distributed event notification middleware system. REVENTS derives many of its features from popular event, messaging, and group communication libraries, and enhances them with hierarchical topic 2
REVENTS does not aspire to be a replacement for MPI group communication primitives that offer greater scalability by exploiting static network topologies.
300
D. Kurzyniec, V. Sunderam, and M. Sławi´nska
lists, event publisher prioritization, and support for late subscriptions, providing flexible and comprehensive mechanisms to facilitate development of modern interactive distributed applications. By separating the transport layer from event handling, and focusing on the API and event subscription and delivery semantics, REVENTS can take advantage of popular communication protocols, enabling interoperability with thirdparty distributed event systems. REVENTS has been used to implement notification facilities in the H2O distributed component framework, both for internal infrastructure events and user-defined application-level events. The H2O distribution [13] includes further usability examples. The conducted experiments show that REVENTS achieves satisfactory performance in general-purpose network settings for which it is designed. Our current work concentrates on examining the usability of REVENTS within the context of adaptive applications and resource discovery in dynamically changing and collaborative grid environments.
References 1. Hapner, M., Sharma, R., Burridge, R., Fialli, J., Haase, K.: Java Message Service API tutorial and reference: messaging for the J2EE platform. Addison-Wesley Longman Publishing Co., Inc, Boston (2002) 2. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns: Elements of Reusable Object-Oriented Software. Professional Computing Series. Addison-Wesley, Reading (1995) 3. Microsoft Corporation: .NET framework developer’s guide: Handling and raising events, http://msdn.microsoft.com/library/default.aspurl=/library/ cpguide/html/cpconEvents.asp 4. Ban, B.: JGroups – a toolkit for reliable multicast communication, http://www.jgroups.org/ 5. Goetz, B.: Java theory and practice: Be a good (event) listener. IBM developerWorks, http://www-128.ibm.com/developerworks/java/library/ j-jtp07265/index.html 6. Kurzyniec, D., Wrzosek, T., Drzewiecki, D., Sunderam, V.: Towards self-organizing distributed computing frameworks: The H2O approach. Parallel Processing Letters 13(2), 273– 290 (2003) 7. Huang, Y., Gannon, D.: A comparative study of Web Services-based event notification specifications. In: ICPPW 2006: Proceedings of the 2006 International Conference Workshops on Parallel Processing, Washington, DC, USA, pp. 7–14. IEEE Computer Society Press, Los Alamitos (2006), Available at http://doi.ieeecomputersociety.org/10.1109/ICPPW.2006.5 8. Huang, Y., Slominski, A., Herath, C., Gannon, D.: WS-Messenger: A Web Services-based messaging system for service-oriented grid computing. In: CCGRID 2006: Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid, Washington, DC, USA, pp. 166–173. IEEE Computer Society Press, Los Alamitos (2006), Available at: http://doi.ieeecomputersociety.org/10.1109/CCGRID.2006.109 9. van Hoof, J.: How EDA extends SOA and why it is important. Thoughts on Service Oriented Architecture and Event-Driven Architecture (blog) (November 2006), http://soa-eda.blogspot.com/2006/11/ how-eda-extends-soa-and-why-it-is.html 10. Eisenhauer, G., Bustamante, F.E., Schwan, K.: Event services in high performance systems. Cluster Computing: The Journal of Networks, Software Tools, and Applications 4(3), 243– 252 (2001)
REVENTS: Facilitating Event-Driven Distributed HPC Applications
301
11. Kurzyniec, D., Wrzosek, T., Sunderam, V., Slomi´nski, A.: RMIX: A multiprotocol RMI framework for java. In: Proc. of the International Parallel and Distributed Processing Symposium (IPDPS 2003), Nice, France, pp. 140–146. IEEE Computer Society, Los Alamitos (2003) 12. Wrzosek, T., Kurzyniec, D., Sunderam, V.S.: Performance and client heterogeneity in service-based metacomputing. In: Proc. of Heterogeneous Computing Workshop, in conjunction with IPDPS 2004, Santa Fe, New Mexico, USA, IEEE Computer Society, Los Alamitos (2004) 13. DCL: H2O Home Page http://www.mathcs.emory.edu/dcl/h2o/
Empowering Automatic Semantic Annotation in Grid ˇ Michal Laclav´ık, Marek Ciglan, Martin Seleng, and Ladislav Hluch´ y Institute of Informatics, Slovak Academy of Sciences, D´ ubravsk´ a cesta 9, 845 07 Bratislava, Slovakia [email protected] http://ikt.ui.sav.sk/
Abstract. Nowadays, capturing the knowledge in ontological structures is one of the primary focuses of the semantic web research. To exploit the knowledge from the vast quantity of existing unstructured texts available in natural languages in ontologies, tools for automatic semantic annotation (ASA) are heavily needed. In this paper, we present the ASA tool Ontea and empowering of the method by Grid technology for performance increase, which help us in delivering formalized semantic data in shorter time. We have adjusted Ontea annotation algorithm to be executable in the distributed grid environment. We also give performance evaluation of Ontea algorithm and experimental results from cluster and grid implementation. Keywords: semantic annotation, grid, Ontea.
1
Introduction
Automated annotation of the Web documents is a key challenge of the Semantic Web [1] effort. Web documents are structured but their structure is understandable only for humans, which is the major problem of the Semantic Web. In this paper, we present a tool for automatic semantic annotation named Ontea. We describe the methods implemented in the tool and present the results of experimental evaluation. Although Ontea’s results are promising, we face the problem of the performance, as the process of Ontea’s annotation is rather time consuming. This led us to the adoption of the computational grid technology, which allows us to reduce the delivery time of annotation results. The distributed version of Ontea is described in the paper and experimental evaluation of the time savings is presented. 1.1
Related Work
Annotation solutions can be divided into manual and semi-automatic methods. This different strategy depends on a use of the annotation. There is number of annotation tools and approaches such as CREAM [2] or Magpie [3] which follow the idea to provide users with useful visual tools for manual annotation, R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 302–311, 2008. c Springer-Verlag Berlin Heidelberg 2008
Empowering Automatic Semantic Annotation in Grid
303
web page navigation, reading semantic tags and browsing [4] or provide infrastructure and protocols for manual stamping documents with semantic tags such as Annotea[5], Ruby[6] or RDF annotation[7]. Semi-automatic solutions focus on creating semantic metadata for further computer processing, using semantic data in knowledge management [8] or in information extraction application. Semi-automatic approaches are based on natural language processing [9] [10], a document structure analysis [11] or learning requiring training sets or supervision [12]. Moreover, other annotation approaches exist, e.g. KIM [14] which uses information extraction, or pattern-based semi-automatic solutions such as PANKOW and C-PANKOW [13], using Google API for automatic annotation. One of relevant automatic semantic annotation solution and the only one which runs on distributed architecture is SemTag [15]. It uses Seeker [15] information retrieval platform to support annotation tasks. SemTag annotates web pages using Stanford TAP ontology [16]. 1.2
Ontea
One of pattern based solutions is Ontea [17] [18] working on text, in particular domain described by domain ontology and using regular expression patterns for semi-automatic semantic annotation. Ontea detects or create ontology elements/individuals within the existing application/domain ontology model according to defined patterns. Several cross application patterns are defined but in order to achieve good results, new patterns need to be defined for each application. Main difference e.g. with SemTag is that solution can also create, not only detect ontology individuals within defined ontology model. This functionality makes Ontea algorithm slower than SemTag [15], but suitable for wider scope of applications where semantics of documents is needed. Ontea is similar to C-PANKOW [13] annotation with better annotation results and better performance. By the Ontea annotation engine we want to achieve the following objectives: – Detecting/Creating Meta data from Text – Preparing improved structured data for later computer processing – Structured data based on an application ontology model 1.3
Use of Information Retrieval Techniques
Ontea uses also RFTS (Rich full-text search) [19] - the tool for document indexing and document search. Ontea uses RFTS functionality when creating a new ontology individual for evaluating the relevance of the newly created instances. Ontea identifies part of text related to semantic context and match the subsequent sequence of characters to create an instance of the concept. Let us denote the sequence of words related to semantic context by C and word sequence identified as a candidate instance as I. We evaluate the relevance of the new instance by computing the ration of the close occurrence of C and I and occurrence of I in the whole collection of documents: close occurrence(C, I) / occurrence(I).
304
M. Laclav´ık et al.
The Ontea annotation method can be also used with Lucene [20] information retrieval library. When connected with RFTS indexing, Ontea asks for relevance based on words distance. When connecting with Lucene, Ontea asks for percentage of occurrence of matched regular expression pattern to detected element represented by word. Example can be Google, Inc. matched by pattern for company search: [\s]+([-A-Za-z0-9][ ]*[A-Za-z0-9]*),[ ]*Inc[\.\s]+, where relevance is computed as Google, Inc. occurrence divided by Google occurrence. RFTS indexing tool also supports this type of queries, however Lucene can achieve better performance. Use of Lucene or RFTS is related to case Ontea creation IR in evaluation.
2
Evaluation
In this chapter we describe evaluation of Ontea annotation success rate as well as performance evaluation on a single machine. Performance evaluation on the grid infrastructure is not yet available but will be at time of the conference. We also describe a test set of documents. 2.1
Test Set of Documents
As reference test data, we used 500 job offers downloaded from web using wrapper which prepared us some structured data. This was converted to a defined ontology, manually checked and edited according to 500 html documents representing reference job offers. Ontea processed reference html documents using the reference ontology resulting in new ontology metadata consisting of 500 job offers, which were automatically compared with reference, manually checked job offers ontology metadata. 2.2
Target Ontological Concepts for Identification
In this test, Ontea used simple regular expressions matching from 1 to 4 words starting with a capital letter. This experiment is referred to as Ontea in next chapter. In the second case we used domain specific regular expressions which identified locations and company names in text of job offers and Ontea also created individuals in knowledge base, while in the first case Ontea did not create extra new property individuals only searched for relevant individuals in knowledge base. This second case is referred to as Ontea creation. The third case used also previously described RFTS indexing tool or Lucene to find out if it is feasible to create a new individual using relevance techniques described earlier. This case is referred to as Ontea creation IR. To sum up, we conducted our experiments in 3 cases: – Ontea: searching relevant concepts in knowledge base (KB) according to generic patterns – Ontea creation: creating new individuals of concrete application specific objects found in text
Empowering Automatic Semantic Annotation in Grid
305
– Ontea creation IR: Similar as previous with the feedback of RFTS or Lucene to get relevance computed above word occurrence. Individuals were created only when relevance was above defined threshold which was set up to 10% We used following regular expressions: – Generic expression matching one or more words in text. This was used only to search concepts in KB ([A-Z][-A-Za-z0-9]+[\s]+ [-a-zA-Z]+) – Identifying geographical location in text and if not found in KB individual was created Location:[\s]*([A-Z][-a-zA-Z]+[ ]*[A-Za-z0-9]*) used for English [0-9]{3}[ ]*[0-9]{2}[ ]+([A-Z][ˆ\s,\.]+[ ]*[ˆ0-9\s,\.]*)[ ]*[0-9\n,]+ used for Slovak text where settlement name is usually next to ZIP code – Identifying company in the text, this was used also with other abbreviations such as ”Ltd” or ”a.s.”, ”s.r.o.” for the Slovak language [\s]+([-A-Za-z0-9][ ]*[A-Za-z0-9]*),[ ]*Inc[\.\s]+ for English [\s]+([A-Z][ˆ\s,\.]+[ ]*[ˆ\s,\.]*[ ]*[ˆ\s,\.]*)[, ]*s\.r\.o\.[\s]+ used for Slovak texts 2.3
Success Rate of the Ontea Algorithm
In this chapter we discuss the algorithm evaluation and success rate. To evaluate success of annotation, we used the standard recall, precision and F1 measures. Recall is defined as the ratio of correct positive predictions made by the system and the total number of positive examples. Precision is defined as the ratio of correct positive predictions made by the system and the total number of positive predictions made by the system. Recall and precision measures reflect the different aspects of annotation performance. Usually, if one of the two measures is increasing, the other will decrease. These measures were first used to measure IR (Information retrieval) system by Cleverdon [21]. To obtain a better measure to describe performance, we use the F1 measure (first introduced by van Rijsbergen [22]) which combines precision and recall measures, with equal importance, into a single parameter for optimization. F1 measure is weighted average of the precision and recall measures. 2.4
Experimental Results of Annotation Success Rate
Experimental results using precession, recall and F1 -measures are in a table bellow. In the table we compare our results with other semantic annotation approaches and we also list some advantages and disadvantages. The column relevance is in case of Ontea F1 -measures but in case of other methods it can be evaluated by other techniques and usually it is not common. For example for C-PANKOW, relevance is referred as recall. Rows relevant to our annotation approach are in grey color, where we show success rate of three evaluation cases mentioned in the previous chapter.
306
M. Laclav´ık et al. Table 1. Annotation experimental results Method regular expresions, search in knowledge base (KB)
Ontea
disambiguatiy check, SemTag searching in KB regular expresions (RE), Ontea creation creation of individuals in KB RE, creation of individuals in Ontea creation KB + RFTS or Lucene IR relevance document structure Wrapper PANKOW C-PANKOW Hahn et al. Evans Human
pattern matching POS taging and pattern matching Qtag library semantic and syntactic analysis clustering manual annotation
relevance precision recall % % % 71
64
high
high
41
28
62
53
high
high
59 74 76 41 high
high
Disadvantages
high recall, lower 83 precesion works only for TAP KB aplication specific 81 patterns are needed low precision some good results are 79 killed by relevance identification zero success with unknown structure low success rate suitable only for English, 74 slow algorithm works only for English not Slovak low success rate problem with creation of high individuals duplicities, inacuracy
Advantages high succes rate, generic solution, solved duplicity problem, fast algorithm fast and generic solution support Slovak language
disambiguities are found and not annotated, good results high success with known structure generic solution generic solution
high recall and precesion
The row Ontea creation IR case is the most important considering evaluation where we combined information retrieval (IR) and annotation techniques. By using this combination we could eliminate some not correctly annotated results. For example by using [Cc]ompany[:\s]*([A-Z][-A-Za-z0-9][ ]*[A-Za-z0-9]* regular expression in the second case we have created and identified companies such as This position or International company which were identified as not relevant in the third case with use of IR. Similarly Ontea creation identified also companies as Microsoft or Oracle which is correct and in combination with IR eliminated. This issue decreases recall and increases precession. Here it seems that IR case is not successful but opposite is true because in many texts Microsoft is identified as products e.g. Microsoft Office and if we take more text to annotate it is better not to annotate Microsoft as a company and decrease recall. If we would annotate Microsoft as a company in other texts, used in context of Microsoft Office we will decrease precision of annotation. It means it is very powerful to use presented annotation technique in combination with indexing in applications where precession need to be high. 2.5
Performance Evaluation on Single Machine
In this section we will evaluate performance of our solution on the same document set as described above. Annotation was executed on Centrino notebook 1.66 GHz, 1GB RAM, Linux OS. The algorithm is written in Java using Sesame [23] semantic engine. An average size of documents from test set was from 2000 to 4000 bytes. We will measure the average time of document annotating and duration of the single ontology query to knowledge base. This depends on the ontology and knowledge base size but in our case this can be omitted. In the future evaluation on a larger set of document we will need to take this into account, since query duration will increase by each individual added to the ontology.
Empowering Automatic Semantic Annotation in Grid
307
First we run the test on the empty ontology with only regular expressions which can create individuals (see Ontea creation case). If the individual returned by the regular expression exists in the ontology system, continue to the next regular expression result (query) but if the result do not exist in the ontology, a new individual will be created by the system. Stats after annotating of 500 documents by the system are in the following table Table 2. Performance in Ontea Creation case Total time duration
220 s
Average Total Average Total Average time queries executed individuals created duration per executed queries per created individuals document document per document 439 ms 1533 3 1241 2.482
Next test has been run on the results ontology from previous case. Now we search for ontology individuals within knowledge base and not creating them (see Ontea case) Table 3. Performance in Ontea case Total time duration
2399 s
Average Total Average Total Average time queries executed expressions found duration per executed queries per found expression document document per document 4797 ms 103570 207 4984 10
In the tables above we can see performance on document set on one machine. Both tables describe annotation tasks, which can be put to Grid environment. Annotation of one document takes approximately 5 seconds.
3
Putting Ontea into Grid
While Ontea yields interesting results, its annotation method is rather time consuming and annotating large number of documents or periodical re-annotating of updated document collection is highly impractical when performing the computation on a single server. The process of Ontea computation is easy parallelized and parallel threads can run independently with only two synchronization steps needed. This allows us to use distributed computing technology to speed up annotation process for large document collections. In this section, we describe the technique we use for Ontea annotation method distribution, exploiting existing grid computing infrastructure.
308
3.1
M. Laclav´ık et al.
Annotation Process Distribution
Ontea semantic annotation is performed in the following stages: 1. In the first stage, the instances of ontological concepts are created in the input text collection based on regular patterns matching. (case Ontea Creation) This stage will produce OWL ontology files which need to be integrated on a central machine. 2. After integration, instances created in the first stage are evaluated by computing their relevance using IR techniques. The instances with relevance value above given threshold are identified as relevant and filled in result domain ontology OWL file. (case Ontea creation IR) 3. Domain ontology OWL file is used in the third stage of the process for searching annotation tags within annotated text similarly to step one but using general keyword matching patterns. This results to executing more ontology queries and thus consuming more time. (case Ontea in evaluation) 4. Last stage integrated produced semantic metadata to one knowledge base represented by OWL file. The distribution of the computation is straightforward, it is sufficient to split the document collection and run the annotation process in a distributed manner, where different computation jobs process distinct subsets of the collection. The distribution is possible for the first and third stage of the annotation process, the second and forth stage is dedicated to integration of the jobs results and is performed on a single node. In the first step, ontology individuals are created. They are created with unique RDF ID based on the detected text and timestamp (e.g. http://...inst#region San Francisco 1179239158496). In case of distribution of annotation algorithm, same individuals can be created on different nodes. We had to change ID creation to be same at all nodes, due to final integration. Thus timestamp value was replaced by hash of detected text and ontology class URI. 3.2
Data Management Aspects
Following data have to be managed for running Ontea text annotation: – executables and libraries – document collection for text annotations – domain ontology OWL files In the first stage, the jobs for pattern matching are submitted to the grid. The subset of document collections must be replicated from central storage to grid sites where the jobs will be executed. After all of the first stage jobs are finished, the results are transferred to a central server, where the integration takes place. To reduce transfer of collection data in the infrastructure, it would be advantageous to submit the third phase jobs to the same grid sites where the first stage jobs were executed. However, limiting sites for third phase jobs only to
Empowering Automatic Semantic Annotation in Grid
309
those sites where document collection subsets are replicated might be impractical when those sites are heavily loaded and the jobs would be queued for a long time. We address this problem by using job management routine which submits the jobs only to given subset of grid sites, after defined timeout it cancels those that are still in scheduled state and resubmits the jobs without restriction to the grid sites. In addition, the OWL ontology files have to be replicated to the execution sites of first and third annotation stage jobs. 3.3
Evaluation Environment
We have developed a distributed version of Ontea using the grid infrastructure and the approach was evaluated in the Gilda testbed [26] training and test grid infrastructure of EGEE [27] project. The grid testbed used for running Ontea is powered by gLite middleware and gridified version of Ontea is tailored to this grid middleware. Worker nodes were Intel(R) Pentium(R) 4 CPU 2.40GHz PCs. As the sites in Gilda testbed are usually low loaded, we have not experience any wait times in the batch system queues for sub-jobs of Ontea computation. We have experienced an average 10 minutes overhead from the grid middleware (the overhead includes the procedure of submitting job via resource broker, scheduling job from resource broker to the execution site, job processing in local batch system, transfer and unpacking of the input data set, notification of resource broker about the state of processing and retrieval of the results from the worker node after the processing is done.) Table 4. Performance on the Grid Total time Average Total Grid Results Number of duration time time overhead integration documents on Grid duration per on single overhead document node 173 min 8.1 sec 678 min 10 min 25 min 5000
The evaluation was performed on the set of 5000 documents, split to five groups of approximately same size; five jobs were submitted to the grid infrastructure, each for one input data set. The same setting was used for 10 experiment runs. Average result delivery time for the whole input set was 173 minutes, with approximately 10 minutes grid middleware overhead and 25 minutes for results integration. The single machine processing of the whole input set took 678 minutes. The annotation of one document took approximately 8 seconds. This is longer than evaluation on a single machine due to a slower machine in testbed as well as changes in the code when allowing it for a grid. The utilization of the grid infrastructure for Ontea data processing brings us near linear speed-up of the results delivery, especially if used in real application where higher number of documents is annotated.
310
4
M. Laclav´ık et al.
Conclusion and Future Work
In this paper we discussed how automatic semantic annotation solution Ontea can benefit from Grid environment to achieve faster results of semantic annotation. The Ontea method is quite a powerful method [17] with acceptable success rate and performance suitable for knowledge management applications or applications in organizations where semantics of documents is needed [8]. If applied on bigger document bases, performance is not acceptable and need to be improved. To our best knowledge, none of annotation solutions yet benefit of Grid infrastructure and only SemTag [15] annotation used distributed architecture to perform annotation tasks. We have shown that semantic annotation can be gridified and achieve its results much faster. Semantic web research can be successful only when number of annotated web documents will reach critical mass, we believe this can happen only when annotation solution will benefit from parallel and distributed computing which is widely used in the information retrieval field. In our future work we would like to parallelize Ontea algorithm using MapReduce algorithm [24] and its open source implementation Hadoop [25]. Acknowledgments. This work is supported by projects NAZOU SPVV 1025/2004, VEGA 2/6103/6, VEGA 2/7098/27, EGEE-II EU 6FP RI-031688 and RAPORT APVT-51-024604.
References 1. Berners-Lee, T., Hendler, J., Lassila, O.: Semantic web. Scientific American 1, 68–88 (2000) 2. Handschuh, S., Staab, S.: Authoring and annotation of web pages in cream. In: WWW 2002: Proceedings of the 11th international conference on World Wide Web, pp. 462–473. ACM Press, New York (2002) 3. Domingue, J., Dzbor, M.: Magpie: supporting browsing and navigation on the semantic web. In: IUI 2004: Proceedings of the 9th international conference on Intelligent user interface, pp. 191–197. ACM Press, New York (2004) 4. Uren, V., Motta, E., Dzbor, M., Cimiano, P.: Browsing for information by highlighting automatically generated annotations: a user study and evaluation. In: K-CAP 2005: Proceedings of the 3rd international conference on Knowledge capture, pp. 75–82. ACM Press, New York (2005) 5. Annotea Project (2001), http://www.w3.org/2001/Annotea/ 6. W3C, Ruby Annotation (2001), http://www.w3.org/TR/ruby/ 7. The Institute for Learning and Research Technology, RDF Annotations (2001), http://ilrt.org/discovery/2001/04/annotations/ 8. Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., Ciravegna, F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. Journal of Web Semantics: Science, Services and Agents on the World Wide Web 4(1), 14–28 (2005) 9. Madche, A., Staab, S.: Ontology learning for the semantic web. IEEE Intelligent Systems 16(2), 72–79 (2001)
Empowering Automatic Semantic Annotation in Grid
311
10. Charniak, E., Berland, M.: Finding parts in very large corpora. In: G´ omez-P´erez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, pp. 358–372. Springer, Heidelberg (2002) 11. Glover, E., Tsioutsiouliklis, K., Lawrence, S., Pennock, D., Flake, G.: Using web structure for classifying and describing web pages. In: Proceedings of the Eleventh International Conference on World Wide Web, pp. 562–569. ACM Press, New York (2002) 12. Reeve, L., Han, H.: Survey of semantic annotation platforms. In: SAC 2005: Proceedings of the 2005 ACM symposium on Applied computing, pp. 1634–1638. ACM Press, New York (2005) 13. Cimiano, P., Ladwig, G., Staab, S.: Gimme the context: context-driven automatic semantic annotation with c-pankow. In: WWW 2005: Proceedings of the 14th international conference on World Wide Web, pp. 332–341. ACM Press, New York (2005) 14. Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D.: Semantic Annotation, Indexing, and Retrieval. Elsevier’s Journal of Web Semantics 2(1) (2005), http://www.ontotext.com/kim/semanticannotation.html 15. Dill, S., Eiron, N., et al.: A Case for Automated Large-Scale Semantic Annotation. Journal of Web Semantics (2003) 16. Guha, R., McCool, R.: Tap: Towards a web of data, http://tap.stanford.edu/ 17. Laclavik, M., Seleng, M., Gatial, E., Balogh, Z., Hluchy, L.: Ontology based Text Annotation OnTeA; Information Modelling and Knowledge Bases XVIII. In: Duzi, M., Jaakkola, H., Kiyoki, Y., Kangassalo, H. (eds.) Frontiers in Artificial Intelligence and Applications, vol. 154, pp. 311–315. IOS Press, Amsterdam (2007) 18. NAZOU Project Website (2006), http://nazou.fiit.stuba.sk/ 19. Ciglan, M.: Documents Content Indexing for Supporting Knowledge Acquisition Tools. In: Navrat, P., et al. (eds.) Tools for Acquisition, Organisation and Presenting of Information and Knowledge, pp. 49–63. Vydavatelstvo STU, Bratislava (2006) ISBN 80-227-2468-8 20. Hatcher, E., Gospodnetic, O.: Lucene in Action, Manning (2005) ISBN: 1932394281 21. Cleverdon, C.W., Mills, J., Keen, E.M.: Factors determining the performance of indexing systems, vol. 1-2. College of Aeronautics, Cranfield, U.K (1966) 22. Van Rijsbergen, C.J.: Information Retrieval. Butterworth-Heinemann, Newton, MA (1979) 23. OpenRDF.org, Sesame RDF Database (2006), http://www.openrdf.org/ 24. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Google, Inc. In: OSDI 2004, San Francisco, CA (2004) 25. Wiki, L.: HadoopMapReduce (2007), http://wiki.apache.org/lucene-hadoop/HadoopMapReduce 26. GILDA, Grid Infn Laboratory for Dissemination Activities (2007), https://gilda.ct.infn.it/ 27. EGEE, Enabling Grids for E-sciencE (2007), http://www.eu-egee.org/
Fault Tolerant Record Placement for Decentralized SDDS LH* Grzegorz L ukawski and Krzysztof Sapiecha Department of Computer Science, Kielce University of Technology, Poland {g.lukawski,k.sapiecha}@tu.kielce.pl
Abstract. Scalable Distributed Data Structures (SDDS) is a file of records distributed to servers of a multicomputer. Faults of record placement in SDDS file may lead the file to crash. In this paper a fault model for decentralized SDDS LH* (DSDDS LH*) is given and, for such a model, a fault tolerant record placement architecture is presented. Client message forwarding is supplemented with client message backwarding. Split token (used in DSDDS LH* for controlling the file scalability) is supplemented with token tracing protocol and so called token level. These additional mechanisms allow for regeneration of DSDDS LH* file damaged by transient control faults.
1
Introduction
The cheapest way to High Performance Computing (HPC) is multicomputing [1]. For example, a multicomputer may be built from desktop or server PCs, connected through fast Ethernet and supervised by Linux-based operating system. Scalable Distributed Data Structures (SDDS) [2] consists of two components dynamically spread across a multicomputer: records belonging to a file and a mechanism controlling record placement in file space. Record placement mechanism is spread between SDDS servers and their clients. In a multicomputer there is a lot of CPUs, memories and another hardware and software components cooperating with each other. Hence, there is a lot of sources for operational faults that may disturb expansion of SDDS file. Faulttolerant record placement architectures for centralized SDDS LH* were presented and evaluated in [4,5]. In this paper fault-tolerant record placement architecture for decentralized SDDS LH* (DSDDS LH*) is introduced. In section 2 a brief outline of DSDDS LH* architecture is presented. Possible DSDDS LH* failures are analyzed in section 3. In section 4 fault tolerant record placement architecture for DSDDS LH* is presented. The paper ends with conclusions.
2
Decentralized Scalable Distributed Data Structures with LH*
The least SDDS component is a record. Each record is equipped with an unique key. Records with keys are stored in buckets. Each bucket’s capacity is limited. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 312–320, 2008. c Springer-Verlag Berlin Heidelberg 2008
Fault Tolerant Record Placement for Decentralized SDDS LH*
313
If a bucket’s load reaches some critical level, it performs a split. A new bucket is created then and a half of data from the splitting bucket is moved into a new one. A client is another SDDS file component. It is a front-end for accessing data stored in SDDS file and may be a part of an application. There may be one or more clients operating the file simultaneously. The client may be equipped with so called file image used for bucket addressing. Such file image not always reflects actual file state, so client may commit addressing error. Incorrectly addressed bucket forwards such message to the correct one, and sends Image Adjustment Message (IAM) to the client, updating his file image, so he will never commit the same addressing error again [2]. All the SDDS file components are connected through a network. Usually, one multicomputer node maintains single SDDS bucket or a client, but there may be more components maintained by single node. In extreme situation all SDDS buckets, clients and additional components may be run on single PC. For record addressing, modified Linear Hashing (LH*) [2] is used. For distributing records between buckets LH* uses hashing function, such as simple division modulo N: hi (C) = C mod 2i (1) where C is a record key and i is a file/bucket level. Because the number of buckets in DSDDS file must not be a power of 2, for bucket and record addressing two hashing functions with levels i and i + 1 are used. Value i is called file level, each bucket has its own bucket level denoted j (Fig. 1). After successful bucket split, its level j increases. LH* bucket splits must be performed in specific order. So called split token is used for split synchronization. Only bucket actually holding a token may create a new bucket instance and split. After successful split, bucket number m sends the split token to bucket number m : m = (m + 1) mod 2j
(2)
where: j denotes splitting bucket level. Fig. 1 shows an example of decentralized LH* file evolution. At the beginning there is only one bucket with number 0 (Fig. 1a), as the hashing function uses this bucket’s level (j = 0) for record addressing, it accepts all records. The lonely bucket overloads and because it possesses the split token, then split is initiated. About half of its capacity is moved to the new bucket number 1 (Fig. 1b), bucket 0 keeps the split token. After more successful insert operations, bucket 0 overloads again, bucket number 2 is initiated (Fig. 1c). If the number of buckets is a power of two (Fig. 1a, b, d), only one bucket level for all active buckets is applied, two consecutive levels are used otherwise (Fig. 1c).
3
Decentralized SDDS LH* Fault Model
As far as record placement is concerned faulty behaviors of both the client and the server (the bucket) were considered.
314
G. L ukawski and K. Sapiecha
Fig. 1. Example DSDDS LH* file evolution (bucket capacity 4 records)
3.1
Client
The DSDDS file client does not pose any real problem for correct DSDDS file operation. If a bucket’s file image or addressing mechanism is damaged, the client may send his queries to invalid buckets. In such a situation, three scenarios may take place: – A query is send beyond the file space (to nodes where no buckets are stored at all). – A query is send to a bucket having a number too small. In this case the message will be properly forwarded to the correct bucket, after at most two additional send operations, and the client will receive Image Adjustment Message [2]. – A query is send to a bucket having a number too big. In such a situation the client’s query may not reach the correct bucket, as the basic SDDS LH* architecture is not aware of such possibility [2]. 3.2
Bucket
Damaged bucket obviously means loss of data, but the split token may be in danger, too. Bucket recovery algorithm defined in [3] for data fault-tolerant architectures could be applied in case of transient or permanent faults concerning one or more buckets. However, these architectures were developed for centralized LH* and no procedure concerning token recovery was given. Split token failures may cause what follows: – No split token in the whole SDDS file and the file evolution is stuck, as no bucket is able to perform a split. – Multiple split tokens in the file what may happen after failed bucket recovery (if the recovery procedure generated additional unnecessary token(s)) or
Fault Tolerant Record Placement for Decentralized SDDS LH*
315
directly after bucket or message damage. The SDDS file structure may be in real danger. Invalid splits may be performed leading to invalid file structure, addressing problems and/or data loss. – Wandering token if one or many buckets are damaged (e.g. invalid address tables) and the token is send between buckets with no respect to LH* rules. This means that invalid splits may take place like as for multiple tokens.
4
Fault Tolerant Record Placement for DSDDS LH*
For SDDS file analysis under faulty conditions, specialized software implemented fault injector called SDDSim was developed [6]. Simulation experiments proved high vulnerability of DSDDS LH* on operational faults [4,5]. Additional mechanisms for the client and the bucket (split token) are necessary for correct DSDDS LH* file operation. They concern so called backwarding, token tracing protocol and token level. 4.1
Backwarding
In case of original SDDS LH* an expansion of SDDS file causes that a message forwarding algorithm is necessary [2]. However, a failure could cause that a client query may be sent incorrectly not only to lower addressed bucked but to higher addressed one, too. Therefore, client message forwarding should be supplemented with client message backwarding. Such additional mechanism is contained in Algorithm 1 (where a is the number of destination bucket according to clients file image, and a is the number of actual bucket). The last line of the algorithm causes that in case of the message send incorrectly to a bucket having a number too big such query will reach the correct bucket at last. Algorithm 1. Client message forwarding and backwarding a ← hj (c); if a = a then a ← hj−1 (c); if a < a and a < a then a ← a ; end if if a < a and a < a then { Backwarding phase } a ← a ; end if end if
4.2
Token Tracing Protocol
To ensure that the token will not be lost in case of bucket failure, the bucket actually holding the split token is periodically checked by its previous holder. Token tracing protocol allows to recover the split token in its correct place
316
G. L ukawski and K. Sapiecha
without the danger of file evolution stuck [4]. Algorithm 2 is performed by bucket k, previously holding the split token1 . Bucket k should also run this algorithm after data fault recovery, and if actual token holder is bucket k + 1. Moreover, bucket k + 1 should return its current level jk+1 along with request reply message. Supplied with such information, bucket k may verify k + 1 state comparing bucket levels: – Bucket k + 1 holds the token and jk+1 = jk − 1, or jk+1 = jk (if k + 1 = 0) correct file state, k + 1 is valid token holder; – Bucket k + 1 has no token and jk+1 = jk , or jk+1 = jk + 1 (if k + 1 = 0) correct file state; k + 1 has performed a split and the token was send to the next bucket; – Bucket k + 1 has no token and jk+1 = jk − 1, or jk+1 = jk (if k + 1 = 0) split token is lost, another one should be generated and send to bucket k + 1; Algorithm 2. Token tracing protocol – bucket k (previous token holder) waits for predefined time-out; – k sends to bucket k + 1 query requesting information about current token state; if k gets positive answer then k + 1 is in good condition and still holds the token, restart; else if k gets negative answer then k + 1 no longer possesses the token as it was send to k + 2; k + 1 will run token tracing protocol with k + 2, end of algorithm. else if k gets no answer then { k + 1 is probably damaged, do one of the following steps: } – transient fault – send another token to k + 1, restart; – persistent fault – skip bucket k + 1, send a new token to k+2, restart and start guarding bucket k + 2; – persistent fault – recover damaged bucket with number k + 1, restart; end if
Besides the token loss, invalid file state with multiple split token may also be detected. If the bucket k + 1 being guarded holds the split token and jk+1 = jk , or jk+1 > jk (if k + 1 = 0), invalid extra token is detected. Moreover, if bucket k detecting extra token in k + 1 has its own split token, at least one extra invalid token circulates at the file. After such detection there is possibility to remove invalid token, but it is not recommended. Because in decentralized LH* file there is no single centralized component monitoring bucket state (just like the Split Coordinator in centralized LH* [2]), such action may be dangerous and may lead to further file damage. 1
This procedure may be applied if SDDS file consists of at least three active buckets. Otherwise previous and current token holder is the same bucket (Fig. 1a, b).
Fault Tolerant Record Placement for Decentralized SDDS LH*
4.3
317
Token Level
The message used for transferring the split token between buckets, may be supplemented with token level, an additional information concerning levels of splitting buckets. Token with level h, permits a bucket to split, but also defines the resulting bucket level j after successful split. Token level is increased every time the split token visits bucket number 0. In practice, the token level is equal to bucket level after successful split. A bucket may perform a split only if: – For bucket 0: its current level j is lower or equal than token level h. – For bucket = 0: its current level j is lower than token level h. Fig. 2 shows an example of LH* file evolution, using split token supplemented with token level. Bucket 1 holds correct split token allowing this bucket to reach level 2 (Fig. 2a). After overload, bucket 1 splits and sends the split token to bucket number 0, where token level h is increased to 3 (Fig. 2b). Consecutive buckets split reaching level 3 according to token level h (Fig. 2c, d).
Fig. 2. DSDDS LH* file with token level
To make SDDS LH* file tolerant to token control faults, every bucket verifies the token status after receiving token message from another bucket (Algorithm 3). Multiple tokens, even valid, are ignored. A bucket accepts split token, only if acceptance rules shown above are fulfilled. Otherwise the token is instantly sent to the next bucket. If such invalid token is received by bucket number 0, its level is increased before sending to another bucket, what introduces unique possibility of file structure regeneration (Fig. 3). The split token itself becomes critical file component, but it may be easily made fault tolerant with one of well known error detecting and/or correcting codes.
318
G. L ukawski and K. Sapiecha
Algorithm 3. Token verification {Split token with level h was received by bucket m with level j;} if m has no token then { Ignoring multiple tokens } if m = 0 then { Bucket number 0 } if h ≥ j then Accept token; else Send token with level h + 1 to the next bucket; end if else if m > 0 then { Bucket number > 0 } if h > j then Accept token; else Send token with level h to the next bucket; end if end if end if
Fig. 3. SDDS LH* file with token level, extra valid token and successful file regeneration
An example of LH* file evolution with extra valid token is shown on Fig. 3. Suppose that in correct LH* file (Fig. 3a), where token with level h = 2 was currently used, another valid token with level h = 3 was generated (Fig. 3b) as an effect of transient fault. In case of overload bucket 2 splits, because j2 = 2. A new bucket 6 is initialized and the token goes to bucket 0 (Fig. 3c). Because the extra token was verified as valid, bucket 0 also splits and a new bucket 4 is initialized (Fig. 3d). Bucket 1, already holding a valid token, drops extra token obtained from bucked 0.
Fault Tolerant Record Placement for Decentralized SDDS LH*
319
The extra token disappears but the file structure is still invalid. In case of bucket 1 overload, another split is performed and bucket 3 is initialized (Fig. 3e). The split token goes back to bucket 0 and this time it is verified as invalid. According to Algorithm 3 the level of such a token is increased and it is instantly send to the next bucket (Fig. 3f). Then, bucket 1 performs a split, bucket 5 is created and the file structure is valid again. However, the split token is in wrong bucket (Fig. 3g). Bucket 2 verifies the token as invalid and it is instantly send to bucket number 3. As a result, the token is verified valid at last and the file structure is successfully regenerated (Fig. 3h). Unfortunately, during file evolution shown on Fig. 3 many operations requested by clients were not finished and many messages were lost (due to addressing problems occurring while the file structure was invalid). Many different scenarios are possible, most of them should lead to correct file regeneration.
5
Example DSDDS LH* File Regeneration
Fig. 4 shows sample experimental results coming from SDDSim. Vertical axis shows how much of the network throughput is taken by messages of given type. In case of the number of buckets, vertical axis shows maximum of 128 bucket supported by SDDSim. Horizontal axis represents the progress (in time), expressed as the total number of messages sent through the simulated network. An additional valid token was injected into fault tolerant DSDDS LH* file (Fig. 4 – point A on the horizontal axis). Many insert operations were not finished then, and many lost messages (sent beyond the file space) were detected. The structure of the file was temporarily damaged. Next, thanks to the token tracing
Fig. 4. SDDS LH* file with token level, extra valid token and successful file regeneration
320
G. L ukawski and K. Sapiecha
and token level mechanisms, the file was successfully regenerated and after some splits additional token was dropped (Fig. 4 – point B on the horizontal axis).
6
Conclusions
Simulation experiments proved high vulnerability of DSDDS LH* on operational faults [4,5]. To make DSDDS LH* tolerant to data faults the bucket recovery algorithm defined in [3] could be applied. Backwarding, token tracing protocol and token level were introduced here to tolerate variety of faults of LH* control. These extra architectural mechanisms were experimentally evaluated with the help of SDDSim software fault injector. The faults were tolerated. DSDDS LH* file structure was successfully regenerated after serious split token control faults. Token tracing protocol requires some additional circuitry (bucket timers) and additional messages to be send through a network. The number of extra messages strongly depends on specific implementation details (network connection speed, time-out chosen, etc.). In practice, efficiency decrease caused by additional messages used for token tracing is negligible. For token level checking no extra messages are needed at all. Token level value, even with additional redundant data, requires a few bytes at most. Expanding token transfer message with such extra data costs nothing, and benefit from using such extended split token may be priceless.
References 1. Dongarra, J., Sterling, T., Simon, H., Strohmaier, E.: High-Performance Computing: Clusters, Constellations, MPPs, and Future Directions. IEEE Computing in Science and Engineering (March 2005) 2. Litwin, W., Neimat, M.-A., Schneider, D.: LH*: A Scalable Distributed Data Structure. ACM Transactions on Database Systems ACM-TODS (December 1996) 3. Litwin, W., Neimat, M.-A.: High-Availability LH* Schemes with Mirroring. In: Intl. Conf. on Coope. Inf. Syst. COOPIS 1996, Brussels (1996) 4. Sapiecha, K., L ukawski, G.: Fault-tolerant Control for Scalable Distributed Data Structures. Annales Universitatis Mariae Curie-Skodowska, Informatica (2005) 5. Sapiecha, K., L ukawski, G.: Fault-tolerant Protocols for Scalable Distributed Data Structures. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, Springer, Heidelberg (2006) 6. L ukawski, G., Sapiecha, K.: Software Functional Fault Injector for SDDS. GIEdition. In: ARCS 2006. Lecture Notes in Informatics (LNI) (2006)
Grid Services for HSM Systems Monitoring Darin Nikolow1 , Renata Slota1 , and Jacek Kitowski1,2 1
2
Institute of Computer Science, AGH-UST, al. Mickiewicza 30, 30-059, Krak´ ow, Poland Academic Computer Center CYFRONET-AGH, ul. Nawojki 11, 30-950 Krak´ ow, Poland {darin,rena,kito}@agh.edu.pl
Abstract. As HSM systems are commonly used in data intensive grids to economically store huge amounts of data a need for monitoring them arises. This paper describes an approach to create a common model for describing the state of an HSM system. An implementation of a grid service for HSM monitoring and test results are presented.
1
Introduction
Data intensive grid applications often require access to huge amount of data. Storage systems dealing with such amounts of data can either use expensive high performance disk arrays or use the less expensive tape storage managed by HSM software. HSM systems are used in the case when the most important is the ability to store huge amounts of data at low cost and when the access time to data is not a primary issue. The access time to the data stored on HSM systems can vary a lot depending on the localization of the data and the current state of the HSM system. For example if the data is cached to disk the access time will be short, but if the data is only on tape then the access time will be longer. For some applications and middleware functionalities it is desirable to have some apriori information about the access time of the data we are just going to access. Optimization of data access for replicated data sets is one of the examples of such functionality [1]. In this case the middleware layer functionality should be able to choose the most appropriate replica for downloading. In order to have the ability to estimate access time for HSM systems a monitoring service running along with HSM system is needed. The monitoring service should gather information about the state of the HSM system and provide it to requesters. There are many different HSM systems available having quite different architectures, APIs or command line interfaces. Some of the most popular HSM systems are: DiskXtender [6], CASTOR [7], HPSS [8], FSE [9], TSM [10]. In order to be more feasible the service should provide the monitoring data via a common well defined interface. The structure of these data should also be well defined and should be able to describe the state of any HSM system using a state model of general HSM system having all the HSM systems functionalities and peculiarities. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 321–330, 2008. c Springer-Verlag Berlin Heidelberg 2008
322
D. Nikolow, R. Slota, and J. Kitowski
Grid services are based on the OGSA standard [11] and can provide a method for interoperability between heterogeneous computer systems. The grid services can be implemented as statefull services using WSRF [12] and thus provide more flexibility than the traditional web services. For example the data describing a storage system can be kept in the internal resources of the grid service instead of digging it out from the storage system each time the service is called, thus allowing for lower system response times. This article describes the attempt to design and implement a set of grid services for efficient monitoring of storage systems with special respect to HSM systems. The rest of the paper is organized as follows: The following section presents the state of the art. Next section section presents architecture of monitoring and estimation system for storage systems. The forth section describes a proposition of monitoring model of general HSM system. An implementation of monitoring utility as a grid service is presented in the fifth section. The sixth section presents test results and the last section concludes the paper.
2
State of the Art
Replication is a common method used to increase the performance of data access or to increase the data availability or the level of protection [2,3]. Utilities for monitoring and access time estimation are needed in such replication systems to make proper decisions where to create new replica or which one to choose for data access. Data access time time estimation for tertiary storage devices has been studied in [4]. In our previous work a system for estimating access time for the DiskXtender HSM system has been proposed [5] and tested [18]. Our experience with data access cost estimation for the CASTOR HSM system has been presented in [17]. Shriver et. al. in [15] describes several performance analysis techniques for storage systems. The GGF (Global Grid Forum) Performance Working Group has defined the GMA (Grid Monitoring Architecture) intended to provide a minimal specification supporting infrastructure monitoring and allowing interoperability [13]. The Grid Resource Monitoring (GridRM) project [14] is aimed to provide a generic open-source resource monitoring architecture, designed specifically for the Grid. The DiskXtender HSM software provides an administration graphical interface which can monitor many aspects of the system interactively [6], but its not grid-ready and of course can support only one HSM system.
3
Architecture of Distributed Monitoring and Estimation System for Storage Systems
Our vision of a system for monitoring and data access time estimation of storage resources in grid environment is graphically presented in Fig. 1. The idea is that every storage node is being equipped with a Storage System monitoring grid
Grid Services for HSM Systems Monitoring
323
SN SN
HSM DiskXtender HSM CASTOR
SN
SSmon
Disk Array
SSmon SSmon SN Local filesystem
CN
Network
SSmon
SSest
SSest
2 3
4 1
Client
Replication service Legend: CN − Computing Node SN − Storage Node SSmon − Storage System monitoring service SSest − Storage System estimation service
Fig. 1. Architecture of distributed monitoring and estimation system for storage systems
service called SSmon. The implementation of this service is dependent on the underlying storage system, because different storage system and especially HSM systems have quite different graphical user interfaces (GUI), command line interfaces (CLI) or application programming interfaces (API). On the other hand the services for different storage systems provide the same interface for the other components of the grid installation. SSmon provides information about the state of the monitored storage system. The state is defined as a set of values of parameters concerning performance, capacity and behavior of the storage system. The storage system estimation service, SSest, can use the current state of the storage system provided by SSmon to estimate the access time of data residing on the given storage system. The implementation of the SSest grid service can differ depending on the adopted approach (statistical, rule-based, simulated). SSest provides information about the expected latency and transfer time of a given data access request. This information can be used by client to optimize the data access. One typical client will be the replication service responsible for choosing the best replica for downloading or choosing the best location for uploading a new replica. A typical information flow when requesting a data access estimation is depicted with dash lines (see Fig.1). The client calls one of the available SSest grid services giving the URI of file of interest. The estimation service then makes calls to the appropriate SSmon and gathers the necessary data to calculate the estimated access time. When the results are ready they are returned to the client.
324
4
D. Nikolow, R. Slota, and J. Kitowski
Monitoring Model of General HSM System
Generally HSM systems consist of tertiary storage hardware (typically tape libraries) attached to one or more servers on which HSM software is running. Additional disk array volumes are often attached to the server acting as a cache for the data kept on the tertiary storage. The HSM storage space is usually accessed via traditional filesystem access methods or via special API provided by the vendor. The requests are served according to the implemented queuing strategy which is generally FIFO. Table 1. Static parameters describing HSM system Parameter name HSMVendorString
Type Unit Description
String n/a A string identifying the HSM software vendor, version etc. LibVendorString String n/a A string identifying the automated media library - vendor, model, firmware version. DriveVendorString String n/a A string identifying the drive hardware vendor, model, firmware version. TotalCapacity float TB The total capacity of the HSM system - it can be the capacity in the license or the capacity of the tertiary storage hardware. MountDir String n/a The mount point of the HSM filesystem. nrOfLibraries int n/a Number of automated media libraries attached to the HSM systems. nrOfDrvInLib int n/a Number of drives in each library. nrOfSlotsInLib int n/a Number of slots in each library. TypeOfDrvInLib String n/a Type of drives in each library. avgMountTime float s Average time for the robot to move a medium from a slot to a drive. avgDismountTime float s Average time for the robot to move a medium from a drive to a slot. avgLoadTime float s Average time for the drive to bring the medium online. avgUnloadTime float s Average time for the drive to eject a medium. avgPosTime float s Average time to position a head. avgDiskCacheTransferRate float MB/s Average transfer rate for the disk cache. maxNatDrvTransferRate float MB/s Maximal native transfer rate of a drive. mediumNativeCapacity int GB The native capacity of a medium.
The proposed model consists of a set of parameters which can be used to describe the current state of a given HSM system. The model is general and should be suitable for describing any HSM system. The parameters describing the HSM model can be divided into two classes:
Grid Services for HSM Systems Monitoring
325
– static parameters which don’t change often, these are usually configuration parameters, – dynamic parameters changing frequently. The static parameters describing our model are listed in Table 1. These parameters should be set once when the service is started for the first time. They give a general view of the HSM system capabilities, like the overall capacity, disk cache capacity, the number of libraries and drives. Some performance characteristic can also be found here like the average mount and dismount times, average load and unload times and the average position time. Table 2. Dynamic parameters describing HSM system Parameter name FreeCapacity MaxTransferRate
Type
Unit Description
float float
GB Free storage space in the HSM system. MB/s Maximal transfer rate of HSM system for local file transfers - the data consumer process is run on the same machine where the process of the HSM system responsible for the transfer of files is running. This means that when the client is remote the actual transfer rate could be less or equal to this value depending on the network bandwidth. CurTransferRate float MB/s The last measured transfer rate for local transfers. HSMLoad int n/a The number of requests being served. HSMQueue aggregate n/a List of requests being served. Each item consists of the following attributes describing a request: tapeID, start block, end block. HSMDriveState aggregate n/a Describes the state of each drive: usage (empty, idle, read, write), block position, tapeID.
The dynamic parameters are listed in Table 2. These parameters describe the current state of HSM system and its actual performance characteristics. Some of the parameters are aggregates which actually means array of structures (using C jargon). Table 3. File and media related parameters describing HSM system Parameter Type Unit Description name HSMFileInfo aggregate n/a Describes the state of a given file (file name, isCached, tapeID, start block, end block, size). HSMTapeInfo aggregate n/a Describes the state of a given medium tapeID, mediaType, usage, lastMountTime, blockSize, capacity, capacityUsage, copyNumber).
326
D. Nikolow, R. Slota, and J. Kitowski
There are also dynamic parameters (not listed in Table 2) related to a given file or medium, which should be provided only on request because of their large total size. These parameters are kept in the internal data base of the HSM system. These parameters are listed in Table 3.
5
Implementation of HSM Monitoring Grid Service
As part of our research a grid service for HSM monitoring has been implemented. The grid service uses the model described in the previous section in order to supply (via well defined interface) information about the state of the HSM system being monitored. The service is implemented in the Java language and uses the WSRF technology provided with the Globus toolkit version 4. The class diagram of the service is presented in Fig. 2. In the current version the service provides the following methods: – getStorageSystemInfo() - this method provides general information about the storage system like type of storage, identification of vendor, model and version, storage capacity, performance characteristics, available data (or file) access methods. – getHSMFileInfo() - this method provides information about the requested file. This information includes the availability of the file (in cache or on removable medium), location of the file data in the medium (medium id, start block, end block). – getHSMState() - provides information about the state of the HSM system. The state is described by the queue of files being requested and the state of the removable media drives. The state of the drive is specified by the medium being in the drive and its current block position. – getLibraryInfo() - provides information about the automated media libraries attached to the HSM system. The information includes the inventory (which media belong to the library), number and type of drives, performance times of mount and dismount operations. – getHSMTapeMap() - provides information about the media like media type, block size used, capacity, time of last mount. A simple client with GUI has also been implemented to allow testing of the service.
6
Test Results
The HSM monitoring grid service has been tested on the following system configuration: – HP D class server with 2 processors, 640MB RAM with HPUX 11.11 – tape library ATL 7100 with DLT7000 tape drives – DiskXtender HSM software with 10GB disk cache Two kinds of tests are being performed: time performance tests and tests of influence on the HSM system performance. Below the procedure of performing these tests and the obtained results are presented.
Grid Services for HSM Systems Monitoring
327
Fig. 2. Class diagram of HSM monitoring grid service
6.1
Performance Tests
During these tests the time of executing the service methods is being measured. Each method is being called 10 times and the response wall-clock time is registered. For the getHSMFileInfo() method a random file is chosen from the files residing on the HSM system. The Table 4 presents the obtained data. The second column presents the initial response time measured when a method has been called for the first time after starting the service. The third column presents the average response time of the second and next calls. We can see that the initial call for the getStorageSystemInfo() method lasts about half minute but the next calls last less than a second. This is because of using service resources to keep the static information, which is gathered only once when the method is Table 4. Execution time performance test for DiskXtender Service method getStorageSystemInfo() getHSMState() getLibraryInfo() getHSMTapeMap() getHSMFileInfo()
Response time [s] Initial Average 30.80 0.85 0.79 0.60 1.44 1.05 6.15 2.54 2.93 1.08
328
D. Nikolow, R. Slota, and J. Kitowski
called for the first time. Similar effect can be observed for the getHSMTapeMap() and getHSMFileInfo() methods, but the speedup is not that impressive. 6.2
Influence Test
This test for DiskXtender is conducted as follows: 1. A random list of files being in the disk cache is created. The total size is about 500MB and the file sizes were from 5MB to 100MB. 2. The DiskXtender filesystem is unmounted and mounted again to flush the filesystem buffers. 3. The files from the list are read sequentially and the time of the whole operation registered. This time is further referred as tnogs 4. The DiskXtender filesystem is remounted again. 5. The HSM monitoring grid service is started. 6. A client generating a pattern of typical requests is started with time interval between requests being set to 5 seconds. 7. The files from the list are read sequentially and the time of the whole operation registered. This time is further referred as tgs The above procedure is repeated 10 times to obtain average values. Table 5. Service influence test for DiskXtender nr tnogs [s] tgs [s] 1 2 3 4 5 6 7 8 9 10 Avg
82,13 81,29 83,29 83,34 81,23 78,56 78,36 77,43 77,48 77,4 80,05
87,5 83,5 90,55 97,6 87,04 82,83 84,06 92,55 98,75 83,74 88,81
Overhead Transfer rate [MB/s] abs [s] rel [%] nogs gs 5,37 6,54 6,09 5,71 2,21 2,72 6,15 5,99 7,26 8,72 6,00 5,52 14,26 17,11 6,00 5,12 5,81 7,15 6,16 5,74 4,27 5,44 6,36 6,04 5,7 7,27 6,38 5,95 15,12 19,53 6,46 5,40 21,27 27,45 6,45 5,06 6,34 8,19 6,46 5,97 8,76 11,01 6,25 5,65
The results are presented in Table 5. We can see that the running service causes about 10% performance degradation of the DiskXtender HSM system.
7
Conclusions
In this paper a common model describing the state of a general HSM system has been proposed. The model is able to describe any HSM system. As part of our research an HSM monitoring grid service has been implemented. The test
Grid Services for HSM Systems Monitoring
329
results showed moderate overhead for rather heavy loading of the monitoring service. Our future work will be focused on adding support for the FSE [9] and TSM [10] HSM systems. Performance tests using modern computer architectures and Linux operating system will be conducted in order to study the influence of different system components on the performance. We also plan to make certain modification to our implementation in order to make it compliant with GMA.
Acknowledgments The work described in this paper was supported by AGH-UST grant. Thanks ˙ also go to the implementation team: Daniel Zmuda, Dariusz Antolak, Krzysztof Trubalski, L ukasz Wiktor and Wojciech W¸atroba.
References 1. Slota, R., Nikolow, D., Skital, L., Kitowski, J.: Optimizing of Data Access Using Replication Technique. In: Bubak, M., Turala, M., Wiatr, K. (eds.) Proceedings of Cracow Grid Workshop - CGW 2004, Krak, ACC-Cyfronet UST, December 13-15 2004, pp. 215–220 (2004) 2. Kunszt, P., Laure, E., Stockinger, H., Stockinger, K.: File-based replica management. Future Generation Computer Systems 21, 115–123 (2005) 3. Lamehamedi, H., Szymanski, B., Deelman, E.: Data Replication Strategies in Grid Environments. In: Proceedings of the 5th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2002, Beijing, China, October 2002, pp. 378–383. IEEE Computer Science Press, Los Alamitos (2002) 4. Sandst˚ a, O., Midstraum, R.: Low-Cost Access Time Model for Serpentine Tape Drives. In: Proc. of 16th IEEE Symp. on Mass Storage Systems, the 7th NASA Goddard Conf. on Mass Storage Systems and Technologies, San Diego, California, USA, March 1999, pp. 116–127 (1999) 5. Nikolow, D., Slota, R., Kitowski, J.: Data Access Time Estimation for HSM Systems in Grid Environment. In: Proc. of the Cracow Grid Workshop, Cracow, Poland, December 11-14, pp. 209–216. ACK Cyfronet-AGH, Cracow (2002) 6. Legato Systems, Inc. - DiskXtender Unix/Linux, http://www.legato.com/products/diskxtender/diskxtenderunix.cfm 7. The CASTOR Project, http://castor.web.cern.ch/castor 8. HPSS - High Performance Storage System, http://www.hpss-collaboration.org/ 9. HP StorageWorks File System Extender Software - Overview & Features, http://h18006.www1.hp.com/products/storageworks/fse/index.html 10. IBM Tivoli Storage Manager, http://www-306.ibm.com/software/tivoli/products/storage-mgr/ 11. Towards Open Grid Services Architecture, http://www.globus.org/ogsa/ 12. Globus: WSRF - The WS-Resource Framework, http://www.globus.org/wsrf/ 13. Grid Monitoring Architecture, http://www-didc.lbl.gov/GGF-PERF/GMA-WG/ 14. GridRM: Resource monitoring for the Grid, http://gridrm.org/ 15. Shriver, E., Hillyer, B.K., Silberschatz, A.: Performance Analysis of Storage Systems. In: Reiser, M., Haring, G., Lindemann, C. (eds.) Dagstuhl Seminar 1997. LNCS, vol. 1769, pp. 33–51. Springer, Heidelberg (2000)
330
D. Nikolow, R. Slota, and J. Kitowski
16. Slota, R., Skital, L., Nikolow, D., Kitowski, J.: Algorithms for Automatic Data Replication in Grid Environment. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 707–714. Springer, Heidelberg (2006) 17. Kuta, M., Nikolow, D., Slota, R., Kitowski, J.: Data Access Time Estimation for the CASTOR HSM System. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 148–155. Springer, Heidelberg (2006) 18. Nikolow, D., Slota, R., Kitowski, J.: Gray Box Based Data Access Time Estimation for Tertiary Storage in Grid Environment. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 182–188. Springer, Heidelberg (2004)
The Vine Toolkit: A Java Framework for Developing Grid Applications Michael Russell, Piotr Dziubecki, Piotr Grabowski, Michal Krysin´ski, Tomasz Kuczy´ nski, Dawid Szjenfeld, Dominik Tarnawczyk, Gosia Wolniewicz, and Jaroslaw Nabrzyski Pozna´ n Supercomputing and Networking Center, Pozna´ n, Poland {russell,piotr.dziubecki,piotrg,mich,docentt, dejw,dominikt,gosiaw,naber}@man.poznan.pl http://www.man.poznan.pl
Abstract. The Vine Toolkit is a modular, extensible Java library that offers developers an easy-to-use, high-level API for Grid-enabling applications. It supports a wide-variety of application environments and Grid middleware platforms. Keywords: Java, Grid, framework, application, web, portal, service.
1
Introduction
The Grid [1] landscape has changed quite a bit since the early days of the Globus Toolkit [2]. Where the Globus Toolkit was the de facto Grid middleware platform on the open-source market, and remains so in the United States, in Europe and elsewhere there has been a proliferation of Grid platforms and middleware technologies in the last 7 years. While UNICORE [3] has been around for nearly as long as the Globus Project, EGEE’s gLite [4] platform, GRIA [5], CROWN [6] and other platforms have now acquired substantial user bases. Moreover, many auxiliary services are available today that address particular problem areas, such as GridSAM [7] for job submission, OGSA-DAI [8] for federated heterogeneous database access, the Storage Resource Broker (SRB) [9] for managing files logically across distributed filesystems and the Virtual Organization Management Service (VOMS) [10] for managing VOs. Some of these auxiliary services are designed to build upon or integrate with particular platforms, others may be designed to be platform agnostic but support only a limited set of middleware for production use. Deciding on which technologies to adopt is not trivial. Considerable effort is needed to install, configure, maintain and support Grid infrastructures, thus care must be taken to choose a platform that will meet an organization’s needs for years to come. Even when the motivation to select a platform is pre-determined, such as when there is a need to interoperate with other collaborating institutes R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 331–340, 2008. c Springer-Verlag Berlin Heidelberg 2008
332
M. Russell et al.
that have existing infrastructures, building applications and services on top of just one platform is usually not trivial either. There are several risks to consider when doing so. – Portability: Writing to one platform may negate the use of another sometime in the future. The growing support for industry-wide standards that expose relevant features to applications may alleviate problems one day, but that day has not yet arrived. – Maintenance: The rate at which a particular platform evolves may result in hundreds hours of work spent on maintaining software. This is more so for applications that are meant to be used by the general community and thus need to be supported for multiple versions of particular middleware services. – Functionality: Not all Grid platforms are created equal. All have their strengths and weaknesses. It may turn out during the course of implementing a solution one finds the desired functionality does not yet exist, isn’t fully supported, or requires many “work-arounds” to achieve at the application level. 1.1
Standards and Their Importance to Applications
Today, most Grid platforms are Web-service based. In practice, this essentially means there now exists a standard way for issuing calls to services. This is not enough for true interoperability between infrastructures and therefore does not solve the application portability problem. Semantic Grid [11] approaches, or some other methodology involving meta-data to describe resources, may offer a means to dynamically discover how to use a particular service for a particular purpose. However, even here, higher-level standards will be needed for describing types of services. The authors of the Vine Toolkit can testify that the need for standards that encapsulate the major areas of functionality addressed by most Grid platforms is real. Though the Vine Toolkit, as will be discussed later, hides the problem of developing applications for different Grid middleware behind an application programmer interface (API), this has only put stress on the Vine Project, and other software projects that provide similar solutions, to maintain support for all those middleware. For example, consider the task of maintaining job submission support for 4 different evolving middleware platforms, as the Vine Project has done, as opposed to having to support job submission to only one specification. Even if that specification is changing over time, the workload is considerably reduced (and more so if user authentication and data staging are likewise standardized!). The Open Grid Forum (OGF) adopted Job Description Specification Language (JSDL) 1.0 [13] and the evolving Basic Execution System (BES) [14] specifications, though still somewhat immature, are steps in the right direction. Another approach to standardization lead within OGF is the Simple API for Grid Applications (SAGA) [15]. The SAGA Core Working Group (SAGACORE-WG) [16] has produced a nearly completed language-independent specification that describes what many application experts and Grid experts perceive
The Vine Toolkit: A Java Framework for Developing Grid Applications
333
to be the base and most relevant capabilities of Grid platforms to expose to application programmers. The main idea is to solve the portability problem at the API level as opposed to the infrastructure level. The SAGA specification outlines programming interfaces, entities and basic behavior that should be implemented for any programming language“binding” written for the specification. It also outlines how support for third party libraries and services should be plugged into a SAGA implementation. Some members of the SAGA effort are actively involved in developing language bindings and reference implementations of those bindings. Nearly completed SAGA-C++ and SAGA-Java binding and implementation efforts already exist with support for various Grid platforms. 1.2
Our Position
The Vine Project strongly agrees with the SAGA approach but also stresses the need for middleware-level standards for reasons cited above and feels these two approaches are, in fact, complimentary. The Vine Project intends to work with SAGA peers to implement the evolving Java SAGA 1.0, by “wrapping” the Vine Toolkit, and will encourage application programmers to consider using SAGA for basic Grid programing needs instead of the Vine API directly. However, the authors of the Vine Toolkit contend the Vine API, as it is applied to the Java programming language and target applications for Java, is more robust and extensible. Additionally, Vine is more than just a Grid API, it provides several areas of functionality that facilitate proper integration with web applications and services.
2
The Vine Toolkit
The Vine Toolkit, like the upcoming SAGA Java reference implementation or the Intel Grid Programing Environment [20], is a Java-based framework that offers developers an easy-to-use, high-level API for Grid-enabling applications. Vine began as an evolution of the GridSphere GridPortlets Project [22]. GridSphere [21] is a Portlet 1.0 [23] compliant portal framework for building portal websites and hosting collections of portlet web applications. GridPortlets was designed to provide a framework for using the Grid in portlet applications. Like the evolving Java SAGA 1.0 specification, GridPortlets is supported by a “pluggable” architecture for implementing its API to support different middleware. While its API is generally perceived to be very useful, GridPortlets is also recognized for several shortcomings. In practice, GridPortlets has only ever been distributed with support for Globus Toolkit 2 and 4 because extending it to support other middleware is overly complicated. Moreover, it is dependent on elements of the GridSphere Portal Framework and needs to be managed at runtime by a Portlet 1.0 API compliant container. Therefore, GridPortlets cannot be used or even tested outside of a web application server. Vine was designed from the outset to address these issues and more. Now in its alpha stages and available for public use, Vine includes an API and model
334
M. Russell et al.
of the Grid that greatly improves upon the original GridPortlets API while retaining its best features. More importantly, Vine is a general library that can be deployed for use in desktop, Java Web Start [24], Java Servlet 2.3 [25] and Java Portlet 1.0 environments with very little effort on behalf of the application programmer. This is because Vine provides an easy-to-use, extensible build system for deploying, configuring and bundling Vine for use with many different types of applications. However, taking into account the explicit need of web application developers, Vine is designed to provide all the necessary entry points for “attaching” Vine to web portals, web services and other “container” environments. Several of these entry points are discussed below. – User account management: Vine provides the ability to plug into the user account management mechanisms of its container environment, as well as to manage user accounts in standalone applications and services. – User registration with third-party libraries and services: Vine simplifies the process of generating credentials, registering those credentials with Grid middleware or creating accounts on remote systems with provider defined registration modules. This process can be attached to Vine’s account management system. – Single sign-on: Vine defines its own extensible authentication module system that can also be plugged into any container that supports extensions to its sign-on mechanisms, say during user login to a portal. – Authorization: The resources Vine makes accessible to end-users is completely configurable. Authorization in Vine can be linked to its container as well as to third-party authorization services. – User session management: In the Vine Toolkit, all user activity is managed within sessions. Vine sessions can naturally be attached to environments that user support session management. – Per-request behavior: Vine defines what is called a service context. Service contexts have a well-defined lifecycle. When applied to web applications, service contexts can be attached to HTTP requests, for example. When a request has completed, its associated service context will be destroyed, therefore cleaning up resources. This is useful in the context of connections to remote services and databases. – Startup and shutdown behavior: When Vine is first invoked by its container, it can execute one or more startup services to, say, begin polling remote information services or monitor remote activity. Similarly, one can define shutdown services for use with Vine. – Multi-threaded behavior: Vine inherently supports multi-thread applications and contains built-in mechanisms for performing tasks in separately managed threads. – Persistence: No extra work is required by the application programmer to manage persistence of information registered with Vine or tasks performed with it.
The Vine Toolkit: A Java Framework for Developing Grid Applications
2.1
335
Vine Deliverables
Vine is not developed in a vacuum, but rather to meet the specific needs of the projects that fund it, in addition to the needs of the community-at-large. Begun in the HPC-Europa Project [17] at the Pozna´ n Supercomputing and Networking Center, Vine is currently funded by the OMII-Europe Project [18] and BEinGRID [19]. Both of these projects are driven by the need for more common components on the market that promote interoperability and portability across different Grid middleware. OMII-Europe places particular emphasis on interoperability through Web and Grid standards adopted by the OGF and elsewhere. In fact, OMII-Europe and its partner institutes lead many of the OGF standards efforts currently underway. Moreover, OMII-Europe sponsors re-engineering activities in the US, Europa and China to enable today’s leading Grid platforms to support relevant standards. BEinGRID places particular emphasis on the need for common components that meet the needs of industry as well as support the middleware and services that have established user communities, including but not limited to: – – – – – – –
gLite Globus Toolkit GRIA OGSA-DAI SRB UNICORE VOMS
The Vine Project actively supports the technologies cited above to “Gridenable” web applications through generic, reusable Web 2.0 [26] user interfaces for use with GridSphere and other Portlet 1.0 compliant containers.
3
Vine Architecture
The Vine Toolkit consists of a core project that defines a base API and programming model upon which sub projects are built. Each sub project addresses a particular problem area (Fig. 2). Some, like the Grid Vine, build upon core Vine to define more general concepts and extensible elements. Others, like the Globus Toolkit 4 Vine, are concerned with adding support for particular third party libraries and services . Each project conforms to a particular file structure that defines how source code is built as well as how third party libraries and configuration files are packaged and deployed. Users can select the specific Vine projects they require for their applications. Naturally, there are dependencies between certain projects that must also be taken into account (Fig. 3). When Vine is deployed, Vine’s build system will deploy and package only those source files that are relevant to the target environment. Source files that are included in the main source tree for each project are deployed to all types of target environments, while source files contained in a project’s web source tree
336
M. Russell et al.
Fig. 1. Vine can be deployed as one more portlet web applications to a servlet container, such as Apache Tomcat, for use with the GridSphere Portal Framework
are deployed for use only with web applications (Fig. 4). Typically, web source trees include user interfaces developed in one or more web UI frameworks, such as the Google Web Toolkit [27] or Adobe Flex [28] , as well as Java servlets, portlets and any web services that are intended for use by that project. Project
Purpose
Grid Vine
Provides Grid API and base classes
Group Vine
Provides Groupware API and base classes
gLite 3 Vine
Adds gLite 3 client support to the Vine Toolkit
GT 4 Vine
Adds GT4 client support to the Vine Toolkit
UNICORE 6 Vine
Adds UNICORE 6 client support to the Vine Toolkit
VOMS Vine
Adds VOMS security support to the Vine Toolkit
BES Vine
Adds BES job submission support to the Vine Toolkit
SRB Vine
Adds SRB file management support to the Vine Toolkit
RUS Vine
Adds RUS information gathering support to the Vine Toolkit
Fig. 2. Some of the projects that are actively supported in the Vine Toolkit
3.1
Key Concepts
In addition to the base session management, persistence management and other mechanisms described above, the Root Vine defines several key concepts upon which all other Vine projects are built.
The Vine Toolkit: A Java Framework for Developing Grid Applications
337
Fig. 3. Dependencies between Vine projects
Fig. 4. How files are organized in the Vine Toolkit
Resources are perhaps the most important concepts modeled in the Vine Toolkit. Vine defines a resource as anything that can be utilized. A computer, an application, a software library, a person, these are all resources in Vine. In fact, in Vine, resources define the application just as they define grids. At their most basic level, grids are collections of resources with policies describing how to use those resources. Vine extends the resource concept to include tasks. 3.2
Advanced Concepts
More advanced interfaces and classes are provided in the Group Vine and the Grid Vine (Fig. 7). Group Vine defines the user account management and registration concepts described earlier. Grid Vine defines concepts specifically related to supporting grid application development. In addition to modeling basic concepts such as computers, software and services, Grid Vine models information services, file management services and job management services. Projects that extend Grid Vine implement part or all of the Grid Vine API. The Storage Resource Broker Vine, for example, implements the file aspects of the Grid Vine API.
338
M. Russell et al.
Fig. 5. Core interfaces defined in Vine
Fig. 6. Main task interfaces in the Vine Toolkit
Fig. 7. Group Vine and Grid Vine add base functionality to the Vine Toolkit
The Vine Toolkit: A Java Framework for Developing Grid Applications
4
339
Next Steps
Vine 1.0 Alpha has recently been released and been put in the hands of early adopters. Plans are underway for a second release before the end of 2007. The current focus is on solidifying the information gathering features of the Vine Toolkit as well as adding new Web 2.0 user interfaces to the web source trees of projects. The long term future for Vine is wide-open. In addition to adding support for more Grid middleware and platforms, Vine can be used to build higher-level web services, applied to P2P applications and much more. The only limits are those set by the projects that fund its development. Acknowledgments. The authors would like to extend our sincerest gratitude to Roman Wyrzykowski and of course to the Pozna´ n Supercomputing and Networking Center. We would also like to thank the OMII-Europe Project, BEinGRID, [29], HPC-Europa and other European Union funded organizations that have made this work possible.
References 1. Foster, I.: What is the Grid? A Three Point Checklist. GRIDToday, July 20 (2002) 2. Foster, I.: Globus Toolkit Version 4: Software for Service-Oriented Systems. In: Jin, H., Reed, D., Jiang, W. (eds.) NPC 2005. LNCS, vol. 3779, pp. 2–13. Springer, Heidelberg (2005) 3. Erwin, D.W., Snelling, D.F.: UNICORE: A Grid Computing Environment. In: Sakellariou, R., Keane, J.A., Gurd, J.R., Freeman, L. (eds.) Euro-Par 2001. LNCS, vol. 2150, Springer, Heidelberg (2001) 4. Laure, E., et al.: Programming the Grid with gLite. Computational Methods In Science And Technology 12(1), 33–45 (2006) 5. Surridge, M., Taylor, S., De Roure, D., Zaluska, E.: Experiences with GRIA Industrial Applications on a Web Services Grid. In: First Int. Conf. e-Science and Grid Computing Proceedings (2005) 6. Sun, H., Zhu, Y., Hu, J.H., Liu, Y., Li, J.: Early Experience of Remote and Hot Service Deployment with Trustworthiness in CROWN Grid. In: Cao, J., Nejdl, W., Xu, M. (eds.) APPT 2005. LNCS, vol. 3756, Springer, Heidelberg (2005) 7. Lee, W., McGough, A.S., Darlington, J.: Performance Evaluation of the GridSAM Job Submission and Monitoring System. In: UK e-Science All Hands Meeting 2005, Proceedings (2005) 8. Antonioletti, M., et al.: The design and implementation of Grid database services in OGSA-DAI. Concurrency and Computation: Practice and Experience 17(2-4), 357–376 (2005) 9. Rajasekar, A., et al.: Storage Resource Broker - Managing Distributed Data in a Grid. Computer Society of India Journal, Special Issue on SAN (2003) 10. Alfieri, R., et al.: From gridmap-file to VOMS: managing authorization in a Grid environment. Future Generation Computer Systems 21(4), 549–558 (2005) 11. De Roure, D., Jennings, N.R., Shadbolt, N.R.: The Semantic Grid: Past, Present, and Future. Proceedings of the IEEE 93(3), 669–681 (2005) 12. Open Grid Forum, http://www.ogf.org
340
M. Russell et al.
13. Job Submission Description Language (JSDL) Specification, Version 1.0. Global Grid Forum, GFD.56 (November 2005), http://www.ogf.org/documents/GFD.56.pdf 14. OGSA Basic Execution Service 1.0. Open Grid Forum, GFD.108 (July 2006), http://forge.gridforum.org/projects/ogsa-bes-wg 15. Goodale, T., et al.: SAGA: A Simple API for Grid Applications. High-level application programming on the Grid. Computational Methods in Science and Technology 12(1), 7–20 (2006) 16. Simple API for Grid Applications Core Working Group (SAGA-CORE-WG), http://www.ogf.org 17. HPC-Europa Project, http://www.hpc-europa.org 18. OMII-Europe Project, http://www.omii-europe.org 19. BEinGRID Project: http://www.beingrid.eu 20. Intel Grid Programming Environment, An Overview, http://gpe4gtk.sourceforge.net/GPE-Whitepaper.pdf 21. Novotny, J., Russell, M., Wehrens, O.: GridSphere: a portal framework for building collaborations. Concurrency and Computation: Practice and Experience 16 (5) 503, 503–513 (2004) 22. Russell, M., Novotny, J., Wehrens, O.: The Grid Portlets Web Application: A Grid Portal Framework. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 691–698. Springer, Heidelberg (2006) 23. Java Portlet 1.0 API, http://jcp.org/aboutJava/communityprocess/final/jsr168/ 24. Java Web Start, http://java.sun.com/products/javawebstart/ 25. Java Servlet 2.3 API: http://jcp.org/en/jsr/detail?id=53 26. Miller, P.: Web 2.0: Building the New Library, http://www.ariadne.ac.uk/issue45/miller/ 27. Google Web Toolkit, http://code.google.com/webtoolkit/ 28. Adobe Flex: http://www.adobe.com/products/flex/ 29. Gridipedia, http://www.gridipedia.eu
Enhancing Productivity in High Performance Computing through Systematic Conditioning Magdalena Sławi´nska, Jarosław Sławi´nski, and Vaidy Sunderam Dept. of Math and Computer Science, Emory University 400 Dowman Drive, Atlanta, GA 30322, USA {magg,jaross,vss}@mathcs.emory.edu
Abstract. In order to take full advantage of high-end computing platforms, scientific applications often require modifications to source codes, and to their build systems that generate executable files. The ever-increasing emphasis on productivity in HPC makes the efficiency of the build process an extremely important issue. Our work is motivated by the need for a systematic approach to the HPC life-cycle that encompasses build and porting tasks. In this paper we briefly present the design of the Harness Workbench Toolkit (HWT) that addresses the porting issue through a user-assisted conversion process. We briefly describe our adaptation capability model that includes routine conversions (string substitutions) as well as more advanced transformations such as 32-64 bit changes. Next, we show and discuss results of an experiment with a production source code (the CPMD application) that examines the effort of adapting the baseline code (the Linux distribution) to specific high-end machines (IBM SP4, Cray X1E, Cray XT3/4) in terms of the number of necessary conversions. Based on the conversion capability model, we have implemented conversion assistant modules that were used in the experiment. The experimental results are promising and demonstrate that our approach takes a step towards improving the overall productivity of scientific applications on highend machines. Keywords: porting parallel programs, productivity, legacy codes, conditioning, automatic code transformations, software tools.
1 Introduction High performance computing is well-established as a mainstream methodology for conducting science. The efficacy of applications, and the resulting scientific advances, are directly related to “productivity” in HPC – traditionally measured by raw performance, problem-size, and machine efficiency. This and related notions of productivity form the focus of several recent efforts, including the DARPA High Productivity Computing Systems (HPCS) [1] program and Cray’s adaptive supercomputing [2] vision. Whereas the HPCS initiative focuses on developing a new generation of economically viable systems that follow the rate of underlying technology improvement in delivering increased value
Research, publication, and presentation supported in part by DOE grant DE-FG0206ER25729, NSF grant CNS-720761, and a faculty travel grant from The Institute of Comparative and International Studies, Emory University.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 341–350, 2008. c Springer-Verlag Berlin Heidelberg 2008
342
M. Sławi´nska, J. Sławi´nski, and V. Sunderam
to end-users [3], Cray aims to provide the next productivity breakthrough by developing a single adaptable HPC platform that integrates multiple processing technologies. However, HPC productivity is also greatly influenced by several other less tangible factors, the most significant among them being the preparatory stages of application execution. Thus another strategy to improve HPC productivity is to reduce the effort involved in scientific application development. Research shows that in terms of productivity, the build process, i.e., compilation, linking, installation, deployment, and staging, may consume up to 30% of the development effort [4]. This large effort results mainly from the variety of software packages, program-building tools, and hardware architectures together termed the build problem [5]. Adaptation and deployment of scientific codes on high-performance platforms is particularly challenging due to new and unique architectures and the legacy nature of applications that have evolved over several decades with continual tweaking. Modern scientific codes are usually developed on common Unix-like architectures and need to be ported to run on specific HPC machines that are controlled by specialized (often lightweight) kernels. In order to obtain efficient executables, necessary modifications depend on vendor’s compilers and libraries, and may be relatively easily defined such as function renaming (e.g. POSIX functions), removing unsupported system calls, references to signals, taking care of pointer-sizes, modifications to handle the absence of the TCP protocol (sockets), etc. In addition to more or less straightforward source code changes, more challenging modifications such as fundamental algorithm reimplementation (e.g. vectorization), changing communication patterns to reflect specific hardware might be necessary. Our project aims to partially relieve site administrators, porting specialists, and computational scientists from the burdens of the build process by providing a systematic, integrated, and streamlined approach, in contrast to current ad-hoc practices. In order to accomplish this we propose a toolkit that facilitates conditioning, i.e., adaptation, of HPC applications to target platforms. In this paper we focus on the source code application adaptation aspect. Our methodology is based on a toolkit-assisted approach that can facilitate routine tasks. Discussions with application scientists and experts at Oak Ridge National Laboratory (ORNL), summarized in Section 3, show that conditioning is often tedious and requires cross-domain expertise. Furthermore, there is an acute lack of dedicated toolkits that can support this large and complex process. Although it is likely that conditioning HPC applications cannot be fully automated and requires human intervention, results of the conducted research and experiments show that certain patterns can be observed. In Section 4 we define the adaptation capability model based on the analysis of a few ORNL application source codes and demonstrate the productivity improvement with developed conversion assistants (Section 5). Finally, Section 6 concludes the paper.
2 Related Work As mentioned, execution of scientific codes on HPC platforms requires dealing with adaptation of scientific source codes, the build system and conditioning the target environment. There are many ways of tackling different aspects of these phases. One
Enhancing Productivity in High Performance Computing
343
way to approach conditioning issues is to learn from HPC community experiences that often are documented as guidelines or study cases. For instance, HPC vendors or compiler providers support their users with porting guidelines, compiler optimization flags, optimized libraries [6,7], or even with documentation on how to build popular scientific applications [8]. Clearly, this approach, albeit valuable, is very general, and in a specific situation a well-documented use case may be more desirable. Recent porting cases regard technologies that have gained popularity such as multi-core processing or reconfigurable supercomputers based on Field Programmable Gate Array (FPGA) technology. They report on portable frameworks [9,10], guidelines for specific applications [11,12,13], or algorithms [14,15]. Although addressing multi-core or highperformance reconfigurable computing platform porting issues is not our project main focus, our methodology can be applied to facilitate conditioning on such machines. We propose to encapsulate system- and application- specific knowledge into reusable, community-shared descriptions called profiles. Profiles instruct our conditioning assistant modules to retrieve appropriate parameters for conditioning tasks such as environment variable settings, compilation suites (compilers, libraries, optimization flags, etc) or code-snippets. Another way to deal with some conditioning issues is to utilize standards. For instance, portable run-time environments such as the Open Run-Time Environment (ORTE) [16] provide a unified interface to interprocess communication, resource discovery and allocation, and process launching across heterogeneous platforms. Our project intends to benefit from the unification provided by ORTE, e.g. with respect to application launching. Another example is notable GNU Autotools that helps overcome the build problem by standardizing the compilation, linking, and installation process. However, GNU Autotools does not fully address all the aspects characteristic to the HPC domain such as cross-compilation, use of restricted microkernels, and tuning to the hardware environment. In addition, it introduces its own compatibility issues such as the requirement for compatible versions at the user’s and developer’s side [17]. Our approach supports different build systems (Makefile-based as well as GNU Autotoolsbased) by providing a unified interface and encapsulating specific features into profiles. Providing tools that facilitate porting or building processes takes conditioning a step further. For instance, Environment Modules facilitates management of configuration information regarding software packages and libraries installed on a given computer system [18]. Although, in the past there have been efforts to develop tools for automatic porting [19], currently a supportive approach is preferred due to application complexity. The most advanced software solutions provide integrated programming environments that facilitate conditioning [20,21]. However, they are usually restricted to popular operating systems or very specific architectures. Instead of developing a new IDE for conditioning, our project aims to provide uniform access to build tools through a tool-virtualization methodology and by storing expert knowledge regarding specific tool configuration requirements, parameters, or options into profiles. This knowledge would be described once and repeatedly applied. In the HPC arena, Eclipse PTP [22] provides an interesting option for GUI-oriented development and run-time interface to heterogeneous systems. Our project can be perceived as complementary to Eclipse PTP in terms of supplying conditioning Eclipse plugins.
344
M. Sławi´nska, J. Sławi´nski, and V. Sunderam
3 A New Approach to Conditioning The traditional conditioning model is presented in Figure 1. End-users connect from their client systems to designated front-end nodes through remote (secure) connections.
Fig. 1. Traditional and proposed conditioning model
The security policy depends on the high-end system and may vary, e.g., static passwords and private-key authentication methods or one-time passwords. In fact, the manner of the user-supercomputer interaction is forced by the front-end node environment. Fortunately, often front-end nodes are controlled by the Linux operating system albeit usually modified by the vendor. Despite high Linux popularity and taking into account that scientists often need to utilize a few computational centers, adjusting to a new environment can be inconvenient for researchers and may distract them from their actual work especially in the context of the build problem. Analysis of current practices at large scientific computing establishments (e.g. DOE labs) suggests that application building is usually accomplished via GNU Autotools and Makefile-based systems, sometimes through proprietary scripts, and occasionally without any provided build system. But even when standard tools are employed, user modifications are often needed due to, e.g., cross-compilation issues. These factors contribute to lost productivity and worse, inconsistent or even incorrect results due to misuse of compiler flags or wrong library versions. Moreover, executing applications on different high-end platforms usually involves different preparatory steps that lead to further inefficiencies or inaccuracies. In order to address the above issues, we propose a shift in the conditioning model, as depicted in Figure 1. In our model the user works locally, instead of working directly
Enhancing Productivity in High Performance Computing
345
on a remote front-end node. We propose the concept of the build-tool virtualization to provide a common and unified interface to conditioning tools. In our approach the end-users interact with the Harness Workbench Toolkit (HWT) and issue a virtualized build command. The building assistant (Figure 2) orchestrates and runs, based on information stored in declarative profiles, relevant plugin modules (e.g. for environment preconditioning, staging, and compiling), the command that is passed to the execution assistant. The execution assistant (also steered by profiles) generates and executes actual target-specific commands.
Fig. 2. Harness Workbench Toolkit
The HWT architecture is presented in Figure 2. The HWT consists of three layers designed to be pluggable to support toolkit extensibility. The HWT behavior is configurable and tunable through declarative profiles that incorporate expert knowledge. Profiles, described in more detail in our previous paper [23], embody target platform specific knowledge, and may inherit from or override other descriptions. In addition, dynamic recursive resolution allows profile elements to cross-refer, and thus enables switching among predefined settings to select an appropriate suite of, e.g., compiler flags provided by a vendor. The other very important aspect of conditioning, as mentioned earlier, is adaptation of the application source code to the target platform. In the HWT, the porting assistant layer is dedicated to source code adaptation tasks. Its merits are based on the adaptation (conversion) capability model that aims to systematize required and commonly performed application source code conversions.
346
M. Sławi´nska, J. Sławi´nski, and V. Sunderam
4 Adaptation Capability Model Porting can be defined as a set of tasks required to launch and correctly execute the application on a target machine. Whereas it usually involves source code modifications, the application semantics needs to be preserved. In particular, the uniqueness of high-end machines and pursuit of peak performance makes this process exceptionally challenging in the HPC domain. We propose a toolkit-assisted approach that can facilitate routine tasks. In order to determine the routine activities that porting specialists deal with we have examined eight scientific codes in ORNL production use from a wide spectrum of computational science (chemistry, biology, fusion, computer science, and climate). We focused on those applications with available baseline and ported source codes, relevant to ORNL computing systems (Cray X1E (vector machine), Cray XT3/4, IBM SP4 (PowerPC processors)). A PC Linux distribution of a scientific application served as a baseline code. The results of our analysis are presented in the form of porting conversion categories shown in Figure 3.
Fig. 3. Conversion capability model
In general we can distinguish between two main code conversion categories, namely automatic and manual. The former refers to the set of conversions that to some extent may be automated, although user steering and input is still necessary. The simplest automatic conversions are substitutions that play a similar role to name refactoring (e.g. adding the prefix PXF for POSIX functions on Cray machines, or identifier mangling conventions). More advanced conversions concern pattern mapping. As an illustration, consider the different parameter passing conventions, e.g., the PGI Fortran compiler CALL FREE(PTR) and the IBM Fortran compiler CALL FREE(%VAL(PTR)), or time functions that may differ in sematics on various high-end machines. The other example relates to library incompatibilities such as a new version of the same library that has not been ported to a machine yet (e.g. FFTW 3.x → 2.x), or highly vendor-optimized
Enhancing Productivity in High Performance Computing
347
library counterparts. In addition to converting patterns into patterns, there are conversions that track dependencies outside the matched patterns. For instance, consider the code adaptation from the 32-bit to the 64-bit compilation model for Fortran on Cray X1E. It involves tracking variables’ declarations and existance of the compilation parameter ’-sdefault64’ in the build system. The other very important example of a tracking template conversion is loop unrolling. This technique (i.e., loop unrolling) attemps to optimize loop execution time (e.g. by utilizing the maximum number of available CPU registers) and is commonly used for fast computations. Apart from mapping conversions there are cases where a given HPC system does not support certain features such as signals, threads, some system calls, sockets, or a synthetic file system (i.e., /proc). Detection conversions intend to deal with such situations and inform the user about non-portabilities. In general, detections trigger manual code adaptations that usually require expert knowledge of the hardware (architecture), system software in terms of compiler switches, usage of relevant libraries or versions, and application algorithms. For instance, in order to utilize a streaming feature such as SSE or 3DNow, the algorithm must be implemented in an assembler code. Another example concerns code vectorization to fully exploit vector processors, manual loop unrolling, or performance optimization and tuning. Based on the described conversion capability model, we have developed conversion assistant modules for the HWT porting assistant layer in order to perform the porting experiment with production scientific code.
5 CPMD Experiment We have implemented examples of substitution, template, and detection conversions of the conversion capability model (Figure 3) as appropriate porting assistant modules (Figure 2). The goal of this experiment was to examine the number of modifications of the baseline source code that can be supported by our conversion modules in comparison to the total number of necessary modifications to successfully build and execute an application on a given high-end machine. As target platforms we chose HPC systems relevant to ORNL, i.e., IBM SP4, Cray X1E, and Cray XT3/4. The most convenient from the analysis standpoint would be a comparison of original source codes with their ported counterparts. Unfortunately, obtaining original source codes may encounter considerable obstacles due to their constant adaptation over time. To overcome this difficulty, a preprocessor can be used to generate architecture-specific versions of application source codes. Therefore, we chose the CPMD application [24] since it supports hundreds of configuration sets including architectures of our interest. We assumed the baseline code is the CPMD PC Linux distribution. CPMD is a molecular dynamics scientific code consisting of about 700 files implemented in Fortran (and an occasional C file). The CPMD build system is based on a shell script that generates the appropriate Makefile file. The results of our experiment are presented in Table 1. The Cray X1E turned out to be the most challenging of all of the examined highend machines in the context of the number of required source code modifications. This includes manual as well as automatic conversions and results from the vector
348
M. Sławi´nska, J. Sławi´nski, and V. Sunderam Table 1. CPMD experimental results Conversion Specific conversion Target HPC system category IBM SP4 Cray X1E Cray XT3/4 manual changing FFT → ESSL FFT 12 7 N/A optimization & tuning 4 95 9 unknown 2 0 7 Total 18 102 16 automatic substitution 11 11 0 detection 8 21 6 basic template 4 1 2 tracking template 1 43 18 Total 24 76 26
architecture of Cray X1E and the non-vectorized CPMD baseline code. The CPMD application contains the intrinsic FFT library. However, in case of IBM SP4 and Cray X1E the vendor provides the optimized version of FFT. Therefore, we distinguished a named optimization subgroup in Table 1. Other optimization and tuning conversions we identified concern manual loop unrolling (often combined with pragmas to enable aggressive code optimization such as for indicating which variables can be shared), changing the 32-bit to 64-bit compilation model, and setting zeros to the memory (on IBM SP4 due to performance reasons). Finally, there are cases that require expert knowledge for their explanation and that is why we classified them into the ’unknown’ category. With regard to automatic conversions we did not identify any substitutions for Cray XT3/4. This results from the existence of the same compiler for Cray XT3/4 and for Linux and appropriate configuration files in the CPMD distribution, so there is no need for substitutions. In general, the number of required substitutions measures the portability of a given routine. The number of substitutions would be greater in the case of changing calculation precision. This usually implies exchanging the function suite and in consequence, often many substitutions and tracking conversions (the number of arguments and their types often needs to be adjusted to counterpart function calls). For instance, changing calculation precision for the CPMD code in regard to the BLAST library requires exchanging a suite of 58 function names. Not surprisingly, due to its vector nature, Cray X1E leads in the detection conversion group that included detecting of unsupported signals and synthetic file system. Results shown in Table 1 indicate that tracking conversions are dominant (for Cray X1E and Cray XT3/4) over the basic templates and suggests that more advanced conversions may reduce the porting effort to a greater degree. The obtained results demonstrate that our methodology is promising and may even contribute much more to porting applications not as well prepared for this process as CPMD (CPMD is portability-oriented, e.g. porting-sensitive routines or functions such as malloc() or open() are wrapped into proprietary functions). The outcome of this experiment shows that although manual conversions are much more cumbersome, require substantial effort and knowledge, and cannot be eliminated, we identified conversions that can be supported by a tool, and in this regard improve productivity of scientists involved in the porting process.
Enhancing Productivity in High Performance Computing
349
6 Conclusions and Future Work Despite maturing in many ways, high-end computing systems continue to require substantial effort in terms of application and environment adaptation to execute a scientific code on a target HPC platform. Conditioning needs will intensify in the future as new technologies emerge and gain popularity (e.g. multicore processors, reconfigurable supercomputers). Utilization of legacy codes is inevitable due to their tested reliability and demonstrated performance. In this paper we introduce the Harness Workbench Toolkit that supports conditioning of the environment and adaptation of source code to a particular high-end machine. We focus on the porting aspect of the conditioning, namely the HWT porting assistant layer. We describe the conversion capability model that classifies source code adaptation conversions. Our experiment with developed conversion assistants performed on the production code (the CPMD application) demonstrates that our approach is feasible and may improve productivity of computational scientists by limiting their involvement only to those porting tasks that cannot be supported by software (i.e., optimization and tuning). In addition, the results of the CPMD experiment show that conversion assistants can be parametrized according to a machine architecture. Therefore, our future work will concentrate on including architecture-specific information into profiles. For instance, the set of detection conversion assistants for a given high-end machine would be determined by the machine’s profile. We also plan to develop conversion assistants as lexical analyzers. Currently, they are implemented as regular expressions and were sufficient for the experiment. In addition, we intend to perform similar experiments with scientific applications implemented in C and mixed (Fortran/C) programming languages.
References 1. Feldman, M.: HPC, Thy Name is Productivity. HPCwire (March 2007), http://www.hpcwire.com/ 2. Snell, A., Wu, J., Willard, C.G., Joseph, E.: Bridging the Capability Gap: Cray Pursues Adaptive Supercomputing Vision. White Paper (February 2007), http://www.cray.com/downloads/IDC-AdaptiveSC.pdf 3. Kepner, J.: HPC Productivity: An Overarching View. The International Journal of High Performance Computing Applications 18(4), 393–397 (2004) 4. Kumfert, G.K., Epperly, T.G.W.: Software in the DOE: The hidden overhead of the build. Technical Report UCRL-ID-147343, Lawrence Livermore National Laboratory (2002) 5. Dubois, P.F., Kumfert, G.K., Epperly, T.G.W.: Why Johnny can’t build. Computing in Science and Engineering 5(5), 83–88 (2003) 6. Cray, Inc.: Craydoc (2007), http://docs.cray.com/ 7. Sun Microsystems, Inc.: Porting UNIX Applications to the Solaris Operating Environment (1999), http://developers.sun.com/solaris/articles/portingUNIXapps.html 8. PathScale, LLC: Building popular codes with PathScale (August 2007), https://www.pathscale.com/building code/index.html 9. Saunders, R., Jeffery, C., Jones, D.T.: A Portable Framework for High-Speed Parallel Producer/Consumers on Real CMP, SMT and SMP Architectures. In: Proc. 21st Intl. Parallel and Distr. Processing Symp. (IPDPS 2007), Long Beach, CA (March 2007)
350
M. Sławi´nska, J. Sławi´nski, and V. Sunderam
10. Bader, D., Kanade, V., Madduri, K.: SWARM: A Parallel Programming Framework fro Multicore Processors. In: Proc. 21st Intl. Parallel and Distr. Processing Symp (IPDPS 2007), Long Beach, CA (March 2007) 11. Olivier, S., Prins, J., Derby, J., Vu, K.: Porting the GROMACS Molecular Dynamics Code to the Cell Processor. In: Proc. 21st Intl. Parallel and Distr. Processing Symp (IPDPS 2007), Long Beach, CA (March 2007) 12. Petrini, F., Fossum, G., Fernandez, J., Varbanescu, A.L., Kistler, M., Perrone, M.: Multicore Suprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine. In: Proc. 21st Intl. Parallel and Distr. Processing Symp. (IPDPS 2007), Long Beach, CA (March 2007) 13. Kindratenko, V., Pointer, D.: A Case Study in Porting a Production Scientific Supercomputing Application to a Reconfigurable Computer. In: IEEE Symp. Field-Programmable Custom Computing Machines (FCCM 2006), pp. 13–22. IEEE CS Press, Los Alamitos (2006) 14. Villa, O., Scarpazza, D.P., Petrini, F., Peinador, J.F.: Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors. In: Proc. 21st Intl. Parallel and Distr. Processing Symp (IPDPS 2007), Long Beach, CA (March 2007) 15. Brunner, R., Kindratenko, V., Myers, A.: Developing and Deploying Advanced Algorithms to Novel Supercomputing Hardware. In: NASA Science Technology Conference – NSTC 2007 (2007) 16. Castain, R.H., Woodall, T.S., Daniel, D.J., Squyres, J.M., Barrett, B., Fagg, G.E.: The open run-time environment (openrte): A transparent multi-cluster environment for highperformance computing. In: Proc. 12th European PVM/MPI Users Group Meeting, Sorrento, Italy (September 2005) 17. Doar, M.B.: 5. In: Practical Development Environments, O’Reilly, Sebastopol (2005) 18. Furlani, J.L., Osel, P.W.: Environment Modules Project (2005), http://modules.sourceforge.net/ 19. Muppidi, S., Krawetz, N., Beedubail, G., Marti, W., Pooch, U.: Distributed computing environment (DCE) porting tool. In: Proceedings of the IFIP/IEEE International Conference on Distributed Platforms: Client/Server and Beyond: DCE, CORBA, ODP and Advanced Distributed Applications, pp. 115–129 (1996) 20. Sun Microsystems, I.: Sun Studio 12 C, C++ & Fortran Compilers and Tools (2007), http://developers.sun.com/sunstudio/ 21. SRC Computers, I.: SRC’s Carte Programming Environment (2007), http://www.srccomp.com/SoftwareElements.htm 22. The Eclipse Foundation: Parallel Tools Platform (2007), http://www.eclipse.org/ptp 23. Sławi´nska, M., Sławi´nski, J., Kurzyniec, D., Sunderam, V.: Enhancing Portability of HPC Applications across High-end Computing Platforms. In: Proc. 21st Intl. Parallel and Distr. Processing Symp (IPDPS 2007), Long Beach, CA (March 2007) 24. CPMD Consortium: Car-Parrinello Molecular Dynamics – CPMD 3.11 (March 2006), http://www.cpmd.org/
A Formal Model of Multi-agent Computations Maciej Smolka Institute of Computer Science, Jagiellonian University, Krak´ ow, Poland [email protected]
Abstract. The paper contains an extension of a formal model of multiagent computing system developed in previous publications towards considering a more general system state. We provide also some deeper details of the model in the case of a homogeneous hardware environment. The model provides us with a precise definition of the optimal task scheduling together with results on the existence and characterization of optimal scheduling strategies.
1
Introduction
The application of the multi-agent paradigm together with local, diffusion-based agent scheduling strategies provide us with a relatively simple decentralized management of large-scale distributed computations. Multi-agent systems based on the idea of task diffusion (cf. [1]) dedicated to large-scale computations have been developed for the last several years [2]. A formal model for such systems was introduced in [3] and has been further developed in [4,5,6]. It provides us with a precise definition of the optimal task scheduling as well as results on the existence of optimal scheduling strategies and optimality conditions. In this paper the model is extended to consider a more general system state. We provide also more detailed modelling of a simple but realistic case of a homogeneous network of identical machines. Finally results on the existence and characterization of optimal scheduling strategies are presented.
2
Architecture of a Computing MAS
The principles of the MAS architecture which form the basis for our mathematical model were defined in [2,7] and has been further developed in [8,9]. The full description as well as the details of realization and test results can be found therein. Here let us recall only some crucial points. The suggested architecture of the system is composed of a computational environment (MAS platform) and a computing application being a collection of mobile agents called Smart Solid Agents (SSA). The computational environment is a triple (N, BH , perf ), where: N = {P1 , . . . , PN } , where Pi is a Virtual Computational Node (VCN). Each VCN can maintain a number of agents. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 351–360, 2008. c Springer-Verlag Berlin Heidelberg 2008
352
M. Smolka
BH is the connection topology BH = {N1 , . . . , NN }, Ni ⊂ N is an immediate neighborhood of Pi (including Pi as well). perf = {perf1 , . . . , perfN }, perfi : R+ → R+ is a family of functions where perfi describes relative performance of VCN Pi with respect to the total memory i request Mtotal of all agents allocated at the node. The MAS platform is responsible for maintaining the basic functionalities of the computing agents. Namely it delivers the information about the local load concentration Lj and Qj (see (2) and (3) below), it performs agent destruction, partitioning and migration between neighboring VCN’s and finally it supports the transparent communication among agents. We shall denote an SSA by Ai where index i stands for an unambiguous agent identifier. Each Ai contains its computational task and all data necessary for its computations. Every agent is also equipped with a shell which is responsible for the agent logic. At any time Ai is able to denominate the pair (Ei , Mi ) where Ei is the estimated remaining computation time measured in units common for all agents of an application and Mi is the agent’s RAM requirement in bytes. An agent may undertake autonomously one of the following actions: continue executing its internal task, migrate to a neighboring VCN or decide to be partitioned, which results in creating two child agents {Aij = (Tij , Sij )}, j = 1, 2. We assume that in the case of the agent partitioning the following conditions hold: Ei > Eij , Mi > Mij , j = 1, 2. The parent SSA disappears after such a partition. A computing application may be characterized by the triple (At , Gt , Scht ), t ∈ [0, +∞) where At is the set of application agents active at the time t, Gt is the tree representing the history ofagents’ partitioning until t. All agents active t till t constitute the set of its nodes s=0 At , while the edges link parent agents to their children. All information on how to rebuild Gt is spread among all agents such that each of them knows only its neighbors in the tree. {Scht }t∈[0,+∞) is the family of functions such that Scht : At → N is the current schedule of application agents among the MAS platform servers. The function is defined by the sets ωj containing indices of agents allocated on each Pj ∈ N. Every ωj is locally stored and managed by Pj . Each server Pj ∈ N asks periodically all local agents (allocated on Pj ) for their requirements and computes the local load concentration Lj =
j Etotal j perfj (Mtotal )
j where Etotal =
j Ei and Mtotal =
i∈ωj
Mi
(1)
i∈ωj
Then Pj communicates with neighboring servers and establishes k k Lj = (Lk , Etotal , Mtotal , perfk ) : Pk ∈ Nj
(2)
as well as the set of node indices Qj = {k = j : Pk ∈ Nj , Lj − Lk > 0} .
(3)
A Formal Model of Multi-agent Computations
353
The mentioned papers as well as [4] describe migration and partitioning strategies which make use of the above-defined quantities. These strategies are related to the physical phenomenon of the molecular diffusion in crystals.
3
System State Revisited
Next let us recall the features of a mathematical model of multi-agent computations, which is based on the presented architectural principles. It has already been presented in [3,4,5,6]. Here we shall propose a generalization of the model, namely we shall consider a more general form of the system state. Let us recall that we have considered the set of all possible application’s agents A. Because of problems with the determination of such a set the crucial idea is to avoid considering the evolution of a single agent and to study the dynamics of a whole computing application instead. We observe the system state in discrete time moments. Let us recall the notion of the vector weight of an agent, which is the mapping w : N × A −→ RM + . Note that otherwise than in our previous papers we allow M to be greater than 2. We assume that the dependency of the total weight of child agents after partition upon their parent’s weight before partition is well-known and linear, i.e. there is a matrix P ∈ RM×M such that + in the case of partition A → {A1 , A2 } we have wt+1 (A1 ) + wt+1 (A2 ) = Pwt (A).
(4)
This time we relax the assumption of componentwise dependency. If this is the case P is diagonal. Next let us recall the notion of the total weight of all agents allocated on a virtual node P at any time t, i.e. Wt (P ) = Scht (A)=P wt (A) This notion is crucial in our search for a global description of the system dynamics since it is a global quantity describing the state of the system in such a way that is appropriate for our purposes. As for the components of w it is straightforward to choose Ei and Mi defined i i in Sec. 2 as we did in previous papers. Then obviously Etotal and Mtotal are the components of W . In general it may be convenient to find other state variables i i (see Sec. 5). In any case both Etotal and Mtotal should remain observables of our system. In the sequel we shall assume that the number of virtual nodes N = N is fixed. Let us introduce the notation Wtj = Wt (Pj ) for j = 1, . . . , N . Then Wt may be interpreted as a vector from RMN or, if it is + more convenient, as a nonnegative M × N matrix. The equality Wt = 0 means that at the time t there is no computational activity. Thus 0 is the target of the application’s evolution: once our computing MAS reaches this state we want it to stay there forever.
354
4
M. Smolka
Equations of Evolution
As said before we shall consider Wt as a state of the computing application. Now we shall formulate the equations of evolution of Wt . Since they contain a stochastic term (see below), Wt turns out to be a discrete stochastic process. The following state equations are an adaptation of the equations presented in [3] or [5] to the generalized state. First consider three simple cases. ’Established’ evolution (when there are neither partitions nor migrations). Then we assume that the state equation has the form Wt+1 = F (Wt , ξt )
(5)
where F is a given mapping and (ξt )t=0,1,... is a given sequence of random variables representing the background load influence. We assume that ξt are mutually independent, identically distributed and have a common finite set of values Ξ, which is justified in many natural situations [6]. Since we want Wt to stay at 0 (i.e. we want 0 to be an absorbing state of Wt ), we need to assume that for every t F (0, ξt ) = 0 (6) with probability 1. Partition at node j. Then we have j j Wt+1 = I − diag(ujj (W )) Wtj + P diag(ujj t t t (Wt )) Wt i Wt+1
for i = j
= Wti
(7)
MN where components of ujj → [0, 1]M are the proportions of the weight t : R+ components of splitting agents to the corresponding components of the total weight of all agents at node j at time t. By diag(v) we denote the square matrix obtained by putting the elements of the vector v on the diagonal and 0’s outside the diagonal. Migration from j to k. In this case the state equations have the form ⎧ j jk ⎪ ⎪ W = I − diag(u (W )) Wtj t t ⎨ t+1 j k (8) Wt+1 = Wtk + diag(ujk t (Wt ))Wt ⎪ ⎪ ⎩W i i =W for i ∈ {j, k} t+1
ujk t
ujj t .
t
with analogous to In reality all three cases usually appear simultaneously. Therefore the final state equations shall be a combination of them which will reduce to a particular ’simple’ case when there are no activities in the system related to the other cases. Here we present such a combination, namely we propose the following form of the state equations ji j i i t , ξt ), P diag(uii Wt+1 = g i F i (W (W )) W , diag(u (W )) W t t t t t t j=i (9) i Wi =W 0
A Formal Model of Multi-agent Computations
355
ti = I − N diag(uik is a given for i = 1, . . . , N , where W (W )) Wti and W t t k=1 initial state. Remark 1. In our previous papers we considered only g(s, p, m) = s + p + m. A more general assumption on g could be g(s, 0, 0) = s, g(0, p, 0) = p, g(0, 0, m) = m. It follows that Wt is a controlled stochastic process with a control strategy π = (ut )t∈N ,
ut : RMN −→ U. +
(10)
The control set U contains such elements α from [0, 1]M(N ×N ) that satisfy at least the following conditions for m = 1, . . . , M . ji αij m · αm = 0 for i = j,
iN αi1 m + · · · + αm ≤ 1 for i = 1, . . . , N .
(11)
In fact quite often the conditions imposed on U shall be more restrictive (see next section). The first equation in (11) can be interpreted in the following way: at a given time migrations between two nodes may happen in only one direction. The second equality says that the number of agents leaving a node must not exceed the number of agents present at the node just before the migration. Remark 2. It is easy to see that the control set U defined by the conditions (11) is compact (and so are of course its closed subsets). In the most general case one might want to take the whole RMN for the state + space of the stochastic process Wt . But, on the other hand, in the most common situation its components represent some resources which are naturally bounded and quantized (see [5, Sec. 2]) for an analysis of a special case). Therefore we shall assume that the state space is a finite subset of NMN containing 0. Let us call the elements of this finite set si , i.e. S = {s0 = 0, s1 , . . . , sK }. This analysis of the state space has some consequences. First of all we have to assume that F and g have values in S. Likewise, the condition (11) is not sufficient for the equations (9) to make sense. Namely we have to assume that for any t ∈ N and W ∈S ut (W ) ∈ UW = {α ∈ U : G(W, α, ξ) ∈ S for ξ ∈ Ξ} with Gi (W, α, ξ) = g i (F i ((I −
N k=1
diag(αik ))W i , ξ), P diag(αii ) W i , ji j j=i diag(α ) W ) (12)
denoting the right hand side of (9). The above equality implies that G(W, 0, ξ) = F (W, ξ) ∈ S for any W and ξ, which means that 0 ∈ UW , therefore UW is nonempty for every W ∈ S. Another consequence of (12) is that G(0, α, ξ) = F (0, ξ) = 0, i.e. 0 remains an absorbing state of Wt even if we apply some agent operations. Remark 3. Given (9) it is easy to see that Wt is a controlled Markov chain with transition probabilities pij (α) = Pr(G(si , α, ξ0 ) = sj ) for i, j = 0, . . . , K, α ∈ Usi . The transition matrix for the control u is P (u) = [pij (u(si ))]i,j=0,...,K .
356
5
M. Smolka
Analysis of a Special Case
In this section we shall present more details on the state variables and the state equations in a simple but nontrivial special case. Let us assume that we use a homogeneous computational environment, i.e. VCN’s are deployed on different identical physical nodes. Furthermore assume that the evolution of Ei for each agent when there are no migrations, partitions or delays related to a higher background load is like on Fig. 1, i.e. linearly decreasing from the time of the agent’s creation ts until the end of computations. For each agent the maximal ¯ of Ei may be different. Assume also that Mi is equal to a positive convalue E stant throughout the agent’s life and before its creation and after its destruction Mi is equal to 0 (Fig. 1). Of course not all computational agents behave in the E
M
¯ E
¯ M
ts
¯ ts + E
t
ts
¯ ts + E
t
Fig. 1. An agent’s remaining time of computations (in common units, left picture) and memory requirement (in bytes, right picture)
presented way (e.g. HGS agents [9] do not), but there is an important class of agents related to CAE computations (e.g. SBS linear solver agents [7]) whose evolution can be modeled as above. In order to find appropriate state variables we divide the agents into generations according to their place in the partition history, i.e. the generation 0 consists of agents never split and the last generation (with number G) contains agents too small to be partitioned. We assume that within each generation all agents have the same memory requirement. Then the memory requirement becomes a strictly decreasing function of a generation M : {0, . . . , G} −→ N. Dej note by Nk,t the number of agents in the generation k at the node j at the time j t and by Ek,t the total remaining time (cf. (1)) of all agents in the generation k at the node j at the time t. Using these notations we define the state variable T j j j j Wtj = E0,t , . . . , EG,t , N0,t , . . . , NG,t G j (W T denotes the transposed matrix for W ). Note that both Etotal = k=0 Ekj G j and Mtotal = k=0 M (k)Nkj are observables of our system, however they are not state variables. Next let us consider the evolution of Wt . To this end let us assume that during the partition an agent is split into two equal children in such a way that both
A Formal Model of Multi-agent Computations
357
get a half of the parent’s remaining time of computations. Then the equations of the partition for one agent belonging to the generation k (k < G) have the form j j j j Ek,t+1 = Ek,t −e Nk,t+1 = Nk,t −1 j j j j Ek+1,t+1 = Ek+1,t +e Nk+1,t+1 = Nk+1,t +2 where e is the splitting agent’s remaining time. Thus the partition matrix has the following form B 0 P= 0 2B where B is the matrix which has 1’s right under the diagonal and 0’s elsewhere. It means that the system evolution follows (7). Similarly it is easy to see that (8) describes the dynamics of migration in this situation as well. Therefore let us concentrate on the ’established’ case. Then the evolution equation of variables E is the following j j j j Ek,t+1 = Ek,t − Nk,t + Dk,t j j where Dk,t ∈ {0, . . . , Nk,t } is a random number of agents delayed due to the high background load. In order to drop the dependency of the set of values of D on j j j the current state we shall rewrite it in the following way Dk,t = [Nk,t ξk,t ] where j [x] stands for the whole part of x and ξk,t is a sequence of random variables with values in [0, 1]. The equations of evolution of variables N are slightly more complicated, namely we have ⎧ j j j j ⎪ if Ek,t ≥ 2Nk,t ⎨Nk,t − Dk,t j j j j j j j Nk,t+1 = Ek,t − Nk,t − D (13) if Nk,t < Ek,t < 2Nk,t k,t ⎪ ⎩ j j j Ek,t+1 if Ek,t = Nk,t
j ∈ {0, . . . , N j −Dj −1} is another random variable. It represents the where D k,t k,t k,t uncertainty about whether agents within a generation are at the same point of j , the better balanced the generation. computations: the smaller the value of D k,t j = 0 means that all agents within the generation j are exactly The equality D k,t at the same point. The main reason of a probable poor balancing are agentswanderers, which do not perform any computations and migrate over the network instead for some time, and finally end up at the original node in a very early stage of computations in comparison to agents which have stayed ’home’ and j enables us to consider such situations in a probabilistic way. worked hard. D k,t j = [(N j − Dj )ζ j ] where ζ j is a sequence of random As before we put D k,t k,t k,t k,t k,t variables with values in [0, 1). Thus the interpretation of the equations (13) in the case of the perfect balancing of computations within generations is the following. The first means that when there is sufficiently much work to do the number of agents is constant. The second and third mean that at the end of computations we have a number of agents and each of them is to finish its task in a time unit. To sum up, in the presented case we have obtained the state equations, which have the form (5). This shows what F may look like. In general we need to consider migrations and partitions as well, ending up with equations like (9).
358
6
M. Smolka
Optimal Scheduling Problem
Now let us recall the definition of the optimal scheduling for a computing MAS. We shall follow the steps of [3] and put our problem into the stochastic optimal control framework. The general form of the cost functional for controlled Markov chains of our type is (cf. [10]) ∞ V (π; s) = E[ t=0 k(Wt , ut (Wt ))] (14) where π is a control strategy (10) and s is the initial state of Wt , i.e. W0 = s. Since 0 is an absorbing state we shall always assume that remaining at 0 has no cost, i.e. k(0, ·) = 0. This condition guarantees that the overall cost can be finite. Let us define the following set of admissible control strategies U = π : ut (W ) ∈ UW , t ∈ N . Now we are in position to formulate the optimal scheduling we look for a control strategy problem. Namely given an initial configuration W ∗ π ∈ U such that ) = min V (π; W ) : V (π ∗ ; W π ∈ U, Wt is a solution of (9) . (15) Consequently, an optimal scheduling for Wt is a control strategy π ∗ realizing the minimum in (15). Changing the cost functional we obtain various criteria for the optimality of the scheduling. Let us present here some cost functionals of type (14) which are appropriate for multi-agent computations. The first one is the expected total time of computations VT (π; s) = E inf{t ≥ 0 : Wt = 0} − 1 . (16) In other words, in this case a scheduling is optimal if it is expected to finish the computations in the shortest time. The second functional promotes the good mean load balancing over time. It has the following form N ∞ N i 2 i VL (π; s) = E (L − L ) , Lt = N1 (17) t t t=0 i=1 i=1 Lt where Lit is the load concentration (1) at the i-th VCN at the time t. These i i quantities are well defined because we have assumed that Etotal and Mtotal are observables of our system. Both the above examples do not contain an explicit dependency on the control. Generalizing it a little allows us to penalize migrations. Namely take ϕ : S → R+ , a ≥ 0 and μij m : [0, 1] → R+ nondecreasing and such that μij (0) = 0, and put M ∞ ij ij VM (π; s) = E . (18) t=0 ϕ(Wt ) + a m=1 i=j μm (um,t (Wt )) In the above expression ϕ can have the form of the term under the estimated value in (16) or (17), μij allows us to penalize the distance between i-th and j-th VCN and a is a tuning factor (the greater is the value of a, the greater is the influence of the ’migration’ term).
A Formal Model of Multi-agent Computations
7
359
Existence and Characterization of Optimal Strategies
Let us consider first the existence of solutions for problem (15). To this end let us denote by R(u) = [pij (u(si ))]i,j=1,...,K the ’probably not absorbing’ part of the transition matrix for a control u. The following proposition is the main existence result. Proposition 1. Assume that (R1) RK (u) is a contraction for every u such that u(s) ∈ Us or (R2) Rn (u) is a contraction for some n ≥ 1 and u as above but additionally k(sj , α) ≥ ε > 0 for j = 0, α ∈ Usj . Then there exists the unique optimal solution of (15). Proof. It is sufficient to notice that Us are finite so they are compact and k(sj , ·) is obviously continuous. It means that the assumptions (A1)–(A3) from [10, Chap. 4] hold. Thus the thesis is a straightforward consequence of [10, Theorem 4.2]. Corollary 1. Consider our example cost functionals VT , VL and VM . 1. Problem (15) for VT has the unique solution provided (R2) holds. 2. Problem (15) for VL has the unique solution provided (R1) holds. 3. Existence of the solution for (15) for VM depends on the assumption on ϕ. If the latter is separated from 0 we need (R2) otherwise (R1). Now we shall present some Bellman-type optimality conditions for (15), which are another consequence of [10, Theorem 4.2] and its proof. We shall formulate them in the following proposition. Proposition 2. Assume (R1) or (R2). Then the optimal solution of (15) is a stationary strategy π ∗ = u∞ = (u, u, . . . ) and it is the unique solution of the equation K ∗ V (π ∗ ; s) = minα∈Us (19) j=1 pij (α)V (π ; sj ) + k(s, α) . The solution of (19) exists and is the optimal solution of (15). The simple but important consequence of this proposition is that in order to find the optimal scheduling we need to consider only stationary strategies. Note that diffusion strategies defined in [8,4] are stationary, thus we can try to verify their optimality (or quasi-optimality) by means of a variant of the equation (19). On the other hand this equation also allows us to compute the optimal strategy by means of some iterative procedures like Gauss-Seidel [10].
8
Conclusions
The presented MAS architecture accompanied by diffusion-based agent scheduling strategies make a convenient and efficient environment for large-scale distributed computations. The presented mathematical model based on the
360
M. Smolka
stochastic optimal control theory provides us with a new precise definition of optimal scheduling in such an environment. It also enables us to obtain some results on the existence of optimal scheduling strategies (Proposition 1 and Corollary 1) as well as the optimality conditions (Proposition 2). These results show in particular that any optimal scheduling must belong to the class of stationary strategies, which have been utilized during tests [8,9]. In this paper the formal model has been extended to allow more general system state. This in turn has enabled us to provide a more detailed description of the generalized model in an important special case (Sec. 5) which form the basis for experiments designed to verify the model. Such experiments are being undertaken with results soon to come.
References 1. Luque, E., Ripoll, A., Cort´es, A., Margalef, T.: A distributed diffusion method for dynamic load balancing on parallel computers. In: Proceedings of EUROMICRO Workshop on Parallel and Distributed Processing, San Remo, Italy, pp. 43–50. IEEE Computer Society Press, Los Alamitos (1995) 2. Uhruski, P., Grochowski, M., Schaefer, R.: Multi-agent computing system in a heterogeneous network. In: Proceedings of the International Conference on Parallel Computing in Electrical Engineering (PARELEC 2002), Warsaw, Poland, pp. 233– 238. IEEE Computer Society Press, Los Alamitos (2002) 3. Smolka, M., Grochowski, M., Uhruski, P., Schaefer, R.: The dynamics of computing agent systems. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3516, pp. 727–734. Springer, Heidelberg (2005) 4. Grochowski, M., Smolka, M., Schaefer, R.: Architectural principles and scheduling strategies for computing agent systems. Fundamenta Informaticae 71(1), 15–26 (2006) 5. Smolka, M.: Optimal scheduling problem for computing agent systems. Inteligencia Artificial 9(28), 101–106 (2005) 6. Smolka, M., Schaefer, R.: Computing MAS dynamics considering the background load. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3993, pp. 799–806. Springer, Heidelberg (2006) 7. Grochowski, M., Schaefer, R., Uhruski, P.: An agent-based approach to a hard computing system — Smart Solid. In: Proceedings of the International Conference on Parallel Computing in Electrical Engineering (PARELEC 2002), Warsaw, Poland, pp. 253–258. IEEE Computer Society Press, Los Alamitos (2002) 8. Grochowski, M., Schaefer, R., Uhruski, P.: Diffusion based scheduling in the agentoriented computing systems. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 97–104. Springer, Heidelberg (2004) 9. Momot, J., Kosacki, K., Grochowski, M., Uhruski, P., Schaefer, R.: Multi-agent system for irregular parallel genetic computations. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3038, pp. 623– 630. Springer, Heidelberg (2004) 10. Kushner, H.: Introduction to Stochastic Control. Holt, Rinehart and Winston (1971)
An Approach to Distributed Fault Injection Experiments Janusz Sosnowski, Andrzej Tymoczko, and Piotr Gawkowski Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, Warsaw 00-665, Poland [email protected]
Abstract. Software implemented fault injection technique is gaining much interest in evaluating system dependability. For complex software applications fault injection experiments take a lot of time. In the paper we present an innovative approach to fault injection by performing it in LAN distributed environment. The paper presents the system architecture, task scheduling and the analysis of its effectiveness. The presented considerations are illustrated with practical results. Keywords: distributed systems, processor intensive workloads, task scheduling, load balancing, performance evaluation.
1
Introduction
Fault injection experiments are widely used to check system dependability (susceptibility to faults) [1,2,3]. We have developed fault injectors which simulate faults in the real IBM PC system by disturbing the states of processor registers and memory cells during the execution of the analysed program [4,5]. In the performed experiments we generate in an automatic way many faults and analyse their effects (e.g. incorrect results, system exceptions, time-outs, and correct results). For every injected fault (test) the program has to be executed, so usually experiments need long simulation time. To speed up the experiments we have developed DInjector system, which manages and distributes fault injection tasks in LAN environment. In the literature only single node fault injectors were described, so we are pioneers in distributed fault injection. Developing DInjector we assumed the possibility of performing experiments by many users in parallel with normal workload in LAN. From the variety of parallel computation models (e.g. [6,9,11]) the farm processing is relatively close to the specificity of fault injection experiments. Nevertheless its adaptation to heterogeneous environment, multiaccess and performance enhancements were challenging tasks. This required some studies on task decomposition, load balancing in nodes, task scheduling, etc. In the literature these issues are usually analysed separately and for abstract assumptions (e.g. [8,9,10,12]). In our project we consider them in a comprehensive way taking into account some specific problems (neglected in the literature), in particular: automatic adaptation to system configuration changes, dealing with various erroneous situations caused by injected faults as well as faults generated by other LAN users. Another challenge R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 361–370, 2008. c Springer-Verlag Berlin Heidelberg 2008
362
J. Sosnowski, A. Tymoczko, and P. Gawkowski
was verification of the developed system, its effectiveness and performance [7]. For this purpose we have embedded some original monitoring modules. Resuming the paper presents a case study of a sophisticated system, which integrates various technologies and is used in practice. In section 2 we give an outline of the system architecture. Schemes of distributing simulation tasks and experiment management are presented in section 3. Section 4 describes the methodology of evaluating system effectiveness and performance (illustrated with practical results).
2
System Architecture
Fault injection experiments consist of several tasks: ME - extraction of program modules for the analysis, GR - collecting reference information (golden run log GRL) from non disturbed application execution, MS - selection of fault injection moments, EX - fault injection tests. The user specifies experiment configuration, the number, localisation and classes of injected faults, etc. [4,5]. Basing on these specifications delivered by the users the DInjector system creates and distributes various tasks for the computer nodes, monitors their execution and collects the results. These processes are realised by the following functional objects: – User interface - assures user interaction with the system, implemented in the JSP (JavaServer Pages) technology running on Tomcat HTTP server. – Database - collects the information provided by the user and the DInjector system. It co-operates with data warehouse (based on MS SQL) comprising aggregation and data mining capabilities [6,10,11]. – DICoordinator - responsible for the distribution and management of computational tasks assigned to a subnetwork of homogeneous workstations. Coordinators are implemented in C++ (system services in Win32 or daemons in Linux environment). Communication channels use CORBA technology, which simplifies many low-level aspects. – Fault injection cores (FICs) - services or daemons (implemented in C++.) realising all computational tasks (ME, GR etc.) in the workstations (managed by the corresponding instance of the DICoordinator). All tasks of performed experiments are written into the DInjector database. The sequence of tasks corresponds to the experiment model and is guaranteed by the user interface (WWW application) and database transactions. The scheduling of the tasks (section 3) takes into account assigned user credits and priority as well as node loading. All system components (databases, coordinators and simulation services) are independent and can work on different computers as well as on the same workstation. This allows scaling the system and using different workstation subnetworks as a part of DInjector. There is no limit for the number of coordinators in the system so many subnetworks (Linux or Win32 like) can be connected with coordinators to serve the system. The subsystem composed of the DICoordinator and related to it processing nodes (DINodes) is called DICluster. Different DIClusters cooperate with the same database. The DICoordinator is composed of three
An Approach to Distributed Fault Injection Experiments
363
co-working but independent parts (they can be spread within one subnetwork of homogeneous workstations): – Distribution coordinator (DIManager) selects tasks, assigns them to nodes, manages and monitors their execution at the workstations. Moreover, it controls resource and merging servers (defined below) and assigns nodes to them, – Resource server (DIRServer) is responsible for distribution of needed files (e.g. GRL, application code and environment) across the network of processing nodes (DINodes), – Merging coordinator (DIMerger) collects results from DINodes, writes them into the database and signals the termination of tasks to the DIManager. These three components are compiled within one application (they can be switched on or off). DIManager comprises two subcomponents: one used for the preliminary phases (ME, GR, MS) and the second one responsible for the testing phase (EX). This functional decomposition assures good system scalability and eliminates possible bottlenecks of the system (e.g. installation of multiple resource servers within a network). Increased number of processing nodes results in higher speed of results generation. In a single instantiation of the system we can create several DIMergers operating on different machines and writing results in parallel to several databases. Fault injection subsystems can use the same resource server (or several of them). One DIManager should be associated with at least one DIMerger and DIRServer. One DIMerger is associated with only one DIManager. One DIRServer can be associated with many DIManagers. Hence, two DICoordinators can share one DIRServer. Each DIManager has its own group of associated processing nodes. In the case of homogeneous fault injectors one DIManager can control preliminary phases and the second one the test phase (EX). The preliminary phases of the experiment i.e. ME, GR and MS are executed by simple assignment of each task of this type to a single processing node. In the case of tests (phase EX) they are partitioned into subsets which can be assigned to different nodes. Each subset of tests is denoted as task TS which is composed of so called parcels (subtask PC). A parcel may comprise one or more tests. Each correct execution of the task (ME, GR, MS or TS) is signalled by the fault injection core (FIC in DINode) to DIMerger. The node signals its readiness for further processing to DIManager and in the meantime DIMerger initiates writing the results to the database. The termination of this process is signalled to DIManager. System operation is described in section 3.
3
System Operation and Task Scheduling
In the DInjector system we have two levels of task identification. The higher level relates to tasks defined by the user (ME, GR, MS, EX, PC). At the lower level we have so called monitored tasks which are operational units of DICluster. These tasks are correlated with appropriate resource and operation to be executed. They control the allocation of the resources as well as deallocation in the
364
J. Sosnowski, A. Tymoczko, and P. Gawkowski
case of unsuccessful operation. Within the DIManager we have the task monitor which checks the time-out for the initiated task. Expired time-out restores allocated resources. Each monitored task handles the following events: task correct termination, error during task execution (task cancelling) and task time-out expiration. The most important monitored tasks are: allocation task and task status monitoring task. The allocation task relates to processing node and tasks performed by this node. Termination and closing a task moves the node from the set of busy nodes to the set of free nodes. Time-out will result in deleting the node from the set of registered nodes. The identifiers of monitored tasks are assigned by DIManager. The experiment is distributed through the coordinators by dividing the set of all tests. A single test requires the specification of the fault (location and type) and its injection time instant (fault triggering). The Cartesian product of these two spaces defines the test space for the experiment. To improve the processing efficiency, tests are grouped into parcels. The parcel is a unit treated in a transactional manner (if all tests in the parcel are executed the whole parcel is committed). The coordinator locks in the database a set of parcels (for a specified period of time) and sends it to the available free node (within its network). The results of the tests are successively sent back to the merging coordinator, which collects single test results and writes the results of the whole parcel into the database committing the transaction. The node keeps the application environment for the further experiments as long as there are more tasks of the analyzed application. Such caching minimises the communication overhead while setting the node for a task. Every simulation node registers itself to the local coordinator and signals its readiness for serving. The whole system sets-up automatically - nodes register to coordinators in their networks, coordinators connect to the system database. Two kinds of time-outs are defined - long and short. The first one is applied to task parcels at the level of coordinators. If the result is not committed after the long period of time the task parcel assignment is cancelled and restarted. Short timeouts are used to monitor the task activity at the node level by the management coordinator. In the case of a single task time-out at the node level, the node is forced to cancel its execution. Similarly, in the case of a node or coordinator failure tasks are rescheduled to other available nodes in the system. In the case of a short break of the communication between the database and coordinators or coordinators and nodes, all the tasks assigned to unavailable stations after the time-outs will be unlocked and available for execution by other nodes. The workstations can be turned-on or turned-off independently (dynamic configuration). Tests (EX task) are executed by distributing parcels (with test specifications) to the nodes. All tests related to the analyzed application are defined by triggering moments and fault locations. A single test is defined by a pair . For such a specification the fault is simulated in the node by generating a fault mask (e.g. at random) and then disturbing an appropriate bit (or bits) in accordance with the mask specification and fault type (e.g. bit flip) [5]. In an experiment we usually simulate thousands or
An Approach to Distributed Fault Injection Experiments
365
millions of faults. So, an important issue is appropriate coding of test specifications for the whole space of specified tests defined by the generated fault triggering moments (e.g. in a pseudorandom way in the dynamic image of the application code) and specified locations. So, DIManager comprises both lists of specifications. Than test parcels are created and numbered according to the following principle: each parcel is assumed to comprise the same number of tests (NTP), all tests are numbered in such a way that we take the first triggering moment and define subsequent tests taking into account subsequent locations from the location list. Then we take the second triggering moment and assign subsequent test for all the locations etc. The parcel 0 comprises the first NTP tests from the list, parcel 1 the second portion on NPT tests, etc. DIManager sends to the nodes only the set of the numbers of test parcels, the set of related fault triggering moments, fault locations and the size of the parcel. The fault triggering moments are assigned to the appropriate code addresses during the selection phase (MS task). The list of locations comprises pairs , where the location cardinality defines the number of requested fault injections. Such a compressed list of locations is revealed in the node to specify individual fault injections. Hence the total test specification data sent in the network of nodes is of the complexity O(M + N*L), where M is the number of all triggering moments, N is the number of used processing nodes and L is the number of all considered different locations (without repetitions). We have admitted simultaneous access to the system by many users. The preliminary tasks (ME, GR, MS) are selected by the related executor in two phases: at first we select the user than its tasks. Selection of the user will be described later on (fair access is assured). User task are selected in the sequence ME, GR and MS, this is a natural sequence which assures the fastest interaction with the user. The preliminary phase executor performs the following operations in a loop: reservation of free nodes, selection of tasks for execution, assignment of the tasks to the nodes (not used nodes are returned as free nodes), task delegation to the nodes. In the experiment phase the appropriate executor selects at first the user from the set of the users with ready experiments for the execution. For the selected user we select the experiment with highest priority and for such experiment test parcels are defined and distributed to the nodes. Let us assume that we have n parcels waiting for the execution and we have m idle nodes ready for operation, other registered nodes (N - m) are used for other tasks. The executor will try to assign to each free node a unique parcel if the number of idle nodes is sufficient. If the number of free nodes is lower than n, then we assign n*m/N parcels to each free node. User scheduling is based on the pre-assigned usage share (quota) e.g. for 3 users we can have usage scores 50, 20 and 10 (total 80), which results in system usage percentage 50/80 = 62.5%, 20/80 = 25% and 10/80 = 12.5%. The real usage share may fluctuate to some extent. For example after the execution of some task we may have the real execution time for the tasks of the considered users 200s, 150 s and 50s. This leads to the real usage 50%, 37.5% and 12.5%, which is different from
366
J. Sosnowski, A. Tymoczko, and P. Gawkowski Users
cycle N
Id U
M
Δ
R
cycle N + 1 Δ cycle N + 2 Δ cycle N + 3 M
R
M
R
M
R
1 40
440s 44.0
440s 40.7
440s 38.2 +20s 460s 37.6
2 20
200s 20.0 +30s 230s 21.3
230s 20.0 +20s 250s 20.5
3 20
190s 19.0 +20s 210s 19.4 +40s 250s 21.7
4 14
130s 13.0 +5s
5
30s
5
3.0 +10s 40s
3.7
+20s 60s
6 1 10s 1.0 +7s 100 1000s 100.0 920s 85.0
11s
250s 20.5
5.2
60s
4.9
1.0* +20s 31s
2.5
991s 86.0
1051s 86.0
Fig. 1. Illustration of the task scheduling algorithm
the targeted one. The first user has some deficiency, the second one some surplus (excess) in the usage as compared with the targeted one. The used scheduling algorithm converges to assure the real usage close to the pre-assigned target. For this purpose for each new task assignment cycle we take into account parameter: scaled real usage = (usage score)*(real usage percentage)/(targeted usage percentage). So for the considered example we will get 50*50%/62.5%=40, 20*37.5%/25% = 30 and 10*12.5%/12.5% =10. The total scaled real usage is 40 + 30 + 10 = 80 (the targeted total usage is also 80). In the initial phase of the scheduling algorithm we have so called initial working list comprising users id’s and the corresponding assigned target usage scores. In fig. 1 we have 6 users with assigned scores: 40, 20, 20, 14, 5, and 1. In the N-th cycle of user selection we take into account the targeted usage score (U), the cumulated (from the beginning) processing time (M) of the specified user and the scaled real usage score (R). If in the row there is no entry it means that the related user has no tasks ready for execution. Between subsequent cycles we have the specification of the user additional processing time (Δ). This value is added to the processing time (M) in the N-th cycle and gives the usage time in cycle N+1. Users 4 and 6 finished their tasks and are not included in cycle N + 1. A new task of user 6 appears before cycle N + 2. In italic we specify parameters of users that were absent in the previous cycle (lack of ready tasks). Here we assume artificially the value M such that assures R=U (lack of history). This is denoted with * (user 6, cycle N+2). In cycle N all users have tasks for the execution. The users 2, 3, 4, 5 and 6 have no usage surplus (i.e. U ≥ R, bolded figures), so in this order they are included in the newly created list. In the N+1 cycle only users 1, 2, 3 and 5 have tasks for the execution. Among them 3 and 5 are selected due to usage deficiency. In cycle N+2 a new user reappears (6). In this cycle users 1, 2 and 6 will be selected. In cycle N + 3 users 1 and 5 will be selected. Hence, resuming we have obtained the following list of selected users in subsequent cycles: {2,3,4,5,6}, {3,5}, {1,2,6}, {1,5}. The effectiveness of the algorithm was verified in simulation experiments. For tasks with uniform
An Approach to Distributed Fault Injection Experiments
367
random distribution of execution times from the range <50s, 150s> the usage scores converged to the targeted ones after 20-30 scheduling cycles.
4
Testing and Evaluating System Performance
The developed system has been systematically tested in LAN environment covering several laboratories comprising various workstations. To simplify and automate configuration of nodes in the network we have developed DIDeployer and DIDeployManager applications. DIDeployer has to be installed once on every workstation previewed to be used in experiments. The system installer accesses DIDeployManager and selects appropriate text file comprising the list of used node addresses. This activates communication with DIDeployers in the specified nodes. Then the installer selects appropriate fault injection core (FIC) and initiates broadcasting it to the nodes. DIDeployer deinstalls the previous version of FIC, creates new file structures, installs and initiates the new FIC version. DIDeployManager provides also other services such as reading log files from the nodes, interrupting node operation, sending new configuration file. Testing the whole system needed some supplementary tools collecting operational logs in every node. In the developed logging system we can specify the required level of data collection (more or less detailed). Data is stored by invoking logging function by the application components. For the highest accuracy of logging we have observed 1GB of data generated by the DICoordinator logging subsystem per 24 hours (for 20 operating nodes). Each node (FIC application) generated 20-30 MB of logs per 24 hours. The high accuracy logging was needed in the first phases of the system testing. Later we selected rough logging. The lowest level of logging is restricted to error messages, decisions on selecting tasks for execution and moments of their termination. In this case the log volume increase for DICoordinator is about several MB per 24 hours. Higher levels of logging include issued SQL commands, image of task queues and many other specific events. Analysing events in the log we can identify incorrect system operation or proof its correctness. The system has been verified for many test scenarios involving various applications for fault injection and several users. For the executed tests we verified not only the final task results but also analysed the system operation and its performance. During the experiment we can monitor its progress, in particular the number of executed parcels, number of processed parcels, number of incorrectly executed parcels, distribution of executed tasks or parcels within nodes, task queues, etc. Moreover, we can visualise times of registering tasks in co-ordinators, time of initiating tasks execution, time to the execution deadline, time of staying in specified system states, execution time of messages, etc. The rich monitoring capabilities allowed us to eliminate system bottlenecks and optimise its performance. In the first phase of experiments many parcels were denoted as incorrect or many timeouts were generated. The first situation related mostly to existing bugs in the fault injector (systematically revealed and corrected). Timeouts in a large extent appeared during student laboratories and mostly resulted from
368
J. Sosnowski, A. Tymoczko, and P. Gawkowski
manual computer resetting. In the case of clean system shutdown in the node the operating system signalled this fact to the application. The application signalled this to DIManager, which delegated the remaining tasks to other nodes (instead of trying to repeat them in the reset node). The performed experiments proved the functionality of the system and its effectiveness. We illustrate this in tab. 1, which gives the experiment execution time of three applications (Qsort, LZW compression and HTML conversion). Parameter N specifies the number of computer nodes used for fault injection. All nodes were used only for the simulation experiment (not loaded with other applications). The results prove linear scalability, this effect was observed also for other applications. Some time fluctuation is related to system background processes, hence for LZW we have got higher speedup (S) than 10 for 10 nodes. It is worth noting that the distribution of test parcels processed by the nodes fluctuates due to some differences in the execution time of disturbed application, generated time-outs and automatic load balancing of nodes by the system. Tab. 2 shows some statistics for HTML conversion: the mean number of processed parcels (M) in nodes, standard deviation (D), as well as difference between maximal and minimal numbers (Δ). The presented results relate to 33%, 60% and 100% test progress respectively (100% means execution of all previewed tests). Table 1. Experiment (phase EX) execution time (in hours and minutes) and relative speed-up (S) N
Qsort - sorting
LZW compression
HTML conversion
1 10 28
30 h 15 min (S = 1) 3 h 05 min (S = 9.8) 1h 05 min (S = 27.9)
4 h 36 min (S = 1) 27 min (S = 10.2) 10.5 min (S = 26.2)
112 h 05 min (S = 1) 11 h 10 min (S = 9.96) 4 h 03 min (S = 27.8)
The impact of the assumed time-out limit (To) on the experiment execution time (TE) is not negligible. This is illustrated in tab.2 for HTML application (columns To and TE relate to complete tests - 100%). Here we show experiment execution time TE (in minutes) in function of the assumed time-out (To) in the nodes (for 28 nodes). We give the execution time for all tests and the execution time related to tests which generated time-outs (in brackets). For subsequent values of To: 1, 5 and 10 seconds the time-out tests contributed to the experiment execution time (TE) 19%, 55% and 71%, respectively. This is the reason for increasing TE. The experiments involved fault injections of single bit flip faults in the processor EAX register (pseudorandomely in time and space). The timeout should be selected carefully for the analysed application. Too long time-out slows down the experiment execution. On the other hand too short timeout may disturb obtaining valid test results. Optimising performance of the system we analysed the impact of the parcel size and distribution of system components in the network nodes. For the 100Mb/s transmission speed in the network the optimal parcel size is 10 tests.
An Approach to Distributed Fault Injection Experiments
369
Table 2. Statistics of experiment progress (phase EX) for HTML conversion Test
Processed parcels
Execution time
progress
M
D
Δ
To [s]
TE [min]
33% 60% 100%
100 149 297
6.22 6.84 8.56
23 26 39
1 5 10
65 (18) 135 (89) 234 (178)
Longer parcels may result in more losses due to unsuccessful simulation (and the need of repetitions of the whole parcel). To eliminate the bottleneck of the access to the database it is recommended to install DICoordinator and the MS SQL database server on separate computers. Performing tests for 3 applications Qsort, LZW compression and HTML conversion (over 2 million tests) on 26 nodes we obtained 17% speedup in this configuration (as compare with a single computer comprising DICoordinator and the database). In the optimisation process we used not only the event logging program but also standard performance monitor, which allowed us to trace transmissions, resource bottlenecks, system errors, etc. We also analysed the impact of other jobs executed in the computing nodes. Practically student laboratories on programming had negligible impact on simulation speed. Using special programs simulating CPU and disc loading showed that the high CPU loading results in a fair access to CPU processing power (close to 50% for simulation and 50% for the loading application), disc loading (continuous copying of files) had low impact (about 5% slow down of the simulation processes).
5
Conclusion
The main benefit of distributing fault injection experiments is speeding up their execution and performing them in a well-controlled environment. Moreover, it creates the possibility of multi-access as well as better usage of available computer resources. The developed system adapts to unpredictable fluctuations of the available computation power. Designing the system we optimised task management, data formats and communication overhead. High system effectiveness was proved in many experiments. Fault injection experiments fit well to distributed environment, howeve their specificity needed resolving some special problems. In particular the performed tests may result in node crashes. On the other hand running the experiments in parallel with other tasks (e.g. student laboratories) created some risks with node reloading, restarting, operating system changes etc. So it was required to embed special mechanisms dealing with these problems. This was quite challenging problem which required extensive studies of various situations covering different simulation tasks schemes, node loading and other users interactions. Hence, in the testing and validation phases we used specially developed monitoring mechanisms
370
J. Sosnowski, A. Tymoczko, and P. Gawkowski
related to collecting various task activity reports as well as some standard performance parameters. The developed distributed manager is quite universal and we plan to use it for other fault injectors, which are being developed. Moreover, it will be directly connected to our new data warehouse to facilitate exploring test results [4,13].
References 1. Arlat, J., et al.: Comparison of physical and software implemented fault injection technique. IEEE Trans. on Computers 52(8), 1115–1133 (2003) 2. Benso, A., Prinetto, P.: Fault injection techniques and tools for embedded systems reliability evaluation. Kluwer Academic Publishers, Dordrecht (2003) 3. Gawkowski, P., Sosnowski, J., Radko, B.: Analyzing the effectiveness of fault hardening procedures. In: Proc. of the 11th IEEE Int’l On-Line Testing Symp., pp. 14–19 (2005) 4. Gawkowski, P., Sosnowski, J.: Experiences with software implemented fault injection. In: Int. Conf. on Architecture of Computing Systems, Workshop Proceedings, pp. 73–80. VDE Verlag GMBH (2007) 5. Sosnowski, J., Gawkowski, P., Lesiak, A.: Software implemented fault inserters. In: Proc. of IFAC PDS 2003 Workshop, Pergamon, pp. 293–298 (2003) 6. Tanenbaum,, van Steen, M.: Distributed systems, principles and paradigms. Prentice Hall, Englewood Cliffs (2002) 7. Herruzo, E., et al.: Distributed Architecture system for computing performance testing. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 140–147. Springer, Heidelberg (2006) 8. Bonorden, O., et al.: Load balancing strategies in a web computing environment. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 839–846. Springer, Heidelberg (2006) 9. Wagner, A.S., et al.: Performnace models for the processor farm paradigm. IEEE Trans. on Parallel and Distributed Systems 5(5), 475–486 (1997) 10. Xiao, L., et al.: Dynamic cluster resource allocations for jobs with known and unknown memory demands. IEEE Trans. on Parallel and Distributed Systems 13(3), 223–240 (2002) 11. Bosque, J.L., Pastor, L.: A parallel computational model for heterogeneous clusters. IEEE Trans. on Parallel and Distributed Systems 17(12), 1390–1400 (2006) 12. Dhakal, S., et al.: Dynamic load balancing in distributed systems in the presence of delays. IEEE Trans. on Parallel and Distributed Systems 18(4), 485–497 (2007) 13. Sosnowski, J., et al.: Enhancing fault injection testbench. In: Proc. of Int. Conf. on Depenability of Comp. Systems, DepCoS, pp. 76–83 (2006)
Parallel Solution of Nonlinear Parabolic Problems on Logically Rectangular Grids Andrs Arrar´ as, Laura Portero, and Juan Carlos Jorge Departamento de Ingenier´ıa Matem´ atica e Inform´ atica Edificio Las Encinas, Universidad P´ ublica de Navarra Campus de Arrosad´ıa s/n, 31006 Pamplona (Spain) {andres.arraras,laura.portero,jcjorge}@unavarra.es
Abstract. This work deals with the efficient numerical solution of nonlinear transient flow problems posed on two-dimensional porous media of general geometry. We first consider a spatial semidiscretization of such problems by using a cell-centered finite difference scheme on a logically rectangular grid. The resulting nonlinear stiff initial-value problems are then integrated in time by means of a fractional step method, combined with a decomposition of the flow domain into a set of overlapping subdomains and a linearization procedure which involves suitable Taylor expansions. The proposed algorithm reduces the original problem to the solution of several linear systems per time step. Moreover, each one of such systems can be directly decomposed into a set of uncoupled linear subsystems which can be solved in parallel. A numerical example illustrates the unconditionally convergent behaviour of the method in the last section of the paper. Keywords: Domain Decomposition, Fractional Step Method, Linearly Implicit Method, Logically Rectangular Grid, Nonlinear Parabolic Problem, Support-Operator Method.
1
Introduction
Darcian water flow through non-swelling isothermal soils has been shown to obey Richards’ equation [5,2]. A simplified version of such equation, together with suitable initial and boundary conditions, gives rise to nonlinear parabolic initial-boundary value problems of the following form: Find ψ : Ω × [0, T ] → R such that ⎧ ∂ψ(x, t) ⎪ ⎪ = div(K(ψ) grad ψ(x, t)) + g(ψ) + f (x, t), (x, t) ∈ Ω × (0, T ], ⎪ ⎨ ∂t ψ(x, 0) = ψ0 (x), x ∈ Ω, ⎪ ⎪ ⎪ ⎩ ψ(x, t) = ψD (x, t), (x, t) ∈ ∂Ω × (0, T ], (1)
This research is partially supported by the Spanish Ministry of Science and Education under Research Project MTM2004-05221 and FPU Grant AP2003-2621 and by Government of Navarre under Research Project CTP-05/R-8.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 371–380, 2008. c Springer-Verlag Berlin Heidelberg 2008
372
A. Arrar´ as, L. Portero, and J.C. Jorge
where ψ(x, t) denotes the pressure head, K(ψ) is a symmetric positive-definite tensor of the form 11 K (ψ) K 12 (ψ) K(ψ) = , K 12 (ψ) K 22 (ψ) which represents nonlinear hydraulic conductivity, g(ψ) is a smooth nonlinear function which may describe, for instance, root water uptake in soil profiles (cf. [10]) and f (x, t) is a source/sink term. The flow domain is assumed to be a bounded open set Ω ⊆ R2 with boundary ∂Ω, where we have considered Dirichlet boundary conditions ψD (x, t). This paper is devoted to the design of an efficient numerical algorithm for solving problems of type (1). The construction of such scheme is carried out by using a two-stage discretization process (space/time), described in sections 2 and 3. The last section includes a numerical experiment which illustrates the main advantages of the proposed method.
2
Spatial Semidiscretization
The semidiscrete scheme related to (1) is obtained by using a finite difference spatial discretization based on the support-operator method. Such technique, initially developed in [6] and subsequently discussed in [7], provides a methodology for constructing discrete analogs of invariant first-order differential operators which appear in (1) (i.e. divergence and gradient). The standard support-operator method was designed for that case in which the self-adjoint operator div(K(x) grad ψ) is linear. However, the nonlinear nature of the conductivity tensor K(ψ) involved in Richards’ equation makes it necessary to combine the original technique with a bivariate interpolation method. This section describes the general basis of the discretization scheme, introducing the specifics of our proposal which permit to solve problem (1). Let us first discretize Ω by means of a logically rectangular grid Ωh , where h denotes the spatial mesh size. The structure of such grid is indexed as follows: if Nx and Ny are positive integers, then the (i, j)-node is given by the coordinates (˜ xi,j , y˜i,j ), for i ∈ {1, 2, . . . , Nx } and j ∈ {1, 2, . . . , Ny }. Moreover, the quadrangle defined by the nodes (i, j), (i + 1, j), (i, j + 1) and (i + 1, j + 1) is called the (i, j)-cell and its center is given by the coordinates (xi,j , yi,j ), which can be obtained as xi,j = 0.25 (˜ xi,j + x ˜i+1,j + x ˜i,j+1 + x ˜i+1,j+1 ), yi,j = 0.25 (˜ yi,j + y˜i+1,j + y˜i,j+1 + y˜i+1,j+1 ), for i ∈ {1, 2, . . . , Nx − 1} and j ∈ {1, 2, . . . , Ny − 1}. Within this framework, the support-operator method considers cell-centered approximations for scalar functions ψ(x, t), g(ψ) and f (x, t) denoted by ψh (t), gh (ψh ) and fh (t), respectively. On the other hand, vector functions w(x, t) ≡ ˜ h (t) ≡ (w (wx (x, t), wy (x, t)) are nodally discretized by means of w ˜hx (t), w ˜hy (t)).
Parallel Solution of Nonlinear Parabolic Problems
373
As described in [9], it is natural to use the divergence as the first-order prime operator for this approximation scheme. Based on the invariant definition of the divergence, we can derive a discrete analog divh of div such that divh :
V˜h × V˜h ˜h v
→ →
Vh ˜h, divh v
˜ h ≡ (˜ where v vhx , v˜hy ). Both V˜h and Vh are finite-dimensional spaces of discrete functions defined on the nodes and the cell-centers of Ωh , respectively. Hence, ˜ h (t) the expression of the discrete divergence acting on a semidiscrete vector w has the following form ˜ h (t))i,j = (divh w
1 (((w˜hx )i+1,j+1 − (w ˜hx )i,j )(˜ yi,j+1 − y˜i+1,j ) 2σi,j −((w ˜hx )i,j+1 − (w ˜hx )i+1,j )(˜ yi+1,j+1 − y˜i,j ) −((w ˜hy )i+1,j+1 − (w ˜hy )i,j )(˜ xi,j+1 − x ˜i+1,j )
(2)
−((w ˜hy )i,j+1 − (w ˜hy )i+1,j )(˜ xi+1,j+1 − x ˜i,j )), for i ∈ {1, 2, . . . , Nx − 1} and j ∈ {1, 2, . . . , Ny − 1}, where σi,j is the area of the (i, j)-cell and (w ˜hz )i,j denotes the ((i − 1)(Ny − 1) + j)-th component of w ˜hz (t) z which approximates w (xi,j , yi,j , t) for z = x, y. Considering a discrete version of Gauss’ theorem, together with equation (2), we shall construct the derived operator gr adh as the discrete analog of grad as follows gr adh : Vh → V˜h × V˜h uh → gr adh uh . The components of the discrete gradient acting on the semidiscrete function ψh (t) can be obtained as (gr adxh ψh (t))i,j =
1 ((˜ yi,j+1 − y˜i+1,j ) (ψh )i,j + (˜ yi−1,j − y˜i,j+1 ) (ψh )i−1,j 2ηi,j +(˜ yi+1,j − y˜i,j−1 ) (ψh )i,j−1 + (˜ yi,j−1 − y˜i−1,j ) (ψh )i−1,j−1 ),
(gr adyh ψh (t))i,j =
1 ((˜ xi,j+1 − x ˜i+1,j ) (ψh )i,j + (˜ xi−1,j − x˜i,j+1 ) (ψh )i−1,j 2ηi,j +(˜ xi+1,j − x˜i,j−1 ) (ψh )i,j−1 + (˜ xi,j−1 − x ˜i−1,j ) (ψh )i−1,j−1 ),
(3) for the internal values i ∈ {2, . . . , Nx − 1} and j ∈ {2, . . . , Ny − 1}, where ηi,j = 0.25 (σi,j +σi−1,j +σi,j−1 +σi−1,j−1 ) and (ψh )i,j denotes the ((i−1)(Ny −1)+j)th component of ψh (t) which approximates ψ(xi,j , yi,j , t). It is easy to see that the previous equations can be extended to the boundaries if we introduce the following fictitious nodes x ˜i,0 = x˜i,1 , x˜0,j = x˜1,j ,
y˜i,0 = y˜i,1 , y˜0,j = y˜1,j ,
x˜i,Ny +1 = x˜i,Ny , x ˜Nx +1,j = x˜Nx ,j ,
y˜i,Ny +1 = y˜i,Ny , y˜Nx +1,j = y˜Nx ,j ,
374
A. Arrar´ as, L. Portero, and J.C. Jorge
for i ∈ {1, 2, . . . , Nx } and j ∈ {1, 2, . . . , Ny }, as well as the evaluations of the Dirichlet boundary condition ψD (x, t) at the centers of the boundary segments, i.e. (ψh )0,j = ψD (ˆ x1,j , yˆ1,j , t),
(ψh )Nx ,j = ψD (ˆ xNx ,j , yˆNx ,j , t),
where x ˆ1,j = 0.5 (˜ x1,j + x ˜1,j+1 ), yˆ1,j = 0.5 (˜ y1,j + y˜1,j+1 ), x ˆNx ,j = 0.5 (˜ xNx ,j + x ˜Nx ,j+1 ) and yˆNx ,j = 0.5 (˜ yNx,j + y˜Nx ,j+1 ), for j ∈ {1, 2, . . . , Ny − 1}, and (ψh )i,0 = ψD (ˆ xi,1 , yˆi,1 , t),
(ψh )i,Ny = ψD (ˆ xi,Ny , yˆi,Ny , t),
where xˆi,1 = 0.5 (˜ xi,1 + x˜i+1,1 ), yˆi,1 = 0.5 (˜ yi,1 + y˜i+1,1 ), x ˆi,Ny = 0.5 (˜ xi,Ny + x ˜i+1,Ny ) and yˆi,Ny = 0.5 (˜ yi,Ny + y˜i+1,Ny ), for i ∈ {1, 2, . . . , Nx − 1}. Let us next proceed to explain the spatial discretization for tensor K(ψ). In the linear case studied in [9] (i.e. when K ≡ K(x) does not depend on ψ), the discrete equations (3) together with the nodal evaluations of the components of ˜ 11 )i,j , (K ˜ 12 )i,j and (K ˜ 22 )i,j , lead us to K, denoted by (K h h h ⎛ ⎞ x 11 ˜ )i,j (gr ˜ 12 )i,j (gr (K adh ψh (t))i,j + (K adyh ψh (t))i,j h h ˜ h gr ⎠. (K adh ψh (t))i,j = ⎝ ˜ 12 ˜ 22 )i,j (gr (K )i,j (gr adx ψh (t))i,j + (K ady ψh (t))i,j h
h
h
h
(4) Now, using (2)-(4), it is straightforward to obtain the discrete linear opera˜ h gr ˜ h gr tor divh (K adh ). In this case, the local stencil for (divh (K adh ψh ))i,j involves the cell-centered approximations (ψh )i−1,j−1 , (ψh )i,j−1 , (ψh )i+1,j−1 , (ψh )i−1,j , (ψh )i,j , (ψh )i+1,j , (ψh )i−1,j+1 , (ψh )i,j+1 and (ψh )i+1,j+1 , as well as the evaluations of the components of K at the nodes (i, j), (i + 1, j), (i, j + 1) and (i + 1, j + 1). In the nonlinear case, the nodal discretization of K(ψ) grad ψ is similar to the one given by (4). However, as the conductiviy tensor depends on the unknown ψ, its discrete analog will involve approximations of such unknown at the grid nodes. Let us denote these approximations by ψ˜h . Now, combining such discretization with equations (2) and (3), we shall obtain an approxima˜ h (ψ˜h ) gr tion for div(K(ψ) grad ψ). The local stencil for (divh (K adh ψh ))i,j involves, as in the linear case, the cell-centered approximations (ψh )i−1,j−1 , (ψh )i,j−1 , (ψh )i+1,j−1 , (ψh )i−1,j , (ψh )i,j , (ψh )i+1,j , (ψh )i−1,j+1 , (ψh )i,j+1 and (ψh )i+1,j+1 ; moreover, it includes the nodal approximations given by (ψ˜h )i,j , (ψ˜h )i+1,j , (ψ˜h )i,j+1 and (ψ˜h )i+1,j+1 . Such values will be obtained by means of a bivariate interpolation method as linear combinations of the nine values of ψh (t) at the cell centers, i.e. (ψ˜h )i,j =
1
ci,j k, (ψh )i+k,j+ ,
k,=−1 1
(ψ˜h )i,j+1 =
k,=−1
ci,j+1 (ψh )i+k,j+ , k,
(ψ˜h )i+1,j =
1
ci+1,j (ψh )i+k,j+ , k,
k,=−1 1
(ψ˜h )i+1,j+1 =
ci+1,j+1 (ψh )i+k,j+ . k,
k,=−1
Figure 1 shows the structure of the local nine-cell stencil corresponding to ˜ h (ψh ) gr (divh (K adh ψh ))i,j .
Parallel Solution of Nonlinear Parabolic Problems
(ψh )i−1,j+1
(ψh )i,j+1
(ψh )i+1,j+1
(i, j + 1)
(i + 1, j + 1)
(ψh )i,j
(ψh )i−1,j
(ψh )i+1,j (i + 1, j)
(i, j)
(ψh )i−1,j−1
375
(ψh )i+1,j−1
(ψh )i,j−1
˜ h (ψh ) gr Fig. 1. Nine-cell stencil for (divh (K adh ψh ))i,j
Using this discretization for the diffusion term, we can approach the original problem by solving a nonlinear stiff initial-value problem of the form: Find ψh : [0, T ] → Vh such that ⎧ ⎨ dψh (t) = div (K ˜ h (ψh ) gr adh ψh (t)) + gh (ψh ) + fh (t), t ∈ (0, T ], h dt (5) ⎩ ψh (0) = rh (ψ0 ) = ψ0h , where rh denotes the restriction to the cell centers of Ωh .
3
Time Integration
Let us consider Ω decomposed into the union of m overlapping subdomains, where each one of them consists of a certain number of disjoint connected components, i.e. Ω=
m
Ωi , where Ωi =
i=1
mi
Ωij such that Ωij ∩ Ωik = ∅ if j = k.
j=1
Next, we define a smooth partition of unity consisting of m functions {ρi (x)}m i=1 , where each function ρi : Ω → [0, 1] is defined as follows ⎧ 0, if x ∈ Ω \ Ωi , ⎪ ⎪ ⎪ ⎪ m ⎪ ⎪ ⎪ ⎨ hi (x), if x ∈ (Ωi ∩ Ωj ), ρi (x) = j=1 ⎪ ⎪ j =i m ⎪ ⎪ ⎪ ⎪ 1, if x ∈ Ω \ (Ωi ∩ Ωj ), i ⎪ ⎩ j=1 j =i
for 0 ≤ hi (x) ≤ 1 and
m
i=1
hi (x) = 1 ∀ x ∈
m j=1 j =i
(Ωi ∩ Ωj ).
376
A. Arrar´ as, L. Portero, and J.C. Jorge
By using this partition of unity, we shall define the following splittings for the ˜ h (·) gr nonlinear discrete operator Ah (·) ≡ divh (K adh ·) and the semidiscrete function fh (t) (cf. [3])
Ah (·) =
m
Ai,h (·),
i=1
fh (t) =
m
fi,h (t).
(6)
i=1
˜ i,h (·) gr ˜ i,h (ψh ) gr Here, we denote Ai,h (·) ≡ divh (K adh ·), considering K adh ψh as the discretization of Ki (x, ψ) grad ψ, where Ki (x, ψ) ≡ ρi (x)K(ψ). On the other hand, fi,h (t) ≡ rh (ρi (x) f (x, t)). Considering the splittings given by (6), the variant of the fractional implicit Euler method with m internal stages introduced in [1] reduces the nonlinear stiff problem (5) to the following set of nonlinear systems (one per internal stage) ⎧ ψh,0 = ψ0h , ⎪ ⎪ ⎪ ⎪ k ⎪
⎪ ⎪ k ⎪ ⎪ ψ = ψ + τ A,h (ψh,n ) + f,h (tn+1 ) + τ gh (ψh,n ), h,n ⎪ h,n ⎨ =1
for k ∈ {1, 2, . . . , m}, ⎪ ⎪ ⎪ ⎪ ⎪ m ⎪ ⎪ ψh,n+1 = ψh,n , ⎪ ⎪ ⎪ ⎩ for n ∈ {0, 1, . . . , NT },
(7)
where NT ≡ [T /τ ] − 1. The discrete solution ψh,n+1 approximates ψh (tn+1 ), where tn+1 = (n + 1) τ and τ denotes the constant time step. The choice of the fractional implicit Euler scheme is motivated by the fact that this method is stable even when combined with an operator splitting which considers an arbitrary number of terms that do not necessarily commute (cf. [4]). This is the case for the discrete operators Ai,h (·) involved in (6). Note that (7) also entails an explicit treatment of the nonlinear discrete function gh (ψh,n ). In order to linearize (7), we approximate A,h (ψh,n ) by the two first terms of its Taylor expansion around ψh,n , i.e. A,h (ψh,n )
dA,h (ψh ) ≈ A,h (ψh,n ) + (ψh,n − ψh,n ). dψh ψh =ψh,n
(8)
If we denote by (ψh,n )i,j the ((i − 1)(Ny − 1) + j)-th component of ψh,n , for i ∈ {1, 2, . . . , Nx − 1} and j ∈ {1, 2, . . . , Ny − 1}, then (A,h (ψh,n ))i,j depends nonlinearly on nine unknowns: (ψh,n )i−1,j−1 , (ψh,n )i,j−1 , (ψh,n )i+1,j−1 , (ψh,n )i−1,j , (ψh,n )i,j , (ψh,n )i+1,j , (ψh,n )i−1,j+1 , (ψh,n )i,j+1 and (ψh,n )i+1,j+1 . Therefore, the ((i − 1)(Ny − 1) + j)-th row of the Jacobian matrix J,h (ψh,n ) ≡ dA,h (ψh )/dψh |ψh =ψh,n will contain nine non-zero elements representing the derivatives of (A,h (ψh,n ))i,j with respect to each one of the previous unknowns.
Parallel Solution of Nonlinear Parabolic Problems
377
Inserting (8) into (7), we obtain the totally discrete scheme for problem (1) ⎧ ˇ ⎪ ⎪ ψh,0 = ψ0h , ⎪ ⎪ k−1 ⎪
⎪ ⎪ ⎪ ˇh,n ) ψˇk = ψˇh,n +τ I − τ J ( ψ A,h (ψˇh,n )+J,h (ψˇh,n )(ψˇh,n − ψˇh,n ) ⎪ k,h h,n ⎪ ⎪ ⎪ =1 ⎪ ⎪ ⎨ +f,h (tn+1 ))+τ Ak,h (ψˇh,n ) − Jk,h (ψˇh,n ) ψˇh,n +fk,h (tn+1 ) +τ gh (ψˇh,n ), ⎪ ⎪ ⎪ for k ∈ {1, 2, . . . , m}, ⎪ ⎪ ⎪ ⎪ ⎪ m ⎪ ⎪ ψˇh,n+1 = ψˇh,n , ⎪ ⎪ ⎪ ⎪ ⎩ for n ∈ {0, 1, . . . , NT }, (9) where ψˇh,n+1 is an approximation of ψh (tn+1 ) which preserves the same order of accuracy as ψh,n+1 . Note that the domain decomposition splitting chosen for Ah (·) makes each internal stage consist of a linear system which involves the unknowns lying just on one of the subdomains {Ωi }m i=1 . Moreover, since each subdomain Ωi comprises mi disjoint connected components, this system can be immediately decomposed into mi uncoupled subsystems which allow a straightforward parallelization. For those points lying outside subdomain Ωi , we have that Ji,h (ψˇh,n ) ≡ 0 (remember that ρi (x) = 0 if x ∈ Ω \ Ωi ) and so, in this case, the solution of the i-th internal stage in (9) simply requires an explicit evaluation of the right-hand side. As a difference with respect to classical domain decomposition methods, artificial boundary conditions are not required on each subdomain and, hence, no Schwarz iterative procedures are needed in the computation. Finally, the uncoupled linear subsystems arising at each internal stage are solved by the Gauss-Seidel method. In the case when K(ψ) is a diagonal tensor and Ω is a rectangular domain discretized by means of a rectangular grid Ωh , we can derive efficient algorithms for the solution of (1) by combining a finite difference spatial discretization with the time integration procedure described in this section. In such a case, we have two options for decomposing operator Ah (·): the domain decomposition splitting explained before or a classical alternating direction splitting (cf. [1]). In the latter type, we obtain an essentially one-dimensional linear system at each internal stage which is tridiagonal and can be easily decomposed into several subsystems whose solution may be parallelized.
4
Numerical Experiment
In this section, we test the behaviour of the numerical algorithm on a set of pseudo-random logically rectangular grids. A similar test is shown in [8] for a classical implicit Euler scheme, combined with the support-operator technique, in the solution of linear parabolic problems.
A. Arrar´ as, L. Portero, and J.C. Jorge 1
1
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
y
1 0.9
y
y
378
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
0
0
0.2
0.4
0.6
0.8
x
(a) N = 17
1
0.1 0
0.2
0.4
0.6
0.8
x
(b) N = 33
1
0
0
0.2
0.4
0.6
0.8
1
x
(c) N = 65
Fig. 2. Pseudo-random logically rectangular grids described in the numerical experiment
Let us consider an equation of type (1) posed on Ω × (0, T ] ≡ {x = (x, y) ∈ R2 : 0 < x < 1, 0 < y < 1} × (0, 0.01]. The hydraulic conductivity K(ψ) is a full nonlinear tensor defined as K(ψ) = Q(θ) D(ψ) Q(θ)T , where Q(θ) is a 2 × 2 rotation matrix with angle θ = π/4 and D(ψ) is a 2 × 2 diagonal matrix whose diagonal entries are 1 + ψ 2 and 1 + 8ψ 2 . The nonlinear function is chosen to be g(ψ) = 1/(1 + ψ 3 ), whereas the source/sink term f (x, y, t) and both initial and Dirichlet boundary conditions are defined in such a way that 2 ψ(x, y, t) = e−2π t sin(πx) sin(πy) is the exact solution of the problem. The spatial semidiscretization is based on the finite difference method described in section 2. The flow domain Ω is first discretized by means of a pseudo-random logically rectangular grid Ωh ≡ {(xi,j , yi,j )}N i,j=1 with coordinates xi,j = (i − 1) h − 0.25 h + 0.5 h Rx and yi,j = (j − 1) h − 0.25 h + 0.5 h Ry , where h = 1/(N − 1) and Rx , Ry are random numbers generated on the interval (0, 1). Figure 2(a) shows an example of such type of grids for N = 17. In order to study the asymptotic behaviour of the error, we successively refine the original pseudo-random grid by using the following procedure: starting from a given grid, we add the lines which connect, on each cell, the centers of the opposite sides. Figures 2(b) and 2(c) show the first two refinements for the grid displayed on Figure 2(a). Let us now consider a decomposition of Ω into m = 4 overlapping subdomains {Ωi }m i=1 , each of which consists of mi = 4 disjoint 1connected 1 components, for 3 i ∈ {1, 2, 3, 4}. In particular, if we denote I ≡ 0, + d ∪ − d, + d and 1 4 2 4 I2 ≡ 14 − d, 12 + d ∪ 34 − d, 1 , the four subdomains are given by Ω1 ≡ I1 × I1 , Ω2 ≡ I2 × I1 , Ω3 ≡ I1 × I2 and Ω4 ≡ I2 × I2 . Note that the width of the overlapping regions is 2d, where d is chosen to be 1/16. Next, we define a smooth partition of unity consisting of four functions {ρi (x)}4i=1 which are related to the previous domain decomposition. For that purpose, we start by introducing ⎞ ⎛ −1 exp x−x 0 +d ⎠, h(x, x0 , d) = exp ⎝d exp(1/d) log(2) x − x0 − d
Parallel Solution of Nonlinear Parabolic Problems
379
Table 1. Global errors and numerical orders of convergence for N = 129 τ
τ0 = 10−3
τ0 /2
τ0 /4
τ0 /8
τ0 /16
τ0 /32
EN,τ pN,τ
3.703 E-2 0.5230
2.577 E-2 0.6759
1.613 E-2 0.7766
9.416 E-3 0.8439
5.246 E-3 0.8919
2.827 E-3 −
Table 2. Global errors and numerical orders of convergence for τ = 10−7 N EN,τ pN,τ
17
33
65
129
257
5.312 E-3 1.4667
1.922 E-3 2.0340
4.693 E-4 2.0077
1.167 E-4 2.0122
2.893 E-5 −
which is subsequently used to ⎧ 1, ⎪ ⎪ ⎪ ⎪ ⎨ 0, i1 (x) = ⎪ h(x, α, d), ⎪ ⎪ ⎪ ⎩ 1 − h(x, 12 , d),
define the functions if x ∈ 0, 14 − d ∪ 12 + d, 34 − d , if x ∈ 14 + d, 12 − d ∪ 34 + d, 1 , if x ∈ [α − d, α + d] with α = 14 , 34 , if x ∈ 12 − d, 12 + d
and i2 (x) = 1 − i1 (x). Finally, if we consider suitable products of i1 (x) and i2 (x), we can construct the following non-negative C ∞ -functions: ρ1 (x, y) = i1 (x) i1 (y), ρ2 (x, y) = i2 (x) i1 (y), ρ3 (x, y) = i1 (x) i2 (y) and ρ4 (x, y) = i2 (x) i2 (y). In order to obtain the totally discrete scheme (9), we use the fractional step method given by (7), with four internal stages (m = 4), in combination with the linearization procedure described in (8). The solution of our numerical scheme provides vectors ψh,n ∈ R(N −1)×(N −1) , for n = 1, 2, . . . , NT + 1, whose components are approximations to the exact solution ψ(x, tn ) at the cell centers of Ωh . Owing to the domain decomposition splitting considered for Ah (·), the linear system obtained at each internal stage reduces to a set of four smaller uncoupled subsystems which can be easily solved in parallel. Therefore, the number of unknowns involved in the computation will decrease from (N − 1) × (N − 1) to a number between n1 × n1 and n2 × n2 , where n1 = (N − 1)(1/4 + d) and n2 = (N −1)(1/4+2d). From a practical point of view, as the amount of available processors increases, each subdomain can be decomposed into a greater number of disjoint components in order to reduce the actual execution time. Finally, we include two tables which contain the global errors, EN,τ (upper row), and numerical orders of convergence, pN,τ (lower row), obtained for different values of N and τ when using the maximum norm in time and the L2 -norm in space, i.e. · L∞(0,T ;L2 (Ω)) . The method shows unconditional convergence of first order in time (see Table 1) and second order in space (see Table 2).
380
A. Arrar´ as, L. Portero, and J.C. Jorge
References 1. Arrar´ as, A., Jorge, J.C.: An alternating-direction finite difference method for threedimensional flow in unsaturated porous media. Mathematical Modelling and Analysis, 57–64 (2005) 2. Celia, M.A., Bouloutas, E.T., Zarba, R.L.: A general mass-conservative numerical solution for the unsaturated flow equation. Water Resour. Res. 26, 1483–1496 (1990) 3. Portero, L., Bujanda, B., Jorge, J.C.: A combined fractional step domain decomposition method for the numerical integration of parabolic problems. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 1034–1041. Springer, Heidelberg (2004) 4. Portero, L.: Fractional Step Runge-Kutta Methods for Multidimensional Evolutionary Problems with Time-Dependent Coefficients and Boundary Data. Ph.D. Thesis, Universidad P´ ublica de Navarra, Pamplona (2007) 5. Richards, L.A.: Capillary conduction of liquids through porous mediums. Physics 1, 318–333 (1931) 6. Samarski˘ı, A., Tishkin, V., Favorski˘ı, A., Shashkov, M.: Operational finitedifference schemes. Differ. Equ. 17, 854–862 (1981) 7. Shashkov, M.: Conservative Finite-Difference Methods on General Grids. CRC Press, Boca Raton (1996) 8. Shashkov, M., Steinberg, S.: Solving diffusion equations with rough coefficients in rough grids. J. Comput. Phys. 129, 383–405 (1996) 9. Shashkov, M., Steinberg, S.: The numerical solution of diffusion problems in strongly heterogeneous non-isotropic materials. J. Comput. Phys. 132, 130–148 (1997) ˘ 10. Simunek, J., Hopmans, J.W., Vrugt, J.A., van Wijt, M.T.: One-, two- and threedimensional root water uptake functions for transient modeling. Water Resour. Res. 37, 2457–2470 (2001)
Provenance Tracking in the ViroLab Virtual Laboratory Bartosz Bali´s1,2 , Marian Bubak1,2, and Jakub Wach2 1
Institute of Computer Science, AGH, Poland {balis,bubak}@agh.edu.pl, [email protected] 2 Academic Computer Centre – CYFRONET, Poland
Abstract. Provenance describes the process which led to the creation of a piece of data. Tracking provenance of experiment results is essential in modern environments which support conducting of in silico experiments. We present a provenance tracking approach developed as part of the virtual laboratory of the ViroLab project. The applied provenance solution is motivated by the Semantic Grid vision as an infrastructure for e-Science. Provenance data is represented in XML and modeled as ontologies described in the OWL knowledge representation language. The provenance tracking system, PROToS, has been designed and implemented to address important stages of the knowledge management lifecycle. Keywords: e-Science, Grid, ontology, provenance, ViroLab.
1 Introduction The term ‘e-Science’ [6] was introduced by John Taylor to denote a new type of scientific research based on the collaboration within a number of scientific areas, enabled by a next generation infrastructure. The infrastructure in question is usually identified with Grid systems which offer at least two benefits important for loosely-coupled crossinstitution research and collaboration: the virtualization and sharing of resources [7], and building of virtual organizations [3]. Several initiatives have emerged to develop existing Grid infrastructures towards higher-level functionalities useful for scientists. Examples of such initiatives are myGrid [9], Taverna [8] and Triana [12]. Sometimes the term virtual laboratory is used to denote an environment that supports conducting in silico experiments. Provenance of a piece of data is defined as a process that led to that data [5], also known as a derivation path of a piece of data. Provenance is essential for a scientist as it provides the information on how the particular result was obtained which might be equally important as the result itself. The scientist would use the provenance data for various purposes, for example to assess the quality of the result, to repeat whole or part of the experiment, to validate the results, etc. [4]. The goal of ViroLab1 [11] is to set up a virtual organization and develop a virtual laboratory2 for infectious diseases in order to support medical doctors and scientists in
1 2
This work is supported by the European Union through the IST-027446 project ViroLab and by the Foundation for Polish Science within the Domestic Grant for Young Scientists. http://www.virolab.org http://virolab.cyfronet.pl
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 381–390, 2008. c Springer-Verlag Berlin Heidelberg 2008
382
B. Bali´s, M. Bubak, and J. Wach
their daily work. The primary scenario that drives ViroLab is the HIV drug resistance. Medical doctors are provided with a decision support system which helps them rank drugs for particular patients. Virologists can use data integrated from a number of institutions, and various tools, virtualized and perhaps composed into scripts (similar to workflows), to perform data mining, statistical analysis, as well as molecular dynamics and cellular automata simulations. This paper presents the provenance tracking approach in the ViroLab Project. We describe our motivation (Section 2), the provenance model in ViroLab (Section 3), and the architecture of our provenance tracking system – PROToS (Section 4). Section 5 presents the implementation of PROToS in more detail. Section 6 contains an overview of related work.
2 Motivation Our motivation in designing the provenance solution was not only to support recording and browsing of provenance logs but also to enable complex queries over provenance data that can help find interesting properties. Examples of useful queries are as follows: – Which rule sets were used most frequently to obtain drug rankings? – Return all experiments performed by John Doe within last two weeks. – Return all input sequences of experiments of type Drug Ranking Support whose input parameter threshold was between 0.03 and 0.04. Our work is also driven by the vision of semantic grid as a future infrastructure for e-Science, as presented in [10]. In this article, three types of services – data services, information services, and knowledge services – are introduced, where data is understood as an uninterpreted sequence of bits (e.g. an integer value), information is data associated with meaning (e.g. a temperature), while knowledge is understood as “information applied to achieve a goal, solve a problem or enact a decision”. Provenance is important part of an e-Science infrastructure providing both information and knowledge services. However, an adequate knowledge management has several aspects concerning various stages of knowledge lifecycle: acquisition, modelling, retrieval, reuse, publishing and maintenance [10]. Consequently, our aim is to design provenance data model and representation as well as the provenance tracking system with those requirements in mind. With the provenance service built as a knowledge service, its use can go beyond the simple usage as a personal scientist’s logbook documenting the experiments. Sufficiently rich information stored in provenance records can enable other uses. For example, records of multipe invocations of particular services equipped with timing information can easily be used for instance-based learning to optimize future resource brokering or scheduling decisions. In the future, we anticipate the integration with the ViroLab’s brokering/scheduling component for this purpose.
3 Applied Provenance Model Fig. 1 presents a simplified architecture of the ViroLab’s virtual laboratory. The virtual laboratory provides an Experiment Planning Environment (EPE) which allows to develop experiment scripts which can be subsequently executed for different input data.
Provenance Tracking in the ViroLab Virtual Laboratory
383
Both data and computation are virtualized as so-called Grid Objects. A unified data layer (accessible through Data Access Client, DAC) provides data integration over multiple data sources. Grid Objects represent services with one or more methods which can be programatically accessed from scripts in a uniform way, regardless of the underlying technology (web services, components, jobs). Information about Grid Objects is stored in a registry (Grid Resources Registry, GRR). Grid objects are invoked by the Computation Access component. Currently several tools, such as HIV sequencing, HIV subtyping or HIV drug ranking, are virtualized within the virtual labolatory and available as Grid Objects. The enactment engine (GSEngine) orchestrates the execution of scripts. Most virtual laboratory components feed a monitoring system (GEMINI [1]) with monitoring events which are aggregated, translated to ontological models, and published into the provenance tracking system, PROToS.
Experiment Planning Environment (EPE)
Look up Grid Objects
Grid Resources Registry (GRR) Read Grid Object properties Computation Access (Invoker & Optimizer)
Execute Experiment Execute remote processing Events
Instantiate and invoke GObjs
Computing Resource
Events
Monitoring System (GEMINI)
Events
Events
Experiment Information store Store information
GSEngine (enactment engine)
Provenance Tracking System (PROToS)
Data request
Publish information
Data Access Client (DAC)
Monitoring data (events)
(translated to ontological representation)
Information Aggregator
Fig. 1. Architecture of the virtual laboratory in ViroLab
Provenance in ViroLab is process-oriented, i.e. complete traces of experiments conducted in the virtual laboratory are captured and stored in a repository. Provenance data is generated as a set of events generated in distributed, instrumented components of the virtual laboratory runtime services and application services. The events are collected, correlated and aggregated, and then translated into an ontology representation. Provenance data is represented as XML documents in native XML database. The advantages of using XML to represent data exchanged across services was pointed out many times. XML is self-describing, interoperable, reusable, flexible and extensible [2]. Our model of provenance data is based on a comprehensive ontology-based description, modeled in the knowledge representation language, OWL3 . The basic conceptualization of the provenance is based on a generic experiment ontology supplied with provenancerelated concepts. Part of this ontology is depicted in Fig. 2. Its main concept is an 3
OWL Web Ontology Language Reference, http://www.w3.org/TR/owl-ref
384
B. Bali´s, M. Bubak, and J. Wach
Experiment which is composed of Execution Stages (invocations of Grid Operations or Data Access calls). Virtualized computation is represented as GridObjects which have one or more GridOperations. Physical computing resources involved in the execution are represented by other concepts, such as Hosts and Containers. Many important information pieces are modeled as concepts’ properties. Those generic concepts are connected with domain ontologies which, in our case, are ontologies describing applications and data of the ViroLab virtual laboratory. Concepts shown in the diagram – V iroLabEvent and V iroLabDataEntity are top level concepts of domain ontologies which enhance semantic descriptions of applications (a ViroLab event represents a well defined stage of an application, e.g. ‘computation of drug rankings’), and data, virological and medical in our case.
Fig. 2. Conceptualization of provenance data
In a virtual laboratory, we deal with a number of systems which collect and share information, and are maintained for a long time. In such an environment, data schemas might easily evolve in time. The use of self-describing and interoperable data representation enriched with semantic information enables an easy or even seamless and automatic integration of services in such cases. The benefits to model provenance as semantic, ontology-based data are as follows: 1. The ontology will enable the creation of a semantic description of each experiment performed in the virtual laboratory. This description can reflect different levels of detail and different points of view on the experiment.
Provenance Tracking in the ViroLab Virtual Laboratory
385
2. The generic ontology can be extended with domain ontologies which reflect the conceptual model of a particular research domain (e.g. bioinformatics) as well as a particular application (e.g. drug resistance). In this way the basic experiment description can be easily extended with more detailed, domain-specific information. 3. The ontologies denote the space of provenance event types that are relevant for an experiment. Thus, they can be used for automatic generation of instrumentationhelper tools that generate event representations. They also suggest which actors should be instrumented to obtain a particular piece of information. 4. Ontologies enhance the usability of provenance data in both manual and automatic manner. First, they enable construction of powerful and user-friendly query tools which operate on the level of ontology concepts and relationships instead of those of the particular data model. This is very helpful in constructing complex queries over provenance and actual data, ultimately aimed at extraction of knowledge for researchers and clinicians, who are not familiar with the technical details of underlying data representations. Second, a repository of experiment traces with well-defined semantics could be used for machine learning purposes by virtual laboratory runtime services, e.g. to improve scheduling and brokering decisions.
4 Provenance Tracking System – PROToS The Provenance TRacking System (PROToS) is developed on the basis of the provenance model presented in Section 3. Fig. 3 presents a simplified component diagram of PROToS and its context environment. The main PROToS components are as follows: – PROToS Core provides core functionalities – provenance data gathering (consumer interface) and retrieval (producer interface). Incoming information, in the form of events will be provided by the Monitoring System. Provenance data could be queried in one of supported languages (XQuery, RDQL, SPARQL, etc.) by any component, such as a QUaTRO tool, Application Optimizer or an application-specific portlet
Query Translation Tools (QUaTRO)
Data Access Service <<external>>
Ontology Store Producer Interface
PROToS Core
Provenance Data Repository
Consumer Interface
Events
Monitoring System
Fig. 3. Structure of PROToS – PROvenance Tracking System in ViroLab
386
B. Bali´s, M. Bubak, and J. Wach
(part of the ViroLab portal). External interfaces, exposed by core, are provided as Stateless Web Services, to enable a high level of interoperability with other system components. Internally, components are interconnected by RMI (Remote Method Invocation) middleware because of performance and security reasons. – Query Translation Tools (QUaTRO). This component consists of a set of tools to enable VL end-users to define and build provenance queries in an automated, easy way, without touching technical details. To achieve this goal, QUaTRO tools will make use of semantic descriptions existing in the VL – data, applications and experiments. Also, a simple Data Mining and Natural Language Processing will be employed by more sophisticated tools. This component is crucial in order to enable provenance use for a typical user who does not have any knowledge about used ontologies, provenance data structure or query languages such as XQuery or SQL. – Provenance Data Repository (PDR) stores provenance data and used ontologies (data, application, experiment), in the XML-RDF representation of the ontology language, OWL. Storage is hierarchical to enlarge storage capacity, and distributed to achieve high query performance. PDR also takes care of balancing provenance data in a way that ensures high performance and high storage capacity. – Ontology Store contains all Domain Ontologies used by Core and QUaTRO components, namely experiment, application and data ontologies. Currently, the Ontology Store is implemented as Sesame4 and accessible by a programmatic (Java) interface and http protocol.
5 Implementation of PROToS From the technical point of view, PROToS is organized as a set of applications, possibly and preferably deployed on different nodes. The PROToS core is deployed as Java Web Application (WAR), built on the Spring framework5 and requires a servlet container, such as Apache Tomcat or Jetty to operate. We have chosen Spring because it offers full support for such technologies as JMX, RMI and XFire-based Web Services. It is also a very well supported industry standard. Internally, the core runs an eXist database6 instance for Domain Ontologies storage purposes. It is necessary to provide such functionalities as quick validation of incoming provenance events and maintenance of PDR. The communication between PDR elements and PROToS core is implemented with Java RMI because of performance and security reasons. The PDR itself could be composed of many Storage Nodes (SNs) – groups of physical nodes that store and manage one or more domain ontologies with an associated set of individuals (called register). Each Storage Node has one central node (called Storage Super Node) to store the domain ontology and provide other functionalities such as group management/configuration and ontology reasoning with Pellet7 or other, Jena8 -compliant reasoner. It performs also other work associated with the ontology 4 5
6 7 8
Sesame Ontology Store, http://www.openrdf.org Spring framework (Dependency Injection container), http://www.springframework.org eXist XML native DB, http://exist.sourceforge.net Pellet OWL reasoner, http://www.mindswap.org/2003/pellet Jena Semantic Framework, http://jena.sourceforge.net
Provenance Tracking in the ViroLab Virtual Laboratory
387
PROToS Storage Node
Storage Node
Storage Peer
Storage Peer
PROToS CORE Storage Super Node
Storage Super Node
Storage Peer Storage Peer
Storage Peer Storage Peer
Fig. 4. PROToS example deployment diagram
processing, such as the validation of individual consistency and correctness with respect to a Domain Ontology. What is more, also queries in language other than XQuery will have to be handled by the ontology-core, running on the SN central node. We have chosen Jena as our framework for ontology processing. It includes its own SPARQL query language implementation, called ARQ with a query engine and a rule-based reasoner. Jena allows other reasoners to be deployed and provides us with many features required for our ontology-core. Jena also natively supports reading and storing OWL in XML format, which is crucial as we have chosen an XML database for storage. The rest of physical nodes will be used as Storage Peers – components that run an eXist instance with the sole purpose of storing part of the individuals register. We plan to divide the register into more than one Storage Peer, because individuals registries tend to grow to very large sizes, especially with ontology reasoning enabled. Currently, we plan to provide at least two implementations of the Storage Peer component. The first one, simple with embedded Jetty container and the second one, fully configurable and implemented with PicoContainers9 . Pico was chosen, because it is much more lightweight than Spring-based, and we believe that all node resources available should be devoted to storage functionality. Hierarchical, multi-node PDR architecture described above was developed to meet performance and storage requirements of PROToS. The investigation of the typical provenance use in VL applications convinced us that most queries will pertain to a single Domain Ontology. In our schema, such queries will be handled by one Storage Node, so that no merging or post-processing of the results will be needed. Fig. 4 presents an example core and PDR deployment. As mentioned before, the PROToS core provides maintenance of PDR. This task includes ontology-based routing mechanisms and algorithms for distributing and balancing ontology data on Storage Nodes. Other important roles of the core include choosing on which Storage Nodes a new query should be executed, performing the query and integrating results from all SNs involved. The PROToS core runtime 9
Picocontainer (Dependency Injection container), http://www.picocontainer.org
388
B. Bali´s, M. Bubak, and J. Wach
configuration allows for adding and removing Storage Nodes as well as Storage Peers, managing Domain Ontologies and rearranging the PDR structure. All configuration options/methods available both for core and Storage Nodes are remotely available via JMX. This is very important since performance greatly depends on such configuration options as reasoning enabled/disabled or reasoner used. JMX is highly standardized and provides off-the-box tools for building modular, dynamic solutions for managing systems such as PROToS. The use of this technology saved us much time and made our system cleaner and more robust.
6 Related Work A number of provenance tracking systems have been built. Provenance approach presented in [14] is based on recording all data sets (serialized as XML), and their transformations. Provenance data is stored as a relational database in a separate repository. In myGrid, workflow templates, data, and metadata including provenance records are stored in a common information repository [16]. Provenance records are generated based on the events from workflow enactment engines. In a postprocessing phase, provenance logs are annotated with ontological concepts taken from semantic description of services involved in a workflow [16]. The Karma provenance framework [13] was designed for data-centric workflows in a SOA, and collects three forms of provenance: workflow-describing services called, process-describing activities that took place in single service invocation and datadescribing services that produced/used a piece of data (across all recorded workflows). Virtual Data System allows to define and use procedures that perform data analysis resulting in new, derived data. Authors of [15] define provenance of a data object as a functional procedure that was used to produce and can be used to reproduce it. Focusing on above definition, they modeled a logical virtual data schema for provenance recording that allows to store information needed for a complete provenance record. For storing the information about operations arguments, such as type, VDS model uses simply string labels. Datasets for workflows are also stored by names (strings). Described solutions have certain limitations. Only myGrid is seriously focused on semantics of provenance records but only limited to annotations. The schema concept of [15] is similar to our generic experiment ontology, but because of the focus on VDS definition, much more limited. Example provenance queries presented in this work are fully supported by the PROToS provenance model. However, coupled with specific application ontologies, PROToS will support more complicated queries, similar to mining over provenance data. We use semantic model of data used in experiments which features mappings to database schema. That enables ontology-guided construction of queries combined with transparent browsing of databases themselves, if necessary.
7 Summary and Future Work We have presented PROToS – PROvenance Tracking System developed in the ViroLab project. Currently, the first prototype of PROToS has been released. It supports main functionalities, the storage of provenance events collected from Monitoring as well as
Provenance Tracking in the ViroLab Virtual Laboratory
389
XQuery-based queries. PDR is limited to one Storage Node, with basic ontology reasoning enabled. The QUaTRO component consists currently of a single tool which enables end-user oriented, GUI-based ontology-driven construction of queries which are subsequently transformed into XQuery. Future work includes the development of main ontologies for primary application scenarios in ViroLab, and extensions to the PROToS architecture and functionalities. PDR will be extended into a multi-node, hierarchical storage. This is a major task, because enabling PROToS for new VL applications will generate tremendous amounts of data which will certainly exceed the capability of a centralized storage. An important challenge is to address the inevitable evolution of VL Domain Ontologies – not only aplication, but also data DOs. Obviously, the user does not want to lose the whole registry of individuals when the domain ontologies are changed. Long term goals involve extending the PROToS ontology reasoning capabilities. As mentioned before, the current, Jena-based implementation requires all individuals to be loaded into the main memory. This of course will not be possible in a production environment with millions of individuals from one Domain Ontology. To overcome this limitation, we are closely evaluating such solutions as distributed reasoning with e-connections and automatic loading of individuals on demand from data base. Apart from PROToS core and PDR, we plan to develop Language-to-Query Translation tool as a part of our QUaTRO tools suite. This tool will enable a more sophisticated approach to the problem of user-provenance interaction. In our concept, it would analyze queries expressed by the user in a natural language to produce queries in XQuery. We are going to use text mining techniques and algorithms to achieve this. Of course, this tool will also use VL Domain Ontologies to speed-up the process of query analysis. Acknowledgements. We thank Peter M. A. Sloot, Maciej Malawski, Tomek Gubala and participants of the ViroLab Meeting in Dagstuhl in May 2007 for valuable discussions.
References 1. Balis, B., Bubak, M., Dziwisz, J., Truong, H.-L., Fahringer, T.: Integrated Monitoring Framework for Grid Infrastructure and Applications. In: Innovation and the Knowledge Economy. Issues, Applications, Case Studies, Ljubljana, Slovenia, October 2005, pp. 269–276. IOS Press, Amsterdam (2005) 2. Bean, J.: XML for Data Architects: Designing for Reuse and Integration. Morgan Kaufmann, San Francisco (2003) 3. Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the grid: Enabling scalable virtual organizations. International Journal of High Perform. Comput. Applications 15(3), 200–222 (2001) 4. Goble, C.: Position statement: Musings on provenance, workflow and (semantic web) annotations for bioinformatics. In: Data provenance/derivation workshop (October 2002) 5. Groth, P., Jiang, S., Miles, S., Munroe, S., Tan, V., Tsasakou, S., Moreau, L.: D3.1.1: An Architecture for Provenance Systems. Technical report, University of Southampton (2006) 6. Hey, A.J.G., Trefethen, A.E.: The UK e-Science Core Programme and the Grid. Future Generation Computer Systems 18(8), 1017–1031 (2002) 7. Nemeth, Z., Sunderam, V.: Virtualization in grids: A semantical approach. In: Rana, O.F., Cunha, J.C. (eds.) Grid Computing: Software Environments and Tools, Springer, Heidelberg (2006)
390
B. Bali´s, M. Bubak, and J. Wach
8. Oinn, T., Li, P., Kell, D.B., Goble, C., Goderis, A., Greenwood, M., Hull, D., Stevens, R., Turi, D., Zhao, J.: Taverna / myGrid: aligning a workflow system with the life sciences community. In: Taylor, I., Deelman, E., Gannon, D., Shields, M. (eds.) Workflows for e-Science, pp. 300–319. Springer, New York (2007) 9. Robinson, A., Stevens, R.D., Goble, C.A.: myGrid: Personalised Bioinformatics on the Information Grid. In: Proc. 11th International Conference on Intelligent Systems in Molecular Biology (June 2003) 10. de Roure, D., Jennings, N.R., Shadbolt, N.: The Semantic Grid: a Future e-Science Infrastructure, Grid Computing–Making the Global Infrastructure a Reality, pp. 437–470. Wiley, Chichester (2003) 11. Sloot, P.M., Tirado-Ramos, A., Altintas, I., Bubak, M., Boucher, C.: From Molecule to Man: Decision Support in Individualized E-Health. Computer 39(11), 40–46 (2006) 12. Taylor, I., Shields, M., Wang, I., Harrison, A.: The Triana Workflow Environment: Architecture and Applications. In: Taylor, I., Deelman, E., Gannon, D., Shields, M. (eds.) Workflows for e-Science, pp. 320–339. Springer, New York (2007) 13. Simmhan, Y.L., Plale, B., Gannon, D., Marru, S.: Performance Evaluation of the Karma Provenance Framework for Scientific Workflows. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, Springer, Heidelberg (2006) 14. Szomszor, M., Moreau, L.: Recording and reasoning over data provenance in web and grid services. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003, and ODBASE 2003. LNCS, vol. 2888, pp. 603–620. Springer, Heidelberg (2003) 15. Zhao, Y., Wilde, M., Foster, I.: Applying the Virtual Data Provenance Model. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, Springer, Heidelberg (2006) 16. Zhao, J., Goble, C., Stevens, R.: Semantically linking and browsing provenance logs for escience. In: Proc. of the 1st International Conference on Semantics of a Networked World, Paris, France. LNCS, Springer, Heidelberg (2004)
Efficiency of Interactive Terrain Visualization with a PC-Cluster Dariusz Dalecki, Jacek Lebied´z, Krzysztof Mieloszyk, and Bogdan Wiszniewski Gda´ nsk University of Technology Faculty of Electronics, Telecommunications and Informatics ul. Narutowicza 11/12, 80-952 Gda´ nsk {jacekl,krzymi,bowisz}@eti.pg.gda.pl
Abstract. The paper presents final results of the project aimed at interactive visualisation of a terrain based on real spatial data, which was one of the tasks of the CLUSTERIX project. Practical aspects of various performance issues are addressed in particular to verify adequacy of using PC-clusters in graphical applications. Keywords: mesh simplification, dynamic 3D scene, parallel sector simplification.
1
Introduction
Interactive visualization of real terrain based on spatial data have been one of the topics investigated in the recently concluded CLUSTERIX (National Cluster of Linux Systems) Project [6]. In particular, the research addressed two problems that have been identified during development of one of this project’s applications, Vis3D, for visualizing real terrain with a PC-cluster1 : – information represented by spatial data may be incomplete, inconsistent, or in incompatible format for high performance computing, and – data must be processed on time to enable real-time interaction of observers with a dynamic 3D scene. The former problem was investigated by authors and results published in their previous PPAM2005 paper [3]. This paper presents results of the research on the latter problem, namely assessment of major performance characteristics of the Vis3D application.
2
Terrain Visualization
Reliable visualization of a real terrain requires large volumes of spatial data. Publically available data (’hgt’ files) of a digital Earth terrain model, provided by 1
This work has been supported in part by the Polish Ministry of Science and Information Society Technologies under grant 6T11 2003C/06098.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 391–399, 2008. c Springer-Verlag Berlin Heidelberg 2008
392
D. Dalecki et al.
the Shuttle Radar Topography Mission (SRTMP) project [7] include hipsometric values measured for each three longitude and latitude seconds. SRTM data are grouped in sectors (tiles) of the size corresponding to 1×1 longitude and latitude degrees. For the Gdansk region (see Figure 1) it takes 1442401 (1201×1201) 16bit values representing 2880000 triangles covering the 61 km × 110 km area of of 7150 km2 . The number of triangles estimated here corresponds more or less to the resolution of contemporary images. It is worth mentioning that in order to visualize a scene from higher altitudes many sectors, each one of nearly three million triangles, may be needed. Confronting such a huge number of triangles with often limited capabilities of graphical cards indicates the need to simplify a visualized scene, especially that limited resolution of displays would not allow to visualize details of distant sectors. 2.1
Triangle Mesh Simplification
For a moving observer a perceived level of detail changes for different sectors of the visualized terrain. Details deteriorate when close sectors are left behind, and new details emerge when getting closer to sectors that lied far before. For each observation point certain “subareas” of various level of detail may be distinguished. They take a form of concentric rings surrounding the observer. Since it is easier to represent terrain as rectangular sectors of a uniform distribution of visible detail, rings are represented most often as a composition of such sectors. Note in Figure 1a that sectors of the considered Gdansk area ehxibit various level of detail measured with a number of triangles. They are shaded from light gray corresponding to over 100 triangles down to dark grey corresponding to no more than 20 triangles. Mesh simplification may be implemented in a couple of ways. For example the Real-time Optimally Adapting Meshes (ROAM) algorithm [4] processes a regular mesh of triangles by splitting and merging pairs of triangles, as shown in a)
b)
Fig. 1. Intensity of computations in sectors around a moving observer: (a) a cluster view (b) an aerial view
Efficiency of Interactive Terrain Visualization with a PC-Cluster a)
393
b) split
vsplit
TR
TL
fn1
T
T0
TB
TB1 TB0
fn1
T1
fn0
fn3 Vs
Vu fn3 fL fR
fn2
fn0
Vt
fn2
ecol
merge
Fig. 2. Operations on a triangle mesh: (a) ROAM, (b) VDPM
Figure 2a. Alternatively View-Dependent Progressive Mesh (VDPM) method [1] is aimed at an irregular mesh of triangles, by performing on triangle nodes ecol (collecting) and vsplit (splitting) operations, shown in Figure 2b. Vis3D application uses a yet another approach, known as Sequential Greedy Insertion (SGI) [5] (see Figure 3).
+
+
Fig. 3. Greedy insertion of nodes
The algorithm starts from a maximally simplified mesh (two triangles per one sector) and gradually inserts nodes into the mesh at places where the greatest error is observed in each step. An inserted node implies reorganization of triangles around it, because if it happens to fall inside a circle circumscribing any existing triangle that triangle is eliminated from the mesh. The “hole” is filled next with triangles determined by the new node and pairs of neighboring nodes that lie on the hole’s border. 2.2
Parallel Simplification of Sectors
Simplification of data representing a wide geographical area involves a rather large number of loosely-coupled parallel tasks computing respective sectors. Therefore it is quite natural to parallelize simplification of sectors, provided that simplification tasks are physically distributed [2]. It may be expected that
394
D. Dalecki et al.
the total scene computation time will shorten proportionally to the number of available processing nodes. Partitioning of the visualized area into sectors, eg. 1 km × 1 km, allows for equal distribution of data among processing nodes. This may however limit simplification of sectors observed from a further distance, since the simplest sector may not include less than two triangles. Therefore distant sectors should be grouped into larger units, called hipersectors – to be simplified by one task. See for example varying sizes of sectors in Figure 1a. Assigning sectors to the respective tasks (or computing nodes) could be done in many ways. The simplest way is to assume a uniform “complexity” of the visualized geographical area. In such a case tasks will receive sector/hypersector data distributed equally with regard to the number of sectors. Sector simplification algorithm would then take into account only a position of the observer reported by the visualization station. This solution, illustrated in Figure 4a, allows connecting many visualization stations to the application to view a dynamic scene in a panoramic fashion. a)
b) cluster
cluster geospatial data
geospatial data
sector data process 1
process 2
observer’s position
sector data
MPI
...
process N
TCP/IP multicast
graphical workstation
server
client 1
observer’s position
client 2
...
client N
TCP/IP multicast
graphical workstation
Fig. 4. Vis3D architecture: (a) without supervision (b) with supervision
Various test of the Vis3D application indicated that sometimes data may be lost when sent simultaneously to the graphical station for a final rendering of a scene view. To eliminate this phenomenon the notion of a token was introduced. Only a current owner of the token can send out data upon completion of its sector simplification. In oder to guarantee the fair use of the token an additional token server was introduced. It acts as an arbiter resolving conflicting request from tasks simplifying sectors of the scene. Communication between simplifying tasks and the token server was implemented with MPI (see Figure 4b). The server also takes responsibility of a dynamic mapping of sectors on the processing nodes.
Efficiency of Interactive Terrain Visualization with a PC-Cluster
3
395
Performance Testing Experiments
Experiments were run on a PC cluster (Intel Itanium II 1,2 G Hz, 16 B RAM) in different configurations ranging from 2 up to 32 nodes, and in a standard lab of two to four PCs (Intel Pentium IV 3,6 G Hz, 1B RAM) with Fast Ethernet LAN. In each case the graphical station was a Dell Latitude D50 laptop with a specially written terrain visualisation program, which besides scene rendering was capable of measuring and displaying various parameters characterizing simplification of the scene. Some of them may be seen in Figure 5 as horizontal bars representing numbers of sectors received for visualization from each client. Their similar length (as shown in Figure 5b) indicates a uniform distribution of sector computation load among processing nodes. As mentioned before, the number of processed sectors equals the number of processes simplifying the mesh. The visualization program provides also an insight into the structure of the simplified terrain (see Figure 1b). a)
b)
Fig. 5. Visualized terrain: (a) observer’s view, (b) sectors of the area
During the tests of various configurations involving the cluster and LAN based versions of the Vis3D application hipsometric data of an 65 km × 110 km area around Gdansk were used. This terrain has been found representative for the tests, as it included flat surfaces to the North (the Bay of Gdansk), slightly elevated coastal areas in the center, and moraine hills surrounding the bay area from the South West (see Figure 5b). Measurements included the total number of triangles in a scene, their reduction rate, scene generation time, average scene update time, maximum allowed speed of the observer and an average single node computation time in three scenarios – an observer moving along the same route at 1 km, 4 km and 10 km over the area shown in Figure 5. For each of these altitudes simplification of sectors required respectively 60, 174 and 286 tasks (one per sector).
396
3.1
D. Dalecki et al.
Cluster Based Visualization
Basic performance characteristics collected during the experiments with a cluster based Vis3D application included among others: – scene generation time, defined as a time from sending a request from the graphical workstation to receiving all required sectors for rendering. Each sector was calculated from scratch up to the detail level implied by its distance from the initial position of the observer (Figure 6a); – average scene update time, i.e. an average time required to compute a new level of detail of a sector changing its distance from the observer (Figure 6b); – maximum speed of the observer that cannot be exceeded due to delays in computing new level of detail of changing sectors (Figure 6c); – average single node computation time, indicating scalability of the Vis3D application (Figure 6d). Experimental configurations of cluster nodes were ordered in Figure 6 according to the increasing degree of parallelism, taking into account their double-core processors, i.e. from 2×1 (two nodes with a single task each), through 2×2 (two nodes with two tasks), up to a 16×2 configuration. It can be seen from Figure 6a that computations with a lower degree of parallelism require considerably more time to compute the initial scene view, however configuration with four nodes is already sufficient and further increase of computational power is not needed. One exception is for the highest altitude of 10 km, since the level of detail even for the closest sectors does not matter much, so they may be practically calculated on a single/double processor machines. While computation of the initial scene view may be demanding for weaker configurations, scene updates are much faster (tenths of seconds as shown in Figure 6b) – especially for higher altitudes, as at and over 4 km the terrain significantly flattens. However, performance of updates for the lower altitudes becomes less predictable, as the terrain shape may change rapidly due to the abrupt maneuvers of the observer. Computations of particular sectors may dominate individual nodes and slow down completion of the entire scene update. Note the “saw” like shape of this characteristic for altitudes of 1 km in Figure 6b. Certainly its shape will vary significantly with regard to the exact route taken by the observer, its speed and orientation, as well as the shape of the terrain below. This observation may be reinforced by analyzing the maximum allowable speeds in Figure 6c. At the lowest altitude of 1 km the speed varies for the same reason as mentioned before, while for the higher altitudes it does not depend on the applied degree of parallelism. Surprisingly, the maximum speed at 10 km is lower than at 4 km, what may be explained by the square increase of the number of visible sectors to compute with the increase of altitude vs. the only linear decrease of the number of details in each sector. Anyway, the basic requirement for the Vis3D system to allow for the whole range of subsonic speeds [3] is met, as in any case maximum speeds are well in the upper 500-1000 km range.
Efficiency of Interactive Terrain Visualization with a PC-Cluster b) 30
0.6
25
0.5 seconds
seconds
a)
397
20 15
0.4 0.3
10
0.2
5
0.1
0
2x1
2x2
4x1
4x2
8x1
8x2
16x1
0.0
16x2
2x1
2x2
4x1
nodes x tasks
4x2
8x1
8x2
16x1
16x2
8x2
16x1 16x2
nodes x tasks
c)
d) 4000 3500 3000
25 20
2000
seconds
km/h
2500
1500 1000
10 5
500 0
15
0 2x1
2x2
4x1
4x2
8x1
8x2
16x1
16x2
1 km, 60 sectors
2x1
2x2
4x1
4x2
8x1
nodes x tasks
nodes x tasks
4 km, 174 sectors
10 km, 286 sectors
Fig. 6. Performance characteristics of cluster based Vis3D: (a) scene generation time, (b) scene update time, (c) max observer’s speed, (d) single node computation time
Finally, the task mapping algorithm in Vis3D is scalable, since the average single node computation time decreases counter-proportionally to the increase of the number of nodes for each altitude, as shown in Figure 6d. 3.2
LAN Based Visualization
Computational power of processors of the PC-cluster used in the experiments was estimated at the level of nearly 4000 BogoMIPS (number of null instructions performed in 1 second). Given the almost doubled power of processors in the PC lab (the LAN based configuration of Vis3D) compared to processors in cluster nodes it was expected that cluster based configuration of Vis3D will perform worse. Surprisingly, the latter performed better, probably owing to the architecture of the Ithanium II processor (64-bit, double core) used by the cluster and 16 times larger RAM. For comparison, some measured results of the total computation time of all sectors in a scene (over 1 million triangles) are shown in Figure 7.
398
D. Dalecki et al. 160 140 120
seconds
100
1km, 60 sectors, LAN
80 10km, 286 sectors, LAN 60 1km, 60 sectors, cluster
40 20
10km, 286 sectors, cluster
0 2x1
2x2
4x1
4x2
nodes x tasks
Fig. 7. Computation time of all sectors
The total time shown in Figure 7 was measured as a sum of local execution times of all task involved in computing the initial scene view. In other words, it equals a time required to calculate the scene with only a single processor in a sequential fashion. It can be seen that LAN configuration processors were not able to cope well with more than 1 task per node for relatively difficult “low altitude” sectors. Sector computations can stabilize only for less detailed “high altitude” sectors (see 10 km, 258 sectors, LAN based configuration in Figure 7). On the other hand, performance of the cluster based configuration with this regard remains stable for all altitudes and gets better with the increase of the altitude. We skip other characteristics of the LAN based configuration, as they have similar shapes to their cluster based counterparts shown in Figure 6. One difference is that the respective curves are placed higher, as generally the LAN based configuration performed worse. In particular, the average single node computation time characteristic indicated almost linear scalability of the Vis3D application in a LAN based configuration, as in its cluster based counterpart.
4
Conclusions
Spatial data used for performance testing of the Vis3D application were optimized before the experiments. This was possible because terrain data remained static during execution of all scenarios. Experiments with unoptimized data were carried out in parallel to the ones described earlier in the paper. Their results, when compared to the “optimized” case, indicate that all scenarios became to perform very well at a rather low degree of parallelism (typically four nodes were enough) only for optimized data. By the way, four (or more) processing nodes are a quite common feature of upper class graphical workstations, having typically 8-16 processors. In that sense, using clusters for realistic terrain visualization might be considered “using a steam hammer to kill an ant”. However, results obtained for the latter case (unoptimized data) indicate that simplification of sectors in real-time may consume considerably more computational power offered by a PC-cluster. While for the optimized hipsometric data
Efficiency of Interactive Terrain Visualization with a PC-Cluster
399
of the Gdansk area about one million triangles could be effectively processed by no more than four processors (see Figure 6a), unoptimized data involved 2-3 times greater numbers of triangles, depending on the particular altitude, and required considerably more nodes to achieve the same rate of reduction as for the optimized case. For example, altitudes of 1 and 4 km required respectively over 32 and 16 nodes to observe the same effect as in Figure 6a for four nodes processing optimized data. This is an important result for interactive visualization of dynamic data (acquired on-line) that cannot be optimized prior to the interactive visualization session. These may include data modelling of meteorogical phenomena, such as weather fronts or building-up clouds, dynamic surfaces, like sea waves, or physical processes, like smoke, forest fires, etc. Such dynamic data could be produced by specialized simulators used in future interactive visualization exercises with the Vis3D application. Further research on this topic, inspired by the CLUSTERIX project is currently on the way.
References 1. Hoppe, H.: Smooth View-Dependent Level-of-Detail Control and its Application to Terrain Rendering. research.microsoft.com/hoppe/svdlod.pdf 2. Lebied´z, J., Mieloszyk, K.: Real Terrain Visualisation on the Basis of GIS Data. In: Bolc, L., Michalewicz, Z., Nishida, T. (eds.) IMTCI 2004. LNCS (LNAI), vol. 3490, pp. 40–49. Springer, Heidelberg (2005) 3. Lebied´z, J., Mieloszyk, K., Wiszniewski, B.: Real Terrain Visualisation with a Distributed PC-Cluster. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 349–356. Springer, Heidelberg (2006) 4. Duchaineau, M., Wolinsky, M., Sigeti, D.E., Miller, M.C.: Aldrich Ch., and Mineev-Weinstein M.B. ROAMing Terrain: Real-time Optimally Adapting Meshes, http://www.llnl.gov/graphics/ROAM 5. Garland, M., Hecbert, P.S.: Fast Polygonal Approximatio of Terrains and Height Fields. gpraphics.cs.uiuc.edu/garland/CMU/scape/scape.pdf 6. Wyrzykowski, R., Meyer, N., Stroi´ nski, M.: Clusterix: National Cluster of Linux Systems (2002), http://www.clusterix.pl 7. Shuttle Radar Topography Mission. NASA, http://www2.jpl.nasa.gov/srtm
Implementing Commodity Flow in an Agent-Based Model E-Commerce System Maria Ganzha1,2 , Maciej Gawinecki2 , Pawel Kobzdej2 , Marcin Paprzycki2,3, and Tomasz Serzysko4 1
2
Elblag University of Humanities and Economy, Elblag, Poland Systems Research Institute Polish Academy of Science, Warsaw, Poland 3 Warsaw Management Academy, Poland 4 Warsaw University of Technology, Poland
Abstract. In our work we are developing a complete agent-based e-commerce system. Thus far we have been focusing on interactions between clients and shops (C2B relationships). In this work we discuss how the proposed system can be augmented with a logistic subsystem and discuss specifics of its implementation.
1
Introduction
In our work, we are developing a complete model agent-based e-commerce system [7, 3]. Thus far we have considered interactions between clients and shops (C2B relationships) and assumed that in each shop there exists a warehouse where products, to be sold through price negotiations, are stored. We did not address questions like: where these products are coming from, how are they restocked, etc. Only recently we have proposed an initial design of the logistics subsystem [8]. While processes involved in client purchasing a product are very similar to these when stores re-stock their warehouses, there are also important differences concerning: product demand prediction, offer selection criteria, interactions with wholesalers (including B2B portals as infomediaries), methods of price negotiations, trust management etc. These differences that underline the need for the special logistics subsystem, as well as assumptions about the business functions that shape it have been presented in [8] and are omitted here. The aim of this paper is to discuss how the logistics subsystem has been implemented in the JADE agent environment. Specifically, we use two scenarios (failed and successful purchase) to illustrate flow of messages between logistics agents. We discuss type and content of messages exchanged in agent interactions, while utilizing JADE’s Sniffer to illustrate operation of the system. Before proceeding let us observe that the proposed logistics subsystem is not “stand alone” (e.g. similar to these considered in [6, 1, 10]). Instead, it has been created within the context of a specific agent system, which has directly influenced its design. Furthermore, focus of this work is on agent implementation and interactions. Therefore, we omit important topics like: how are sale forecasts derived and transformed into purchasing orders, how agents evaluate wholesaler R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 400–408, 2008. c Springer-Verlag Berlin Heidelberg 2008
Implementing Commodity Flow
401
offers, etc. Currently we have implemented rudimentary mechanisms that support these functions and encapsulated them into replaceable modules. Therefore, readers should assume that, for instance, when we write that “received offers are evaluated,” then their favorite evaluation method has been utilized (i.e. an appropriate module was replaced with one containing that evaluation method).
2
System Description
Let us start our work from the Use Case diagram of the system (in Figure 1). This diagram contains functions involved in logistics and is conceptualized on a slightly higher level than diagrams included in [7,3]. Therefore these diagrams as well as the description of non-logistics related functions presented there should be consulted for all additional details. Our system models a distributed marketplace in which e-shops sell products to incoming buyers. Specifically, User-Clients are represented by (1) Client Agent that orchestrates actions involved in purchasing a product, (2) Client Decision Agent that is responsible for data analysis and decision making in support of User-Client request, and (3) pool of Buyer Agent s (BA) that represent User-Client in price negotiations. In Figure 1 we can also see two Client Information Center s—central repositories where matchmaking information (which e-store sells which product, and which wholesaler supplies which product) is stored (see [9] for analysis of approaches to matchmaking). Let us note that the Logistics CIC is represented “outside of the system.” This is to indicate that its services could be provided, for instance, by a B2B portal. Finally, the Shop Agent (SA) is the counterpart of the Client Agent and
Product Registration
Creating list of Shops CIC
Client Decision Agent
User−Seller
Registration
Client Decision Making
Sale finalization
Odering purchase User−Client
Wholesale Agent
Client Selling
Ordering product
Shop
Organization of buying process
Shop Decision Making
Admitting to negotiations Gatekeeper
Ordering Agent
Shop Decision Agent
Ensuring
Preparing negotiation Buyer
<> product levels Stock
Logistic Agent
management
Negotiation Seller
Warehouse Agent
Creating list of suppliers
Logistics CIC
Fig. 1. Use Case diagram of the proposed system
orchestrates all functions taking place in the e-store. Decisions made in the store are the result of data processing performed by the Shop Decision Agent (SDA). The SA is supported by the Gatekeeper Agent (GA) that is responsible for
402
M. Ganzha et al.
admitting incoming BAs into price negotiations, management of a pool of Seller Agent s, and negotiation preparation. The role of the GA ends when a group of BAs is released to negotiate prices with a Seller Agent. Agents involved in logistics are led the Warehouse Agent (WA), which is responsible for: (1) handling product reservations [3], and (2) managing the warehouse, and this function we are interest in. To ensure appropriate supply of products the WA follows the forecast delivered by the SDA and utilizes (1) the Logistics Agent (LA) that is the “brain” behind ordering products, and (2) a pool of Ordering Agent s which are responsible for handling individual purchasing orders.
3
Product Restocking Process
To illustrate the restocking process we use two scenarios and output generated by the JADE Sniffer Agent [2]. The first scenario, depicted in Figure 2, represents process that ended in a failure. In the description we use small letter agent
Fig. 2. Messages in the first restocking scenario as observed by the Sniffer Agent
names to match the naming convention from Figure 2; we denote messages as mn, where n is the message number. Note that Figure 2 should be considered together with the sequence diagram included in [8]. Let us assume that the sda has prepared forecast about required stock level of Canon EOS 10D camera, for the period between February 1st and February 28th. It states that the sales will be 4 items per week, with a deviation of 2 items.
Implementing Commodity Flow
403
Furthermore, the sda requested that the purchase price stays below US$ 1,500. This prediction is communicated to the wa as a FIPA Inform ACL message {m1} with the following value of the content slot: ( SDAPrediction : predictionDescription ( PredictionDescription : g l o b a l P r o d u c t I D CanonEOS10D−566782 : p r i c e M a x 1500 : predictionDeviation 2 : p r e d i c t i on P e r i od ( Period : from 2 0 0 7 0 2 0 1 T0000000 +02:00 : t o 2 0 0 7 0 2 2 8 T0000000 + 0 2 : 0 0 ) : predictionAmount 4 ) ) ) )
Upon receiving this message, the wa checks the stock level and decides to purchase between 1 and (preferably) 5 cameras. Furthermore, cameras have to be delivered before 4 p.m. on January 31th; however, an offer with promised delivery time of January 29th at 10:00 am would be ideal. Based on these constraints, the wa sends to the la a message {m2} specifying the requested purchase: ( a c t i o n ( a g e n t− i d e n t i f i e r : name l a @ b e e t h o v e n : 1 0 9 9 /JADE) ( O r d e r Re q u e st : orderDescription ( OrderDescription : deliveryTimeRequired 2 0 0 7 0 1 3 1 T1600000 +02:00 : p r i c e M a x 1500 : amountRequired 1 : deliveryTimeAllowed 2 0 0 7 0 1 2 9 T1000000 +02:00 : amountPreferred 5 : globalProductID CanonEOS10D − 5 6 6 7 8 2 ) ) )
The la asks the cic agent (instance of the Logistics CIC ; see Figure 1) about suppliers of Canon EOS 10D camera, using a FIPA QUERY-REF message {m3} (see [5] for details concerning product ontology and querying the CIC ). The cic replies with a FIPA INFORM message {m4} containing a list of Wholesale Agent s. This response could have the following content: ( al l ? supplier ( Supplies ? supplier CanonEOS10D − 5 6 6 7 8 2 ) ) ( set ( a g e n t− i d e n t i f i e r : name wha2@beethoven : 1 0 9 9 /JADE : a d d r e sse s ( sequence http : / / beethoven : 7 7 7 8 / acc ) ) ( a g e n t− i d e n t i f i e r : name wha1@beethoven : 1 0 9 9 /JADE : a d d r e sse s ( sequence http : / / beethoven : 7 7 7 8 / acc ) ) )
When the list is non-empty, the la “prunes” all suppliers that have their trust value below a threshold (trust information is stored in the LA database; see also [4]). The list of remaining wholesalers is supplemented with trust information (to be used to rank offers) and sent {m5} to one of free OA’s (here oa1 ). Note that the message contains the IssueOrder action with the OrderDescription and the suppliers list and thus contains all necessary information to execute an order.
404
M. Ganzha et al.
( a c t i o n ( a g e n t− i d e n t i f i e r : name oa1@beethoven : 1 0 9 9 /JADE) ( IssueOrder : orderDescription ( OrderDescription : deliveryTimeRequired 2 0 0 7 0 1 3 1 T1600000 +02:00 : deliveryTimeAllowed 2 0 0 7 0 1 2 9 T1000000 +02:00 : amountRequired 1 : amountPreferred 5 : p r i c e M a x 1500 : globalProductID CanonEOS10D−566782 : s u p p l i e r s ( sequence ( SupplierDescription : tru st 0.98 : name ( a g e n t− i d e n t i f i e r : name wha2@beethoven : 1 0 9 9 /JADE : a d d r e sse s ( sequence http : / / beethoven : 7 7 7 8 / acc ) ) ) ( SupplierDescription : trust 0.7 : s u p p l i e r ( a g e n t− i d e n t i f i e r : name wha1@beethoven : 1 0 9 9 /JADE : a d d r e sse s ( sequence http : / / beethoven : 7 7 7 8 / acc )))))))
After obtaining the request from the la, the oa engages in FIPA ContractNet Protocol interactions with Wholesale Agents (WhA) from the list (here wha1 and wha2. It sends the following FIPA CallForProposal message {m6, m7}, containing Sell action with initial contract requirements. ( a c t i o n ( a g e n t− i d e n t i f i e r : name wha2@beethoven : 1 0 9 9 /JADE ) ( Sell : globalProductID CanonEOS10D−566782 : deliveryTimeRequired 2 0 0 7 0 1 3 1 T1600000 +02:00 : deliveryTimeAllowed 2 0 0 7 0 1 2 9 T1000000 +02:00 : amountRequired 1 : amountPreferred 5 ) )
The WhAs evaluate the CFP and if terms contained there are acceptable respond by sending FIPA Propose messages; e.g. wha2 agent sends message proposing sale of 4 cameras, at $1450 each to be delivered by January 30th at 14:00 {m8}: ( Proposition : amountSpecific 4 : p r i c e S p e c i f i c 1450 : s u p p l i e r ( a g e n t− i d e n t i f i e r : name wha2@beethoven : 1 0 9 9 /JADE : a d d r e sse s ( sequence http : / / beethoven : 7 7 7 8 / acc ) ) : deliveryTimeSpecific 2 0 0 7 0 1 3 0 T1100000 + 1 4 : 0 0 )
The oa evaluates received offers and if there is at least one that satisfies its criteria, accepts the best one (here wha1 ’s) by sending an ACL message {m10}: ( PropositionAccepted : proposition ( Proposition
Implementing Commodity Flow
405
: amountSpecific 4 : p r i c e S p e c i f i c 1450 : s u p p l i e r ( a g e n t− i d e n t i f i e r : name wha1@beethoven : 1 0 9 9 /JADE : a d d r e sse s ( sequence http : / / beethoven : 7 7 7 8 / acc ) ) : deliveryTimeSpecific 2 0 0 7 0 1 3 0 T1100000 + 0 2 : 0 0 ) ) )
The wha1 confirms an order with FIPA Inform message {m11} containing the PropositionConfirmed predicate and a unique orderID. This re-confirmation is required by the Contract Net Protocol and has the following form: ( PropositionConfirmed : proposition ( Proposition : amountSpecific 4 : p r i c e S p e c i f i c 1450 : s u p p l i e r ( a g e n t− i d e n t i f i e r : name wha1@beethoven : 1 0 9 9 /JADE : a d d r e sse s ( sequence http : / / beethoven : 7 7 7 8 / acc ) ) : deliveryTimeSpecific 2 0 0 7 0 1 3 0 T1100000 + 0 2 : 0 0 ) : orderID 4904904)
Next, the oa forwards the confirmation to the la (a success message {m12}). The la checks the delivery time and waits. After the promised delivery time has passed, and no delivery notification was received from the wa; if there is still time before the hard deadline, the la contacts {m13} an available OA(here oa1 again) and sends FIPA Request message reminding about the order {m14}). Agent oa1 contacts the supplier by sending a reminder containing the ID of the order: ( a c t i o n ( a g e n t− i d e n t i f i e r : name oa1@beethoven : 1 0 9 9 /JADE) ( Deliver : orderID 4904904))
In our scenario the wha1 re-confirms {m15} an order with the new expected time for delivery of January 31th, 1 p.m. This information is again forwarded to the la ({m16}), which checks the delivery time and waits again. After the promised time passed, if no delivery notification was received from the wa and there is still time to the deadline, the la contacts an available OA (the oa1 again). This time however, a reminder has been already sent to this supplier. So the wha1 is removed from the list of potential suppliers, and a new order request is sent ({m17}). The oa1 initiates the FIPA ContractNet Protocol with the last remaining supplier (the wha2 ), which wins and its offer is accepted and confirmed ({m18 − m21}). Confirmation is forwarded to the la ({m22}). The ls checks the delivery time and waits yet again. After the time promised passes, and no delivery notification was received from the wa, since this time deadline is crossed, the ordering process fails, and a proper notification is sent to the wa: ( Result ( a c t i o n ( a g e n t− i d e n t i f i e r : name l a @ b e e t h o v e n : 1 0 9 9 /JADE) ( O r d e r Re q u e st : orderDescription
406
M. Ganzha et al.
( OrderDescription : deliveryTimeRequired 2 0 0 7 0 1 3 1 T1600000 +02:00 : p r i c e M a x 1500 : amountRequired 1 : deliveryTimeAllowed 2 0 0 7 0 1 2 9 T1000000 +02:00 : amountPreferred 5 : globalProductID CanonEOS10D − 5 6 6 7 8 2 ) ) ) ( Failure : reason SupplierDidNotDelivered ))
While the process depicted in Figure 2 ended in a failure, in Figure 3 we depict a restocking process that ended in success. Now we focus our attention only on
Fig. 3. Messages in the second restocking scenario as observed by the Sniffer Agent
these messages that are different from the process described before. Therefore, we start form the moment when the la successfully finished ordering a product and communicates this event to the wa (as a FIPA Agree message {m13}): ( ( a c t i o n ( a g e n t− i d e n t i f i e r : name l a @ b e e t h o v e n : 1 0 9 9 /JADE) ( O r d e r Re q u e st : orderDescription ( OrderDescription : deliveryTimeRequired 2 0 0 7 0 1 3 1 T1600000 +02:00 : p r i c e M a x 1500 : amountRequired 1 : deliveryTimeAllowed 2 0 0 7 0 1 2 9 T1000000 +02:00 : amountPreferred 5
Implementing Commodity Flow
407
: globalProductID CanonEOS10D − 5 6 6 7 8 2 ) ) ) ( Order : o r d e r I D 4 9 0 4 9 0 4 ) )
When ordered product is delivered to the warehouse, the wa confirms it to the la by sending a FIPA Inform message {m14}, and in turn the la informs the wa about success of the delivery process {m15}. This last message may seem spurious, as the wa already “knows” that the Canon EOS cameras have arrived. However, we proceed here to complete the “communication loop” that started with the wa sending out the original purchasing order. By sending this last message we close the case of this order without distracting the communication protocol.
4
Concluding Remarks
In this paper we have described in some detail the way in which we have implemented the logistics subsystem that was recently added to our model agent based e-commerce system. Our description was based on two actual runs of the JADE-implemented subsystem. These two runs represented two basic scenarios of product restocking: one ending with a success, and one ending with a failure. We have focused our description on message types, forms and content. Currently the logistics subsystem is being integrated with the core of the system and we expect this process to be completed shortly. The next step in our research will be (1) to enhance the ways in which sale predictions are made, ordering requests issued (on the basis of these predictions) as well as offers are evaluated; (2) to extend the scope of negotiations to include items like: price of insurance, multiple options and prices of product delivery etc. We will report on our progress in subsequent reports.
References 1. Agentis Software (2007), http://www.agentissoftware.com/ 2. JADE—Java Agent DEvelopment framework. TILab (2007), http://jade.tilab.com/ 3. B˘ adic˘ a, C., B˘ adit˘ a, A., Ganzha, M., Paprzycki, M.: Implementing rule-based automated price negotiation in an agent system. Journal of Universal Computer Science 13(2), 244–266 (2007) 4. B˘ adic˘ a, C., Ganzha, M., Gawinecki, M., Kobzdej, P., Paprzycki, M.: Towards trust management in an agent-based e-commerce system—initial considerations. In: Zgrzywa, A. (ed.) Proc. of the MISSI 2006 Conference, pp. 225–236 (2006) 5. B˘ adic˘ a, C., Ganzha, M., Gawinecki, M., Kobzdej, P., Paprzycki, M., Scafes, M., Popa, G.-G.: Managing information and time flow in an agent-based e-commerce system. In: Petcu, D., et al. (eds.) Proceedings of the Fifth International Symposiom on Parallel and Distributed Computing, pp. 352–359. IEEE Computer Society Press, Los Alamitos (2006) 6. Butler, C.A., Eanes, J.T.: Software agent technology for large scale, real-time logistics decision support. US Army Research Report ADA392670, 23 pages (2001)
408
M. Ganzha et al.
7. Ganzha, M., Paprzycki, M., B˘ adic˘ a, C., B˘ adit˘ a, A.: E-Service Intelligence— Methodologies, Technologies and Applications. In: Developing a Model Agentbased E-commerce System, pp. 555–578. Springer, Berlin (2007) 8. Serzysko, T., Gawinecki, M., Kobzdej, P., Ganzha, M., Paprzycki, M.: Introducing commodity flow to an agent-based model e-commerce system. In: Proceedings of the 2007 IAT Conference (in press, 2007) 9. Trastour, D., Bartolini, C., Preist, C.: In WWW 2002: Proc. of the 11th international conference on World Wide Web, pp. 89–98. ACM Press, New York (2002) 10. Ying, W., Dayong, S.: Multi-agent framework for third party logistics in ecommercestar. Expert Systems with Applications 29( 2), 431–436 (2005)
MPI and OpenMP Computations for Nuclear Waste Deposition Models Ondˇrej Jakl, Roman Kohut, and Jiˇr´ı Star´ y Institute of Geonics, Academy of Sciences of the Czech Republic [email protected], [email protected], [email protected]
Abstract. In this paper, we introduce one important source of highperformance computations, namely mathematical modelling of deep geological repositories of the spent nuclear fuel, and describe two real concepts of such repositories. Mathematical modelling is practically the only way how to predict the behaviour of such facilities in their long-term existence. We present a simplified mathematical model that considers thermo-mechanical behaviour of the repositories and the corresponding in-house solver. This solver is analyzed as a parallel application with both MPI and OpenMP realizations. On the example of the two repositories and related demanding computations we develop a case study focused on practical comparison of those two paradigms of parallel processing. Keywords: deep geological repository, mathematical model, thermoelasticity, parallel solver, OpenMP, MPI.
1
Motivation and Introduction
The disposal of radioactive waste is an urgent problem of today’s environment. Especially the spent nuclear fuel, produced by nuclear reactors from uranium fuel in nuclear power plants, is very difficult to handle. One of the most promising methods of its final disposal are deep geological repositories (DGR). Many countries are investigating appropriate technology for the underground disposal. Studied phenomena include (geo-)mechanical behaviour, heat transfer, chemical processes, water flows, etc. and their interactions in the long-term DGR operation. That is why mathematical modelling plays an indispensable role in the design. Complex multiphysics simulations are a great challenge in all their phases, starting from the mathematical formulation and ending at the (parallel) computer realization. Our work contributes to this kind of practical mathematical modelling. First, let us introduce two concrete concepts of DGR. 1.1
SKB
Sweden is one of the leading countries not only in production of nuclear energy, but also in solution of the consequences. Already in 1983, SKB (Swedish Nuclear Fuels and Waste Management Co.) presented the KBS-3 Report, describing a R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 409–418, 2008. c Springer-Verlag Berlin Heidelberg 2008
410
O. Jakl, R. Kohut, and J. Star´ y
disposal concept for Swedish spent fuel, which still remains the basic reference design. Under the KBS-3V technology, the repository would consist of a number of parallel tunnels in bedrock at a depth of about 500 m, connected by a central tunnel for transportation and communication. Vertical holes with room for one canister in each were to be drilled from the floor of the tunnels. Copper canisters would then be emplaced and surrounded by compacted bentonite. ¨ o Hard Rock Laboratory and its Prototype Repository [5], this model In Asp¨ is tested in full scale. Situated 450 m below the earth surface, its central part is constructed as a 65 m long tunnel having two sections with four and two deposition holes, 1.75 m in diameter and 8 m deep (Fig. 1 left).
Fig. 1. The illustration of the SKB’s Prototype Repository (left). The finite element mesh of the corresponding SKB model (right).
Based on the data available, we created a simplified mathematical model using our GEM3 finite element software. This SKB model was set up as a not fully coupled thermo-elasticity problem with thermal load caused by the radioactive waste in the deposition holes, with exponential decays of the heat source. The computational domain (see Fig 1 right) of 158×57×115 m was discretized by linear tetrahedral finite elements with 2 586 465 degrees of freedom (DOF) for the heat conduction and 7 759 395 DOF for the elasticity computations. The time interval of interest was one hundred years. The main goal was to verify the correctness of the GEM3 thermal algorithms/solvers. See [1] for details. 1.2
SURAO
In the Czech Republic, the final DGR is just in the early phases of planning. According to SURAO, the state authority responsible for the safe disposal of radioactive waste, the Czech DGR should consist of several access shafts and tunnels and of a large network of corridors for storing, ventilation, drainage and
MPI and OpenMP Computations for Nuclear Waste Deposition Models
411
communication — see Fig. 2 left. The deposition technology would correspond to the solutions that are accepted in Sweden (cf. above) and most other countries. Detailed description of a SURAO’s reference design can be found in [6]. Note that in this case there is no real prototype facility available.
Fig. 2. A view of the SURAO’s DGR concept (left). The finite element mesh of the SURAO model (right).
In the SURAO case, our thermo-elasticity model took up just a small fraction of the whole design (32×58×100 m), namely three depository drifts, each connecting four deposition holes 1.32 m in diameter and 4.77 m deep, and one access gallery, for which 145×105×197 finite element nodes (2 999 325 DOF for the thermal part) were generated — cf. Fig. 2 right. The main objective of this modelling was to provide data for the assessment how the distance between the deposition holes affects the thermo-elastic response of the rock [2].
2
Thermo-elasticity Problem
In this section, let us provide the basic information about the mathematical formulation of the SKB and SURAO models. More background in this direction can be found e.g. in [1], [3] and [2]. We consider thermo-elasticity problems which are not fully coupled — just with one-directional coupling through a thermal expansion term in the constitutive relations — since we suppose that the deformations are very slow and do not influence temperature fields. Thus, we can proceed as follows: First, we determine the temperature distribution by the solution of the nonstationary heat equation for a set of time points. Second, we solve the linear elasticity problem for some subset of those time points in a post-processing procedure. 2.1
Formulation
The thermo-elasticity problem is mathematically formulated as follows: Find the temperature τ = τ (x, t) and the displacement u = u(x, t), τ : Ω × (0, T ) → R ,
u : Ω × (0, T ) → R3 ,
412
O. Jakl, R. Kohut, and J. Star´ y
that fulfill the equations κρ −
∂2τ ∂τ =k + Q(t) ∂t ∂xi 2 i
in Ω × (0, T ) ,
∂σij = fi (i = 1, . . . , 3) ∂xj j σij = cijkl [εkl (u) − αkl (τ − τ0 )] kl
1 εkl (u) = 2
∂uk ∂ul + ∂xl ∂xk
in Ω × (0, T ) , in Ω × (0, T ) ,
in Ω × (0, T )
together with the corresponding boundary and initial conditions specified below. The four expressions represent the heat conduction equation, equations of equilibrium, Hook’s law and tensor of small deformations, respectively, with symbols having the following meaning: κ is the specific heat, ρ is the density of material, k are the coefficients of the heat conductivity, Q is the density of the heat source, σij is the Cauchy stress tensor, εkl is the tensor of small deformations, f is the density of the volume (gravitational) forces, cijkl are the elastic moduli, αkl are the coefficients of the heat expansion and τ0 is the reference (initial) temperature. For the heat conduction, we use the boundary conditions τ (x, t) = τˆ(x, t) ∂τ −k ni = q ∂xi i ∂τ −k ni = H(τ − τˆout ) ∂xi i
on Γ0 × (0, T ) , on Γ1 × (0, T ) , on Γ2 × (0, T ) ,
where Γ = Γ0 ∪ Γ1 ∪ Γ2 . These conditions prescribe the temperature, the heat flow through the surface heat flux q and the heat transfer to the surrounding medium with the temperature τˆout . The symbol H denotes the heat transfer coefficient. For the elasticity part, we apply the boundary conditions un = u i ni = 0 on Γ˜0 × (0, T ) , i
σt = 0 σij nj = gi
(i = 1, . . . , 3)
on Γ˜0 × (0, T ) , on Γ˜1 × (0, T ) ,
j
which set the displacement, stresses and surface loading. Here, Γ = Γ˜0 ∪ Γ˜1 . The initial temperature is specified by condition τ (x, 0) = τˆ0 (x)
in Ω .
MPI and OpenMP Computations for Nuclear Waste Deposition Models
2.2
413
The Time-Stepping Algorithm
After the variational formulation, the thermo-elasticity problem is discretized by finite elements in space and by finite differences in time. Employing the linear finite elements and the so-called backward Euler time discretization, this leads to the computation of vectors τ j , uj of nodal temperatures and displacements at the time points tj (j = 1, . . . , N ) with the time steps Δtj = tj − tj−1 . We get the time-stepping algorithm in Fig. 3. find τ 0 : Mh τ 0 = τ0 find u0 : Ah u0 = b0 = bh (τ 0 ) for j = 1, . . . , N : dj = (Mh − (1 − ϑ)Δtj Kh )τ j−1 + ϑΔtj qhj + (1 − ϑ)Δtj qhj−1 j find τ : (Mh + ϑΔtj Kh )τ j = dj at predefined time points: find uj : Ah uj = bj = bh (τ j ) end for compute
Fig. 3. The time-stepping algorithm for the thermo-elasticity problem
Here, Mh is the capacitance matrix, Kh is the conductivity matrix, Ah is the stiffness matrix, qh represents the heat sources and bh comes from volume and surface forces including a thermal expansion term. The parameter ϑ ∈ 0, 1 sets the time scheme of the computation. We obtain the explicit Euler scheme for ϑ = 0, the Crank-Nicholson scheme for ϑ = 12 and the backward Euler scheme for ϑ = 1. 2.3
Incremental Form
For our implementation, the general time-stepping algorithm (Fig. 3) was further concretized and developed. In Fig. 4 its incremental form for Δτ based on the 01
set B 0 = Mh , τ 0 = 0, Δt0 = 0
02
for j = 1, . . . , N :
03 04 05 06 07 08 09 10
set Δtj , then c = Δtj − Δtj−1 compute B j = B j−1 + cKh compute f = Kh τ j−1 compute g = Δtj (qhj − f ) find Δτ j : B j Δτj = g compute τ j = τ j−1 + Δτ j at predefined time points: find uj : Ah uj = bj = bh (τ j ) end for
Fig. 4. The modified time-stepping algorithm
414
O. Jakl, R. Kohut, and J. Star´ y
substitution τ j = τ j−1 + Δτ j is shown. Moreover, the backward Euler scheme (ϑ = 1), implying the highest stability, is applied. This approach allows further optimization of the main loop in the timestepping algorithm thanks to the adaptive time-stepping scheme, which is based on a local comparison of the backward Euler and Crank-Nicholson time steps. Evaluating some easily computable ratio, we can control the subsequent time step Δtj (in row 03) and adjust its size if the variation is too small or large. For more details on this point see [3] and the references therein.
3
Solver and Its Parallelization
A thermo-elasticity solver that implements the time-stepping algorithm described in the previous section and some accompanying codes have been implemented in the framework of the in-house finite element package GEM3. This section provides some more details, especially regarding the parallelization. 3.1
Parallel Solution of Large Linear Systems
Most computational work of the modified time-stepping algorithm (Fig. 4) is involved in the repeated solution of the linear system for the heat conduction of the form B Δτ = q (row 07), solved for each time step to get the temperature distribution in the course of time, and in the solution of the linear system for the elasticity Au = b (row 09), which is solved only at time points of interest to obtain the displacements in mesh nodes under given temperature. The preconditioned conjugate gradient (PCG) method is suitable for the solution of both systems. The idea of the parallelization inheres in the spatial partitioning of the modelled domain Ω into m non-overlapping subdomains Ωk by means of a 1-D data decomposition along the vertical Z axis. These are further extended so that adjacent subdomains overlap by two or more layers of elements. Such partitioning results in a “nice” (block) decomposition of the data (matrices, vectors) in the time-stepping algorithm. Thus, when the parallel solver gives rise to one process per subdomain, the k-th of m concurrent processes can simply follow the timestepping algorithm on Ωk , working the data blocks pertaining the subproblem. The interaction (communication) of those parallel processes is necessary just in the matrix × vector multiplication (row 05) and in the PCG solution (row 07). Let us make a stop at the PCG algorithm in row 07. The tricky point may be the choice of the preconditioner here. However, when we use the one-level additive Schwarz method, the preconditioning step can be expressed as g = Gr =
m
Ik Bk−1 Rk r ,
k=1
where Bk are the finite element matrices corresponding to subproblems on Ωk and Ik , Rk = IkT are the interpolation and restriction matrices, respectively. If B denotes the finite element matrix of the whole problem, then Bk = Rk BIk .1 As 1
The local subproblems are solved inexactly, when the matrices Bk are replaced by their incomplete factorizations.
MPI and OpenMP Computations for Nuclear Waste Deposition Models
415
a consequence, during the parallel PCG iterations the processes need to communicate just locally with their neighbours and the amount of data transferred is quite small, proportional to the size (minimal in practice) of the overlapped region. The parallelization has very good dispositions to be efficient and scalable. Finally let us note that in the parabolic problem of the heat conduction (row 07), when reasonable assumptions hold, numerical scalability can be maintained without help of a coarse grid correction in the preconditioner, which may be necessary in the case of elliptic elasticity problems (row 09) — see [3] for details. This is favourable for parallel processing since the set of parallel processes remains homogeneous and additional load ballancing is not required. 3.2
MPI and OpenMPI Implementations
We conceived the realization of the parallel thermo-elasticity solver, based on the time-stepping algorithm, as an opportunity to make a practical comparison study of the two main paradigms in parallel programming, message passing and shared memory, of the corresponding standards MPI and OpenMP and of some of their implementations. We wrote the solver in Fortran in two variants, the first one using MPI and the second one using OpenMP. Both variants of the solver have the same logic, i.e. they follow the same algorithm and apply the same data decomposition, described above, respecting different character and formalism of the MPI and OpenMP standards. Thus, they provide the same results, when neglecting rounding errors. The codes can be scaled in the sense that the number of generated subproblems matches the number of available processors.
4
DGR Computations
This section is devoted to the numerical experiments. Although of course the main goal of development of the thermo-mechanical codes was to get results valuable for the progress in the DGR design, in this article we concentrate more on the performance and comparison of the MPI and OpenMP solvers, without taking care of the resulting values themselves. With respect to our previous longtime experience with solvers for the elasticity problems, the experiments were focused on the solution and validation of the thermal part. 4.1
Parallel Systems
Recall that OpenMP requires shared-memory parallel hardware, whereas message passing of MPI is supported and generally available on all parallel architectures. Thus, the best parallel architecture compare the MPI and OpenMP codes is a symmetric multiprocessor. Our solvers were developed on a small in-house IBM e-server xSeries 455 shared memory system (x455 for short) consisting of two interconnected boxes with four Itanium 2 (1.3 GHz) processors each. Unfortunately, this interconnection made the whole system non-uniform, so our first tests made use of just one
416
O. Jakl, R. Kohut, and J. Star´ y
box as a true symmetric multiprocessor with 8 GB of shared memory. x455 ran SuSE Linux Enterprise Server 8 operating system. The MPI solver was compiled with Intel Fortran Compiler 8.0 using MPICH 1.2.6, the OpenMP codes were compiled also with this OpenMP-aware compiler. With courtesy of the Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX, [7]), the solvers could be ported to a much larger shared-memory platform, called Simba. The relevant technical parameters of this Sun Fire E 15000 machine (installed in 2001) were 48 UltraSPARC-III (900 MHz) processors, 48 GB of shared memory and Sun Fireplane system interconnect. Simba was a “virtual server” on this system with 36 CPUs and 36 GB of main memory assigned. Under the Solaris 9 operating system we could use Sun Studio 11 compiler suite for Fortran compilation (with OpenMP support) and Sun HPC ClusterTools 5 for MPI message passing. 4.2
Experiments and Observations
Table 1 summarizes the performance characteristics of the OpenMP and MPI solvers on a one-year time step (i.e. just one computation of the linear system in raw 07 in the time-stepping algorithm) of the SKB model, when it was computed on the x455 system employing 1, 2 and 4 processors. Table 1. SKB solution on x455. The number of PCG iterations, the wall-clock time and relative speedup for varying number of processors are shown.
OpenMP MPI
1 processor #It T[s] S 40 98 – 39 115 –
2 processors #It T[s] S 40 52 1.88 40 60 1.95
4 processors #It T[s] S 40 28 3.50 40 31 3.76
The table shows fairly comparable results for both codes. The OpenMP solver achieved a little better execution times, its MPI counterpart succeeded in scalability, which promised better performance on greater number of processors. On Simba, we could carry out more thorough computations, e.g. run computations of both models, consider the whole time period and test the scalability of parallel codes on up to 32 processors. In the SKB case (Tab. 2 left), the heat conduction computation covers 100 years adaptively divided into 47 time steps. In each of these, the solution of the linear system (row 07) starts with the initial guess taken from the previous step and continues up to the relative residual accuracy 10−6 . Similarly for the SURAO model (Tab. 2 right), where the time period was 200 years and 46 time steps. Both tables show the performance (wall-clock time and the speedup relative to one-processor execution time) of the OpenMP and MPI solvers depending on the number of processors. These tables show quite different results in comparison with x455 and Tab. 1. On Simba, where the character of the processor-time dependency is quite similar for both models, the MPI solver outperforms the OpenMP code both in speedup
MPI and OpenMP Computations for Nuclear Waste Deposition Models
417
Table 2. SKB (left) and SURAO (right) models computed on Simba. The total number of PCG iterations, processing time and relative speedup for varying number of processors are presented.
#P 1 2 4 8 12 16 20 24 28 32
# It 1341 1423 1426 1514 1580 1617 1689 1709 1714 1819
OpenMP MPI T[s] S T[s] S 9044 5931 4782 1.89 3624 1.64 2818 3.21 1823 3.25 1594 5.67 967 6.13 1298 6.97 704 8.42 1076 8.41 545 10.88 1022 8.85 484 12.25 948 9.54 407 14.57 901 10.04 356 16.66 951 9.51 364 16.29
#P 1 2 4 8 12 16 20 24 28 32
OpenMP MPI # It T[s] S T[s] S 4530 33378 25821 5621 21280 1.55 15923 1.62 5626 12018 2.78 7845 3.29 5988 6925 4.82 4229 6.11 6978 5928 5.63 3284 7.86 7430 5134 6.50 2683 9.62 7777 4551 7.33 2176 11.87 8364 4660 7.16 2046 12.62 8694 4372 7.63 1671 15.45 9382 5670 5.89 1749 14.76
and computation time, with the fastest execution being more than 2.5 times shorter. Note however that for 2 and 4 processors the speedups are quite similar to the speedup recorded on x455 and that on the larger SURAO model OpenMP looses even more. Without having access to more detailed system information, we can explain this behaviour if we realize that the Fireplane interconnect design makes Simba in fact to a NUMA (Non-Uniform Memory Access) machine, in contrast to the x455 symmetric multiprocessor. With OpenMP, we have not data alignment directives to impose data locality2 , whereas MPI forces data locality by default. Thus, with increasing size of the data and the number of processors employed, poor data locality may deteriorate the performance especially in the OpenMP case, its average memory access time grows and efficiency decreases more quickly. Thus, on 28 processors, the relative efficiency of the MPI solver is much better than that of the OpenMP solver, especially with the SURAO model (0.55 vs. 0.27). 28 processors is also the limit, up to which both solvers improve their execution time, proving quite good scalability. This limit is the consequence of the 1-D decomposition along the vertical Z axis: For both models, the “height” of the largest subdomain, which determines the parallel execution time, stays constant between 28 and 32 processors. Let us add that in OpenMP it is much simpler to implement 2-D or even 3-D decomposition in the PCG solver, which could bring more scalability to the codes. We are working on this topic now.
5
Conclusion
Our contribution dealt with one important source of high-performance computations, namely with mathematical models of deep geological repositories of the 2
Data should be local to the processor that needs it.
418
O. Jakl, R. Kohut, and J. Star´ y
spent nuclear fuel. We have shown that even under fairly simplifying assumptions, when the complex multiphysics and multiscale problem is reduced just to a thermo-elastic behaviour with one-directional coupling, we get algorithms that, applied on a real DGR, bring up computational demands for which parallel processing is highly appropriate. We made use of two such DGR concepts to introduce our parallel code for the solution of thermo-elastic models. Since the solver was implemented in two variants, using either OpenMP or MPI standards for parallel programming, we could make some practical comparisons of those programming paradigms. Although we experienced rather opposite results on two shared-memory systems available, on a large NUMA machine we registered considerably better performance of the MPI code. This supports the prevalent idea that in most cases MPI surpasses OpenMP in efficiency and that for large problems like DGR modelling, MPI would be more appropriate. Acknowledgement. This work is supported by the grant No. 1ET400300415 of the Academy of Sciences of the Czech Republic.
References ˇnup´ 1. Blaheta, R., Byczanski, P., Kohut, R., Kolcun, A., Sˇ arek, R.: Large-Scale Modelling of T-M Phenomena from Underground Reposition of the Spent Nuclear Fuel. In: Koneˇcn´ y, P., et al. (eds.) EUROCK 2005, Impact of Human Activity on Geological Environment., A.A.Balkema, Leiden, pp. 49–55 (2005) 2. Kohut, R., Star´ y, J., Kolcun, A.: The evaluation of the thermal behaviour of an underground repository of the spent nuclear fuel. In: Proceedings of the Sixth International Conference on Large-Scale Scientific Computing LSSC 2007 held in Sozopol, Springer, Berlin (to appear, 2007) 3. Kohut, R., Star´ y, J., Blaheta, R., Kreˇcmer, K.: Parallel Computing of Thermoelasticity Problems. In: Lirkov, I., Margenov, S., Wa´sniewski, J. (eds.) LSSC 2005. LNCS, vol. 3743, pp. 671–678. Springer, Heidelberg (2006) 4. Smith, B., Bjørstad, P., Gropp, W.: Domain decomposition. Parallel multilevel methods for Elliptic Partial Differential Equations. Cambridge University Press, New York (1996) 5. Svemar, C., Pusch, R.: Prototype Repository - Project description. IPR-00-30, SKB, Stockholm (2000) 6. Vavˇrina, V., et al.: Referenˇcn´ı projekt podzemn´ıch a nadzemn´ıch ˇca ´st´ı hlubinn´eho u ´loˇzistˇe. SURAO 23-8024-51-001/EGPI444-990 009 (1999) 7. UPPMAX home page (April 28, 2007), http://www.uppmax.uu.se
A Pipelined Parallel Algorithm for OSIC Decoding Francisco-Jose Mart´ınez-Zald´ıvar1, Antonio-Manuel Vidal-Maci´ a2 , 2 and Pedro Alonso 1
2
Departamento de Comunicaciones [email protected] Departamento de Sistemas Inform´ aticos y Computaci´ on Universidad Polit´ecnica de Valencia Camino de Vera s/n, 46022 Valencia, Spain [email protected], [email protected]
Abstract. This paper describes and compares two algorithms for the Ordered Successive Interference Cancellation (OSIC) decoding procedure proposed in V-BLAST wireless MIMO systems. They are based on algorithms that solve the Recursive Least Squares (RLS) problem, and are derived from the square root version of the Kalman Filter and the square root version of the Information Filter, respectively. For OSIC decoding, the latter is a novel and faster solution than the former because a matrix multiplication and some rotation applications are avoided. The algorithm has been formulated as a block algorithm, observing an optimum block size that minimizes execution time. It has been parallelized as a pipeline, and so is very efficient.
1
Introduction
Multiple Input Multiple Output (MIMO) systems have been extensively studied in recent years in the context of wireless communications. The original proposal by Foschini, [1], known as BLAST (Bell Labs Layered Space-Time), has generated a family of architectures that uses multiple antenna arrays to transmit and receive information with the aim of increasing the capacity and reliability of the links. Some of these architectures are: D-BLAST (Diagonal BLAST) [1]; and Turbo-BLAST, [2], both with highly complex implementations; and V-BLAST (Vertical BLAST) with more practical complexity at the expense of diversity [3]. This paper is focused on the suboptimal, but more practical, V-BLAST family: where nearly optimal decoders such as sphere decoding can be used, [4]; or linear decoders such as Zero Forcing; or MMSE and its ordered version OSIC (Ordered Successive Interference Cancellation), [5]. Moreover, with some applications, such as multicarrier systems, [6] the dimension of the problem may reach several thousands. This paper describes a novel algorithm to solve the OSIC decoding problem. It offers a better performance than those reported in the literature, and its parallelization can produce a highly efficient result. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 419–428, 2008. c Springer-Verlag Berlin Heidelberg 2008
420
2
F.-J. Mart´ınez-Zald´ıvar, A.-M. Vidal-Maci´ a, and P. Alonso
OSIC Decoding Procedure
In a basic approach, we need to solve the typical perturbed system y = Hx + v, where the full rank matrix H ∈ Cm×n represents the channel matrix and any manipulation of the symbols before transmission, x is a vector whose components belong to a discrete symbol set, and v is the process noise. The use of MMSE (Minimum Mean Square Error estimation) yields: ˆ= x
√H αIn
† y = H†α y, 0
(1)
where α−1 may be thought as the signal-to-noise ratio, H†α denotes the first n √ ∗ ∗ columns of the pseudoinverse of (H , αIn ) , and the asterisk superscript (·)∗ denotes the complex conjugate. In OSIC, the signal components xi , i = 1, . . . n are decoded from the strongest (with highest signal-to-noise ratio) to the weakest, cancelling the contribution of the decoded signal component to the received signal – and then repeating the process with the remaining signal components. The estimation of every component requires a mapping of the complex, or real result, into the symbol set. Let P be the error estimation covariance matrix (symmetric positive definite) ˆ )(x − x ˆ )∗ } = (αIn + H∗ H)−1 . P = E{(x − x Let P−1 = P−∗/2 P−1/2 be the Cholesky factorization of P−1 = (αIn + H∗ H), then the square root factor P−1/2 and its inverse P1/2 are upper triangular matrices. If Pjj = min{diag(P)}, then xj is the component of x ˆ with the highest signal-to-noise ratio, which can be decoded as xˆj = H†α,j y, where H†α,j is the j th row of H†α (the j th nulling vector). If H = (h1 , h2 , . . . , hn ), the contribution of the estimated xj in the received signal is cancelled as y = y − hj x ˆj . The procedure should be repeated for the deflated channel matrix H , which is obtained by deleting the j th column of H. This implies the recomputation of the pseudoinverse of the deflated channel matrix, followed by the selection of the new strongest component signal by applying the same calculations. The computation of the iteratively deflated channel matrix pseudoinverses, until all the signal components are decoded, is computationally expensive. This recomputation is avoided in [5]. If the next QR factorization is computed: √H = QR, αIn where R ∈ Cn×n , the columns of Q ∈ C(m+n)×n are orthogonal, and define Qα = (qα,1 , qα,2 , . . . qα,n ) as the first m rows and n columns of Q, then R = P−1/2 can be identified, verifying that H†α = P1/2 Q∗α .
(2) 1/2
Let us suppose that the rows of the upper triangular matrix P are sorted in decreasing Euclidean norm order (this could otherwise be obtained using a
A Pipelined Parallel Algorithm for OSIC Decoding
421
permutation matrix Π and a unitary transformation Σ in ΠP1/2 ΣΣ∗ Q∗α maintaining the upper triangular structure in ΠP1/2 Σ – the permutation should be applied to the solution vector as well); then the diagonal entries of P = P1/2 P∗/2 are sorted in decreasing value order, and consequently, the symbols in the x ˆ vector are sorted in increasing signal-to-noise ratio order. Hence, the first component to decode would be x ˆn . If this sorting criterion is satisfied in [5], then it is shown how the nulling vector H†α,j can be obtained from (2) in a simpler way
as H†α,j = pj q∗α,j , where pj is the j th diagonal entry of P1/2 . An algorithm that summarizes these ideas is shown below: 1/2
1/2
Input: P1/2 , Qα , y ˆ Output: x y(n) = y for j = n : −1 : 1 do 1/2 H†α,j = pj q∗α,j ; x ˆj = MAP(H†α,j y(j) ); end for
y(j−1) = y(j) − hj x ˆj ;
where the MAP function represents the mapping of the result in the symbol set, and y(j) represents the received signal without the contribution from the decoded signal components x ˆj+1 , . . . , x ˆn . Therefore, only P1/2 and Qα are required to be able to find the solution to the OSIC problem. In [5], P1/2 and Qα are computed ˆ = H†α y = P1/2 Q∗α y as a Recursive Least Squares and so solve the equation x (RLS) problem in a special way. 2.1
The Square Root Kalman Filter for OSIC
−∗/2 From (2), Qα = H†∗ . This matrix is propagated along the iterations of the α P square root Kalman Filter devised initially to solve a Recursive Least Squares (RLS) problem. A block version of the algorithm is reproduced below for OSIC, reported in [5], and it will be termed SRKF-OSIC.
Input: H = (H∗0 , H∗1 , . . .)∗ , P(0) = 1/2
1/2
√1 In α 1/2 P(m/q)
and Qα,(0) = 0
Output: Qα = Qα,(m/q) , P = for i = 0, . . . , m/q − 1 do Calculate Θ(i) and apply in such a way that: ⎛ ⎞ ⎛ 1/2 ⎞ 1/2 Iq Hi P(i) Re,(i) 0 ⎜ ⎜ ⎟ 1/2 ⎟ 1/2 E(i) Θ(i) = ⎝ 0 P(i) ⎠ Θ(i) = ⎝ Kp,(i) P(i+1) ⎠ = F(i) −Γ(i+1) Qα,(i) Z Qα,(i+1) end for
∗ −∗/2 where Z = − Γ∗(i+1) − Hi H†α,(i+1) Re,i , q is the number of consecutive rows of H processed in a block, so Hi ∈ Cq×n ; the iteration index subscript enclosed between parenthesis denotes that the variable is updated iteratively: Qα
422
F.-J. Mart´ınez-Zald´ıvar, A.-M. Vidal-Maci´ a, and P. Alonso 1/2
and P1/2 are the values Qα,(i+1) and P(i+1) in the last iteration i = m/q − 1;
T Γ(i+1) = 0Tiq×q , Iq , 0T(m−q(i+1))×q ∈ IRm×q ; Re,i and Kp,(i) are variables of the Kalman Filter whose meaning is described in [7], and H†α,(i+1) appears implicitly in Z. Arithmetic costs. The costs of one iteration are a matrix multiplication 1/2 Hi P(i) and the application of a sequence of Givens rotations Θ(i) , exploit1/2
ing and maintaining the triangular structure of P(i) along the iterations. Let 1/2
wTRMM (q, n) denote the cost of the Hi P(i) matrix multiplication, where q and n are the dimensions of the result, and wROT (z), the cost of applying a Givens rotation to a pair of vectors of z components. The cost can be approximated as:
m/q−1 q n wTRMM (q, n) + [wROT (q − r + 1)+wROT (n − c + 1 + [i + 1]q)] c=1 r=1
i=0
m/q
=
wTRMM (q, n) + q
m/q n
wROT (c + iq) + n
i=1 c=1
i=1
q
m/q
wROT (r),
i=1 r=1
where r, c and i denotes the index of rows, columns, and iterations respectively. If we assume that wTRMM (q, n) = qn2 flops and wROT (z) = 6z flops, [8], the cost is 4n2 m + 3nm2 flops approximately. 2.2
The Square Root Information Filter for OSIC
The Square Root Information Filter variation to solve this problem is shown below. This algorithm version will be denoted as SRIF-OSIC and is based on a modified version of the extended square root Information Filter algorithm that (a) (a) can be found in [7]. Let V(i) and W(i) be: (a) V(i)
=
−∗/2
P(i)
H∗i
1/2
P(i)
0
,
(a) W(i)
=
−∗/2
P(i+1)
0
1/2
P(i+1) −Kp,(i)
.
It can be verified from [7] that if we find a unitary matrix Θ(i) such that (a)
(a)
(a)
(a)∗
(a)
(a)∗
V(i) Θ(i) = W(i) , then V(i) V(i) = W(i) W(i) . Let the next augmented matrices be: ⎛ −∗/2 ⎞ ⎛ −∗/2 ⎞ P(i) H∗i P(i+1) 0 ⎜ ⎟ ⎜ ⎟ (b) (b) V(i) = ⎝ P1/2 0 ⎠ , W(i) = ⎝ P1/2 −Kp,(i) ⎠ , (i) (i+1) A B L M −∗/2
−∗/2
†∗ so, in order to propagate H†∗ α,(i) P(i+1) , if we force A = Hα,(i) P(i)
and L =
−∗/2 H†∗ α,(i+1) P(i+1)
= Qα,(i+1) , and evaluate
(b) (b)∗ V(i) V(i)
=
= Qα,(i)
(b) (b)∗ W(i) W(i) ,
one
A Pipelined Parallel Algorithm for OSIC Decoding
423
solution can be obtained for B and M as: B = Γ(i+1) and M = (Γ(i+1) − −∗/2
∗ H†∗ α,(i) Hi )Re,(i) . P(i+1) and −Kp,(i) matrices are uninteresting, so they will be 1/2
(b)
(b)
eliminated from V(i) and W(i) and the new matrices will be denoted as V(i) and W(i) . The final algorithm is shown below. −1/2
Input: H = (H∗0 , H∗1 , . . .)∗ , P(0)
=
√ αI, Qα,(0) = 0 −∗/2
Output: Qα = Qα,(m/q) , P−∗/2 = P(m/q) for i = 0, . . . , m/q − 1 do Compute Θ(i) in such a way that: −∗/2 −∗/2 P(i) H∗i P(i+1) 0 V(i) Θ(i) = Θ(i) = = W(i) Qα,(i) Γ(i+1) Qα,(i+1) M
(3)
end for
Arithmetic cost. Zeroes must be got in the positions of the submatrix H∗i in V(i) . This can be achieved by means of Householder transformation applications or Givens rotation applications, or both, right applied to V(i) . Suppose that Givens rotations are used: for every row r = 1, . . . , n of H∗i , then q Givens rotations must be applied to a pair of vectors of (n − r + 1) + [i + 1]q components, so the cost of the ith iteration is: Wsec,i =
n
qwROT ((n − r + 1) + [i + 1]q) = 3qn2 + 6q 2 n[i + 1] + 3qn
(4)
r=1
flops, and the total algorithm cost is: Wsec =
m/q−1 n i=0 2
qwROT ((n − r + 1) + [i + 1]q) = q
r=1
≈ 3n m + 3nm2 = Θ(n2 m) + Θ(nm2 )
m/q n
wROT (r + iq)
i=1 r=1
(5)
flops, so this version can be about 16% faster than the square root Kalman Filter based on when n ≈ m, because neither the matrix multiplication nor some rotation applications are necessary. This speedup can be even greater if we use Householder transformations (in this algorithm, the greater the value of q, then the greater the speedup, asympotically to 25% when n ≈ m—).
3 3.1
Parallel Algorithm Data Partition
For reasons of clarity let us suppose that, as an example, there are p = 2 processors: P0 and P1 . A matrix enclosed within square brackets with a processor
424
F.-J. Mart´ınez-Zald´ıvar, A.-M. Vidal-Maci´ a, and P. Alonso
subscript will denote that part of the matrix belongs to such a processor. If it is enclosed within parenthesis, then it denotes that the entire matrix is in such a processor. Let C(i) , D(i) and V(i) be: C(i) =
−∗/2
P(i) Qα,(i)
, D(i) =
H∗i Γ(i+1)
, V(i) = C(i) , D(i) .
The n columns of C(i) will be distributed among the processors (n0 columns belong to P0 and n1 columns to P1 , with n0 + n1 = n) and this assignment will not change during the parallel algorithm execution (it could change in an adaptive load balance algorithm version). D(i) will be manipulated in a pipelined way by all the processors, so the subscript will change accordingly (it will be ∗ initially in P0 ). To better understand the evolution of the algorithm, Hi in D(i) P will be divided into p groups of consecutive rows. A left superscript j will denote the number of rows in the grouping. The initial data partition will be given by: ⎛ ⎛ n0 [H∗ ] ⎞ ⎞ −∗/2 −∗/2
i P P(i) ⎝ n1 [H∗i ] ⎠ ⎠= C(i) C(i) P1 D(i) P0 . V(i)=⎝ (i) P0 Qα,(i) Qα,(i) Γ P0
3.2
P1
(i+1)
P0
Processor Tasks
Let us suppose that P0 obtains zeroes in the n0 rows of 0 [H∗i ] by applying a sequence of unitary transformations denoted by Θ(i),P0 (an apostrophe (·) denotes the updating of a matrix): ⎛ ⎛ n0 [H∗ ] ⎞ ⎞ −∗/2 −∗/2 i P P (i) ⎝ n1 [H∗i ] ⎠ ⎠ Θ(i),P0 V(i) = V(i) Θ(i),P0 = ⎝ (i) Qα,(i) Qα,(i) Γ(i+1) P P0 P1 0 ⎛ ⎞ ⎛ n0 [0] ⎞ −∗/2 −∗/2 P(i) ⎜ P(i+1) ⎝ n1 [H∗i ] ⎠ ⎟ =⎝ ⎠. Qα,(i+1) Qα,(i) Γ(i+1) P0 P1 n
P0
It can be observed that data not belonging to P0 are not involved in the com−∗/2 −∗/2 putations. With the application of Θ(i),P0 , [P(i) ]P0 is converted in [P(i+1) ]P0 and only the first (i + 1)q rows of the matrices Γ(i+1) and [Qα,(i) ]P0 are updated due to the structure of Γ(i+1) and the zero initial value of Qα,(0) (this will be exploited to save computations). It is also important to note that C(i+1) P 0 (the first n0 columns of the result V(i) ) are the first n0 columns of the matrix V(i+1) . This is useful to obtain a pipelined behaviour in the processing work – with minimum data movement from one iteration to the next. n1 Now, if P0 transfers [H∗i ] and the nonzero part of Γ(i+1) to P1 , then P1 can obtain zeroes in the n1 rows of
n1
[H∗i ] with the application of the unitary
A Pipelined Parallel Algorithm for OSIC Decoding
425
transformation sequence Θ(i),P1 . Simultaneously, P0 could work with C(i+1) P 0 if that new input data is available (pipelined behaviour): ⎛ ⎞ ⎛ n0 [0] ⎞ −∗/2 −∗/2 P(i) ⎜ P(i+1) ⎝ n1 [H∗i ] ⎠ ⎟ V(i) = V(i) Θ(i),P1 = ⎝ ⎠ Θ(i),P1 Qα,(i+1) Qα,(i) Γ P0 P1 (i+1) P1 ⎛ ⎛ n0 [0] ⎞ ⎞ −∗/2 −∗/2 P(i+1) P(i+1) ⎝ n1 [0] ⎠ ⎠ = W(i) . =⎝ Qα,(i+1) Qα,(i+1) M P P0 P1 1
Then, the identity in (3) is obtained in a pipelined way by means of: V(i) Θ(i) = V(i) Θ(i),P0 Θ(i),P1 = W(i) . Let nj be the number of columns of C(i) assigned to Pj , and r0j the index of the first. The arithmetical cost in the Pj processor for the ith iteration is: WPj ,i =
nj
wROT (n−r0j +2−r+[i+1]q) = [6q(n−r0j +2+[i+1]q)−3q]nj −3qn2j
r=1
(6) flops. It can be verified that the parallelization arithmetic overhead is zero. 3.3
Load Balance −∗/2
Due to the lower triangular structure of P(i) in C(i) , every column (or consecutive column grouping) of this matrix has a different associated cost, so if the same number of columns of C(i) are assigned to each processor, the parallel algorithm is clearly unbalanced. It is advisable to balance the workload at iteration level in order to minimize processor waiting time, so the optimum number of columns nj assigned to the processor Pj can be calculated by solving a second order equation from WPj ,i = p1 Wsec,i using (4) and (6) – beginning with n0 , with r00 = 1, up to np−1 . This result depends on the iteration index i, so if a static load balance scheme is desired, the load can be balanced for the worst case, i = m/q − 1. An alternative way to balance the workload is to assign columns to a processor in a symmetrical way, but the number of transfers will nearly double. 3.4
Communications and Scalability
The parallel algorithm organization requires that the processor Pj transfers the nonzero part of the updated (D(i) )Pj matrix to Pj+1 , 0 ≤ j < p − 1, so the communications organization requires a unidirectional array of processors. The number of elements of the transfer to Pj+1 is (i + 1)q + p−1 k=j+1 qnk . Let us suppose that the time to transfer all this information from Pj to Pj+1 for the ith iteration can be modeled as a linear function of the number of elements to
426
F.-J. Mart´ınez-Zald´ıvar, A.-M. Vidal-Maci´ a, and P. Alonso
transfer. The worst time case takes place when there is no computation and communication overlap, and all the transfers are performed serially. For this worst case, the communication time is:
m/q−1 p−1
TC =
i=0
TC,Pj ,i
= Θ (mnp) + Θ mp2 + Θ
j=0
pm2 q
.
(7)
The scalability of the parallel system based on the isoefficiency function, [9], can be evaluated calculating how the workload must be increased with the number of processors to obtain a certain efficiency. This can be achieved by comparing the sequential time (5) with the total parallel overhead time. In the studied case, the only theoretical source of overhead is the communication time (7), so the total parallel overhead time is pTC . If Wsec tw = pTC , where tw is the time per elemental arithmetic operation, the workload must be increased as the worst 2 case of n = Θ(p2 ), nm = Θ(p3 ) or nmq = Θ(p2 ). If n = Θ(m), then the worst case is n = Θ(p2 ). 3.5
Experimental Results
The tests have been run on a ccNUMA architecture multiprocessor running a 64-bit Linux operating system with up to 16 processors available to a single user. Each processor is a 1.3 GHz Itanium 2. The programs have been coded in Fortran using a proprietary MPI communications library. Figure 1 shows the sequential execution time ratio of the SRKF-OSIC and SRIF-OSIC (Givens) versions for several values of q, where it can be observed that the proposed SRIF-OSIC algorithm is faster than the SRKF-OSIC. The sequential execution time for several values of q can be observed in figure 2. The 2
q=1 q=5 q=20 q=50
1.8
ratio
1.6
1.4
1.2
1
0.8
0
1000
2000
3000 n
4000
5000
6000
Fig. 1. Execution time ratio of SRKF-OSIC and SRIF-OSIC (Givens) versions for m = 6000
A Pipelined Parallel Algorithm for OSIC Decoding
427
1400
q=1 q=5 q=20 q=50
1200
1000
time (s)
800
600
400
200
0
0
1000
2000
3000 n
4000
5000
6000
Fig. 2. SRIF-OSIC (Givens) sequential execution time for m = 6000
1
0.9
Efficiency
0.8
0.7
0.6
0.5
p=2 p=4 p=8 p=16
0.4
0.3
0
1000
2000
3000 n
4000
5000
6000
Fig. 3. Efficiency of the SRIF-OSIC (Givens) parallel algorithm for q = 20 and m = 6000
curves show decreasing values – with an increasing value of q, except for q = 50. Therefore, it can be deduced that there is an optimum value for q that results in a minimum execution time. There are no algorithmic reasons for obtaining such minimum execution times, (5), so the processor architecture is the cause of this behavior. Figure 3 shows the efficiency of the proposed parallel algorithm for q = 20 and m = 6000, depicting a good level of efficiency in the results for p = 2, 4, 8 and 16 processors.
428
4
F.-J. Mart´ınez-Zald´ıvar, A.-M. Vidal-Maci´ a, and P. Alonso
Conclusions
A novel algorithm to solve the OSIC decoding problem based on the Information Filter with a better performance than the Kalman Filter-based reference is proposed. The improvement lies in the fact that a matrix multiplication and the application of some rotations are unnecessary in the new algorithm. The same conclusions are applicable to the highly efficient parallel version. Moreover, thanks to pipelined parallelization, implementation in VLSI systems is also feasible. Acknowledgements. This work has been supported by the Spanish Government MEC and FEDER under grant TIC 2003-08238-C02.
References 1. Foschini, G.J.: Layered space-time architecture for wireless communications in a fading environment when using multiple antennas. Bell Labs Technical Journal 1, 41–59 (1996) 2. Sellathurai, M., Haykin, S.: Turbo-BLAST for wireless communications: theory and experiments. IEEE Transactions on Signal Processing 50(10), 2538–2546 (2002) 3. Wolniansky, P.W., Foschini, G.J., Golden, G.D., Valenzuela, R.A.: V-BLAST: An architecture for realizing very high data rates over the rich-scattering wireless channel. In: Proc. IEEE ISSSE 1998, pp. 295–300 (1998) 4. Viterbo, E., Boutros, J.: A universal lattice decoder for fading channels. IEEE Trans. Inf. Theory 45(5) ( July 1999) 5. Hassibi, B.: An efficient square-root algorithm for BLAST. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. II737– II740 (2000) 6. Choi, Y.-S., Voltz, P.J., Cassara, F.A.: On channel estimation and detection for multicarrier signals in fast and selective Rayleigh fading channels. IEEE Transactions on Communications 49(8) (August 2001) 7. Sayed, A.H., Kailath, T.: A state-space approach to adaptive RLS filtering. IEEE Signal Processing Magazine 11(3), 18–60 (1994) 8. Golub, G.H., Van Loan, C.F.: Matrix Computations. Johns Hopkins University Press, Baltimore, MD, USA (1996) 9. Kumar, V., Gram, A., Gupta, A., Karypis, G.: An Introduction to Parallel Computing: Design and Analysis of Algorithms, ch.4, 2nd edn. Addison-Wesley, Harlow, England (2003)
A Self-scheduling Scheme for Parallel Processing in Heterogeneous Environment: Simulations of the Monte Carlo Type Grzegorz Musial1 , Lech D¸ebski1 , Dorota Jeziorek-Kniola1, and Krzysztof Gola¸ b2 1
2
Institute of Physics, A. Mickiewicz University, ul. Umultowska 85, 61-614 Pozna´ n, Poland [email protected] Technische Universit¨ at M¨ unchen, Fakult¨ at f¨ ur Informatik, Boltzmannstr. 3, 85748 Garching
Abstract. A parallelization scheme, which drives processing in simulations of the Monte Carlo type, suitable in highly heterogeneous computer system of a general purpose, is proposed. The message passing is applied and the MPI library is exploited. For testing, the 2D Ising model in a magnetic field is taken. The dependence of speedup on the number of parallel processes is studied, showing that the scheme works well in different parallel computer systems. The condition for the best speedup in these simulations is explained. The possibility of parallel use of any available computing power from the surrounding is also indicated. Keywords: parallelization of processing, heterogeneous environment, Monte Carlo simulations.
1
Introduction
There are several factors that today stimulate evolution of computing systems towards parallel ones. The finite speed of light and effectiveness of heat dissipation impose physical limits on the speed of a single computer. Moreover, the cost of an advanced single-processor computer increases more rapidly than its power of computing. A significant reduction of the price per performance or throughput ratios is achieved by utilizing networks of PCs or workstations as parallel computers. The list of the sites operating the 500 most powerful computer systems [1] suggestively manifests this tendency. One should also note the step by step growth of wide-area networks that can span the globe in a more or less distant future. Symmetric multiprocessing (SMP) is a subject of many studies dealing with parallel computing. Usually SMP is performed on clusters of computers with CPUs of the same type. Using a queue system a user program has to its exclusive disposal some amount of CPUs in a computer system (a multiprocessor, a multicomputer, a commodity cluster, a network of workstations or PCs) by a direct access or within a grid. Here we look for a computing power beyond such a systems. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 429–438, 2008. c Springer-Verlag Berlin Heidelberg 2008
430
G. Musial et al.
This paper argues for the use of any computers, not necessarily dedicated for computing, like PCs, even highly heterogeneous, as a parallel computer system. Such a computer system can concurrently execute many processes of any kind of processing, sequential and parallel ones, at the same time. We present here a selfscheduling scheme for parallel processing in simulations which can be effectively used in such computer systems being able to utilize any computing power of available computers. There is no necessity of restricting of their use for other purposes. We answer an interesting question how effective is this scheme in simulations of the Monte Carlo type. We have used the MPI library [2] to parallelize the computational process in the subject simulations. The main reason for this choice is that message passing is the most effective and universal among the parallel computational models, as it effectively works in every computer system with the distributed, shared or mixed type of memory and fits well separate processors connected by fast or slow communication network. The most powerful computer systems now have a combined memory structure: the system is built of some distributed groups of processors (usually 2, 4 or 8) which share their local memory, and these groups are connected with intercommunication networks that keep up with speeds of advanced single processors. Although for many applications, including our simulations, the use of Ethernet for their communication environment is sufficient. We take the phase transitions in an Ising model on a square lattice in a magnetic field as an example of simulations, as it is simple, reflects the main features of typical systems simulated by a Monte Carlo method and still finds many interesting new applications. In conclusions we present the results of these simulations in a wider context, namely that allowing interpretation of the Falicov-Kimball model [3,4].
2
The Simulations of the Monte Carlo Type
The simulation of the Monte Carlo type is a kind of an experiment performed to predict the behavior of a macroscopic system (i.e. with a large number of degrees of freedom) when given the laws governing its microscopic behavior. The convenient starting point to the latter is the energy operator H (Hamiltonian), as it allows a description of the behavior of a system on a microscopic level using the quantum mechanics laws. We take an Ising model on a square lattice in a magnetic field, as mentioned above. The effective Hamiltonian of this model is of the form H − = Ksi sj + h si , (1) kB T i [i,j]
where si is the Ising degree of freedom residing on each lattice site, which can take only two values +1 or −1, [i, j] denotes the summation over pairs of nearest neighboring lattice sites i and j, K = −J/(kB T ), J is the coupling of the nearest neighbor interaction between the Ising degrees of freedom (i.e. the further
A Self-scheduling Scheme
431
interactions are neglected in this model), h = B/(kB T ), B is an intensity of a static and homogeneous magnetic field applied, kB is the Boltzmann constant, and T is the temperature. Performing simulations we generate equilibrium microstates (i.e. possible configurations of all degrees of freedom in the system) of the finite-size square samples of the size L × L for fixed values of the model parameters, using the Metropolis algorithm, i.e. trying to reverse single spins. We have chosen this algorithm because of its simplicity and of its typical structure. There are other algorithms of similar structure which flip clusters of spins leading to a significant reduction (or even elimination) of the critical slowing down [5]. Recently an interesting algorithm has been published [6] which uses a random walk in the energy space to estimate the density of states of a system studied. Before starting simulations one has to drive the system into the equilibrium state. For this purpose thermalization by generating of the 106 Monte Carlo steps (MCS) is applied. As usual, a MCS is completed when each of the lattice sites has been visited once. A MC run is split into k segments consisting of about 106 MCS and one partial average of each measured quantity (PA) was calculated from microstates in each segment to estimate the statistical scattering of the results. Only every s = 10-th MC step contributes to the calculation of the partial averages to avoid correlations between sampled microstates of spins in the system and to sample microstates with the Gibbs distribution of probability. In our simulations the periodic boundary conditions were imposed and the 48- and 64-bit random number generators were used. The phase diagram was determined from the analysis of the behavior of the curves QL [7,8,9] M 2 2L QL = (2) M 4 L versus K at fixed value of h, where M n L denotes the n-th power of the s spins order parameter, averaged over an assembly of samples of the size L × L, each in independent microstate. The quantity Q is the so-called Binder cumulant. It allows localization of phase transition points from the common intersection point of the curves QL with different values of L ([7] and the papers cited therein).
3
The Self-scheduling Scheme for Parallel Processing
Usually such simulations are preformed by running many sequential jobs but there are at least two reasons for their parallelization. First, these are the large scale simulations [7,10,11]. The second reason follows from the very long time for obtaining single PA when the linear size L of a system becomes large. What is important, the larger the samples considered, the better the analysis of the results. For higher L it is worth performing the calculations in parallel to reduce the execution time, but parallel calculations are rather necessary to consider systems with L ≥ 100.
432
G. Musial et al.
Table 1. The scheme of organization of symmetric parallel processing in the simulations of the Monte Carlo type Main activities in the parallel program Initialize the parallel MPI environment Get the number p of parallel processes Get the rank r of the current parallel process In 0th process: Input of the model parameters Input of the simulation parameters Send the model and simulation parameters to the other parallel processes Produce r ∗ i ∗ m ∗ L2 random numbers Calculate i partial averages and send them to 0th process 0th process collects all partial averages, calculates and prints the results and their standard deviations Finalize the parallel MPI environment
(comments in parenthesis) (that the user specified at start)
(i.e. in the master process) (here these are L, K and h) (these are i, m and s)
(to use the different sets of random numbers for each PA) (the main part of the program) (here these results are the values of cumulant QL , magnetization and their uncertainties)
However, there are some conditions whose satisfaction makes the results of simulations reliable. First of all, we have to drive the system into the equilibrium state independently in each parallel processes, as presented in Table 1. For this purpose thermalization of the initial microstates of the length of order 106 Monte Carlo steps (MCS) is applied using completely different sets of random numbers in each parallel processes. Moreover, only the use of different sets of random numbers for each parallel process ensures that the partial averages can be reliably regarded as independent. Here we briefly recall that in symmetric processing [11] after driving the system to the thermodynamic equilibrium independently on each of the p parallel processes, as mentioned above, different processes of the parallelized job calculate their parts of k partial averages of the moments of an order parameter M , as presented in Table 1. As usually, the speedup u of such a job is defined as u = tser /tpar , where tser and tpar denote the sequential and parallel execution time, respectively. For simplicity, in this part we have used such numbers p which are integer divisors of k = 90, the number of partial averages calculated. In this way we have the same number of partial averages calculated by each parallel process and each MC run had approximately the same amount of computational work. Thus, assuming k = ip (the same workload on each parallel process), the speedup u in the ideal case, i.e. when there are no latencies between parallel
A Self-scheduling Scheme
433
processes (the same workload and communication for each CPU), should be equal to t0 + ipt1 u= , (3) t0 + it1 where t0 is the computing time for leading the system to the thermodynamic equilibrium, and t1 is the computing time for one partial average of the moments of an order parameter M . Usually t0 = t1 , then we obtain u=
1 + ip . 1+i
(4)
When the number i of partial averages per process is i 1, we obtain the ideal speedup u = p. For processing in heterogeneous computer systems we have to use master process only as a kind of resource broker which distributes small portions of work among slave processes and collects their results. The scheme should work smoothly and significantly reduce the latencies in the situation when it is difficult to predict how much of work particular slaves can execute. It is obvious that the amount of work which can be done by a slave depends not only on hardware and software, but also on the workload of a particular computer. The self-scheduling scheme which we propose for parallel processing in such a heterogeneous computer system of a general purpose is presented in Table 2. We implemented it in Fortran, so MPI basic operations are subroutine calls. The master process (with r = 0) reads the model and simulation parameters, broadcasts them to slave processes and sends one of the first order numbers of PAs to each slave process, then enters into the loop of k cycles to receive the results from the slaves. The MPI ANY SOURCE and MPI ANY TAG parameters are used in the subroutine MPI RECV call, so the master process always first receives a PA which was calculated by the currently fastest process. After receiving PA, master sends the sender of PA the next PA order number to calculate or MPI BOTTOM and 0 as TAG when all PA numbers have been already sent. Here the slave processes do not communicate with one another, only with the master process which here calculates no PA, in contrast to the scheme for the symmetric processing in Table 1. It is worth noting that the scheme in Table 2 will be less effective in symmetric processing than the scheme presented in Table 1. The scheme in Table 2 can utilize any available amount of power of computers. With such an organization of the parallel program one has only to install MPICH environment [2] (or possibly a similar one like LAM/MPI or Open MPI [12]) on each available computer and to organize respective communication between them. Each node can be used also for any other purposes and our program will consume only the remaining computing power. Until one does not overload the computers, an interactive user will not feel significantly the execution of additional program. The use of MPICH environment only is a compromise achieved with the owners of PCs that contribute to our parallel computer, as they do not want to have a bigger specialized parallel environment installed, like HeteroMPI [13],
434
G. Musial et al.
Table 2. The self-scheduling scheme for parallel processing in a heterogeneous computer system of a general purpose Main activities in the parallel program The first row as in Table 1 In the master process: Input of the model and simulation parameters and send them to the slave processes Send one of the lowest order numbers of PAs to each slave process Do j=1,k Receive and store a PA from any slave If there is any PA number to send Then send the lower one to the sender Else send MPI BOTTOM and the number 0 EndIf EndDo In each slave process: Receive model and simulation parameters before=0; flag=0 10: Receive the PA order number n If (n==0) go to finish Generate (n-1-before)*m*L2 random numbers If (flag==0) Then execute m*L2 MCS; flag=0 EndIf Calculate a PA and send it to master before=n Go to 10
(comments) (i.e. with r=0) (model parameters: L, K, h; simulation parameters: i, m, s)
(to stop a slave)
(i.e. with r = 0) (No of a previously calculated PA) (the stop condition) (use different random numb. for diff. PAs) (thermalization)
The last two rows as in Table 1
mpC [14] or NetSolve [15], developed recently. That is why we underline that the computers used are of a general purpose, i.e. are not dedicated for parallel processing. Moreover, this is the proposal for solving many problems not covered by grants.
4
Tests of the Method and the Results
For comparison we have tested our scheme of parallelization of symmetric processing on the simulations presented in Table 1. For this test, we have fixed the parameters of the simulation: the size of the square L = 60, the number of all partial averages k = 90, the number of MCS performed on driving the system to the thermodynamic equilibrium (thermalization) and the number of MCS
Speedup u
A Self-scheduling Scheme
40 10 9 8 7 30 6 5 4 20 5
10
435
15 SMP, 1 CPU/slave, L=60 system1: 8 CPUs, L=60 system2: 9 CPUs, L=60 system3: 10 CPUs, L=50
10
10
20 30 Number of slave processes p
40
Fig. 1. The dependence of the speedup u versus the number p of slave processes for parallel Monte Carlo jobs. The computer system, the number of CPUs (cores) and the linear size L of the simulated Ising lattice are listed in the legend box.
for calculation of one PA were equal to m = 5 · 105 . We have calculated the dependence of the speedup u versus the number p of parallel processes on the SUN multicomputer (built up from units with two dual-core AMD64 Opteron 2GHz CPUs, MPICH environment). The results are presented in Fig. 1 with a solid line. For comparison, the dotted line represents the ideal speedup u = p. Although the solid line in Fig. 1 looks similar to the Amdahl’s law, the main reason for such a run of u(p) is of different nature. It follows from the structure of the algorithm of MC simulations, namely the part of the computing time in each parallel process is used for independent thermalization of the system. One can see from the course of the solid line in Fig. 1 that the latencies were small during these computations, because the reduction of the speedup compared with the ideal case is mainly the effect of the decrease in the ratio of the number of MCS used for the calculation of PAs in the process and the number of MCS used for thermalization. It is clearly seen from the point with p = 45 for which only two PAs per process are calculated, where thermalization takes just 1/3 of all MCS in a process, as approximately the solid lines deviates from the ideal speedup. We can conclude from these considerations that for the best speedup one should keep the number of partial averages per process i 1 during the simulations as follows from the theoretical considerations presented in the previous section. With this conclusion in mind, for testing our program with the self-scheduling scheme, we have divided the calculations on k = 450 PAs with the same global amount of MCS, i.e. here m = 105 . Thus, even using of tens parallel processes, we still have the condition k/p 1 for good speedup fulfilled. The rest of investigated system and simulation parameters are the same as above.
436
G. Musial et al.
Fig. 1 presents also the speedup u as a function of the number p of parallel processes in Monte Carlo simulations performed on three different parallel computer systems: – system1: the heterogeneous commodity cluster containing 2 PCs (each with 1 dual-core AMD Athlon 64 3,6GHz CPUs) and 2 PCs (each with 2 AMD64 Opteron 1.8GHz CPU) connected by Gigabit Ethernet, MPICH; – system2: the heterogeneous network of 9 PCs (each with 1 Pentium III CPU, but 4 of them with a 850MHz clock and 5 of them with 600MHz clock) connected by Fast Ethernet, MPICH; – system3: the symmetric system 10 CPUs (Intel Itanium2 Montecito dualcore 1.6GHz) in the multicomputer with Gigabit Ethernet, MPICH; It is evident that for symmetric processing this scheme is not as effective as that presented in Table 1 (solid line in Fig. 1), as expected, but it works properly and u increases with increasing number p of slave processes when p does not exceed the number of processors (i.e. cores). One should pay attention to the number of CPUs and at L, which differ for different curves and are listed in the legend box. For heterogeneous systems we also observe a proper increase in u when p increases until it reaches the number of CPUs (cores). The scatter of the results is not only the effect of scheduling within the program, but for systems 1 and 2 it also follows the current use of particular computers as PCs (system 1) and of students activity (system 2). Only system 3 is dedicated for computing. This figure also shows that running two processes per one CPU (core), we still have cut short the program execution time, as each process has subsequent CPUand I/O-bursts. Thus, running two processes on one CPU improves utilization of a processor, but addition of the third process per CPU does not significantly shorten our program execution time.
5
Concluding Remarks
Our previously used parallelization scheme [11], summarized in Table 1, suits well SMP. In Table 2 we have proposed the parallelization scheme which drives processing in heterogeneous computer system of a general purpose. The component computers are not dedicated for parallel processing and we have no permission to install a bigger specialized environment like HeteroMPI [13], mpC [14] or NetSolve [15], but we have only MPICH to our disposal. This is a solution for many problems not covered by grants. The basic idea is to use the master process as a kind of resource broker only, whereas the main computational work is done by slave processes. Such a scheduling is analyzed in simulations of the Monte Carlo type by comparing the speedups of jobs with the same number of MCS. It is demonstrated that the scheme works well on the example of the 2D Ising model in a magnetic field, however, the idea is of more general character and may be applied to many types of simulations, as most of them depend on the size of the system.
A Self-scheduling Scheme
437
A simultaneous use of more CPUs permits the low-cost simulations for very large samples and they can be executed on any kind of computing systems both, with shared and distributed memory, thanks to the message-passing parallelization model applied. Taking the MPI library to parallelize the computational processes, we simultaneously ensure efficiency, full portability and functionality of an application. We also specify the condition of effective parallelization of such simulations, i.e. without a significant increase in the total CPU time (or by a similar decrease in the speedup u or efficiency E = u/p): the number of partial averages per process should be much greater than 1. The test of the method is performed by comparing the speedup of calculations and we demonstrate that the method works well in different parallel computer systems. The scalability of so scheduled simulations, ability of their execution on any distributed systems, even on relatively cheap Ethernet network of PCs, makes them very attractive, as mostly users are not really interested in high performance computing, but rather in increasing of throughput of their computer system by adapting as much computing power as available. Using more and more CPUs, one can consider larger physical systems which leads to increasingly credible results and to more effective use of PCs from the surrounding. An interactive user will not feel significantly the execution of our additional processes provided that we will not overload the computers. MPICH or LAM/MPI environments enable such a control. This scheduling can be used nearly everywhere in contrast to the Open Mosix based one. Though the latter offers much better load balancing [16] but it needs the same processors and modifying the kernel makes rather tightly coupled cluster of computers. In addition we would like to note that the Ising model in a static and homogeneous magnetic field, which is a testing system in our paper, still finds new interesting applications. Performing these simulations we were able easy to construct the phase diagram which we have used to prove the existence of the interesting phenomenon of strip ordering in finite temperatures [10] in the Falicov-Kimball [3,4]. Acknowledgements. This work was supported in part by the European Commission under the project MAGMANet NMP3-CT-2005-515767. Numerical calculations were also carried out on the platforms of the Pozna Supercomputing and Networking Center and Leibnitz-Rechnenzentrum M¨ unchen.
References 1. 2. 3. 4. 5.
TOP500 Supercomputer Sites, http://www.top500.org/ MPI Forum Home Page, http://www.mpi-forum.org/ Falicov, L.M., Kimball, J.C.: Phys. Rev. Lett. 22, 997 (1969) Lena´ nski, R., Wojtkiewicz, J.: phys. stat. sol (b) 236, 408 (2003) Swendsen, R.H., Wang, J.-S.: Phys. Rev. Lett. 58, 86 (1987); U. Wolff, Phys. Rev. Lett. 62, 361 (1989)
438
G. Musial et al.
6. Wang, F., Landau, D.P.: Phys. Rev. E64, 056101 (2001) 7. Musial, G.: Phys. Rev. B69, 024407 (2004) 8. Binder, K., Heerman, D.W.: Monte Carlo Simulation in Statistical Physics. Springer Series in Solid State Physics, vol. 80. Springer, Berlin (1988) 9. Binder, K., Landau, D.P.: Phys. Rev. B30, 1877 (1984) 10. Wojtkiewicz, J., Musial, G., D¸ebski, L.: phys. stat. sol (c). 3, 199 (2006) 11. Musial, G., D¸ebski, L.: Lect. Notes in Comp. Scie, vol. 2328, p. 535 (2002) 12. LAM/MPI and Open MPI Home Page, http://www.lam-mpi.org/ 13. Lastovetsky, A., Reddy, R.: J. Parallel Distrib. Comput. 66, 197 (2006) 14. Lastovetsky, A.: Parallel Computing 28, 1369 (2002) 15. NetSolve/GridSolve Home Page, http://icl.cs.utk.edu/netsolve/ 16. D¸ebski, L., Musial, G., Rogiers, J.: In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, p. 455. Springer, Heidelberg (2004)
Asynchronous Parallel Molecular Dynamics Simulations Jaroslaw Mederski1 , L ukasz Mikulski1 , and Piotr Bala1,2 1
2
Faculty of Mathematics and Computer Science, Nicolaus Copernicus University Chopina 12/18, 87-100 Toru´ n, Poland Interdisciplinary Center for Mathematical and Computational Modelling Warsaw University Pawi´ nskiego 5a, 02-106 Warsaw, Poland {mastem,frodo,bala}@mat.umk.pl
Abstract. Dynamic development of parallel computers makes them standard tool for large simulations. The technology achievements are not followed by the progress in scalable code design. The molecular dynamics is a good example. In this paper we present novel approach to the molecular dynamics which is based on the new asynchronous parallel algorithm inspired by the novel computer architectures. We present also implementation of the algorithm written in Java. Presented code is object-oriented, multithread and distributed. The performance data is also available.
1
Introduction
Molecular dynamics (MD) is widely used to investigate structure and function of biomolecular systems. Due to large size and the long time scales involved computer simulations are very important tool. Biomolecular complexes consisting of components such as proteins, lipids, DNA and RNA, and solvent are typically large in simulation terms. The growing interest in investigating complex systems such as solvated protein complexes leads to molecular systems with millions of atoms [1]. The details of the MD algorithm can be found elsewhere [2,3]. In general, the MD method provides a numerical solution of classical (Newtonian) equations of motion. Well known algorithms such as leap-frog [4] and Verlet [5] are used to calculate new positions and velocities. The leading computational component of the MD calculation involves the nonbonded forces, a calculation generally quadratic in the number of atoms that can be reduced to close to a linear dependence with the cutoff radius approximation coupled with strategic use of a pairlist [6,7]. Molecular dynamics of large systems requires significant computational resources, especially aggregated CPU time. Over last years, architecture of computers providing largest computational resources has been changing from vector and scalable multiprocessor computers to clusters and massively distributed R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 439–446, 2008. c Springer-Verlag Berlin Heidelberg 2008
440
J. Mederski, L . Mikulski, and P. Bala
systems. In result, in order to utilize available computational resources highly scalable codes are required. Most of the codes run on large systems is based on the trivial parallelism which cannot be achieved for molecular dynamics of large systems. Numerous MD parallelizations have been described in the literature, ranging from the easy to implement replicated algorithm [8] to the more difficult to implement spatial decomposition [9], which is generally more scalable. However data parallel approaches have been found to be problematic due to the irregularity inherent to molecular dynamics. The most successful parallel implementation of the molecular dynamics is NAMD2 [10]. NAMD2 is written in Charm++ and has been proved to be able to scale to thousands of processors. However, NAMD2 is not ready for the next generation parallel machines with hundreds of thousands of processors due to limited parallelization exploited in the application. The paralelization of the molecular dynamics is difficult because the widely used MD algorithms have build in synchronization after each time step. The increasing computational speed of processors leads to situation where numerical calculations performed in single step is relatively short. Therefore synchronization events are very often and work performed between is relatively small. In this paper we present work towards new algorithm of molecular dynamics which removes the barriers for highly scalable code. We propose here new algorithm which removes implicit synchronization of the MD. The code is implemented in fully object oriented manner using Java. This allows to benefit from multithreading on SMP nodes. Java RMI has been used as the main communication layer.
2 2.1
Parallelization Strategy Asynchronous Algorithm of Molecular Dynamics
The asynchronous molecular dynamics algorithm is based on two-level parallelization. Firstly, spatial simulation cell is subdivided into cuboids (also called Cubes), one for each node in the parallel system. It is based on spatial decomposition algorithm [11]. Second level of parallelization takes place within single cuboid. In our algorithm each atom has own loop of computation and therefore can be treated separately. In the each iteration atom computes coordinates of the own trajectory and sends data to cuboids but only these ones which are within cutoff radius of the atom. Hence communication between nodes holds only if it is needed. Every atom stores the cutoff based neighbor list (see [12]). The list is used for a number of timesteps to calculate all force interactions and consecutive coordinates of the trajectory. Every given number of time steps neighbor list is recalculated. In the traditional approach the elements of the list are created using global information about all coordinates of atoms, in more advanced approach the information is limited to the spatially close subdomains. Such approach leads to the high communication cost.
Asynchronous Parallel Molecular Dynamics Simulations
441
In the presented approach we have hierarchized neighbor list creation. The extended cutoff has been introduced which is larger than traditional one. Each cuboid stores extended neighbor list - a list of atom pairs within extended cutoff radius. This list is limited to the pairs involving one of the atoms located at the particular centroid. The extended neighbor list is calculated based on the information of all atoms in the system and is performed more seldom than traditional pair list updates. In some sense the extended pair list is analog of the repartitioning technique used in the traditional approaches to the paralel molecular dynamics algorithms. Please note that extended pair list can be updated based on the exchange coordinates of neighboring cuboids. In addition to the extended pair list, the traditional pair list is used and updated with the usual frequency. The update is however based on the extended neighbor list stored at each centroid and do not require information exchange between different cuboids. Therefore communication is not involved in the updates of neighbor list. The main asynchronity is consequence of the concept of parallel loops for atoms. Each atom calculates trajectory independently until it has access to the corresponding coordinates from neighbor atoms. If the coordinates are not available the particular interactions are not calculated and next atom pair is considered. If all pairs are used, the calculations for next time step start. This is possible because atom coordinates are calculated asynchronously. If all available atom pairs are scanned, the calculations loop is started from the beginning. Only the missing interactions are considered. The process is started again as long as all interactions for pairs in the neighbor list are calculated. Please note that proposed schema extends traditional Verlet algorithm. The number of time steps is calculated simultaneously while in the traditional approach the implicit synchronization after each time step is performed. The obvious consequence of the proposed here approach is increase of the memory required to store coordinates. In the traditional approach one stores coordinates in current time step and next one, while in our approach the storage of atom coordinates in number of steps is required. Once the coordinates for particular time step are updated for all atoms, the coordinates can be used to calculate various global characteristics of the system (energy, kinetic energy etc.) and are stored on the disk. Than the coordinates for such time can be removed from the memory. The described mechanism is called here staging and is basic concept which removes implicit global synchronization of the atom coordinates after each time step. The size of staging, eg. number of the simultaneously stored coordinates can be monitored, and in typical situations does not exceed 10. The staging results in the increased memory requirements for the application, but this is not a problem for nowadays computational infrastructure. One should note, that in the presented approach we are targeting massively parallel systems and therefore cuboid size will be of range few hundreds rather than thousands or millions. The available memory allows for easy storage of the several time steps simultaneously.
442
2.2
J. Mederski, L . Mikulski, and P. Bala
Implementation Details
The object-oriented algorithm described in the previous section has been implemented in Java language. The communication mechanism used is Remote Method Invocation (RMI)1 . The application consists of two parts, client and server. Client is a single object ParallelMD which reads initial coordinates and starts computation. The coordinates are distributed among servers based on the spatial decomposition which is known to reduce communication between processors (see [15]). The initial atom coordinates, velocities and parameters of simulation are broadcasted by the client to servers. The server is an instance of class Cube. For each computing node dedicated object is created.
Fig. 1. Schema of the communication between Cubes
Each atom in the system is represented by a separate object (instance of class Atom) which is run in a separate thread. In our model numerical simulations required for the molecular dynamics are performed by the objects which represent atoms. The large number of objects allows for easy and scalable distribution of the computations. Additionally, each atom constitutes separate thread which allows for fine granularity parallelism on multicore systems. Please note that large number of object, even taking into account performance overhead caused by object handling, is not a problem for nowadays computers. The most important problem is communication time and frequency of synchronization events. 1
In the development process we have used also other packages design for development of distributed applications such as JavaSymphony [13] or ProActive [14]. The results were however not satisfactory.
Asynchronous Parallel Molecular Dynamics Simulations
443
Every atom establishes list which contains all neighbors, eg. atoms located within cutoff distance. The list is created based on the extended neighbor list as described before. After that, in subsequent steps, atom thread computes partial forces between himself and each neighboring atom. Positions of atoms are taken from the data stored in the local cube object. For this part of calculations no communication with other cubes is necessary. When some coordinates are not available locally, the atom thread tries to compute another partial force and comes back to this calculations later. This is handled by special, dynamical list of interaction pairs. When end of list is reached, atom thread sleeps for a while and begins loop over pair interaction once more omitting previously visited neighbors. Finally atom thread sends its new positions to all neighboring Cubes. The communication proces is therefore asynchronous and overlaps with the calculations.
3
Results
The developed implementation has been tested in several parallel systems with the different architecture. In particular we have used multicore SMP systems, closely coupled cluster of workstations and homogeneous net of workstations. The important problem we have forced were system restrictions for running Table 1. Configuration of the test systems System 1 System 2 Nodes 8 nodes 32 nodes Processors per node 1 1 Processor AMD Athlon 64 Pentium IV 3400+ Memory 1 GB per node 1 GB per node Network Gigabit Ethernet Fast Ethernet
System 3 2 SMP nodes 4 AMD Athlon 64 3400+ 32 GB per node Gigabit Ethernet
multithread Java RMI code. For described systems, because of the operating system restrictions, it was difficult to run more than thousand threads the same Another limit, caused by the Java RMI implementation, was number of opened files. In result, the size of the systems used for tests has been limited to about 500 atoms per node. The code has been tested for the systems with number of atoms of 2000-5000. One should note that existing limits could be removed in the future by redesign of the Java RMI and threads implementations. This will increase number of the atom threads which can be run on the single computational node. The tests has been performed for the system of Argon atoms interacting with Lenard–Jones potential. The short range interactions were not present, but they involve only neighbor atoms and the implementation is straightforward. The
444
J. Mederski, L . Mikulski, and P. Bala
Fig. 2. Efficiency tested on 8-node cluster (System 1 ) for various data size. The lines connect results corresponding to the shortest simulation time for given number of atoms.
Fig. 3. Efficiency tested on 2729 atoms run at the different computer systems systems as described in Table 1. Systems 2a and 2b differ in connection architecture, 2a is connected to the one router, 2b is connected to two different routers. The lines are drown for the results corresponding to the shortest simulation time for given system.
Asynchronous Parallel Molecular Dynamics Simulations
445
initial atom velocities were obtained from the were drown using Boltzmann decomposition. The total time of the simulation of 200 steps of molecular dynamics for the different number of atoms is presented in the Figure 2. For the largest investigated system good scaling is achieved, for the smaller number of atoms, the number of calculations performed at the single node starts to be comparable to the communication time and scaling is not as good. Some irregularities are observed for 6 which reflects less optimal spatial decomposition. Presented data confirms that developed code scales well with increasing number of processors. In the Figure 3 the simulation time obtained for the different parallel hardware is presented. The data has been received for the largest investigated system containing 2729 atoms. The simulation time differs because of the different computational speed of the processors. In all cases scaling is similar and does not depends on the cluster type and connection hardware. This confirms that speedup is not limited by the communication which takes place asynchronously and overlaps with the simulations.
4
Conclusions
We have presented new algorithm for parallel molecular dynamics simulations and its implementation written in Java. The algorithm allows for asynchronous communication overlapping with computations. In result good scaling of the parallel code has been achieved for different hardware in particular PC clusters. The obtained implementation is objective and allows for efficient utilization of the multicore systems through Java threads. The presented data confirm that obtained implementation scales well on most popular parallel architectures. The efficiency and scalability of proposed approach is still limited by the available implementations of the Java threads and RMI, however we expect that these difficulties will be removed in the near future. Acknowledgements. We thank Terry Clark for discussions which motivated this work. We would like to acknowledge preliminary work performed by Magdalena Lisiecka and Dorota Urba´ nska. The computations were partially performed at the ICM, Warsaw University. %beginthebibliography88
References 1. Sanbonmatsu, K.Y., Tung, C.-S.: High Performance computing in biology: Multimillion atom simulations of nanoscale systems. Journal of Structural Biology 157, 470–480 (2007) 2. Allen, M.P., Tildesley, D.J.: Computer simulations of liquids, Clarendon, Oxford (1987) 3. McCammon, J.A., Harvey, S.: Dynamics of proteins and nucleic acids. Cambridge Univ. Press, Cambridge (1989) 4. van Gunsteren, W.F., Berendsen, H.J.C.: A leap-frog algorithm for stochastic dynamics. Molecular Simulation 1(3), 173–182 (1988)
446
J. Mederski, L . Mikulski, and P. Bala
5. van Gunsteren, W.F., Berendsen, H.J.C.: Algorithms for brownian dynamics. Molecular Physics 45, 637–647 (1982) 6. Verlet, L.: Computer experiments on classical fluids. I. Thermodynamical properties of Lennard-Jones molecules. Physical Review 159(1), 98 (1967) 7. Eastwood, J.W., Hockney, R.W.: Computer Simulation Using Particles. Cambridge University Press, Cambridge (1987) 8. Clark, T.W., McCammon, J.A.: Parallelization of a molecular dynamics nonbonded force algorithm for MIMD architecture. Computers & Chemistry 14(3), 219–224 (1990) 9. Clark, T., von Hanxleden, R., McCammon, J.A., Scott, L.R.: Parallelizing molecular dynamics using spatial decomposition. In: Scalable High Performance Computing Conference, Knoxville, May 1994, pp. 95–102. IEEE Computer Society, Los Alamitos (1994), Available via anonymous ftp from: softlib.rice.eduaspub/CRPCTRs/reports/CRPC-TR93356-S 10. Kal’e, L., Skeel, R., Brunner, R., Bhandarkar, M., Gursoy, A., Krawetz, N., Phillips, J., Shinozaki, A., Varadarajan, K., Schulten, K.: NAMD2: Greater scalability for parallel molecular dynamics. J. Comput. Phys. 151(1), 283–312 (1999) 11. Plimpton, S.: Fast parallel algorithms for short-range molecular dynamics. J. Comput. Phys. 117(1), 1–19 (1995) 12. Verlet, L.: Computer ”Experiments” on Classical Fluids. II. Equilibrium Correlation Functions. Phys. Rev. 165, 201–214 (1968) 13. Fahringer, T., Jugravu, A.: JavaSymphony: A new programming paradigm to control and to synchronize locality, parallelism, and load balancing for parallel and distributed computing. Concurrency and Computation: Practice and Experience 17(78), 1005–1025 (2005) 14. Baude, F., Baduel, L., Caromel, D., Contes, A., Huet, F., Morel, M., Quilici, R.: Programming, Deploying, Composing, for the Grid. In: Cunha, J.C., Rana, O.F. (eds.) GRID COMPUTING: Software Environments and Tools, Springer, Heidelberg (2006) 15. Bing, W., Jiwu, S., Weimin, Z., Jinzhao, W., Min, C.: Hybrid Decomposition Method in Parallel Molecular Dynamics Simulation Based on SMP Cluster Architecture. Journal of Tsinghua Science and Technology 10(2), 183–188 (2005)
Parallel Computing of GRAPES 3D-Variational Data Assimilation System Xiaoqian Zhu, Weimin Zhang, and Junqiang Song National Laboratory for Parallel and Distributed Processing, Changsha, Hunan, 410073, China [email protected]
Abstract. The three-dimensional variational assimilation (3D-Var) is the most commonly used technique currently to generate an analysis that provides better consistent initial conditions for numerical weather prediction (NWP). The Global and Regional Assimilation Prediction System (GRAPES) is a new generation NWP system in China, in which 3D-Var is one of the main components and plays an important role in direct assimilation for non-conventional observations. In this study, the principal theory and serial implementation of GRAPES 3D-Var are introduced firstly, and the details of distributed parallel computing algorithm of GRAPES 3D-Var are discussed, including data partitioning strategies, data communication strategies and stagger parallelization strategies. At last, some parallel experimental results on 16-CPU cluster platform are put forward, and the numerical simulations of the parallelization show that the parallel strategies can be combined to achieve considerable load balancing and good performance. Keywords: variational data assimilation, parallel computing.
1
Introduction
In the meteorological community, an analysis is the production of an accurate image of the true state of the atmosphere at a given time, and can be used as the initial conditions for numerical weather prediction (NWP). Many data assimilation techniques have been developed to produce a good analysis. In 1963, Gandin gave a theoretical derivation of Optimal Interpolation (OI) method, which tried to minimize the expected variance of analysis through the priori knowledge of the error characteristics of both the observations and the background. At present, the most commonly used data assimilation technique is the variational approach (VAR), which mainly assimilates observations through constraint of the balance equations (3D-Var) or even the numerical forecast model itself (4D-Var). The 3D-Var has been considered as a necessary prerequisite to the ultimate goal of 4D-Var in the evolution of variational data assimilation systems. The Global and Regional Assimilation Prediction System (GRAPES) is a research and development project launched in 2001 for the development of new generation NWP system in China. One of the four main components of the project is to develop the variational data assimilation systems with stress on the R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 447–456, 2008. c Springer-Verlag Berlin Heidelberg 2008
448
X. Zhu, W. Zhang, and J. Song
direct assimilation of satellite and radar data. GRAPES 3D-Var is a large-scale minimization problem and the practical implementation is complicated designed. The research presents a relatively simple but considerable parallel computing scheme which has the following two claims of innovation: 1) the parallel scheme is efficient with less communication, which is easy-to-implement in parallel computing environment; 2) parallelization in vertical direction is another point of view from existing parallelization in horizontal direction. The remainder of this paper is laid out as follows: In section 2, the principle and formulation of GRAPES 3D-Var is introduced. Section 3 describes the serial algorithm of GRAPES 3D-Var. Section 4 discusses the parallel scheme and practical implementation in details. Section 5 presents the parallel experiments and the results. Section 6 makes a conclusion of the study with mentions of further work.
2
Principle and Formulation
The goal of VAR system is to look for the analysis as an approximate solution to the equivalent minimization problem defined by the cost function. According to GRAPES 3D-Var design scheme, a scalar cost function J(x) is defined as Eq.(1), the most probable estimate of analysis xa can be obtained by minimizing Eq.(1) with respect to x. J(x) =
1 1 (x − xb )T B −1 (x − xb ) + (H(x) − yo )T R−1 (H(x) − yo ) 2 2
(1)
In Eq. (1), the solution x = xa represents the posteriori maximum likelihood estimate of the true state of the atmosphere given two sources of priori data: the background xb and observations yo . B and R are the background and observation error covariance matrices respectively. Observation operator H is used to transform the analysis x to observation space y = H(x) for comparison with observations. 2.1
Incremental Method
Given a model state x with n degrees of freedom, the background term Jb of Eq.(1) requires ∼ O(n2 ) calculations. For a typical NWP model with n ∼ 106 − 107 , direct solution is computationally expensive. In order to reduce computational expense, a low-resolution increment δx = x − xb is introduced, and the cost function is minimized at the low resolution of δx. With the assumption that the observation operator H can be linearized in the vicinity of xb , i.e. H(x) = H(xb + δx) ≈ H(xb )+ H δx. Eq.(1) can be rewritten as the incremental formulation Eq.(2): J(δx) =
1 T −1 1 δx B δx + (Hδx − d)T R−1 (Hδx − d)T 2 2
(2)
where d = yo − H(xb ), is called innovation vector (O − B), and linearized observation operation H is defined as the differential of H in the vicinity of xb .
Parallel Computing of GRAPES 3D-Variational Data Assimilation System
2.2
449
Preconditioning
In Eq.(2), the background error convariance matrix B is enormous with degree of n2 , typically 106 ×106 . This is much too large to be specified explicitly. At the same time, condition number of J is ordinary large which lead to minimization of Eq.(2) is difficult. Then, a preconditioning via a control variable v transform defined by v = L−1 δx is performed, where the transform L is chosen to approximately satisfy the relationship B = LLT , and Eq.(2) may be rewritten as: J(v) = Jb + Jo =
1 T 1 v v + (HLv − d)T + R−1 (HLv − d) 2 2
(3)
In Eq.(3), background term is essentially diagonalized, reducing the calculations required to evaluate Jb from O(n2 ) to O(n). Generally, condition number of Eq.(3) is less than that of Eq.(2). Differential of the scalar J(v) with respect to the vector v produces the gradient J in control variable space, see Eq.(4) J = Jb + Jo = v + LT HT R−1 (HLv − d)
3
(4)
Serial Algorithm
In the case of the GRAPES 3D-Var, the initial approximation to the analysis is simply the background x0 = xb .GRAPES 3D-Var assimilation system use three primary sources of input data xb , yo and B to generate analysis increments δx, which combines with the background xb to produce an analysis xa . 3.1
Algorithm Implementation
GRAPES 3D-Var computing procedure can be summarized as the following sequential steps: step 1. Input three data sources xb , yo and B; step 2. Calculate innovation vector O − B : d = yo − H(xb ); step 3. Initialize control variable v = 0 and start the minimization; step 4. Calculate residual vector O − A : re = HLv − d, which is based on control variable transform δx = Lv; step 5. Calculate gradient in observational space: grad y = R−1 (HLv − d); step 6. Evaluate the cost function : J = 1/2(v T v)+1/2[(HLv−d)T R−1 (HLv− d)], which makes use of the results of step 4 and 5; step 7. Based on the results of step 5 and the adjoint operator of H and L, calculate gradient in control space: grad v = LT HT • grad y = LT HT R−1 (HLv − d); step 8. Evaluate the gradient of cost function: J = v + LT HT R−1 (HLv − d); step 9. With the results of J and J (step 6,7), use the limited memory BFGS method to minimize the cost function and find the optimal control variable v; step 10. After the control variable v has been achieved, a final transform of the analysis increments δx = Lv is performed. The increments δx are then added to the background xb to produce the analysis xa .
450
X. Zhu, W. Zhang, and J. Song
The ten steps above are essential in GRAPES 3D-Var algorithm, and the minimization of the cost function from step 4 to 9 proceeds iteratively. In all GRAPES 3D-Var routines, control variable transform L and its adjoint LT account for over 99% of the time spent in calculating cost function and its gradient (from step 4 to 8) and these routines are amenable to operation in parallelization. 3.2
Flow Chart of Grapes 3D-Var
GRAPES 3D-Var code has been written in modular style using Fortran 90. The source code is about 50’000 lines, 600 subroutines, with hundreds of variables and derived data types. Fig. 1 illustrates the flow chart of GRAPES 3D-Var procedure.
namelist
Read Namelist
xb
Setup Background
B
Setup Background Errors
yo
Setup Observations Calculate Innovation Vector
diagnosis
Minimize Cost Function Compute Analysis Output Analysis
¥xa & xa
Fig. 1. The module ”Minimize Cost Function” consists of the evaluations of cost function and its gradient, minimization algorithm etc., which consumes about 96% of the elapse time spent by GRAPES 3D-Var. At the same time, in ”Minimize Cost Function” module, iterative execution of Calculate J ( to Calculate J and J ) accounting for over 93% of the total runtime. Meanwhile control variable transform L and its adjoint LT comprise the main computational loading of Calculate J.(see Table 1).
4
Parallelization
GRAPES 3D-Var problem can be summarized as the iterative solution of cost function Eq.(1), including incremental formulation, preconditioning, control variable transform and physical/balance transform, observation operator (and tangent linear & adjoint operator), minimization algorithm etc., which is highly computation- and data-intensive. A typical ”full observations” 3D-Var operational process requires about 40 minutes. Parallelization is effective to achieve high performance of scientific computing and reduce the elapse time. In this section we discuss the key strategies in the design of GRAPES 3D-Var parallel scheme which is implemented based on Message Passing Interface (MPI) 2.0.
Parallel Computing of GRAPES 3D-Variational Data Assimilation System
451
Table 1. The runtime of Calculate J components shows that the implementation of δx = Lv (TransfWToDxa) and its adjoint grad v = LT HT • grad y (TransfWToDxaAdj) consume almost entire runtime (over 99%) of Calculate J, these routines are the part of the code most amenable to operation in parallelization Calculate J Component Percentage of Runtime Jb=0.5*dot product(w,w) 0.03% TransfWToDxa 42.50% TransfDxaToYi 0.31% Residual 0.04% Calculate Jo and GradY 0.07% TransfDxaToYiAdj 0.26% TransfWToDxaAdj 56.70% j grad=w+jo grad 0.07%
4.1
Partitioning Strategies
GRAPES 3D-Var employs four control variables (CV: stream function psi, unbalanced velocity potential chi, unbalanced geopotential height phi and relative humidity hum), and the computing of each control variable is irrelative (except for physical/balance transform). To simplify parallelization complexity and reduce the data dependence, we assume that all the processors have been classified into four processor-groups (PGs), and each PG solves computing of one control variable. Therefore it’s unnecessary to communicate among PGs during horizontal and vertical transform. Observations partitioning. To keep load balancing in computing the observation error cost and its gradient, each type of observations (e.g. observations of type q) is equally or almost equally distributed to each processor within a PG according to the number of observations of type q, instead of according to forecast model compute area, because the number of observations in each compute area can not be certainly equal. The observations partitioning algorithm is described as follows (Fig. 2a): For observations of type q (total number is NObsq ), In PGn (n = 1, 2, 3, 4), execute the operations: for each processor PEi (i=1...I) within PGn do processing part of observations of type q (the number of observations of type q distributed to PEi is about NObsq /I) done Since the observations are not connected as the data items of model connected domains, and the locality of observations is random, each processor within a PG should maintains a whole copy of background to process observations distributed locally. Control variables partitioning. As all processors are divided into four PGs, and each PG deals with only one control variable, most computing in vertical
452
X. Zhu, W. Zhang, and J. Song
PG1
PG2
PG3
proc.0
proc.0
proc.0
Obs.1 ~ m/n proc.1
Obs.1 ~ m/n
proc.n-1 Obs.(n-1)m/n+1~m
proc.0 Obs.1 ~ m/n
proc.1 Obs.m/n+1~2m/n
PG4
proc.1 Obs.m/n+1~2m/n
Obs.1 ~ m/n proc.1
Obs.m/n+1~2m/n
proc.n-1 Obs.(n-1)m/n+1~m
proc.n-1 Obs.(n-1)m/n+1~m
Obs.m/n+1~2m/n
proc.n-1 Obs.(n-1)m/n+1~m
a. Observations Partitioning (m: total num. of Obs; n: num. of proc. in a PG) proc. 0
1 ~ i+1 levels
proc. j-1
ij+j-i ~ ij+j levels
proc. j
proc. n-1
ij+j+1 ~ ij+j+i levels
m-i+1 ~ m levels
b. Control Variables Partitioning in a PG (n: num. of proc. in a PG; m: num. of levels; i=m/n, j=mod(m,n))
Fig. 2. Data partitioning strategies of GRAPES 3D-Var
direction is irrelevant in control variable transform. To simplify communication, control variables (3D arrays) are equally or almost equally distributed to all the processors within a PG according to the number of vertical levels. The control variable partitioning algorithm is described as follows (Fig. 2b): Assume PGn processes control variable CVn (the number of vertical levels is Ln ): (n=1,2,3,4) for each processor PEi (i=1...I) within PGn do processing part of the vertical levels of CVn (the number of levels processed by PEi is about Ln /I) done 4.2
Data Communication Strategies
The set of control variable transform implements the computing from control variable space to analysis variable space, which is the most time-consumed part of GRAPES 3D-Var, and of complicated communications. Control variable transform consists of three main computing procedures: horizontal spectral filters, vertical EOF projection and physical transform. We adopt different data partitioning strategies and communication strategies in different computing procedure. Although full copy of the preparative data can reduce communications in the parallel scheme, the communication strategies in physical transform are sophisticated in order to keep load balancing, and the details of physical transform implementation are discussed. After horizontal spectral filters and vertical EOF projection, the control variables are computed by their respective PGs, and each processor within a PG contains data of different vertical levels. The purpose of physical transform is to produce the analysis variables (u-wind, v-wind, geopotential height and relative humidity) derived from the control variables. These procedures include:
Parallel Computing of GRAPES 3D-Variational Data Assimilation System
453
TransfPsiChiToUV: Use psi and chi to produce u-wind and v-wind, only control variable psi and chi are involved. In order to use all four PGs to compute, psi and chi are divided into two parts respectively. The latter part of psi is transmitted from PG1 to PG3 and the latter part of chi is transmitted from PG2 to PG4 (Fig. 3a). After communications complete, PG1 contains the whole psi and the former part of chi, and PG2 contains the former part of psi and the whole chi, PG3 and PG4 contain the latter part of psi and the latter part of chi. Then, PG1 and PG3 can compute the different part of u-wind respectively, and PG2 and PG4 can compute the different part of v-wind respectively. After computation, u-wind and v-wind should be gathered by PG1 and PG2 respectively (Fig. 3b). PG1 psi
PG2 chi
psi
chi
PG3
PG4
a. data preparation of psi and chi u
v
b. data gathering of u and v
Fig. 3. Data communications of TransfPsiChiToUV
balance global: Use psi to produce balanced phi b. The computing of TransPsiChiToUV leads to no communications of psi to compute phi b. After computation the partial phi b calculated by PG1 , PG2 and PG4 respectively should be gathered by PG3 . Compute mss: Use phi and phi b to produce analysis variable mss(geopotential height), and no communications are need. The resulting mss is contained in PG3 . Compute hum: The analysis variable hum (relative humidity) can be evaluated directly, and no communications are need. The resulting hum is contained in PG4 . 4.3
Stagger Parallelization Strategies
The GRAPES 3D-Var parallelization procedures include data preparation, innovation vector calculation, control variable transform, computing of cost function and its gradient, minimization, analysis calculation etc. Considering the data distribution, the stagger parallelization strategies are adopted in different computing steps to keep load balancing.
454
X. Zhu, W. Zhang, and J. Song
Data preparation. The input data of GRAPES 3D-Var include: namelist file, first guess (backgound) xb , observations yo , background error covariances B. Each processor reads namelist file and yo . Only the ”monitor” processor reads xb and B, and then broadcasts the fields via MPI to other processors. Innovation vector calculation (O-B). For each type of observations, use the observations partitioning strategies (section 4.1) to distribute observations to each processor within a PG. Each PG calculates the innovation vector of one of all four analysis variables respectively. Because of the random distribution of locality of all observations, each processor should maintain the whole copy of background xb to calculate the innovation vector d = yo − H(xb ). Control variable transform. In horizontal spectral filters, each PG deals with one control variable. Within a PG, the control variable partitioning strategies (section 4.1) is adopted to distribute the control variable of different vertical levels to processors. In vertical EOF projection, each PG deals with one control variable. Each processor within a PG needs the whole fields of all vertical levels to calculate EOF projection, so before EOF projection, the fields in every processor should be broadcasted to all other processors within a PG. The communications in physical transform are carefully designed to keep load balancing, and the data communication strategies described in section 4.2 are used to implement the parallelization of physical transform. Minimization. Each PG computes the cost function of one of the four control variables respectively, and each processor within a PG allocates the control array only on the local-processor domain, then reshaping it from the 1D form into its corresponding 3D variable arrays in preparation of the control variable transform. After each processor computes part of the cost function, MPI ALLREDUCE is called to sum the partial results into a full cost function. Calculate analysis. Each processor performs the final transform of the analysis increments. Each PG computes one of the four analysis variables respectively, and then adds increments to the background values to produce the analysis. Output analysis. The ”monitor” processor gathers the partial analysis and analysis increments in other processors via MPI communication and then outputs them to files.
5
Experiments
The performance of the parallel computing of GRAPES 3D-Var system adopting above parallel scheme is presented in this section. All the experiments are carried out on the platform of a 8-node cluster computer system. Each node has 2 CPUs (Intel IA64 Itanium-2 1.5GHz) which are interconnect by Infiniband with bandwidth of about 800 MB per second.
Parallel Computing of GRAPES 3D-Variational Data Assimilation System
455
To evaluate the parallel algorithm of control variable transform, three modules: horizontal spectral filters, vertical EOF projection and physical transform are tested respectively. We choose two different horizontal resolution of control variable space to verify the parallel efficiency: high resolution: 320 × 160 grids and low resolution: 160 × 80 grids.(Fig. 4)
3.267
3.5
3.5
2.782
3
1.913
1.753
2 1
2.324
2
1.853
2.5
1
1.5
1
1
1
0.5
0.5
4CPU 8CPU 16CPU
low reso.
0
4CPU 8CPU 16CPU
4CPU 8CPU 16CPU
high reso.
4CPU 8CPU 16CPU
low reso.
a. horizontal spectral filter
0
high reso.
b. vertical EOF projection
1.947
1.905
2
1
0.5 0
3.503 3.021
3
1.634
1.5
1
4 3.5
2.5
2.5
1.5
3.104
3
1
1
4CPU 8CPU 16CPU
4CPU 8CPU 16CPU
low reso.
high reso.
c. physical transform
Fig. 4. The parallel performance of the three modules of control variable transform. The speedup of the three modules at different resolution is tested respectively.
To obtain better efficiency benefited from parallel computing, the problem size must be sufficiently large. Generally speaking, the speedup of GRAPES 3D-Var parallelization increases as the problem size increases (see Fig. 4). On the other hand, the trends of speedup show that the performance is quite good for a small number of processors and it becomes a little worse as the number of processors increases, which is mainly due to the cost of communications increases as the number of processors increases. 11
10.887
speedup
9 7
6.25
5 3.523
3 1
1 1
2
1.889 3
4
5
6
7
8
9 10 11 12 13 14 15 16
CPU
Fig. 5. Speedup of parallelization of GRAPES 3D-Var on 1-CPU, 4-CPU, 8-CPU and 16-CPU respectively
6
Conclusion
The GRAPES 3D-Var parallel scheme discussed above is relatively simply, and the idea of processor-groups and parallelization in the vertical direction greatly reduces the complexity of communications, so it’s easy to implement on the parallel computing platform practically. Experimental results show that the parallel
456
X. Zhu, W. Zhang, and J. Song
implementation achieves considerable load balancing and parallel efficiency. At the same time, the parallel implementation scheme has some disadvantages, i.e. the parallel scalability is constrained: the maximum number of processors is limited within the number of control variables (4 in GRAPES 3D-Var) multiple the number of vertical levels - the amount is always no more than 100. The data partitioning strategies based on horizontal area could partly resolve the problem, though the corresponding communications are more sophisticated than that based on vertical levels. To meet with operational requirements, there is a lot of subsequent work to do in the future.
References 1. Bouttier, F., Courtier, P.: Data assimilation concepts and methods. Meteorological Training Course Lecture Series (January 2001) 2. Jishan, X., Shiyu, Z., Guofu, Z., Hua, Z.: Scientific Design Scheme of GRAPES 3D-Var System. Chinese Academy of Meteorological Sciences (2001) 3. Barker, D., Huang, W., Guo, Y.-R., Bourgeois, A.I.: A Three-Dimensional Variational (3D-Var) Data Assimilation System For Use With MM5., NCAR Tech Note, NCAR/TN453+STR, p. 68 (2003) 4. Barker, D.M., Bourgeois, A.J., Guo, Y.-R., Huang, W.: A Three-Dimensional Variational Data Assimilation System for MM5: Implementation And Initial Results. Mon. Wea. Rev. 132(4), 897–914 (2004) 5. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Mathematical Programming 45, 503–528 (1989) 6. Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel Computing, 2nd edn. Pearson Education Limited, London (2003)
The Effects of Heterogeneity on Asynchronous Panmictic Genetic Search Boleslaw K. Szymanski, Travis Desell, and Carlos Varela Department of Computer Science, Rensselaer Polytechnic Institute, Troy NY 12180, USA {szymansk,deselt,cvarela}@cs.rpi.edu http://wcl.cs.rpi.edu/
Abstract. Research scientists increasingly turn to large-scale heterogeneous environments such as computational grids and the Internet based facilities to satisfy their rapidly growing computational needs. The increasing complexity of the scientific models and rapid collection of new data are drastically outpacing the advances in processor speed while the cost of supercomputing environments remains relatively high. However, the heterogeneity and unreliability of these environments, especially the Internet, make scalable and fault tolerant search methods indispensable to effective scientific model verification. An effective search method for these types of environments is asynchronous genetic search, where a population continuously evolves based on asynchronously generated and received results. However, it is unclear what effect heterogeneity has on this type of search. For example, results received from slower workers may turn out to be obsolete or less beneficial than results calculated by faster workers. This paper examines the effect of heterogeneity on asynchronous panmictic (single population) genetic search for two different scientific applications, one used by astronomers to model the Milky Way galaxy and another by particle physicists to determine the existence of theory predicted, yet unobserved particles such as missing baryons. Results show that for both applications results received from slower workers while overall less beneficial are still useful. Additionally, a modification of asynchronous genetic search shows that different parameter generation strategies change their effectiveness over the course of the search1 .
1
Introduction
The rate of increase in CPU performance does not nearly match the rapidly increasing rates of data acquisition in all scientific disciplines. This is leading to significantly long, if not intractable, turn around times between the development of a scientific model and its verification using traditional computing environments. Testing current scientific models can involve processing terabytes of data 1
This work has been partially supported by the following grants: NSF CAREER Award No. CNS-0448407 and NSF OISE Grant No. 0334667. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 457–468, 2008. c Springer-Verlag Berlin Heidelberg 2008
458
B.K. Szymanski, T. Desell, and C. Varela
using computationally intense modeling techniques, which results in program execution times of weeks to months on a single high end computer to calculate only a single parameter set for a scientific model. Even the best scientific model verification search methods require the evaluation of thousands of parameter sets of a scientific model over the data set. This makes using large-scale computing environments, such as computational grids and the Internet, highly desirable platforms for performing scientific model verification. The processing power these environments provide enables them to compute these models in a short amount of time, which in turn allows scientists to more quickly gather results and improve their models and understanding. Grids and the Internet introduce additional challenges in comparison to homogeneous large-scale computing environments such as supercomputers. In addition to scalability and heterogeneity concerns, the reliability of the host nodes comes into question, especially in the case of internet computing architectures such as BOINC [4] where computing nodes can disconnect at random and for computationally significant amounts of time. Most search methods used in scientific model verification are iterative (or synchronous) in nature [7], and therefore not well suited to heterogeneous and unreliable computing environments. Additionally, it is uncertain what the effect of heterogeneity will have on asynchronous search. A software framework for distributed scientific model evaluation and search (GMLE [8]) was extended with an asynchronous distributed evaluation framework. GMLE was used to implement an asynchronous panmictic (single population) genetic search, and a modification to it which improves the rate of convergence to a solution. Two different scientific applications, one used for astronomical modeling and the other for particle physics modeling, were used to evaluate the convergence rates of the asynchonous genetic searches. To compare the effect of heterogeneity on the search, the applications were run using Rensselaer’s CCNI BlueGene as a high-performance homogeneous testing environment and the Rensselaer Grid as a heterogeneous testing environment. Results show that the asynchronous genetic search (AGS) with continuous update converges in a half to a third of the evaluations compared to traditional iterative genetic search (IGS) on the BlueGene. Additionally, AGS on the Rensselaer Grid still converges within half the number of evalautions as needed by IGS. For a heterogeneous environment, it is shown that while fitness evaluations from slower workers are not as effective in improving the population, they still do provide benefit. This means that over large-scale heterogeneous environments, any additional processors, even slow ones, can still improve the performance of an AGS. The benefit of different types of parameter generation, in addition to the benefit of results with different calculation times, is shown to change over the course of the application. This means it may be possible to develop modifications to AGS which reduce the effect of heterogeneous calculation times. Such modification may be combined with improvements in convergence resulting from adaptively determining what parameter generation methods to use. The paper proceeds as follows. Section 2 discusses related parallel genetic search methods and frameworks for large-scale scientific evaluation. Section 3
The Effects of Heterogeneity on Asynchronous Panmictic Genetic Search
459
describes the GMLE architecturem, the asynchronous distributed evaluation framework and the genetic searches used. The different searches are evaluated in Section 4. Lastly, conclusions and future work are discussed in Section 5.
2
Related Work
A wide range of parallel genetic algorithms (PGAs) have been examined for different distributed computing environments. Generally, there are three types of parallel genetic algorithms: single population (panmictic, coarse-grained), multipopulation (island, medium-grained), or cellular (fine-grained) [7]. Typically, these approaches are synchronous. Panmictic GAs create a population, evaluate it in parallel, and use the results to generate the next population. Island [3,5] approaches evaluate local populations for a certain number of iterations, then exchange the best members with other islands. Cellular algorithms [2,9] evaluate individual parameter sets, then update these individual sets based on the fitness of their neighbors. Hybrid approaches [14,18] have also been examined. P-CAGE [11] is a peer-to-peer (P2P) implementation of a hybrid multi-island genetic search built using the JXTA protocol [12] which is also designed for use over the Internet. Each individual processor (a member of the P2P network) acts as an island (a subpopulation of the whole) and evolves its subpopulation cellularly. Every few iterations, it will exchange exterior neighbors of its population with its neighbors. There have also been different approaches taken in develping PGAs for computational grids. Imade et. al. have studied synchronous island genetic algorithms on grid computing environments for bioinformatics [13]. Lim et. al. provide a framework for distributed calculation of genetic algorithms and an extended API and meta-scheduler for resource discovery [15]. Both approaches use synchronous island-style GAs. Nimrod/O [16] is a tool that provides different optimization algorithms for use on grids and has been used to develop the EPSOC algorithm [14] which is is a mixture of a cellular and traditional GA. Populations are generated synchronously but the elimination of bad members and mutating good ones is done locally. It has already been shown by Dorronsoro et. al. that asynchronous cellular GAs can perform competitively and discuss how update rate and different population shapes affect the convergence rate [10]. In this paper, we introduce a novel approach (to the best of our knowledge) that evaluates asynchronous panmictic GAs. This approach is well suited for both Internet and Grid computing infrastructures, because it easily facilities scalability, fault tolerance without redundancy, and does not require inter-worker communication.
3
Distributed Search on Heterogeneous Environments
The GMLE framework has been designed to facilitate collaboration between researchers in machine learning, distributed computing and experts with different scientific domain knowledge who are interested in distributed model verification
460
B.K. Szymanski, T. Desell, and C. Varela
Fig. 1. GMLE with an asynchronous distributed evaluation framework
or parameter optimization. GMLE has previously used a synchronous distributed evaluation framework for performing maximum likelihood evaluation with astronomical and particle physics applications on an IBM BlueGene supercomputer and the Rensselaer Grid [8]. The framework partitions data across a set of processors that perform partial evaluations of the model in parallel, after which the results are composed into the final result. This has been shown to be efficient for both supercomputing and grid environments, however it does not work well on highly heterogeneous and unstable environments like the BOINC infrastructure and some grids. GMLE was extended with an asynchronous distributed evaluation framework (see Figure 1). Evaluators request work from a master, process that work and return the result, repeating as necessary. Work requests and results are all processed asynchronously by the master which peforms the different search methods. The master does not need to wait or have any dependencies on the results of the different evaluators which makes evaluator failures easily ignored and reduces the need for redundant, wasted computations. The rest of this section details the different search methods and how they are parallelized. Iterative Genetic Search. Algorithm 1 shows pseudocode for the IGS algorithm. In this algorithm, an initial population of parameter sets is generated randomly and the fitness of the model for each of those parameter sets is calculated. The iterative genetic search repeatedly calculates a new population based on the previous one using selection, reproduction and mutation. Selection takes the best members of the previous population and moves them to the new population. Reproduction takes two randomly selected members of the previous population and generates a new parameter set that is their average. Mutation takes a randomly selected member of the previous population and creates a new parameter set which is equal to the selected member except that one value is mutated to a
The Effects of Heterogeneity on Asynchronous Panmictic Genetic Search
461
Algorithm 1. Iterative Genetic Search (IGS) Data: X /*Best to keep*/, Y /*Number of Reproductions*/, Z /*Number of Mutations*/ Result: Converged Population for p ∈ P[1] ... P[X+Y+Z] do p.params = random params() evaluate(P) while not converged(P) do for p ∈ P’[1] ... P’[X] do p = P.get next best() for p ∈ P’[X+1] ... P’[X+Y] do p = reproduce(P[random()], P[random()]) for p ∈ P’[X+Y+1] ... P’[X+Y+Z] do p = mutate(P[random()]) P = P’ evaluate(P)
Algorithm 2. Asynchronous Report Work Data: P /*Population*/, max /*Maximum Population Size*/, R /*Result*/ Result: Updated Population if P.size < max then P.insert(R) else if R.fitness > worst(P).fitness then P.insert(R) P.remove(worst(P))
new randomly selected value. In this way, iterative genetic search will converge to minima using reproduction and use mutation to prevent being stuck in a local minimum. The population size, S, is typically kept constant, so S = X + Y + Z, where X is the number of selections, Y is the number of reproductions, and Z is the number of mutations. Asynchronous Genetic Search. AGS is similar to IGS in that it keeps a population of parameters and generates reproductions and mutations based on it. However, instead of using a parallel model of concurrency like IGS, it uses a master-worker approach. Instead of iteratively generating new populations, new members of the population are generated when a worker requests work, and the population is updated when a worker reports work to the master. The AGS algorithm consists of two phases and uses two asynchronous message handlers (see Algorithms 2 and 3). The server can either be processing a request work or a report work message and cannot process multiple messages at the same time. In the first phase of the algorithm (while the population size is less than the maximum population size) the server is being initialized and a random population is generated. When a request work message is processed, a random parameter set is generated, and when a report work message is processed, the population is updated with the parameters and the fitness of that evaluation. When enough report work messages have been processed, the algorithm proceeds into the second phase which actually performs the genetic search. In the second phase, report work will insert the new parameters and their fitness into the population but only if they are better than the worst current
462
B.K. Szymanski, T. Desell, and C. Varela
Algorithm 3. Asynchronous Request Work Data: P /*Population*/, C /*Reproduction Probability*/, max /*Maximum Population Size*/ Result: New Parameters to Evaluate if P.size < max then return random params() else if random() < C then p1 = P[random()] p2 = P[random()], where p1 != p2 return reproduce(p1, p2) else return mutate(P[random()])
Algorithm 4. Double Shot Reproduce Data: Member m1, Member m2 Result: Reproduced parameters Member[] result result[0].params = (m1.params + m2.params)/2 diff = result[0].params - m1.params result[1].params = diff - m1.params result[2].params = diff + m2.params return result
member, and remove the worst member to keep the population size the same, otherwise the parameters and the fitness is discarded. Processing a request work message will either return a mutation or reproduction from the population. Asynchronous Double-Shot Genetic Search. The AGS algorithm was extended with the double shot method, on the observation that for the astronomy model (along with many other scientific modeling applications), the parameter space is not well formed. In this case, when a reproduction is generated from two parameter sets, they often both lie on a slope, so using the average of two points will typically would not improve the fitness. The AGS double shot (AGS-DS) algorithm improves AGS by generating three children when doing a reproduction (see Algorithm 4). One child is the average of its parents, but the other two children lie outside the parent parameters. One child is equally distant from the average outside the first parent, and the other child is equally distant from the average outside the second parent. This allows the population to travel down gradients much faster leading to improved convergence times. Genetic Search Distributed Evaluation. There are three ways that IGS can be parallelized: (1) the fitness of each member in the population can be evaluated in parallel, (2) the fitness calculation can be done in parallel, and (3) the fitness calculation can be done in parallel as well as the population being evaluated in parallel. The first approach can scale to a number of processors equal to the population size, while the scalability of the second approach is dependent on how much of
The Effects of Heterogeneity on Asynchronous Panmictic Genetic Search
463
the fitness calculation can be done in parallel. The third approach can scale to a number of processors equal to the first times the second, however it is the most complex to implement. All three approaches suffer from the scalability limitation imposed either by the population size and/or the scalability of the fitness calculation. None perform well on heterogeneous environments without intelligent partitioning. In the first case, the algorithm will only progress as fast as the slowest fitness calculation, while in the second case, the algorithm will only progress as fast as the slowest calculation of part of the fitness. The third case suffers from both, making partitioning the most difficult. AGS and AGS-DS can be distributed in two ways: (1) workers request and report work individually and asynchronously, and (2) all workers can calculate fitness collectively based on parameters generated by the current population which are then reported, and this process repeats iteratively. The first approach has significant benefits in heterogeneous environments because the calculation of fitness can be done by each worker concurrently and independently of each other. The algorithm progresses as fast as work is received, and faster workers can processes multiple request work messages, in the style of CILK’s work stealing [6], without waiting on slow workers. However, the second approach can be better on homogeneous environments due to the fact that new parameter sets are always generated from the newest (and best) population.
4
Results
Test Applications. The physics application uses data from particle wave analysis (PWA) to determine the existence of theory predicted, but unobserved particles (missing baryons) [19]. PWA observes particle states and measures their quantum spin and parity using a beam of mesons formed in an accelerator. This beam strikes a liquid hydrogen target, causing some pions interact with a proton at the target which can result in a spray of particles. After this, some particles will live long enough to create trails in a particle detector and be observed. Missing baryons decay after an extremely short time (1023 seconds) and do not travel a measurable distance. A scientific model with 10 to 100 fit parameters is used to calculate the occurence of missing baryons based on the observed data. The genetic search finds values for these fit parameters that most closely match the data. The astronomy application uses data from the SLOAN digital sky survey [1], which is measuring the positions and other data about all the stars in the sky. Currently, over 10TB of data has been collected. This data is used to calculate the accuracy of 3-dimensional models of the Milky Way galaxy [17]. Any given model consists of the background (stars uniformly dispersed in the galaxy) and different streams of stars formed when other galaxies have come close to the Milky Way, were ripped apart and spread around it. The genetic search finds values for parameters describing the background and different star streams which most closely match the observed sky survey data.
464
B.K. Szymanski, T. Desell, and C. Varela
Fig. 2. Minimum, median and maximum population values for the astronomy application on the BlueGene with IGS (upper left), AGS (upper right), AGS-DS (lower left), and AGS-DS on the Rensselaer Grid (lower right)
Test Environments. Various test environments were used to evaluate the different types of asynchronous genetic search. Rensselaer’s CCNI BlueGene was used as a homogeneous high-performance test environment. A 512 node partition was used in virtual mode for a total of 1024 processors, each a 700MHz PowerPC 440 with 1GB of RAM connected by a 3-dimensional torus with 175MBps in each direction and 1.5μsec latency. The Rensselaer Grid was used as a heterogeneous test environment, consisting of four different clusters. The Solaris cluster (SOL) consists of four single core, dual processor SunBlade 1000 Sun Solaris machines, running at 800MHz. The AIX cluster (AIX) consists of four quadprocessor single-core Power-PC processors running at 1.7GHz. Two Opteron clusters were also used. The first (OP1) consists of 8 quad-processor, single-core machines, and the second (OP2) consists of 2 quad-processor, dual-core machines with each core running at 2.2MHz. Inter-cluster communication is over the Rensselaer’s wide-area network (WAN). Convergence. Figure 2 shows the convergence rates of the different algorithms on the BlueGene and Rensselaer Grid. The convergence rates of the IGS, AGS and AGS-DS algorithms on a homogeneous environment were tested on the CCNI BlueGene in part because of the expensive fitness calculation of the astronomy application – 5 to 25 minutes on any of the processors in the Rensselaer Grid. Similar results were obtained by the physics application. The known optimal fitness for the sample astronomy data set used was approximately 3.026. IGS
The Effects of Heterogeneity on Asynchronous Panmictic Genetic Search
465
Fig. 3. Utility of results based on their calculation time for the astronomy application (upper left) and physics application (upper right), and utility of results based on how they were generated for the astronomy application (lower left) and physics application (lower right)
had not converged to the optimum even after 50,000 evaluations, while AGS took approxmiately 30,000 evaluations and AGS-DS took 18,000 evaluations. Both AGS and AGS-DS quickly converged to a local minimum in the data set (at a fitness of approximately 3.1). AGS-DS converged faster to both minima (the local and the optimal) due to the double shot technique allowing the algorithm to travel down gradients quicker. AGS-DS was also run on the Rensselaer Grid to evaluate the effect of heterogeneity on the search. The convergence rate was not as fast, but still better than IGS, converging at at around 30,000 evaluations. Compared to the homogeneous evaluation of AGS and AGS-DS, the population had more variation over the entire execution. Again, the physics application performed similarly, with heterogeneous AGS-DS converging faster than IGS, but not as fast as AGS and AGS-DS on a homogeneous environment. Evaluation Utility. The utility of the different evaluations performed by the search was also examined (see Figure 3). Utility is was calcualted as the number of results inserted into the population divided by the total number of results received for that speed. While results that were calculated faster had a higher chance of being inserted into the population, those with slower calculation rates still were useful. While initially, faster results tended to be much more useful than slow results, as the search began to converge, the utility of the results for all speeds decreased. Interestingly, after approximately 10,000 evaluations in the
466
B.K. Szymanski, T. Desell, and C. Varela
physics application, and 16,000 evaluations in the astronomy application, the utility rate started to increase again. For the physics application, the slower results gained the most benefit, while in the astronomy application the faster results improved the most. However, for both applications results of all speeds of improved their utilities. One possibility for this effect is that as the populations had both converged closely to a minima, the population was not changing as drastically, so the chance of a result being useful increased. The utility of the different parameter generation strategies was also calculated for both applications for both applications. The benefit of mutation is initially very good and tapers off sharply, aftewards not adding much benefit. The average method of generating new parameters is the strongest of all three approaches, with the lower (the parameters generated outside the more fit parent) and higher (the parameters generated outside the less fit parent) methods being less effective. For astronomy, as with calculation time, after 16,000 evaluations the average and lower methods started to improve in their ability to return beneficial results, however the improvement did level off as the search converged. Likewise, with the physics application after 10,000 evaluations average, lower and higher methods began to improve, but they also tapered off as the search converged.
5
Discussion
This paper examines two different types of asynchronous panmictic (single population) genetic search using the GMLE distributed modeling and search package. AGS is a desirable search technique for large scale and heterogeneous environments due to its inherent scalability and fault tolerance. Asynchronous genetic search (AGS) was evaluated with two different scientific applications, one used in astronomy and the other in physics. Traditional iterative genetic search (IGS) was compared to continuously updated asynchronous genetic search on a IBM BlueGene supercomputer and to asynchronous genetic search on a heterogeneous grid environment. AGS is shown to improve the convergence rate over IGS for all cases, with continuously updated AGS performing the best. The effects of heterogeneity on AGS were also measured. Results have shown that the utility of a result, measured by its improvement of the population, is partially dependent upon how long it takes to be computed. Results which take longer to calculate are generated from older populations with less fitness, and thus have less chance to improve the current population. However, even results which are received from very slow workers still improve the population providing some benefit to the search. The utility of different types of parameter generation methods for the genetic search was also tested. Interestingly, it was shown that while the asynchronous double shot algorithm has the fastest convergence rate, the additional types of parameter generation used (higher and lower) were less likely to improve the population. For a faster convergence rate, even though the higher and lower methods of parameter generation are less likely to generate a result that will improve the population, the results generated must provide better benefit to the population when they are correct.
The Effects of Heterogeneity on Asynchronous Panmictic Genetic Search
467
It was also shown that the utility of results based on calculation time and parameter generation type changes over the course of program execution. In future work, more types of parameter generation could be developed to ameliorate the effect of slowly evaluated fitnesses, reducing the impact of heterogeneity. Additionally, an adaptive search could be developed which dynamically chooses which types of parameter generation to use based on the speed of the processor and past performance. More future work will involve extending GMLE to work with the BOINC framework. This will allow AGS to be be evaluated on a very large-scale and heterogeneous environment. Additionally, this work evaluated single population, or panmictic versions of asynchronous genetic search. As the number of available workers inscreases, asynchronous island (multi-population) genetic search will be of interest, especially if multiple servers are required to handle the load from a large BOINC community. This work shows that while heterogeneity does have a negative effect on the convergence rate of AGS, it is not excessive, even when the evaluation time of workers differs by an order of magnitude. Additionally, with different types of parameter generation strategies and an adaptive search, it may be possible to reduce the impact of heterogeneity even more – allowing AGS to be done over very heterogeneous and large-scale environments.
References 1. Adelman-McCarthy, J.e.a.: The 6th Sloan Digital Sky Survey Data Release. ApJS, arXiv/0707.3413 (in press) (July 2007), http://www.sdss.org/dr6/ 2. Alba, E., Dorronsoro, B.: The exploration/exploitation tradeoff in dynamic cellular genetic algorithms. IEEE Transactions on Evolutionary Computation 9, 126–142 (2005) 3. Alba, E., Troya, J.M.: Analyzing synchronous and asynchronous parallel distributed genetic algorithms. Future Generation Computer Systems 17, 451–465 (2001) 4. Anderson, D.P., Korpela, E., Walton, R.: High-performance task distribution for volunteer computing. In: e-Science, pp. 196–203. IEEE Computer Society, Los Alamitos (2005) 5. Berntsson, J., Tang, M.: A convergence model for asynchronous parallel genetic algorithms. In: IEEE Congress on Evolutionary Computation (CEC 2003), vol. 4, pp. 2627–2634 (December 2003) 6. Blumofe, R.D., Leiserson, C.E.: Scheduling Multithreaded Computations by Work Stealing. In: Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS 1994), Santa Fe, New Mexico, pp. 356–368 (November 1994) 7. Cantu-Paz, E.: A survey of parallel genetic algorithms. Calculateurs Paralleles, Reseaux et Systems Repartis 10(2), 141–171 (1998) 8. Desell, T., Cole, N., Magdon-Ismail, M., Newberg, H., Szymanski, B., Varela, C.: Distributed and generic maximum likelihood evaluation. In: 3rd IEEE International Conference on e-Science and Grid Computing (eScience 2007), Bangalore, India, p. 8 (December 2007) (to appear) 9. Dorronsoro, B., Alba, E.: A simple cellular genetic algorithm for continuous optimization. In: IEEE Congress on Evolutionary Computation (CEC 2006), pp. 2838–2844 (July 2006)
468
B.K. Szymanski, T. Desell, and C. Varela
10. Dorronsoro, B., Alba, E., Giacobini, M., Tomassini, M.: The influence of grid shape and asynchronicity on cellular evolutionary algorithms. In: IEEE Congress on Evolutionary Computation (CEC 2004), vol. 2, pp. 2152–2158 (June 2004) 11. Folino, G., Forestiero, A., Spezzano, G.: A JXTA based asynchronous peer-to-peer implementation of genetic programming. Journal of Software 1, 12–23 (2006) 12. Gong, L.: Jxta: A network programming environment. IEEE Internet Computing 5, 88–95 (2001) 13. Imade, H., Morishita, R., Ono, I., Ono, N., Okamoto, M.: A grid-oriented genetic algorithm framework for bioinformatics. New Generation Computing: Grid Systems for Life Sciences 22, 177–186 (2004) 14. Lewis, A., Abramson, D.: An evolutionary programming algorithm for multiobjective optimisation. In: IEEE Congress on Evolutionary Computation (CEC 2003), vol. 3, pp. 1926–1932 (December 2003) 15. Lim, D., Ong, Y.-S., Jin, Y., Sendhoff, B., Lee, B.-S.: Efficient hierarchical parallel genetic algorithms using grid computing. Future Generation Computer Systems 23, 658–670 (2007) 16. Peachey, T., Abramson, D., Lewis, A.: Model optimization and parameter estimation with Nimrod/O. In: International Conference on Computational Science, University of Reading, UK (May 2006) 17. Purnell, J., Magdon-Ismail, M., Newberg, H.J.: A probabilistic approach to finding geometric objects in spatial datasets of the Milky Way. In: Foundations of Intelligent Systems, vol. 3488, pp. 485–493. Springer, Heidelberg (2005) 18. Sinha, A., Goldberg, D.E.: A survey of hybrid genetic and evolutionary algorithms. Technical Report No. 2003004, Illinois Genetic Algorithms Laboratory (IlliGAL) (2003) 19. Wang, W., Maghraoui, K.E., Cummings, J., Napolitano, J., Szymanski, B., Varela, C.: A middleware framework for maximum likelihood evaluation over dynamic grids. In: Second IEEE International Conference on e-Science and Grid Computing, Amsterdam, Netherlands, p. 8 (December 2006)
A Parallel Sensor Selection Technique for Identification of Distributed Parameter Systems Subject to Correlated Observations Przemyslaw Baranowski1 and Dariusz Uci´ nski2 2
1 University of Zielona G´ ora, Computer Centre University of Zielona G´ ora, Institute of Control and Computation Engineering ul. Podg´ orna 50, 65-246, Zielona G´ ora, Poland [email protected], [email protected]
Abstract. The paper considers the problem of determining optimal sensors locations so as to estimate unknown parameters in a class of distributed parameter systems when the measurement errors are correlated. Given a finite set of possible sensor positions, the problem is formulated as the selection of the gaged sites so as to maximize the log-determinant of the Fisher information matrix associated with the estimated parameters. The search for the optimal solution is performed using a GRASP method combined with a multipoint exchange algorithm. In order to alleviate the problem of excessive computational costs for large-scale problems, a parallel version of the GRASP solver is developed aimed at computations on a Linux cluster of PCs. The resulting numerical scheme is validated on a simulation example. Keywords: Distributed parameter systems, parameter estimation, GRASP, parallel computations.
1
Introduction
A crucial problem underlying parameter identification of distributed parameter systems (DPSs) is the selection of sensor locations. This problem comprises the arrangement of a limited number of measurement transducers over the spatial domain in such a way as to obtain the best estimates of the system parameters. The location of sensors is not necessarily dictated by physical considerations or by intuition and, therefore, some systematic approaches should still be developed in order to reduce the cost of instrumentation and to increase the efficiency of identifiers. The importance of sensor planning has already been recognized in many application domains, e.g., air quality monitoring systems, groundwater-resources management, recovery of valuable minerals and hydrocarbon, model calibration in meteorology and oceanography, chemical engineering, hazardous environments and smart materials [1,2,3]. Over the past years, they have stimulated laborious R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 469–478, 2008. c Springer-Verlag Berlin Heidelberg 2008
470
P. Baranowski and D. Uci´ nski
research on the development of strategies for efficient sensor placement (for reviews, see [4,5]). Nevertheless, in most cases correlations between observations are neglected, although it is obvious that frequently several measurement devices situated close to one another do not give much more information than a single sensor. On one hand, such a simplification is very convenient as it leads to elegant theoretical results, but on the other hand, when the observations are correlated, many of classical optimum experiment design algorithms cannot be directly implemented. It is well-known that even simple correlation structures [5,6] may substantially complicate the solution of the sensor-location problem. This is because the corresponding information matrix in this case is no longer a sum of information matrices corresponding to individual sensors. Consequently, publications on optimum sensor location for correlated observations are rather scarce and limited to simple settings [5,7]. The main objective of this work is to propose a technique based on combination of the parallel GRASP method with an extremely fast multipoint exchange algorithm, which makes it possible to select best sites from among more than a thousand of admissible locations.
2
System Description
Consider a bounded spatial domain Ω ⊂ Rd with sufficiently smooth boundary Γ , a bounded time interval T = (0, tf ], and a distributed parameter system ¯ ⊂ Rd and time instant t ∈ T¯ (DPS) whose scalar state at a spatial point x ∈ Ω is denoted by y(x, t). Mathematically, the system state is governed by the partial differential equation (PDE) ∂y = F x, t, y, θ in Ω × T , (1) ∂t where F is a well-posed, possibly nonlinear, differential operator which involves first- and second-order spatial derivatives and may include terms accounting for forcing inputs specified a priori. The PDE (1) is accompanied by the appropriate boundary and initial conditions B(x, t, y, θ) = 0 y = y0
on Γ × T, in Ω × {t = 0},
(2) (3)
respectively, B being an operator acting on the boundary Γ and y0 = y0 (x) a given function. Conditions (2) and (3) complement (1) such that the existence of a sufficiently smooth and unique solution is guaranteed. We assume that the forms of L and B are given explicitly up to an m-dimensional vector of unknown constant parameters θ which must be estimated using observations of the system. The implicit dependence of the state y on the parameter vector θ will be reflected by the notation y(x, t; θ). In what follows, we consider the discrete-continuous observations provided by n stationary pointwise sensors, namely zm (t) = y(x , t; θ) + ε(x , t),
t ∈ T,
(4)
A Parallel Sensor Selection Technique
471
where zm (t) is the scalar system output and x ∈ X stands for the location of the -th sensor ( = 1, . . . , n), X signifies the part of the spatial domain Ω where the measurements can be made and ε(x , t) denotes the measurement disturbance satisfying E ε(x, t) = 0, (5) E ε(x, t)ε(χ, τ ) = q(x, χ, t)δ(t − τ ), (6)
q( · , · , t) being a known continuous spatial covariance kernel and δ the Dirac delta function.
3
Optimum Experimental Design Problem
A customary performance index quantifying the optimality of sensor configurations is the D-optimality criterion defined as the determinant of the Fisher information matrix (FIM) [5,8,9,10] associated with the parameter estimates. In our setting, the FIM is given by tf M (x1 , . . . , xn ) = F (t)C −1 (t)F T (t) dt (7) 0
where the elements of the matrix F are usually called the sensitivity coefficients, F ( · ) = f (x1 , · ) . . . f (xn , · ) , (8) where
f (x , · ) = i
∂y(xi , · ; θ) ∂θ
T ,
(9)
θ=θ 0
θ0 being a prior estimate to the unknown parameter vector θ. The quantity C( · ) is the n × n covariance matrix with elements cij (t) = q(xi , xj , t), i, j = 1, . . . , n. (10) Introducing the matrix W (t) = wij (t) = C −1 (t), we can rewrite the FIM as follows: n n tf M (x1 , . . . , xn ) = wij (t)f (xi , t)f T (xj , t) dt. (11) i=1 j=1
0
For notational convenience, in what follows, we shall use the following compact notation for a solution to our problem:
ξ = x1 , . . . , xn . (12) Each ξ will then be called a discrete (or exact ) design and its elements support points. Since all measurements are supposed to be taken in a single run of the process (1)–(3) so that replications are not allowed, the admissible designs must
472
P. Baranowski and D. Uci´ nski
satisfy the condition xi = xj for i, j = 1, . . . , n and i = j. Thus, from among all discrete designs we wish to select the one which maximizes Ψ [M (ξ)]. We call such a design a D-optimum design. In what follows, we shall focus our attention on the framework in which the set of admissible locations X consists of N > n elements. Thus we are faced with a combinatorial problem whose solution demands an approach avoiding the exhaustive search of the solution space.
4
Numerical Construction of Optimum Designs
The GRASP (Greedy Randomized Adaptive Search) algorithm is a well-known metaheuristic method implemented for many optimization problems [11,12]. In general, GRASP is an iterative method, which consists of two stages: a construction phase and a local search. In the first stage the algorithm constructs an Algorithm 1. GRASP - general scheme 1: procedure GRASP 2: Ψ [M (ξ)] = −∞ 3: while {stopping criterion not satisfied} do 4: Construction Phase: build a greedy solution ξ ; 5: Local ˘ search for an optimum ¯ ξi , starting from ξ ; 6: if Ψ [M (ξi )] > Ψ [M (ξ )] then 7: Ψ [M (ξ )] = Ψ [M (ξi )]; 8: ξ = ξi ; 9: end if 10: end while 11: return(ξ ); 12: end procedure ˘ ¯ 1: procedure Local search for an optimum ξ k = x1 , . . . , xn 2: k = 0; 3: loop 4: Determine: F (k) (ξ k ), W (k) (ξ k ), M (k) (ξ k ), det M (k) (ξ k ), D(ξ k ) = M −1 (ξ k ); 5: Determine: neighborhood of ξ k ; 6: Find: (Xrem , Xadd ) = max Δ(Xrem , Xadd ); (Xrem ,Xadd )∈ξ×X
7: 8: 9: 10: 11: 12: 13: 14: 15: 16:
where (k) Δ(Xrem , Xadd ) = {det M (ξXrem Xadd ) − det M (ξ (k) )}/ det M (ξ (k) ); if Δ(Xrem , Xadd ) ≤ δ then STOP loop; else k ξ k+1 = ξX ; rem Xadd k = k + 1; end if end loop return(ξ k ); end procedure
A Parallel Sensor Selection Technique
473
acceptable, greedy solution, which is then explored in the local search phase. The best solution among all iterations is returned as an optimum. Algorithm 1 displays the scheme of the GRASP method implemented to determine D-optimum sensor locations i.e., for Ψ [M ] := det(M ). The construction phase returns a design consisting of n sites selected from among N admissible sites forming a ˜ in X. It starts from a random, initial m-point design ξ 0 , where m is grid X the number of parameters to be estimated. This guarantees the nonsingularity of the information matrix. In each iteration of this phase the algorithm adds one location to the current design. This site is chosen from a list of the best ˜ ξ 0 . The design obtained candidates, which is built through a full search of X in the first GRASP phase is likely to be close to an optimal one so in the second phase only a local search is performed. Local optimization consists in a full search only in a list of locations which are qualified as a neighborhood of the design produced in the construction phase. This list is much smaller than the amount of all admissible sites. A location is qualified as a neighbor of point x if its distance on the grid to x is less than 2 units. In the local search phase (Algorithm 1) current design is improved by exchanging its elements with those in its neighbourhood. It is possible to swap many sites during one iteration but due to the combinatorial nature of the problem we decided to exchange at most two points in one step.
5
Parallelization Concepts
In spite of the simplification resulting from applying the GRASP algorithm the optimization problem is still complex, so we decided to use parallel computations combined with an efficient exchange-type algorithm. The algorithm consists in exchanging the worst sensor locations in the current design for the best location from a list of other admissible sites. In our parallel implementation we used a hierarchical master-worker strategy, which was proposed in [13]. Through the iterative scheme of GRASP and exchange algorithm where we found many separate tasks, the problems were split into subproblems which can be solved by different processors. The main tasks of the master node in our implementation are submitting jobs to workers and aggregating their results. The master node maintains an array of workers which are supposed to take part in computations. In this array the master collects the information which node is active or idle (waiting for a job). Based on it, the master process makes the decision how to redistribute the jobs (Fig. 1). The master has some additional arrays in which the orders and data for workers are stored. Each order, which is sent right before the data to a worker, is responsible for execution of the appropriate procedure on a processor. This method ensures that different nodes can execute different procedures for different data. In particular, each worker explores a local part of the neighborhood list, received from the master, and after computation it sends the best location back to the coordinator. When the master receives the best locations from all workers, it compares the increase in the determinant of the FIM and decides which sites are to be added to the current design.
474
P. Baranowski and D. Uci´ nski
Fig. 1. Master-Slave tasks
The easiest way to parallelize the GRASP algorithm is to employ a multistart strategy, where each worker starts its own complete GRASP procedure.
6
Implementation Details
The speed of the local search procedure (Algorithm 1) can be highly improved by reduction of time-consuming operations, such as inversions of matrices (lines 6–7). With no loss of generality, we assume that we are always to replace s points Xrem = {xrem1 , . . . , xrems } ∈ ξ (k) for Xadd = {xadd1 , . . . , xadds } ∈ X. Indeed, interchanging points in the current design should be followed by swapping some columns in F (k) , as well as interchanging the appropriate rows and columns in W (k) . Such a replacement greatly simplifies the resulting formulas and makes the implementation easier. This requires two stages: the removal of {xrem1 , . . . , xrems } from ξ (k) and the augmentation of the resulting design by {xadd1 , . . . , xadds }. Clearly, both stages imply changes in all the matrices corresponding to the current design. Applying the Frobenius theorem [5, pages 254–255], we can derive the forms of the necessary updates which guarantee the extreme efficiency of the calculations. Stage 1: Deletion of {xrem1 , . . . , xrems } from ξ (k) . Write F (k) (t) as F (k) (t) = Fr (t) fr (t) ,
(13)
where Fr (t) ∈ Rm×(n−s) and fr (t) ∈ Rm×s . Deletion of {xrem1 , . . . , xrems } from ξ (k) implies removing fr (t) from F (k) (t), so that we then have Fr ( · ) instead of F (k) ( · ). But some changes are also necessary in matrices W (k) ( · ) and M (k) . Namely, decomposing the symmetric matrix W (k) (t) into ⎡ ⎤ V (t) b (t) r ⎢ r ⎥ ⎥, W (k) (t) = ⎢ (14) ⎣ T ⎦ br (t) γr (t) where Vr (t) ∈ R(n−s)×(n−s) , br (t) ∈ R(n−s)×s , γr (t) ∈ R(s×s) , we set
A Parallel Sensor Selection Technique
gr (t) = chol{γr (t)}−1 Fr (t)br (t) + chol{γr (t)}fr (t),
475
(15)
where chol{γr (t)} is the upper triangular Cholesky factor of the matrix γr (t). Then calculate, respectively, the following counterparts of the matrices W (k) ( · ) and M (k) : Wr (t) = Vr (t) + Vr (t)br (t) γr (t) − br (t)T Vr (t)br (t) br (t)T Vr (t) tf Mr = M (k) − gr (t)grT (t) dt.
(16) (17)
0
Stage 2: Inclusion of {xadd1 , . . . , xadds } into the design resulting from Stage 1. We construct Fa (t) = Fr (t) fa (t) , (18) where fa (t) = f (x, t). Such an extension of the matrix of sensitivity coefficients influences the form of matrices Wr ( · ) and Mr obtained at Stage 1. In order to determine their respective updated versions Wa ( · ) and Ma , we define va (t) = col q(x, x1 , t), . . . , q(x, xn , t) , γa (t) = [q(x, x, t) −
(19)
vaT (t)Wr (t)va (t)]−1 ,
(20)
ba (t) = −γa (t)Wr (t)va (t), ga (t) = chol{γr (t)}
−1
(21)
Fr (t)ba (t) + chol{γr (t)}fa (t).
(22)
Then we have ⎡
−1
⎢ Wr (t) + γa (t) Wa (t) = ⎢ ⎣ bT a (t)
⎤ ba (t)bT a (t)
ba (t) ⎥ ⎥, ⎦ γa (t)
tf
ga (t)gaT (t) dt.
Ma = Mr +
(23)
(24)
0
Completion of exchanging means that the design points which guarantee the largest increase in the D-optimality criterion for the current iteration have been found. At this juncture, we may update the matrices as follows: F (k+1) = Fa ,
W (k+1) = Wa ,
M (k+1) = Ma ,
(25)
where the quantities on the right-hand sides are those of (18), (23) and (24). The relatively simple form of the above formulas guarantees the efficiency of the algorithm which thus becomes very similar to Fedorov’s exchange algorithm for determining exact designs, cf. [14].
476
P. Baranowski and D. Uci´ nski
The efficiency of the computations at both the stages provides convenient formulas for the inverse of block-partitioned matrices. It is possible to prove in the same way [5], that updates for respective matrices can be written as
T F (k) (t)W (k) (t) F (k) (t) = Fr (t)Wr (t)FrT (t) + gr (t)grT (t), (26) which clearly forces (17). Similarly, (24) can be rewritten as Fa (t)Wa (t)FaT (t) = Fr (t)Wr (t)FrT (t) + ga (t)gaT (t),
7
(27)
Experimental Results
Consider simultaneous advection and diffusion of an air pollutant over an urban area. We take into account an active source of pollution and reaction, which leads to changes in the pollutant concentration y = y(x, t). The evolution of y over the normalized observation interval T = (0, 1] is described by the following advection-diffusion equation: ∂y(x, t) + ∇ · υ(x)y(x, t) = ∇ · a(x)∇y(x, t) + g(x) in Ω × T ∂t
(28)
subject to the boundary and initial conditions: ∂y(x, t) =0 on Γ × T, (29) ∂n y(x, 0) = 0 in Ω, (30) where the term g(x) = exp −50x−c2 represents an active source of the pollutant located at point c = (0.7, 0.7), and ∂y/∂n stands for the partial derivative of y with respect to the outward normal to the boundary Γ . The solution to (28)–(30) is presented in Fig. 2 where the process dynamics can be easily observed. It can be seen that the cloud of pollutant spreads over the entire domain, thereby reflecting the complex combination of diffusion and advection processes, and follows the direction of the wind being the dominant transport factor. The observations are assumed to be corrupted by zero-mean correlated noise with covariance kernel q(xi , xj , t) = exp(−ρ xi − xj ) (31)
0.6
0.6
0.6
0.6
x2
1
0.8
x2
1
0.8
x2
1
0.8
x2
1
0.8
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0 0
0.2
0.4
0.6
x1
(a) t = 0
0.8
1
0 0
0.2
0.4
0.6
0.8
x1
(b) t = 0.25
1
0 0
0.2
0.4
0.6
x1
(c) t = 0.6
0.8
1
0 0
0.2
0.4
0.6
x1
(d) t = 1
Fig. 2. Concentration of the pollutant at consecutive time instants
0.8
1
A Parallel Sensor Selection Technique
1 0.8
0.6
0.6
0.6
x
x
x
2
2
1 0.8
2
1 0.8
0.4
0.4
0.4
0.2
0.2
0.2
0 0
0.2
0.4
x1
0.6
0.8
1
(a) ρ = 40
477
0 0
0.2
0.4
0.6
x
0.8
1
1
(b) ρ = 3.0
0 0
0.2
0.4
0.6
x
0.8
1
1
(c) ρ = 1.0
Fig. 3. D-optimum sensor configuration for a 50 sensors for different correlation
where · denotes the Euclidean norm, ρ is the coefficient which controls the intensity of the correlation. In particular, the ρ values of 1, 3.0, 40 were chosen as representative ones for considerable, medium and small correlations respectively. The program used for parallel solution of the discussed problem was written completely in Fortran 95 using IntelFortran Compiler, Math Kernel Library for matrix and vector operations and MPI for message passing [15]. Computations were performed on a Linux cluster at the University of Zielona G´ ora, being part of the national CLUSTERIX project [16]. The D-optimal sensor configurations for 50 allocated sensors and various correlations are shown in Fig. 3. Each time the GRASP procedure was started from a randomly generated 3-points design. The obtained locations of sensors perfectly retain the symmetry of the problem with respect to the line x2 = x1 and tend to form a pattern reflecting the areas of greatest changes in the pollutant concentration. The speedup of computational time subject to the number of processors is shown in Table 1. Of course, due to communication between the nodes the speedup does not scale linearly with the number of nodes, but nevertheless, the gain is substantial. The average time of program execution for 30 correlated (ρ = 1.0) sensors was about a dozen minutes. Surprisingly, the measurements in the closest vicinity of the pollution source are not very attractive for parameter estimation. The intuition fails in this case and it is very difficult to predict the solution when armed only with the experimenter’s experience. Table 1. Computation times for a 41 × 41 grid of admissible sites [hh : mm : ss] Number of nodes / computation times n 2 4 6 8 30 00:28:47 00:11:27 00:08:20 00:07:06 50 01:24:52 00:32:16 00:25:13 00:19:59
8
Conclusions
The paper discussed the problem of parallel computations performed on a homogenous Linux cluster in order to determine D-optimum experimental designs for correlated observations. Using a heuristic GRASP method to allocate
478
P. Baranowski and D. Uci´ nski
sensors has two goals. The first is to find an optimum or near-optimum solutions. The second is to arrive at such a solution with a minimal amount of computational effort. Given that most combinatorial optimization problems are classified as intractable and have huge solution spaces, the one discussed here is not an exception, it is very often ineffective to apply the brute force technique of exhaustive enumeration. Work is in progress regarding implementation of the presented approach on a grid of computers. Acknowledgments. The work of P. Baranowski was supported by the Integrated Regional Operational Programme (Measure 2.6: Regional innovation strategies and the transfer of knowledge) co-financed from the European Social Fund.
References 1. van de Wal, M., de Jager, B.: A review of methods for input/output selection. Automatica 37, 487–510 (2001) 2. Sun, N.Z.: Inverse Problems in Groundwater Modeling. In: Theory and Applications of Transport in Porous Media, Kluwer Academic Publishers, Dordrecht (1994) 3. Uci´ nski, D.: Measurement Optimization for Parameter Estimation in Distributed Systems. Technical University Press, Zielona G´ ora (1999) 4. Kubrusly, C.S., Malebranche, H.: Sensors and controllers location in distributed systems — A survey. Automatica 21(2), 117–128 (1985) 5. Uci´ nski, D.: Optimal Measurement Methods for Distributed-Parameter System Identification. CRC Press, Boca Raton (2005) 6. Uci´ nski, D., Atkinson, A.: Experimental design for time-dependent models with correlated observations. Studies in Nonlinear Dynamics and Econometrics 8(2), 14 (2004), http://www.bepress.com/snde 7. Patan, M.: Optimal Observation Strategies for Parameter Estimation of Distributed Systems. University of Zielona G´ ora Press, Zielona G´ ora (2004) 8. Fedorov, V.V., Hackl, P.: Model-Oriented Design of Experiments. Lecture Notes in Statistics. Springer, New York (1997) 9. Pukelsheim, F.: Optimal Design of Experiments. In: Probability and Mathematical Statistics, John Wiley & Sons, New York (1993) ´ Pronzato, L.: Identification of Parametric Models from Experimental 10. Walter, E., Data. In: Communications and Control Engineering., Springer, Berlin (1997) 11. Feo, T., Resende, M.: Greedy randomized adaptive search procedures. Journal of Global Optimization 6, 109–133 (1995) 12. Festa, P.: Greedy randomized adaptive search procedures. AIROnews 7(4), 7–11 (2003) 13. Goux, J., Kulkarni, S., Linderoth, J., Yoder, M.: An enabling framework for masterworker applications on the computional grid. In: Proc. the 9th IEEE Symposium on High Performance Distributed Computing (HPDC9) (2000) 14. Fedorov, V.V.: Theory of Optimal Experiments. Academic Press, New York (1972) 15. Pacheco, P.S.: Programming parallel with MPI. Morgan Kaufmann, San Francisco (1997) 16. Wyrzykowski, R., Meyer, N., Stroi´ nski, M.: Clusterix: National cluster of linux systems. In: Proc. 2nd European Across Grids 2004 Conf. (2004)
Distributed Segregative Genetic Algorithm for Solving Fuzzy Equations Octav Brudaru1,2 and Octavian Buzatu2 1
Department of System Engineering and Management, Technical University “Gh. Asachi”, D. Mangeron 53, 700050 Iasi, Romania 2 Institute of Computer Science, Romanian Academy, Carol I 22A, 700505 Iasi, Romania [email protected], [email protected]
Abstract. This paper presents a genetic algorithm for solving fuzzy equations that evolves many sub-populations of solutions. The individuals are clustered into groups using a features based similarity measure. A distributed implementation of this segregative GA containing a communication mechanism that enforces the clustering structure of the subpopulations is described. The results of the experimental investigation of the ability to find multiple accurate roots as well as the effect of alternative distributed models and communication schemes are reported.
1
Introduction
Fuzzy systems and genetic algorithms have been proved their versatility and robustness on modeling and solving real world problems. In conjunction with these techniques, distributed computing offers effective ways to reduce the run time of the computational intensive tasks involved by the solution search processes. In this context, solving fuzzy equations based on discrete fuzzy arithmetic is a very intensive search process. It appears as an independent computational task [5] or in connection with the solving of inverse problems in approximating fuzzy dependencies [2]. In [4] some techniques for solving fuzzy linear equations are given and an evolutionary algorithm able to solve fuzzy eigenvalue problems involving triangular shaped fuzzy numbers is described. A genetic algorithm able to solve fuzzy equations with arbitrary fuzzy functions and sampled fuzzy numbers is described in [3]. It is able to find one solution at a time and if the equation has many roots they should be found by eliminating the already found roots from the current search process. Unfortunately, executing a deflation process is neither possible nor a recommended procedure due to the specific of fuzzy arithmetic. Therefore, this way to find of many roots of a fuzzy equation is not feasible. This paper proposes a new evolutionary approach [8] for finding multiple roots of fuzzy equations defined by arbitrary fuzzy functions operating with sampled fuzzy numbers. This technique uses a natural similarity measure between solutions and creates and maintains a clustering structure of the evolving population [1]. A specific root search process is developed within each cluster. The communication between clusters is done by transferring individuals according to the R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 479–488, 2008. c Springer-Verlag Berlin Heidelberg 2008
480
O. Brudaru and O. Buzatu
similarity measure used in clustering. A natural distributed implementation of this segregative GA is proposed. Section 2 presents the basic elements of the fuzzy arithmetic are given and states the problem. Section 3 describes the components of the segregative GA. In section 4, it is described the distributed segregative GA, whereas its performance is analyzed in section 5. Some conclusive remarks end the paper.
2
Problem Formulation
The basic tools for operating with fuzzy numbers are described below. A ksampled fuzzy number is given by x = ((x1 , p1 ), ..., (xk , pk )) , where x1 , ..., xk are the realizations of x and p1 , ..., pk ∈ [0, 1] represent the corresponding membership function values. It is supposed that x1 < x2 < ... < xk and x-values are separated by equally spaced points. The arithmetic of k-sampled fuzzy numbers is defined in [7]. If x = ((x1 , p1 ), ..., (xk , pk )) and y = ((y1 , q1 ), ..., (yk , qk )) are two sampled fuzzy numbers and ⊗ ∈ +, −, ·, / denotes an arithmetic operation, then z = x ⊗ y , with z = ((z1 , r1 ), ..., (zk , rk )) is given by: Fuzzy arithmetic algorithm – Compute ρij = xi ⊗ yj and mij = min{pi , qj }, i, j = 1, ..., k; – Take a0 = min ρij , ak = max ρij , t = (ak −a0 ), ah = a0 +h·t, h = 0, 1, ..., k−1 i,j
i,j
and define the intervals I1 = [a0 , a1 ], I1 = [ah , ah+1 ], h = 0, 1, ..., k − 1 ; – For h = 0, 1, ..., k determine Mh = {mij |ρij ∈ Ih }. If Mh = ∅ then compute mi∗j∗ = max{mij |ρij ∈ Ih }, take zh = ρi∗j∗ and rh = mi∗j∗ . Otherwise take zh = (ah + ah+1 )/2 and rh = 0. The fuzzy arithmetic is computational intensive due to its own complexity and the frequency of invoking it during the search process. Further, if a = ((a1 , α1 ), ..., (ak , αk )) and b = ((b1 , β1 ), ..., (bk , βk )) are fuzzy numbers then a measure of the difference between a and b is represented by k d(a, b) = (ak − bk )2 + (αk − βk )2 . (1) h=1
The defuzzification associates to fuzzy number x, its representative value. Further, the defuzzification is done with the center of gravity defined by k k g(x) = xh · ph / ph (2) h=1
h=1
and it is used to define a feature based similarity measure between fuzzy numbers in the clustering algorithm. Now, consider two sets A and B of k-sampled fuzzy numbers and a function F : A → B. Suppose that the fuzzy number y ∈ B is known and the equation F (x) = y
(3)
Distributed Segregative Genetic Algorithm for Solving Fuzzy Equations
481
stands for the unknown fuzzy number x. If 3 has many solutions, only those having distinct defuzzified values are of practical interest. Actually, the problem can be stated as the finding of the roots x ˜ of 3 having distinct defuzzified values so that P (˜ x) is close enough to y. Unfortunately, no technique used for solving nonlinear equations can be adopted for solving 3 due to the discrete nature of the fuzzy arithmetic.
3
Segregative Genetic Algorithm
Further, the basic components of the segregative GA are described. Some components are identical or similar to those used in [2] for the designing of fuzzy rational approximators. For the sake of the completeness, all the components are shortly described. 3.1
Solution Representation
Consider a fuzzy number x = ((x1 , p1 ), ..., (xk , pk )). From the above assumption, it results that xh ∈ [θ + (h − 1) · ε, θ + h · ε], h = 1, ..., k for some appropriate constants θ and ε > 0. Thus, θ, ε and h uniquely locate xh in the interval [θ+(h− 1)·ε, θ+h·ε]. The exact positioning of xh within this interval is done by a positive integer vh represented by a sequence of s bits. It is obtained xh = θ + (h − 1) · ε + vh ·ε/2s and θ+(h−1)·ε ≤ xh ≤ θ+h·ε−ε/2s, h = 1, ..., k. The error to locate xh is ε/2s . On the other hand, the membership function value ph is represented by a sequence of 8 bits that leads to an integer wh = 0, ..., 255. Then ph = wh /255 ∈ 1 1 {0, 255 , ..., 254 255 , 1} and the error in estimating the membership function is 255 . Therefore, x can be represented by the list L(x) = (θ, ε; v1 , w1 , v2 , w2 , ..., vk , wk ). This list contains two floating-point numbers and k pairs of binary sequences having s and 8 bits, respectively. Further, L(x) denotes a candidate to a solution to the fuzzy equation 3. Sometimes, the representation L(x) of x will be denoted simply, x. 3.2
Initial Population
If no information concerning the solution domain is available, the initial population is randomly generated. This is done by considering a large enough interval [m, M ] so that it contains the defuzzified values of all roots. An individual x is obtained starting from a random number ξ ∈ [m, M ]. Then a fuzzification operator is applied to ξ. This fuzzification operator generates a number of samples of a random variable that follows a Gaussian distribution N (ξ, 1). These samples are used to construct x in a way similar to that applied in the fuzzy arithmetic algorithm. 3.3
Genetic Operators
The individuals in the current population suffer mutation with the probability πm . If L(x) is the chromosome under mutation then the components θ and
482
O. Brudaru and O. Buzatu
suffer a non-uniform mutation [8]. The binary sequences (vh , wh ), h = 1, . . . , k are modified at bit level. Denote by pm (nb, j, t) the probability to invert the bit j in a sequence of length nb in the evolution stage t ≥ 1. During the experiments, pm (nb, j, t) = 2 ∗ f (j) ∗ g(t) − f (j) − g(t) + 1 where f (j) = j/nb and g(t) = log t/(log t + 1), t ≥ 1, produced very good results. The effect of using this probability is that during the first evolution stages the most representative part of each binary sequence is changed with a greater probability than the final bits, whereas the situation is reversed in the final stages. This ensures a good trade off between exploration and exploitation. The mating pool consists in the first 40% of the best individuals. Let L = (θ, , v1 , w1 , . . . , vk , wk ) and L = (θ , , v1 , w1 , . . . , vk , wk ) and be two parents (j) (j) (j) (j) and denote by L = (θ(j) , (j) , v1 , w1 , . . . , vk , wk ) the offspring j, j = 1, 2. (1) (1) Then θ = λθ + (1 − λ)θ , = λ + (1 − λ) , with λ ∈ [0.2, 0.3]. All the binary sequences are cut in the same random position. L(1) gets the prefixes from L and the suffixes from L , both the sources and destination having the same rank in their chromosomes. The other offspring is given by interchanging the roles of parents. A couple is selected with probability πc . 3.4
Fitness Function
If x is a chromosome, then the fitness function is defined as f it ( x) = d (F (x), y). The goal is to find chromosomes that minimize the fitness function. Whenever a new individual is produced, its fitness value is computed. Fitness computing is time consuming. 3.5
Population Management
The management of the entire population within the segregative GA consists in the creating of the initial groups based on similarity and, the maintaining and adapting this clustering structure to the changes produced during the evolution process. Creating the pre-clustered population. Consider the GA constructed with the components described in sections 3.1-3.4. The mission of this basic genetic algorithm is to create the individuals entering the clustering process. It uses an elitist strategy to select the survivors forming the next generation. As soon as the new individuals are scored, the survivor x must satisfy f it(x) ≤ y1 , where y1 is a small positive threshold. In this stage, the mating pool extends to 80% of the best individuals. The growing at the population till the upper limit of the population size is realized. Then, the basic GA is stopped and g(x) is computed for each individual x. Let X be the population created in this way. Population clustering. Let k be the estimated number of solutions to (3). If x and y are fuzzy numbers, then |g(x) − g(y)| defines a similarity measure between x and y. The c-means algorithm is applied to split X into k clusters [9], but other simple clustering techniques could be equally used. One obtains the
Distributed Segregative Genetic Algorithm for Solving Fuzzy Equations
483
partition {X1 , . . . , Xk } of X and the centroids c1 , . . . , ck of these clusters. The centroid ci is viewed as a representative of {g(x)/x ∈ Xi }. The current accuracy threshold ycrt is set to y1 . Clusters maintenance. Each sub-population Xi evolves by applying essentially the standard GA with certain modifications described below. In order to keep the coherence inside each cluster, whenever a new individual x is produced, if f it(x) > ycrt then one determines m so that |g(x) − g(cm )| ≤ |g(x) − g(ci )| , i = 1, . . . , k. The winning cluster m is the destination of x. The individuals from the current cluster compete with the chromosomes coming from other clusters. A tournament selection is applied to avoid the overpass of the population limit in each cluster and afterwards, the center of each cluster is updated for reflecting the new sub-population structure. The old centroid is replaced by the arithmetic mean of g(x), x ∈ Xi . The new values of the centroids are sent to all clusters. Since the clustering is not very sensitive to the exact values of the centroids, the transmission could be done only if the changes of centroids exceed a prescribed value. The new value for the accuracy threshold is taken as ycrt = maxi=1,...,k maxx∈Xk f it(x) and a new evolution stage can begin. 3.6
Stop Condition
The average of the fitness function is computed for each sub-population at the end of each generation. The search made in a given sub-population ends when this average does not change significantly for a prescribed number of generations or the number of stages reaches a given limit. The whole search process ends when each sub-population sends a termination signal.
4
Distributed Implementation
The segregative GA has two stages. The first stage creates the pre-clustered population and runs the clustering algorithm. Since it represents less that 5% of the global computational effort and requires intensive communication for running the clustering algorithm, the decision was to implement it sequentially. This stage is allocated to the master process P0 . After the first stage ends, each slave process Pi runs a copy of the basic GA on its own sub-population Xi , i = 1, . . . , k while the master process activity reduces to deciding the end of the global search. From now on, each slave process works with the same values of control parameters and executes the same sequential procedure for cluster’s maintenance. At the end of each evolution stage, each slave process executes its own ending test and sends a true/false value to P0 . When P0 requires true values from all slave processes, then it demands the best found solution from each slave and ends the entire activity. The main streams of data circulate only among P1 , . . . , Pk . Within each slave process, the so-called segregative operator performs the data moving control. This operator receives new individuals produced in the current evolution stage and creates packets of chromosomes that either remain in current sub-population
484
O. Brudaru and O. Buzatu
or are sent to specific clusters. It was adopted the strategy to sent a packet as soon as it was produced regardless whether it contains few or many chromosomes. This accelerates the convergence to better solutions. After sending the packets, the current process continues its activity and integrates its own packet and the chromosomes received from other processes into the next generation. Each receiving process tests the existence of new packets just when it has to determine the population entering the next generation.
5
Performance of the Segregative GA
The implementation of the segregative GA was done on a cluster structure based on NT-MPICH platform with MPI-1 standard. A number of four IBM PC computers connected by a local Ethernet 100 Base TX network were used. The hardware configuration of the computers was Athlon XP 2100+, 512 MB SDRAM, and Ethernet 100Bs network cards with Windows XP as operating system. The goal of experimental investigation was to know whether the segregative GA works better than its mono-population counterpart as concerning the ability to find single and multiple solutions. Other experimental objectives focus on the identifying of the appropriate values of the control parameters and on the evaluating the gain of the distributed implementation vs. a sequential one. For illustrating different facts, some fuzzy equations are formed using the fuzzy numbers given in Table 1 (grey / white cells contain the realization / membership values). Table 1. Fuzzy constants 1F , 2F , 0F , 1.2F , 1.5F 0F : -0.01 0.1 0.5F : 0.49 0.1 1F : .99 0.1
5.1
0 0.01 2F : 1.99 2 2.01 0.9 0.1 2F : 0.1 0.9 0.1 0.5 0.51 1.2F : 1.09 1.1 1.11 0.9 0.1 2F : 0.1 0.9 0.1 1 1.01 1.5F : 1.49 1.5 1.51 0.9 0.1 2F : 0.1 0.9 0.1
Absence of Roots Positioning Information
As the roots are closer as the difficulty to discriminate them increases. In Table 2, the percent of runs when the exact roots are found is given for each equation. The interval of limits [−10000, 10000] was used for generating initial population. The discrimination can be improved by reducing the difference M − m and increasing the density of the initial candidates. The first remedy needs additional information while the second requires a large pre-clustered population. In general, a large interval [m, M ] reduces the chances to discriminate between roots with different defuzzified values, but acts in favor of the capturing a root by placing a random chromosome in its vicinity. Density of the initial population, the accuracy of the clustering technique and the available computing resources are key factors. As an example, the success rate for the last two equations in Table 2, becomes 100% if the density of the initial individuals is triple.
Distributed Segregative Genetic Algorithm for Solving Fuzzy Equations
485
Table 2. Success rate for different equations Equation (X − 1F )(X − 2F ) (X − 1F )(X − 1.5F ) (X − 1F )(X − 1.1F )
5.2
Roots %Runs 1F , 2F 100% 1F , 1.5F 94% 1F , 1.1F 86%
Suitable Values for Probabilities
The minimal values of the mutation and crossover probabilities that ensure a good solution are πm = 0.1 and πc = 0.1 ÷ 0.2. As an example, for equation (x + 2F )(x − 1F )(x + 0.5F ) = 0F
(4)
the obtained results are given in Table 3. Table 3. Results obtained from pc = 0.2 and pm = 0.2 Subpop 2. Subpop 1. Subpop 3. -2.1 -2 -1.99 -0.51 -0.5 -0.49 0.99 1 1.11 0.12 0.93 0.04 0.17 0.94 0.15 0.27 0.98 0.23 fit. #iter. sec. fit. #iter. sec. fit. #iter. sec. 0.05 368 126 0.01 341 106 0.07 356 115
5.3
Multi-population vs. Mono-population
The segregative GA dominates the mono-population GA with regard to ability to find multiple solutions. The segregative GA can find many roots in one run whilst the second must run many times and needs to activate a number of filters that eliminate from the search the vicinities of the already found solutions. Moreover, the segregative GA gives more accurate solutions. For the same equation, the fitness values given by the segregative GA are with 10%-20% smaller than those given by the mono-population algorithm. As an example, for equation 4, two roots are found with a success rate less that 10%, whereas 30 times more iterations were executed. 5.4
Multi-population for a Unique Root
Whenever the segregative GA runs for solving an equation that actually has one root, all sub-populations move in the vicinity of the respective root or the vicinity of other local minima of the fitness functions. In the second case, the search process converges but the fitness function could remain big enough, as it is illustrated in Table 5, for the equation P (x) = y, where P is a fuzzy polynomial whose coefficients are given in Table 4. From Table 5, it results that the second sub-population tends to the root 1F , while the first and the third processes have as final fitness values 1.5 and 0.3 as comparing to 0.04 given by the second process. Thus, it is easy to recognize a good solution using the corresponding fitness values.
486
O. Brudaru and O. Buzatu Table 4. Fuzzy equation of degree 3 Coef. of X 3 Coef. of X 2 8.9 9 9.1 5.9 6 6.1 0.9 0.1 0.9 0.2 0.15 0.9 0.1 0.1
x 1 1.1 0.9 0.1
Coef. of X 1 Coef. of X 0 y = P (x) 6.9 7 7.1 9.9 10 10.1 27.01 29.21 32 0.1 0.9 0.15 0.1 0.9 0.15 0.1 0.1 0.9 Table 5. Roots found with three sub-populations Subpop 1. Subpop 2. Subpop 3. 1 0.79 0.61 0.88 0.98 1 0.88 1 1.18 0.49 0.44 0.2 0.13 0.58 1 0.16 0.82 0.17 fit. #iter. sec. fit. #iter. sec. fit. #iter. sec. 1.51 117 117 0.04 117 116 0.3 117 115
5.5
Data Interchange
Experimental investigation shows that the number of chromosomes sent from a process Pi to a process Pj , j = i decreases along the evolution. This is because the search within each population becomes more focused on the best individuals. On the other hand, as evolution advances, the mutation operator creates a smaller variability because the less significant bits from the representation are changed with highest probability. For equation (4), the bilateral traffic between processes is illustrated in Fig. 1. The average bilateral traffic between processes 2-3,1-3 and 1-2 is 202, 55 and 200, respectively. Another feature of the communication is that as shorter is the distance between the centers of clusters as higher is the number of interchanged chromosomes. This is due to the genetic operators that lead to offsprings situated in overlapping vicinities centred in different but nearer clusters. This is illustrated
Fig. 1. Data interchange between processes along the evolution
Distributed Segregative Genetic Algorithm for Solving Fuzzy Equations
487
in Table 6, which contains the cumulated unidirectional traffic values in the previous example along 16 epochs (1 epoch=20 generations). P1 , P2 and P3 converges to −2F , −0.5F and 1f respectively. The highest bi-directional traffic is between P2 and P3 (203/epoch), and P2 and P3 (200/epoch). The sparest is between P1 and P3 (56/epoch). Table 6. Cumulated values of the data interchanges Process − >1 − >2 − >3 1− > 3316 936 2− > 3372 3426 3− > 934 3361 -
In Fig. 2, it is shown the evolution of the best and medium fitness during the applying of the distributed segregative GA to solve equation (4). The parameter γc rt varies from γ1 = 92831 to 0.06 in about 100 generations.
Fig. 2. Best (a) and medium (b) fitness variation
As a conclusion, experimental investigation shows that the proposed segregative GA is an efficient tool for finding multiple roots of fuzzy equations. It gives good approximation of the roots whereas the distributed implementation shorts the run time compensating the computational intensive processes generated by fuzzy arithmetic.
6
Final Remarks
A distributed segregative genetic algorithm has been presented for finding multiple roots of fuzzy equations. This algorithm outperforms the serial monopopulation GA and the island parallel genetic algorithm versions with respect to both to ability for finding simultaneous roots and the accuracy/cost ratio. Better heuristic for locating the roots of fuzzy equation and the extension to systems of fuzzy equations are in progress.
488
O. Brudaru and O. Buzatu
Acknowledgments. This research has been partially supported by CEEX Project 76-2/2006 Ministry of Education and Research, Romania.
References 1. Affenzeller, M.: A New Approach to Evolutionary Computation: Segregative Genetic Algorithms (SEGA), Institute of Systems Science Systems Theory and Information Technology, Johannes Kepler University, Linz, Austria (2005) 2. Brudaru, O., Buzatu, O.: Distributed genetic algorithm for finding fuzzy rational approximators. In: The Fourth International Conference on Parallel Computing in Electrical Engineering, PARELEC 2004, September 7-10, 2004, pp. 394–397. EEE Computer Society Press, University of Technology Dresden, Germany (2004) 3. Brudaru, O., Leon, F., Buzatu, O.: Genetic Algorithm for Solving Fuzzy Equations. In: SACCS 2004 - International Symposium on Automatic Control and Computer Science, Iasi, Romania, October 22 - 23 (2004) 4. Buckley, J.J., Feuring, T., Hayashi, Y.: Solving fuzzy equations using evolutionary algorithms and neural. Soft Computing 6(2) (April 2004) 5. Buckley, J.J., Feuring, T.: Fuzzy and Neural: Interactions and Applications. Physica Verlag, Heidelberg (1999) 6. Cantu-Paz, E.: Implementing Fast and Flexible Parallel Genetic Algorithms. In: Practical Handbook of Genetic Algorithms, vol. 3, CRC Press, Boca Raton (1999) 7. Hirota, T.: A Digital Representation of Fuzzy Numbers and its Application. In: Proceedings of IIZUKA 1990, pp. 527–529 (1990) 8. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Program, 2nd edn. Springer, Berlin (1994) 9. Rojas, R.: Neural Networks - A Systematic Introduction. Springer, Berlin (1996)
Solving Channel Borrowing Problem with Coevolutionary Genetic Algorithms Krzysztof Gajc1 and Franciszek Seredynski1,2,3 1
3
University of Podlasie, Department of Computer Science ul. Sienkiewicza 51, 08-110 Siedlce, Poland 2 Polish-Japanese Institute of Information Technology Koszykowa 86, 02-008 Warsaw, Poland Institute of Computer Science Polish Academy of Sciences, Ordona 21, 01-237 Warsaw, Poland
Abstract. With increasing number of cellular network users, a problem of optimal use of available radio frequencies becomes serious. Some base stations have free frequencies, while users in other stations are blocked because radio resources run out. In this paper we propose to apply coevolutionary genetic algorithm to find optimal borrowing channel scheme and we compare results with those obtained with use of standard genetic algorithm.
1
Introduction
In nowadays we can observe increasing number of mobile phone users. Each of them must have access to mobile services at any time. Unfortunately, the capacity of mobile devices network is limited, mainly by restricted availability of radio frequency resources. For this reason radio frequency must be effectively managed. A number of problems related to management of mobile network can be reduced to combinatorial optimization problems. The most important and currently recognized problems are following: location management problem, frequency assignment problem and channel borrowing problem. In this paper we will have to do with last problem. Aforementioned problem are complex optimization problem and number of metaheuristic have been proposed to solve them. In particular, for solving the channel borrowing problem algorithms based on application of genetic algorithm [9][12], cellular automata [11] or neural networks [10] were proposed. In this paper we propose to solve the channel borrowing problem with use of coevolutionary algorithms, which represent relatively new parading in evolutionary computing. This approach gives a possibility do offer a distributed algorithm corresponding to distributed nature of cellular system. This feature does not have a standard genetic algorithm, which is a very popular tool in solving complex optimization problems. The paper is organized as follows: the coming section contains an introduction to radio frequency problem in mobile phone networks. The problem of channel R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 489–498, 2008. c Springer-Verlag Berlin Heidelberg 2008
490
K. Gajc and F. Seredynski
borrowing is presented in section 3. Basic concepts of genetic algorithm and coevolutionary techniques are described in section 4. Section 5 contains experiment description and results. Last section contains the conclusion and a discussion of further possibilities of development of the presented approach.
2
Radio Frequencies Using in Mobile Systems
Cellular network is a network in which radio frequency resources are managed in cells – two-dimensional areas limited by a range of radio waves [1]. A central point of each cell is a base station. Fixed networks connect base stations. The cells are organised in clusters (see Fig. 1). Each cell in a cluster must work with different radio frequency band, to reduce interferences between base stations signals. We assume that i-th cell is characterized by two parameters: x - a number of currently active mobile users, and y a total number of channels allocated to given cell. In the cluster shown in Fig. 1a a cell, e.g. the cell #7 has currently 3 active users and the total number of assigned channels is equal to 10. With increasing number of users who request for a service in a cell, a number of accessible radio frequencies is reduced. In some moment of time, a number of accessible radio frequencies is equal to zero, and new users are blocked base station cant assign them a channel (see, e.g. the cell #1 in Fig. 1a). As a measure of quality of a service we can take a number of users, which are blocked in time T. This number must be minimized. In the stage of network designing one must assure that regions with high load will have enough resources to serve users. Adding new radio frequencies to existing cellular network is very expensive and sometime impossible. This is a reason why we must very effective exploit available resources. There exists a few method of optimisation of radio frequency use. The simplest one, but expensive is dividing cell, in which users are blocked into some number of smaller cells. In this case we need to build new base stations. Another method assumes division of cells into few classes - microcells with relatively small range and macrocells with large range cells [3]. Macrocells serve as covering group of microcells. In the case of ending available radio frequency in microcell, macrocell serve new users.
3
Channel Borrowing Problem
Another method of effective use radio frequencies is borrowing frequencies that are not in use in given moment, from neighbouring cells with low load. Borrowing operation can be made in two ways [1]. It is possible to use a frequency, which is still used by neighbourhood, but with smaller power - covering smaller territory, to reduce danger of interference. Another borrowing method is borrowing with blocking frequency. A cell that lends a frequency stop using it and cell, which borrows this frequency, uses it in full power. A situation before and after borrowing a channel by cell #1 from cell #7 is shown in Fig. 1.
Solving Channel Borrowing Problem
491
Fig. 1. Cluster of cells: before borrowing by cell #1 from cell #7 (a), after borrowing (b)
Similarly like in the frequency assignment problem we have constrains on reusing channels arising from nature of radio waves [4]: – Co-channel constraints: a pair of interfering transmitters must not be assigned the same frequency – Co-site frequency constraints: channel separation of any pair of frequencies assigned to transmitters belonging to an identical site must be more than a fixed value – Adjacent-channel constraints: channel separation of any pair of frequencies assigned to an identical transmitter must be less than a fixed value. Before we borrow a channel we must take a decision from which cell borrowing must take place. In this paper we will construct and study a channel borrowing algorithm with blocking, which reduces a total number of blocked users in mobile network. We will minimize a function: f= bi (1) i∈S
where bi is a number of blocked users in cell i, and S is a set of network cells. To solve this combinatorial optimization problem we will apply two evolutionary algorithms: standard genetic algorithm and coevolutionary genetic algorithm.
4
Genetic Algorithms and Coevolutionary Genetic Algorithms
Genetic Algorithms [GA] [5] are a class of optimization methods inspired by natural evolution. They give good results when one attempt to find an optimal
492
K. Gajc and F. Seredynski
or near optimal solution in a big space of potential solutions. We want to use this class of algorithms to find a borrowing scheme that minimize a number of blocked users in a cellular network. We compare two GAs classical or standard GA and coevolutionary GA. One of crucial problem when we use GA is choosing a method of coding a potential solution into an individual of GA population. In our case for every cell in a network we attribute array of numbers corresponding to a number of channels to borrow from neighborhood cells. Fig. 2a shows a cell #1 with its cluster neighborhood, and Fig 2b presents a part of individual of GA. One can see that the cell #1 borrows 0 channels from cell #2, 2 channels from the cell #3 etc. The whole individual for a cellular network consisting of N cells will have length N ∗ r, where r is a number of neighboring cluster cells. Classical GA works as follows:
Fig. 2. Cell #1 in cluster (a) and corresponding part of an individual of GA (b)
1. Initialize an initial population - a set of potential solutions. 2. While stop condition is not satisfied: – Calculate a fitness function of each individual in the population. – Apply operator of selection (proportional selection with elite is use). – Apply operators of crossover and mutation to create a new population Best individual from last generation will be a solution of our problem. The algorithm will run a number of generations L. As a fitness function will use the function: F =C −f where C is a constant, and f is a function (1).
(2)
Solving Channel Borrowing Problem
493
There exists a number of papers presenting a solution channel borrowing problem with use of GA [9][12]. The main problem with using a method like GA is an assumption that there exists a central unit that receives full information from all cells about their needs and proposes a solution. While this approach may offer good theoretical solutions, it is unproductive. The cellular mobile network is a distributed system and rather distributed algorithms are necessary to provide a high efficiency of whole system. For this reason we propose in this paper a novel solution based on applying a coevolutionary algorithm. Coevolution is a relatively new paradigm in the field of evolutionary computation. In opposite to GA, where a population consists of individuals of a single species, coevolutionary algorithms assume existing two or more species evolving in subpopulations and interacting in the process of searching a solution A number of coevolutionary algorithms have been proposed recently [2][6]. In this paper we will apply Loosely Coupled GA (LCGA) proposed in [7][8]. In opposite to GA we will assume that a subpopulation of individuals is related to each cell of mobile system. Individuals are related to each cell of mobile system represent single species. The number of subpopulations (and the number of species) is equal to a number of cells. In LCGA we use another representation of an individual in a subpopulation. An individual related to i − th cell has the length r (r neighborhood size of a cell) and represents a proposed to lease a number of channels for neighborhood cells. Fig. 3a shows a fragment of a network, a cluster around the cell #1, and Fig 3b shows a corresponding individual of a subpopulation related to this cell.
Fig. 3. Cell #1 in cluster (a) and individual for CGA (b)
494
K. Gajc and F. Seredynski
With given subpopulation j (j = 1, ..., N ) a local function fj = bj is associated, i.e. each cell wants to minimize a its number of blocked users. However, similarly like in the case will applying GA, we will observe a global characteristic of the system, i.e. function (1), which should be minimized. As fitness function of an individual j − th subpopulation a more detailed function will be used: fj∗ = C − (a1 ∗ bj + a2 ∗ lcj + a3 ∗ sj )
(3)
where bj is a number of blocked users in cell #j, lcj is a number of channels leased for neighbors, sj is a number of leased and used by neighborhoods channels, and a1 -a3 are coefficients. LCGA works similarly to GA. The main difference is that we have not one but a number of subpopulations, which evolve in parallel. LCGA works in the following way: 1. Create initial subpopulation for every cell in the network 2. For every individual fitness function is calculated: – From every subpopulation corresponding to a cell we take randomly one individual; a set of such individuals create a single solution (channel borrowing proposition); we realize this channel borrowing scheme corresponding to the individual. – For individuals calculate fitness function depending on situation in the cluster. 3. Repeat step 2 until every individual in each subpopulation has assigned fitness value. 4. In each subpopulation perform locally genetic operators of selection, crossover and mutation, and create new subpopulation. 5. While stop condition is not satisfied go to step 2 A solution of the problem is a collection of best individuals, one from each subpopulation.
5
Experimental Results
In our experiment we compare efficiency of two presented above evolutionary algorithms. In experiments we use also two additional techniques: no borrowing and simple heuristic algorithm, in which channel is borrowed from neighborhood with largest number of free channels. As an optimization criterion we use the number of blocked users (see, the function (1)). Simulation experiments were conducted with use of mobile cellular network with the size of 10x10 cells. It was assume that users inside the network can move from one cell to another. We use the following scenarios of moving users:
Solving Channel Borrowing Problem
495
– Highway a big group of users move together along straight line – City a big group of users starts form a central point of the network and moves in random directions with a random speed. In each of these schemes, there exist a small number of users which do not move. Conditions about users and network were generated one time and are the same for every borrowing scheme. Each experiment was repeated several times for 128 and 256 users in network. Table 1 shows used in experiments parameters for evolutionary methods. As it is shown the same parameters are used for both GAs. Table 1. Parameters used in experiments for evolutionary algorithms
Parameter Population size Probability of crossover Probability of mutation Elite size
Evolutionary algorithm GA LCGA 150 150 0.8 0.8 0.005 0.005 5 5
Obtained results are presented in Table 2. As it is shown, for both scenarios the best results gives a classical GA, but results obtained using a LCGA are also acceptable and are better than solutions received using a simple borrowing. It is worth to mention that LCGA in opposite to GA works in fully distributed manner, what corresponds to distributed nature of cellular systems. Table 2. Total number of blocked users in different borrowing methods Borrowing method Scenario Without borrowing Simple borrowing GA LCGA 256 users highway 220 200 167 180 256 users city 65 54 28 39
Fig. 4 shows how a number of blocked users changes in simulation time depending on borrowing method in the highway scenario. In this scheme the number of potential blocked users is almost constant. The results obtained with use LCGA are better than results given by a simple borrowing but worse than GA results. Fig. 5 shows an example of run LCGA and shows how the number of blocked users change. Fig. 6 shows how the value of fitness function LCGA for entire network during evolving subpopulations. Number of blocked users in simulation for different borrowing methods in highway scenario.
496
K. Gajc and F. Seredynski
Track out experiment for highway model 256 users no borrowing LCGA
240
GA simple borrowing
Blocked users
220
200
180
160
10
20
30
40
50 60 Iteration step
70
80
90
100
Fig. 4. Number of blocked users in simulation for different borrowing methods in highway scenario
Track out CGA solution 220
Number blocked users
215
210
205
200
195
190 20
40
60 Generation
80
100
Fig. 5. Changing of number blocked users in one LCGA run i
120
Solving Channel Borrowing Problem
497
Track out LCGA for entire network 60
best avg
50
Fitnes value
40
30
20
10
0 20
40
60
80
100
120
Generation
Fig. 6. Fitness values for one LCGA run for entire network
6
Conclusions
In the paper two evolutionary algorithms GA and LCGA were used to solve the channel borrowing problem. Preliminary results of experiments have shown that both evolutionary algorithms solve the problem, but GA offers better quality results. Results by LCGA are acceptable, but the main advantage of this approach is fully distributed proposition to solve the borrowing problem. This approach seems to be much more realistic than the approach with use of GA which assumes existing one central control unit that is not realistic. Our current research are oriented on increasing performance of LCGA and study much more realistic sizes of cellular networks.
References 1. Agrawal, D.P., Qing-An, Z.: Introduction to Wireless and Mobile Systems, Thomson (2006) 2. Barbosa, H.J.C.A.: Coevolutionary Genetic Algorithm for Constrained Optimization. In: Proceedings of the 1999 Congress of Evolutionary Computation, vol. 3, pp. 1605–1611. IEEE Press, Los Alamitos (1999) 3. Bassiouni, M., Fang, C.: Dynamic Channel Allocation for Real-time Connections in Highway Macrocellular Networks. Journal of Wireless Personal Communications 19(2), 121–128 (2001)
498
K. Gajc and F. Seredynski
4. Crisan, C., M¨ uhlenbein, H.: The Breeder Genetic Algorithm for Frequency Assignment. In: Eiben, A.E., B¨ ack, T., Schoenauer, M., Schwefel, H.-P. (eds.) PPSN 1998. LNCS, vol. 1498, pp. 897–906. Springer, Heidelberg (1998) 5. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading (1989) 6. Lohn, J., Kraus, W.F., Haith, G.L.: Comparing a Coevolutionary Genetic Algorithm for Multiobjective Optimization. In: 2002 Congress on Evolutionary Computation (CEC 2002), pp. 1157–1162. IEEE Press, Piscataway (2002) 7. Seredynski, F.: Competitive Coevolutionary Multi-Agents Systems: The Application to Mapping and Scheduling Problems. Journal of Parallel and Distributed Computing, 39–57 (1997) 8. Seredynski, F.: Loosely Coupled Distributed Genetic Algorithms. In: Davidor, Y., M¨ anner, R., Schwefel, H.-P. (eds.) PPSN 1994. LNCS, vol. 866, pp. 514–523. Springer, Heidelberg (1994) 9. Somnath, S.M.P., Kousik, R., Sarthak, B., Deo, P.V.: Improved Genetic Algorithm for Channel Allocation with Channel Borrowing in Mobile Computing. IEEE Transactions on Mobile Computing 5(7), 884–892 (2006) 10. Wilmes, E.J., Erickson, K.T.: Two methods of neural network controlled dynamic channel allocation for mobile radio systems. In: Vehicular Technology Conference. Mobile Technology for the Human Race, IEEE 46th, pp. 746–750 (1996) 11. Yener, A., Rose, C.: Genetic Algorithms Applied to Cellular Call Admission Problem: Local Policies. IEEE Transactions on Vehicular Technology 46(1), 72–79 (1997) 12. Zomaya, A.Y., Wright, A.: Observations of Using Genetic Algorithms for Channel Allocation in Mobile Computing. IEEE Transactions on Parallel and Distributed Systems 13(9), 948–962 (2002)
Balancedness in Binary Sequences with Cryptographic Applications Candelaria Hern´ andez-Goya1 and Amparo F´ uster-Sabater2 1
DEIOC University of La Laguna 38271 La Laguna, Tenerife, Spain [email protected] 2 C.S.I.C. Serrano 144, 28006 Madrid, Spain [email protected]
Abstract. An efficient algorithm to compute the degree of balancedness in sequence generators of cryptographic application has been developed. The computation is realized by means of simple logic operations on bit strings. An MPI-based implementation of such an algorithm is also described. Emphasis is on the computational features of this algorithm concluding that results of cryptographic interest may be obtained within affordable time and memory consumption. The procedure checks deviation of balancedness from standard values for key-stream cryptographic generators in a completely deterministic approach. Keywords: Cryptography, Balancedness, binary sequences, stream ciphers.
1
Introduction
Confidentiality of sensitive information makes use of an encryption function currently called cipher that converts the plaintext into the ciphertext. Symmetric key ciphers are usually divided into two large classes: stream ciphers and blockciphers. Stream ciphers are the fastest among the encryption procedures so they are implemented in many technological applications e.g. algorithms A5 in GSM communications [1] or the encryption function E0 in Bluetooth specifications [2]. Stream ciphers try to imitate the one-time pad cipher and are designed to generate a long sequence of seemingly random bits [3]. This key-stream sequence is then XORed with the plaintext in order to obtain the ciphertext. Most generators of key-stream sequences are based on Linear Feedback Shift Registers (LFSRs) [4]. Thus, the generated sequence is just the image of a nonlinear Boolean function in the LFSR stages. Balancedness in the output sequence is a necessary condition that every cryptographic generator must satisfy. Roughly speaking, a binary sequence is balanced if it has approximately the same number of 1 s as 0 s. Due to the long
Work supported by Ministerio de Educaci´ on y Ciencia under project SEG200404352-C04-03 and by HPC-Europa programme, funded under the European Commission’s Research Infrastructures activity of the Structuring the European Research Area programme, contract number RII3-CT-2003-506079.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 499–508, 2008. c Springer-Verlag Berlin Heidelberg 2008
500
C. Hern´ andez-Goya and A. F´ uster-Sabater
period of the generated sequence (T 1038 bits in current cryptographic applications), it is unfeasible to produce an entire cycle of such a sequence and then count the number of 1 s and 0 s. Therefore, in practice, portions of the output sequence are chosen randomly and statistical tests such as the monobit test specified in FIPS 140-1 are applied to such subsequences [5]. Nevertheless, passing the previous tests merely provides probabilistic evidence that the generator produces a balanced sequence. Checking balancedness of LFSR-based generators in a deterministic way is the main contribution of the present work. In fact, an efficient bit-string algorithm allows us to compute the exact number of 1 s (number of 0 s) in the output sequence without producing the whole sequence. The obtained number of 1 s (0 s) is compared with the expected value (half the period ± a tolerance interval). In case of non-accordance, the LFSR-based generator must be rejected since this bias may be exploited for the cryptanalysis of the generator. 1.1
Basic Concepts and Notation
A Boolean function of L variables (a1 , a2 , ..., aL ) can be represented as: (i) Algebraic Normal Form (ANF) that expresses the function FI as the exclusive-OR sum of logic products in the L variables: ⎛ ⎞ NI ⎝ FI (a1 , a2 , ..., aL ) = aj ⎠ , αi ⊆ {1, 2, . . . , L} i=1
j∈αi
(ii) Minterm representation that expresses the function as a linear combination of its minterms (logic product of the L variables in either true or complementary form): NI FM (a1 , a2 , ..., aL ) = Aαi i=1
Here, Aαi represents the minterm that includes the subset of variables αi ⊆ 1, 2, ..., L in their true form while the rest of variables are in complementary form. Henceforth, we will use FI to denote the ANF form of a function F while FM will stand for the minterm representation of the function FI . Next, the minterm function FM associated to a Boolean function is introduced. In fact, FM is a L-variable Boolean function such that, given FI in ANF, FM substitutes each term of FI by its corresponding minterm. Moreover, it can be proved [6] that the composition function FO = FM ◦ FM coincides with FI (FO stands for the output function built by the algorithm) once it is expanded out and expressed in its ANF form. An LFSR-based generator is a nonlinear Boolean function FI given in ANF whose input variables (ai ) are the binary contents of the LFSR stages. At each new clock pulse, the new binary contents will be the new input variables. In this way, the generator produces the successive bits of the output sequence. Every
Balancedness in Binary Sequences with Cryptographic Applications
501
LFSR-based generator can be expressed as a linear combination of its minterms as well as each minterm provides the output sequence with a unique 1, see [6]. Therefore, the aim of this work is just to express the ANF of a function FI in terms of its minterm representation, then the computation of the number of minterms in FO will give us the exact number of 1 s in the output sequence.
2
Conversion from ANF to Minterm Representation
For the sake of simplification, in the description of the implementation every minterm Aα , α ⊆ {1, 2, . . . , L} represents the minterm where the variables contained in α are in its true form. It can be interpreted as an L-bit string numbered 1, 2, ..., L from right to left. If the n-th index is in the set α (n ∈ α), then the n-th bit of such a string takes the value 1. Otherwise, the value will be 0. In addition, d (α) equals the number of 1 s in the L-bit string that represents Aα . So, d (α) coincides with the Hamming weight of the string associated to Aα . The procedure of conversion from ANF to minterm representation is considered [6] next. Input: A nonlinear Boolean function given with NI terms in its ANF form. – Step 1: Compute FM (a1 , a2 , ..., aL ) by substituting each term corresponding minterm Aαi , i ∈ {1, 2, . . . , NI }. – Step 2: Realize the two following substeps: 2.1) Expand out each minterm and cancel common terms FM (a1 , a2 , ..., aL ) =
NI i=1
L
bik
, bik =
k=1
aj by the
j∈αi
ak if k ∈ αi (1 + ak ) if k ∈ / αi
getting as result: FM (a1 , a2 , ..., aL ) =
NM i=1
⎛ ⎝
⎞
aj ⎠ , αi ⊆ {1, 2, . . . , L}
j∈αi
2.2) Compute FO = FM ◦ FM . FO (a1 , a2 , ..., aL ) =
NM
Aα
i
i=1
– Step 3: Compute the number of minterms (NM ) in the minterm representation of the function FO . Output: The number of minterms in the nonlinear Boolean function FO . Once the function FI has been expressed in terms of its minterms, balancedness in the output sequence can be easily analyzed.
502
3
C. Hern´ andez-Goya and A. F´ uster-Sabater
The Algorithm
The first task carried out by the algorithm is the previous conversion procedure. Afterwards, it computes the number of ones in the keystream from the number of minterms in function FO . We define the maximum common development of two minterms Aα and Aβ , notated M D (Aα , Aβ ), as the minterm Aχ where χ = α ∪ β. Under this L-bit string representation, M D can be realized by means of a bit-wise OR operation between the binary strings of both minterms. Finally, the computation of minterms is reduced to the computation of 1 s in the resulting L-bit strings. If two minterms Aα and Aβ are added, Aα ⊕Aβ , then the terms corresponding to their M D are cancelled. It occurs because M D represents the terms that both minterms have in common when they are expanded out. Thus, the total number of terms in Aα ⊕Aβ is the number of terms in Aα plus the number of terms in Aβ minus twice the number of terms in M D, that is 2L−d(α) +2L−d(β) −2·2L−d(α∪β) . Below we clarify this concept through an example. A12 ⊕ A13 = (a1 a2 ⊕ a1 a2 a3 ) ⊕ (a1 a3 ⊕ a1 a2 a3 ) = a1 a2 ⊕ a1 a3 and M D(A12 , A13 ) = 011 OR 101 = 111 = a1 a2 a3 = A123 According to the previous considerations, the algorithm is viewed as follows: Input: A nonlinear function FI of NI terms given in ANF. – Step 1: Define the bit-strings Aαi from the NI terms of FI . Initialize the bitstring H0 with a null value, H0 = . Hi refers to the minterm representation of function FO once the term i of function FI is considered. – Step 2: Run this loop from i = 1 to i = NI : update Hi = Hi−1 + Aαi − 2 · M D (Aαi , Hi−1 ). – Step 3: From the final form of HNI = sj Aβj , compute the number of 1 s j in the generated sequence by means of the expression UFI = sj · 2L−d(βj ) with sj ∈ Z.
j
Output: The number of 1 s in the generated sequence, UFI .
4
Implementation and Practical Results
In this section the details of the serial and parallel implementations (carried out using ANSI C language) are described jointly with the experiments carried out. 4.1
General Considerations
The data structure chosen to simulate the function FI is a linked list where each element contains:
Balancedness in Binary Sequences with Cryptographic Applications
503
– field value: an array of unsigned integers of length 64 bits (type long long integer) representing the corresponding minterm. – field coefficient: a signed integer corresponding to the times the current minterm has been added or cancelled along the algorithm execution and until the current stage. This data structure allows us to reduce the operations specified by the algorithm to bit-wise OR operations among the strings Aαi . As it can be appreciated from the experiments described in next section, the order in which the terms of the function FI are examined influences the amount of resources consumed by the algorithm. Actually, it is the number of 1 s in the examined minterms what determines the behavior of the algorithm. That is the reason why the Hamming weight is used as first sorting parameter when building the list FI . Apart from the structure defined for the nodes of the linked list and thinking of saving time, an auxiliary vector (Ham) of length L is introduced. The i-th component of such a vector stores the position (address) where the first element of Hamming weight i+1 lies. In this way, when a minterm with Hamming weight j is examined, such a structure will restrict us to the portion of the list where the elements with the same Hamming weight are located. Consequently, time saving is guaranteed. Based on these ideas, four different ways of building the list FI were developed. These procedures differ on the order in which the minterms are inserted. Below we describe how these methods work. Ascending order: The minterms with lower Hamming weight at field value are inserted at the beginning of the list. Descending order: In this case, the first elements in the list will be those with higher Hamming weight. The same sorting criterion is used in the sublists. Inserting at the beginning: Each new minterm to be examined will be allocated occupying the head position of the list. Inserting at the end: This insertion method places the current minterm at the tail of the linked list. At each new iteration, the actions described in step 2 require the construction of a new list corresponding to M D(Aαi , Hi−1 ). Afterwards, this list is linked to Hi−1 , the list obtained at the previous iteration. The minterms that form part of M D(Aαi , Hi−1 ) have field value greater or equal than the ones associated to the terms already included in Hi−1 . That is the reason why these lists are always generated in ascending order according to the Hamming weight and the minterm value. When evaluating the efficiency of the algorithm from the worst-case complexity point of view, the obtained order corresponds to O(NI · 2NI ). However, this analysis is no realistic when applying to LFSR-based generators since this means that the sequences generated do not even fulfill the short terms statistical properties.
504
4.2
C. Hern´ andez-Goya and A. F´ uster-Sabater
Serial Computational Experience
This subsection deals with the description of the results obtained throughout the experiments carried out with the serial version of the algorithm. In fact, several trials were developed. The first one was realized in a notebook (CPU Pentium IV 3.06GHz, 512 MB RAM). This first experiment was two-folded. On the one hand, the growth on the algorithm’s requirements was studied for a set of nonlinear functions with different number of terms NI . Such functions were randomly generated considering ”dense” functions, i.e. functions whose number of terms was greater than the LFSR’s length. We explored time and memory consumption for lengths ranging in the interval L = 4 up to L = 32 (table 1). It can be noticed that the growth in time consumption is slower when descending order in the input list is considered. Table 1. Serial version: Exploring lengths in the interval [4,32] for dense functions
L 4 11 13 15 17 19 21 23 32
NI 9 10 19 26 22 29 43 61 50
Insertion Methods (Time s.) Descending Ascending Beginning End 0,000015 0,000037 0,000015 0,000034 0.000117 0.000117 0,000041 0,000124 0.002834 0,001576 0,004513 0,002116 0.072567 0,065289 0,090434 0,083481 0.019869 0,025606 0,054372 0,026987 0.515094 0,424093 0,779927 0,475481 11.654250 6,251525 613,753521 6,639547 1039.345658 446,719947 1743,768206 2077,803223 1206.056401 2460,806611 41902.840635 3650,988031
On the other hand, the second part of this preliminary experiment consisted in analyzing the resource consumption when setting the length at L = 32 while the number of terms NI was variable in the range NI = 17 up to NI = 50. Figure 1 and table 2 shows resource consumption dependency on the order in which the terms were analyzed when length L = 32 was fixed.The less steepest profile in this figure is associated to the insertion method using descending order. Hence, this evidences that this method is the most suitable when restrictions on the resources are defined. A simple memory consumption analysis is shown in figure 2. The function chosen for this experiment has a minterm Aα50 containing an index in position 50 which appears for the first time in the function. The figure compares the memory requirements when this minterm is analyzed when using the four insertion methods for building the list FI . The values represented on the vertical axis (Δ(Mi )) correspond to the partial memory spent by the algorithm during the i-th minterm evaluation. Unlike it could be expected, the function Δ(Mi ) is not monotonically increasing. In fact, when a new minterm Aαi is examined, then the length of the list
Balancedness in Binary Sequences with Cryptographic Applications
Fig. 1. Serial version: Time expenses when L = 32
Fig. 2. Serial version: Memory requirements
505
506
C. Hern´ andez-Goya and A. F´ uster-Sabater
Table 2. Serial version: Exploring NI ∈ [17, 50] when length is fixed to L = 32
NI 17 20 25 30 35 40 45 50
Insertion Methods (Time s.) Descending Ascending Beginning End 0,060123 0,051201 0,065889 0,069289 0,164954 79,838514 0,249549 0,473879 3,115604 0,323327 5,034812 8,409885 43,499306 79,838514 66,706362 291,443199 314,682024 757,167563 808,720591 2611,555953 1083,218336 1761,116686 1783,359994 3030,704070 3167,012708 5755,229837 5013,538995 7962,885570 2460,806611 10247,150844 613,753521 41902,840635
Hi may not change or it even may decrease. In the profile of the function represented, the sections with maximum slopes are associated to those minterms with indexes not appearing in the elements analyzed so far. Hence at the time of analyzing the results (see last row in table 2), the conclusion that the order in which the minterms are analyzed affects the number of comparisons may be confirmed. In fact, it can be stated that when building the list M D(Aαi , Hi−1 ) and highly frequent indexes appear, a great deal of repeated binary strings are obtained, whereas the minterms with bizarre indexes generate many more elements to be inserted in the list. If the rare strings are inserted at the beginning of the process, then the auxiliary list Hi would grow quickly and such strings should be pulled throughout the calculations. This was the reason why we incorporated the idea of using the Hamming weight as sorting criterion and we introduced it in our second experiment. In this way, the time consumption of the algorithm was analyzed: a) when the minterms with higher Hamming weights were examined first and b) when they are examined in last place. The results obtained are depicted in table 3. The second experiment was carried out using a machine (Tarja) located at the Computing Research Support Service at the University of La Laguna (64 Intel Itanium 2 processors at 1.5GHz. and 256 Gb. RAM). The instances evaluated were also randomly generated in agreement with the indications given by [4]. The following recommendations were considered: the length of the LFSRs was a prime number, the number of minterms was fixed as L/2 (NI = L/2), and finally, the distribution of the minterms was as follows: one term of order L/2 and several terms of low orders. Following these hints, the linear complexity as well as the confusion of the sequence generated are improved. Finally, twenty of these functions were randomly generated for each length. The figures contained in table 3 corresponds to the average time calculated for each method and each length. In order to probe that the difference among the time consumed by each method was non negligible, statistical tests were developed. The results of such tests were that such a difference on time consumption has statistical significance. It should be pointed out the reduction
Balancedness in Binary Sequences with Cryptographic Applications
507
Table 3. Serial version: Time and memory consumption in different insertion methods
L 17 19 23 29 31 37 41 43 47 53 59 61
Ascending 0,002463 0,005373 0,039821 0,747996 0,577718 8,893148 28,927472 95,173627 1081,090696 2988,857453 16960,045435 27585,195783
Insertion Methods Standard Memory Time (s.) Deviation (Bytes) Descending End Beginning 0,001591 0,002041 0,001643 0,000406 436,60 0,003362 0,003892 0,003790 0,002259 713,00 0,023242 0,027873 0,025270 0,007426 2164,20 0,387202 0,442608 0,401662 0,170376 10976,20 0,234695 0,335746 0,372657 0,144012 8641,40 2,858814 3,708103 4,027641 2,725812 31672,60 9,958281 12,096550 16,733530 8,484202 64410,80 31,611891 60,556840 36,701542 39,675053 97994,20 432,942094 553,082902 522,977275 293,516883 426630,40 834,996001 1409,526998 1473,830173 920,644763 550728,71 4263,622550 7061,612248 13200,287895 5764,040705 1296742,00 8501,225097 11422,367676 9787,686239 8921,166474 1716769,60
on time consumption obtained when the descending order was chosen as criterion for examining the minterms. 4.3
Parallel Computational Experience
As a natural continuation of this computation, we evaluated the possibility of developing a parallel version of the proposed algorithm. Designing a direct parallelization turned out to be a hard question since there are strong dependencies among the lists the procedure has to built. Finally, an MPI implementation using the paradigm master-slave was designed. Next, the description of the parallel algorithm is detailed. There is a root processor (t0 ) in charge of splitting the list Hi−1 , i = 1, 2, . . . , NI among the rest of the tasks. Afterwards, each task (tj , j = 0, 2, . . . , n − 1) (including task root) builds locally the chunk of the list M Dij associated to it and the results are delivered again to the master processor. This will gather the partial results obtaining then the global list M Di for the corresponding iteration. From this step onwards the root processor is in charge of calculating the new list Hi . In the last iteration each processor stores a chunk of the final list HNI , hence each processor may calculate the number of ones in the corresponding chunk. Finally, master processor will obtain the global number of ones in the key-stream sequence by combining these partial results. The machine where the experiments were carried out was Lomond (Sun Fire E15k server, with 52 Ultrasparc III processors at 900 MHz. and 1Gb. RAM) The computational experience accomplished with the parallel version was basically repeating the one described in table 3 using from 1 up to 32 processors. So far, we have gather results when using the first insertion method (Ascending order). This new results are shown in table 3. A preliminary analysis of these results does not show improvements when comparing them with the ones
508
C. Hern´ andez-Goya and A. F´ uster-Sabater Table 4. Parallel version: Time expenses when using ascending order
L 1 2 17 0,005820 0,013789 19 0,009257 0,021465 23 0,037276 0,081716 29 0,520710 0,820401 31 0,320188 0,558836 37 2,854579 4,119178 41 23,429382 28,360481 43 74,297054 87,315359 47 831,633594 4181,882724
Processors Time (s.) 4 8 16 32 0,037452 0,014012 0,017335 – 0,044650 0,021070 0,027242 – 0,073809 0,076270 0,060847 – 0,766395 0,676239 0,589137 – 0,471122 0,445524 0,475716 – 7,220491 7,060415 6,080605 6,960721 24,203191 21,312362 28,572274 42,151505 81,454178 74,787643 70,918817 22,156432 523,623818 824,268878 1584,118183 222,977813
obtained in the sequential version. Certain speed-up is obtained when using eight and sixteen processors, but not the optimum one. Apart from that, the statistical tests carried out in order to confirm the differences among the executions with different processors states that there is certain time reduction in the 80% of the cases. These results does not diminish the relevance of the proposed algorithm since the serial version provides competitive results even for lengths of cryptographic interest.
5
Conclusions
An easy and efficient method of computing the degree of balancedness in the output sequence of key-stream generators has been presented. From the handling of bit-strings, it is possible to derive the exact number of 1 s in the output sequence of an LFSR-based generator. An extended computational experience and analysis has been developed for the implementations carried out. This analysis has allowed us to confirm the dependency of the size of Hi lists and the insertion method chosen. In summary, for generators in a range of cryptographic application this algorithm is a deterministic tool for analyzing balancedness that provides accurate results consuming an affordable amount of resources.
References 1. 2. 3. 4. 5.
GSM: Global Systems for Mobile Communications Bluetooth, Specifications of the Bluetooth system, Version 1.1 (2001) eSTREAM, the ECRYPT Stream Cipher Project, Call for Primitives (2004) Golomb, S.: Shift Register-Sequences. Aegean Park Press (1982) ´ A new statistical testing for symmetric ciphers and hash functions. In: Filiol, E.: Deng, R.H., Qing, S., Bao, F., Zhou, J. (eds.) ICICS 2002. LNCS, vol. 2513, pp. 342–353. Springer, Heidelberg (2002) 6. F´ uster-Sabater, A., Garc´ıa-Mochales, P.: On the balancedness of nonlinear generators of binary sequences. Information Processing Letters 85, 111–116 (2003)
A Cost-Benefit-Based Adaptation Scheme for Multimeme Algorithms Wilfried Jakob Forschungszentrum Karlsruhe GmbH, Institute for Applied Computer Science, P.O. Box 3640, 76021 Karlsruhe, Germany [email protected]
Abstract. Memetic Algorithms are the most frequently used hybrid of Evolutionary Algorithms (EA) for real-world applications. This paper will deal with one of the most important obstacles to their wide usage: compared to pure EA, the number of strategy parameters which have to be adjusted properly is increased. A cost-benefit-based adaptation scheme suited for every EA will be introduced, which leaves only one strategy parameter to the user, the population size. Furthermore, it will be shown that the range of feasible sizes can be reduced drastically.
1
Motivation
Almost all practical applications of Evolutionary Algorithms use some sort of hybridisation with other algorithms like heuristics or local searchers, frequently in the form of a Memetic Algorithm (MA)1 . MAs integrate local search in the offspring production part of an EA and, thus, introduce additional strategy parameters controlling the frequency and intensity of the local search among others [2,3]. The benefit of this approach is a speed-up of the resulting hybrid usually in the magnitude of factors. The draw back is the greater amount of strategy parameters which have to be adjusted properly [2,3,4]. The necessary tuning of strategy parameters is one of the most important obstacles to the broad application of EAs to real-world problems. This can be summarised by the following statement: despite the wide scope of application of Evolutionary Algorithms, they are not widely applied. To overcome this situation, a reduction of the MA strategy parameters to be adjusted manually is urgently required.
2
Introduction
To enhance the applicability of MAs, they should either adopt their strategy parameters or the parameters should be fixed to the greatest extent possible. Another point is the usage of application-independent local searchers (LS), as they maintain the general usability of the MA. In this paper a cost-benefit-based 1
See [1] and reports about real-world applications at PPSN, ICGA, or GECCO conference series.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 509–519, 2008. c Springer-Verlag Berlin Heidelberg 2008
510
W. Jakob
adaptation scheme suited for every EA shall be introduced. It will be applied to an MA for parameter optimisation, which uses two application-independent local searchers to maintain generality. The idea of cost-benefit-based adaptation was first published in 2004 [4,5] and in a more elaborated form in 2006 [6]. In this paper an improved version of the adaptation scheme shall be presented together with a more detailed analysis of the experimental results as it was possible in [6]. To achieve this, the section about related work will be summarised here and the interested reader is referred to the detailed discussion in [6]. Other researchers dealt with dynamic adaptation which uses some sort of co-evolution to adjust the strategy parameters [7,8] or they worked in other application fields, especially combinatorial optimisation [8,9]. As it is not known a priori which local searcher or meme suits best or is feasible at all, some researches also tackled the problem of adaptive meme construction [7] or the usage of several memes [7,8,10]. In [6] it is shown in detail how the work presented here fits into the gap previous work leaves and why co-evolution is not considered the method of first choice. In Sect. 3 the cost-benefit-based adaptation scheme and its extension with respect to the one used in [6] is introduced. Section 4 contains a short introduction of the basic algorithms used for the experiments. In Sect. 5 the test cases are highlighted briefly, the old and new strategy parameters are compared, and the experimental results discussed in detail. From this, a common parameterisation is derived. The paper concludes with a summary and an outlook.
3
Concept of Cost-Benefit-Based Adaptation
The basic idea is to use the costs measured in evaluations caused by and the benefit measured in fitness gain obtained from an LS run to control the selection of an LS out of a given set and the intensity of their search as well as the frequency of their usage. Suited LS must therefore have an external controllable termination parameter like an iteration limit or, even better, a convergence-based termination threshold. Firstly, the adaptive mechanism is described for the parameter adjustment. For the fitness gain a relative measure is used, because a certain amount of fitness improvement is much easier to achieve in the beginning of a search than in the end. The relative fitness gain rfg is based on a normalised fitness function in the range of 0 and fmax , which turns every task into a maximisation problem. rfg is the ratio between the achieved fitness improvement (fLS − fevo ) and the maximum possible one (fmax −fevo ), as shown in (1), where fLS is the fitness obtained by the LS and fevo the fitness of the offspring as produced by the evolution. rf g =
fLS − fevo (1) fmax − fevo
rf gPn,L1,i rf gPn,L2,j rf gPn,L3,k : : evalPn,L1,i evalPn,L2,j evalPn,L3,k
(2)
For each parameter Pn a set of levels is defined, each of which has a probability p and a value v containing for each level an appropriate value of that particular parameter. Three consecutive levels are always active, i.e. have a probability p greater than zero. For each active level the required evaluations eval and the
A Cost-Benefit-Based Adaptation Scheme for Multimeme Algorithms
511
obtained rfg are calculated per LS usage and summed up. The probabilities of the levels are adjusted, if either each level was used at minimum usageL,min times or they all have been used usageL,max in total since the last adjustment. The new relation among the active levels L1, L2, and L3 is calculated as shown in (2). The sums are reset to zero after the adjustment, such that the adaptation is faster. If the probability of the lowest or highest active level exceeds a threshold value of 0.5, the next lower or higher level, respectively, is added. The level at the opposite end is dropped and its likeliness is added to its neighbour. The new level is given a probability of 20% of the sum of the probabilities of the other two levels. This causes a move of the three consecutive levels along the scale of possible ones according to their performance determined by the achieved fitness gain and the required evaluations. To ensure mobility in both directions, none of the three active levels may have a probability below 0.1. An example of an upward movement of the three active levels is shown in Fig. 1. It is done, because p of level 5 has become too large (first row). Level 3 is deactivated (p=0) and level 4 inherits its likeliness totalling now p=0.4 (second row). Finally, the new level 6 receives 20% of the probabilities of the two others as shown in the third row. The same procedure can be used to adjust the probabilities of the involved LSs. Again, the required evaluations eval as well as rfg are calculated and summed up for each LS usage. The new relation of the LS probabilities is computed in the same way as for the active levels, if either each LS was used at minimum usageLS,min times or there have been matingsmax matings in total since the last adjustment. If the probability of one LS drops be- Fig. 1. The three phases of a level movement. Active levels are marked by a gray background. low Pmin for three consecutive alterations, it is ignored from then on. To avoid premature deactivation, the probability is set to Pmin for the first time it is lower than Pmin . For the experiments Pmin was set to 0.1. This simple LS adaptation scheme was used for the investigations reported in [6]. As erroneous deactivations of an LS were observed, the adaptation speed had to be reduced by comparably high values for the re-adjustment thresholds. As a consequence, the extended LS adaptation procedure uses the old distribution by summing up one third of the old probabilities and two thirds of the newly calculated ones, thus resulting in a new likelihood of each LS. For EAs that create more than one offspring per mating, such as the one used here, a choice must be made between locally optimising the best (called bestimprovement ) or a fraction of up to all of these offspring (called all-improvement). This is controlled adaptively as follows: the best offspring always undergoes LS improvement and for its siblings the chance of being treated by the LS is
512
W. Jakob
adaptively adjusted as described before, with the following peculiarities. After having processed all selected offspring, fLS is estimated as the fitness of the best locally improved child and fevo as that of the best offspring from pure evolution.
4
Basic Algorithms
As EA, GLEAM (General Learning Evolutionary Algorithm and Method) [11] is used. It is an EA of its own, combining aspects from Evolution Strategy and Genetic Algorithms with the concept of abstract data types for the easy formulation of genes and chromosomes. GLEAM uses ranking-based selection, elitist offspring acceptance, and a structured population based on a neighbourhood model [12] that causes an adaptive balance between exploration and exploitation and avoids premature convergence. Hence, GLEAM can be regarded a more powerful EA compared to simple ones, which makes it harder to reach an improvement by adding local search. On the other hand, if an improvement can be achieved by adding and applying memes adaptively, then at least the same advantage, if not better, can be expected by using a simpler EA. As local searchers, two well-known procedures from the sixties, the Rosenbrock and the Complex algorithms, are used, since they are known as powerful local search procedures. As they are derivative-free and able to handle restrictions, they maintain general usability. The implementation is based on Schwefel [13], the algorithms will be abbreviated by R and C. GLEAM and the two LS form HyGLEAM (Hybrid General-purpose Evolutionary Algorithm and Method). Apart from the basic procedures, HyGLEAM contains a simple MA (SMA), consisting of one meme (local searcher) and an adaptive Multimeme Algorithm (AMMA) using both LS. Earlier research showed that Lamarckian evolution, where the chromosomes are updated according to the LS improvement, performs better than without updates [3,4,5]. The danger of premature convergence, which was observed by other researchers, is avoided by the neighbourhood model used. Hence, Lamarckian evolution was used for the experiments with the AMMA.
5 5.1
Experimental Results Test Cases
Five test functions taken from the GENEsYs collection [14] and two real-world problems [15,16] were used, see Table 1. Due to the lack of space, they shall be described very briefly only, and the interested reader is referred to the given literature. Rotated versions of Shekel’s Foxholes and the Rastrigin function were employed in order to make them harder, see [4,5]. The scheduling task is solved largely by assigning start times to production batches so that the combinatorial aspect is limited to solving conflicts arising from assignments of the same time.
A Cost-Benefit-Based Adaptation Scheme for Multimeme Algorithms
513
Table 1. Important properties of the test cases used. fi: function numbers of [14]. Parameter Schwefel’s Sphere [15, f1] 30 real Shekel’s Foxholes [15, f5] 2 real Gen. Rastrigin f. [15, f7] 5 real Fletcher & Powell f. [15, f16] 5 real Fractal f. [15, f13] 20 real Design optimisation [15] 3 real Scheduling+resource opt. [16] 87 int. Test Case
5.2
Modality unimodal multimodal multimodal multimodal multimodal multimodal multimodal
Impl. Range Restr. no [-5*106, 5*106] no [-500, 500] no [-5.12, 5.12] no [-3.14, 3.14] no [-5, 5] no yes
Target Value 0.01 0.998004 0.0001 0.00001 -0.05
Strategy Parameters
The AMMA controls meme selection and five strategy parameters which affect the intensity of local search (thR , limitR , and limitC ) and the frequency of meme application (all-impr R and all-impr C ) as shown in Table 2. thR is a convergencebased termination threshold value of the Rosenbrock procedure, while limitR and limitC simply limit the LS iterations. The two all-impr parameters control the probabilities of applying the corresponding meme to the siblings of the best offspring in case of adaptive all-improvement, see also Sect. 3. Table 2. Adaptively controlled strategy parameters of the Multimeme Algorithm Strategy Parameter Values Used for the Experiments thR 10−1 , 10−2 , 10−3 , 10−4 , 10−5 , 10−6 , 10−7 , 10−8 , 10−9 limitR , limitC 100, 200, 350, 500, 750, 1000, 1250, 1500, 1750, 2000 all-impr R , all-impr C 0, 0.2, 0.4, 0.6, 0.8, 1.0
Table 3. Strategy parameters settings for the adaptation speed
The usage and matings parameAdaptation Meme Selection Parameter Adaptation ters control the Speed usageLS, min matingsmax usage L,min usageL,max adaptation speed fast 3 15 3 12 as shown in Tamedium 5 20 4 15 ble 3. The experiments reported in slow 8 30 7 25 [6] motivated separate adaptation for different fitness ranges. An analysis of these experiments showed that the amount of these ranges is of minor influence. Consequently, the effects of unseparated and separate adaptation using three fitness ranges (0-40%, 40-70%, and 70-100% of fmax ), called separated and common adaptation, are compared in the experiments reported here. Together with the already introduced strategy parameters, this results in four new strategy parameters as shown in Table 4, and a crucial question for the experiments is, whether they can be set to common values without any relevant loss of performance.
514
W. Jakob
Table 4. Strategy parameters of the SMA and the AMMA. The new ones are in italic.
Strategy Parameter population size μ Lamarckian or Baldwinian evolution best- or static all-improvement best- or adaptive all-improvement LS selection LS iteration limit Rosenbrock termination threshold thR probability of the adaptive all-improvement adaptation speed simple or extended LS adaptation separate or common adaptation 5.3
Relevance and Treatment SMA AMMA manual manual manual Lamarckian evolution manual manual manual adaptive 5000 adaptive manual adaptive adaptive manual manual manual
Experimental Results
An algorithm together with a setting of its strategy parameters is called a job and the comparisons are based on the average number of evaluations from one hundred runs per job. Where necessary, t-test (95% confidence) is used to distinguish significant differences from stochastic ones. Only jobs are taken into account, the runs of which are all successful, i.e. reach the target values of Table 1 or a given solution quality in case of the real-world problems. Table 5 compares the results for the basic algorithms2 and Table 6 shows the results of the two SMAs. An important result is that the wide range of best population sizes of the EA ranging from 20 to 11,200 is narrowed down by the best SMAs to 5 to 70. Further results are that the choice of the best SMA as well as good values for thR are application-dependent and that static all-improvement is better in two cases. Moreover, all-improvement is crucial to the success of the Rastrigin function. In all cases an improvement can be achieved in the magnitude of factors. In this respect the sphere function and the design optimisation are exceptional as they are solved during the first generation by the SMA-R or SMAC, respectively. As most real-world applications are of multimodal nature, the efforts are geared to this case and the sphere function is chosen for checking the obtained results with a challenging unimodal test function. Thus, these results are not bad. But for the design optimisation, this outcome means that this task is of limited value for evaluating the adaptation scheme. 2
The results presented here differ from those reported in [3,4,5,6] in two cases. To better show the effects of adaptation, the sphere function is parameterised now in such a way that GLEAM can just solve it and the Rosenbrock procedure misses the 100% success rate with common values for thR . Secondly, a different GLEAM job is used for Shekel’s foxholes as a reference, because it has a much better confidence interval and is nearly as good as the old job.
A Cost-Benefit-Based Adaptation Scheme for Multimeme Algorithms
515
Table 5. Best jobs of the basic algorithms together with the population size μ and thR . Abbreviations: CI: confidence interval for 95% confidence, Succ.: success rate.
Test Case
μ
Sphere 120 Fletcher 600 Scheduling 1,800 Foxholes 300 Rastrigin 11,200 Fractal 20 Design 210
GLEAM Evaluations 37,964,753 483,566 5,376,334 108,435 3,518,702 195,129 5,773
CI 974,707 191,915 330,108 8,056 333,311 20,491 610
Rosenbrock Proc. thR Succ. Eval. 10-8 100 4.706 10-8 22 5,000 0.6 0 3,091 10-4 5 133 0 0 10-6 15 891
Complex-Alg. Succ. Eval. 0 5,000 26 658 0 473 0 95 0 5,000 0 781 12 102
Table 6. Best jobs of both SMAs. Confidence intervals are given for jobs better than GLEAM only. Abbreviations: b/a: best- or static all-improvement, L: rate of Lamarckian evolution.
Test Case Sphere Fletcher Scheduling Foxholes Rastrigin Fractal Design
μ 20 10 5 30 70 5 10
Rosenbrock-MA (SMA-R) Complex-MA (SMA-C) thR b/a L Eval. CI Eval. CI μ b/a L 10-6 b 100 5,453 395 10-4 b 100 13,535 1,586 5 b 100 4,684 812 0.6 b 100 69,448 10,343 10-2 a 100 10,710 1,751 20 b 0 8,831 2,073 10-2 a 100 315,715 46,365 150 a 100 3,882,513 10-2 b 100 30,626 3,353 10 b 100 1,065,986 10-4 b 5 4,222 724 5 b 100 1,041 141
Next, the best AMMA and SMA jobs are compared. What can be expected from adaptation: on the one hand, improvement because of better tuned strategy parameters and on the other hand, impairment, because adaptation means learning and learning costs. And, in fact, both effects can be observed, as shown in Fig. 2. The Rosenbrock procedure turns out to be the absolute dominating meme in the end phase of the runs for all test cases except of the Foxhole function (0.78%) and the design optimisation (0.49%). Table 6 shows, that the improved test cases all use a thR value of 0.01, while the worsened ones require lower values. As the adaptation starts with values of 0.1 to 0.001, a longer phase of adaptation is required to decrease thR and to increase limitR , which is required to benefit from lower thR thresholds. This explanation of the observed impairments is checked for the most drastic case, the sphere function, by starting with medium levels for both parameters: the impairment factor is reduced to 3.3. The effect of the extended LS adaptation (cf. Sect. 3) is also shown in Fig. 2. Unfortunately, the only two cases with significant differences show a different behaviour. In all other cases, a minimal, but insignificant improvement is
516
W. Jakob
observed. A more detailed analysis of the results shows that the danger of erroneous deactivation of a local searcher is considerably lowered, when the extended LS adaptation is used. Thus, the adaptation speed can be increased, resulting in more adaptations and a faster and better parameter adaptation. To interpret Fig. 2 correctly, it must be kept in mind that the performance of the SMAs is the result of a time-consuming tuning by hand. Up to now the comparisons have been based on both, all- and best-improvement, although tasks as multimodal as the Rastrigin function require all- Fig. 2. Comparison between the best jobs of the SMA and improvement. Hence, two variants of the AMMA based on the average efforts. For a generally applicable a better presentation impairments are shown as negative AMMA must use it. factors. Fig. 3 shows the differences of best-, static, and adaptive allimprovement. The latter outperforms the two others with two exceptions: Fletcher’s function, where the differences are not significant, and the unimodal Sphere function, where all-impr. is not suited at all. Fig. 3 shows that adaptive all-improvement even is superior to best- Fig. 3. Comparison of best- and static all-improvement (SMA) and adaptive all-improvement (AMMA). For a betimprovement in three ter overview the results of the fractal and the Sphere funccases, videlicet She- tion are scaled by 0.1, those of the Rastrigin function and kel’s Foxholes, fractal, the scheduling task by 0.01. and in particular the Rastrigin function. But it can also be oversized like in case of Fletcher’s function or the scheduling task. In conclusion the usage of adaptive all-improvement is not only motivated by its need for the Rastrigin function.
A Cost-Benefit-Based Adaptation Scheme for Multimeme Algorithms
517
Beside the per- Table 7. Results for best and recommended AMMA. Abbr.: formance aspects, Table 5 the robustness of Test Case Best AMMA Recommended AMMA the results must Eval. CI Eval. CI μ μ be considered also. Sphere 5 54,739 3,688 30 194,653 17,094 Jobs, where only Fletcher 10 11,451 1,644 5 12,599 2,486 one or a few poScheduling 20 251,917 22,453 20 306,843 51,595 pulation sizes yield Foxholes 30 5,467 1,086 50 5,673 916 successful runs, will Rastrigin 120 257,911 31,818 70 300,551 73,152 be omitted as Fractal 10 19,904 2,801 5 23,075 2,562 robustness-lacking Design 5 1,383 210 5 1,581 384 jobs in the comparison of the remaining strategy parameters showing the following: Using common adaptation (cf. Sect. 5.2), there is at the minimum one test case with jobs lacking robustness for all adaptation speeds. Applying separated adaptation according to three fitness ranges instead and extended LS adaptation (cf. Sect. 3), best results are reached with fast adaptation at a good robustness. This is together with adap- Fig. 4. Comparison of best SMA and AMMA jobs. Empty fields tive all-improvement indicate an insufficient success rate (below 100%), while flat the recommended ones denote the fulfillment of the task but with greater effort common parameter- than the basic EA. The given improvements are the ratio beisation. Leaving the tween the means of the evaluations of the basic EA and the exceptional cases of compared algorithms. the Sphere function and the design task aside, 88.3% of the performance of the best hand-tuned AMMA can be reached on an average and 75.7% when considering all test cases. Table 7 and Fig. 4 summarise these results.
518
6
W. Jakob
Conclusions and Outlook
A common cost-benefit-based adaptation procedure for Memetic Algorithms was introduced, which controls both meme selection and the balance between global and local search. The latter is achieved by adapting the intensity of local search and the frequency of their usage. A common good parameterisation was given for the new strategy parameters of the adaptation scheme leaving only one parameter to be adjusted manually: the population size μ. In the experiments the useful range of μ could be narrowed down from ranges between 20 and 11,200 to ranges between 5 and 70. A value of 20 can be recommended to start with, provided that the task is assumed to be not very complex with not too many suboptima. Further investigations will take more test cases into account and aim at the extension of the approach to combinatorial tasks, namely, job scheduling and resource allocation in the context of grid computing.
References 1. Davis, L., L.: Handbook of Genetic Algorithms. Van Nostrand Reinhold, NY (1991) 2. Hart, W.E., Krasnogor, N., Smith, J.E. (eds.): Recent Advances in Memetic Algorithms. Studies in Fuzziness and Soft Computing, vol. 166. Springer, Berlin (2005) 3. Jakob, W.: HyGLEAM – An Approach to Generally Applicable Hybridization of Evolutionary Algorithms. In: Guerv´ os, J.J.M., Adamidis, P.A., Beyer, H.-G., Fern´ andez-Villaca˜ nas, J.-L., Schwefel, H.-P. (eds.) PPSN 2002. LNCS, vol. 2439, pp. 527–536. Springer, Heidelberg (2002) 4. Jakob, W.: A New Method for the Increased Performance of Evolutionary Algorithms by the Integration of Local Search Procedures. In German, PhD thesis, Dept. Mech. Eng., University of Karlsruhe, FRG, FZKA 6965 (March 2004), http://www.iai.fzk.de/∼ jakob/HyGLEAM/main-gb.html 5. Jakob, W., Blume, C., Bretthauer, G.: Towards a Generally Applicable SelfAdapting Hybridization of Evolutionary Algorithms. In: Deb, K., al., e. (eds.) GECCO 2004. LNCS, vol. 3102, pp. 790–791. Springer, Heidelberg (2004) 6. Jakob, W.: Towards an Adaptive Multimeme Algorithm for Parameter Optimisation Suiting the Engineers’ Needs. In: Runarsson, T.P., Beyer, H.-G., Burke, E.K., Merelo-Guerv´ os, J.J., Whitley, L.D., Yao, X. (eds.) PPSN 2006. LNCS, vol. 4193, pp. 132–141. Springer, Heidelberg (2006) 7. Krasnogor, N.: Studies on the Theory and Design Space of Memetic Algorithms. PhD thesis, Faculty Comput., Math. and Eng., Univ. West of England (2002) 8. Smith, J.E.: Co-evolving Memetic Algorithms: A Learning Approach to Robust Scalable Optimisation. In: Conf. Proc. CEC 2003, pp. 498–505. IEEE press, Los Alamitos (2003) 9. Bambha, N.K., Bhattacharyya, S.S., Zitzler, E., Teich, J.: Systematic Integration of Parameterized Local Search Into Evolutionary Algorithms. IEEE Trans. on Evolutionary Computation 8(2), 137–155 (2004) 10. Ong, Y.S., Keane, A.J.: Meta-Lamarckian Learning in Memetic Algorithms. IEEE Trans. on Evolutionary Computation 8(2), 99–110 (2004) 11. Blume, C., Jakob, W.: GLEAM – An Evolutionary Algorithm for Planning and Control Based on Evolution Strategy. In: Cant´ u-Paz, E. (ed.) GECCO – 2002, vol. Late Breaking Papers, pp. 31–38 (2002)
A Cost-Benefit-Based Adaptation Scheme for Multimeme Algorithms
519
12. Gorges-Schleuter, M.: Genetic Algorithms and Population Structures - A Massively Parallel Algorithm. PhD thesis, Dept. Comp. Science, Univ. of Dortmund (1990) 13. Schwefel, H.-P.: Evolution and Optimum Seeking. John Wiley & Sons, NY (1995) 14. B¨ ack, T.: GENEsYs 1.0 (1992), ftp://lumpi.informatik.uni-dortmund.de/pub/GA/ 15. Sieber, I., Eggert, H., Guth, H., Jakob, W.: Design Simulation and Optimization of Microoptical Components. In: Bell, K.D., et al. (eds.) Proceedings of Novel Optical Systems and Large-Aperture Imaging, SPIE, vol. 3430, pp. 138–149 (1998) 16. Blume, C., Gerbe, M.: Deutliche Senkung der Produktionskosten durch Optimierung des Ressourceneinsatzes. atp 36, 5/94, pp. 25–29. Oldenbourg Verlag (1994)
Optimizing the Shape of an Impeller Using the Differential Ant-Stigmergy Algorithm ˇ 1 , Klemen Oblak3 , and Franc Kosel4 Peter Koroˇsec1,2, Jurij Silc 1
2
Joˇzef Stefan Institute, Computer Systems Department, Jamova cesta 19, SI-1000 Ljubljana, Slovenia {peter.korosec, jurij.silc}@ijs.si http://csd.ijs.si University of Primorska, Faculty of Mathematics, Science and Information Technologies, Glagoljaˇska 8, SI-6000 Koper, Slovenia [email protected] 3 Domel Ltd., Electromotors and household devices, ˇ Otoki 21, SI-4228 Zelezniki, Slovenia [email protected] 4 University of Ljubljana, Faculty of Mechanical Engineering, Aˇskerˇceva cesta 6, SI-1000 Ljubljana, Slovenia [email protected]
Abstract. A metaheuristic optimization algorithm for solving multiparameter optimization problems is presented. The algorithm is applied to a real-world problem, where the aerodynamic power efficiency of the radial impeller of a vacuum cleaner is optimized. Here, the radial impeller is presented using parametric modeling. Due to the large number of parameters and, consequently, the enormous search space, an efficient metaheuristic approach is inevitable. Therefore, the so-called Differential Ant-Stigmergy Algorithm, which is an extension of the Ant-Colony Optimization for a continuous domain, is applied. The result of this is that the aerodynamic power of the radial impeller is increased by twenty percent. Keywords: ant-colony optimization, black-box optimization, metaheuristics, numerical simulation, stigmergy.
1
Introduction
Optimization problems often arise in practice, and they can be formulated as follows (assume minimization): given a real-valued cost function f : IRD → IR, find a global minimum, x∗ = arg min f (x) x∈S
where S is the parameter space, usually a compact set in IRD . The cost function in practice is often nonlinear and may have multiple local optima. Furthermore, a-priori knowledge about the objective function is usually very limited and such problems are termed “black-box” optimizations. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 520–529, 2008. c Springer-Verlag Berlin Heidelberg 2008
Optimizing the Shape of an Impeller Using the DASA
521
Many engineering optimization problems are black-box optimizations (see Fig. 1), in which the cost function’s form as well as its derivatives are unknown. Normally, this occurs when the cost function is computed using a complex simulation about which the optimization algorithm has no information (e.g., to evaluate a candidate impeller shape we could simulate its operations using a computational fluid-dynamics software package). Executing a black-box simulation in order to evaluate a candidate solution is usually very expensive and can take up to several minutes. This is particularly problematic because optimization algorithms for black-box problems are necessarily blind search algorithms that must repeatedly sample points in a solution space, evaluate them by running the simulation, and apply various heuristics in order to choose the next points to sample.
SIMULATION PART Input parameters x
new x
Simulation program as “black-box”
Simulation output ^
f (x)
Experimental design method for finding the best next search point based on the information about the cost function f (x) geathered so far.
OPTIMIZATION PART Fig. 1. The “black-box” optimization
In recent years numerous optimization algorithms have been proposed to solve “black-box” optimization problems, such as controlled random search [14], realparameter genetic algorithms [19], evolution strategies [3], differential evolution [17], particle swarm optimization [8], classical methods such as the quasi-Newton method [15], other non-evolutionary methods such as simulated annealing [9], tabu search [7] and most recently ant-colony optimization (ACO) algorithms [5]. However, the direct application of an ACO for solving a real-parameter optimization problem is difficult. The first algorithm designed for continuous function optimization was a continuous ant-colony optimization [1], which comprises two levels: global and local. The algorithm uses the ant-colony framework to perform local searches, whereas the global search is handled by a genetic algorithm. Until now there have been a few other adaptations of the ACO algorithm to black-box optimization problems: the continuous interacting ant colony [6], the extended
522
P. Koroˇsec et al.
ACO for continuous and mixed-variables [16], and the aggregation pheromone system [18].
2
The Differential Ant-Stigmergy Algorithm
In the following, a new approach to the black-box optimization problem using an ACO-based algorithm is presented. It uses pheromonal trail laying—a case of stigmergy—as a means of communication between the ants. 2.1
The Fine-Grained Discrete Form of Continuous Domain xi
Let be the current value of the i-th parameter. During the searching for the optimal parameter value, the new value, xi , is assigned to the i-th parameter as follows: xi = xi + δi . (1) Here, δi is the so-called parameter difference and is chosen from the set Δi = + − − − k+Li −1 Δ− , k = 1, 2, . . . , di } and Δ+ i ∪ {0} ∪ Δi , where Δi = {δi,k | δi,k = −b i = + + k+Li −1 {δi,k | δi,k = b , k = 1, 2, . . . , di }. Here di = Ui − Li + 1. Therefore, for each parameter xi , the parameter difference, δi , has a range from bLi to bUi , where b is the so-called discrete base, Li = lgb (i ), and Ui = lgb (max(xi ) − min(xi )). With the parameter i , the maximum precision of the parameter xi is set. The precision is limited by the computer’s floating-point arithmetics. 2.2
Graph Representation
From all the sets Δi , 1 ≤ i ≤ D, where D represents the number of parameters, the so-called differential graph G = (V, E) with a set of vertices, V , and a set of edges, E, between the vertices is constructed. Each set Δiis represented by the set of vertices, Vi = {vi,1 , vi,2 , . . . , vi,2di +1 }, and V = D i=1 Vi . Then we − − − + + + have that Δi = {δi,d , . . . , δ , . . . , δ , 0, δ , . . . , δ , . . . , δ i,1 i,1 i,j i,d −j+1 i,di } is equal to i i − Vi = {vi,1 , . . . , vi,j , . . . , vi,di+1 , . . . , vi,di+1+j , . . . , vi,2di+1 }, where vi,j −→ δi,d , i−(j−1) δ
δ
δ
+ vi,di+1 −→ 0, vi,di+1+j −→ δi,j , and j = 1, 2, . . . , di . To enable a more flexible movement over the search space, the weight ω is added to Eq. 1:
xi = xi + ωδi ,
(2)
where ω = RandomInteger(1, b − 1). Each vertex of the set Vi is connected to all the vertices that belong to the set Vi+1 . Therefore, this is a directed graph, where each path ν from the start vertex to any of the ending vertices is of equal length and can be defined with vi as ν = (v1 v2 . . . vi . . . vD ), where vi ∈ Vi , 1 ≤ i ≤ D. The optimization task is to find a path ν, such that f (x) < f (x ), where x is currently the best solution, and x = x + Δ(ν) (using Eq. 1). Additionally, if the objective function f (x) is smaller than f (x ), then the x values are replaced with x values.
Optimizing the Shape of an Impeller Using the DASA
2.3
523
The Algorithm
The optimization consists of an iterative improvement of the currently best solution, x , by constructing an appropriate path ν, that uses Eq. 2 and returns a new best solution. This is done using the following search algorithm: Algorithm Differential Ant-Stigmergy Algorithm (DASA): Step 1: A solution x is manually set or randomly chosen. Step 2: A search graph is created and an initial amount of pheromone, τV0i , is deposited on all the vertices from the set Vi ⊂ V, 1 ≤ i ≤ D, according to a (x−μ)2
Gaussian probability density function Gauss(x, μ, σ) = σ√12π e− 2σ2 , where μ is the mean, σ is the standard deviation, and μ = 0, σ = 1. Step 3: There are m ants in a colony, all of which begin simultaneously from the start vertex. Ants use a probability rule to determine which vertex will be chosen next. More specifically, ant α in step i moves from a vertex in set Vi−1 to vertex vi,j ∈ {vi,1 , . . . , vi,2di+1 } with a probability given by pj (α, i) = τ (vi,j )/ 1≤k≤2di +1 τ (vi,k ), where τ (vi,k ) is the amount of pheromone on the vertex vi,k . The ants repeat this action until they reach the ending vertex. For each ant, the solution x is constructed (see Eq. 2) and evaluated with a calculation of f (x). The best solution, xb , out of m solutions is compared to the currently best solution x . If f (xb ) is better than f (x ), then the x values are replaced with xb values. Furthermore, in this case the amount of pheromone is redistributed according to the associated path b b ν b = (v1b . . . vi−1 vib . . . vD ). New probability-density functions have maxima on the vertices vib and the standard deviations are inversely proportional to the improvements of the solutions. Step 4: Pheromone evaporation is defined by some predetermined percentage ρ on each probability-density function as follows: μNEW = (1 − ρ)μOLD and σ NEW = (1 + ρ)σ OLD if (1 + ρ)σ OLD < σmax , otherwise σ NEW = σmax . Pheromone dispersion has a similar effect to pheromone evaporation in a classical ACO algorithm. Step 5: The whole procedure is then repeated from Step 3 until some ending condition is met. Through the iterations of the algorithm we slowly decrease the maximum standard deviation, σmax , and with it we improve the convergence (an example of daemon action).
3
Shape Optimization of an Impeller
Using the DASA algorithm described in the previous section, we optimized the radial impeller of a vacuum cleaner. Radial air impellers are the basic components of many turbomachines. In the following we will concentrate on relatively small impellers andsubsonic speeds. Our main aim was to find an impeller shape that has a higher aerodynamic power efficiency than the one currently used in production.
524
P. Koroˇsec et al.
(a)
(b)
(c) Fig. 2. Parametric modeling: (a) top view; (b) 3D view, (c) side view
3.1
Modeling
An impeller is constructed from blades, an upper and a lower side. The sides enclose the blades and keep them together. The blades, which are all the same, were the main part of the optimization. The geometry of a blade is shown in Fig. 2, where the gray color represents the blade. The method of modeling is as follows: we construct the points at specific locations, draw the splines through them and spread the area on the splines. Once a blade is made, an air channel must be constructed in a similar way. In Fig. 2(a) the point 1 has two parameters: the radius r1 and the angle ϕ1r . Similarly, the points 2, 5 and 6 have the parameter pairs r2 , ϕ2r ; r5 , ϕ5r and r6 , ϕ6r . The points 3 and 4 are fixed on the x axis. This is because the impeller must have a constant outer radius, rout , and the outer side of the blade must be parallel to the z axis. On the other hand, the outer angle of the blade, ϕout , and the angle of the spline at points 3 and 4, can be varied. Analogously, the angles ϕ1in and ϕ6in are the inner-blade angles for the upper and lower edges of the blade at the input, respectively. In Fig. 2(c) the points 1, 2, and 3 form the
Optimizing the Shape of an Impeller Using the DASA
525
upper spline, and the points 4, 5, and 6, the lower spline. Between the points 1 and 6 is the point 7, which defines the spline of the input shape of the blade. In this figure, the points 1, 2, 5, and 6 have the parameters h1 , h2 , h5 , and h6 , respectively, describing their heights. Point 3 stays on the x axis and point 4 has a constant height, hout . In other words, the designer of the impeller must know at least the outer diameter, rout , and the height, hout . The parameters ϕ1h and ϕ6h describe the input angles of the lower and upper parts of the blade with respect to the r − z plane. Similarly, the parameters ϕ3h and ϕ4h describe the outer blade angle with respect to the same plane. In Fig. 2(b) the meaning of point 7 is explained more precisely. The parameters r7 , h7 , and ϕ7r define the radius, the height, and the angle, respectively. The radius and the angle dictate where the point should appear with respect to the x − y plane and the height with respect to the r − z plane. Similarly, the angles β1u , β2u , β1d , and β2d are needed to define the starting and ending angles of the spline constructed between the points 1, 7, and 6. If we look closely at Fig. 2(c) then we can see the contour surrounding the blade. This is the air channel with the following parameters: the inner radius, rs (see Fig. 2(a)), which is needed for a smooth construction of the hexahedral mesh; the air intake radius, rup ; the air outflow radius, rp ; the bolt radius, r10 ; the bolt height, hdwn ; and the impeller height, hup . Those parameters are fixed during the optimization, because the geometry of the vacuum-cleaner motor stays unchanged. In this way we have successfully modeled the impeller geometry with 26 variable parameters. For each parameter we have a predefined search interval with a given discrete step. Therefore, the size of the search space can be obtained as the product of the number of possible settings over all the parameters. It turns out that there are approximately 3 e+34 possible solutions. 3.2
Estimation of the Results
The estimation is made with the use of CFD (Computational Fluid Dynamics). The air channel (Fig. 3(a)), which is meshed with hexaedral elements (Fig. 3(b)), represents only one period of the model. Namely, all the blades possess the same shape, as was mentioned, and therefore we are able to reduce the calculation time. The boundary conditions (BCs) at the influx and outflux of the air channel (see Fig. 2(c)) are the intake velocity, vin , and the reference pressure, pref , respectively. The other BCs are zero velocities at the walls and at the top and bottom sides of the blades. For the CFD we will not give the theoretical background, which can be found R FLOTRANTM packin [2,4,13]. In our case, for the CFD we used the ANSYS age [10]. The velocity vin is increased five times and for each different velocity a numerical analysis is performed. As an output of the analysis, the intake pressure pin is taken at the influx area A. Now, the cost function, f (x), can be defined as: Paer = vin A(pref − pin ) = Q(pref − pin ),
526
P. Koroˇsec et al.
(a)
(b)
Fig. 3. Air channel: (a) geometry, (b) hexahedral mesh
where Q is a flux. The angular velocity ω is 30,000 rpm, and this stays unchanged during the analysis. In contrast, the velocity changes in such a way that the flux Q gives the values 25, 30, 35, 40, 45 l/s. Therefore, we get five points to describe the graph Q–Paer (see left-hand sides of Figs. 4, 5, 6). Such graphs usually have a convex shape and our criterion is to get a maximum of the graph at Q ≈ 35 l/s. One can say that only three points are needed to achieve this, but the numerics can produce mistakes and it is better to use two more points. As we shall see in the next section, the measured results have a maximum at Q ≈ 30 l/s. The difference is because in the CFD geometry the thickness of the blades is zero, while in the real world they are of constant thickness (e.g., 1 mm in our case). 3.3
Results
The DASA settings were m = 10, ρ = 0.2, with dependent on the discrete step of each parameter. Table 1. Optimized impeller’s aerodynamic power after 2,000 CFD calculations Q = 35 l/s
Classical Optimized impeller
ω = 30,000 rpm
impeller Worst Mean Best
Aerodynamic power [W]
160.13 185.34 191.84 210.34
The optimization method was run 10 times and each run consisted of 2,000 CFD calculations. A single CFD calculation takes approximately seven minutes. The obtained results, in terms of aerodynamic power, are presented statistically in Table 1. In Figs. 4, 5, and 6 the CFD results are compared with the measured results and the appertaining shapes of the classical and optimized impellers are shown. These results indicate that the selected CFD can be used for evaluation purposes.
527
Paer[W]
Optimizing the Shape of an Impeller Using the DASA
Q [l/s]
Paer[W]
Fig. 4. Aerodynamic power distribution of the classical impeller at ω = 30,000 rpm (left) and geometry (right)
Q [l/s]
Fig. 5. Aerodynamic power distribution of the optimized impeller 1 at ω = 30,000 rpm (left) and geometry (right)
Although the CFD results are up to 7 % different than the measured ones, they all deviate at higher values of Paer . This is the consequence of the zero thickness of the blades. It is clear that the CFD and the measured results of the optimized impellers are approximately 20 % higher than with a classical impeller. The optimized geometries of the blades are more spatially curved than the classical ones. But the line connecting points 3 and 4 (see Fig. 2) is parallel with the z axis, which does not affect the manufacturing process. Namely, the blades stay stable when the sides are being assembled.
P. Koroˇsec et al.
Paer[W]
528
Q [l/s]
Fig. 6. Aerodynamic power distribution of the optimized impeller 2 at ω = 30,000 rpm (left) and geometry (right)
4
Discussion and Conclusion
In this paper we introduced a new ACO-based metaheuristic called the Differential Ant-Stigmergy Algorithm (DASA) for continuous global optimization. As seen in Sec. 2, the DASA is generally applicable to global optimization problems. In addition, it makes use of neither derivative nor a-priori information, making it an ideal solution method for black-box problems. The DASA was compared with a number of evolutionary optimization algorithms [11,12]. The results obtained indicate a promising performance for the new approach. The DASA almost always converges to the global optimum in a reasonable time. The paper shows the ability of the DASA to solve optimization problems with real-world applications, thus making it a well-suited approach for solving global optimization problems from many fields of the applied sciences.
References 1. Bilchev, G., Parmee, I.C.: The ant colony metaphor for searching continuous design spaces. LNCS, vol. 993, pp. 25–39. Springer, Heidelberg (1995) 2. Chung, T.J., T.J.: Computational Fluid Dynamics. Cambridge University Press, Cambridge (2002) 3. Deb, K., Anand, A., Joshi, D.: A computationally efficient evolutionary algorithm for real-parameter optimization. Evol. Comput. 10, 371–395 (2002) 4. Dixon, S.L.: Fluid Mechanics and Thermodynamics of Turbomachinery, 4th edn. Elsevier, Amsterdam (1998) 5. Dorigo, M., St¨ utzle, T.: Ant Colony Optimization. The MIT Press, Cambridge (2004) 6. Dr´eo, J., Siarry, P.: A new ant colony algorithm using the heterarchical concept aimed at optimization of multiminima continuous functions. In: Dorigo, M., Di Caro, G.A., Sampels, M. (eds.) Ant Algorithms 2002. LNCS, vol. 2463, pp. 216– 227. Springer, Heidelberg (2002)
Optimizing the Shape of an Impeller Using the DASA
529
7. Glover, F., Laguna, M.: Tabu Search. Kluwer Academic Publishers, Dordrecht (1997) 8. Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proc. IEEE International Conference on Neural Networks, Perth, Australia, November/December 1995, pp. 1942–1948 (1995) 9. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220, 671–680 (1983) 10. Kohnke, P. (ed.): ANSYS, Inc. Theory Reference, ANSYS Release 9.0. ANSYS, Inc. (2004) ˇ 11. Koroˇsec, P., Silc, J.: Real-parameter optimization using stigmergy. In: Proc. Second International Conference on Bioinspired Optimization Methods and their Applications, Ljubljana, Slovenia, October 2006, pp. 73–84 (2006) ˇ 12. Koroˇsec, P., Silc, J., Oblak, K., Kosel, F.: The differential ant-stigmergy algorithm: An experimental evaluation and a real-world application. In: Proc. IEEE Congress on Evolutionary Computation, Singapore, September 2007, pp. 157–164 (2007) 13. Kundu, P.K., Cohen, I.M.: Fluid Mechanics. Academic Press, London (2002) 14. Price, W.L.: Global optimization by controlled random search. J. Optimization Theory Appl. 40, 333–348 (1978) 15. Reklaitis, G.V., Ravindran, A., Ragsdell, K.M.: Engineering Optimization Methods. Wiley, Chichester (1983) 16. Socha, K.: ACO for continuous and mixed-variable optimization. In: Dorigo, M., Birattari, M., Blum, C., Gambardella, L.M., Mondada, F., St¨ utzle, T. (eds.) ANTS 2004. LNCS, vol. 3172, pp. 25–36. Springer, Heidelberg (2004) 17. Storn, R., Price, K.V.: Differential evolution – A simple and efficient huristic for global optimization over continuous space. J. Global Optim. 11, 341–359 (1997) 18. Tsutsui, S.: An enhanced aggregation pheromone system for real-parameter optimization in the ACO metaphor. In: Dorigo, M., Gambardella, L.M., Birattari, M., Martinoli, A., Poli, R., St¨ utzle, T. (eds.) ANTS 2006. LNCS, vol. 4150, pp. 60–71. Springer, Heidelberg (2006) 19. Wright, A.H.: Genetic algorithms for real parameter optimization. In: Proc. 1st Workshop on Foundations of Genetic Algorithms, Bloomington, IN, July 1990, pp. 205–218 (1990)
Parallel Algorithm for Simulation of Circuit and One-Way Quantum Computation Models Marek Sawerwain Institute of Control & Computation Engineering University of Zielona G´ ora, ul. Podg´ orna 50, Zielona G´ ora 65-246, Poland [email protected]
Abstract. In this paper we present the software to simulate circuits and one-way quantum computation models in parallel environments build from PC workstations connected by the standard Ethernet network. We describe the main vector state transformation and its application to one and multi-qubit gate application process. We also show the realisation of the measurement process in non-standard bases. We present a benchmark result of calculation of the Quantum Inverse Fourier Transformation. Keywords: quantum computation, circuit simulation, one-way quantum computation simulation, parallel vector state transformation.
1
Introduction
In the field of quantum computation there are several different approaches to the quantum computation model (QCM). The most well known today are the quantum circuit model, the quantum Turing machines and the quantum automata. These quantum computation models are counterpart of classical notions. The before mentioned methods use the unitary evolution as the main mechanism of information processing. The quantum mechanical measurement is used at the end of the processing, to convert the quantum information to the classical world. Today it is believed (what is confirmed by some experiments Chuang I.L et al. in [1]) that quantum computer could solve a number of non-trivial problems in polynomial time using only a linear amount of resources. These include the factorisation problem and the discrete logarithm problem. The quantum computer allows simulating of physical systems in polynomial time and probably can solve a few other esoteric problems which today we even don’t try to formulate. In the paper [2] Feynman has argued that classical computers will never allow a simulation of quantum system in polynomial time. However, the quantum computations simulators are the only widely available tools for testing the quantum algorithms. Unfortunately, the full simulations of QCM can be executed only for small registers which contain 20–26 qubits on a typical PC hardware or about 36 qubits on clusters or supercomputers de Raedt K. et al. in [17]. Today there are many specialised software for quantum simulation (see Quantiki in [14] for a list of such packages). The most well known are the QDD [13], the OpenQubit [11], the QDensity [7], the QCL [10]. The mentioned software R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 530–539, 2008. c Springer-Verlag Berlin Heidelberg 2008
Parallel Algorithm for Simulation of Circuit
531
uses the complex number representation (similar to software which is presented in this work). Moreover, the QCL tries to establish a high-level quantum imperative programming language. Generally speaking there are two methods of solving a computationally intensive problem. The first one is to use a supercomputer (e.g. see Prusa J.M. et al. in [12] or see Guarracino M.R. et al. in [4]). The second — a cluster of simple workstations. The major advantage of using clusters is the cost effectiveness of this solution. This however requires a parallel version of the tool used to solve the problem. Currently only a few parallel QCM software packages exits. The first example of a such software is a massive parallel simulator done by de Raedt K. et al. in [17]. Another example is given in Niwa J. et al. in [9] where authors also analysed robustness of the quantum circuits in the presence of decoherence and the operational errors. Another example is the extended version of the QC¨ lib (library used in the QCL), written by Glendinning and Omer in [3]. It runs on distributed-memory parallel computers. The before mentioned systems do not fulfil all the requirements of the current research. The answer to such problem is the newly developed quantum computer simulator (QCS). The QCS is capable of general quantum computation models’ simulation utilising many different approaches such as the CHP model (implemented for the QCS by Teszner in [19]), computation model based on pure states and mixed states and also allows simulation of the one-way computation model. No other software allows the simulation of such model. Most systems listed above concentrate on a single computational model only. Some of listed software for quantum computation model are not publicly accessible which is important for educational purposes. 1.1
Short Resume of the One-Way Quantum Computation Model
In this section we will briefly introduce the one-way quantum computation (in short termed the 1WQC), in literature also called the cluster state computation. The 1WQC is a model of quantum computation, where measurement plays a key role. The one-way quantum computation model was introduced by Raussendorf and Briegel in [15] and extensively discussed in Raussendorf and Browne in [16]. Firstly, we recall the Pauli operators (I, X, Y , Z) and the Hadamard gate denoted by H: 1 10 01 0 −i 1 0 1 1 I= , X= , Y = , Z= , H= √ . 01 10 i 0 0 −1 2 1 −1 (1) We use the Pauli operators as gates in simulations of the 1WQC to correct error which can appear in the measurement process of qubits in a cluster. Two more ingredients are important in the first step of the 1WQC, the Hadamard basis state and the CPhase gate. The Hadamard basis appear in one-way quantum computation model in the beginning of computations and play an important role. This base is represented as
532
M. Sawerwain
1 1 |+ = √ (|0 + |1) , |− = √ (|0 − |1) (2) 2 2 The third ingredient is called the controlled phase (CPhase gate or short CZ) gate. It is the 4 × 4 unitary transformation which flips the phase of the second qubit if the first qubit is in state |1 and does nothing if the first qubit is in |0. The CPhase gate is given in the matrix form as ⎛ ⎞ 10 0 0 ⎜0 1 0 0 ⎟ ⎟ CPhase = ⎜ (3) ⎝ 0 0 −1 0 ⎠ 0 0 0 −1 The major difference between the one-way and the standard quantum gate array computation model, is the fact that in the first one we consider a rectangular two-dimensional grid of several qubits in state |+ connected by applying the CZ gate to the nearest neighbour pair in horizontal and vertical directions. After applying the CZ gate, the obtained state is fully entangled and is called the cluster state or the graph state. An appropriate way of measuring this state (termed pattern measurement) allows to simulate all gates which belong to the universal set. An example of such measurement is depicted on Figure 1 (see also in Raussendorf and Browne in [16]). In computation process we mainly use a measurement procedure in different bases (including the standard base). In many applications of the 1WQC (for example see Jozsa in [6]) we use a base constructed as M (θ) = {|0 ± eiθ |1}
(4)
for some real θ, additionally θ = 0 corresponds to the X basis (a standard base Mx = {|0, |1}). The results of measurement are always labelled zero or one, although using different bases. Results of measurements are always uniformly random. The QCS simulates measurements in any basis. Often in the 1WQC the arbitrary basis is considered (note that (5) is a generalisation of (4))
|0 + eiϕ |1 |0 − eiϕ |1 √ √ B(ϕ) = , (5) 2 2 The bases states of all possible measurements (5) lie on the equator of the Bloch sphere, on the intersection of the x − y–plane. This fact allows to specify the measurement parameter as a single argument, see equation (4).
2
Quantum Computing Simulator
The quantum computing simulator (Sawerwain in [18]) is under development at the University of Zielona G´ ora since 2004. In development process several open source tools are used: GNU Compiler Collection v4.x, Python script language v2.4.x, SWIG v1.3.x wrapper generator, specialised LAPACK and BLAS linear algebra library and MPI library for parallel implementation.
Parallel Algorithm for Simulation of Circuit
533
Fig. 1. Example of measurement pattern for a basic set of gates: CNot, Hadamard and π/2 phase gate. The pattern for general rotation gate is specified by the Euler angle ξ, η, ζ.
The core library of the QCS system is written in pure ANSI C programming language which simplifies porting to many other platforms. The QCS library can be used in the Python language and a port for Java can be easily generated by using the SWIG tool. The MPI version offers additional simple language similar to an assembler. This language is called QASM (quantum assembler) and is compatible with the Knill Quantum Random Access Machine [5]. The QASM makes the script executing in parallel environment easier. The QCS was tested in Windows XP SP2 and Linux 2.6.xx operating system in both 32 and 64 bits environment.
Fig. 2. General architecture of parallel quantum assembler interpreter
The QCS system simulates several types of quantum computation models including the PQC – pseudo quantum circuits which consist of Not and CNot gates, CHP – quantum circuits composed of CNot, Hadamard and Phase change gates. This mode allows single qubit measurements. The main models considered in this paper are the standard quantum circuit model and the one-way quantum computation model. The system contains all commonly used density matrix operations e.g.: fidelity (and other metrics for quantum states), eigenvector and eigenvalue functions and Schmidt decomposition.
3
Parallel Algorithm for Simulation of the Quantum Computation Model
The most important problem of classical simulation of quantum computation model is the high computational complexity. Generally, the problem of simulating
534
M. Sawerwain
the QCM belongs to the exponential class of complexity O(2n ). The heart of our algorithm for simulation of the QCM is the vector state transformation (VST) by an unitary matrix u. In pseudo–code this algorithm has a very simple description shown in Fig. 3. The parallel version of mentioned algorithm can be easily constructed by splitting the vector state q reg evenly to several different nodes. The algorithm runs on distributed-memory parallel computers, using the MPICH implementation [8] of massage passing interface (MPI). The general architecture is depicted on Fig. 2. To achieve the reasonable computation speed we assume that number of nodes denoted by M is given by the following relation: N = logM 2L
(6)
where L denotes the number of qubits. Unfortunately, the total parallel time is still exponential, which is shown by the following proposition: Proposition 1. The computational cost of simulation of QCM model on a parallel machine is given by: L 2 TQCM (L) = (7) N Proof. The proof is a direct consequence of constructing the structure of vector state by tensor product of qubits or qudits which belong to the quantum register. It means that the VST algorithm is not optimal for parallel computations and does not belong to the Nick’s parallel complexity class. m = pow( 2, n - 1 ); step = pow( 2, n - t ) - 1; p = pow( 2, t - 1 ); vstep = pow( 2, n ) / p; irow = 0; for( ip = 0 ; ip < p ; ip++ ) { for( i = 0 ; i < step + 1 ; i++ ) { r1=irow+i; r2=irow+i+step+ 1; oper_on_rows(q_reg, r1, r2, u); } irow = irow + vstep; } Fig. 3. Pseudo-code for an in-place algorithm of the vector transformation by unitary matrix u for pure states
The complexity of an algorithm shown on Fig. 3 can be calculated as follows: let tc denote the cost of execution of elementary operations on variables r1 and r2, and let tfnc represent the computational cost of oper on rows function (see Fig. 3). After simple calculations we obtain the exponential class of complexity O(2n ):
Parallel Algorithm for Simulation of Circuit
TV ST (n) =
n−1 n−t 2 2
535
(tc + tfnc) = (1 + 2n−1 + 2n−t + 22n−t−1 ) (tc + tfnc ) = O(2n )
ip=0 i=0
(8) Additional important problem in parallel implementation of algorithm depicted on Fig. 3 is the transfer of the part of a quantum register to another node. This problem is encountered when one qubit or multi-qubit gates are simulated. It is explained by the following example. Consider the simple three qubits register divided into two parts. The matrix u represents the unitary gate applied to qubit zero, one and two, which can be written as u ⊗ I ⊗ I, I ⊗ u ⊗ I and I ⊗ I ⊗ u respectively (where ⊗ denote the tensor product): ⎛ ⎞⎛ ⎞ ⎛ ⎞ αα0 + βα4 α . . . β . . . α0 ⎜ . α . . . β . . ⎟ ⎜ α1 ⎟ ⎜ αα1 + βα5 ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ . . α . . . β . ⎟ ⎜ α2 ⎟ ⎜ αα2 + βα6 ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ . . . α . . . β ⎟ ⎜ α3 ⎟ ⎜ αα3 + βα7 ⎟ ⎜ ⎟⎜ ⎟ = ⎜ ⎟ (9) ⎜ γ . . . δ . . . ⎟ ⎜ α4 ⎟ ⎜ γα0 + δα4 ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ . γ . . . δ . . ⎟ ⎜ α5 ⎟ ⎜ γα1 + δα5 ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎝ . . γ . . . δ . ⎠ ⎝ α6 ⎠ ⎝ γα2 + δα6 ⎠ . . . γ . . . δ α7 γα3 + δα7 The calculations on the first part of the quantum register needs data contained in the second node to complete the gate simulation. Similar situation appears during the operation I⊗u⊗I but in the operation I⊗I⊗u no parts need to be transferred to the second node. Much better results are obtained for multiqubit gates. Some gates change only a fragment of the quantum register. For example we consider the CNot gate. The qubit zero is a control qubit and qubit two is a target. The first four amplitudes in the first node are not changed. We modify only the amplitudes in the second part of the register. If we divide the quantum register into two equal parts, the exchange of any part of this register between the nodes is not necessary. The computational complexity of the VST algorithm for the ideal case (when no part of the register needs to be exchanged) is given by 2L−qc TV ST (L) = (10) N where qc is the number of control qubits. The parallel version of the in–place vector state transformation can be accomplished by distributing calculations over all available nodes. The pseudo–code has following form slave_qubit(QType, QNumber) { u = get_matrix_for(QType); calculate m, p, step, vstep; min_idx = node_from_index; max_idx = node_to_index;
536
M. Sawerwain
if(vstep <= part_size) other_part = -1 ; else pther_part = calculate number node; if( other_part != -1) { transfer remote node from other_part; send local block to other_part; } if( other_part != -1) { in-place transform on local block with additional data; } else { in-place transform on local block; } } 3.1
Measurement Implementation
The main problem of simulation of the one-way quantum computation model is the realisation of the measurement process in both standard and non-standard bases. The current version of the QCS uses the standard von Neumann measurement with the projection operators. Let |ψm denote a basis of measurement and the projection operators Pm are defined as Pm = |ψm ψm | and Pmi Pm = δi,j Pm
(11)
i
Obviously, we observe that m Pm = I (where I denotes the identity matrix). Let pm denote the probability that a quantum state after measurement collapses to state pm = ψ|Pm |ψ. The post measurement state is given by Pm |ψ |ψ = √ . pm
(12)
For P0 = |00|, P1 = |11| we obtain the standard measurement base and for P0 = |++|, P1 = |−−| we obtain the Hadamard measurement base. In both cases the probabilities are equal to p0 = ψ|P0 |ψ and p1 = ψ|P0 |ψ respectively. A measurement in non-standard base (for example B(ϕ) of (5)) can be easily performed by using the appropriate projective operators defined in (11). To measure a given qubit i in the quantum register the following algorithm (termed QMA) can be used: (1) (2) (3) (4)
compute the probability p0 and p1 using the operators P0 and P1 of (11) if p0 > p1 apply the P0 projector to qubit i if p0 < p1 apply the P1 projector to qubit i if p0 = p1 we randomly select P0 or P1 and apply the selected projector to qubit i (5) normalise the obtained post measurement state
Parallel Algorithm for Simulation of Circuit
537
Parallel implementation of this simple algorithm is easy when we use the vector transformation algorithm shown on Fig. 3. The first step of the QMA does not modify the vector state. The parallel complexity of this step is similar to the transformation of the register using a quantum gate. The computational complexity of this step is O(2n ). To calculate the complexity of the second step (points from (2) to (4) of the QMA algorithm) recall that in this step the transformation of the vector with the projector is used. The complexity of this step is hence equal to O(2n ) which was proven in (8). Before step (5) of the QMA the obtained vector may not be normalised. Therefore we must apply the normalisation procedure. The pseudocode of the normalisation procedure has the following form (|qr[i]| denotes the modulus of the complex number and sqrt denotes the square root): for(i=0;i<2^L;i++) in parallel do { fmod = | qr[i] | if ( fmod != 0 ) qr[i] = 1 / sqrt(probe) * qr[i] } The normalisation does not require transferring any additional data between nodes. The complexity of this part of the QMA is equal to (where N is the number of nodes): 2L T (L) = , (13) N
4
Performance Benchmarks
To show the effectiveness of the QCS a set of benchmark problems was simulated on a cluster of nine PC computers (1GB Ram, Pentium 4 2.4 Ghz) connected by a 100Mbits Ethernet network. To avoid transferring of large amounts of data between nodes using slow network the compression technique was used. The used compression algorithm was fast and thus has not influenced on the overall simulation time significantly. The compression and decompression time for simulating a quantum circuit (Inverse Quantum Fourier Transformation — IQFT) containing 25 qubits was approximately 2.7 sec. The compression time for other simulated circuit was similar. Even better execution time can be obtained when nodes in the cluster are connected with a faster network e.g. 1GBit Ethernet for MPI message communication. In such cases the use of compression might not be beneficial (at least for the considered circuits). The whole program for simulation of the IQFT for 25 qubits written in QASM program language contains 325 instructions, where 25 instructions represent the Hadamard gate and 300 — the controlled alpha rotation gate. The execution of the whole IQFT on three nodes takes about half an hour. This shows that the use of cluster is necessary for simulating programs containing many quantum instructions.
538
M. Sawerwain Table 1. Results of executing script of IQFT of 25 qubits register Number of nodes n: 3 5 9 1q gate simulation avg. (sec.): 6 3.2 1.6 Mq gate simulation avg. (sec.): 3 1.7 0.8 Original (KB): 130 · 1024 67 · 1024 33 · 1024 Compression (KB): 130 67 33 Compress time (sec.): 2.7 1.3 0.7 Decompresion time (sec.): 0.5 0.3 0.1 Total time (sec.): 1123 650 302
5
Conclusions and Further Work
In this article, we presented the specialised software, called the QCS, to simulate two main quantum computational models: the circuit model and the one-way quantum computational model. The QCS can be run on any computer from a single processor PC to a heterogeneous cluster of multiprocessor machines. A significant advantage of the QCS is the fact that it is written in the pure ANSI C language with the MPI messaging, and hence can be run on any modern operating system. The simulator allows switching between the one-computation model and the quantum circuit model. The QCS also allows the application of any type of gate and measurement in non-standard base. Up to 30 qubits can be handled on a cluster of ten PC’s connected with a 100MBit Ethernet network. Obviously many other features can be implemented in our system. Future work will include other models of quantum computation — the adiabatic computation and the topological computational model. We plan to add support for measurement in finite number of bases. This constrain allows to significantly reduce the memory usage of the QCS. Additionally this will allow encoding values in the quantum register as binary codes. Another feature which will be implemented in the QCS is the possibility of simulating the circuits containing general d-level qudits. The parallel version of the QCS presented in this paper may also find application as a special type of benchmark for a new multiprocessor computer workstations connected by the computer network. Acknowledgments. I acknowledge useful discussions on the QCS with the QINFO group at the Institute of Control and Computation Engineering of the University of Zielona G´ ora, Poland. I would like to thank Roman Gielerak for useful comments and his supervision and feedback. The author would also like to thank L ukasz Hladowski and Przemyslaw Ratajczak for useful comments and suggestions which improved the readability of this paper.
References 1. Chuang, I.L., Vanerspyen, L.M.K., Zhou, X., Leung, D.W., Lloyd, S.: Experimental realization of a quantum algorithm. Nature 393, 143–146 (1998) 2. Feynman, R.P.: Simulating physics with computers. Int. J. Theoretical Physics 21(6/7), 467–488 (1982)
Parallel Algorithm for Simulation of Circuit
539
¨ 3. Glendinning, I., Omer, B.: Parallelization of the General Single Qubit Gate and CNOT for the QC-lib Quantum Computer Simulator Library. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 461–468. Springer, Heidelberg (2004) 4. Guarracino, M.R., Perla, F., Zanetti, P.: A parallel block Lanczos algorithm and its implementation for the evaluation of some eigenvalues of large sparse symmetric matrices on multicomputers. Int. J. Applied Mathematics and Computer Science 16(2), 241–249 (2006) 5. Knill, E.: Conventions for quantum pseudocode (1996), Technical Report LAUR-962724, Los Alamos National Laboratory, http://citeseer.ist.psu.edu/knill96conventions.html 6. Jozsa, R.: An introduction to measurement based quantum computation, arXiv:quant-ph/0508124 7. Juli´ a – D´ıaza, A., Burdisa, J.M., Tabakin, F.: QDENSITY – A Mathematica Quantum Computer simulation. Computer Physics Communications 174(11), 914–934 (2006) 8. MPICH the MPI implementation, http://www-unix.mcs.anl.gov/mpi/mpich/ 9. Niwa, J., Matsumoto, K., Imai, H.: General-purpose parallel simulator for quantum computing, Phys. Rev. A, 66, 062317 (2002) ¨ 10. Omer, B.: Quantum Programming in QCL, Master’s thesis, Institute of Information Systems Technical University of Vienna (2000) 11. OpenQubit, http://www.ennui.net/∼ quantum/index.shtml 12. Prusa, J.M., Smolarkiewicz, P.K., Wyszogrodzki, A.A.: Simulations of Gravity Wave Induced Turbulence Using 512 Pe Cray T3e. Int. J. Applied Mathematics and Computer Science 11(4), 883–897 (2001) 13. QDD, http://thegreves.com/david/QDD/qdd.html 14. Quantiki, List of available quantum computation models simulators, http://www.quantiki.org/wiki/index.php?title=List of QC simulators 15. Raussendorf, R., Briegel, H.J.: A one-way quantum computer. Phys. Rev. Lett. 86, 5188–5191 (2001) 16. Raussendorf, R., Browne, D.E., Briegel, H.J.: Measurement-based quantum computation with cluster states, Phys. Rev. A, 68, 022312 (2003), http://arXiv:quantph/0301052 17. de Raedt, K., Michielsen, K., De Raedt, H., Trieu, B., Arnold, G., Richter, M., Lippert, T., Watanabe, H., Ito, N.: Massive Parallel Quantum Computer Simulator. Computer Physics Communications 176, 127–136 (2007) 18. Sawerwain, M.: Quantum Computing Simulator. In: Proc. Int. Conf. Computer Methods and Systems, CMS 2005, Krak´ ow, Poland, vol. 2, pp. 185–190 (2005) (in Polish) 19. Teszner, M.: Effectively simulable quantum systems. Master thesis, University of Zielona G´ ora (2006) (in Polish)
Modular Rough Neuro-fuzzy Systems for Classification Rafał Scherer1,2 , Marcin Korytkowski1,3 , Robert Nowicki1,2 , and Leszek Rutkowski1,2 1
Department of Computer Engineering, Cze¸stochowa University of Technology al. Armii Krajowej 36, 42-200 Cze¸stochowa, Poland http://kik.pcz.pl 2 Department of Artificial Intelligence, Academy of Humanities and Economics in Lodz ul. Rewolucji 1905 nr 64, Ł´od´z, Poland http://www.wshe.lodz.pl 3 Olsztyn Academy of Computer Science and Management ul. Artyleryjska 3c, 10-165 Olsztyn, Poland [email protected], {marcink, rnowicki, lrutko}@kik.pcz.czest.pl http://www.owsiiz.edu.pl
Abstract. In the paper we propose a new class of modular systems for classification in the case of missing features. We incorporate the rough set theory into construction of neuro-fuzzy systems which create the modular structure. The AdaBoost algorithm is combined with the gradient algorithm to train the whole system. We illustrate the performance of our approach on typical benchmarks.
1 Introduction Classification consists in assigning an object described by a set of features to a class. There are many methods for classifying data. The traditional statistical classification procedures [14] apply Bayesian decision theory and assume knowledge of the posterior probabilities. Unfortunately, in practical situations we have no information about an underlying probability model and Bayes formula cannot be applied. Over the years, numerous classification methods were developed based on neural networks, fuzzy systems [13][14][16], support vector machines, rough sets and other soft computing techniques. These methods do not need the information about the probability model. Yet, they usually fail to classify correctly in case of missing data (features). Generally, there are two ways to solve the problem of missing data: - Imputation - the unknown values are replaced by estimated ones. The estimated value can be set as the mean of known values of the same feature in other instances. An another idea is to apply the nearest neighbor algorithm based on instances with known value of the same feature. The statistical method can be also used. - Marginalisation - the features with unknown values are ignored. In this way the problem comes down to the classification in lower-dimensional feature space. Fuzzy
This work was supported in part by the Foundation for Polish Science (Professorial Grant 2005-2008) and the Polish Ministry of Science and Higher Education (Special Research Project 2006-2009) and by science funds for 2007-2010 as research project Nr N N516 1155 33.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 540–548, 2008. c Springer-Verlag Berlin Heidelberg 2008
Modular Rough Neuro-fuzzy Systems for Classification
541
classifiers are frequently used thanks to their ability to use knowledge in the form of intelligible fuzzy rules. Classifiers can be combined to improve accuracy. By combining intelligent learning systems, the model robustness and accuracy is nearly always improved, comparing to single-model solutions. Popular methods are bagging and boosting which are metaalgorithms for learning different classifiers. They assign weights to learning samples according to their performance on earlier classifiers in the ensemble. Thus subsystems are trained with different datasets. In this paper we will combine fuzzy methods with the rough set theory [9] [10] [11] and classifier ensemble methods. Neuro-fuzzy systems are trained with the AdaBoost algorithm and the backpropagation [5]. Then rules from these systems are used in a rough set classifier.
2 Modular Systems Generated by Boosting Algorithm We use Mamdani-type fuzzy systems as a single classifier ht in the ensemble Nt
ht =
y¯tr r=1 sgn N t r=1
n
where τtr = T
i=1
· τtr ,
(1)
τtr
μAri (¯ xi ) is the activation level of the fuzzy rule r = 1, ..., Nt in clas-
sifier ht , t = 1, ..., T , μAri (¯ xi ) is the membership function of a fuzzy set Ari . Structure described by (1) is shown in Figure 1. To build the ensemble we use the AdaBoost algorithm which is the most popular boosting method [2],[6],[15]. Let us denote the l-th learning vector by zl = [xl1 , ..., xln , y l ] , l = 1...m is the number of a vector in the learning sequence, n is the dimension of input vector xl , and y l is the learning class label. Weights Dl assigned to learning vectors, have to fulfill the following conditions l
(i) 0 < D < 1 ,
(ii)
m
Dl = 1 .
(2)
l=1
The weight Dl is the information how well classifiers were learned in consecutive steps of an algorithm for a given input vector xl . Vector D for all input vectors is initialized according to the following equation Dtl =
1 , m
for t = 0 ,
(3)
where t is the number of a boosting iteration (and a number of a classifier in the ensemble). Let {ht (x) : t = 1, ..., T } denotes a set of hypotheses obtained in consecutive steps t of the algorithm being described. For simplicity we limit our problem to a
542
R. Scherer et al.
binary classification (dichotomy) i.e. y ∈ {−1, 1} or ht (x) = ±1 . Similarly to learning vectors weights, we assign a weight ct for every hypothesis, such that (i)
T
ct = 1 , (ii) ct > 0 .
(4)
t=1
Now in the AdaBoost algorithm we repeat steps 1-4 for t = 1, . . . , T : 1. Create hypothesis ht and train it with a data set with respect to a distribution dt for input vectors. 2. Compute the classification error εt of a trained classifier ht according to the formula εt =
m
Dtl (z l )I(ht (xl ) = yl) ,
(5)
l=1
where I is the indicator function
I(a = b) =
1 if a =b . 0 if a = b
If εt = 0 or εt ≥ 0.5, stop the algorithm. 3. Compute the value αt = 0.5 ln
1 − εt . εt
(6)
(7)
4. Modify weights for learning vectors according to the formula Dt+1 (zl ) =
Dt (zl ) exp{−αt I(ht (xl ) = y l )} , Nt
where Nt is a constant such that
m
(8)
Dt+1 (zl ) = 1 . To compute the overall output of
l=1
the ensemble of classifiers trained by AdaBoost algorithm the following formula is used f (x) =
T
ct ht (x) ,
(9)
t=1
where ct = T
αt
t=1
|αt |
(10)
is classifier importance for a given training set. The AdaBoost algorithm is a metalearning algorithm and does not determine the way of learning for classifiers in the ensemble.
3 Rough Neuro-fuzzy Systems The concept of using rough sets and fuzzy sets together comes from Dubois and Prade [3], [4]. They proposed two approaches to combining both theories. The first one leads
Modular Rough Neuro-fuzzy Systems for Classification
543
Fig. 1. Single Mamdani neuro-fuzzy system
to the definition of the rough fuzzy set, where lower and upper approximations of a fuzzy set are defined. The second one leads to the definition of the fuzzy rough set, where the lower and upper approximations of usual sets (not fuzzy) are fuzzy. In our approach the first approach is applied. In definition of the rough sets as well as rough fuzzy set a notion of equivalence class [x]R is very important. It is defined as follows: Definition 1 (Equivalence class). The equivalence class [ˆ x]R is a set of elements x ∈ X which are related with object xˆ by relation R. It is expressed as follows [ˆ x]R = {x ∈ X : x ˆRx} ,
(11)
where R is equivalence relation i.e. any relation satisfying reflexivity, symmetry and transitivity conditions [12]. Definition of rough fuzzy set can be writed as follows
Definition 2 (Rough fuzzy set). The rough fuzzy set is a pair RA, RA of fuzzy sets. RA is an R-lower approximation and RA is an R-upper approximation of fuzzy set A ⊆ X. The membership functions of RA and RA are defined as follows μRA (ˆ x) = inf μA (x) ,
(12)
μRA (ˆ x) = sup μA (x) .
(13)
x∈[ˆ x]R
x∈[ˆ x]R
where [ˆ x]R is an equivalence class [3], [4], [12]. Now, we have to define relation R, which is depended on available and missing features. When we denote the set of available features as P then we can define the P -indiscernibility relation which is described indiscernibility of vectors taking into
544
R. Scherer et al.
consideration only features, which belongs to set P , i.e which are available, the rest of features are ignored. Formally, we can write down as follows xP x ˆ ⇔ ∀i : xi ∈ P ; xi = xˆi .
(14) Using relation P instead R in Definition 2 we obtain rough fuzzy set depending on available and missing features. When we use it to build neuro-fuzzy system then obtain rough neuro-fuzzy system, which can work also in the case of missing features [7], [8]. Its two answers adopted to modular system defined in Section 2 can be defined as follows ⎛ ⎞ N 2 μ (x) r P At ⎜ ⎟ r=1 ⎜ ⎟ r : y rt =1 ⎜ ⎟ ht (x) = sign ⎜ N − 1 (15) ⎟ , N ⎜ ⎟ ⎝ ⎠ μP Ar (x) + μ r (x) ⎛
t
r=1 r : y rt =1
⎜ ⎜ ⎜ ht (x) = sign ⎜ N ⎜ ⎝ r=1 r : y rt =1
r=1 r : y rt =−1
N
2
r=1 r : y rt =1
μ
P Art
μ
(x) +
P Art
(x)
N r=1 r : y rt =−1
P At
⎞ ⎟ ⎟ ⎟ − 1⎟ , ⎟ ⎠ μP Ar (x)
(16)
t
where the constants 2 and −1 in Eq. 15 and Eq. 16 are rescaling values from interval < 0; 1 > to < −1; 1 >. Thus, the final answer of classifier ht in the modular system will be defined in following way ht (x) if ht (x) = ht (x) ht (x) = . (17) 0 if ht (x) = ht (x)
4 Modular Rough-Neuro-fuzzy Classifier We used rules from the ensemble of neuro-fuzzy systems (Section 2) in rough-neurofuzzy classifier (Section 3). We changed the way of aggregation of hypothesis from (9) to the following formula ⎧ T ⎪ ct ht (x) ⎪ ⎪ ⎪ t=1 ⎪ T ⎪ :ht (x) =0 ⎪ ⎪ if ct > 0 ⎪ T ⎨ t=1 ct :ht (x) =0 f (x) = (18) t=1 :ht (x) =0 ⎪ ⎪ ⎪ T ⎪ ⎪ ⎪ ⎪ 0 if ct = 0 ⎪ ⎪ t=1 ⎩ :ht (x) =0
H(x) = sign (f (x))
(19)
Modular Rough Neuro-fuzzy Systems for Classification
545
Fig. 2. Rough-neuro-fuzzy classifier
Interpretation: – H(x) = 1 - object belongs to class, – H(x) = 0 - don’t known, – H(x) = −1 - object doesn’t belongs to class.
5 Experimental Results The Iris flower [1] is a common benchmark in classification and pattern recognition studies. Four features (v1 -sepal length in cm, v2 -sepal width in cm, v3 -petal length in cm, v4 -petal width in cm) describe each of 150 instances. The flowers are divided into three species: iris setosa (ω1 ), iris versicolor (ω2 ) and iris virginica (ω3 ). The species Table 1. Rules in 1st rough neuro-fuzzy subsystem
(x1,67.23,73.51) (x1,58.22,62.71) (x1,58.22,62.71) (x1,58.22,62.71) (x1,58.22,62.71) (x1,58.22,62.71) (x1,50.65,53.94)
Antecedent (x2,25.88,32.75) (x3,57.79,63.46) (x2,32.34,28.20) (x3,38.56,47.58) (x2,32.34,28.20) (x3,38.56,47.58) (x2,32.34,28.20) (x3,38.56,47.58) (x2,32.34,28.20) (x3,38.56,47.58) (x2,32.34,28.20) (x3,38.56,47.58) (x2,35.46,40.63) (x3,15.44,28.27)
(x4,22.97,23.16) (x4,4.22,10.27) (x4,4.22,10.27) (x4,4.22,10.27) (x4,4.22,10.27) (x4,4.22,10.27) (x4,3.65,7.48)
Consequent (y1,yes) (y1,no) (y1,no) (y1,no) (y1,no) (y1,no) (y1,yes)
546
R. Scherer et al. Table 2. Rules in 2nd rough neuro-fuzzy subsystem Antecedent Consequent (x1,54.37,11.88) (x2,31.27,2.66) (x3,24.23,11.29) (x4,8.33,1.51) (y1,no) (x1,70.66,2.44) (x2,32.83,1.43) (x3,38.49,5.16) (x4,16.31,1.15) (y1,no) (x1,63.99,13.08) (x2,36.70,11.68) (x3,62.15,3.99) (x4,26.83,3.61) (y1,yes) Table 3. Rules in 3rd rough neuro-fuzzy subsystem Antecedent Consequent (x1,53.11,10,17) (x2,26.61,6.29) (x3,24.75,14.19) (x4,7.08,2.78) (y1,no) (x1,65.37,7.88) (x2,32.31,3.55) (x3,37.60,8.67) (x4,15.95,1.11) (y1,no) (x1,66.89,13.05) (x2,37.47,9.02) (x3,57.22,13.23) (x4,23.48,3.52) (y1,yes)
Table 4. Results of iris flower classification by the rough-neuro-fuzzy system (learning seq.) Known features x1 , x2 , x3 , x4 x1 , x2 , x3 x1 , x2 , x4 x1 , x3 , x4 x2 , x3 , x4 3 features x1 , x2 x 1 , x3 x 1 , x4 x 2 , x3 x 2 , x4 x 3 , x4 2 features x1 x2 x3 x4 1 feature
1st [%] 96/0/4 0/100/0 81/17/2 89/9/2 95/2/3 66/32/2 0/100/0 0/100/0 70/29/1 0/100/0 78/21/1 87/11/2 39/60/1 0/100/0 0/100/0 0/100/0 66/34/9 17/83/0
System 2nd [%] 3rd [%] modular [%] 99/0/1 99/0/1 99/0/1 28/72/0 0/100/0 28/72/0 29/71/0 58/42/0 80/17/2 83/16/1 70/30/0 96/2/2 91/8/1 72/28/0 96/1/3 57/42/1 50/50/0 75/23/2 0/100/0 0/100/0 0/100/0 14/86/0 0/100/0 14/86/0 0/100/0 49/51/0 70/29/1 28/72/0 0/100/0 28/72/0 28/72/0 49/51/0 78/21/1 69/31/0 64/36/0 90/8/2 23/77/0 27/73/0 47/53/1 0/100/0 0/100/0 0/100/0 0/100/0 0/100/0 0/100/0 8/92/0 0/100/0 8/92/0 0/100/0 49/51/0 66/34/0 2/98/0 12/88/0 19/81/0
are represented by 50 flowers. In our experiments we merged the first two classes into one class. Data are randomly divided into a learning sequence (105 sets) and a testing sequence (45 sets). We use rules from three neuro-fuzzy classifiers trained by the AdaBoost algorithm. Each classifier has three fuzzy rules and whole set of these rules is used in rough-neuro-fuzzy classifier implementation. The rules are shown in Tables 1-3. Classifier importance coefficients ct are accordingly 0.38, 0.26, 0.34. Table 1 has seven rules because Rule 2 had its consequent value equal ”-5” instead of ±1, therefore it was weighted respectively. The fuzzy sets in rules in tables are Gaussian type and are labeled by values of their center and width.
Modular Rough Neuro-fuzzy Systems for Classification
547
Table 5. Results of iris flower classification by the rough-neuro-fuzzy system (testing seq.) Known features x1 , x2 , x3 , x4 x1 , x2 , x3 x 1 , x2 , x4 x 1 , x3 , x4 x 2 , x3 , x4 3 features x1 , x2 x 1 , x3 x 1 , x4 x 2 , x3 x 2 , x4 x 3 , x4 2 features x1 x2 x3 x4 1 feature
1st [%] 94/0/6 0/100/0 81/16/3 85/11/3 94/0/6 65/32/3 0/100/0 0/100/0 71/27/2 0/100/0 76/23/2 84/13/3 39/60/1 0/100/0 0/100/0 0/100/0 66/34/0 17/83/0
System 2nd [%] 3rd [%] modular [%] 95/0/5 95/0/5 95/0/5 23/77/0 0/100/0 23/77/0 26/74/0 52/48/0 81/16/3 77/21/2 58/42/0 92/3/5 85/11/3 65/35/0 94/0/6 53/46/1 44/56/0 72/24/4 0/100/0 0/100/0 0/100/0 15/85/0 0/100/0 15/85/0 0/100/0 39/61/0 71/27/2 23/77/0 0/100/0 23/77/0 23/77/0 39/61/0 76/23/2 68/32/0 48/52/0 85/11/3 22/78/0 21/79/0 45/54/1 0/100/0 0/100/0 0/100/0 0/100/0 0/100/0 0/100/0 5/95/0 0/100/0 5/95/0 0/100/0 39/61/0 66/34/0 1/99/0 10/90/0 18/82/0
Tables 4 and 5 show simulation results on learning (Table 4) and testing (Table 5) sequence. The tables show percentage of correct (correct classification), rejected (no classification) and wrong (incorrect classification) cases.
6 Conclusions In the paper we used fuzzy rules from the ensemble of neuro-fuzzy systems in a rough set classifier. The ensemble consisted of three Mamdani neuro-fuzzy systems with three fuzzy rules. The systems were trained by the backpropagation algorithm in combination with the AdaBoost metalearning. The rules from trained systems were used to build the knowledge base for a rough-neuro-fuzzy classifier. Classification accuracy by the rough set system is relatively high and the greatest advantage of the proposed solution is the ability to work in case of missing data. In such a case its confidence decreases but it still works, in contrast to other learning systems like neural networks or fuzzy systems.
References 1. Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases, Irvine, University of California, Department of Information and Computer Science (1998), www.ics.uci.edu/∼mlearn/MLRepository.html 2. Breiman, L.: Bias, variance, and arcing classifiers, Technical Report 460, Statistics Department, University of California (1997)
548
R. Scherer et al.
3. Dubois, D., Prade, H.: Rough fuzzy sets and fuzzy rough sets. Internat. J. General Systems 17(2-3), 191–209 (1990) 4. Dubois, D., Prade, H.: Putting rough sets and fuzzy sets together. In: Słowi´nski, R. (ed.) Intelligent Decision Support: Handbook of Applications and Advences of the Rough Sets Theory, pp. 203–232. Kluwer, Dordrecht (1992) 5. Korytkowski, M., Rutkowski, L., Scherer, R.: On Combining Backpropagation with Boosting. In: 2006 International Joint Conference on Neural Networks, IEEE World Congress on Computational Intelligence, Vancouver, BC, Canada (2006) 6. Meir, R., Ratsch, G.: An Introduction to Boosting and Leveraging. In: Mendelson, S., Smola, A.J. (eds.) Advanced Lectures on Machine Learning. LNCS (LNAI), vol. 2600, pp. 118–183. Springer, Heidelberg (2003) 7. Nowicki, R.: Rough Sets in the Neuro-Fuzzy Architectures Based on Monotonic Fuzzy Implications. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC 2004. LNCS (LNAI), vol. 3070, pp. 510–517. Springer, Heidelberg (2004) 8. Nowicki, R.: Rough-Neuro-Fuzzy System with MICOG Defuzzification. In: Proc. 2006 IEEE International Conference on Fuzzy Systems, IEEE World Congress on Computational Intelligence, Vancouver, BC, Canada, pp. 9090–9097 (2006) 9. Pawlak, Z.: Rough sets. International Journal of Information and Computer Science 11(341), 341–356 (1982) 10. Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer, Dordrecht (1991) 11. Pawlak, Z.: Rough sets, decision algorithms and Bayes’ theorem. European Journal of Operational Research 136, 181–189 (2002) 12. Polkowski, L.: Rough Sets. Mathematical Foundation. Physica-Verlag, A Springer-Verlag Company, Heidelberg, New York (2002) 13. Rutkowski, L., Cpałka, K.: Flexible Neuro-Fuzzy Systems. IEEE Trans. Neural Networks 14(3), 554–574 (2003) 14. Rutkowski, L., Cpałka, K.: Designing and Learning of Adjustable Quasi-Triangular Norms With Applications to Neuro-Fuzzy Systems. IEEE Trans. Fuzzy Systems 13(1), 140–151 (2005) 15. Schapire, R.E.: A brief introduction to boosting. In: Proc. of the Sixteenth International Joint Conference on Artificial Intelligence 1999, pp. 1401–1406 (1999) 16. Wang, L.X.: Adaptive Fuzzy Systems and Control. PTR Prentice Hall, Englewood Cliffs (1994)
Tracing SQL Attacks Via Neural Networks Jaroslaw Skaruz1 , Franciszek Seredynski1,2,3 , and Pascal Bouvry4 1
Institute of Computer Science, University of Podlasie, Sienkiewicza 51, 08-110 Siedlce, Poland [email protected] 2 Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warsaw 3 Institute of Computer Science, Polish Academy of Sciences, Ordona 21, 01-237 Warsaw, Poland [email protected] 4 Faculty of Sciences, Technology and Communication, University of Luxembourg, 6 rue Coudenhove Kalergi, Luxembourg [email protected]
Abstract. In the paper we present a new approach based on application of neural networks to detect SQL attacks. SQL attacks are those attacks that take the advantage of using SQL statements to be executed. The problem of detection of this class of attacks is transformed into time series prediction problem. SQL queries are used as a source of events in a protected environment. To differentiate between normal SQL queries and those sent by an attacker, we divide SQL statements into tokens and pass them to our detection system, which predicts the next token, taking into account previously seen tokens. In the learning phase tokens are passed to recurrent neural network (RNN) trained by backpropagation through time (BPTT) algorithm. Training data in the output of RNN are shifted by one token forward in time with relation to input. An additional rule is defined to interpret RNNs output. Experiments were conducted on Jordan and Elman networks and the results show that the Jordan network outperforms the Elman network predicting correctly queries with higher efficiency. Moreover, our results lead to the form of the rule, which can be successfuly applied to the subset of SQL statements taken into consideration in this study.
1
Introduction
Large number of Web applications, especially those deployed for companies to e-business purpose involve data integrity and confidentiality. Such applications are written in script languages like PHP embedded in HTML allowing to establish connection to databases, retrieving data and putting them in WWW site. Security violations consist in not authorized access and modification of data in the database. SQL is one of languages used to manage data in databases. Its statements can be one of sources of events for potential attacks. In the literature there are some approaches to intrusion detection in Web applications. In [1] the authors developed anomaly-based system that learns R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 549–558, 2008. c Springer-Verlag Berlin Heidelberg 2008
550
J. Skaruz, F. Seredynski, and P. Bouvry
the profiles of the normal database access performed by web-based applications using a number of different models. A profile is a set of models, to which parts of SQL statement are fed to in order to train the set of models or to generate an anomaly score. During training phase models are built based on training data and anomaly score is calculated. For each model, the maximum of anomaly score is stored and used to set an anomaly threshold. During detection phase, for each SQL query anomaly score is calculated. If it exceeds the maximum of anomaly score evaluated during training phase, the query is considered to be anomalous. Decreasing false positive alerts involves creating models for custom data types for each application to which this system is applied. Besides that work, there are some other works on detecting attacks on a Web server which constitutes a part of infrastructure for Web applications. In [2] detection system correlates the server-side programs referenced by clients queries with the parameters contained in these queries. It is similar approach to detection to the previous work. The system analyzes HTTP requests and builds data model based on attribute length of requests, attribute character distribution, structural inference and attribute order. In a detection phase built model is used for comparing requests of clients. In [3] logs of Web server are analyzed to look for security violations. However, the proposed system is prone to high rates of false alarm. To decrease it, some site-specific available information should be taken into account which is not portable. In this work we present a new approach to intrusion detection in Web application. Rather than building profiles of normal behavior we focus on a sequence of tokens within SQL statements observed during normal use of application. Two architectures of RNN are used to encode stream of such SQL statements. The paper is organized as follows. The next section discusses SQL attacks. In section 3 we present two architectures of RNN. Section 4 shows training and testing data used for experiments. Next, section 5 contains experimental results. Last section summarizes results and shows possible future work.
2 2.1
SQL Attacks SQL Injection
SQL injection attack consists in such a manipulation of an application communicating with a database, that it allows a user to gain access or to allow it to modify data for which it has not privileges. To perform an attack in the most cases Web forms are used to inject part of SQL query. Typing SQL keywords and control signs an intruder is able to change the structure of SQL query developed by a Web designer. If variables used in SQL query are under control of a user, he can modify SQL query which will cause change of its meaning. Consider an example of a poor quality code written in PHP presented below. $connection=mysql_connect(); mysql_select_db("test");
Tracing SQL Attacks Via Neural Networks
551
$user=$HTTP_GET_VARS[’username’]; $pass=$HTTP_GET_VARS[’password’]; $query="select * from users where login=’$user’ and password=’$pass’"; $result=mysql_query($query); if(mysql_num_rows($result)==1) echo "authorization successful" else echo "authorization failed"; The code is responsible for authorizing users. User data typed in a Web form are assigned to variables user and pass and then passed to the SQL statement. If retrieved data include one row it means that a user filled in the form login and password the same as stored in the database. Because data sent by a Web form are not analyzed, a user is free to inject any strings. For example, an intruder can type: ”’ or 1=1 –” in the login field leaving the password field empty. The structure of SQL query will be changed as presented below. $query="select * from users where login =’’ or 1=1 --’ and password=’’"; Two dashes comments the following text. Boolean expression 1=1 is always true and as a result user will be logged with privileges of the first user stored in the table users. 2.2
Proposed Approach
The way we detect intruders can be easily transformed to time series prediction problem. According to [4] a time series is a sequence of data collected from some system by sampling a system property, usually at regular time intervals. One of the goal of the analysis of time series is to forecast the next value in the sequence based on values occurred in the past. The problem can be more precisely formulated as follows: st−2 , st−1 , st −→ st+1 ,
(1)
where s is any signal, which is dependent on a solving problem and t is a current moment in time. Given st−2 , st−1 , st , we want to predict st+1 . In the problem of detection SQL attacks, each SQL statement is divided into some signals, which we further call tokens. The idea of detecting SQL attacks is based on their key feature. SQL injection attacks involve modification of SQL statement, which lead to the fact, that the sequence of tokens extracted from a modified SQL statement is different than the sequence of tokens derived from a legal SQL statement. Various techniques have been used to analyze time series [5,6]. Besides statistical methods, RNNs have been widely used for that problem. In our study presented in this paper we selected the Jordan network and the Elman network.
552
3 3.1
J. Skaruz, F. Seredynski, and P. Bouvry
Recurrent Neural Networks RNN Architectures
There are some differences between the Elman and the Jordan networks. The first is that input signal for context layer neurons comes from different layers and the second is that Jordan network has additional feedback connection in the context layer. While in the Elman network the size of the context layer is the same as the size of the hidden layer, in the Jordan network the size of output layer and context layer is the same. In both networks recurrent connections have fixed weight equal to 1.0. Networks were trained by BPTT and the appropriate equations are presented in [8]. 3.2
Training
The training process of RNN is performed as follows. The tokens of the SQL statement become input of a network. Activations of all neurons are computed. Next, an error of each neuron is calculated. These steps are repeated until last token has been presented to the network. Next, all weights are evaluated and activation of the context layer neurons is set to 0. For each input data, teaching data are shifted by one token forward in time with relation to input. We consider the following tokens: keywords of SQL language, numbers, strings and combinations of these elements. We used the collection of SQL statements to define 54 distinct tokens. Each token has a unique index. The table 1 shows selected tokens and their indexes. The indexes are used for preparation of input data for neural networks. The index e.g. of a keyword WHERE is 7. The index 28 points to a combination of keyword FROM and any string. The token with index 36 relates to a grammatical link between SELECT and any string. Finally, when any string is compared to any number within a SQL query, the index of a token equals to 47. Figure 1 presents an example of SQL statement, its representation in the form of tokens and related binary four inputs of a network. Table 1. A part of a list of tokens and their indexes token index ... ... WHERE 7 ... ... FROM string 28 ... ... SELECT string 36 ... ... string=number 47 ... ... INSERT INTO 54
Tracing SQL Attacks Via Neural Networks
553
SQL statement is encoded as k vectors, where k is the number of tokens constituting the statement (see figure 1). The number of neurons on the input layer is the same as the number of defined tokens. Networks have 55 neurons in the output layer. 54 neurons correspond to each token similarly to the input layer but the neuron 55 is included to indicate that just processing input data vector is the last within a SQL query. Training data, which are compared to the output of the network have value either equals to 0.1 or 0.9. If a neuron number n in the output layer has small value then it means that the next processing token can not have index n. On the other hand, if output neuron number n has value of 0.9, then the next token in a sequence should have index equals to n. At the beginning, SQL statement is divided into tokens. The indexes of tokens SELECT user_password FROM nuke_users WHERE user_id = 2
token 36
token 28
token 7
token 47
a) vector 1 vector 2 vector 3 vector 4
000000000000000000000000000000000001000000000000000000 000000000000000000000000000100000001000000000000000000 000000100000000000000000000100000001000000000000000000 000000100000000000000000000100000001000000000010000000
b) Fig. 1. Preparation of input data for a neural network: analysis of a statement in terms of tokens (a), input neural network data corresponding to the statement (b)
are: 36, 28, 7 and 47. Each row is an input vector for RNN (see figure 1). In the figure 1 the first token that has appeared is 36. As a consequence, in the first step of training output signal of all neurons in the input layer is 0 except neuron number 36, which has value of 1. Next input vectors indicate current indexes of tokens and the index of a token that has been processed by RNN. The next token in a sequence has index equals to 28. It follows that only neurons 36 and 28 have output signal equal to 1. The next index of a token is 7, which means that neurons: 36, 28 and 7 send 1 and all remaining neurons send 0. Finally, neurons 36, 28, 7, 47 have activation signal equal to 1. In that moment weights of RNN are updated and the next SQL statement is considered.
4 4.1
Training and Testing Data Data Set I
Initialy, we evaluated our system using data collected from PHP Nuke portal[7]. Similarly to [1] we installed this portal in the version 7.5, which is susceptible
554
J. Skaruz, F. Seredynski, and P. Bouvry
to some SQL injection attacks. A function of the portal related to executing SQL statements was modified. Each time a Website is downloaded by a browser, SQL queries are sent to a database and logged to a file simultaneously. During operation of the portal we collected nearly 100000 SQL statements. The set of all SQL queries was divided into 12 subsets, each containing SQL statements of different length. 80% of each data set was used for training and remaining data used for examining generalization. Data with attacks are the same as reported in [1]. 4.2
Data Set II
To generalize classiffication decision process, we conducted experiments using synthetic data collected from a SQL statements generator (see section 5). We considered the same subset of SQL keywords as the subset gathered from PHP Nuke portal. The generator takes randomly a keyword from selected subset of SQL keywords, data types and mathematical operators to build a SQL query. We generated 300000 SQL statements. Next, the identical statements were deleted. Finally, our data set contained 7512 free of attack SQL queries and 6871 SQL attacks. The set of all SQL queries was divided into 17 subsets, each containing SQL statements of different length.
5 5.1
Experimental Results Detecting Attacks by RNNs
In all experiments, which results we present in this subsection, we used data set I. In the first experimental stage, we evaluated the best parameters of both RNNs and learning algorithm. We run experiments 10 times and averaged results. In the second phase of the experimental study we trained 12 RNNs, one for each training data subset, using values from the first stage. Figure 2 shows how the error of networks changes for all subsets of SQL queries and how well the networks are verified. Here, a statement is considered as well predicted if for all input vectors, all neurons in the output layer have values according to training data. All values presented in figures are averaged on 10 runs of RNNs. One can see that nearly for all data subsets the Jordan network outperforms the Elman one. Only for data subsets 11 and 12 the error of the Jordan network is greater than the error of the Elman network. Despite of this the Jordan network is better then the Elman network in terms of percentage number of wrong predicted SQL queries. Verification states how good a network is trained. In the sense of the detecting of attacks, it means that the better verification, the less false alarms of a system. The Jordan network predicts all tokens of 10 length statements (20.6% false alarms). In the third part of experiments we checked if RNNs correctly detect attacks. Each experiment was conducted using trained RNNs from the second stage. Figure 3a) presents the typical RNN output if an attack is performed. The left column depicts the number of input vector for
Tracing SQL Attacks Via Neural Networks
Jordan’s network performance
555
Elman’s network performance
100
100 Jordan RMS Jordan Verification
Elman RMS Elman Verification
40
20
60
RMS
RMS
60
% no. of wrong predicted SQL queries
80 % no. of wrong predicted SQL queries
80
40
20
0
0 2
4
6
8
10
12
2
4
Index of training data subset
6
8
10
12
Index of training data subset
(a)
(b)
Fig. 2. Error and number of wrong predicted SQL queries for each subset of data. Jordan network (a), Elman network (b). # SQL statement − attack
# legal SQL statement
# index of input vector num. of errors
# index of input vector num. of errors−ver num. of errors−gen
1 2
7 2
1 2
0 1
0 1
3
1
3
1
2
4
2
4
0
1
5 6
2 1
5 6
0 1
1 1
7
2
7
1
1
8
0
8
1
0
a)
b)
Fig. 3. RNN output for an attack (a), RNN output for known and unknown SQL statement (b)
RNN, while the right column shows the number of cases in which the index of the token indicated by network output is different than the index of the next processed by RNN token. What is common for each network is that nearly each output vector of a network has a few errors. This phenomenon is present for all attacks used in this work. Figure 3b) shows RNN output for verification (the 2nd column) and generalization (the 3rd column). Verification means that as an input the data from the training phase were used while generalization means that as the input the data, which were not used in the training stage were used. It is easy to see (figures 3a) and 3b)) that the sum of errors for a legal SQL statement and an attack strongly varies. Moreover, when an attack is performed, the sum of vectors, which are without errors is lower than the sum of free of attacks vectors constituting a legal SQL query. Easily noticeable difference between an attack and normal activity allows us to re-evaluate obtained results presented in figure 2.
556
J. Skaruz, F. Seredynski, and P. Bouvry
To distinguish between an attack and a legitimate SQL statement we define the following rule for the Jordan network: an attack occurred if the average number of errors for each output vector is not less than 2.0 and 80% of output vectors include any error. For the Elman network the values of coefficients equal to 1.6 and 90%, respectively. Applying these rules ensures that all attacks are detected by both RNN. In the most cases the Jordan network outperforms the Elman network. Only for data subsets containing statements made from 11 and 12 tokens, the Elman network is a little better than the Jordan network. For nearly all training data, detection error was below 4%. The important outcome of defined rules is that both RNNs thought all statements and only few legitimate statements, which were not in the training set were detected as attacks. 5.2
The Rule Rationale
In this part of the experimental study we further examine if the rule can be used to the other data set described in section 4.2. Here, only the Jordan network is applied as the one which outperforms the Elman network. We trained 17 RNNs, one for each training data subset. Then, each network was used to detect attacks and to check if it is able to correctly classify legal SQL statements. The rule is composed of two coefficients. It turned out that if we use the values of these coefficients, which were defined in the previous subsection, obtained results are very poor. In the most cases over 70% attacks are identified as normal activity while all legal SQL queries were classified correctly. Given such results it is obvious that applying given values of two coefficients leads to overtraining, which means that nearly all activity is treated as normal. This imposes the need for the rule re-evaluation. The table 2 presents the details of the 2nd training data set and obtained results. The first column depicts the number of tokens constituting SQL queries and the next column shows the number of SQL statements used in the training phase. The third and fourth columns contain a number of queries free of attacks and SQL attacks data used in the testing phase, respectively. The next two columns include the value of the average number of errors for each output vector and the number of output vectors, which include any error when input were legal SQL queries. The last two columns depict the values of the same coefficients as the previous two columns but attacks were input to the network. It is easy to observe that the values of the two coefficients vary if input were attacks and legal data. This feature was also present in the previous experiments. It is legitimate to apply a good discriminating rule for interpretation of the network output. The only one concern is to define the appropriate values of the two coefficients. We evaluated these values and the new results are presented in the table 3. The estimated values are 1.05 and 0.55 for the first and the second coefficient accordingly. The number of attacks, which were not detected is greater (false negative) for shorter SQL statements. As opposite, the number of misbehaviours when there is not an attack (false positive) is greater for longer SQL queries. While the most often SQL queries contain from 5 to 14 tokens, our detection system is a good proposal for detecting intruders. The deeper analysis
Tracing SQL Attacks Via Neural Networks
557
Table 2. Training data and experimental results length of SQL queries legal SQL attack SQL data subset (training) (testing) (testing) 4 80 20 14 5 181 46 20 6 400 100 136 7 400 100 200 8 400 100 500 9 400 100 500 10 400 100 500 11 400 100 500 12 400 100 500 13 400 100 500 14 400 100 500 15 400 100 500 16 400 100 500 17 400 100 500 18 396 100 500 19 289 72 500 20 230 58 500
1st val. (legal) 0.112 0.369 0.245 0.28 0.45 0.487 0.524 0.687 0.66 0.71 0.626 1.054 0.752 0.795 0.877 0.741 0.736
2nd val. 1st val. 2nd val. (legal) (attacks) (attacks) 0.0625 3.1 0.65 0.113 3.8 0.76 0.136 2.32 0.658 0.135 2.186 0.68 0.225 1.953 0.726 0.222 1.815 0.732 0.324 1.699 0.73 0.398 1.627 0.788 0.421 1.746 0.822 0.467 1.487 0.8 0.43 1.667 0.871 0.499 1.91 0.848 0.499 1.652 0.861 0.539 1.635 0.871 0.511 1.713 0.911 0.547 1.685 0.886 0.574 1.648 0.923
Table 3. Effectiveness of attack detection length of SQL query 4 5 6 7 8 9 10 11 12
false false length of negative positive SQL query 14.28 % 0 % 13 0% 2.1 % 14 17.64 % 1 % 15 1% 0% 16 7.2 % 1% 17 7.2 % 1% 18 8.6 % 2% 19 11.2 % 6% 20 2.6 % 12 %
false negative 6.8 % 2.8 % 1% 1.8 % 3.4 % 1% 1.6 % 1.8 %
false positive 9% 8% 29 % 13 % 21 % 19 % 9.7 % 12.06 %
of the experimental results of each data subset would lead to data length specific rule, which should result in much lower rate of false alarms.
6
Conclusions
In the paper we have presented results of ongoing research on intrusion detection in databases. The problem of detection was transformed into time series prediction problem and two RNNs were examined to show their potential use for such a class of attacks. It turned out that the Jordan network is easily trained by BPTT algorithm. Despite the fact that large architecture of RNN was used,
558
J. Skaruz, F. Seredynski, and P. Bouvry
that network is able to predict sequences of up to ten length with acceptable error margin. Deep analysis of the experimental results leads to the definition of rules used for distinguishing between an attack and a legitimate statement. When these rules are applied, both networks are completely trained for all SQL queries included in the all training subsets. Accuracy of the results very strongly depends on the rules. A large number of SQL queries in the 2nd data set used in the experimental study allows us to assume that the re-evaluated rule can be widely used. The advisable part of experimental study is to apply defined rule to the other data set taking into consideration greater part of SQL language, which can confirm efficiency of the proposed approach to detecting SQL attacks. Our future plans include also development of classification algorithms, which could replace the rule with higher efficiency.
References 1. Valeur, F., Mutz, D., Vigna, G.: A Learning-Based Approach to the Detection of SQL Attacks. In: Proceedings of the Conference on Detection of Intrusions and Malware and Vulnerability Assessment (DIMVA), Austria (2005) 2. Kruegel, C., Vigna, G.: Anomaly Detection of Web-based Attacks. In: Proceedings of the 10th ACM Conference on Computer and Communication Security (CCS 2003), pp. 251–261 (2003) 3. Almgren, M., Debar, H., Dacier, M.: A lightweight tool for detecting web server attacks. In: Proceedings of the ISOC Symposium on Network and Distributed Systems Security (2000) 4. Nunn, I., White, T.: The Application of Antigenic Search Techniques to Time Series Forecasting. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), USA, (2005) 5. Kendall, M., Ord, J.: Time Series, 3rd edn (1999) 6. Pollock, D.: A Handbook of Time-Series Analysis, Signal Processing and Dynamics. Academic Press, London (1999) 7. http://phpnuke.org/ 8. Skaruz, J., Seredyski, F.: Recurrent neural networks on duty of anomaly detection in databases. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007. LNCS, vol. 4493, pp. 85–94. Springer, Heidelberg (2007)
Optimization of Parallel FDTD Computations Using a Genetic Algorithm Adam Smyk1 and Marek Tudruj1,2 1
2
Polish-Japanese Institute of Information Technology, 86 Koszykowa Str., 02-008 Warsaw, Poland Institute of Computer Science, Polish Academy of Sciences, 21 Ordona Str., 01-237 Warsaw, Poland {asmyk,tudruj}@pjwstk.edu.pl
Abstract. Abstract. In this paper, we discuss optimization of numerical computations of the FDTD problem in multiprocessor environments. The use of a genetic algorithm to find the best program macro data flow graph (MDFG) partition for a given FDTD problem for execution by a set of processors is presented. Different sub-graph merging actions are successively used in each step of the merging algorithm which starts from a program data flow graph representation. A special kind of chromosome represents consecutive steps of the graph partitioning algorithm to be applied to the current version of the macro data flow graph. To compare quality of individuals, we estimate the total execution time for each output MDF graph after applications of the actions specified in the algorithm, which they represent. To estimate efficiency of computations we used an architectural model which enables to represent parallel computations with 3 different communication protocols (MPI, RDMA RB, SHMEM). Experimental results obtained by simulation are presented.
1
Introduction
There are many numerical applications that can be described by irregular data patterns or use irregular data structures [9]. Good examples can be simulation applications, some linear algebra problems or VLSI unit design applications [2]. In order to execute them efficiently in parallel system, we usually need to decompose the relevant data [7] and perform computations on separate computational nodes. Such operation has big influence on the quality (execution time) of the computations. There are many techniques which enable performing optimally such operation which is an NP-complete problem [4]. Generally, they can be divided into two groups: direct (based on the cut-min optimization) [6] and iterative techniques [5,6,7,8] that are usually based on the Kernighan-Lin algorithm [7], later developed by Fidducia-Mattheyses [3]. The algorithm presented in the paper is partially based on last two mentioned methods but it exploits a new interesting technique in the form of a genetic algorithm. Genetic algorithms (GA) are very well known and it is a popular method widely used in searching and optimization problems in many scientific and engineering areas [8,12]. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 559–569, 2008. c Springer-Verlag Berlin Heidelberg 2008
560
A. Smyk and M. Tudruj
We used this idea in a specific way to find an optimal parallel program graph partitioning for given computational system. More precisely, the genetic algorithm does not transform directly program graph representation. It just helps us to find an optimal graph transformation algorithm that will be able to partition an input program data flow graph which corresponds to a given instance of the application problem. It needs an explicit definition of rules describing each step performed during the graph partitioning. Of course, all genetic operators, like selection, crossover, mutation usually used by GA, should be defined. We used such GA-based approach to find an optimal partitioning of the data flow graph that represents computations for the Finite Difference Time Domain method (FDTD). This method enables simulation of high frequency electromagnetic wave propagation by solving Maxwell equations [11]. Mp7 Mp6
Mp0 Mp4
Mp5
Mp1
MN1
Mp3
MN3
Mesh of physical points
Mp2
Macro data flow graph Mp1
Mp2
Mp3
Mp4
Mp5
Mp6
Mp7
Mp0 M
Mp1 M
Mp2 M
Mp3 M
Mp4 M
Mp5 M
Mp6 M
Mp7 M
Step i+1
Mp0
Step i
Discrete simulation points Computational cells
Computational mesh
Fig. 1. Irregular computational area with its MDFG for two computational steps
Simulated area is a two dimensional, irregular shape, see Fig. 1. Before simulation, a whole CA is transformed into a physical mesh. In a two dimensional FDTD problem, the mesh consists of a specified number of points which contain alternately electric component Ez of electromagnetic field and one from two magnetic components Hx or Hy (depending on coordinates). After transformations of two Maxwell equations into a differential form [11], we can formulate dependencies between Ez, Hx and Hy components. The value of Ez is dependent on values of four nearest magnetic field components (Hx and Hy). To calculate the value of magnetic component, we require two nearest electric field components. Computational process is divided into a given number of steps. Each step is divided into two substeps: first - the values of all Ez are computed and later - the values of Hx and Hy are computed. If the shape of computational area is regular (e.g. rectangular), the whole computational process can be easily parallelized (e.g. by stripe or block partitioning of the computational area). In the case of an irregular shape of computational area such decomposition is more complicated. It is because we have to take into consideration both a proper load of all processing nodes and the minimal number of data transmissions. To solve this problem, we have implemented the FDTD method program according to a macro data flow
Optimization of Parallel FDTD Computations Using a Genetic Algorithm
561
paradigm. At first, we have created a data flow graph of computations to show basic data dependencies in the program. Next, we use a data flow node merging algorithm to define macro nodes assigned to separate processors (Mpn, where p is a processor number, a n is a simulation sub-area identification). It is described in detail in [11]. The computation macro data flow graph (for two computational steps) for the hypothetical simulation area is shown in Fig. 1. Each computations step in each sub-area is represented by one macro node. A macro node can be executed only if all external input data have arrived to the physical processor on which this macro node has been mapped. Data dependencies are given by edges that connect macro nodes. All edges are supplied with weights which give the amount of data, sent from one macro node to another. It directly depends on the length of the boundary line between two adjacent sub-areas, which correspond to Mpx, and Mpy. Each Mp produces results, which will be sent to another Mps when its computations are finished.
2
Partitioning Using a Genetic Algorithm
In this part, main assumptions for the partitioning algorithm are described. General idea of the use of the genetic algorithm in the MDFG optimization is presented in Fig. 2. Before the genetic algorithm starts, we prepare an initial MDFG a given graph FDTD computation (IMDFG). It represents computations and communication pattern in given computational area. It is obtained from an initial data flow graph. Firstly, we identify leader nodes for further node marging. The initial macro nodes will be created by merging data flow nodes around the leaders. The number of leaders is usually much bigger than the assumed number of processors in the given computational system. The genetic algorithm is used to find the best algorithm which is to be used to transform the IMDFG into an optimal MDFG, which provides the minimal execution time of the FDTD program for a given number of processors and a given type of data communication. In the first step of our optimization algorithm, we create an initial population that contains an assumed number of individuals. Each individual (chromosome), represents an instance of the algorithm for merging nodes in the simulation for macro data flow graph and constitutes a sequence of node merging actions. At the beginning of each iteration, the same IMDF graph is assigned for all individuals in the population. Each individual works on the same data, but different merging rules are applied to them. In our case, a single chromosome, Fig. 3, contains integer values that represents an identifier of the merging rules described in Table 1. The execution of any merging rule, merges two chosen macro data flow nodes (according to the criteria implemented in the rule). If two or more macro nodes fulfill the applied merging node selection criteria, the algorithm randomly chooses one pair of them. It is done so to reduce the complexity of the algorithm. To obtain an assumed cardinality of the MDFG graph partition, the length of a chromosome (it means, the total number of merging steps) is described by the following formula: ChromosomeLength= InitialNumberOfParttionsInIMDF – GivenNumberOfProcesors
562
A. Smyk and M. Tudruj START Initial computational area with fine grain partitioning
Population of initial or selected individuals (from previous iteration)
Merging operations according to given chromosomes 114151176771...232 1020102100110...114 Computational areas with coarse partitioning
Fitness value evaluation
No
Fitness value evaluation
176771114151...002
Fitness value evaluation
Next iteration? Yes Individuals selection for new population Crossover operation Mutation operation STOP
Fig. 2. General overview of the GA partitioning method
As we can see, the chromosome length defines the number of merging operations (rules) performed in one iteration of the GA. It means that a chromosome determines a chain of merging operations for each individual. The initial values for a chromo-some can be set in several different ways: – randomly – each merging rule can be used with the same probability in one iteration – single value is set to whole chromosome - one, chosen rule is used during a single whole iteration – a single, two or three values are set to whole chromosome. The most important part of the GA, is the evaluation of the fitness values of the indi-viduals and simultaneously the fitness of the whole population for solving a given problem. It must be computed at the end of each iteration and it enables to select (we used tournament selection) the best candidates for reproduction
Optimization of Parallel FDTD Computations Using a Genetic Algorithm
563
process. Examples of the fitness functions that have been used in our GA are presented below: – execution time of the partitioned macro data flow graph – it requires implementing an architectural model that can be used to perform a simulated graph execution - it enforces selecting individuals which give the shorter execution time – the difference between the maximally and minimally loaded macro data flow nodes – it enforces the selection for candidates with the most equally loaded macro nodes – the number of the cut edges in the obtained DFG graph partition – it enforces selecting individuals that represent a partitioning with the shortest boarders In the last step of each iteration, the algorithm enters a genetic operations phase. We used parallel genetic operators (as crossover and mutate) to choose and to reproduce the best candidates (according to the fitness value). Table 1. Merging rules Id MR0 MR1 MR2 MR3
MR4
MR5
MR6
MR7
MR8
MR9
Rule priority Computational load balancing Computational load balancing Computational load balancing Communication optimisation - edge cut reduction Communication optimisation - edge cut reduction Communication optimisation - edge cut reduction Communication optimisation - edge cut reduction Communication optimisation - edge cut reduction Computational load balancing with edge cut reduction Computational load balancing with edge cut reduction
Description The two least loaded adjacent nodes will be merged The most loaded node will be merged with the least loaded adjacent node The least loaded node will be merged with the most loaded adjacent node The two most loaded adjacent nodes will be merged The node with the biggest communication volume will be merged with its neighbour with the biggest communication volume The node with the smallest communication volume will be merged with its neighbour with the biggest communication volume The node with the smallest communication volume will be merged with its neighbour with the smallest communication volume The node with the biggest communication volume will be merged with its neighbour with the biggest communication volume The least loaded node will be merged with the adjacent node with the biggest communication volume The least loaded node will be merged with the adjacent node with the lowest communication volume
564
A. Smyk and M. Tudruj Single chromosome
1 0 2 0 1 0 2 1 0 0 1 1 0 ... 1 1 4 1 5 1 1 7 6 7 7 1
....... Index
0
1
2
...... 3
4
5
...
...
...
...
Merging rule signature
Merging rule 0
Merging rule 1
Merging rule 2
Merging rule 3
….
Merging rule N
Fig. 3. Description of single chromosome
3
Experimental Results
The algorithms described above have been implemented and tested for different shapes of simulation area and for nine system configurations. We have introduced a simple architectural model [1] for a homogenous multiprocessor system, which is described by two values: CompSpeed (processor computational speed) and CommSpeed (communication performance). In our experiments we have defined 9 types of systems based on various combinations of values of CompSpeed determined by the processor performance and CommSpeed, determined by the type of the communication facility applied in a system, see Table 2. Table 2. Architectural model parameters Symbol CompSpeed of a single node CommSpeed between two nodes FF Fast – 1 GFlops Fast – Shared memory FM Fast – 1 GFlops Medium – RDMA RB FS Fast – 1 GFlops Slow – MPI MF Medium – 0.3 GFlops Fast – Shared memory MM Medium – 0.3 GFlops Medium – RDMA RB MS Medium – 0.3 GFlops Slow – MPI SF Slow – 0.02 GFlops Fast – Shared memory SM Slow – 0.02 GFlops Medium – RDMA RB SS Slow – 0.02 GFlops Slow – MPI
We performed several experiments to test our genetic algorithm. The description of each test is presented in Table 3. Our goal was to check the convergence of the algo-rithm and the quality of the produced partitioning for given shape of the compu-tational area. As it was written in the previous section we have applied several fitness functions to estimate the quality of each individual, but
Optimization of Parallel FDTD Computations Using a Genetic Algorithm
565
in our experiments the best results were obtained for the fitness which is a difference between the maximally and minimally loaded macro data flow nodes in the individual. Values of the fitness function for all individuals are added up to give a fitness value for the whole tested population (FFP). It is an indicator that shows the convergence of the genetic algorithm, see Fig. 4. As we can see later, the FFP value has significant impact on the total execution time of the MDFG graph obtained during the partition-ing phase. All experiments were done for the constant size of population (50 individu-als) and for constant number of iterations (600). Any increase of the volume of population enlarged the execution time without any significant improvement of the quality of the obtained solution (quality of the partitioning). Table 3. Tests descriptions Test symbol Test1 Test2 Test3 Test4
Test5
Test6 Test7
Test8
Description Test performed for the following (manually chosen) merging rules: MR0,MR1,MR2,MR3 Test performed for all merging rules Test performed for all merging rules with random re-selection of the best 5% of individuals Test performed for all merging rules with random re-selection of the best 5% of individuals and additionally another 5% of individuals are replaced by the best individual Test performed for all merging rules with random re-selection of the best 1% of individuals and additionally another 1% of individuals are replaced by the best individual Test performed for all merging rules with re-selection: 5% of individuals are replaced by the best individual in population Test performed for merging rules: MR0,MR1,MR4,MR7,MR8 (manually chosen); with re-selection: 5% of individuals are replaced by the best individual in population Test performed for merging rules: MR0,MR1 (manually chosen); with re-selection: 5% of individuals are replaced by the best individual in population
The genetic algorithm starts from a totally randomly generated population. During the selection operation we were choosing 50% of the best individuals and we per-formed crossover (one-point and two-point crossover operators were used) and muta-tion operations (probabilities of mutation were set from 1/5 to 1/10). Additionally, for all tests we have measured two other parameters for the resulting partitioned MDFGs: CutMin parameter that represents the number of edges that connect two com-putational nodes assigned to different macro nodes after partitioning, (see Fig. 5. left) and the execution time for a chosen architectural model, (see Fig. 5. right). Our first test (Test1) was performed only for first four merging rules, see Table 3. As we can see from Fig. 4 (left), the GA convergence in this case is very slow (especially after 350 iteration). When we
566
A. Smyk and M. Tudruj
16000 15000
Test3 14000 13000 12000 11000
Test1 10000 9000 Test4 8000 Test2
Fittness value for whole population [FFP]
Fittness value for whole population [FFP]
applied other rules – Test2 – (it takes into consideration both computational and communication load) the convergence was slightly better, but in both cases the execution time was similar. Additional interesting observation is that despite the execution time better for Test2, it is characterized by a higher CutMin value. It can be explained by the fact that for a given tested computational area a proper load balancing is much more important, than communication optimization. 10000
9000
Test5 8000
7000
6000
Test6
Test7 5000 Test8 4000
3000
7000 0
50
100
150
200
250
300
350
400
450
500
550
0
600
50
100
150
200
250
300
350
400
450
500
550
600
Iteration
Iteration
Fig. 4. Genetic algorithm convergence results
To speed up the GA convergence we have done a modification. If the value of fitness function was not changed during last 6 iterations, we introduce more aggressive change of the population obtained after selection operation. We perform a re-selection operation. It means that we chose 5% (or 1% it depends on the test) of individuals and we replace them either be another 5% (1%) of randomly generated individuals, or by a copy of one of the best individuals in this iteration. It is done to provoke the GA to going out from a stable state which does not bring any better results for current population. As we can see from Test3 and Test4, the first re-selection operations is executed after 100 and 200 iterations respectively. In the first case (Test3 with the random re-selection), the behavior of the GA is very disappoint-ing and we can observe the divergence from expected solution. The execution time in this case is much worse that for any other test. When we applied a mix of two re-selection methods (random and the best individual 1400
160000 1350
Execution time
155000
CutMin
1300
1250
1200
1150
150000 145000 140000 135000
1100
130000 1050
Test1 Test2 Test3 Test4 Test5 Test6 Test7 Test8 Test1
Test2
Test3
Test4
Test5
Test6
Test7
Test8
Fig. 5. Results of the CutMin value (left) and the execution time (right) for all tests
Optimization of Parallel FDTD Computations Using a Genetic Algorithm
567
re-selection), Test4, the results were much better, but they were still comparable to Test1 and Test2. Slightly better results were obtained when the level of aggression of the re-selection operation was reduced to 1%, see Test5. But, the best results were obtained (for Test6, Test7 and Test8) when we totally eliminated random re-selection operation and we left only 5% of the best individuals. The convergence was significantly better in comparison to other tests. Simultaneously, as we can see in the Fig. 5, for the last 3 tests we obtained minimal CutMin value and minimal execution time. 100% 90% 80%
Merging rules [%]
70% 60% 50% 40% 30% 20%
MR0 MR5
MR1 MR6
Test2
Test3
MR2 MR7
MR3 MR8
MR4 MR9
10% 0% Test1
Test4
Test5
Test6
Test7
Test8
Fig. 6. The percentage use of the merging rules in each test
In Fig. 6. we present the percentage of the merging rules use (see Table 1) in all tests. We measured, how often each rule has been used in the last iteration of the GA. Inter-esting observation can be done for the last three tests. The results (CutMin) for Test 7 is better in comparison to Test6 but the execution time for both MDFG graphs pro-duced in Test6 and Test7 is almost the same. We have to remember that Test6 was performed for all available merging rules, whereas Test7 only for those rules that intuitively are expected to bring spectacular results. It shows that the GA can find a good solution (if we take into consideration the execution time) and eliminate the influence of the “wrong” merging rules. If we compare Test7 and Test8, we can see that the execution time for MDFG produced in Test8 is much shorter, but the CutMin value is better for the MDFG from Test7. It means that for a given tested computa-tional area and for an assumed number of processors in a given computational system (4 in this case) much more important is the correct load balancing for all parti-tions than the communication optimization. As we can see, the CutMin value for Test7 is definitely better than for Test8. It is because, the merging rules applied in Test7 involved also the communication rules whereas in Test8 we focused only on the rules concerned the computational load balancing. As we can see from Fig. 6, the most utilized rule was the rule MR0, which merges the two least loaded
568
A. Smyk and M. Tudruj
adjacent nodes. It is the most dominant rule for all performed tests. It shows that this kind of merging (MR0) in the most efficient way can find the MDFG partitioning. Definitely, as our tests shown to increase the GA convergence and to shorten the execution time of the application FDTD program for obtained partitioning, the MR0 rule must be supported by the other rules (e.g. MR1, MR4, MR7, MR8).
4
Conclusions
In this paper we have presented a methodology, that is used to design and optimize a MDFG partitioning algorithm for the FDTD computations. The main element of our method is a genetic algorithm, which finds a proper sequence of merging rules that should be applied during consecutive steps of the graph partitioning algorithm. We tested our genetic algorithm for 10 different merging rules and we discussed the resulting MDFG quality. Our results have shown that the genetic algorithm can find a proper solution, but to increase its efficiency and to speedup its convergence, we need to introduce some modifications. Beside standard genetic operators like selection, crossover and mutation we used also a re-selection operator, which has enabled mov-ing out the genetic algorithm from local optima and it significantly has increased of the quality of the solution.
References 1. Bharadwaj, V., Ghose, D., Mani, V., Robertazi, T.G.: Scheduling Divisible Loads in Parallel and Distributed Systems. IEEE Computer Society Press, Los Alamitos, California (1996) 2. Dutt, S., Deng, W.: VLSI Circuit Partitioning by Cluster-Removal using Iterative Im-provement Techniques. In: Proc. IEEE International Conference on ComputerAided Design, pp. 350–355 (1997) 3. Fiduccia, C.M., Mattheyses, R.M.: A Linear Time Heuristic for Improving Network Partitions. In: Proc. Nineteenth Design Automation Conference, pp. 175–181 (1982) 4. Garey, M., Johnson, D., Stockmeyer, L.: Some simplified NP-complete graph problems. Theoretical Computer Science 1, 237–267 (1976) 5. Karypis, G., Kumar, V.: Unstructured Graph Partitioning and Sparse Matrix Ordering, Technical Report, Department of Computer Science, University of Minesota (1995), http://www.cs.umn.edu/~kumar 6. Khan, M.S., Li, K.F.: Fast Graph Partitioning Algorithms. In: Proceedings of IEEE Pacific Rim Conference on Communications, Computers, and Signal Processing, Victoria, B.C., Canada, May 1995, pp. 337–342 (1995) 7. Kerighan, B.W., Lin, S.: An efficient heuristic procedure for partitioning graphs. AT&T Bell Labs. Tech. J. 49, 291–307 (1970) 8. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983) 9. Lin, H.X., van Gemund, A.J.C., Meijdam, J.: Scalability analysis and parallel execution of unstructured problems. In: Eurosim 1996 Conference (1996)
Optimization of Parallel FDTD Computations Using a Genetic Algorithm
569
10. Smyk, A., Tudruj, M.: RDMA Control Support for Fine-Grain Parallel Computations. In: PDP 2004, La Coruna, Spain (2004) 11. Smyk, A., Tudruj, M.: Parallel Implementation of FDTD Computations Based on Macro Data Flow Paradigm. In: PARELEC 2004, Dresden, Germany, September 7-10 (2004) 12. Coley, D.A.: An Introduction to Genetic Algorithms for Scientists and Engineers (Hardcover), Har/Dsk edition. World Scientific Publishing Company, Singapore (1997)
Modular Type-2 Neuro-fuzzy Systems Janusz Starczewski1,2 , Rafał Scherer1,2 , Marcin Korytkowski1,3 , and Robert Nowicki1,2 1
Department of Computer Engineering, Cze¸stochowa University of Technology al. Armii Krajowej 36, 42-200 Cze¸stochowa, Poland http://kik.pcz.pl 2 Department of Artificial Intelligence, Academy of Humanities and Economics in Lodz ul. Rewolucji 1905 nr 64, Ł´od´z, Poland http://www.wshe.lodz.pl 3 Olsztyn Academy of Computer Science and Management ul. Artyleryjska 3c, 10-165 Olsztyn, Poland {jasio,marcink,rnowicki}@kik.pcz.czest.pl, [email protected] http://www.owsiiz.edu.pl
Abstract. In the paper we study a modular system which can be converted into a type-2 neuro-fuzzy system. The rule base of such system consists of triangular type-2 fuzzy sets. The modular structure is trained using the backpropagation method combined with the AdaBoost algorithm. By applying the type-2 neurofuzzy system, the modular structure is converted into a compressed form. This allows to overcome the training problem of type-2 neuro-fuzzy systems. An illustrative example is given to show the efficiency of our approach in the problems of classification.
1 Introduction In order to improve accuracy, classifiers can be combined into modular structures. By combining adaptive intelligent systems the model robustness and accuracy is nearly always improved comparing to single-model solutions. For the purpose of modular structure tuning, well known learning meta-algorithms as bagging and boosting are used. The key is that the algorithms assign weights to learning data from the point of view of their performance on earlier tuned classifiers. Therefore, all subsystems in the ensemble are tuned with different data sets. Good results in classification are brought by Fuzzy Logic Systems (FLSs). In the last decade FLS have been extended to describe a higher order of fuzziness by the use of type-2 fuzzy sets [2][5] [6]. In relation to the classical fuzzy set (of type-1), which is a set of pairs: elements and their membership grades, the type-2 fuzzy set consists of membership grades which are actually ordinary type-1 fuzzy sets in the unit interval [0, 1]. Type-2 fuzzy sets are characterized by imprecise or uncertain membership function. Type-2 FLSs can effectively solve the class of problems, in which knowledge used to generating rules is imprecise, incomplete or uncertain.
This work was supported in part by the Foundation for Polish Science (Professorial Grant 2005-2008) and the Polish Ministry of Science and Higher Education (Special Research Project 2006-2009) and by science funds for 2007-2010 as research project Nr N N516 1155 33.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 570–578, 2008. c Springer-Verlag Berlin Heidelberg 2008
Modular Type-2 Neuro-fuzzy Systems
571
In this paper we will combine type-2 fuzzy logic with modular classifier structures. Fuzzy logic subsystems are trained with the AdaBoost algorithm and the backpropagation method. Then the ensemble of classifiers is aggregated to a form of triangular type-2 FLS. Owing to this conversion we bypass the complex and computationally expensive problem of tuning type-2 fuzzy logic systems with non-interval fuzzy sets.
2 Type-2 Fuzzy Logic Systems First, we formalize a type-2 fuzzy set as a generalization of a fuzzy set whose mem˜ in the bership grades are fuzzy subsets of the unit interval. The fuzzy set of type-2, A, real line R, is a set characterized by membership function μA˜ : R → F ([0, 1]), where F ([0, 1]) is a set of all classical fuzzy sets in the unit interval [0, 1]. That is, each x ∈ R is associated with a secondary membership function fx ∈ F ([0, 1]), which is a mapping fx : [0, 1] → [0, 1]. In this paper, secondary memberships are expressed by triangle functions. Evolving the notion of FLSs of type-2, the type-2 FLS consists of: – – – – –
a type-2 fuzzy rule base, an inference engine modified to deal with type 2 fuzzy sets, a type reducer, a type-1 defuzzifier, and optionally a type-2 fuzzifier of inputs.
In the following we describe one of the simplest type-2 FLSs which does not have a fuzzification component, and whose rule base is composed of type-2 fuzzy antecedents and singleton fuzzy consequents. Thus in the rule base, there is K rules of the form fuzzy relations: ˜ k : IF x1 is A˜k,1 and xk,2 is A˜k,2 and · · · and xN is A˜k,N THEN y is yk R where xn is the n-th input variable, A˜k,n is the n-th antecedent fuzzy set of type-2, and yk is the k-th rule consequent, n = 1, . . . , N , k = 1, . . . , K. Fuzzy rule relations are usually expressed by the intersection operator, which in this case is realized by an extended t-norm. Description of methods for calculation extended t-norms the reader my find in [2],[6],[8]. Assuming no fuzzification and singleton consequents yk , the inference engine produces the type-2 fuzzy conclusion according to the following formula: N
τ˜k (yk ) = T˜ μA˜k,n (xn ) n=1
(1)
The well known Center Average (CA) defuzzification has been extended according to Zadeh’s principle [9] to deal with type-2 fuzzy sets. This extended defuzzification, called type-reduction [2],[6], transforms type-2 fuzzy conclusions into a type-1 fuzzy set, which can be finally defuzzified into a crisp output value. The neuro-fuzzy realization of the triangular type-2 FLS is presented in [8].
572
J. Starczewski et al.
3 Modular Fuzzy Logic Systems We use Mamdani-type fuzzy systems as a single classifier t in the ensemble Kt
ht =
yt,k τt,k
k=1 Kt
(2) τt,k
k=1 N
where membership function τt,k = T
n=1
μAk,n (x n ) is the activation level of the fuzzy
rule k = 1, ..., Kt in classifier t = 1, ..., T . To build the modular FLS we use the AdaBoost algorithm which is the most popular boosting method [4],[7]. Let us denote the l-th learning vector by l = 1...m is the number of a vector in the learning sequence, n is a dimension of input vector xl , and y l is the learning class label. Weights, assigned to learning vectors, have to fulfill the following conditions (i) 0 < Dl < 1 ,
(ii)
m
Dl = 1 .
(3)
l=1
The weight Dl is the information how well classifiers learned in consecutive steps of an algorithm for a given input vector xl . Vector D for all input vectors is initialized according to the following equation 1 , for t = 0 , (4) m where t is the number of a boosting iteration (and a number of a classifier in the ensemble). Let {ht (x) : t = 1, ..., T } denotes a set of hypotheses obtained in consecutive steps t of the algorithm being described. For simplicity we limit our problem to a binary classification (dichotomy) i.e. y ∈ {−1, 1} or ht (x) = ±1 . Similarly to learning vectors weights, we assign a weight ct for every hypothesis, such that Dtl =
(i)
T
ct = 1 , (ii) ct > 0 .
(5)
t=1
Now in the AdaBoost algorithm we repeat steps 1-4 for t = 1, . . . , T : 1. Create hypothesis ht and train it with a data set with respect to a distribution dt for input vectors. 2. Compute the classification error εt of a trained classifier ht according to the formula εt =
m
Dtl (z l )I(ht (xl ) = yl) ,
(6)
l=1
where I is the indicator function I(a = b) = If εt = 0 or εt ≥ 0.5, stop the algorithm.
1 if a =b . 0 if a = b
(7)
Modular Type-2 Neuro-fuzzy Systems
3. Compute the value αt = 0.5 ln
1 − εt . εt
573
(8)
4. Modify weights for learning vectors according to the formula Dt+1 (zl ) =
Dt (zl ) exp{−αt I(ht (xl ) = y l )} , Nt
where Nt is a constant such that
m
(9)
Dt+1 (zl ) = 1 . To compute the overall output of
l=1
the ensemble of classifiers trained by AdaBoost algorithm the following formula is used h(x) =
T
ct ht (x) ,
(10)
t=1
where ct = T
αt
t=1
|αt |
(11)
is classifier importance for a given training set. The AdaBoost algorithm is a metalearning algorithm and does not determine the way of learning for classifiers in the ensemble.
4 From Modular to Type-2 Fuzzy Logic Systems After almost ten years of using type-2 FLSs, the problem of parameters’ tuning of such systems is still important. Neither gradient methods nor evolutionary strategies give us satisfactory learning convergence as in type-1 FLSs. This paper presents another alternative for parameters’ tuning. Since tuning of a modular type-1 FLS allows to obtain good results, we can compose a type-2 FLS from modular system. In the sequel we present the procedure for such composition. 4.1 Choice of Principal Subsystem Suppose we have four subsystems of a modular FLS with weights ct generalized in the following vector form: T c = [0.5, 0.3, 0.1, 0.1] . (12) Then we have to rescale all coefficients into fuzzy weights such that maximal one will be equal to one, c q= = [1, 0.6, 0.2, 0.2]T . (13) max (c) The weight of maximal certainty, i.e., qt = 1, indicates the principal subsystem hpr , which will be a base for construction of a triangular type-2 FLS in the following subsections. We assume that there exist only one principal subsystem, otherwise the procedure would lead to the construction of a trapezoidal type-2 FLS. We assume also that modular subsystems have singleton consequents.
574
J. Starczewski et al.
4.2 Consequents’ Neighborhood of Principal Subsystem In subsystem s, we have to find closest neighbors for the k-th rule consequent ypr,k of the principal subsystem pr. In the ideal situation there might be only the one closest consequent of the same value as ypr,k . In other usual cases there are two closest consequents: ys,lef t and ys,right . We may attribute the left consequent and the right one with weights αk and 1 − αk , respectively, such that ypr,k is the weighted average of neighbor consequents. Therefore, ypr,k − ys,right αk = . (14) ys,lef t − ys,right Fig. 1 illustrates the calculation of αk for neighbor consequents of the second subsystem. Obviously in cases of only one closest neighbor existence, αk = 1. 1 0.9 0.8 0.7
1−α
0.6 0.5 0.4 α
0.3 0.2
y left neighbour 2ndMod
0.1 0
0
2
4
y 6 1stMod
y right neighbour 2ndMod 8
10
Fig. 1. Calculation of αk for second subsystem neighbours
The calculation have to be proceeded for all rules of the principal subsystem, k = 1, . . . , K. 4.3 Antecedents’ Neighborhood of Principal Subsystem Coefficients αk and 1 − αk have the meaning of weights for the second subsystem from the particular rule point of view. The weighted average of the second subsystem rule is closest to the principal subsystem rule. Consequently, we have to use the same weights for the antecedent part of the rule base. The averaged antecedent membership function is in the following form: μs,k,n (xn ) = αk μs,lef t,n + (1 − αk ) μs,right,n ,
(15)
where xn is n-th input, μs,k,n is the averaged antecedent membership function of s-th subsystem, for k-th principal rule and n-th input, μs,lef t,n is the membership function for the antecedent corresponding to the left consequent neighbor of ypr,k , and μs,right,n is the membership function for the antecedent corresponding to the right consequent neighbor of ypr,k . The averaged antecedent membership function is presented in Fig. 2.
Modular Type-2 Neuro-fuzzy Systems
575
1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2
10
5
0
15
20
Fig. 2. Averaging of the antecedent membership function with the use of αk ; the antecedent membership function for the principal subsystem (solid line); left and right antecedent membership functions for the subsystem of fuzzy weight qt = 0.6 (thin dashed lines); the averaged membership function (bold dashed line)
1
0.8
0.6
0.4
0.2
0 1 0.5 0
5
10
15
20
Fig. 3. The principal membership function and averaged antecedent membership functions of single input for one rule and for all subsystems
Calculations of μs,k,n (xn ) have to be repeated for all inputs, for all rules and for all subsystems of the modular FLS. The example of principal membership function and averaged antecedent membership functions of single input for one rule and for all subsystems is presented in Fig. 3. Fuzzy weights of subsystems are indicated on the upright axis. 4.4 Construction of Type-2 Antecedent Membership Function In Fig. 4, the membership function of principal subsystem with corresponding averaged membership functions of other subsystems are presented. These functions serve the
576
J. Starczewski et al.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
5
10
15
20
Fig. 4. The principal membership function and averaged antecedent membership functions
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
h_ 0
0.2
0.4
0.6
0.8
1
Fig. 5. Linear regression for secondary membership function at xi = mpr = 10
construction of a triangular type-2 fuzzy set, whose secondary membership functions are triangles. For this purpose we take into consideration three points: – the point of maximal value of the principal membership function, – two additional points of the Gaussian principal membership function. With regard to the maximal principal membership grade, we propose a vertical linear regression of subsystem weights. Using the case presented in Fig. 4, we choose all membership grades along the line xi = 10. Then, assigning for each membership grade its subsystem weight, we determine a linear regression with a constraint at point (1, 1). The resultant secondary membership function evidently has to be bounded in the truth interval [0, 1]. Let the minimal non zero membership value be denoted by hpr . With regard to the other two points, horizontal linear regressions of subsystem weights can be determined. Using the case presented in Fig. 4, we choose all values of xi along the line μ = h exp − 12 . Then, we plot four lines of regression with constraints when secondary memberships reach 1.
Modular Type-2 Neuro-fuzzy Systems
577
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
5
10
15
20
Fig. 6. Linear regression in vertical direction
Fig. 7. Triangular type-2 antecedent
that Gaussian membership function of height h is expressed by μ (x) = h exp Note 1 x−m 2 −2 . For μ (x) = h exp − 12 , there are two solutions x1 = m+σ and x2 = σ m − σ, which may be considered as additional characteristic points of the Gaussian membership function. On triangle functions presented in Figs. 5 and 6 we describe a triangular type-2 fuzzy set, as in Fig. 7. This type-2 fuzzy set is bounded from above and from below by piecewise Gaussian functions. This ends the construction of the triangular type-2 FLS.
5 Numerical Experiment We test classification accuracy on the Glass Identification problem [1]. The goal is to classify 214 instances of glass into window and no-window classes basing on 9 numeric inputs. We took out 43 instances for a validating set. By the AdaBoost algorithm we
578
J. Starczewski et al.
obtained 5 neuro-fuzzy structures summarized in Table 1. All subsystems were of 2 rules with the implication and Cartesian product calculated by the algebraic product. By combining the ensemble of subsystems we obtained 94.99% of classification accuracy, while by the transformation into the triangular type-2 neuro-fuzzy system the accuracy diminished only to 94.49%. Table 1. Numerical Results of Glass Identification FLS 1 FLS 2 FLS 3 No. of epochs 20 30 50 ct 0.13 0.01 0.17 classification rate [%] 88.37 76.74 90.06
FLS 4 100 0.34 97.67
FLS 5 Modular FLS Type-2 FLS 100 0.35 97.75 94.99 94.49
6 Conclusion The paper presents the method for converting modular neuro-fuzzy systems into triangular type-2 fuzzy logic systems. Due to this conversion we reduced modularity of the system. From the other point of view we applied a new technique for training type-2 neuro-fuzzy systems, bypassing hardly applicable gradient methods for type-2 fuzzy logic systems. The Glass Identification experiment shows superior performance of the proposed approach comparing to the literature.
References 1. Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. Irvine, University of California, Department of Information and Computer Science (1998), www.ics.uci.edu/∼mlearn/MLRepository.html 2. Karnik, N.N., Mendel, J.M., Liang, Q.: Type-2 Fuzzy Logic Systems. IEEE Trans. on Fuzzy Systems 7, 643–658 (1999) 3. Korytkowski, M., Rutkowski, L., Scherer, R.: On Combining Backpropagation with Boosting. In: 2006 International Joint Conference on Neural Networks, IEEE World Congress on Computational Intelligence, Vancouver, BC, Canada (2006) 4. Meir, R., Ratsch, G.: An Introduction to Boosting and Leveraging. In: Mendelson, S., Smola, A.J. (eds.) Advanced Lectures on Machine Learning. LNCS (LNAI), vol. 2600, pp. 118–183. Springer, Heidelberg (2003) 5. Mendel, J.M.: Advances in type-2 fuzzy sets and systems. Information Sciences 177, 84–110 (2007) 6. Mendel, J.M.: Uncertain Rule-Based Fuzzy Logic Systems: Introduction and New Directions. Prentice Hall PTR, Upper Saddle River (2001) 7. Schapire, R.E.: A brief introduction to boosting. In: Proc. of the Sixteenth International Joint Conference on Artificial Intelligence, pp. 1401–1406 (1999) 8. Starczewski, J.T.: A Triangle Type-2 Fuzzy Logic System. In: 2006 International Joint Conference on Neural Networks, IEEE World Congress on Computational Intelligence, Vancouver, BC, Canada (2006) 9. Zadeh, L.A.: The Concept of a Linguistic Variable and its Application to Approximate Reasoning - I. Information Sciences 8, 199–249 (1975)
Evolutionary Viral-type Algorithm for the Inverse Problem for Iterated Function Systems Barbara Strug2 , Andrzej Bielecki1 , and Marzena Bielecka3 1
Institute of Computer Science, Jagiellonian University, Nawojki 11, 30-072 Krakow, Poland 2 Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University, Reymonta 4,30-059 Krakow, Poland 3 Department Geoinformatics and Applied Computer Science, Faculty of Geology, Geophysics and Environmental Protection, AGH University of Science and Technology Al. Mickiewicza 30, 30-059 Krakow, Poland [email protected], [email protected], [email protected]
Abstract. In this article a possibility of enriching evolutionary algorithms by a specific type mechanism characteristic for replication of influenza viruses is discussed. Genetic material of influenza type A virus consists of eight separate segments. In some types of tasks such a structure of a genome can be more adequate than representation that consists of only one sequence. If influenza viruses strains infect the same cell then their RNA segments can mix freely producing progeny viruses. Furthermore, mistakes leading to new mutations are common. An evolutionary algorithm for solving the inverse problem for iterated function systems (IFSes) for a two-dimensional image is proposed. Four patterns are considered as examples and a preliminary statistical analysis results are also presented.
1
Introduction
Genetic algorithms were first researched by Holland [6]. He named a number of elements that must be present in such a system: a search space (simulating the environment), a population of solutions (individuals), a method for changing the individuals in the population and the measure of the quality of each solution (fitness function). Holland used two main operators to transform populations. In recent years the evolutionary approach is applied to many fields and new ideas in evolutionary computations are searched for. This includes the search for new operators, more closely modelling the evolutionary process. In this paper a new type of crossover operator, based on viral replication known as reasortment, is proposed. This operator leads to the hierarchical evolutionary algorithm. This paper is a continuation of our previous research [4,5,3,2] concerning finding an Iterated Function System (IFS) that can be used to generate a given two dimensional target image - so called inverse problem. Such a problem may be considered to be a search problem, where the search space consists of all possible IFSs. The main problem lies in the size of such a search space. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 579–588, 2008. c Springer-Verlag Berlin Heidelberg 2008
580
2 2.1
B. Strug, A. Bielecki, and M. Bielecka
Theoretical Basis Iterated Function Systems
The first systematic accounts of iterated function schemes seems to be that of Hutchinson and Barnsley [1,7]. The theory is based on contractive mappings on metric spaces. Here we briefly recall essential definitions and results. Let (X, d) be a metric space. A transformation g : X → X is called contractive mapping if there exists a constant s ∈ [0, 1) such that d(g(x), g(y)) ≤ s · d(x, y) for each x, y ∈ X. The number s is called the contractive factor of g. If the point x ∈ X is such that g(x) = x then it is called the fixed point of the mapping g. Let (H(X), h) be the space of nonempty compact subsets of a space X with the Hausdorff metric h on it, generated by the metric d. Let gi : X → X, i = 1 . . . n be a set of contractive mappings on (X, d). Let’s define Fi : H(X) → H(X), i = 1 . . . n as Fi (A) = {gi (p) : p ∈ A}. It can be easily shown that if gi is a contractive mapping on the space X with a contractive factor si then Fi is contractive on H(X) with the same contractive factor. The Barnsley operator F : H(X) → H(X) is defined in the following way F (A) =
n
Fi (A).
i=1
If every gi , i = 1, ..., n, and Fi as a consequence, are contractive mappings with the contractive factors si , respectively, then the Barnsley operator F is contractive on (H(X), h) with the contractive factor s = max{s1 , ..., sn }. Thus, having an IFS which is defined as a finite set of contractive mappings on a metric space IFS := {(X, d), g1 , ..., gn } the Barnsley operator is generated univocally and is continuous as a contractive mapping. The Barnsley operator on the Hausdorff space satisfies assumptions of Banach Contraction Principle. Therefore, the operator has a unique fixed point, say A∗ , and a sequence of iterations of the operator converges to A∗ for every starting point A ∈ H(X). Formally, F 0 (A) := A, F n (A) := F (F n−1 (A)) and limn→∞ F n (A) = A∗ , for each A ∈ H(X). Thus the fixed point A∗ is the global attractor of F. The important property of any IFS, especially in graphical applications, is the fact that attractor is independent from the starting point A. This means that the attractor A∗ is fully defined by the set of mappings gi constituting the operator F and can be generated iteratively by the Barnsley operator for any starting point, including one-element set {x} ∈ H(X). By the continuity of a Barnsley operator its attractor is also continuous. This means in turn that small changes in parameters of mappings gi result in small changes of the attractor. Finding an IFS which attractor is a good approximation of a given computer image is a non-trivial task. For images which can be easily decomposed into separate parts, such that each one is an image of the whole attractor transformed by one of mappings gi , the Collage Theorem can be effectively applied [1].
Evolutionary Viral-type Algorithm for the Inverse Problem
581
Otherwise, if the analyzed image is more complex no effective algorithm able to find an IFS for such image is known. This paper deals with such a problem. Thus, by adjusting parameters, we can move closer to the IFS whose attractor is similar enough to a desired image. The existence of such IFS is guaranteed by the Collage Theorem [1]. In computer graphics and image processing most often are used IFSes founded on the Euclidean plane with affine functions ai b i x e gi (p) = gi (x1 , x2 ) = (ai x1 +bi x2 +ei , ci x1 +di x2 +fi ) = · 1 + i . ci di x2 fi In such a case both the Barnsley operator and its attractor are fully defined by the set of parameters {ai , bi , ci , di , ei , fi , i = 1, ..., n}. 2.2
Mechanism of Replication of Influenza Type A Virus
Basing on [10] let us describe the mechanism of replication of an influenza A virus. The life cycle and genomic structure of influenza A virus allow it to evolve and exchange genes very easily. Unlike in classical genetic algorithms in which genomes consist of a single segment of DNA ([6]), the viral genetic material consists of eight separated RNA segments encased in a lipid membrane. To reproduce, the virus enters a living cell, where it commandeers cellular machinery, inducing it to manufacture new viral proteins and additional copies of viral RNA. As no proof-reading mechanism exists to ensure that the RNA copies are accurate, mistakes leading to new mutations are common. If two different influenza virus strains infect the same cell, their RNA segments can mix freely there, producing progeny viruses that contain a combination of genes from both original viruses. Such mechanism, called reasortment of viral genes, is important for generating diverse new strains. The following properties of the presented replication process should be distinguished as key ones when considering the possibility of its application to evolutionary algorithms. – The genetic material consists of eight separated fragments, not of one long segment. – The fragments are mutually independent and unordered. – No fragment has any special meaning. – The new virus can be created by sampling eight genetic segments from various viruses. – Probability of mutation is high. The above mentioned mechanism would be effective for a class of problems showing a number of properties: 1. The phenotype has a hierarchical structure - it consists of a set of elements. The order of these elements has no importance. Individual element is coded as a single string, and a genotype consists of a set of such strings.
582
B. Strug, A. Bielecki, and M. Bielecka
2. Different types of phenotypes may exist, each one having different number of elements. The elements belonging to the phenotype are of the same type. Thus the length of strings is constant for all elements. 3. It is not known a priori, which type of genotype contains a solution. Thus during the run of the evolutionary process different types of genotypes (of different degree), and phenotypes, can appear. 4. The exchange of genetic material is possible on different levels of hierarchy. An inverse problem for IFS has all the properties described above. The IFS is represented by a number of affine mappings order of which has no importance. The number of mappings can be different for different IFS and we do not know a priori how many mappings are needed to describe an IFS for a given image. Moreover, the genetic information can be exchanged on at least three different levels: 1. the exchange of whole mappings between different individuals, 2. the exchange of coefficients of mappings, 3. the crossover of the coded representation of coefficients.
3 3.1
Evolutionary Algorithms Initialization
The process starts by generating randomly the pool of singels. Then the first population is produced by choosing singels of the pool (universum). This process is based on the vector of probabilities of degrees V D = {p1 , p2 . . . pn }, where pi is the probability of a produced individual being of degree i and n is a maximum degree allowed in the representation. 3.2
Evaluation
In this paper a fitness function of an IFS is based on the relative points coverage of the target image. To evaluate how well the attractor covers an image we calculate the number of differences: the number of points that are in the image but not in the attractor - NN D (not drawn points) and the number of points present in the attractor but not in the image -NN N (points not needed). If we denote by NA the number of points in the attractor and by NI - in the image then RC = NN D /NI is the relative coverage of the attractor and RO = NN N /NA calculates how many points of an attractor are outside the image. Thus the smaller each of this values the better the solution. Two combine both values we use the fitness function defined as (1 − RC) + (1 − RO). This function has to be maximized. The points produced by an IFS outside the image rectangle are also classified as points not needed thus leading to low fitness and fast elimination of such functions. In such a fitness function an equal importance is given to the number of points drawn correctly and points drawn outside. To first eliminate IFSes drawing too many points outside the image and then concentrating on improving point coverage requires using modified version prc · (1 − RC) + pro · (1 − RO) and changing parameters prc and pro .
Evolutionary Viral-type Algorithm for the Inverse Problem
3.3
583
Genetic Operators
The next generation in the evolutionary process is produced by applying genetic operators to elements of the current population. In order to use the proposed new representation we have to appropriately modify traditional operators like mutation and recombination (crossover). As the mutation operator a traditional mutation is used but it is applied to each singel independently. For our representation different types of crossover can be defined. We use a so called arithmetic crossover which takes two selected IFSes - parents p1 and p2 and produces two offsprings c1 = p1 · (1 − a) + p2 · a, c2 = p1 · a + p2 · (1 − a), where a ∈ [0, 1]. The second recombination operator used in our algorithm is a vector one-point (u) (u) (u) crossover. It takes two vectors from parent population u = g1 , g2 , . . . gk (v) (v) (v) (q) (q) (q) (q) (q) (q) (q) and v = g1 , g2 , . . . , gk , where gi = ai , bi , ci , di , ei , fi and q ∈ {u, v}, i ∈ {1, . . . , k} and exchanges the parts divided at a random point in a way based on standard one-point crossover in GA. Both above mentioned crossover operators are applied to the IFSes consisting of the same number of singels (functions) i.e. belonging to the same ”species”, where by species all IFSes with the same number of functions are meant. In addition to these two operators we also use an inter-species operator that is used to recombine individuals consisting of different number of singels. The crossover operator used here is similar to the traditional one-point crossover but it takes into account the variable number of singels in each individual’s representation. Such an operator, called generalized crossover operator or inter-species crossover, takes two individuals, p1 and p2 and produces two offsprings, c1 and c2 . It starts by selecting one singel from p1 and one from p2 , then these singels are crossed over by a one-point crossover and resulting ones are added to the respective offsprings. This step is repeated until the offsprings have the same degree as their parents. Then the offsprings are placed in the new population. It should be noted here that these operator does not create individuals of different species then the one to which belong the parents. The number of offsprings generated by each of these operators is controlled by global parameters tuned experimentally. In addition to these modified versions of traditional operators, we may introduce new operator thus making full use of the representation. Here we propose an operator of reasortment modelling a viral replication described in section 2.2. It consists in firstly selecting two individuals e1 and e2 of degree d1 and d2 respectively. Then the randomly selected singels of the individuals are exchanged to produce two offsprings; one of degree d1 and the second - d2 . An operator of self-creation is also proposed. It adds to the new population a number of individuals generated of the ones in the pool in the same way as during the initialization process, but using the updated vector V D. The proportion of individuals added by this operator depends on the overall fitness of the current population. The higher the population fitness the lower number of individuals is created from the universum.
584
B. Strug, A. Bielecki, and M. Bielecka
3.4
Moldified Evolutionary Algorithm
Besides operators and other elements a number of parameters must be set for an evolutionary algorithm. Let U be the size of the universum i.e. a number of singels generated at the beginning of the process. The offsprings can be created by four different processes: by genetic recombination using crossover within the species, by an inter-species crossover, by the self-creation operator and by the reasortment operator. The number of new individuals generated by the recombination is denoted by N1 , by the self-creation operator by N2 , by an inter-species crossover by N3 and by the reasortment operator by N4 . Moreover, a vector V C = {c1 , c2 . . . , cr } is used to keep the number of individuals belonging to a given species; ci denotes the number of individuals of degree i in the current population. As we have different recombination operators, the arithmetic crossover and the vector crossover, we have to set the probability of offsprings being generated by each of these operators. Thus pa will denote the probability of the arithmetic crossover and pv - probability of the vector operator. Moreover pm denotes the probability of mutation. The elitist selection used in this algorithm allows for the best individual of each to be preserved. As we have here individuals of different degrees, seen as different species, we will allow for more than one to be preserved. It is done by selecting the best individual of each species and comparing it to a predefined threshold T h. In the inter-species crossover we also use a distance from species s1 to s2 which is defined as the difference between numbers of singels that s1 and s2 consist of. The aim of this feature is to make crossing-over individuals from different species the less likely to happen the bigger distance between species they belong to. Moreover as we do not know the number of functions needed to generate a given image the process starts by generating a population with a number of species of low degree and possibly a few individuals of higher degree. Then, depending on the average fitness of a given species and the size a species may be removed. When a species is removed a new one is created. The degree of a species to be added is determined by the fitness of existing species and distance. Algorithm 1. Generate a genetic universum - an initial set of singels. 2. Generate the first population using an initial V D and the self-creation operator. 3. For each individual in the current population calculate its fitness function. 4. Adapt the vector of probability distribution V D proportionally to the fitness value of the best individual of each degree. 5. Generate the new empty population. 6. Put to the new population the best individual of each degree if its fitness value is above threshold T h value. 7. Generate N1 individuals in the following way – select a species according to vector V D – select two individuals according to fitness proportional selection;
Evolutionary Viral-type Algorithm for the Inverse Problem
8. 9.
10.
11. 12. 13. 14. 15. 16.
4
585
– select the operator to produce offsprings (either arithmetic or vector crossover) – produce two offsprings by the selected operator; – put the produced offspring to the new population. Generate N2 individual by the self-creation operator using adapted V D and add them to the new population Generate N3 individuals in the following way – select a species according to vector V D – select a second species according to vector V D and distance from the first species – select two individuals according to fitness proportional selection; one from each selected species – produce two offsprings by the inter-species crossover operator; – put the produced offspring to the new population. Generate N4 individuals in the following way – select two individuals according to fitness proportional selection; – produce two offsprings by the reasortment operator; – put the produced offspring to the new population. For each element of the new population perform mutation with probability pm . Update the V C vector Remove weakest species if they fall below a threshold Remove species with population below 5% of total Create new species If the stop condition is not satisfied go to 3.
Results
We carried out a number of experiments for four different patterns for which IFSes are known. Figs. 1a to 1d show these patterns and figs. 2a to 2d the best patterns received in our experiments. The fern pattern is build by an IFS consisting of four functions, the tree1 pattern - of five functions, the snow pattern - of six functions and the tree2 pattern is generated by an IFS consisting of seven functions. In all experiments we used a population of 200 elements and run the evolutionary process for 5000 generations. For each pattern and each set of parameter settings there were five experiments starting from various initial populations generated randomly. In these experiments the values N1 , N3 and N4 were changed to observe the influence of different operators, especially the reasortment and inter-species crossover. The exact proportions of each type of recombination is presented in table 1. From the results we received it seems that we achieved better results when higher proportion of offsprings is created by reasortment and inter-species crossover operators. Thus the use of the reasortment seems to produce better results. The tabled 2 shows the fitness value for the best pattern found in each of the experiments for each pattern and the number of generation in which this
586
a)
c)
B. Strug, A. Bielecki, and M. Bielecka
b)
d)
Fig. 1. Target patterns a) a fern pattern b) a tree1 pattern c) a snow pattern and d) a tree2 pattern Table 1. The data for experiments no. cross-over reasortment inter-species crossover 1 60% 20% 20% 2 40% 30% 30% 3 40% 50% 10% 4 10% 60% 30% 5 5% 80% 15%
value first appeared. From the data presented in table 2 we can observe then individuals with the same fitness value are reached faster (in smaller number of generations) when the reasortment operator is used with higher probability. Other experiments we carried out suggest that it would be useful to change the proportion of individuals generated by the inter-species crossover over time in such a way that it generates more individuals at the beginning of the evolutionary process then at the later stages. But this possibility must be further researched and more experiments are needed to confirm this possibility. The number of experiments carried out is high enough to try to use statistical analysis. but not sufficient to regard this analysis as very precise. The t-Student test was used to test the hypothesis of the equality of the mean values of fitness and speed of learning, with respect to different distribution of crossover operators (as depicted in table 1). The speed of learning is defined in our tests as the number of the generation in which the fitness value passes the threshold of 1.5 (the fitness can take values from [0, 2]). The analysis has shown that for the third set of parameters compared to the first set of parameters, a value of the statistics t is equal to 2.24 which means that at the significance level equal to 95% we can not accept the hypothesis of the equality of mean values of the speed of learning.
Evolutionary Viral-type Algorithm for the Inverse Problem
a)
587
b)
c)
d)
Fig. 2. Best patterns generated for a) a fern pattern b) a tree1 pattern c) a snow pattern and d) a tree2 pattern
Table 2. The fitness function of the best individual for each parameter settings no. 1 2 3 4 5
fern fitness generation 1.99828 4908 1.99886 4936 1.98291 4848 1.99899 4781 1.99898 4640
tree1 fitness generation 1.69325 4865 1.94354 4933 1.99887 4980 1.99900 4638 1.99899 4710
snow fitness generation 1.86754 4820 1.94093 4945 1.99716 4973 1.99899 4872 1.99900 3188
tree2 fitness generation 1.79566 4879 1.86845 4960 1.86195 4880 1.99899 4829 1.99899 3150
The analysis was carried out with 20 trials (five for each of four patterns) for each set of parameters. These statistical results should be treated as preliminary ones. More reliable statistical analysis will be possible to be carried out when we have higher number of experimental results. In particular for patterns coded by higher number of functions - in such cases the reasortment mechanism should play greater role in finding the IFS coding a given image. The parameters for the fitness function were not tested in these experiments and were set to the following values: prc = 2 and pro = 3. Although the best images in each experiment are not absolutely equal to the target image, in most cases they may be considered to be satisfactory approximations.
5
Concluding Remarks
The main idea proposed in this paper is, in a way, similar to the one presented in [8] and [9]. In both approaches genome has an internal structure. However significant differences can be observed as well. First of all, in earlier papers
588
B. Strug, A. Bielecki, and M. Bielecka
the elements of a representation for an individual are of different meaning and structure, whereas in the approach presented here all elements of an individual are identical. Moreover, as it has been already mentioned, the notion of a chromosome assumes that each individual is represented by the same number of elements and very often there is some kind of ordering of elements in such a representation. In our approach the order of the singels in the representation is not important, but the number of singels may be different in different individuals. Additionally, a new operator of reasortment is proposed here. It is also possible to define both mutation and crossover operators that would make use of the hierarchical genetic representation. For example such an operator could consist in exchanging parts of different singels within one individual.
References 1. Barnsley, M.F.: Fractals Everywhere. Academic Press, London (1988) 2. Bielecki, A., Strug, B.: Finding an Iterated Function Systems based Representation for Complex Visual Structures Using an Evolutionary Algorithm, Machine Graphics & Vision (in print) 3. Bielecki, A., Strug, B.: A Viral Replication Mechanism in Evolutionary Algorithms. In: International Conference on Artificial Intelligence and Soft Computing, ICAISC 2006 (2006) 4. Bielecki, A., Strug, B.: An Evolutionary Algorithm for Solving the Inverse Problem for Iterated Function Systems for a Two-Dimensional Image. In: Kurzy´ nski, M., ˙ lnierek, A. (eds.) Computer Recognition Systems. Puchala, E., Wo´zniak, M., Zo Advances in Soft Computing, pp. 347–354. Springer, Heidelberg (2005) 5. Bielecki, A., Strug, B.: Evolutionery Approach to Finding Iterated Function Systems for a Two-Dimensional Image. In: Proc. of ICCVG 2004,Computational Imaging and Vision, vol. 32, pp. 516–521. Springer, Heidelberg (2006) 6. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor (1975) 7. Hutchinson, J.E.: Fractals and self-similarity. Indiana Univ. Math. J. 30, 713–747 (1999) 8. Mayer, H.A., Spitzlinger, M.: Multi-Chromosomal Representation and Chromosome Shuffling in Evolutionary Algorithm. In: Congress on Evolutionary Computations, pp. 1145–1149. IEEE, Los Alamitos (2003) 9. Pierrot, H.J., Hinterding, R.: Using Multichromosomes to Solve a Simple Mixed Integer Problem. In: Sattar, A. (ed.) Canadian AI 1997. LNCS, vol. 1342, pp. 137–146. Springer, Heidelberg (1997) 10. Taubenberger, J.K., Reid, A.H., Fanning, T.G.: Capturing a Killer Flu Virus. Scientific American 292, 62–71 (2005)
Tackling the Grid Job Planning and Resource Allocation Problem Using a Hybrid Evolutionary Algorithm Karl-Uwe Stucky, Wilfried Jakob, Alexander Quinte, and Wolfgang S¨ uß Forschungszentrum Karlsruhe GmbH, Institute for Applied Computer Science, P.O. Box 3640, 76021 Karlsruhe, Germany {uwe.stucky,wilfried.jakob,alexander.quinte,wolfgang.suess}@iai.fzk.de
Abstract. This paper presents results of new experiments with the Global Optimising Resource Broker and Allocator GORBA for grid systems. The scheduling algorithm is based on the Evolutionary Algorithm GLEAM (General Learning Evolutionary Algorithm and Method) and several heuristics. The task of planning grid resource allocation is compared to pure NP-complete job shop scheduling and it is shown in which way it is of greater complexity. Two different gene models and two repair methods are described in detail and assessed by the experimental results. Based on the analysis of the experimental results, directions of further work and improvements will be outlined.
1
Introduction
In a global grid environment a number of users share grid resources of different types according to the requirements of their applications. Grid users will expect an automatic allocation of resources and may have time and budget constraints together with application requirements. Resource providers may have other expectations regarding workloads or the prices of resource provision. Considering the requirements of numerous users and providers in a grid, the importance of planning and optimising strategies to grid resource scheduling is evident. To meet especially the optimising aspect, the utilisation of hybrid Evolutionary Algorithms (EA) was suggested [1]. While developing the Resource Broker GORBA with the Evolutionary Algorithm GLEAM [2], it became clear that a grid component for optimising grid resource schedules should be a core component of grid application management [3]. Three main interfaces connect the broker to the applications, to resource information, and to execution management. The first tests of GORBA in a simulated environment demonstrated the general applicability of the approach. Subsequent work concentrated on improving applicability and results of the planning and optimising process. Three strategies are pursued: – The architecture of GORBA must be extended to allow for the implementation of a variety of optimising methods. Different algorithms should be R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 589–599, 2008. c Springer-Verlag Berlin Heidelberg 2008
590
K.-U. Stucky et al.
selectable and combinable according to application characteristics and the time frame available for planning. Present work focuses on different heuristics and ant algorithms [4]. – GORBA is transferred to CampusGrid [5], a grid project at Forschungszentrum Karlsruhe, so that real grid resources will come into the focus. The success of schedule optimisation depends on resource and grid characteristics and requires parameterisation for flexible and automatic method adaptation a kind of meta optimising system is envisaged. – The third strategy aims at improving of the performance and the quality of optimisation results, which is essential for Quality of Service. This paper concentrates on the third strategy applied to the currently implemented GORBA algorithms. It is structured as follows: section 2 explains the application model, describes resource allocation and scheduling as an optimisation problem, and compares it to job shop scheduling. Section 3 presents features of GLEAM and concentrates on the used gene models, allocation heuristics, and repair mechanisms. Corresponding experimental results are described in Sect. 4. A brief summary and an outlook conclude the paper in Sect. 5.
2
Planning Grid Resource Utilisation – An Optimisation Problem
The reason for the development of GORBA is the observation that grid scheduling ranges from simply allocating the best available resource to the next requesting job to planning resource allocation in advance for all known grid jobs. This range has been described in [6] by contrasting queuing and planning systems as two end points in a classification of resource management systems. Special purposes of planning grid resource allocation are given in case of user interaction. For example, a grid user simply wants to be informed in advance about the expected processing of his jobs, with processing time and costs probably being the most interesting quantities. In case of user interaction with the grid system, alternatives can be presented to a user, enabling him to make decisions like accepting higher costs for gaining results earlier or rejecting certain resources for individual reasons. The vision of a world-wide grid as a global network of resources and of users and applications to utilise them, leads to systems that need automatic resource allocation and this planning process can be regarded as an optimisation problem. Grid users have problems that can be solved by grid applications and they want ”the Grid” to solve them without being involved in details more than absolutely necessary. In general, application structures may be arbitrarily complex. The application model builds on workflow techniques and describes a complete application as a so called application job. It consists of elementary grid jobs with predecessor-successor relations and resource requirements. Real applications will require a fully dynamic workflow model. At present, GORBA handles static workflows described by Directed Acyclic Graphs (DAG). Dynamic workflows can be decomposed to static ones. Several other solutions for grid resource
Tackling the Grid Job Planning and Resource Allocation Problem
591
scheduling are based on a workflow representation. A comparable example is a genetic algorithm examined by Prodan et al. [7]. It handles DGs by decomposing them to static DAGs. In [8] it is compared to heuristic algorithms and the applicability is extended to dynamic workflows. Besides the complexity of applications, the heterogeneity of the resource environment in a global grid is the second factor which leads to optimisation methods in resource scheduling. Firstly, there are different types of resources like computing power, software, data, or storage capacity. In general, more than one resource type is required for a single grid job and this leads to the problem of co-scheduling. In GORBA, computing power is being handled together with requirements for software that has to be co-scheduled. Moreover, below the level of resource types, resources of the same type may be characterised by a variety of parameters, the values of which are relevant to the execution of a grid job and the assessment of a schedule. Currently, GORBA distinguishes computing power resources by a performance factor and both resource types by cost that may vary with time. Once the application workflow together with its resource specifications has been transferred to the grid, the details of resource selection should be decided upon by the resource broker. A different approach to coping with the resource part of the problem can be found in [9]. Based on a Service Level Agreement (SLA) architecture an SLA manager provides resource management support by negotiating resources, placing reservations, and improving performance by adapting the application to current resource availabilities. The optimisation problem is a multi-objective scheduling problem, since objectives from the users as well as from the resource providers come into play. For GORBA, minimisation of processing times and costs, compliance with due dates, and maximisation of workloads are included in the present version. This optimisation task is called GORBA problem, as it is more complex than pure job shop scheduling, i.e. the allocation of n jobs to m machines taking precedence relations into account. The main differences of the GORBA problem are: – – – – –
co-scheduling of resources with alternatives processing times are resource-specific the existence of earliest start times and due dates costs of resources and their availability may vary over time the fitness function complexity which comprises more than just makespan
A scheduling problem of comparable complexity could not been found in literature, see [10,11] for a classification and a comprehensive description of known scheduling problems. There are some similarities to resource-constrained project scheduling problems [4]. Since the job shop scheduling problem is NP-complete [10], the same is true for the GORBA problem because it contains the former as a special case. Consequently, no exact solution can be expected and one has to be content with approximations as long as they are better than results of pure queuing systems and satisfy the time and cost constraints imposed by the users. EA have proven their feasibility for tackling scheduling problems under these conditions, see [7,12,13] as examples. Due to the short time in the magnitude of a few minutes available for planning in a grid environment, the usage of a
592
K.-U. Stucky et al.
pure EA will not be satisfying. Therefore, heuristics are added to improve the performance of the EA, as will be described in the following section.
3
The Hybrid Evolutionary Algorithm
In the following, the GORBA-relevant features of GLEAM will be presented. An up-to-date description of the complete GLEAM concept can be found in [2]. One of the GLEAM properties that makes it easy to adapt to new problems is the flexible gene model. Depending on application requirements, different types of genes can be defined. They are distinguished by number and type of parameters as well as by boundary values for the parameters. Real and integer parameters and even genes with no parameters at all are possible. A chromosome is constructed from typed genes in different ways according to the application on hand. In GORBA, all chromosomes are built with genes each with a different type and all genes ordered in a sequence that is phenotypically relevant. Based on the gene type definition, neutral genetic operators are defined and implemented, which take the parameter ranges into account. Depending on the chromosome type only a subset of these operators may be applicable, e.g. mutations that add or delete genes do not make sense with GORBA gene models. Moreover, there is a meta structure for the chromosomes, called segments which distinguish GLEAM from most other EAs. A segment consists of a gene sequence and the segment boundaries are subject to evolutionary change by special genetic operators like boundary shifts, merging or splitting segments. The standard 1- and n-point crossover operators are based on these segment boundaries and there are special segment mutations like altering the gene parameters of a segment, shifting a complete segment or inverting the gene sequence of a segment. The idea is that good partial solutions can be grouped by the evolution and treated as a whole. For the crossover operators, there is a genetic repair available for the used gene models, which ensures that both children in the end have exactly one gene per gene type. The fitness function is the weighted sum of normalised values for the makespan, completion times and costs of each application job, and the workload. They are calculated relative to estimated lower and upper bounds derived from the heuristic planning and form an application-load-neutral fitness function as required for an automated planning process of various amounts of jobs and resources. Penalty functions are applied for precedence violations and cost or time overruns [14]. 3.1
Two Gene Models
The workflow, i.e. the application job, and its grid jobs with the resource requirements is completed by a user’s specification of several additional parameters that control the allocation procedure: – the earliest starting time of the whole application job – the latest end time of the application job – the maximum costs
Tackling the Grid Job Planning and Resource Allocation Problem
593
– weighting factors of time and costs corresponding to the user’s preference – an estimation of the runtime of each grid job for a standard hardware. Together with the performance factor of the resources, this allows for more precise planning, comparable to performance guiding described in [8]. For the experiments described here, two different gene models are used (GM1 and GM2, see the following boxes). Both define one gene type per grid job.
GM1 defines a search space comprising both the grid job sequence and the resource allocation. Every possible schedule can be described by this gene model.
594
K.-U. Stucky et al.
GM2 needs more runtime than GM1, especially when the amount of alternatively usable resources increases. In contrast to GM1, not all possible schedules can be described by it, because all grid jobs must use the same resource preference. For RAH1 and 2, this holds for all grid jobs, but for RAH3, the restriction is relaxed to all grid jobs of the same application job. The experiments will show, whether this price is too high for the reduction of the search space or not. The selection of the resource allocation heuristics is controlled by a gene, and hence, there is an element of co-evolution associated with GM2: the grid job sequence and the selection of the heuristic are evolved separately, although encoded on the same chromosome. To support this, the crossover operators were modified in so far as the offspring inherit the RAH from the better parent. Additionally, a special mutation operator aims at just altering the RAH gene on a chromosome, where no other parameter mutation is required. 3.2
Genetic and Phenotypic Repair
The introduced resource allocation heuristics guarantee the adherence to the precedence relations as long as no gene is located before the gene of its preceding grid job. The genetic operators may construct schedules with precedence violations and there are several possibilities to solve this problem. The construction of the schedule is completed despite of missing predecessors, the violations are handled by penalty functions and, repair mechanisms are introduced. The genetic repair searches for all genes of grid jobs, the genes of the preceding grid jobs of which are not located on the chromosome before. Such a gene is shifted in the direction of the end of the chromosome until all genes of preceding grid jobs are on prior positions. As a result, the mechanism may hamper meaningful steps of shifting genes. This is the explanation of the outcome of experiments earlier than those reported here, which produced poor results for repairing all offspring. They showed that applying the genetic repair to a fraction of about 20% of the offspring performs best. Using genetic repair therefore requires the application of the corresponding penalty function. The phenotypic repair is aimed at a correct interpretation of a chromosome rather than altering it. If the processing of a gene tries to schedule a grid job with missing already scheduled predecessors, it simply suspends the scheduling until all predecessors have been scheduled. The advantage of this approach is that intermediate steps of shifting genes, which itself may be faulty, are now allowed to occur and hopefully to result in a better schedule. 3.3
Heuristic Seeding of the Start Population
As described in [1], GORBA firstly produces some solutions heuristically, which are used to seed the start population of the subsequent GLEAM run. At present, two heuristics are used for the generation of an ordered list of grid jobs: grid jobs of the application job with the shortest due date first and grid jobs of the application job with the shortest processing time first. The two job sequences are processed with the three resource allocation heuristics described in Sect.
Tackling the Grid Job Planning and Resource Allocation Problem
595
3.1 delivering a total of up to six different schedules. They can be computed very fast, and the best serves as a basic solution and a reference for the experiments. These six schedules go into the start population and the remainder is generated randomly with the following peculiarities. For both gene models, all generated individuals undergo genetic repair. In case of GM2, all individuals are assessed using the three allocation heuristics and the best of it is stored in the RAH gene.
4
Experiments
For all GORBA experiments, a set of benchmarks has been developed and the resource broker has been tested in a simulated grid environment. This paper presents results obtained with the standard benchmark set [14]. It consists of four classes of application jobs, each class representing different values of the following two application job characteristics: the degree of grid job dependencies and the degree of freedom of resource selection. They are denoted by s (small) and l (large) for the values and R and D for resources and dependencies, resulting in the four abbreviations sRsD, sRlD, lRsD, and lRlD for the four classes. As the amount of grid jobs is another measure of complexity, a benchmark containing 50, 100, and 200 grid jobs was defined using all the same set of resources for every class. A fourth benchmark again consists of 200 grid jobs, but with a doubled set of resources available (abbreviated by 200d in the figures). Besides the structure and amount of jobs and resources, the tightness of time and cost restrictions is another essential characteristic of the complexity and complicacy of a scheduling task. It is intentionally set so that cost and time
Fig. 1. Results for gene models GM1 (left) and GM2 (right) using genetic repair. 200d stands for the 200 grid jobs benchmark but with doubled number of resources. Dotted bars indicate that this result was obtained for one population size only. If the bars are little below 100%, the exact number is given to for better distinction.
596
K.-U. Stucky et al.
overruns are provoked for the heuristic methods running before the EA, i.e. they are always invoking penalty functions. Thus, a measure of success is defined for the EA: can schedules be found that adhere to the given restrictions and if so, always or to what fraction of the runs? This fraction is denoted as success rate and the first and crucial measure of the experiments. The second is the fitness function which is a much more abstract figure, because it is based on the weighted sum of the relative fulfilment of different criteria, see Sect. 3 and [14]. All combinations of the two gene models GM1 and GM2 and the two repair mechanisms were investigated and the results are based on 50 runs per combination and benchmark at the minimum. For every run, the time limit was set to three minutes, because it is regarded that this is a reasonable time frame for planning. For every setting, different population sizes in the range of 200 to 600 were used. Fig. 1 shows the best results for both gene models and genetic repair. In all cases, GM2 yields better results than GM1. As expected, experiments with longer runtimes yielded the opposite results: the reduced search space of GM2 allows for a faster progress of the evolution, but cannot deliver as good results as the GM1 (complete search space) in the long run. Fig. 2 again compares the two gene models, but with phenotypic repair, and the outcome is the same: GM2 performs much better within the short term. For comparing the effect of genetic and phenotypic repair, the left and the right parts of both figures must be compared separately. For GM1, the results are a little bit ambiguous, but for GM2, phenotypic repair clearly yields the better results. In this best configuration of GLEAM only three benchmark scenarios remain for further improvement and the rest is solved properly. Fig. 3 compares the two gene models based on fitness differences between the best heuristic planning ignoring the penalties and the best GLEAM results.
Fig. 2. Results for gene models GM1 (left) and GM2 (right) using phenotypic repair. Explanations see Fig. 1.
Tackling the Grid Job Planning and Resource Allocation Problem
597
Fig. 3. Fitness change between the best heuristic planning without penalty functions and the best GLEAM results for both gene models and phenotypic repair
As this compares the aggregated quality values of the schedules while time and cost restrictions are ignored, fitness reductions can occur as with GM1 in Fig. 3. GM2 produces fewer improvements than GM1, but does improve the schedules for all test cases. This again supports the conclusion that GM2 causes a faster optimisation in the beginning of the evolutionary search, but cannot reach the quality GM1 yields later. This figure also shows that the benchmarks with the greater resource alternatives are harder to solve, which is not surprising. The number of evaluations that can be accomplished within three minutes strongly depends on the numbers of grid jobs and resource alternatives, see Fig. 4 for GM2 and phenotypic repair. The situation is comparable for the other cases. The bad news from the figure is that with increasing complexity, especially in the form of more grid jobs, the number of evaluations decreases. Hence, the ability of solving the problem within the given time frame is limited mainly by the number Fig. 4. Evaluations within the given time frame of grid jobs. As the implemen- of three minutes for GM2 and phenotypic repair tation of the allocation matrix management is not tuned for speed and great numbers of jobs, there is room for improvements, such that the limits for processable jobs and resources will be increased in the future.
598
5
K.-U. Stucky et al.
Conclusion and Future Work
The problem of planning grid jobs with precedence relations and a co-allocation of inhomogeneous resources with different performances and varying costs over time was compared to the pure NP-complete job shop scheduling task and found to be of greater complexity. Two different gene models and repair mechanisms were introduced and compared using a set of benchmark-based experiments. The kind of planning investigated so far has been of the type development of a new plan from scratch. This is good for performance analysis, but unrealistic in so far, as usually a schedule already exists as in practice and a re-planning is required due to reasons like a new application job, a cancellation or crash of an application job, the availability of new resources, or shut-down of used ones or the like. As a consequence, one of the next steps will be the enhancement of GORBA in terms of re-planning. Earlier experiences in another field of application, including combinatorial elements, revealed that evolutionary re-planning can be much faster than the development of a new plan, even if the new task is largely dissimilar to the old one [15].
References 1. Jakob, W., Quinte, A., S¨ uß, W., Stucky, K.-U.: Optimised Scheduling of Grid Resources Using Hybrid Evolutionary Algorithms. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 406–413. Springer, Heidelberg (2006) 2. Blume, C., Jakob, W.: GLEAM – An Evolutionary Algorithm for Planning and Control Based on Evolution Strategy. In: Cant´ u-Paz, E. (ed.) GECCO 2002, vol. LBP, pp. 31–38 (2002) 3. S¨ uß, W., Jakob, W., Quinte, A., Stucky, K.-U.: GORBA: Resource Brokering in Grid Environments using Evolutionary Algorithms. In: 17th IASTED Int. Conf. on Parallel and Distributed Computing Systems (PDCS), Phoenix, AZ, pp. 19–24 (2005) 4. Schmeck, H., Merkle, D., Middendorf, M.: Ant Colony Optimization for ResourceConstrained Project Scheduling. In: Whitley, D., et al. (eds.) Conf. Proc GECCO 2000, pp. 893–900. Morgan Kaufmann, San Francisco (2000) 5. Schmitz, F., Schneider, O.: The CampusGrid test bed at Forschungszentrum Karlsruhe. In: Sloot, P.M.A., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) EGC 2005. LNCS, vol. 3470, pp. 1139–1142. Springer, Heidelberg (2005) 6. Hovestadt, M., Kao, O., Keller, A., Streit, A.: Scheduling in HPC Resource Management Systems: Queuing vs. Planning. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 1–20. Springer, Heidelberg (2003) 7. Prodan, R., Fahringer, T.: Dynamic Scheduling of Scientific Workflow Applications on the Grid Using a Modular Optimisation Tool: A Case Study. In: 20th Symposium of Applied Computing, SAC 2005, pp. 687–694. ACM Press, New York (2005) 8. Wieczorek, M., Prodan, R., Fahringer, T.: Comparison of Workflow Scheduling Strategies on the Grid. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 792–800. Springer, Heidelberg (2006)
Tackling the Grid Job Planning and Resource Allocation Problem
599
9. Padgett, J., Djemame, K., Dew, P.: Grid Service Level Agreements Combining Resource Reservation and Predictive Run-time Adaptation. In: Proc. of the UK e-Science All Hands Meeting, Nottingham, UK (September 2005) 10. Brucker, P.: Scheduling Algorithms. Springer, Heidelberg (2004) 11. Brucker, P.: Complex Scheduling. Springer, Heidelberg (2006) 12. Di Martino, V., Mililotti, M.: Sub optimal scheduling in a grid using genetic algorithms. Parallel Computing 30, 553–565 (2004) 13. Gao, Y., Rong, H.Q., Huang, J.Z.: Adaptive grid job scheduling with genetic algorithms. Future Generation Computer Systems 21, 151–161 (2005) 14. Stucky, K.-U., Jakob, W., Quinte, A., S¨ uß, W.: Solving Scheduling Problems in Grid Resource Management Using an Evolutionary Algorithm. In: Meersman, R., Tari, Z. (eds.) OTM 2006. LNCS, vol. 4276, pp. 1252–1262. Springer, Heidelberg (2006) 15. Jakob, W., Gorges-Schleuter, M., Blume, C.: Application of Genetic Algorithms to Task Planning and Learning. In: M¨ anner, R., Manderick, B. (eds.) Conf. Proc. PPSN II, pp. 291–300. North-Holland, Amsterdam (1992)
Evolutionary Algorithm with Forced Variation in Multi-dimensional Non-stationary Environment Dariusz Wawrzyniak and Andrzej Obuchowicz Institute of Control and Computation Engineering, University of Zielona G´ ora, ul. Podg´ orna 50, 65-246 Zielona G´ ora, Poland [email protected], [email protected] http://www.issi.uz.zgora.pl
Abstract. The paper deals with an evolutionary algorithm which uses new methods for controlling the range of mutation. In order to significantly increase the efficiency in finding the optimum, it discovers and exploits knowledge about the state of population in the environment in the every generation. It allows to find the solution to be found both quickly and accurately. By dividing the population into objects dealing with different functions of optimization, it can simultaneously explore as well as exploit the solutions space. These abilities allow to increase the algorithm efficiency also in multi-dimensional environments.
1
Introduction
One of the basic problems connected with using evolutionary algorithms is reconciling two mutually conflicting aims: using the best solution available and the most thorough searching of the whole accessible space of solutions, and maintaining the balance between exploitation and exploration [4,5,6]. Researchers have tried to solve this problem in various ways. Many different techniques have been applied, including Domination and Diploidy mechanisms, adding adaptation or self-adaptation of parameters [8], maintenance of populations diversity [11], co-evolutionary genetic algorithm, Learnable Evolution Model [7], etc. The cardinal assumptions of the new way of solving the problem are: – Division of population. Assigning the abilities of exploitation or exploration to individual objects. – Discovering and exploiting knowledge of the state of population in an environment. The main objective of the proposed solution is to enable an algorithm to be used to reconcile two mutually conflicting aims: exploitation and exploration. Thanks to this, the optimum can be found as quickly as possible, fulfilling at the same time, requirements concerning accuracy. The paper is organized as follows: The description of the proposed evolutionary algorithm is presented in section 2. Our experiments and research and their results are presented in section 3. Section 4 concludes the results. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 600–607, 2008. c Springer-Verlag Berlin Heidelberg 2008
Evolutionary Algorithm with Forced Variation
2
601
Proposed Algorithm
Standard evolutionary algorithms use historically acquired knowledge. They use the features of the best solutions which have been found so far. In opposite to that, our proposal uses knowledge about every current individual location in the solutions space in every generation. Considering the standard evolutionary algorithm ESSS, the basic parameter controlling evolutionary process is the one which controls the range of changes taking place during recombination (in the ESSS case - during mutation). Small values of the parameter cause the increase of the exploitation function and the large values increase the exploration function. Our idea is to make one part of the population deal with exploitation and the other part deal with exploration. The question is how and to what extent these functions should be assigned to individual objects. The maximum effect can be reached if objects, which are in the neighborhood of local or global optima, deal with exploitation, and objects located in less promising regions of the solutions space deal with exploration. The next problem is how to obtain the necessary information about current location of an individual in the solution space. Such information, delivered during every generation, is the value of the fitness function of particular individuals. After the evaluation stage, in most kinds of evolutionary algorithms, population is sorted according to fitness function. Thanks to this, in later stages, any kind of selection can be performed easier. Considering population sorted in such a way, we can notice that the best adapted chromosomes are probably located closer to global or local optima. It can be also assumed (with utmost certainty), that worse adapted objects are located further from optima. It seems obvious that individuals which are located closer to optimum should try to find better solutions in their near neighborhood (exploitation), and objects which are located far from optimum have much less chances to find the best solution. So, it is reasonable to use them in order to find other optima (exploration). Moreover, research clearly shows that even small populations find an optimum effectively and it would be wasteful to use the whole population to exploit (probably one of many) an optimum. We can achieve the above-mentioned with the help of a simple modification of the algorithm. In every consecutive generation an individual parameter the δ determining range of mutation [eq.(1)] must be assigned to every individual depending on its current fitness. We tested some versions of ESSS-FV algorithm [12] with various versions of parameters assignment (e.g linear, logistic, simple), but the best method turned out to be the solution with a linear assignment of parameters to populations sorted in descending order of fitness, in range from δmin to δmax . δ(i) = δmin + i ∗
δmax − δmin η−1
(1)
602
D. Wawrzyniak and A. Obuchowicz
Where: η - population size. i - (0,. . . ,η-1) object index in sorted population. Thanks to this simple operation the mutation range is strongly related to current individual location. The information is transferred from the environment to the population directly and immediately. The proposed solution causes progressive dispersion of the population around certain optima, whereas the population performed by standard algorithms is clustered (Fig.1).One should draw attention to the fact, that the computational cost of the modification of this algorithm is very small. (a)
(b)
Fig. 1. Algorithm’s dispersion (a), ESSS (b) ESSS-FV
One might raise an objection to the fact, that it is necessary to set up parameters δmin and δmax . It might appear that we brought in the necessity of setting up two parameters instead of one parameter. However, during our research it was proved that there is no necessity to change these parameters in the progress of evolutionary process in addition to setting them up depending on the kinds or nature of the environment. There is no need to choose the values of these parameters precisely. A small change of the values of parameters doesn’t affect the search result significantly. The parameter δmin has an influence on the accuracy of the solutions found. We should set it up at a specific level, depending on how accurate we want to be in finding the optimal solution. Parameter δmax should be set up in such a way that a chromosome is able to obtain any value from an acceptable solution space during mutation. As in the case of a standard algorithm, it is also possible to set up this parameter at such a level that allows for crossing the saddle in the local optimum trap.
3
Illustrative Examples
We used a dynamic problem generator [2] similar to DF1 [11] which can generate a wide variety of complex dynamic thorough applications of more than one of
Evolutionary Algorithm with Forced Variation
603
the available types of dynamics. Our version uses a random generator instead of the logistic function. f (X, t) = maxi=1,N [Hi (t) − Ri (t) · X − Xi (t)],
(2)
where Hi is the peak height, Ri - the peak slope, Xi - the peak location. The algorithm was examined for three basic kinds of dynamic problems, environments with moving, oscillating and random peaks. In each case we created 4,8,16,32,64 and 128-dimensional environment with adiabatic changes. Between each successive changes in the environment we performed 50 generations of evolutionary algorithm. Our goal was to keep the solution as close to the optimum as possible, therefore we used adaptation performance measure. It was evaluated according to the formula [3]: tmax 1 Φbest (t) μφ = , (3) tmax t=1 Φopt (t) where tmax is the length of the entire search process, Φbest - the fitness of the best individual in the population at the time t, Φopt - the optimum fitness in the search space at the time t. The interesting value which seems to be helpful in comparison between applications is the closeness to the optimum during search process. We used the following measure [4]. μtr =
1
t max
tmax
t=1
ρ xopt (t), x0 (t) ,
(4)
where x0 (t) is the best point of population in the time t and xopt (t) = arg maxx∈D Φ(x, t), ρ(a, b) is a distance measure in D, e.g. if D ⊂ Rn then ρ(a, b) = a − b. The following kinds of algorithms were compared: – ESSS - Evolutionary Search with Soft Selection [8]. – ESSS-FV - ESSS with forced variation - the proposed algorithm [12],[13]. At the beginning, the best parameters values (δ, δmin , δmax , elite group percent, random immigrants group percent) for algorithms and environments were found and only after this were the algorithms compared. The values of parameters are shown in table 1 The initial population was created randomly in the whole solutions space. 3.1
Results
The result for the presented algorithm in stationary environments was presented in [12]. Behavior of ESSS-FV in non-stationary environments was presented in [13]. In the next papers we presented results for enhanced versions of ESSS-FV (with memory, with optimum location prediction, with changes monitoring, etc)
604
D. Wawrzyniak and A. Obuchowicz Table 1. Parameter’s values Parameter’s name ESSS-FV ESSS Population count Number of dimension · 10 δ 0.1 0.1 δmin 0.00001 — δmax 0.3 — Elite group 0.04 — Random group 0.04 — Selection method RandomRestWithRepetitions Roulette
also in dynamic environments. Finally, we decided to test our solution in the multi-dimensional dynamic environment. All experiments were repeated 500 times. Bellow are presenting averaged results from each repetition. The results for both measures in environments with moving peaks are shown in Figure.2. The Figure.3 shows results for environments with oscillating peaks and Figure.4 with random peaks. (a)
(b)
(c)
Fig. 2. Adaptation performance for environment with moving peaks (a),(b) for location (c) for fitness
All figures have shown that the ESSS-FV algorithm is better than the ESSS algorithm in each environment. Additionally, it is clearly seen that the decrease in efficiency is a lot smaller than in the case of ESSS. There are not significant differences between results in the particular environment types.
Evolutionary Algorithm with Forced Variation (a)
605
(b)
(c)
Fig. 3. Adaptation performance for environment with oscillating peaks (a),(b) for location (c) for fitness
(a)
(b)
(c)
Fig. 4. Adaptation performance for environment with random peaks (a),(b) for location (c) for fitness
606
D. Wawrzyniak and A. Obuchowicz
One should pay attention to the fact that in the case of adaptation performance measures for fitness the maximum value is 1.
4
Conclusions
The previous research has shown that the described algorithm has much higher efficiency than the standard algorithm, both in 2-dimensional stationary and non-stationary environments. In the presented work it has been shown that the new algorithm can be successfully applied also to solve the multi-dimensional non-stationary problems. The invented model of the evolutionary algorithm works very well. Its high efficiency is owed first of all to the fact, that it allows to get knowledge about the current state of population in the environment to be exploited. The second very important reason for the algorithm’s efficiency is the connection of the exploitation and exploration abilities in every generation of the evolutionary algorithm. Inclusion of the elitist mechanism in the presented algorithm causes a large increase in the exploitation ability without the risk of falling into the local optimum trap. One should also draw attention to the low computational cost of this algorithm modification. Summarizing, the proposed algorithm is characterized by following features: – Ability to discover and exploit knowledge about the state of population in an environment. – Connection of the exploitation and exploration abilities in every generation. – Higher exploitation and exploration abilities. The global optimum can be found faster and more precisely. – The algorithm doesn’t undergo premature convergence. – There is no possibility of falling into the trap of the local optimum. There is always some part of population which explores the space of solutions. – Thanks to what we mentioned above, we are able to increase the pressure of selection without the risk of falling into the local optimum trap. – The influence of increasing the number of dimensions on algorithm’s efficiency is much smaller.
References 1. Bendtsen, C.N.: Optimization of Non-Stationary Problems with Evolutionary Algorithms and Dynamic Memory. Aarhus Universitet (2001) 2. Bendtsen, C.N., Krink, T.: Dynamic memory model for non-stationary optimization. In: Proc. IEEE Congress on Evolutionary Computation, CEC 2002, vol. 1, pp. 145–150 (2002) 3. Branke, J.: Memory-enhanced evolutionary algorithms for dynamic optimization problems. In: Proc. IEEE Congress on Evolutionary Computation, CEC 1999, vol. 3, pp. 1875–1882 (1999) 4. Cotta, C., Shaefer, R. (eds.): Evolutionary Computation. Int. J. Appl. Math. & Comp. Sci. (Special Issue) 14 (2004)
Evolutionary Algorithm with Forced Variation
607
5. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison–Wesley, Reading (1989) 6. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. Springer, Heidelberg (1996) 7. Michalski, R.S., Cervone, G., Kaufman, K.: Speeding Up Evolution through Learning: LEM. In: Proc. Ninth Int. Symp. Intelligent Information Systems (2000) 8. Obuchowicz, A.: Evolutionary Algorithms for Global Optimization and Dynamic System Diagnosis. Lubuskie Scientific Society Press, Zielona G´ ora (2003) 9. Obuchowicz, A., Wawrzyniak, D.: Evolutionary Adaptation in Non-stationary Environments: a Case Study. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 439–446. Springer, Heidelberg (2006) 10. Rosa, A.C., Ramos, V., Fernandes, C.: Societal Implicit Memory and his Speed on Tracking Extrema in Dynamic Environments using Self-Regulatory Swarms. Technical University of Lisbon (2006) 11. Trojanowski, K.: Evolutionary Algorithm with Redundant Genetic Material for Non-statinary Environments. Polish Academy of Science, Warszawa (2003) 12. Wawrzyniak, D., Obuchowicz, A.: New Approach To Fast And Precise Optimization With Evolutionary Algorithm. In: Arabas, J. (ed.) Evolutionary Computation and Global Optimization, Warsaw University of Technology Press (series: Electronics), vol. 156, pp. 397–404 (2006) 13. Wawrzyniak, D., Obuchowicz, A.: New Approach To Optimization With Evolutionary Algorithm in Dynamic Environment. In: Proceedings of Artificial Intelligence Studies, vol. 3, pp. 187–196. University of Podlasie Press (2006)
Hybrid Flowshop with Unrelated Machines, Sequence Dependent Setup Time and Availability Constraints: An Enhanced Crossover Operator for a Genetic Algorithm Victor Yaurima1 , Larisa Burtseva2 , and Andrei Tchernykh3 1
2
CESUES Superior Studies Center, San Luis R.C., Mexico [email protected] Autonomous University of Baja California, Mexicali, Mexico [email protected] 3 CICESE Research Center, Ensenada, Mexico [email protected]
Abstract. This paper presents a genetic algorithm for a scheduling problem frequent in printed circuit board manufacturing: a hybrid flowshop with unrelated machines, sequence dependent setup time and machine availability constraints. The proposed genetic algorithm is a modified version of previously proposed genetic algorithms for the same problem. Experimental results show the advantages of using new crossover operator. Furthermore, statistical tests confirm the superiority of the proposed variant over the state-of-the-art heuristics.
1
Introduction
In this paper, a scheduling problem for a hybrid flowshop (HFS) is considered. The HFS problem is a general case of the simple flowshop problem [1]. In simple flow shop problems, each machine operation center includes just one machine. In a HFS for at least one machine center or stage, there exists more than one machine available for processing. The HFS differs from the flexible flow line [2] and the flexible flowshop [3] problems, although many authors do not distinguish these three terms. In a flexible flow line as well as in a flexible flowshop the machines available at each stage are identical. A HFS does not have this restriction [4, 5, 6, 7, 8]. This paper deals with the HFS scheduling problem with unrelated parallel machines at the stages, sequence dependent setup time and availability constraints of the machines. HFS has been introduced in 1971 by Arthanary and Ramaswamy [9], where a branch and bound algorithm (B&B) for a flexible flow shop with only two stages was proposed. One of the earliest works that deals with problems with m stages is presented by Salvador [10] in 1973, where dynamic programming algorithms for the no-wait flow shop with multiple processors were proposed. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 608–617, 2008. c Springer-Verlag Berlin Heidelberg 2008
Hybrid Flowshop with Unrelated Machines
609
Ruiz and Maroto [11], and Vignier et al. [12] present comprehensive surveys on the HFS. Many papers deal with simplified cases of the HFS with only two and three stages. Six studies about unrelated parallel machines with multiples stages [11,13,14,15,16,17] are known. Three of them take into account sequence dependent setup times, and only [11] considers availability constraints. Adler et al. [17] have developed BPSS (Bagpak Production Scheduling System). The production problem contains setup times and, in some stages, unrelated parallel machines. The system is specific for the considered problem, and is based on the application of priority rules. Aghezzaf et al. [14] have proposed several methods to solve a problem in the carpet manufacturing industry where three stages and sequence dependent setup times are considered. The solution is based on the problem decomposition, heuristics, and mixed-integer programming models. Gourgand et al. [16] have presented several Simulated Annealing (SA)-based algorithms for the HFS problem. A specific neighborhood is used. The methods were applied to a real industrial problem. Allaoui and Artiba [13] have dealt with the HFS scheduling problem under maintenance constraints to optimize several objectives based on flow time and due date. In this model, the authors take also into consideration setup, cleaning and transportation times. The HFS scheduling problem with machine setup time dependent on job sequences, but without the availability constraints is considered in the paper of Lixin Tang and Yanyan Zhang [18]. In the model, all jobs pass the same route in the HFS, there is at least one identical machine at each stage, and at least one stage has more than two machines. A modification of the traditional Hopfield network formulation and the improving strategy are proposed. Zandieh at al. [15] have proposed an immune algorithm for considered problem. In the model, identical machines on each stage are considered. Obtained results are compared with those by Random Key Genetic Algorithm (RKGA) [19]. It was shown that the immune algorithm outperforms RKGA. Ruiz and Maroto [11] have developed a genetic algorithm to a complex generalized flowshop scheduling problem that consists of unrelated parallel machines at each stage, sequence dependent setup times and availability constraints applied to the production of textiles and ceramic tiles. It was shown that the proposed algorithm is more effective than all other. To the best of our knowledge, only latter method has been proposed for the HFS with unrelated parallel machines, sequence dependent setup time and availability constraints. In this paper, a genetic algorithm to deal with the HFS problem with unrelated parallel machines, sequence dependent setup time, and machine availability constraints is introduced. Makespan (Cmax ) minimization is considered as the optimization criterion. This algorithm is a variant of the genetic algorithm for the HFS proposed by Ruiz and Maroto [11]. The restart conditions, crossover operations and stopping criterion are modified. It is shown that the variant achieves better results. The rest of the paper is organized as follows. In Section 2, the problem statement is given. In Section 3, the proposed genetic algorithm variant is introduced. The experimental setup and evaluation results are presented in Section 4. Finally, Section 5 summarizes the paper and points out ideas for future research.
610
2
V. Yaurima, L. Burtseva, and A. Tchernykh
Problem Statements
Let a set N of n jobs, N = {1, ..., n}, has to be processed on a set M of m stages, M = {1, ..., m}. At every stage iM , a set Mi = {1, ..., mi} of unrelated parallel machines that can process the jobs is given, where |Mi | ≥ 1. Every job has to pass through all stages, and must be processed by exactly one machine at every stage. Let pil ,j be the processing time of the job jN on machine lMi at the stage i. A machine based sequence dependent setup time is also considered. Let Sil ,j,k be the setup time in the machine l at the stage i when processing job kN , after processing job j. For the stage i, Eij is a set of eligible machines that can process job j, 1 ≤ |Eij | ≤ mi . Gourgand et al. [16] showed that for the given problem the total number of n m possible solutions is n! ( i=1 mi ) . Moreover, Gupta [6] showed that the flexible flowshop problem with only two stages (m = 2) is N P -hard even when one of the two stages contains a single machine. Since HFS is a general case of the flexible flowshop, we can conclude that HFS is also N P -hard. Using the well-known three field notation α |β| γ for scheduling problems and its extension for HFS proposed et al. [12], the problem considered here can be viewed as by Vignier (i) (m) F Hm, RM |Ssd , Mj | Cmax . The calculation of the Cmax is as follow: i=1 m
j Ci,π(j) = minl=1 {max{Ci,Lil + Sil ,Lil ,π(j) ; Ci−1,π(j) } + pil ,π(j) }
(1)
where: π is permutation or sequence; π(j) is the job in the jth position in the sequence, jN . Every job has to be processed at every stage, so m tasks per job are considered. Ci,π(j) is the completion time of job at stage i, where iM . Lil is the last job that was assigned to machine l within stage i, lMi . Sil ,Lil ,π(j) represents the setup time of machine l at stage i when processing job π(j) after having processed the previous work assigned to this machine l(Lil ). Once all jobs are assigned to machines at all stages the makespan is calculated as follow: Cmax = maxnj=1 {Cm,π(j) }
3 3.1
(2)
Genetic Algorithm Encoding
The sequence of jobs represents an individual, which is called chromosome. In this encoding, a string of n integers, which is a permutation of the set {1, 2, ..., n} is used, where each integer represents a job number. The following is the GABC procedure. Input: The population of Psize individuals. Output: A candidate sequence of length n. 01 Generate population 02 while not stopping_criterion do 03 for i=0 to Psize
Hybrid Flowshop with Unrelated Machines
04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 3.2
611
generate_individual(i) evaluate_objective_function(i) select individuals by the tournament selection keep the best individual found if minimum_makespan_has_not_changed = 25 then regenerate_population if regenerate = 10 then stopping_criterion = true else increase regenerate minimum_makespan_has_not_changed = 0 else if actual_minimum_makespan = previous_minimum_makespan then increase minimum_makespan_has_not_changed crossover with probability Pc mutation with probability Pm
Initialization and Evaluations
The population is formed by Psize individuals generated randomly. Many authors separate sequencing and assignment decisions in the HFS problems [20, 21]. We follow a way proposed by Ruiz and Maroto [11], where the assignment of jobs to machines at every stage is done by the evaluation function. In a HFS with no setup times and no availability constraints the first available machine would also result in the earliest completion time of the job. In a HFS with unrelated parallel machines we can find that the first available machine is very slow for a given job and thus assigning the job to this machine can result in a later completion time compared with assignment to other machines. With the addition of the setup times this problem can be even worse. To solve this problem, the jobs are assigned to the machine that can finish the job at the earliest time at a given stage, taking into consideration different processing speeds, setup times and machine availability. 3.3
Restart, Generational Scheme and Stopping Criterion
A restart mechanism based on the scheme proposed in [22] with modifications is applied. – At each generation i, store the minimum makespan, maki , – If maki = maki−1 then countmak = countmak + 1. Otherwise countmak = 0. – If countmak > Gr (number of generations without improve) then apply the following procedure: • Sort the population in ascending order of Cmax . • Skip the 20% individuals from the sorted list (the best individuals) • From the remaining 80% individuals, 50% of them are replaced by simple SHIFT mutations of the best individual and 50% are replaced by newly randomly generated schedules.
612
V. Yaurima, L. Burtseva, and A. Tchernykh
• regeneration = regeneration + 1. • countmak = 0. The scheme is to replace some individuals in a new generation by individuals from the previous generation. As it was shown in [23], a steady state genetic algorithm where offspring replace the worst individuals in the population yielded much better results than regular genetic algorithms. The evaluations stop when the minimum makespan has not changed for 25 times (Gr = 25) and after regeneration = 10. Finally, the algorithm is executed 2 times. 3.4
Selection, Crossover and Mutation
For the parent’s selection, the tournament selection, one of the classical selection schemes [24, 25], and the best for this kind of problems [11], is considered. A mutation is incorporated to genetic algorithms to avoid convergence to local optimum, to reintroduce lost genetic material and variability in the population. Three mutation operators (insertion, swap and switch) widely used in the literature [11, 25, 26] are considered. The crossover generates new sequences by combining two other sequences. The goal is to generate new solutions with better Cmax values. Five operators from the literature are considered: OBX (Order Based Crossover) [26], PPX (Precedence Preservative Crossover) [27], OSX (One Segment Crossover) [5, 13], SB2OX (Similar Block 2-Point Order Crossover) [11] and TP (Two Point) [25]. In this paper, three new crossover operators are proposed: TPI, OBSTX, ST2PX. TPI (Two Point Inverse). It is similar to Two Point crossover. It chooses two points randomly. Elements from the position 1 to the first point are copied from father 2. The positions from the first point to the second one are copied from the father 1. The positions from the second point to last one are copied from father 2. OBSTX (Order based Setup Time Crossover). It comes from the original order based crossover taking in consideration sequence dependent setup times according to the binary mask. The value 1 of the mask indicates that the corresponding element of the father is copied to the son. The mask with value 0 indicates that the element of the father 2 is copied according the minimal sequence dependent setup time of machine choosen randomly at first stage. ST2PX (Setup Time Two Point Crossover) works as follows. It chooses two crossover points randomly. Elements from the positions 1 to the first point are copied from father 1. Elements from the second point to last one are copied from father 1. Elements from the first point to second one are copied from the father 2 according to the minimal sequence dependent setup time of the machine chosen randomly at the first stage. Figure 1 shows how this operation is performed. Let us assume that the first point is the position 3, and the second point is the position 8 (Fig. 1A). The genes are copied from the position 1 to the position 3 of father 1 to the child. The genes from the position 8 to position 9 (last position) are copied from father 1 (Fig. 1B). The rest positions of the child are filled with best elements from the father 2, taking into account the sequence dependent
Hybrid Flowshop with Unrelated Machines
613
First point Second point A) Parent 1 7 1 9 2 8 4 6 5 3 1 2 3 4 5 6 7 8 9
Parent 2 5 2 8 1 6 3 7 9 4 1 2 3 4 5 6 7 8 9
B) Parent 1 7 1 9 2 8 4 6 5 3 1 2 3 4 5 6 7 8 9
Jobs for process at the machine
5 3 Child 7 1 9 1 2 3 4 5 6 7 8 9 C) Sequence dependent setup time of machine from first stage. Jobs earliest processed for the machine 1 2 3 4 5 6 7 8 9 1 0 39 46 28 43 45 34 38 34 2 49 0 45 42 27 50 39 43 50 Parent 2 5 2 8 1 6 3 7 9 1 2 3 4 5 6 7 8 3 26 45 0 29 48 37 28 48 36 4 44 44 35 0 42 40 32 31 45 5 38 45 38 32 0 40 39 30 47 6 47 40 29 27 32 0 30 35 42 7 47 43 36 50 27 47 0 42 33 8 41 36 34 36 27 47 26 0 25 Child 7 1 9 8 4 6 2 5 9 42 35 28 42 36 31 41 41 0 1 2 3 4 5 6 7 8 After job 9: min[50(2),25(8),42(6),45(4)]=25, then job 8 is chosen. After job 8: min[43(2),35(6),31(4)]=31, then job 4 is chosen. After job 4: min[42(2),27(6)]=27, then job 6 is chosen. After job 6: min[50(2)]=50, finally job 2 is copied.
4 9
3 9
Fig. 1. Setup Time Two Point Crossover
setup times (Fig. 1C). For instance, to fill position 4 of the child (after the gene 9, position 3 of the child), four setup times (50, 45, 42, 25) are compared. The minimal value that corresponds to the gene 8 is copied to the position 4. The gene is equivalent to a job and represented as an element of each individual. 3.5
Experimental Setup
The following parameters are used to calibrate GABC algorithm: Population size (Psize ): 50, 80, 100, 150 and 200; Crossover type: OBX, PPX, OSX, TP, SB2OX, TPI, OBSTX, ST2PX; Crossover probability (Pc ): 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9; Mutation type: Insert, Swap, Switch; Mutation probability (Pm ): 0.0, 0.005, 0.01, 0.015, 0.02, 0.03, 0.05, 0.1, 0.2, 0.4. Hence, 5.8.10.3.10 = 12000 different setups are considered. For each one, 360 problems, which are the part of full set of problems, as it was shown in [11], are solved. The number of stages (m) is set to 5, 10, and 20. The number of jobs is 20. For each
614
V. Yaurima, L. Burtseva, and A. Tchernykh
configuration, 10 different problems, where the processing times are uniformly distributed in the range 1 to 99, are considered. Three main groups of instances are defined where a given number of machines per stage is set. In the first one, there is a random uniformly distributed number of machines between one and three machines per stage. In the second group, there is a fixed number of two machines per stage. In the third one, three machines per stage is set. Inside each group there are four different subgroups of problems with different configurations for sequence dependent setup times. The setup times are drawn from the uniform distributions U[1, 9], U[1, 49], U[1, 99] and U[1, 124]. They correspond to 10%, 50%, 100% and 125% of the average processing times. Then, 12 different sets (three groups by four subgroups) with 30 problems of each one are used. The largest set up is the following: 20 jobs, 20 stages, with three machines per stage. Thus, 20x60 matrix of processing times, and 60 matrices of 20x20 of the setup times are used. The condition for termination is 25 iterations without improvement of the objective function. Then a regeneration procedure is executed 10 times, and the same parameter combination is applied 2 times taking the minimum makespan. The advantage of the proposed algorithm is calculated by the following formula: [(Heusol − Bestsol ) /Bestsol ] · 100, where Heusol is the value of the objective function obtained by considered algorithm and Bestsol is obtained by the best known one. Each instance for Bestsol is evaluated 500,000 times with standard parameters proposed in [11]: ranking selection [24, 25], OP crossover (One point order crossover [25]), Pc = 0.6, Pm = 0.01, Psize = 50 and Gr = 50. Where, Pc is the cross probability, Pm is the mutation probability, Psize is the size of population, and Gr is the number of generations without improve.
4
Experimental Results
In this section, the computational results of the calibrated GABC algorithm with: ST2PX crossover, Swap Mutation, Pc = 0.8, Pm = 0.4, Psize = 200, binary tournament selection, and the stopping criterion described in Section 3.4 are presented. It is important to note that ST2PX is a new crossover for this problem. As it was showed in Section 1, the unique algorithm GAH proposed for this problem is introduced in [11], where 9 variants of metaheuristics [23, 28, 29, 30, 31,32,33,34] for the permutation flowshop problem are compared. GAH is found to be the best for solving the problem. The GAH is compared with GABC algorithm (Table 1). It can be seen that after a certain threshold of the problem complexity, the GAH shows not satisfactory results with increasing the problem size. Only negative numbers in the table mean overcoming the best known result. The GABC algorithm shows superiority on 0.16% - 13.35% for the most problem cases.
Hybrid Flowshop with Unrelated Machines
615
Table 1. Average percentage of relative advantage over the best known solution SSD10 P13
SSD10 P3
GABC
GAH
GABC
GAH
GABC
GAH
20x5 20x10 20x20 Average
0.18 -0.10 -0.55 -0.16
0.06 0.16 0.22 0.15
-0.41 -0.72 0.09 -0.35
2.64 1.76 1.23 1.88
-1.25 -0.45 -0.54 -0.74
3.06 2.34 1.03 2.14
Instance
GABC
GAH
GABC
GAH
GABC
GAH
20x5 20x10 20x20 Average
-0.03 -2.40 -2.81 -1.75
0.99 0.16 0.63 0.59
-5.38 -6.48 -4.58 -5.48
3.10 1.38 1.13 1.87
-7.81 -6.05 -5.17 -6.34
4.22 2.86 1.13 2.74
Instance
GABC
GAH
GABC
GAH
GABC
GAH
20x5 20x10 20x20 Average
-1.50 -5.92 -7.25 -4.89
1.25 1.13 1.19 1.19
-9.77 -12.42 -10.13 -10.77
4.09 2.40 1.64 2.71
-12.76 -11.86 -10.02 -11.55
5.89 2.20 1.81 3.30
Instance
GABC
GAH
GABC
GAH
GABC
GAH
20x5 20x10 20x20 Average
-2.83 -8.05 -10.10 -6.99
1.80 1.56 0.64 1.33
-11.44 -14.67 -12.66 -12.92
3.96 2.07 1.46 2.50
-13.55 -14.35 -12.13 -13.35
5.69 3.67 1.65 3.67
Average
-3.45
0.82
-7.38
2.24
-7.99
2.96
SSD50 P13
SSD100 P13
SSD125 P13
5
SSD10 P2
Instance
SSD50 P2
SSD100 P2
SSD125 P2
SSD50 P3
SSD100 P3
SSD125 P3
Conclusions
An effective genetic algorithm for a HFS with sequence dependent setup time, unrelated parallel machines at each stage, and machine availability constraints is presented. A crossover operator is introduced to improve the solution quality of a recently proposed genetic algorithm. Experimental results obtained on the benchmark data set show that the proposed algorithm can handle complex problems, yielding high quality solutions. Computational experiments are performed to compare the algorithm with other meta-heuristic approach: a previous best known genetic algorithm. The results are shown to be better than the previous ones. The algorithm achieved 6.27% better value of the objective function, in average, in all instances. Statistical tests applied to the results of these algorithms proved the superiority of the new variant of the genetic algorithm.
616
V. Yaurima, L. Burtseva, and A. Tchernykh
The results are not meant to be complete, but give an overview on the methodology and some interesting relations. These results motivate further explore the rationale for success of the algorithm as well as its robustness to larger problem sizes and especially in the real industry environment of the televisions production.
References 1. Morita, H., Shio, N.: Hybrid Branch and Bound Method with Genetic Algorithm for Flexible Flowshop Scheduling Problem. JSME International Journal, Series C 48(1), 46–52 (2005) 2. Kochhar, S., Morris, R.: Heuristic Method for Flexible Flow Line Scheduling. J. Manufacturing Systems 6(4), 299–314 (1987) 3. Santos, D., Hunsucker, J., Deal, D.: Global Lower Bounds for Flow Shops with Multiple Processors. European J. Operational Research 80, 112–120 (1995) 4. Aghezzaf, E., Artiba, A.: Aggregate Planning in Hybrid Flowshops. Int. J. Production Research 36(9), 2463–2477 (1998) 5. Guinet, A., Solomon, M.: Scheduling Hybrid Flowshops to Minimize Maximum Tardiness or Maximum Completion Time. Int. J. Production Research 34(6), 1643– 1654 (1996) 6. Gupta, J., Tunc, E.: Minimizing Tardy Jobs in a Two-Stage Hybrid Flowshop. Int. J. Production Research 36(9), 2397–2417 (1998) 7. Portmann, M., Vignier, A.: Branch and Bound Crossed with GA to Solve Hybrid Flowshops. European J. Operational Research 107, 389–400 (1998) 8. Riane, F., Artiba, A., Elmaghraby, S.: A Hybrid Three-Stage Flowshop Problem: Efficient Heuristics to Minimize Makespan. European J. Operational Research 109, 321–329 (1998) 9. Arthanary, L., Ramaswamy, K.: An Extension of Two Machine Sequencing Problem. OPSEARCH. The Journal of the Operational Research Society of India 8(4), 10–22 (1971) 10. Salvador, M.: A solution to a special case of flow shop scheduling problems. In: Elmaghraby, S.E. (ed.) Symposium of the Theory of Scheduling and Applications, pp. 83–91. Springer, New York (1973) 11. Ruiz, R., Maroto, C.: A genetic algorithm for hybrid flowshops with sequence dependent setup times and machine eligibility. European J. Operational Research 169, 781–800 (2006) 12. Vignier, A., Billaut, J., Proust, C.: Les Problmes D’Ordonnancement de Type Flow-Shop Hybride: tat de L’Art. RAIRO Recherche op´erationnelle 33(2), 117– 183 (1999) 13. Allaoui, H., Artiba, A.: Integrating simulation and optimization to schedule a hybrid flow shop with maintenance constraints. Computers & Industrial Engineering 47, 431–450 (2004) 14. Aghezzaf, E., Artiba, A., Moursli, O., Tahon, C.: Hybrid flowshop problems, a decomposition based heuristic approach. In: Proceedings of the International Conference on Industrial Engineering and Production Management, IEPM 1995, Marrakech. FUCAM – INRIA, pp. 43–56 (1995) 15. Zandieh, M., Fatemi Ghomi, S., Moattar Husseini, S.: An immune algorithm approach to hybrid flow shops scheduling with sequence-dependent setup times. Applied Mathematics and Computation 180, 111–127 (2006)
Hybrid Flowshop with Unrelated Machines
617
16. Gourgand, M., Grangeon, N., Norre, S.: Metaheuristics for the deterministic hybrid flow shop problem. In: Proceedings of the International Conference on Industrial Engineering and Production Management, IEPM 1999, Glasgow. FUCAM - INRIA, pp. 136–145 (1999) 17. Adler, L., Fraiman, N., Kobacker, E., Pinedo, M., Plotnicoff, J., Wu, T.P.: BPSS: A Scheduling Support System for the Packaging Industry. Operations Research 41(4), 641–648 (1993) 18. Tang, L., Zhang, Y.: Heuristic Combined Artificial Neural Networks to Schedule Hybrid Flow Shop with Sequence Dependent Setup Times. In: Wang, J., Liao, X.F., Yi, Z. (eds.) ISNN 2005. LNCS, vol. 3496, pp. 788–793. Springer, Heidelberg (2005) 19. Ryan, E., Azad, R.M.A., Ryan, C.: On the Performance of Genetic Operators and the Random Key Representation. In: Genetic Programming. LNCS, vol. 2003/2004, pp. 162–173. Springer, Heidelberg (2004) 20. Sherali, H., Sarin, S., Kodialam, M.: Models and algorithms for a two-stage production process. Production Planning and Control 1, 27–39 (1990) 21. Rajendran, C., Chaudhuri, D.: A multi-stage parallel processor flowshop problem with minimum flowtime. European J. Operational Research 57, 11–122 (1992a) 22. Alcaraz, J., Maroto, C., Ruiz, R.: Solving the multi-mode resource-contraints project scheduling problem with genetic algorithms. Journal of the Operational Research Society 54, 614–626 (2003) 23. Reeves, C.: A genetic algorithm for flowshop sequencing. Computers & Operations Research 22(1), 5–13 (1995) 24. Goldberg, D.: Genetic Algorithms in Search, optimization and Machine Learning. Addison-Wesley, Reading (1989) 25. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolutions Programs, 3rd edn. Springer, Heidelberg (1996) 26. Gen, M., Cheng, R.: Genetic algorithms & engineering optimization, p. 512. John Wiley & Sons, New York (1997) 27. Bierwirth, C., Mattfeld, D., Kopfer, H.: On permutation representations for scheduling problems. In: Ebeling, W., Rechenberg, I., Voigt, H.-M., Schwefel, H.-P. (eds.) PPSN 1996. LNCS, vol. 1141, pp. 310–318. Springer, Heidelberg (1996) 28. Nawaz, M., Enscore Jr., E., Ham, I.: A heuristic algorithm for the m-machine, n-job flow-shop sequencing problem. OMEGA, The International Journal of Management Science 11(1), 91–95 (1983) 29. Osman, I., Potts, C.: Simulated annealing for permutation flow-shop scheduling. OMEGA, The Int. Journal of Management Science 17(6), 551–557 (1989) 30. Widmer, M., Hertz, A.: A new heuristic method for the flowshop sequencing problem. European J. Operational Research 41, 186–193 (1989) 31. Chen, C., Vempati, V., Aljaber, N.: An application of genetic algorithm for flow shop problems. European J. Operational Research 80, 389–396 (1995) 32. Murata, T., Ishibuchi, H., Tanaka, H.: Genetic algorithms for flowshop scheduling problems. Computers and Industrial Engineering 30(4), 1061–1071 (1996) 33. Aldowaisan, T., Allahvedi, A.: New heuristics for no-wait flowshops to minimize makespan. Computers & Operations Research 30, 1219–1231 (2003) 34. Rajendran, C., Ziegler, H.: Ant-colony algorithms for permutation flowshop scheduling to minimize makespan/total flowtime of jobs. European J. Operational Research 155, 426–438 (2004)
The Relevance of New Data Structure Approaches for Dense Linear Algebra in the New Multi-Core / Many Core Environments Fred G. Gustavson IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA [email protected]
Abstract. For about ten years now, Bo K˚ agstr¨ om’s Group in Umea, Sweden, Jerzy Wa´sniewski’s Team at Danish Technical University in Lyngby, Denmark, and I at IBM Research in Yorktown Heights have been applying recursion and New Data Structures (NDS) to increase the performance of Dense Linear Algebra (DLA) factorization algorithms. Later, John Gunnels, and later still, Jim Sexton, both now at IBM Research also began working in this area. For about three years now almost all computer manufacturers have dramatically changed their computer architectures which they call Multi-Core, (MC). It turns out that these new designs give poor performance for the traditional designs of DLA libraries such as LAPACK and ScaLAPACK. Recent results of Jack Dongarra’s group at the Innovative Computing Laboratory in Knoxville, Tennessee have shown how to obtain high performance for DLA factorization algorithms on the Cell architecture, an example of an MC processor, but only when they used NDS. In this talk we will give some reasons why this is so.
1
Introduction
Multi-core/Many Core (MC) can be considered a revolution in Computing. In many of my papers I have talked about the fundamental triangle of Algorithms, Architectures and Compilers [7] or the Algorithms and Architecture approach [1,12,11]. MC is revolution in Architectures. The fundamental triangle concept says that all three areas are inter-related. This means Compilers and Algorithms must change and probably in a radical way. The LAPACK and ScaLAPACK projects under the direction of Jim Demmel and Jack Dongarra started an enhancement of LAPACK and ScaLAPACK projects in late 2004. Then in 2006, Jack Dongarra started another project, called PLASMA, which was directed at the effect MC would have on LAPACK. His initial findings were that traditional BLAS based LAPACK will need substantial changes. So, what appeared at first to be enhancements of these libraries now appears to be directed at more basic structural changes. For about 10 years now the work of Bo K˚ agstr¨om’s Group in Umea, Sweden, Jerzy Wa´sniewski’s Team at Danish Technical University in Lyngby, Denmark, R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 618–621, 2008. c Springer-Verlag Berlin Heidelberg 2008
The Relevance of New Data Structure Approaches for DLA
619
and I at IBM Research in Yorktown Heights have been applying recursion and New Data Structures (NDS) to increase the performance of Dense Linear Algebra (DLA) factorization algorithms. More recently, John Gunnels and Jim Sexton of IBM research have also become somewhat involved. It turns out that the results of our researches are also relevant to MC processors. A useful result of our research was the introduction of new data structures, (NDS) [9,12,10,2,3,15,14]. The essence of MC is many cores on a single chip. The Cell BBE (broad band engine) is an example. Cell is a heterogeneous chip consisting of a single traditional PPE (power PC processor) and 8 SPE’s (Synergistic Processing Element) and a novel memory system interconnect. Each SPE core can be thought of as a processor and a ”cache memory”. Because of this, ”cache blocking” is still very important. Cache blocking was first invented by my Group at IBM in 1985 [16] and the Cedar project at the University of Illinois. The advent of the Cray 2 was a reason for the introduction of level 3 BLAS [6] followed by the introduction of the LAPACK [4] library in the early 1990’s. Now, according to some preliminary results of the PLASMA project, this level 3 BLAS approach is no longer adequate to produce high performance LAPACK codes for MC. Nonetheless, it can be argued that the broad idea of ”cache blocking” is still mandatory as data in the form of matrix elements must be fed to the SPE’s so they can be processed. And, equally important, is the arrangement in memory of the matrices that are being processed. So, this is what we will call ”cache blocking” here. My invited lecture at PPAM07 addressed ”cache blocking” as it relates to dense linear algebra factorization algorithms (DLAFA). It repeated some of my early work on this subject and it sketched a proof that DLAFA could be viewed as just doing matrix multiply by adopting the linear transformation approach of applying equivalence transformations to a set of matrix equations Ax = b to produce an equivalent (simpler) form of these equations Cx = d. Examples of C are LU = P A, for Gaussian elimination, LLT = A, for Cholesky Factorization, and QR = A, for Householder’s factorization. I adopted this view to show a general way to produce a whole collection of DLAFA as opposed to the commonly accepted way of describing the same collection as set of distinct algorithms [8]. A second reason was to indicate that for each linear transformation I performed I was invoking the definition of matrix multiplication. Here is the gist of the proof as it applies to LU = P A. 1. Perform n = N/NB rank NB linear transformations on A to get U . 2. Each of these n composed NB linear transformations is matrix multiply by definition. 3. By the principle of equivalence we have Ax = b if and only if U x = L−1 P b. Matrix multiplication clearly involves ”cache blocking”. Around the mid 1990’s I noticed, see page 739 of [9], that the API for Level 3 BLAS GEMM could hurt performance. In fact, this API is also the API for 2-D arrays in Fortran and C. LAPACK and ScaLAPACK also use this API for full arrays. On the other hand, high performance implementations of GEMM do not use this API as doing so would lead to sub-optimal performance. In fact, some amount of data copy
620
F.G. Gustavson
is usually done by most high performance GEMM implementations. Now, level 3 BLAS are called multiple times by DLAFA. This means that multiple data copy will usually occur in DLAFA that are use standard level 3 BLAS. The full NDS are storage formats that are good for GEMM. This was another main message of my invited talk at PPAM07. DLAFA algorithms can be expressed in terms of scalar elements ai,j which are one by one block matrices. Alternatively, they can be expressed in terms of partitioned submatrices, A(I : I + NB − 1, J : J : NB − 1) of order NB. See [8] for a definition of colon notation. The algorithms are almost identical. However, the later description automatically incorporates ”cache blocking” into a DLAFA. Take the scalar statement ci,j = ci,j − ai,k bk,j representing matrix multiply as a fused multiply-add. The corresponding statement for partitioned submatrices becomes a kernel routine for level 3 BLAS GEMM. However, it is imperative to store the order NB SB’s as contiguous blocks of matrix data, as this is what level 3 BLAS GEMM does internally. We remark that this is not possible with the standard Fortran and C API. This was another main message of my talk at PPAM07 which emphasized the importance of storing the submatrices of DLAFA as contiguous blocks of storage. An essence of NDS for full matrices is to store their submatrices as contiguous blocks of storage. The simple format of full NDS has each rectangular block (RB) or square block (SB) as being in standard column major (CM) or standard row major (RM) format, see [12,11] for more details. These formats are the same as block data layout (BDL). BDL is described in [17] and this papers shows that BDL is the format that leads to minimal L1, L2, TLB misses for matrix operations that treat rows and columns equally. This result is true because the cache mapping of a SB is the identity mapping for most cache designs. The last part of my invited lecture spoke about the recent results of the PLASMA project as it related to the Linpack benchmark LU = P A when running on the Cell processor. According to Dongarra’s Team it was crucial that NDS be used as the matrix format. Using the standard API did not yield good performance results. I close this extended abstract of my invited lecture by mentioning recent results obtained by considering the IBM new Blue Gene/L computers [5]. The simple format of full NDS needs to be re-arranged internally to take into account ”cache blocking” for the L0 cache. The L0 cache is a new term defined in [14] and it refers to the register file of the FPU that is attached to the L1 cache. Full details are given in [14].
References 1. Agarwal, R.C., Gustavson, F.G., Zubair, M.: Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM Journal of Research and Development 38(5), 563–576 (1994) 2. Andersen, B.S., Gustavson, F.G., Wa´sniewski, J.: A Recursive Formulation of Cholesky Factorization of a Matrix in Packed Storage. ACM TOMS 27(2), 214–244 (2001)
The Relevance of New Data Structure Approaches for DLA
621
3. Andersen, B.S., Gunnels, J.A., Gustavson, F.G., Reid, J.K., Wa´sniewski, J.: A Fully Portable High Performance Minimal Storage Hybrid Cholesky Algorithm. ACM TOMS 31(2), 201–227 (2005) 4. Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., Sorensen, D.: LAPACK Users’ Guide Release 3.0. SIAM, Philadelphia (1999), http://www.netlib.org/lapack/lug/lapack lug.html 5. Chatterjee, S., et al.: Design and Exploitation of a High-performance SIMD Floating-point Unit for Blue Gene/L. IBM Journal of Research and Development 49(2-3), 377–391 (2005) 6. Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.: A Set of Level 3 Basic Linear Algebra Subprograms. TOMS 16(1), 1–17 (1990) 7. Elmroth, E., Gustavson, F.G., Jonsson, I., K˚ agstr¨ om, B.: Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software. SIAM Review 46(1), 3–45 (2004) 8. Golub, G., Van Loan, C.: Matrix Computations, 3rd edn. John Hopkins Press, Baltimore and London (1996) 9. Gustavson, F.G.: Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms. IBM Journal of Research and Development 41(6), 737– 755 (1997) 10. Gustavson, F.G., Jonsson, I.: Minimal Storage High Performance Cholesky via Blocking and Recursion. IBM Journal of Research and Development 44(6), 823– 849 (2000) 11. Gustavson, F.G.: New Generalized Data Structures for Matrices Lead to a Variety of High Performance Linear Algebra Algorithms. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2001. LNCS, vol. 2328, pp. 418–436. Springer, Heidelberg (2002) 12. Gustavson, F.G.: High Performance Linear Algebra Algorithms using New Generalized Data Structures for Matrices. IBM Journal of Research and Development 47(1), 31–55 (2003) 13. Gustavson, F.G.: New Generalized Data Structures for Matrices Lead to a Variety of High performance Dense Linear Algorithms. In: Dongarra, J., Madsen, K., Wa´sniewski, J. (eds.) PARA 2004. LNCS, vol. 3732, pp. 11–20. Springer, Heidelberg (2006) 14. Gustavson, F.G., Gunnels, J., Sexton, J.: Minimal Data Copy For Dense Linear Algebra Factorization. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 540–549. Springer, Heidelberg (2007) 15. Gustavson, F.G., Wa´sniewski, J.: LAPACK Cholesky routines in rectangular full packed format. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 570–579. Springer, Heidelberg (2007) 16. IBM. IBM Engineering and Scientific Subroutine Library for AIX Version 3, Release 3. IBM Pub. No. SA22-7272-00 (February 1986) 17. Park, N., Hong, B., Prasanna, V.: Tiling, Block Data Layout, and Memory Hierarchy Performance. IEEE Trans. Parallel and Distributed Systems 14(7), 640–654 (2003)
Three Versions of a Minimal Storage Cholesky Algorithm Using New Data Structures Gives High Performance Speeds as Verified on Many Computers Jerzy Wa´sniewski1 and Fred G. Gustavson2 1
Informatics & Mathematical Modeling Technical University of Denmark DK-2800 Lyngby, Denmark [email protected] 2 IBM T.J. Watson Research Center Yorktown Heights NY 10598, USA [email protected]
Extended Abstract Three versions of a High Performance Minimal Storage Cholesky Algorithm which uses New Data Structures are briefly described. The formats are Recursive Packed (RP), Rectangular Full Packed (RFP), and Block Packed Hybrid Formats. They all exhibited high performance on many computers on which we have done timing studies. Summary: We describe three new data formats for storing triangular, symmetric, and Hermitian matrices. The standard two dimensional arrays of Fortran and C (also known as full format) that are used to store triangular, symmetric, and Hermitian matrices waste nearly half the storage space but provide high performance via the use of level 3 BLAS [5]. Standard packed format arrays fully utilize storage (array space) but provide low performance as there are no level 3 packed BLAS. Packed storage use level 2 BLAS [4]. We combine the good features of packed and full storage using our new formats to obtain high performance by using level 3 BLAS. Also, these new formats require exactly the same minimum storage as LAPACK [3] packed format. These new formats obtain about the same or better performance than the LAPACK [3] full format routines for all of our timing studies. Recursive Packed Format: A new compact way to store a symmetric or triangular matrix called Recursive Packed Format (RPF) is described in [2] as are novel ways to transform RPF to and from standard packed format. A new algorithm, called Recursive Packed Cholesky (RPC) [2] that operates on the RPF format is presented there. RPF format operates almost entirely by using level-3 BLAS GEMM [5] but requires variants of algorithms TRSM and SYRK [5] that are R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 622–627, 2008. c Springer-Verlag Berlin Heidelberg 2008
Three Versions of a Minimal Storage Cholesky Algorithm
623
Table 1. Performance in Mflops of Cholesky Factorization on SUN UltraSPARC-IV computer, dual-core CPUs (1350 MHz/ 8 MB/core L2-cache), Sun BLAS Library, Real long precision New rfp n 50 100 200 400 500 800 1000 1600 2000 4000
u 819 1419 1757 2177 2170 2446 2512 2584 2692 2337
l 909 1533 1807 2200 2206 2409 2459 2531 2624 2753
hpp u 1317 1954 1996 2289 2377 2574 2644 2759 2804 2842
l 1321 1953 1997 2316 2415 2635 2711 2849 2899 2944 A
lapack po pp u l u l 910 800 431 617 1268 1389 591 806 1712 1925 708 377 2219 2314 793 257 2354 2350 812 251 2549 2416 800 242 2617 2498 702 228 2753 1950 627 217 2795 2638 629 216 2834 2700 509 97
rfp u 830 1416 1760 2182 2168 2451 2513 2589 2654 2448
l 921 1531 1815 2213 2205 2408 2472 2538 2633 2762
New hpp u l 1319 1321 1950 1952 2005 2006 2294 2321 2386 2422 2579 2640 2644 2717 2765 2853 2810 2903 2845 2942 B
rpp u 256 535 968 1527 1827 2076 2361 2489 2454 2847
l 285 601 1060 1632 1845 2142 2313 2497 2417 2814
lapack pp u l 432 619 591 806 708 376 793 257 812 251 784 237 702 227 627 216 617 216 492 94
Table 2. Performance in Mflops of Cholesky Inversion (subtable A) and Solution (subtable B, nrhs = max(100,n)) on SUN UltraSPARC-IV computer, dual-core, CPUs (1350 MHz/ 8 MB/core L2-cache), Sun BLAS Library, Real long precision
n 50 100 200 400 500 800 1000 1600 2000 4000
rfp no trans trans u l u l 705 680 681 685 1166 1158 1108 1121 1744 1727 1720 1737 2266 2266 2242 2260 2283 2282 2285 2302 2530 2539 2501 2510 2544 2570 2608 2578 2561 2613 2623 2560 2541 2610 2587 2524 2598 2414 2396 2506 A
lapack po pp u l u l 546 534 486 532 1171 1143 680 706 1785 1766 824 818 2272 2276 930 885 2376 2397 951 898 2453 2473 887 770 2523 2524 739 533 2233 2542 638 424 2590 2601 584 407 2340 2424 421 145
rfp u 1773 2051 2590 2765 2731 2867 2918 2944 3023 2821
l 1763 2051 2591 2725 2683 2865 2912 2893 3031 2832
new hpp u 1567 2086 2447 2659 2734 2803 2831 2942 2915 2883
l 1568 2085 2448 2660 2732 2814 2836 2941 2853 2897 B
rpp u 669 1036 1538 2038 2295 2462 2610 2760 2718 2908
l 693 1066 1565 2045 2277 2464 2606 2767 2910 2921
lapack pp u l 559 556 718 713 693 827 733 887 742 900 642 692 537 542 445 446 429 419 179 175
designed to work on RPF. We call these RPTRSM and RPSYRK [2] and find that they do most of their FLOPS by calling GEMM [5]. It follows that most of the execution time of RPC algorithm is done in calls to GEMM. The advantage of this storage scheme compared to traditional packed and full storage will be briefly mentioned. First, the RPF storage format uses the minimum amount of storage required for symmetric, triangular, or Hermitian matrices. Second, RPC algorithm is a level-3 implementation of Cholesky factorization. Finally, RPF requires no block size tuning parameter.
624
J. Wa´sniewski and F.G. Gustavson
Table 3. Performance in Mflops of Cholesky Factorization on IBM Power4 computer, 1300 MHz, caches: L1 128KB, L2 1.5MB, L3 32MB, ESSL BLAS Library, Real long precision rfp rfp n 50 100 200 400 500 800 1000 1600 2000 4000
u 1087 1908 2606 2877 2823 2972 2941 3085 2777 3069
l 1077 1862 2433 2740 2658 2881 2813 2952 2792 2975
lapack hpp
u 422 840 1776 2574 2759 2935 2972 3050 3082 3017
l 421 842 1740 2602 2787 2976 3010 3034 3082 2918 A
po u 1500 2165 2624 2825 2814 2862 2794 3034 2930 2836
l 331 900 1660 2359 2530 2799 2813 2844 2867 2646
new pp
u 437 790 1102 1331 1344 1115 1061 961 984 761
l 629 862 907 959 888 501 448 364 342 304
rfp u 1092 1908 2598 2839 2787 2862 2742 2592 2914 3091
l 1085 1852 2408 2677 2621 2730 2642 2452 2601 2814
lapack rpp pp l u l u l 401 135 133 437 617 844 357 328 808 827 1740 831 780 1114 886 2627 1584 1470 1299 988 2713 1833 1633 1333 844 2650 2174 1994 1073 418 2850 2306 2083 980 354 2301 2340 2167 758 284 2914 2614 2480 857 308 2871 2954 2803 797 320 B
hpp u 406 846 1765 2602 2705 2862 2845 2592 2930 2724
Table 4. Performance in Mflops of Cholesky Factorization on (subtable A) SUN UltraSPARC-IV computer, dual-core CPUs (1350 MHz/ 8 MB/core L2-cache), Sun BLAS Library; and on (subtable B) IBM Power4 computer, 1300 MHz, caches: L1 128KB, L2 1.5MB, L3 32MB, ESSL BLAS Library. Both are in complex long precision
n 50 100 200 400 500 800 1000 1600 2000 4000
rfp no trans trans u l u l 1450 1550 1642 1436 2046 2001 2072 1858 2343 2285 2346 2220 2688 2607 2524 2602 2794 2708 2636 2729 2912 2794 2750 2862 2951 2834 2824 2880 2994 2900 2495 2902 3039 2969 2965 3033 3036 2942 2980 3068 A
lapack
rfp lapack no trans trans po pp u l u l u l u l u l u l 1313 1310 893 1347 1826 1766 1693 1807 2302 727 1140 1479 1564 1944 1292 1368 2483 2407 2471 2467 2435 1484 1725 1743 2123 2389 1657 543 2804 2663 2684 2776 2821 2189 2189 1907 2570 2653 2006 482 2936 2807 2754 2844 2955 2669 2230 1431 2707 2750 2068 478 2941 2850 2869 2922 2902 2742 2043 1028 2895 2246 1430 443 2986 2913 2931 2986 2913 2878 1750 734 2925 2852 1319 435 3030 2898 2963 2996 2930 2852 1698 694 3070 1499 1256 430 2758 2664 3034 3051 2829 2690 1521 579 3091 2989 1206 423 3100 3056 2954 2890 2829 2996 1664 708 3092 2144 596 146 3104 3000 3077 3034 2968 2867 1465 574 B po
pp
Rectangular Full Packed (RFP) Format: We describe a new data format for storing triangular, symmetric, and Hermitian matrices called Rectangular Full Packed (RFP) [7]. Each full or packed symmetric, Hermitian, or triangular routine becomes a single new RFP routine. We present LAPACK routines for Cholesky factorization, inverse and solution computation using RFP format to illustrate this new format and to describe its performance on the IBM and SUN platforms. Our performance results of RFP routines versus LAPACK full routines for both serial and SMP parallel processing is about the same while using half the storage. Performance is roughly one to a factor of 33 for serial and
Three Versions of a Minimal Storage Cholesky Algorithm
625
Table 5. Performance Times and MFLOPS of Cholesky Factorization on an IBM Power 4 computer using SMP parallelism on 1, 5, 10 and 15 processors. Here vendor codes for Level 2 and 3 BLAS and POTRF are used, ESSL library version 3.3. PPTRF is LAPACK code. UPLO = ’L’. IBM Power4 1300 MHz, caches: L1 128KB, L2 1.5MB, L3 32MB n n Mflops Times pr rfptrf in rfptrf lapack oc potrf trsm syrk potrf potrf pptrf 1000 1 2695 0.12 0.02 0.05 0.04 0.02 0.12 0.94 5 7570 0.04 0.01 0.02 0.01 0.01 0.03 0.32 10 10699 0.03 0.01 0.01 0.01 0.00 0.02 0.16 15 9114 0.04 0.01 0.02 0.01 0.01 0.02 0.16 2000 1 2618 1.02 0.13 0.38 0.38 0.13 0.97 8.74 5 10127 0.26 0.04 0.10 0.09 0.04 0.24 3.42 10 17579 0.15 0.02 0.06 0.05 0.03 0.12 1.65 15 23798 0.11 0.02 0.04 0.04 0.01 0.13 1.11 3000 1 2577 3.49 0.45 1.33 1.28 0.44 3.40 30.42 5 11369 0.79 0.11 0.28 0.30 0.11 0.71 11.76 10 19706 0.46 0.06 0.19 0.16 0.05 0.38 6.16 15 29280 0.31 0.05 0.12 0.10 0.04 0.26 4.28 4000 1 2664 8.01 1.01 2.90 3.09 1.01 7.55 75.72 5 11221 1.90 0.26 0.68 0.72 0.24 1.65 25.73 10 21275 1.00 0.13 0.39 0.36 0.12 0.86 13.95 15 31024 0.69 0.09 0.28 0.24 0.08 0.59 10.46 5000 1 2551 16.34 2.04 6.16 6.10 2.04 15.79 154.74 5 11372 3.66 0.45 1.37 1.44 0.40 3.27 47.76 10 22326 1.87 0.25 0.78 0.62 0.22 1.73 28.13 15 32265 1.29 0.17 0.53 0.45 0.14 1.16 20.95
one to a factor of 100 for parallel SMP times faster than for LAPACK packed routines. Existing LAPACK routines and vendor LAPACK routines were used in the serial and the SMP parallel study respectively. In both studies Vendor Level-3 BLAS [5] were used. Block Packed Hybrid [1]: We consider an efficient implementation of the Cholesky solution of symmetric positive-definite full linear systems of equations using packed storage. We take the same starting point as that of LINPACK [6] and LAPACK [3], with the upper (or lower) triangular part of the matrix being stored by columns. Following LINPACK [6] and LAPACK [3], we overwrite the given matrix by its Cholesky factor. We consider the use of a hybrid format in which blocks of the matrices are held contiguously and compare this format to using conventional full format storage and the RPF for the algorithms we consider in our timing studies [1]. We mention that this format has become the format of choice for multi-core processors.
626
J. Wa´sniewski and F.G. Gustavson
Table 6. Performance in Times and Mflops of Cholesky Factorization on SUN UltraSPARC-IV computer with a different number of Processors, testing the SMP Parallelism. PPTRF does not show any SMP parallelism. UPLO = ’L’. SUN UltraSPARC-IV, 1300 MHz, caches: L1 64KB, L2 8MB n n Mflops Times pr rfptrf in rfptrf lapack oc potrf trsm syrk potrf potrf pptrf 1000 1 1587 0.21 0.03 0.09 0.07 0.03 0.19 1.06 5 4762 0.07 0.02 0.02 0.02 0.02 0.07 1.13 10 5557 0.06 0.01 0.01 0.02 0.02 0.06 1.12 15 5557 0.06 0.02 0.01 0.01 0.02 0.06 1.11 2000 1 1668 1.58 0.22 0.63 0.52 0.22 1.45 11.20 5 6667 0.40 0.07 0.13 0.13 0.07 0.38 11.95 10 8602 0.31 0.06 0.07 0.11 0.07 0.25 11.24 15 9524 0.28 0.06 0.06 0.08 0.08 0.23 11.66 3000 1 1819 4.95 0.62 1.98 1.72 0.63 4.86 45.48 5 6872 1.31 0.20 0.42 0.48 0.20 1.38 55.77 10 12162 0.74 0.14 0.22 0.21 0.16 0.76 46.99 15 12676 0.71 0.14 0.16 0.30 0.16 0.61 45.71 4000 1 1823 11.70 1.52 4.62 4.01 1.55 11.86 112.52 5 7960 2.68 0.40 0.94 0.92 0.42 2.74 112.77 10 14035 1.52 0.26 0.47 0.49 0.30 1.61 112.53 15 17067 1.25 0.24 0.37 0.35 0.29 1.29 111.67 5000 1 1843 22.61 2.92 8.76 8.00 2.93 23.60 218.94 5 8139 5.12 0.77 1.81 1.80 0.74 5.45 221.58 10 14318 2.91 0.50 0.97 0.93 0.51 3.11 214.54 15 17960 2.32 0.45 0.72 0.68 0.47 2.40 225.08
Conclusions: All three formats permit the use of Level-3 BLAS for the Cholesky computational phases of factorization, scaling and Schur complement update. The same remarks apply to the inverse and solution phases of positive definite symmetric and Hermitian matrices. They all use minimum storage whereas the full format needs almost twice as much memory. The performance studies indicate that optimized kernels of the hybrid format gives slightly better performance than the other formats do using Level-3 BLAS. Performance tables: There are 6 performance tables. The tables 1 – 4 have two subtables A and B. The performance numbers in subtables A and B were calculated in different runs. These six tables are described by their captions and the names of the Cholesky algorithm version. The Cholesky algorithm versions are abbreviated: rfp – Rectangular Full Packed, hpp – Hybrid Blocked Packed, rpp – Recursive Packed, po – LAPACK Full, and pp – LAPACK Packed. The tables 1 – 4 show uni-processor speeds. The tables 5 and 6 show the parallel SMP speeds. Performance results for two computers, IBM Power4 and SUN UltraSPARC–IV are presented in this paper. More results can be seen in [1,2,7].
Three Versions of a Minimal Storage Cholesky Algorithm
627
Acknowledgments. The results in this paper were obtained on two computers, an IBM and SUN. The IBM machine belongs to the Center for Scientific Computing at Aarhus, the SUN machine to the Danish Technical University. We would like to thank Bernd Dammann for consulting on the SUN system and Niels Carl W. Hansen on the IBM.
References 1. Andersen, B.S., Gunnels, J., Gustavson, F.G., Reid, J.K., Wa´sniewski, J.: A fully portable high performance minimal storage hybrid format Cholesky algorithm. TOMS 31, 201–227 (2005) 2. Andersen, B.A., Gustavson, F.G., Wa´sniewski, J.: A recursive formulation of Cholesky factorization of a matrix in packed storage. TOMS 27(2), 214–244 (2001) 3. Anderson, E., et al.: LAPACK Users Guide Release 3.0. SIAM, Philadelphia (1999), http://www.netlib.org/lapack/ 4. Dongarra, J.J., Du Croz, J., Hammarling, S., Hanson, R.J.: An Extended Set of Fortran Basic Linear Algebra Subroutines. ACM Trans. Math. Soft. 14(1), 1–17 (1988) 5. Dongarra, J.J., Du Croz, J., Duff, I.S., Hammarling, S.: A Set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Soft. 16(1), 1–17 (1990) 6. Dongarra, J.J., Bunch, J.R., Moler, C.B., Steward, G.W.: LINPACK Users Guide. SIAM, Philadelphia (1979) 7. Gustavson, F.G., Wa´sniewski, J.: Rectangular Full Packed Format for LAPACK Algorithms, Timings on Several Computers. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, Springer, Heidelberg (2007)
Hardware-Oriented Implementation of Cache Oblivious Matrix Operations Based on Space-Filling Curves Michael Bader, Robert Franz, Stephan G¨ unther, and Alexander Heinecke Dept. of Informatics, TU M¨ unchen, 80290 M¨ unchen, Germany
Abstract. We will present hardware-oriented implementations of blockrecursive approaches for matrix operations, esp. matrix multiplication and LU decomposition. An element order based on a recursively constructed Peano space-filling curve is used to store the matrix elements. This block-recursive numbering scheme is changed into a standard rowmajor order, as soon as the respective matrix subblocks fit into level-1 cache. For operations on these small blocks, we implemented hardwareoriented kernels optimised for Intel’s Core architecture. The resulting matrix-multiplication and LU-decomposition codes compete well with optimised libraries such as Intel’s MKL, ATLAS, or GotoBLAS, but have the advantage that only comparably small and well-defined kernel operations have to be optimised to achieve high performance.
1
Introduction
Standard linear algebra tasks, such as matrix multiplication or LU decomposition, are at the core of many codes in scientific computing. A lot of effort is therefore invested in the development of highly-optimised libraries for these tasks, Intel’s MKL [9], the ATLAS project [11], or GotoBLAS [7] being wellknown representatives. All of them are based on heavily hardware-aware implementations and optimisations, and are either designed for a specific hardware architecture or at least automatically tuned to given hardware, as in ATLAS. In contrast, approaches that are oblivious of the given hardware need to use algorithmic patterns that are inherently optimal on a large range of architectures. As the efficient use of cache memory, especially for dense matrix operations, is one of the main bottlenecks on current hardware, cache oblivious approaches, i.e. algorithmic schemes that strive for an inherently cache-friendly access pattern to memory, are an active focus of research. In particular, block-recursive approaches are promising; for an overview of such approaches, both cache oblivious and cache aware, see for example [6] and [8]. Also, a recent work by Yotov et al. [12] raised the question whether cache oblivious approaches can actually be competitive with hardware-aware approaches, a question we also address in this article. In [2,3,4], we presented a cache oblivious approach for matrix multiplication and LU decomposition, which uses a block-recursive numbering scheme based on Peano space-filling curves, and exploits their excellent locality properties to R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 628–638, 2008. c Springer-Verlag Berlin Heidelberg 2008
Hardware-Oriented Implementation of Cache Oblivious Matrix Operations
629
achieve a highly local and inherently cache-friendly access to memory. So far, we focused on locality properties and cache efficiency of the algorithms. Performance results, however, showed that for operations on very small subblocks, cache efficiency is no longer the major bottleneck, and the algorithms have to be tuned to CPU-specific properties, instead. In this article, we therefore present an extended approach, where the block-recursive scheme is modified to use optimised kernels for operations on small matrix blocks. The numbering scheme – described in section 2 together with the cache oblivious algorithms – is changed such that the respective small blocks are placed in memory in standard rowmajor order. The block size was chosen to fit the size of the L1 cache, and the kernel operations were implemented in hand-optimised assembler (see section 3). Our performance results, presented in section 4, show that the resulting codes compete well with the currently fastest libraries, Intel MKL and GotoBLAS.
2
Cache Oblivious Matrix Operations Using a Peano-Curve Block-Numbering Scheme
Our cache oblivious algorithm for matrix multiplication is a block-recursive extension of the simple multiplication-scheme of two 3 × 3-matrices, as given in equation (1). There, the indices of the matrix elements indicate the order in which the elements are stored in memory. ⎛ ⎞⎛ ⎞ ⎛ ⎞ a0 a5 a6 b0 b5 b6 c0 c 5 c6 ⎝ a 1 a 4 a 7 ⎠ ⎝ b 1 b 4 b 7 ⎠ = ⎝ c1 c 4 c7 ⎠ . (1) a2 a3 a8 b2 b3 b8 c2 c 3 c8 =: A =: B =: C Equation 2 shows that the operations to compute the elements cr of the result matrix can be executed in an inherently cache-friendly order – from each operation to the next, the involved matrix elements are either reused or one of their direct neighbours in memory is accessed: c0 += a0 b0 ↓ c1 += a1 b0 ↓ c2 += a2 b0 ↓ c2 += a3 b1 ↓ c1 += a4 b1 ↓ c0 += a5 b1 −→
c0 += a6 b2 −→ c5 += a5 b4 ↑ ↓ c1 += a7 b2 c4 += a4 b4 ↑ ↓ c2 += a8 b2 c3 += a3 b4 ↑ ↓ c3 += a8 b3 c3 += a2 b5 ↑ ↓ c4 += a7 b3 c4 += a1 b5 ↑ ↓ c5 += a6 b3 c5 += a0 b5 −→
c6 += a0 b6 −→ c6 += a6 b8 ↑ ↓ c7 += a1 b6 c7 += a7 b8 ↑ ↓ c8 += a2 b6 c8 += a8 b8 ↑ c8 += a3 b7 ↑ c7 += a4 b7 ↑ c6 += a5 b7
(2)
The block-recursive extension of the element order used in equation 1 results in a numbering scheme that is equivalent to the iterations of a Peano spacefilling curve (see section 2.1). Based on this numbering schemes, we can derive block-recursive, inherently cache-efficient algorithms for matrix multiplication (cf. section 2.2), and for LU decomposition (cf. section 2.3).
630
2.1
M. Bader et al.
Element Order Based on Peano Curves
Figure 1 illustrates the recursive storage scheme for the matrix elements. It is an extension of the simple element order used in equation 1, and is equivalent to the so-called iterations of a Peano curve. Four different block numbering patterns – P , Q, R, and S; P being the initial block scheme – are recursively combined and lead to a contiguous storage scheme of matrix blocks. The recursion is stopped once the matrix blocks are small enough, such that at least two of them fit into L1 cache. On those blocks, we use a simple row-major order and obtain a hybrid numbering scheme. The row-major order is necessary for hardware-optimised implementation of the respective block operations.
Fig. 1. Recursive construction of the Peano numbering scheme; the patterns P , Q, R, and S define the numbering scheme for the matrix subblocks. The recursive blocksubdivision is given for patterns P and Q, only.
2.2
A Block-Recursive Scheme for Matrix Multiplication
Equation (3) shows the blockwise matrix multiplication for matrices stored according to the blockwise numbering scheme. Each matrix block is named with respect to its numbering scheme and indexed with the name of the global matrix and the position within the storage scheme: ⎛ ⎞⎛ ⎞ ⎛ ⎞ PA0 RA5 PA6 PB0 RB5 PB6 PC0 RC5 PC6 ⎝ QA1 SA4 QA7 ⎠ ⎝ QB1 SB4 QB7 ⎠ = ⎝ QC1 SC4 QC7 ⎠ . (3) PA2 RA3 PA8 PB2 RB3 PB8 PC2 RC3 PC8 =: A =: B =: C For the block operations, we derive an execution order that follows equation (2) and is, hence, inherently cache friendly. In two subsequent block multiplications only blocks that are identical or immediate neighbours to the previously accessed matrix blocks are addressed. The first operations of this scheme are: PC0 += PA0 PB0 → QC1 += QA1 PB0 → PC2 += PA2 PB0 → . . .
(4)
(cf. to equation (2) for the further operations). In [2], we proved the following excellent spatial and temporal locality properties of the resulting algorithm: – The element traversal of all three involved matrices can be achieved by increment and decrement operations on the indices, only. Hence, we will nearly always stay within a current cache line during traversal.
Hardware-Oriented Implementation of Cache Oblivious Matrix Operations
631
– Any sequence of k 3 floating point operations is executed on only O(k 2 ) contiguous elements in each matrix. This is obviously the optimal ratio we can achieve in matrix multiplication, and guarantees optimal re-use of data. – As a result, the number misses for the multiplication of two n × n of3 cache n matrices is of order O L√ (M being the number of available cache lines, M and L the length of each cache line), which is asymptotically optimal. However, once the accessed data fits into L1 cache, improving data locality even further does not improve the performance any further. On the contrary, the algorithm’s many nested recursive calls compromise the efficiency of important features of modern CPUs, such as SIMD extensions, vectorisation of operations, and pipelining. Hence, in our final algorithm, we switch to row-major order once the respective matrix blocks fit in L1 cache, and replace the block-recursive calls from equation (4) by a hardware-optimised multiplication kernel (see section 3). 2.3
A Block-Recursive Scheme for LU-Decomposition
In [4], we presented a block-recursive algorithm for LU decomposition: ⎛ ⎞⎛ ⎞ ⎛ ⎞ PL0 0 0 PU0 RU 5 PU6 PA0 RA5 PA6 ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎝ QL1 SL4 0 ⎠ ⎝ 0 SU4 QU 7 ⎠ = ⎝ QA1 SA4 QA7 ⎠ PL2 RL3 PL8 =: L
0
0 PU8 =: U
(5)
PA2 RA3 PA8 =: A
( PL0 and PU 0 denote lower and upper triangular matrices, respectively). The respective, locality-preserving sequence of block operations is: 1) PL0 PU 0 = PA0 – LU decomp.
8) SL4 SU 4 = SA4 – LU decomp.
2) QL1 PU 0 = QA1 – solve for QL1
9) QA7 –= QL1 PU 6 – matr. mult.
3) PL2 PU 0 = PA2 – solve for PL2
10) PA8 –= PL2 PU 6 – matr. mult.
4) PL0 PU 6 = PA6 – solve for PU 6
11) RL3 SU 4 = RA3 – solve for RL3
5) PL0 RU 5 = RA5 – solve for RU 5
12) SL4 QU 7 = QA7 – solve for QU 7
6) RA3 –= PL2 RU 5 – matr. mult.
13) PA8 –= RL3 QU 7 – matr. mult.
7) SA4 –= QL1 RU 5 – matr. mult.
14) PL8 PU 8 = PA8 – LU decomp.
In contrast to matrix multiplication, we have to obey certain precedence rules. Thus, the scheme is no longer strictly memory-local, but still minimises cache hits to a large extent. For steps 2–5, 11 and 12, additional block-recursive schemes are derived to solve for a right/left hand side matrix, where an upper/lower triangular matrix is given. 2.4
Blockwise Pivoting for LU-Decomposition
It is commonly known, that pivoting, or at least partial pivoting, is necessary to make LU decomposition numerically stable. However, partial pivoting is an
632
M. Bader et al.
inherently non-local operation: the search for the largest element within the current column, e.g., and also the exchange of two entire rows of a matrix, immediately affect the entire matrix. Moreover, pivoting is an inherently sequential procedure. Future pivot elements directly depend on previous row operations and, hence, on the choice of previous elements. Therefore, pivoting is inherently cache-inefficient, and in direct conflict with any block-structured approach that splits blocks in both column and row direction. In our nested-recursive algorithm, pivoting is also very tedious to implement (and, in fact, current work). Hence, up to now, we use a simplified, blockwise pivoting scheme that does not completely destroy the locality properties. It improves (but, of course, not guarantees) the numerical stability of the LU decomposition, and is intended as a first step towards better pivoting schemes. Equation (6) formulates LU decomposition with our current blockwise pivoting, where pivoting is only applied within the smallest matrix blocks: L1 0 U 1 U2 P1 0 A1 A2 = (6) 0 P4 A3 A4 L3 L4 0 U4 To simplify the presentation, we demonstrate this blockwise pivoting scheme for a 2 × 2-block recursion (instead of 3 × 3). Hence, the matrix blocks L4 , U4 , P4 , and A4 combine four (2 × 2) connected blocks in the Peano numbering scheme (and L3 , U2 , A2 , A3 represent two blocks, accordingly). Performing the LU multiplication in equation (6), we obtain the following block-recursive scheme: (1) L1 U1 = P1 A1 LU decomposition with pivoting within the first block; computes matrices L1 , U1 , and P1 . (2) L1 U2 = P1 A2 Execute the respective pivot row exchanges on A2 , and perform triangular solve to obtain U2 . 3 := P4T L3 . (3) L3 U1 = A3 Triangular solve to obtain the block matrix L In step (3), we used that L3 U1 = P4 A3 ⇔ P4T L3 U1 = P4T P4 A3 , and hence 3 U1 = A3 , because P T = (P4 )−1 . For step (4), we observe that L 4
3 U2 ). L4 U4 = P4 A4 − L3 U2 = P4 A4 − P4 P4T L3 U2 = P4 (A4 − L Hence, the remaining steps of the scheme are: 4 = A4 − L 3 U2 (4) A 4 (5) L4 U4 = P4 A 3 (6) L3 = P4 L
Requires block multiplication. LU decomposition with blockwise pivoting; computes matrices L4 , U4 , and P4 . 3 to obtain the result L3 . Execute row exchanges on L
In our current implementation, we execute the pivot row exchanges in steps (3) and (6) as row exchanges on the entire matrix, however, the presented scheme also makes it possible to perform these row exchanges block-by-block. Remember that in the final recursive scheme (with several block-recursion steps), L3 and U2 actually consist of an entire row or column, resp., of matrix blocks, which are not computed in direct sequence. Hence, postponing row exchanges until the
Hardware-Oriented Implementation of Cache Oblivious Matrix Operations
633
respective matrix block is processed might further improve cache efficiency in an improved implementation. To avoid some trivial situations, where this blockwise pivoting fails – for example, if a block of zeros occurs as a diagonal matrix block –, we also adopt an initial pivoting step, where we sort the matrix rows according to the size of their diagonal elements. The resulting pivot strategy is sufficient for a certain class of “good-natured” matrices, but of course fails, if a diagonal matrix block is singular or near-singular. Still, it is a first step towards more sophisticated pivoting strategies such as incremental pivoting, which is able to update a previous blockwise pivoting efficiently during the later triangular solves [10], or a-priori pivoting strategies based, for example, on maximum weighted matching [5].
3
Hardware-Oriented Implementation
For operations on the smallest matrix blocks, where the block-recursive numbering scheme is replaced by a simple row-major order, the respective operation kernels need to be tuned to exploit specific processor features, such as SSE operations on Intel hardware or use and re-use of floating point (vector) registers. In particular, we implemented optimised kernel routines for Intel’s Core architecture. In this section, we will shortly outline the main ideas of the assembler implementation of the matrix multiplication kernel – very similar approaches where adopted earlier (and described in more detail), for example, in the work by Aberdeen and Baxter [1]. For the triangular solves within blockwise LU decomposition, we also did an assembler implementation, following a similar approach. The block LU decompositions, themselves, only sum up to a minor part of the total computational work, so we just used a standard C++ implementation. The multiplication kernel implements the multiplication of two square matrix blocks stored in row-major order. Block sizes are always multiples of 4, because the Intel SSE registers allow to store and process 4 float variables in parallel. The assembler implementation of the block multiplication was motivated by the following main principles: – To avoid stride-k-access (k being the blocksize), the blocks of matrix B are transposed before the respective block multiplication. – The elements of the result matrix block C are computed via scalar products of a row vector aT of A and respective column vectors bi of B. Respective SSE operations are used to calculate these scalar products for subvectors of 4 elements at a time (2 for double precision). – To improve data reuse, the calculation of four elements of C is intertwined: each row vector aT is concurrently multiplied with four column vectors of B (see the illustration in figure 2). Thus, each 4-element subvector of aT may be kept in its SSE register (xmm1) for four SSE operations, thus saving three potential memory accesses. Moreover, the registers storing the intermediate values of the C-elements would usually require one stall cycle each before their results can be used for further computations. These are avoided by nesting four such computations.
634
M. Bader et al.
Fig. 2. Nested computation of the scalar products of a row vector aT with four column vectors b1 to b4 ; the scalar product is computed by summing up 4 elements at a time movaps xmm1, [esi] movaps xmm2, [edi] mulps xmm2, xmm1 movaps xmm3, [edi+288] addps xmm7, xmm2 mulps xmm3, xmm1 movaps xmm2, [edi+576] addps xmm6, xmm3 mulps xmm2, xmm1 mulps xmm1, [edi+864] addps xmm5, xmm2 addps xmm4, xmm1
load load SSE load sum SSE load sum SSE SSE sum sum
4 elements of aT into SSE unit 4 elements of b1 into SSE unit operation; compute (aT )i (b1 )i for i = 1, . . . , 4 4 elements of b2 into SSE unit up: (xmm7)i += (aT )i (b1 )i for i = 1, . . . , 4 operation; compute (aT )i (b2 )i for i = 1, . . . , 4 4 elements of b3 into SSE unit up: (xmm6)i += (aT )i (b2 )i for i = 1, . . . , 4 operation; compute (aT )i (b3 )i for i = 1, . . . , 4 operation; compute (aT )i (b4 )i for i = 1, . . . , 4 up: (xmm5)i += (aT )i (b3 )i for i = 1, . . . , 4 up: (xmm4)i += (aT )i (b4 )i for i = 1, . . . , 4
Fig. 3. Code section for the scalar products of row vector aT of matrix block A with four column vectors b1 to b4 of B – only the operations for subvectors of length 4 are shown; these partial sums are accumulated in registers xmm7, xmm6, xmm5, and xmm4
A resulting typical code segment is shown in figure 3. It computes part of the scalar product of aT with four column vectors b1 to b4 . The code segment only covers the multiplication of subvectors of length 4. The entire computation of the scalar products is organised as a (fully unrolled) loop over this code segment. The achieved performance proved to be quite sensitive to the size of the blocks. For single precision, a block size of 72 × 72 was found to be optimal; for double precision, the optimal block size was 52 × 52.
4
Performance Analysis
In this section, we present performance tests for TifaMMy (“TifaMMy isn’t the fastest Matrix Multiplication, yet”), our hardware-optimised implementation of the presented block-recursive matrix multiplication and LU decomposition. All tests were performed on an Intel Core2 Duo processor (Conroe, 2.4 GHz, 32 KB L1 data-cache, 4 MB L2 cache). Only one processor core was used, because the assembler code of TifaMMy is not optimised for multi-core architectures, yet. We compared TifaMMy with the respective library functions offered by GotoBLAS [7] (v1.0) and Intel MKL [9] (v8.1).
Hardware-Oriented Implementation of Cache Oblivious Matrix Operations
635
Fig. 4. GFLOP/s achieved by different implementations of matrix multiplication Table 1. Matrix multiplication: Hitrates for L1 and L2 data cache, data translationlookaside-buffer (DTLB), and branch prediction single precision TifaMMy Intel MKL GotoBLAS L1 data cache 99.73 % 99.38 % 96.33 % L2 cache 99.96 % 99.87 % 99.95 % DTLB 100.00 % 99.99 % 99.99 % Branch prediction 99.90 % 99.98 % 99.99 %
double precision TifaMMy Intel MKL GotoBLAS 99.60 % 99.06 % 96.31 % 99.94 % 99.92 % 99.95 % 100.00 % 100.00 % 99.99 % 99.94 % 99.99 % 99.99 %
Figure 4 shows the achieved GFLOP/s rates for the different matrix multiplication implementations for matrix sizes from 500×500 up to 6000×6000 (in steps of size 10). For both single and double precision, TifaMMy is able to outperform Intel’s MKL, but lags behind GotoBLAS. The zig-zag behaviour of TifaMMy’s performance curve results from the fact, that we still use zero-padding to fill the smallest, row-numbered matrix blocks. Due to the bigger block size (72 × 72 vs. 52 × 52), this effect is much stronger for single precision. Table 1 compares the cache hitrates of the three implementations. TifaMMy obviously profits from its cache-oblivious construction, and exceeds both MKL and GotoBLAS with respect to cache hitrates. TifaMMy shows a slightly worse branch prediction rate, which is mainly a result of the recursive calls throughout the block recursion. Hence, it should not severely affect performance. It is often claimed that GotoBLAS obtains its superior performance via a rigorous minimisation of TLB misses. However, our measurements did not show any substantial differences in the TLB hitrates. We rather believe that GotoBLAS gains its performance advantage via an even better assembler implementation, which would also explain the larger advantage for single precision floats. The performance comparison between TifaMMy’s implementation of LU decomposition and those provided by GotoBLAS and MKL, is shown in table 2
636
M. Bader et al.
Table 2. LU decomposition: Hitrates for L1 and L2 data cache, and branch prediction; results are given for matrices H and D, each of size 4212 = 52 · 34 Matrix H (pivoting) TifaMMy Intel MKL GotoBLAS L1 data cache 99.53 % 98.06 % 96.30 % L2 cache 99.94 % 99.90 % 99.92 % Branch prediction 99.93 % 99.99 % 99.97 %
Matrix D TifaMMy Intel MKL GotoBLAS 99.58 % 98.06 % 96.31 % 99.95 % 99.89 % 99.94 % 99.97 % 99.98 % 99.97 %
Fig. 5. GFLOP/s rates (double precision) achieved by TifaMMy, GotoBLAS, and MKL (v 9.0) for the LU decomposition of the matrices D (diagonal dominant) and H (Hilbert). Matrix sizes are multiples of 52 (size of the row-major numbered blocks).
and figure 5. For our tests, we used two different matrices: a “rotated” Hilbert matrix H, with elements hij := (2n − i − j + 1)−1 and an artificial diagonal dominant matrix D with elements di,i := 105 , and dij = 1 for i = j. The computation of the LU decomposition of H requires lots of pivot row exchanges, while there are no pivot exchanges at all for matrix D. Hence, the two matrices serve as contrary examples for the influence of pivoting. Table 2 presents the cache hitrates measured during LU decomposition of matrices H and D. Figure 5 compares the GFLOP/s rates for the different implementations of LU decomposition, again for matrices H and D of different dimension. Overall, we obtain similar performance results as for matrix multiplication, which is not surprising, because all three implementations base on their respective multiplication codes. Again, TifaMMy is faster than MKL but slightly slower than GotoBLAS. For matrix H, we observe a slight decay of TifaMMy’s and GotoBLAS’ performance, which is obviously due to pivoting and results from the search for pivot elements and the respective exchanges of matrix rows. Due to the relatively low number of memory accesses during pivoting, a respective performance decay cannot be observed in the cache hitrates.
Hardware-Oriented Implementation of Cache Oblivious Matrix Operations
637
However, we measured a substantial increase of the absolute number of cache misses, which explains that the performance decay is a bit larger than what would be expected from the additional operations due to pivot exchanges, alone. Hence, finding cache-friendly pivoting strategies will be an important task to maintain TifaMMy’s performance for LU decomposition with pivoting.
5
Conclusion
We demonstrated that a block-structured, cache-oblivious approach for matrix multiplication and LU decomposition can be turned into a competitive alternative to existing high-performance libraries for matrix multiplication, if hardwareoptimised kernel operations are integrated for small matrix blocks. We see the main advantage of our approach in the fact that only a small and well-defined part of the code has to be adapted to different hardware, while the major part of the code is efficient in its cache-oblivious, general form. In LU decomposition, pivoting noticeably affects performance, though asymptotically its O(n2 ) complexity should not be a major runtime factor. Finding suitable blockwise pivot strategies that match our block-recursive strategy will therefore be a focus of our future work, especially as pivoting might become an even stronger performance obstacle for parallel (multi-core) implementations.
References 1. Aberdeen, D., Baxter, J.: Emmerald: a fast matrix-matrix multiply using Intel’s SSE instructions, Concurrency Computat.: Pract. Exper. 13 (2001) 2. Bader, M., Zenger, C.: Cache oblivious matrix multiplication using an element ordering based on a Peano curve. Linear Algebra Appl. 417(2–3) (2006) 3. Bader, M., Zenger, C.: A cache oblivious algorithm for matrix multiplication based on Peano’s space filling curve. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, Springer, Heidelberg (2006) 4. Bader, M., Mayer, C.: Cache oblivious matrix operations using Peano curves. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, Springer, Heidelberg (2007) 5. Duff, I.S., Koster, J.: The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Anal. Appl. 20(4) (1999) 6. Elmroth, E., Gustavson, F., Jonsson, I., K˚ agstr¨ om, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review 46(1) (2004) 7. GotoBLAS, Texas Advanced Computing Center, http://www.tacc.utexas.edu/resources/software/ 8. Gustavson, F.G.: Recursion leads to automatic variable blocking for dense linearalgebra algorithms. IBM Journal of Research and Development 41(6) (1997) 9. Intel Math Kernel Library (2005), http://intel.com/cd/software/products/asmo-na/eng/perflib/mkl/
638
M. Bader et al.
10. Joffrain, T., Quintana-Orti, E.S., van de Geijn, R.: Updating an LU factorization and its application to scalable out-of-core 11. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project. Parallel Computing 27(1–2) (2001) 12. Yotov, K., Roeder, T., Pingali, K., Gunnels, J., Gustavson, F.: Is cache oblivious DGEMM a viable alternative. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, Springer, Heidelberg (2007)
Parallel Tiled QR Factorization for Multicore Architectures Alfredo Buttari1 , Julien Langou2 , Jakub Kurzak1, and Jack Dongarra1,3,4 1
2
Computer Science Dept. University of Tennessee Knoxville, USA Department of Mathematical Sciences, University of Colorado at Denver and Health Sciences Center, Colorado USA 3 Oak Ridge National Laboratory, Oak Ridge, Tennessee USA 4 University of Manchester, UK
Abstract. As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. Compared to the standard approach, say with LAPACK, may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.
1
Introduction
In the last twenty years, microprocessor manufacturers have been driven towards higher performance rates only by the exploitation of higher degrees of Instruction Level Parallelism (ILP). Based on this approach, several generations of processors have been built where clock frequencies were higher and higher and pipelines were deeper and deeper. As a result, applications could benefit from these innovations and achieve higher performance simply by relying on compilers that could efficiently exploit ILP. Due to a number of physical limitations (mostly power consumption and heat dissipation) this approach cannot be pushed any further. For this reason, chip designers have moved their focus from ILP to Thread Level Parallelism (TLP) where higher performance can be achieved by replicating execution units (or cores) on the die while keeping the clock rates in a range where power consumption and heat dissipation do not represent a problem. It is easy to imagine that multicore technologies will have a R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 639–648, 2008. c Springer-Verlag Berlin Heidelberg 2008
640
A. Buttari et al.
deep impact on the High Performance Computing (HPC) world since supercomputers have very high number of processors and, thus, multicore technologies can help reducing the power consumption. As a consequence, all the applications that were not explicitly coded to be run on parallel architectures must be rewritten with parallelism in mind. Also, those applications that could exploit parallelism may need considerable rework in order to take advantage of the fine-grain parallelism features provided by multicores. The current set of multicore chips from Intel and AMD are for the most part multiple processors glued together on the same chip. There are many scalability issues to this approach and it is unlikely that type of architecture will scale up beyond 8 or 16 cores. Even though it is not yet clear how chip designers are going to address these issues, it is possible to identify some properties that algorithms must have in order to match high degrees of TLP: fine granularity: cores are (and probably will be) associated with relatively small local memories (either caches or explicitly managed memories like in the case of the STI Cell [1] architecture or the Intel Polaris[2] prototype). This requires splitting an operation into tasks that operate on small portions of data in order to reduce bus traffic and improve data locality. asynchronicity: as the degree of TLP grows and granularity of the operations becomes smaller, the presence of synchronization points in a parallel execution seriously affects the efficiency of an algorithm. The LAPACK [3] and ScaLAPACK [4] software libraries represent a de facto standard for high performance dense Linear Algebra computations and have been developed, respectively, for shared-memory and distributed-memory architectures. Substantially, both LAPACK and ScaLAPACK implement sequential algorithms that rely on parallel building blocks, i.e. parallel BLAS operations. As multicore systems require finer granularity and higher asynchronicity, considerable advantages may be obtained by reformulating old algorithms or developing new algorithms in a way that their implementation can be easily mapped on these new architectures. A number of approaches along these lines have been proposed in [5,6,7,8,9]; block partitioning and hybrid data structures have been studied and significant gains can be obtained on more conventional processors either in shared or distributed memory environments. The more recent work has focused on looking at operations of the standard LAPACK algorithms for some common factorizations are broken into sequences of smaller tasks in order to achieve finer granularity and higher flexibility in the scheduling of tasks to cores. The importance of fine granularity algorithms is also shown in [10]. The rest of this document shows how this can be achieved for the QR factorization. Section 2 describes the tiled QR factorization that provides both fine granularity and high level of asynchronicity; performance results for this algorithm are shown in Section 3.
Parallel Tiled QR Factorization for Multicore Architectures
2
641
Tiled QR Factorization
The kernels (e.g. BLAS operations) on which common Linear Algebra are based can be broken into smaller tasks to achieve finer granularity. These tasks can thus be scheduled according to a dynamic, graph driven approach that leads to an out-of-order, asynchronous execution. These ideas are not new and have been proposed a number of times in the past [11,12,13,14]. More recent work in this direction show how this idea can be applied to common Linear Algebra operations as SYRK (symmetric rank-K update), Cholesky, block LU, and block QR factorizations [5,7,6] for multicore. In the case of SYRK and CHOL, no algorithmic change is needed, since both these operations can be naturally “tiled” to achieve a very fine granularity. A good summary of recursive and blocked approaches can be found in [8,15] (Sections 5 and 6). Kurzak et al. [5] recently showed that the application of this approach to the standard LAPACK algorithms for LU and QR is limited by the granularity that can be obtained by simply “tiling” the elementary operations that these two factorizations are based on. In order to achieve finer granularity in the LU and QR factorizations, a major algorithmic change is needed. The algorithmic change we propose is actually well-known and takes its roots in updating factorizations [16,17]. Using updating techniques to tile the algorithms have first1 been proposed by Yip [18] for LU to improve the efficiency of out-of-core solvers, and were recently reintroduced in [19,20] for LU and QR, once more in the out-of-core context. A similar idea has also been proposed in [21] for Hessenberg reduction in the parallel distributed context. All of these approaches use the idea of manipulating and operating on the coefficient matrix by referencing small blocks of the matrix [22]. The block organization is a convenient way of expressing and moving parts of the matrix through the memory hierarchy. The originality of this paper is to study the effectiveness of these algorithms in the context of multicore architectures where they can be used to achieve a fine granularity, high degree of parallelism and asynchronous execution. 2.1
A Fine-Grain Algorithm for QR Factorization
The tiled QR factorization will be constructed based on the following four elementary operations: DGEQT2. This subroutine was developed to perform the unblocked factorization of a diagonal block Akk of size b × b. This operation produces an upper triangular matrix Rkk , a unit lower triangular matrix Vkk that contains b Householder reflectors and an upper triangular matrix Tkk as defined by the “WY” technique for accumulating the transformations (see [23], [24] and [25] for details). Note that both Rkk and Vkk can be written on the memory area that was used for Akk and, thus, no extra storage is needed for them. A temporary work space is needed to store Tkk . 1
To our knowledge.
642
A. Buttari et al.
H1 H2 . . . Hb = I − V T V T where V is an n-by-b matrix whose columns are the individual vectors v1 , v2 , . . . , vb associated with the Householder matrices H1 , H2 , . . . , Hb , and T is an upper triangular matrix of order b. Thus, DGEQT2(Akk , Tkk ) performs Akk ←− Vkk , Rkk
Tkk ←− Tkk
DLARFB. This LAPACK subroutine will be used to apply the transformation (Vkk , Tkk ) computed by subroutine DGEQT2 to a block Akj . Thus, DLARFB(Akj , Vkk , Tkk ) performs T Akj ←− (I − Vkk Tkk Vkk )Akj
DTSQT2. This subroutine was developed to perform the unblocked QR factorization of a matrix that is formed by coupling an upper triangular block Rkk with a square block Aik . This subroutine will return an upper triangular ˜ kk which will overwrite Rkk and b Householder reflectors where b is matrix R the block size. Note that, since Rkk is upper triangular, the resulting Householder reflectors can be represented as an identity block I on top of a square block Vik . For this reason no extra storage is needed for the Householder vectors since the identity block need not be stored and Vik can overwrite Aik . Also a matrix Tik is produced for which storage space has to be allocated. See Figure 1 for a graphical representation. Then, DTSQT2(Rkk , Aik , Tik ) performs Rkk I ˜ kk ←− ,R Tik ←− Tik Aik Vik DSSRFB. This subroutine was developed to apply the transformation computed by DTSQT2 to a matrix formed coupling two square blocks Akj and Aij . Thus, DSSRFB(Akj , Aij , Vik , Tik ) performs Akj I ·(Tik )· (I VikT ) Akj ←− I− Aij Vik Aij All of this elementary operations rely on BLAS subroutines to perform internal computations. Assuming a matrix A of size pb × qb where b is the block size and each Aij is of size b × b, the QR factorization can be performed as in Algorithm 1. The operations count for Algorithm 1 is 25% higher than the one of the LAPACK algorithm for QR factorization; specifically the tiled algorithm requires 5/2n2(m−n/3) floating point operations compared to the 4/2n2 (m−n/3) for the LAPACK algorithm. Details of the operation count of the parallel tiled algorithm are reported in [26]. Performance results in Section 3 will demonstrate that it is worth paying this cost for the sake of scaling.
Parallel Tiled QR Factorization for Multicore Architectures
643
Algorithm 1. The block algorithm for QR factorization. 1: for k = 1, 2..., min(p, q) do 2: DGEQT2(Akk , Tkk ); 3: for j = k + 1, k + 2, ..., q do 4: DLARFB(Akj , Vkk , Tkk ); 5: end for 6: for i = k + 1, k + 1, ..., p do 7: DTSQT2(Rkk , Aik , Tik ); 8: for j = k + 1, k + 2, ..., q do 9: DSSRFB(Akj , Aij , Vik , Tik ); 10: end for 11: end for 12: end for
Figure 1 gives a graphical representation of one repetition (with k = 1) of the outer loop in Algorithm 1 with p = q = 3. The red, thick borders show what blocks in the matrix are being read and the light blue fill shows what blocks are being written in a step. The Tkk matrices are not shown in this figure for clarity purposes. 2.2
Graph Driven Asynchronous Execution
Following the approach presented in [5,6], Algorithm 1 can be represented as a Directed Acyclic Graph (DAG) where nodes are elementary tasks that operate on b × b blocks and where edges represent the dependencies among them. Figure 2 show the DAG when Algorithm 1 is executed on a matrix with p = q = 3. It can be noted that the DAG has a recursive structure and, thus, if p1 ≥ p2 and q1 ≥ q2 then the DAG for a matrix of size p2 × q2 is a subgraph of the DAG for a matrix of size p1 × q1 . This property also holds for most of the algorithms in LAPACK. Once the DAG is known, the tasks can be scheduled asynchronously and independently as long as the dependencies are not violated. A critical path can be identified in the DAG as the path that connects all the nodes that have the higher number of outgoing edges. Based on this observation, a scheduling policy can be used, where higher priority is assigned to those nodes that lie on the critical path. Clearly, in the case of our block algorithm for QR factorization, the nodes associated to the DGEQT2 subroutine have the highest priority and then three other priority levels can be defined for DTSQT2, DLARFB and DSSRFB in descending order. This dynamic scheduling results in an out of order execution where idle time is almost completely eliminated since only very loose synchronization is required between the threads. The graph driven execution also provides some degree of adaptivity since tasks are scheduled to threads depending on the availability of execution units. 2.3
Block Data Layout
The major limitation of performing very fine grain computations, is that the BLAS library generally have very poor performance on small blocks. This
644
A. Buttari et al.
Fig. 1. Graphical representation of one repetition of the outer loop in Algorithm 1 on a matrix with p = q = 3. As expected the picture is very similar to the out-of-core algorithm presented in [20].
Fig. 2. The dependency graph of Algorithm 1 on a matrix with p = q = 3
Parallel Tiled QR Factorization for Multicore Architectures
645
situation can be considerably improved by storing matrices in Block Data Layout (BDL) instead of the Column Major Format that is the standard storage format for FORTRAN arrays. In BDL a matrix is split into blocks and each block is stored into contiguous memory locations. Each block is stored in Column Major Format and blocks are stored in Column Major Format with respect to each other. As a result the access pattern to memory is more regular and BLAS performance is considerably improved. The benefits of BDL have been extensively studied in the past, for example in [22], and recent studies like [7] demonstrate how fine-granularity parallel algorithms can benefit from BDL. It is important to note that both [22] and [7] focus on algorithms that are based on the same approach presented here.
3
Performance Results
The performance of the tiled QR factorization with dynamic scheduling of tasks has been measured on the systems listed in Table 1 and compared to the performance of the fork-join approach, i.e., the standard algorithm for block QR factorization of LAPACK associated with multithreaded BLAS. Table 1. Details of the systems used for the following performance results
Architecture Clock speed # cores Peak performance Memory Compiler suite BLAS libraries
8-way dual Opteron Dual-Core AMD OpteronTM 8214 2.2 GHz 8 × 2 = 16 70.4 Gflop/s 62 GB Intel 9.1 MKL-9.1
2-way quad Clovertown R R IntelXeon CPU X5355 2.66 GHz 2×4=8 85.12 Gflop/s 16 GB Intel 9.1 MKL-9.1
Figures 3, 4 report the performance of the QR factorization for both the block algorithm with dynamic scheduling and the LAPACK algorithm with multithreaded BLAS. A block size of 200 has been used for the block algorithm while the block size for the LAPACK algorithm2 has been tuned in order to achieve the best performance for all the combinations of architecture and BLAS library. In each graph, two curves are reported for the block algorithm with dynamic scheduling; the solid curve shows its relative performance when the operation count is assumed equal to the one of the LAPACK algorithm (i.e. 4/2n2(m−n/3)) while the dashed curve shows its “raw” performance, i.e. the actual flop rate computed with the exact operation count for this algorithm that is 5/2n2 (m − n/3). As already mentioned, the “raw performance” (dashed curve) is 25% higher than the relative performance (solid curve). 2
The block size in the LAPACK algorithm sets the width of the so called panel factorization which determines the ratio between Level-2 and Level-3 BLAS operations.
646
A. Buttari et al.
The graphs on the left part of each figure show the performance measured using the maximum number of cores available on each system with respect to the problem size. The graphs on the right part of each figure show the weak scalability, i.e. the flop rates versus the number of cores when the local problem size is kept constant (nloc=5,000) as the number of cores increases. Figures 3, 4 show that, despite the higher operation count, the block algorithm with dynamic scheduling is capable of completing the QR factorization in less time than the LAPACK algorithm when the parallelism degree is high enough that the benefits of the asynchronous execution overcome the penalty of the extra flops. For lower numbers of cores, in fact, the fork-join approach has a good scalability and
Fig. 3. Comparison between the performance of the block algorithm with dynamic scheduling using MKL-9.1 on an 8-way dual Opteron system. The dashed curve reports the raw performance of the block algorithm with dynamic scheduling, i.e., the performance as computed with the true operation count 5/2n2 (m − n/3).
Fig. 4. Comparison between the performance of the block algorithm with dynamic scheduling using MKL-9.1 on an 2-way quad Clovertown system. The dashed curve reports the raw performance of the block algorithm with dynamic scheduling, i.e., the performance as computed with the true operation count 5/2n2 (m − n/3).
Parallel Tiled QR Factorization for Multicore Architectures
647
completes the QR factorization in less time than the block algorithm because of the lower flop count. Note that the actual execution rate of the block algorithm for QR factorization with dynamic scheduling (i.e., the dashed curves) is always higher than that of the LAPACK algorithm with multithreaded BLAS even for low numbers of cores. The actual performance of the block algorithm, even if considerably higher than that of the fork-join one, is still far from the peak performance of the systems used for the measures. This is mostly due to two factors: the nature of the BLAS operations involved and the performance of BLAS routines on small size blocks.
4
Conclusions
By adapting known algorithms for updating the QR factorization of a matrix, we have derived a fine-granularity implementation scheme of the QR factorization for multicore architectures based on dynamic scheduling and block data layout. Although the proposed algorithm is performing 25% more FLOPS than the regular algorithm, the gain in flexibility allows an efficient dynamic scheduling which enables the algorithm to scale almost perfectly when the number of cores increases.
References 1. Pham, D., Asano, S., Bolliger, M., Day, M.N., Hofstee, H.P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Shippy, D., Stasiak, D., Suzuoki, M., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., Yazawa, K.: The design and implementation of a first-generation CELL processor. In: IEEE International Solid-State Circuits Conference, pp. 184–185 (2005) 2. Teraflops research chip, http://www.intel.com/research/platform/terascale/teraflops.htm 3. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Croz, J.D., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK User’s Guide, 3rd edn. SIAM, Philadelphia (1999) 4. Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrouchov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK: A portable linear algebra library for distributed memory computers - design issues and performance. Computer Physics Communications 97, 1–15 (1996), (also as LAPACK Working Note #95) 5. Kurzak, J., Dongarra, J.: Implementing linear algebra routines on multi-core processors with pipelining and a look ahead. LAPACK Working Note 178 (September 2006), Also available as UT-CS-06-581 6. Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Luszczek, P., Tomov, S.: The impact of multicore on math software. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 1–10. Springer, Heidelberg (2007) 7. Chan, E., Quintana-Orti, E.S., Quintana-Orti, G., van de Geijn, R.: Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In: SPAA 2007: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pp. 116–125. ACM Press, New York (2007)
648
A. Buttari et al.
8. Elmroth, E., Gustavson, F., Jonsson, I., K˚ agstr¨ om, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review 46(1), 3–45 (2004) 9. Gustavson, F., Karlsson, L., K˚ agstr¨ om, B.: Three algorithms for cholesky factorization on distributed memory using packed storage. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 550–559. Springer, Heidelberg (2007) 10. Kurzak, J., Buttari, A., Dongarra, J.: Solving systems of linear equations on the CELL processor using Cholesky factorization. Technical Report UT-CS-07-596, Innovative Computing Laboratory, University of Tennessee Knoxville (April 2007) 11. Lord, R.E., Kowalik, J.S., Kumar, S.P.: Solving linear algebraic equations on an mimd computer. J. ACM 30(1), 103–117 (1983) 12. Dongarra, J.J., Hiromoto, R.E.: A collection of parallel linear equations routines for the Denelcor HEP 1(2), 133–142 (December 1984) 13. Agarwal, R.C., Gustavson, F.G.: Vector and parallel algorithms for cholesky factorization on ibm 3090. In: Supercomputing 1989: Proceedings of the 1989 ACM/IEEE conference on Supercomputing, pp. 225–233. ACM Press, New York (1989) 14. Agarwal, R.C., Gustavson, F.G.: A parallel implementation of matrix multiplication and LU factorization on the IBM 3090. In: Proceedings of the IFIP WG 2.5 Working Group on Aspects of Computation on Asychronous Parallel Processors, Stanford CA, Augest 22-26,1988, North Holland, Amsterdam (1988) 15. Elmroth, E., Gustavson, F.G.: Applying recursion to serial and parallel QR factorization leads to better performance. IBM Journal of Research and Development 44(4), 605 (2000) 16. Golub, G., Van Loan, C.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996) 17. Stewart, G.W.: Matrix Algorithms, 1st edn., vol. 1. SIAM, Philadelphia (1998) 18. Yip, E.L.: FORTRAN Subroutines for Out-of-Core Solutions of Large Complex Linear Systems. Technical Report CR-159142, NASA (November 1979) 19. Quintana-Orti, E., van de Geijn, R.: Updating an LU factorization with pivoting, Technical Report TR-2006-42, The University of Texas at Austin, Department of Computer Sciences (2006), FLAME Working Note 21 20. Gunter, B.C., van de Geijn, R.A.: Parallel out-of-core computation and updating of the QR factorization. ACM Trans. Math. Softw. 31(1), 60–78 (2005) 21. Berry, M.W., Dongarra, J.J., Kim, Y.: A parallel algorithm for the reduction of a nonsymmetric matrix to block upper-hessenberg form. Parallel Comput. 21(8), 1189–1211 (1995) 22. Gustavson, F.G.: New generalized data structures for matrices lead to a variety of high performance algorithms. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2001. LNCS, vol. 2328, pp. 418–436. Springer, Heidelberg (2002) 23. Bischof, C., van Loan, C.: The WY representation for products of householder matrices. SIAM J. Sci. Stat. Comput. 8(1), 2–13 (1987) 24. Schreiber, R., van Loan, C.: A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Stat. Comput. 10(1), 53–57 (1989) 25. Bischof, C., van Loan, C.: The WY representation for products of householder matrices. SIAM J. Sci. Stat. Comput. 8(1), 2–13 (1987) 26. Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: Parallel Tiled QR Factorization for Multicore Architectures. Technical Report UT-CS-07-598, University of Tennessee (2007), LAPACK Working Note 190
Application of Rectangular Full Packed and Blocked Hybrid Matrix Formats in Semidefinite Programming for Sensor Network Localization Jacek Błaszczyk, Ewa Niewiadomska-Szynkiewicz, and Michał Marks Institute of Control and Computation Engineering Faculty of Electronics and Information Technology Warsaw University of Technology ul. Nowowiejska 15/19, 00-665 Warsaw, Poland {J.Blaszczyk,E.Szynkiewicz}@ia.pw.edu.pl, [email protected]
Abstract. This paper1 addresses issues associated with reduction of memory usage for a semidefinite programming (SDP) relaxation based method and its application to position estimation problem in ad-hoc wireless sensor networks. We describe two new CSDP solvers (semidefinite programming in C) using two algorithms for Cholesky factorization implementing RFP and BHF matrix storage formats and different implementations of BLAS/LAPACK libraries (Netlib’s BLAS/LAPACK, sequential and parallel versions of ATLAS, Intel MKL and GotoBLAS). The numerical results given and discussed in the final part of the paper show that using both RFP and BHF data formats preserve high numerical performance of the LAPACK full data format while using half the computer storage. Keywords: semidefinite programming, Cholesky factorization, novel matrix data structures, wireless sensor networks, localization methods.
1
Introduction to Sensor Network Localization
Recent advances in technology have enabled the development of low cost, low power and multifunctional computation devices. These devices networked through wireless bring the idea of wireless sensor networks (WSNs) into reality. WSNs are deployed in various environments and are used in large number of practical applications, such as environmental information, traffic or health monitoring, intrusion detection, etc. Typical sensor network consists of a large number of nodes – densely deployed sensors. Each sensor is a miniature device containing sensing unit, battery unit, CPU and a radio transceiver. Transceivers used in this kind of networks are usually characterized by short radio range, low transmission rate and very low power consumption. The computing resources 1
This work is supported by Warsaw University of Technology Research Program grant.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 649–658, 2008. c Springer-Verlag Berlin Heidelberg 2008
650
J. Błaszczyk, E. Niewiadomska-Szynkiewicz, and M. Marks
(CPU and memory) are limited. Nodes networked through wireless must gather local data and communicate with other nodes. The information sent by given sensor is relevant only if we know what location it refers to. Location estimation allows applying the geographic-aware routing, multicasting and energy conservation algorithms. It makes self-organization and localization capabilities one of the most important requirement in sensor networks. The simplest way to determine a node location is to equip this node with a global positioning system (GPS) or install it at a point with known coordinates. Because of the cost, size of sensors and constraints on energy consumption most sensors usually do not know their locations, only a few nodes, called ”anchors” are equipped with GPS adapters. The location of other nodes, called ”non-anchors”, are unknown. In such model the techniques that estimate the location of ”nonanchors” based on information about positions of ”anchors” are utilized. Wireless sensor network localization is a complex problem that can be solved in different ways [1]. Generally the proposed solutions are based on signal processing and algorithms transforming measurements into distances between given two neighbor nodes, and next coordinates of the nodes in the network. Recently very popular are distance based localization algorithms. They use inter-sensor distance measurements in a sensor network to estimate the locations of nodes. In this paper we consider the centralized approach in which a central processor is used to collect all distance data provided by all nodes and calculates locations of all non-anchor nodes. Three main approaches for designing centralized distance based algorithms are provided in literature: multidimensional scaling (MDS) [2,3], semidefinite programming (SDP) [4] and stochastic optimization [1]. We focus on the semidefinite programming relaxation based method for WSNs localization. The efficiency of this method is quite satisfactory (see [4]). Only a few anchor nodes are required to precisely estimate the localization of all the unknown nodes in a network and the localization errors are minimal even when the anchor nodes are not properly placed within a network or the distance measurements are noisy. The numerical tests that we have performed confirm the results discussed in the literature. The problems may appear when one considers large scale networks. The difficulty is computer power, i.e., speed and memory requirements. The objective of our research was to develop the modified version of SDP solver that reduces the memory usage and provides high performance.
2
Semidefinite Programming for WSNs Localization
P. Biswas and Y. Ye in [4] present the quadratic programming formulation of position estimation problem and its transformation to a standard SDP problem. Let us consider the network with m known points (anchors) ak ∈ R2 , k = 1, . . . , m, and n unknown points (non-anchors) xj ∈ R2 , j = 1, . . . , n. For simplicity, the all sensor points are placed on a plane. For each pair of two points, we introduce a Euclidean distance upper bound d¯kj and lower bound dkj between ak and xj , ¯
Application of Rectangular Full Packed and Blocked Hybrid Matrix Formats
651
or upper bound d¯ij and lower bound dij between xi and xj . Then, the quadratic ¯ model of the localization problem may be defined as: min αij + αkj i,j
k,j
s.t. (dij ) − αij ≤ xi − xj 2 ≤ (d¯ij )2 + αij , αij ≥ 0, ∀i = j, (dkj )2 − αkj ≤ ak − xj 2 ≤ (d¯kj )2 + αkj , αkj ≥ 0, ∀k, j. 2
(1)
As we can see this model minimizes the sum of errors in sensor positions for fitting the distance measures. Let X = [x1 x2 . . . xn ] be the 2 × n matrix that needs to be determined. Our problem may be formulated in the matrix form, but unfortunately this problem is nonconvex. The following approach converts the nonconvex quadratic distance constraints into linear constraints by introducing a relaxation to remove the quadratic term in the formulation. Let Y = X T X. This constraint can be relaxed to Y X T X, which is equivalent to the linear matrix inequality (e.g., [5]): I X Z= 0. XT Y where the relation Z 0 means that Z is positive semidefinite, i.e., y T Zy ≥ 0 for all y ∈ Rn . Then, the problem (1) can be formulated as a standard SDP problem: min αij + αkj i,j
k,j T
s.t. (1; 0; 0) Z(1; 0; 0) = 1, (0; 1; 0)T Z(0; 1; 0) = 1, (1; 1; 0)T Z(1; 1; 0) = 2,
(2)
(dij )2 − αij ≤ (0; eij )T Z(0; eij ) ≤ (d¯ij )2 + αij , αij ≥ 0, ∀i = j, ¯ (dkj )2 − αkj ≤ (ak ; ej )T Z(ak ; ej ) ≤ (d¯kj )2 + αkj , αkj ≥ 0, ∀k, j, ¯ Z 0, where eij is the vector with 1 at the ith, −1 at the jth position and zero everywhere else; and ej is the vector of all zero except −1 at the jth position. Finally, the WSN localization problem (2) is formulated in the matrix form as SDP problem and can be effectively solved by the interior point method based SDP solvers available for optimization research community, such as for example SeDuMI, DSDP, CSDP. Thus the effort required to implement the localization algorithm is minimal.
3
CSDP Library and Its Modifications
CSDP, written by Brian Borchers [6], is a public domain library of routines that implements a predictor–corrector variant of the semidefinite programming
652
J. Błaszczyk, E. Niewiadomska-Szynkiewicz, and M. Marks
primal–dual interior point algorithm of Helmberg, Rendl, Vanderbei, and Wolkowicz [7]. Thus the method is known as the HRVW method. The main advantages of this code are that it is written to be used as a callable subroutine, and it is written in C for efficiency. The current version (6.0.1) of CSDP code can run in parallel on shared memory multi-processor systems [8] and typical speedups for larger problems are between 2 and 3 on four processors. CSDP uses OpenMP directives in the C source code to program parallelization. The code is designed to make use of highly optimized linear algebra routines from LAPACK and ATLAS-BLAS libraries. For solving SDP problems that have been written in the SDPA format [9], which case took place in our computations, the CSDP library includes a stand alone solver program. An interface to MATLAB is also provided and the MATLAB routine can be used to solve problems that are in the SeDuMI format [10]. The CSDP software is now being made available as an open source project through the COIN-OR repository (https://projects.coin-or.org/Csdp/). Two of the most important steps in the HRVW algorithm are the construction of the k × k Schur complement matrix that requires O(k 2 ) storage, and its Cholesky factorization, which takes O(k 3 ) computation time (for details see [8]). For the WSNs localization problems the number of constraints k is considerably larger than number of variables n and thus the storage required by the Schur complement matrix often limits the size of problem (2) that can be solved by CSDP. Since the Schur complement matrix is typically dense, it becomes very difficult to store it as the number of constraints k increases. For example, on a workstation with 1024 megabytes of RAM, and using double precision arithmetic, a practical limit of the size of the Schur complement matrix is about k = 8000. Therefore new dense matrix formats with minimum memory requirements preserving good numerical performance of full format and efficient, possibly parallel implementation of the Cholesky factorization routine may by helpful in this situation. It may lead to significant speed improvement of the whole CSDP algorithm and reduction of memory usage. The each iteration of the CSDP algorithm requires Cholesky factorization of the positive definite Schur complement system. Profilings show that for WSNs localization problems defined in (2), the time spent in factoring the Schur complement matrix dominates the CPU time used by CSDP solver (40% - 90% of the entire runtime of CSDP). For that purpose the DPOTRF LAPACK procedure is used in the original CSDP code. Our modification of the CSDP code substitutes the DPOTRF Cholesky factorization procedure calls by calls of the DPFTRF procedure designed for matrices stored in Rectangular Full Packed (RFP) format, or alternatively, by calls of the DHPPTF procedure for matrices stored in blocked hybrid format (BHF). The RFP matrix format [11] is a new data format for storing triangular and symmetric matrices. The widely known full matrix format based on standard two dimensional arrays of Fortran and C, if used to store triangular and symmetric matrices, waste half the storage space but provide high performance by the use of level 3 BLAS. On the other side there is the packed matrix format that
Application of Rectangular Full Packed and Blocked Hybrid Matrix Formats
653
fully utilize storage but provide low performance because of lack of existence of level 3 packed BLAS. The RFP format combines the advantages of full and packed formats for dense triangular and symmetric matrices: good numerical performance of full format by the usage of level 2 and 3 BLAS procedures and minimum memory requirement of packed storage. The RFP format is already accepted for the new version of LAPACK. In modification of the CSDP solver we use the preliminary version of the LAPACK routine for Cholesky factorization in the RFP format (and needed procedures for matrix conversions from full to RFP format, and, in the opposite direction) contributed by Fred Gustavson from the IBM Watson Research Center. For the comparison with the RFP matrix format we use the Fortran 95 subroutines for Cholesky factorization of a positive-definite symmetric matrix in blocked hybrid format from the TOMS collection [12]. The BHF matrix format, proposed in [13], may exploit cache memory – the matrix is packed into k(k+1)/2 real variables, which each block held contiguously by rows or columns, which allows level 3 BLAS to be used. The speed is usually better than that of the LAPACK algorithm that uses full storage (k 2 variables). The block size parameter nb for the matrix affects the performance of the Cholesky factorization code for the BHF format. By the numerical experiments with block_hybrid_speed.f90 program, enclosed to the TOMS 865 package and compiled with the ATLAS library, we choose for our Intel Core2 Quad Q6600 processor suitable block size nb = 400 for uni-core and nb = 1000 for multi-core computations. For remaining versions of the BLAS/LAPACK libraries we choose nb = 200.
4
Numerical Results
In order to evaluate the performance of the CSDP solver with our modifications some numerical tests for sensor network localization problems were performed. The sensor networks with 50, 100 and 200 nodes with randomly generated positions in a square region of [0, 1] × [0, 1] were considered. To evenly distribute the information in the network, even when the number of anchors is small, and also to minimize the localization errors, the positions for first five anchors were set to (0.2, 0.2), (0.2, 0.8), (0.8, 0.2), (0.8, 0.8) and (0.5, 0.5). The remaining anchors and all non-anchor nodes were distributed in the network randomly, but uniformly with the constraint on minimal allowable distance between two neighboring nodes. In our numerical experiments the number of anchor nodes was as 10% of all nodes and the transmission range (radio range) was equal to 0.3, 0.18, 0.15 for 50, 100, 200-node networks, respectively. To minimize the number of constraints in the SDP formulation for a WSNs localization problem, we assumed the limit on the number of edges connected to any sensor point – it was equal 5 in our numerical tests. The SDP problems in the SDPA format for our three different size WSNs localization problems were generated by MATLAB script file ESDP.m developed by Ye (http://www.stanford.edu/~yyye) and solved using the CSDP solver configured with different combinations of the BLAS/LAPACK libraries and Cholesky factorization procedures.
654
J. Błaszczyk, E. Niewiadomska-Szynkiewicz, and M. Marks
The calculations were carried out on the quad-core machine Intel Core2 Quad Q6600 with 2.4 GHz clock and 2 GB of RAM memory, running 64-bit Linux operating system. We compared numerical performance and scalability of different implementations of Cholesky factorization procedure for triangular and symmetric positive definite matrices, needed during iterations of interior-point algorithm of the CSDP solver. The CSDP code was compiled by Intel C/C++ compiler (version 10.0). The Fortran 77 code of RFP subroutines and the Fortran 95 code of BHF subroutines were compiled by Intel Fortran Compiler (version 10.0). For the comparison of performance of Cholesky factorization routines implemented for different matrix formats we used existing system BLAS/LAPACK libraries (version 3.1.1 from Netlib) and highly optimized serial and parallel BLAS/LAPACK routines for our quad-core processor provided by libraries of ATLAS (latest stable version – 3.8.0), Intel Math Kernel Library (Intel MKL) (version 10.0.05) and GotoBLAS (version 1.19). The table 1 and figures 1 and 2 give performance results of LAPACK, RFP and BHF Cholesky factorization routines. The LAPACK routines DPOTRF (full) and DPPTRF (packed) are compared with DPFTRF and DHPPTF routines for RFP and BHF format, respectively. In all cases double precision arithmetic was used and Cholesky factorization procedures were called for upper triangular matrix (uplo=’U’). We present in the table timings for single Cholesky factorization of the Schur complement matrix in the CSDP algorithm, the obtained parallel speedup, the number of Mflops required to perform this factorization, total runtime of the CSDP solver, total time for all needed Cholesky factorizations, and finally, total time cost of conversion of full matrix format used by the CSDP solver to RFP, BHF, or packed matrix formats and vice versa. All mentioned times are in seconds. 4.1
Sequential Implementations of Libraries
As we can see the performance of RFP versus optimized LAPACK full Cholesky factorization subroutine provided by ATLAS, Intel MKL and GotoBLAS libraries is about the same, while using half the storage (see Fig. 1 and 2). In the case of GotoBLAS usage the performance of Cholesky factorization is roughly one to a factor of 8 times faster than LAPACK packed routines while using the same storage. The performance results for the BHF format are similar but a bit worse to that we obtained for the RFP format. The optimized LAPACK Cholesky factorization subroutines, provided by the ATLAS, Intel MKL and GotoBLAS libraries, and also, optimized Cholesky factorization subroutines for matrices stored in the RFP and BHF formats are many times faster than Cholesky factorization for the LAPACK packed data format (see Fig. 1 and table 1). The highly optimized LAPACK DPOTRF procedure from Intel MKL and GotoBLAS libraries clearly outperforms Cholesky factorizations of RFP and BHF formats, but it uses twice as large memory. Our experiments have shown that using optimized dense linear algebra routines can speed up CSDP by a factor of 3 to 5 for the GotoBLAS library with one OpenMP thread on typical quad-core machine (when compare with Netlib’s BLAS/LAPACK full storage library).
Application of Rectangular Full Packed and Blocked Hybrid Matrix Formats
655
3000
Total computation time
2500 2000 1500 1000 500 0 O G
O G
O G
T
T
T
O
O
O
O
-4
-3
-2
-1
4 L-
3 L-
2 L-
4 S-
3 S-
2 S-
1 S-
B
1 L-
T
K
K
K
K
LA
LA
LA
O G
M
M
M
M
AT
AT
AT
LI
LA
ET
AT
N
Library LAPACK - DPOTRF RFP - DPFTRF
BHF - DHPPTF LAPACK - DPPTRF
Fig. 1. Total runtime of the CSDP solver for WSN localization problem with 200 nodes (different implementations of BLAS/LAPACK libraries and data formats)
20
15
10
5
0 GOTO-1 GOTO-2 GOTO-3 GOTO-4
Cholesky factorization storage
1-st Cholesky factorization time
25
LAPACK - DPOTRF RFP - DPFTRF BHF - DHPPTF LAPACK - DPPTRF
Fig. 2. Time and memory usage of Cholesky factorization in first iteration of the CSDP solver for WSN localization problem with 200 nodes (GotoBLAS library)
Nodes Library (k) (# of threads) 50 Netlib (1532) ATLAS (1) ATLAS (2) ATLAS (3) ATLAS (4) Intel MKL (1) Intel MKL (2) Intel MKL (3) Intel MKL (4) GotoBLAS (1) GotoBLAS (2) GotoBLAS (3) GotoBLAS (4) 100 Netlib (2601) ATLAS (1) ATLAS (2) ATLAS (3) ATLAS (4) Intel MKL (1) Intel MKL (2) Intel MKL (3) Intel MKL (4) GotoBLAS (1) GotoBLAS (2) GotoBLAS (3) GotoBLAS (4) 200 Netlib (6696) ATLAS (1) ATLAS (2) ATLAS (3) ATLAS (4) Intel MKL (1) Intel MKL (2) Intel MKL (3) Intel MKL (4) GotoBLAS (1) GotoBLAS (2) GotoBLAS (3) GotoBLAS (4)
Tit 0.87 0.23 0.17 0.23 0.16 0.18 0.11 0.12 0.08 0.17 0.10 0.08 0.07 4.16 1.04 0.66 0.59 0.63 0.84 0.47 0.37 0.31 0.76 0.43 0.34 0.30 70.32 15.90 8.75 6.74 6.02 13.79 7.47 5.41 4.42 12.22 6.56 4.84 4.00
LAPACK – DPOTRF P Mflps Ttot Tchol 1.00 1385 15.93 14.71 1.00 5202 5.17 3.85 1.34 6985 4.48 3.16 1.01 5234 4.50 3.16 1.43 7458 4.28 2.95 1.00 6681 4.33 3.04 1.66 11077 3.07 1.80 1.55 10359 2.76 1.49 2.23 14907 2.55 1.28 1.00 7173 4.30 2.84 1.75 12524 3.11 1.63 2.18 15626 2.84 1.33 2.40 17245 2.74 1.20 1.00 1412 82.73 78.96 1.00 5661 23.69 19.54 1.56 8845 16.79 12.62 1.76 9946 15.57 11.41 1.63 9250 14.73 10.56 1.00 7009 19.90 15.89 1.78 12456 12.90 8.89 2.28 16000 10.86 6.87 2.68 18757 9.84 5.82 1.00 7701 18.92 14.48 1.78 13676 12.67 8.17 2.24 17246 11.04 6.48 2.58 19869 10.22 5.65 1.00 1423 1360.39 1336.14 1.00 6295 329.21 302.00 1.82 11441 193.97 166.68 2.36 14858 153.81 126.56 2.64 16633 140.20 112.89 1.00 7255 288.57 262.06 1.85 13401 168.24 141.76 2.55 18496 129.47 102.96 3.12 22638 110.56 84.06 1.00 8192 261.18 232.33 1.86 15262 153.57 124.65 2.53 20690 120.99 92.03 3.05 25015 105.13 76.08 Tit 0.48 0.24 0.21 0.18 0.17 0.18 0.11 0.09 0.09 0.17 0.10 0.07 0.07 2.67 1.04 0.63 0.61 0.59 0.82 0.46 0.37 0.33 0.78 0.42 0.34 0.30 45.36 16.05 8.88 6.71 5.90 13.32 7.12 5.53 4.77 12.40 6.56 4.80 4.05
P 1.00 1.00 1.12 1.30 1.38 1.00 1.65 2.04 2.08 1.00 1.78 2.31 2.49 1.00 1.00 1.64 1.71 1.75 1.00 1.80 2.25 2.47 1.00 1.84 2.33 2.63 1.00 1.00 1.81 2.39 2.72 1.00 1.87 2.41 2.79 1.00 1.89 2.58 3.06
RFP – DPFTRF BHF – DHPPTF Mflps Ttot Tchol Tconv Tit P Mflps Ttot Tchol 2506 9.87 8.25 0.38 0.87 1.00 1380 16.81 14.77 5034 5.69 3.98 0.39 0.23 1.00 5111 6.00 3.97 5646 4.90 3.19 0.38 0.22 1.05 5379 5.66 3.77 6546 4.78 3.07 0.38 0.21 1.10 5611 5.59 3.69 6964 4.78 3.07 0.38 0.22 1.07 5460 5.52 3.63 6486 4.79 3.12 0.38 0.18 1.00 6806 5.07 2.99 10711 3.49 1.84 0.37 0.10 1.71 11648 3.83 1.76 13214 3.17 1.51 0.37 0.12 1.48 10106 3.72 1.64 13467 3.07 1.38 0.39 0.08 2.33 15875 3.30 1.21 7084 4.71 2.88 0.39 0.17 1.00 7038 5.13 2.89 12616 3.46 1.61 0.38 0.10 1.69 11902 3.93 1.65 16396 3.20 1.28 0.38 0.08 2.13 15019 3.65 1.36 17652 3.10 1.16 0.38 0.07 2.48 17421 3.53 1.17 2195 56.23 51.21 1.23 4.24 1.00 1382 87.43 80.75 5661 24.98 19.58 1.24 1.03 1.00 5667 26.01 19.62 9300 17.89 12.50 1.23 0.81 1.27 7203 21.28 15.09 9679 16.59 11.19 1.23 0.74 1.40 7952 19.75 13.57 9899 16.29 10.89 1.22 0.66 1.56 8845 18.97 12.76 7113 20.89 15.65 1.22 0.81 1.00 7210 22.39 15.47 12829 13.87 8.64 1.21 0.47 1.73 12472 15.85 8.91 15973 12.22 6.99 1.21 0.45 1.81 13017 15.34 8.46 17577 11.47 6.26 1.22 0.32 2.58 18620 12.78 5.83 7508 20.52 14.84 1.25 0.81 1.00 7284 22.57 15.26 13840 13.82 8.09 1.23 0.44 1.82 13279 15.76 8.38 17498 12.16 6.40 1.22 0.35 2.28 16602 14.18 6.73 19776 11.48 5.64 1.23 0.31 2.64 19212 13.37 5.90 2206 898.84 865.93 8.59 72.14 1.00 1387 1416.40 1370.68 6235 340.57 304.73 8.63 16.06 1.00 6230 347.46 305.56 11275 204.91 169.01 8.61 9.63 1.67 10394 227.01 184.62 14916 163.51 127.64 8.61 8.33 1.93 12015 202.86 160.58 16968 148.95 113.03 8.62 7.66 2.10 13060 188.19 145.86 7511 288.24 253.09 8.60 13.15 1.00 7608 297.70 250.29 14058 170.23 135.14 8.60 7.39 1.78 13549 187.57 140.16 18105 140.00 104.95 8.58 7.22 1.82 13859 183.75 136.43 20979 125.60 90.49 8.60 4.61 2.85 21717 133.77 86.32 8069 273.23 235.84 8.64 13.19 1.00 7587 300.48 250.82 15250 161.96 124.45 8.61 7.02 1.88 14247 183.20 133.46 20856 128.87 91.38 8.61 5.40 2.44 18527 152.76 102.33 24702 114.61 76.99 8.62 4.44 2.97 22542 133.95 84.00 Tconv 0.79 0.70 0.55 0.55 0.55 0.79 0.78 0.80 0.80 0.79 0.78 0.78 0.79 2.89 2.23 2.02 2.02 2.04 2.88 2.93 2.86 2.92 2.89 2.88 2.89 2.88 21.01 14.61 15.06 15.05 15.06 20.90 20.91 20.87 20.90 20.93 20.94 20.93 20.93
Tit 1.05 1.27 1.26 1.26 1.26 0.94 0.95 0.95 0.94 0.80 0.81 0.81 0.80 5.44 6.65 6.65 6.64 6.64 7.17 7.15 7.20 7.21 5.21 5.21 5.21 5.21 93.56 113.99 113.98 113.99 113.98 134.45 134.77 134.97 134.64 96.05 96.26 96.06 96.05
LAPACK – DPPTRF P Mflps Ttot Tchol Tconv 1.00 1142 19.42 17.85 0.33 1.00 947 23.13 21.47 0.33 1.00 949 23.13 21.47 0.33 1.00 950 23.11 21.46 0.33 1.00 948 23.08 21.43 0.33 1.00 1273 17.71 16.10 0.33 0.99 1260 17.87 16.27 0.33 0.99 1255 17.98 16.37 0.33 1.00 1271 17.78 16.18 0.33 1.00 1495 15.49 13.74 0.33 0.99 1486 15.52 13.72 0.33 0.99 1485 15.51 13.67 0.33 1.00 1498 15.54 13.59 0.33 1.00 1079 108.15 103.29 1.06 1.00 882 131.44 126.22 1.06 1.00 882 131.43 126.20 1.06 1.00 883 131.47 126.24 1.06 1.00 883 131.58 126.34 1.06 1.00 818 141.67 136.58 1.06 1.00 821 141.39 136.31 1.06 1.00 815 141.93 136.83 1.06 0.99 813 141.74 136.64 1.06 1.00 1125 104.51 99.00 1.06 1.00 1126 104.52 98.95 1.06 1.00 1126 104.78 99.13 1.06 1.00 1126 104.72 99.02 1.06 1.00 1070 1809.39 1777.89 7.02 1.00 878 2200.59 2165.77 7.02 1.00 878 2200.08 2165.58 7.03 1.00 878 2201.33 2166.47 7.03 1.00 878 2200.46 2165.63 7.03 1.00 744 2594.16 2560.35 7.03 1.00 743 2599.29 2565.55 7.03 1.00 741 2597.77 2564.03 7.04 1.00 743 2602.90 2569.10 7.03 1.00 1042 1862.26 1826.51 7.03 1.00 1040 1865.21 1828.85 7.04 1.00 1042 1861.90 1825.69 7.02 1.00 1042 1863.95 1827.70 7.03
Table 1. Time of Cholesky factorization in first iteration of the CSDP solver (Tit ), parallel speedup (P ), required Mflops (Mflps), total computation time (Ttot ), total Cholesky factorization time (Tchol ) and total matrix conversion time (Tconv ) of the CSDP solver linked with different implementations of BLAS/LAPACK libraries (Netlib’s BLAS/LAPACK, sequential and parallel versions of ATLAS, Intel MKL and GotoBLAS) and used for solution of three WSNs localization problems with different number of nodes
656 J. Błaszczyk, E. Niewiadomska-Szynkiewicz, and M. Marks
Application of Rectangular Full Packed and Blocked Hybrid Matrix Formats
657
It should be pointed that in discussed CSDP implementation before and after Cholesky factorization it was necessary to convert between full matrix format used in C code of the CSDP library and the RFP or BHF matrix formats by suitable transformation procedures – DTRTTF and DTFTTR for the RFP matrix format, DPPHPP and DHPPPP for the BHF matrix format, with our auxiliary conversion code between packed and full matrix storage formats. The total matrix format conversion time needed for the RFP and BHF procedures was noticeably in total runtime of the CSDP solver, i.e., in the case of the fastest BLAS library (GotoBLAS) usage, for larger matrices, the needed format conversions consumes less than 10% of total computational time, while the total time of Cholesky factorizations takes up to 90% of the entire runtime of the CSDP solver, so its reimplementation with the usage of RFP or BHF matrix format would be fully justified by memory saving benefits. 4.2
Parallel Implementations of Libraries
To speed up the calculations we used POSIX threads and/or OpenMP support included in parallel versions of optimized BLAS/LAPACK libraries. We compared the results obtained using parallel versions of ATLAS, Intel MKL and GotoBLAS libraries and their sequential implementations. The Cholesky factorization algorithm for the RFP and BHF formats parallelizes on our multi-core machine not worse than the LAPACK full storage DPOTRF procedure. In the case of Intel MKL and GotoBLAS the speedup of native Cholesky factorization for larger matrices is very good (P = 1.9, 2.5, 3.1 for 2, 3, 4-cores, respectively), while for ATLAS library it is noticeably worse (P = 1.8, 2.3, 2.6). For the LAPACK packed subroutine DPPTRF there was no effect of usage parallel BLAS/LAPACK libraries. Next we compared the results obtained for parallel ATLAS, Intel MKL and GotoBLAS libraries and Netlib’s BLAS/LAPACK routines. In the case of GotoBLAS usage with four OpenMP threads the performance of Cholesky factorization is roughly one to a factor of 24 times faster than LAPACK packed routines while using the same storage. Finally, our experiments have shown that using parallel version of GotoBLAS and a quad-core machine we can speed up CSDP up to a factor of 13 when compare with Netlib’s LAPACK full storage routine.
5
Summary and Conclusions
In this paper we described the modified version of the CSDP semidefinite programming solver that was applied to estimate the locations of nodes in wireless sensor network. In general, the Cholesky factorization procedures for RFP and BHF matrix formats, used by modified code of the CSDP solver, preserve high numerical performance and scalability of optimized level 3 BLAS and LAPACK full format routines, and save almost half space comparing with LAPACK full matrix format. Thus RFP or BHF matrix formats can replace both LAPACK, full and packed, matrix formats for symmetric and triangular matrices. Moreover, it may be profitable to reimplement the whole CSDP solver code to store all
658
J. Błaszczyk, E. Niewiadomska-Szynkiewicz, and M. Marks
symmetric and triangular matrices in the RFP matrix format. That modification would allow to solve larger WSNs localization problems on the same machine without losing the efficiency and scalability of LAPACK full matrix storage.
Acknowledgment The authors would like to thank Jerzy Wasniewski and Fred Gustavson for their constructive comments and suggestions which helped in improving the quality of this paper.
References 1. Kannan, A.A., Mao, G., Vucetic, B.: Simulated annealing based localization in wireless sensor network. In: Proceedings of the 30th Annual IEEE Conference on Local Computer Networks, pp. 513–514 2. Shang, Y., Ruml, W., Zhang, Y., Fromherz, M.P.J.: Localization from connectivity in sensor networks. IEEE Transactions on Parallel and Distributed Systems 15(11), 961–974 (2004) 3. Ji, X.: Sensor positioning in wireless ad-hoc sensor networks with multidimensional scaling. In: Proceedings of IEEE INFOCOM 2004 (2004) 4. Biswas, P., Ye, Y.: Semidefinite programming for ad hoc wireless sensor network localization. In: Proceedings of the Third International Symposium on Information Processing in Sensor Networks, pp. 46–54 5. Boyd, S., Ghaoui, L.E., Feron, E., Balakrishnan, V.: Linear Matrix Inequalities in System and Control Theory. Studies in Applied Mathematics, vol. 15 6. Borchers, B.: CSDP, a C library for semidefinite programming. Optimization Methods & Software 11-2(1-4), 613–623 (1999) 7. Helmberg, C., Rendl, F., Vanderbei, R.J., Wolkowicz, H.: An interior-point method for semidefinite programming. SIAM Journal on Optimization 6(2), 342–361 8. Borchers, B., Young, J.G.: Implementation of a primal-dual method for SDP on a shared memory parallel architecture. Computational Optimization and Applications 37(3), 355–369 (2007) 9. Fujisawa, K., Kojima, M., Nakata, K., Yamashita, M.: SDPA (semidefinite programming algorithm) users manual - version 6.00. Technical Report B–308, Tokyo Institute of Technology (1995) 10. Sturm, J.F.: Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods & Software 11-2(1-4), 625–653 (1999) 11. Gustavson, F.G., Waśniewski, J.: Rectangular full packed format for LAPACK algorithms timings on several computers. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, Springer, Heidelberg (2007) 12. Gustavson, F.G., Reid, J.K., Waśniewski, J.: Algorithm 865: Fortran 95 subroutines for Cholesky factorization in blocked hybrid format. ACM Transactions on Mathematical Software 33(1) 8 13. Andersen, B.S., Gunnels, J.A., Gustavson, F.G., Reid, J.K., Waśniewski, J.: A fully portable high performance minimal storage hybrid format Cholesky algorithm. ACM Transactions on Mathematical Software 31(2), 201–227
New Data Structures for Matrices and Specialized Inner Kernels: Low Overhead for High Performance Jos´e R. Herrero Computer Architecture Department Universitat Polit`ecnica de Catalunya Barcelona, Spain [email protected]
Abstract. Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This approach, however, achieves suboptimal performance due to the overheads associated to such calls. Taking as an example the dense Cholesky factorization of a symmetric positive definite matrix we show that the potential of non-canonical data structures for dense linear algebra can be better exploited with the use of specialized inner kernels. The use of non-canonical data structures together with specialized inner kernels has low overhead and can produce excellent performance. Keywords: Specialized inner kernels, new data structures, dense linear algebra, low overhead, high performance.
1
Introduction
Linear Algebra codes often make use of Level 3 Basic Linear Algebra Subroutines (BLAS) [1]. Performance portability is achieved through high performance libraries such as ATLAS [2], Goto BLAS [3] or those provided by machine vendors. Operations are often expressed in terms of matrix-matrix multiplication operations (calls to routine GEMM) [4]. Thus, a great effort is devoted to the implementation of this routine [5,6,7,8,9,10,11]. Data precopying [12] can be very useful to exploit locality and facilitate data streaming. However, such copies introduce overhead [13] since they are repeated in every call to BLAS routines [9,14]. Figure 1 shows the performance of several codes using version 3.7.11 of the ATLAS library1. We can observe that the performance obtained by the matrix multiplication routine (DGEMM) is substantially higher than that obtained by the Cholesky factorization routine (DPOTRF). This is mainly due to the overhead mentioned above.
1
This work was supported by the Ministerio de Educaci´ on y Ciencia of Spain with grants TIN2004-07739-C02-01 and TIN2007-60625. Experiments have been conducted on an Intel Itanium 2 processor running at 1.3 GHz with a theoretical peak performance of 5.2 GFlops (indicated by the dashed line at the top of some graphs).
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 659–667, 2008. c Springer-Verlag Berlin Heidelberg 2008
660
J.R. Herrero
Fig. 1. Performance of matrix multiplication and Cholesky factorization using ATLAS
New Data Structures (NDS) have been introduced in dense linear algebra codes as an alternative to canonical storage [15,16,17,18,19,20,21,22,23,24,25]. The main goals are improving locality, avoiding data copies, and obtaining reduced (or even minimum) storage requirements for triangular and symmetric matrices while achieving performance similar to codes which work on full storage. When NDS are used data copies can be avoided: instead of calling BLAS routines, simpler kernels can be defined and used. Several papers reflect the need and/or the creation of such kernels [10,14,19,21,23,24,26,27,28,29,30,31]. If BLAS calls are used, however, then unnecessary overhead is paid. The two curves at the bottom of Figure 1 show the performance of a Cholesky factorization routine (DPSTRF) based on a Square Blocked Packed Format (SBPF) [19] for two block sizes (nb=200 and nb=100) when calls to ATLAS BLAS routines are performed. The lower matrix is used in our implementation. We can observe that performance drops for the smaller block size. This happens because the overhead is multiplied due to the larger number of calls. For several years Goto’s BLAS library has become the reference library due to its great performance. Recently, changes have been introduced in Goto’s library to produce even higher performance for BLAS3 operations [3]. The authors have modified the copying procedure to avoid redundant packing and, therefore, reduce the overhead and increase performance for operations such as the Symmetric Rank-K (SYRK) and Triangular Solve (TRSM) amongst others. Since the Cholesky operation is implemented with calls to these operations, together with DGEMM, the performance of both DPOTRF and DPSTRF is excellent. Figure 2 shows the performance obtained with these two routines. We have used Goto’s BLAS version 1.15 in our experiments. The curve at the top corresponds to a direct call to DPOTRF, while the rest correspond to calls to DPSTRF with different block sizes. Although performance drops for smaller block sizes, performance is remarkable. Thus, the avoidance of redundant packing together
New Data Structures for Matrices and Specialized Inner Kernels
661
Fig. 2. Performance of Cholesky factorization for several block sizes using Goto’s library
with the high performance matrix multiplication kernels [11] are very effective and seem good examples to follow. Our goal is to analyze the potential of NDS combined with specialized inner kernels. Thus, first we need to determine the best candidate for our study. Recursivity in dense linear algebra has been explored for several years [23,32,33,34], but recently it has been shown that iterative codes with proper block sizes can achieve better performance than recursive codes [35,36,37]. For this reason we focus our study on a dense Cholesky factorization using an iterative approach. We have chosen to work with a Square Blocked Lower Packed Format [19] with the TRANS parameter fixed to ’T’. Although this format exceeds the minimum storage provided by other formats such as a Hybrid Full Packed (HFP) [22] format, it is rather simple, it allows for data accesses in the inner kernels with stride one and it is easy to adapt to our inner kernels. This allows us to obtain upper bounds on the performance of this approach. In the same direction, we compare only with the best combinations for DGEMM (AT × B) and DPOTRF (UPLO=’U’) assuming Fortran column-major storage.
2
Specialized Inner Kernels
In this section we briefly state our approach to the creation of high performance specialized kernels. Profiling. Optimization efforts must be applied to those parts of code which take up most computation time. In this case, for instance, this means focusing on the optimization of the matrix-matrix multiplication routine first. A bottom-up approach. We drive the creation of the structure from the bottom: the inner kernel fixes the size of the data submatrices [37]. Then the rest of the
662
J.R. Herrero
data structure is produced in conformance. We do this because the performance of the inner kernel has a dramatic influence in the overall performance of the algorithm. Thus, our first priority is to use the best inner kernel at hand. Afterwards, we can adapt the rest of the data structure and/or the computations. Specialization. Code specialization is commonly used to optimize generic routines which are otherwise difficult to optimize. Specialization can simplify the code and, at the same time, allow the usage of other optimization techniques as, for instance, memoization [38]. Simple codes are easier to optimize, both automatically and manually. Inner kernel based on our Small Matrix Library (SML). In previous papers [29,31] we presented our work on the creation of a Small Matrix Library (SML): a set of routines, written in Fortran, specialized in the efficient operation on matrices which fit in the first level cache. The advantage of our method lies in the ability to generate very efficient inner kernels by means of a good compiler. Working on regular codes for small matrices, most of the compilers we have used in different platforms create very efficient inner kernels for regular codes such as matrix-matrix multiplication. Applying this approach to the Cholesky factorization we first create the inner kernel for the matrix multiplication operation (GEMM). This results in a block size nb = 124 for the Itanium 2 machine. Once the block size is already fixed we apply the same approach to the other operations (TRSM, SYRK, and POTRF). We use the resulting routines, which we store within the SML, as the inner kernels of our general NDS linear algebra codes.
3
Results
We present performance of the Cholesky factorization of a matrix into an upper triangular matrix using routine DPOTRF in both Goto’s library and ATLAS; and DPSTRF working on a lower triangular matrix in SBP format which calls the inner kernels in SML. We use the DPSTRF routine in a similar way as Gustavson does in [21] with the only differences that we use a block size of 124 × 124 and SML routines are called in our case. We also present results for matrix multiplication. The matrix multiplication performed is C = C − AT × B. We show results of DGEMM corresponding to ATLAS [2], Goto [9] and our code based on SB format and the SML [37]. Goto BLAS are known to obtain excellent performance on Intel platforms. They are coded in assembler and targeted to each particular platform. Figure 3 shows the performance of these six codes on an Intel Itanium 2. The dashed line at the top of the plot shows the theoretical peak performance of the processor. We can observe that the performance of our codes based on NDS and SML are similar to ATLAS for matrix multiplication and outperform ATLAS for Cholesky factorization. Goto’s codes are in all cases the best. The merit of the combination of NDS and SML codes lays in the fact that they are the only ones which are not based on codes written in assembly language.
New Data Structures for Matrices and Specialized Inner Kernels
663
Fig. 3. Performance of several matrix multiplication and Cholesky factorization codes
Fig. 4. Performance of Cholesky factorization relative to matrix multiply
We have mentioned in the introduction that calls to BLAS routines usually introduce some overhead (although it has been reduced in Goto’s library). This can be better observed in Figure 4 which shows the performance of Cholesky factorization relative to the corresponding matrix multiplication subroutine. ATLAS represents the traditional approach to implementing linear algebra codes in terms of BLAS routines. This results in poor performance of DPOTRF relative to DGEMM. Goto’s library however has a considerably higher ratio due to the modifications commented in section 1. We can observe that the ratio DPSTRF/DGEMM SB is very high. This is due to the low overhead present in the calls to SML routines from the NDS code. We note that the relative performance drops for matrix dimensions 2000. The reason for this is that we have implemented multilevel orthogonal block forms (MOB) [5,37] in our matrix multiplication code based on the SB format. However, we have not used any
664
J.R. Herrero
additional blocking technique on the SBPF Cholesky. Thus, locality is not fully exploited for larger matrices within the DPSTRF routine. In [24] the authors provide some performance results for Packed Recursive (PR) and Packed Hybrid (PH) formats. In order to have an approximate hint of their performance we have taken the values taken on an Itanium 2 running at 1 GHz. The theoretical peak performance of that machine is 4 GFlops. In order to compare results we present performance relative to the theoretical peak. Although this is not exactly the same as running the program on the same machine it can give us a hint on the relative performance of these codes. We present the case where the upper matrix and a block size nb = 200 are used. Figure 5 presents these results. The curves labeled with a final + sign include the time necessary to perform the rearrangement of data from canonical storage into the NDS tested. The cost of the rearrangement is O(N 2 ) [14,31,39]. We
Fig. 5. Performance of several NDS Cholesky factorization codes with and without rearrangement
Fig. 6. Performance of Cholesky factorization codes
New Data Structures for Matrices and Specialized Inner Kernels
665
can observe that the SBP format with calls to SML routines provides higher performance even when the block size is smaller nb = 124. Finally, Figure 6 presents the performance of all variants of Cholesky factorization tested. This figure includes the combination of DPSTRF with calls to BLAS from Goto’s library with a block size nb = 124. We can observe that the performance obtained is slightly better than that corresponding to the case where SML routines are called. Both cases indicate that it is possible to have reduced storage and very high performance simultaneously.
4
Conclusions
The specialization of the inner kernels avoids performing unnecessary operations repetitively. In addition, it simplifies the code allowing for more opportunities for automatic optimization. Working with simple square blocks it is possible to produce efficient inner kernels with the help of an optimizing compiler. When these kernels are called directly from linear algebra codes which store matrices using non-canonical data structures the overhead is very low. This happens because there are no costs associated to copying data and checking certain parameters. The performance obtained from the resulting Cholesky factorization is better than that of several previous implementations and approaches that of a hand-optimized implementation in which most representative parts of the code are written in assembly code and data is packed for efficient use of the register file (Goto BLAS).
References 1. Dongarra, J., Croz, J.D., Duff, I., Hammarling, S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft. 16, 1–17 (1990) 2. Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Supercomputing 1998, pp. 211–217. IEEE Computer Society, Los Alamitos (1998) 3. Goto, K., Geijn, R.V.D.: High-performance implementation of the level-3 BLAS. ACM Transactions on Mathematical Software (TOMS) (to appear) 4. K˚ agstr¨ om, B., Ling, P., van Loan, C.: GEMM-based level 3 BLAS: highperformance model implementations and performance evaluation benchmark. ACM Transactions on Mathematical Software (TOMS) 24, 268–302 (1998) 5. Navarro, J.J., Juan, A., Lang, T.: MOB forms: A class of Multilevel Block Algorithms for dense linear algebra operations. In: Proceedings of the 8th International Conference on Supercomputing, pp. 354–363. ACM Press, New York (1994) 6. Agarwal, R.C., Gustavson, F.G., Zubair, M.: Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM J. Res. Dev. 38, 563–576 (1994) 7. Navarro, J.J., Garc´ıa, E., Herrero, J.R.: Data prefetching and multilevel blocking for linear algebra operations. In: Proceedings of the 10th international conference on Supercomputing, pp. 109–116. ACM Press, New York (1996) 8. Gunnels, J.A., Henry, G., van de Geijn, R.A.: A family of high-performance matrix multiplication algorithms. In: International Conference on Computational Science, vol. (1), pp. 51–60 (2001)
666
J.R. Herrero
9. Goto, K., van de Geijn, R.: On reducing TLB misses in matrix multiplication. Technical Report CS-TR-02-55, Univ. of Texas at Austin (2002) 10. Chatterjee, S., Bachega, L.R., Bergner, P., Dockser, K.A., Gunnels, J.A., Gupta, M., Gustavson, F.G., Lapkowski, C.A., Liu, G.K., Mendell, M., Nair, R., Wait, C.D., Ward, T.J.C., Wu, P.: Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L. IBM Journal of Research and Development 49, 377–391 (2005) 11. Goto, K., van de Geijn, R.A.: Anatomy of a high-performance matrix multiplication. ACM Transactions on Mathematical Software 34 (2007) 12. Lam, M., Rothberg, E., Wolf, M.: The cache performance and optimizations of blocked algorithms. In: Proceedings of ASPLOS 1991, pp. 67–74 (1991) 13. Temam, O., Granston, E.D., Jalby, W.: To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In: Supercomputing, pp. 410–419 (1993) 14. Gustavson, F.G., Gunnels, J.A., Sexton, J.C.: Minimal data copy for dense linear algebra factorization. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 540–549. Springer, Heidelberg (2007) 15. McKellar, A.C., Coffman, J.E.G.: Organizing matrices and matrix operations for paged memory systems. Communications of the ACM 12, 153–165 (1969) 16. Frens, J.D., Wise, D.S.: Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code. In: Proc. 6th ACM SIGPLAN Symp. on Principles and Practice of Parallel Program, SIGPLAN Notices, vol. 32, pp. 206–216 (1997) 17. Gustavson, F., Henriksson, A., Jonsson, I., K˚ agstr¨ om, B.: Recursive blocked data formats and BLAS’s for dense linear algebra algorithms. In: K˚ agstr¨ om, B., Elmroth, E., Wa´sniewski, J., Dongarra, J. (eds.) PARA 1998. LNCS, vol. 1541, pp. 195–206. Springer, Heidelberg (1998) 18. Chatterjee, S., Jain, V.V., Lebeck, A.R., Mundhra, S., Thottethodi, M.: Nonlinear array layouts for hierarchical memory systems. In: Proceedings of the 13th international conference on Supercomputing, pp. 444–453. ACM Press, New York (1999) 19. Gustavson, F.G.: New generalized data structures for matrices lead to a variety of high-performance algorithms. In: Engquist, B. (ed.) Simulation and visualization on the grid: Parallelldatorcentrum, Kungl. Tekniska H¨ ogskolan, 7th annual conference, Stockholm, Sweden, vol. 13, pp. 46–61. Springer, Heidelberg (2000) 20. Andersen, B.S., Wasniewski, J., Gustavson, F.G.: A recursive formulation of Cholesky factorization of a matrix in packed storage. ACM Transactions on Mathematical Software (TOMS) 27, 214–244 (2001) 21. Gustavson, F.G.: High-performance linear algebra algorithms using new generalized data structures for matrices. IBM J. Res. Dev. 47, 31–55 (2003) 22. Gunnels, J.A., Gustavson, F.G.: A new array format for symmetric and triangular matrices. In: Dongarra, J., Madsen, K., Wa´sniewski, J. (eds.) PARA 2004. LNCS, vol. 3732, pp. 247–255. Springer, Heidelberg (2006) 23. Elmroth, E., Gustavson, F., Jonsson, I., K˚ agstr¨ om, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review 46, 3–45 (2004) 24. Andersen, B.S., Gunnels, J.A., Gustavson, F.G., Reid, J.K., Wa´sniewski, J.: A fully portable high performance minimal storage hybrid format Cholesky algorithm. ACM Transactions on Mathematical Software 31, 201–227 (2005) 25. Bader, M., Mayer, C.: Cache oblivious matrix operations using Peano curves. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 521–530. Springer, Heidelberg (2007)
New Data Structures for Matrices and Specialized Inner Kernels
667
26. Siek, J.G., Lumsdaine, A.: A rational approach to portable high performance: The basic linear algebra instruction set (BLAIS) and the fixed algorithm size template (FAST) library. In: Demeyer, S., Bosch, J. (eds.) ECOOP 1998 Workshops. LNCS, vol. 1543, pp. 468–469. Springer, Heidelberg (1998) 27. Wise, D.S., Frens, J.D.: Morton-order matrices deserve compilers’ support. Technical Report TR 533, Computer Science Department, Indiana University (1999) 28. Valsalam, V., Skjellum, A.: A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurrency and Computation: Practice and Experience 14, 805–839 (2002) 29. Herrero, J.R., Navarro, J.J.: Automatic benchmarking and optimization of codes: an experience with numerical kernels. In: Int. Conf. on Software Engineering Research and Practice, pp. 701–706. CSREA Press (2003) 30. K˚ agstr¨ om, B.: Management of deep memory hierarchies - recursive blocked algorithms and hybrid data structures for dense matrix computations. In: Dongarra, J., Madsen, K., Wa´sniewski, J. (eds.) PARA 2004. LNCS, vol. 3732, pp. 21–32. Springer, Heidelberg (2006) 31. Herrero, J.R., Navarro, J.J.: Compiler-optimized kernels: An efficient alternative to hand-coded inner kernels. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Lagan´ a, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3984, pp. 762–771. Springer, Heidelberg (2006) 32. Gustavson, F.G.: Recursion leads to automatic variable blocking for dense linearalgebra algorithms. IBM J. Res. Dev. 41, 737–756 (1997) 33. Toledo, S.: Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl. 18, 1065–1081 (1997) 34. Andersen, B.S., Gustavson, F.G., Karaivanov, A., Marinova, M., Wasniewski, J., Yalamov, P.Y.: LAWRA: Linear algebra with recursive algorithms. In: Sørevik, T., Manne, F., Moe, R., Gebremedhin, A.H. (eds.) PARA 2000. LNCS, vol. 1947, pp. 38–51. Springer, Heidelberg (2001) 35. Park, N., Hong, B., Prasanna, V.K.: Tiling, block data layout, and memory hierarchy performance. IEEE Trans. Parallel and Distrib. Systems 14, 640–654 (2003) 36. Gunnels, J., Gustavson, F., Pingali, K., Yotov, K.: Is cache-oblivious DGEMM viable? In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 919–928. Springer, Heidelberg (2007) 37. Herrero, J.R., Navarro, J.J.: Using non-canonical array layouts in dense matrix operations. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 580–588. Springer, Heidelberg (2007) 38. Michie, D.: Memo functions and machine learning. Nature 218, 19–22 (1968) 39. Gustavson, F.G.: Algorithm Compiler Architecture Interaction Relative to Dense Linear Algebra. Technical Report RC23715 (W0509-039), IBM, T.J. Watson (2005)
The Implementation of BLAS for Band Matrices Alfredo Rem´on, Enrique S. Quintana-Ort´ı, and Gregorio Quintana-Ort´ı Depto. de Ingenier´ıa y Ciencia de Computadores Universidad Jaime I; 12.071–Castell´ on, Spain {remon,quintana,gquintan}@icc.uji.es
Abstract. In this paper we evaluate the performance of several implementations of the routines in BLAS involving band matrices. The results on two different platforms show that not enough attention has been paid to the efficient implementation of these operations. We also present implementations for two level 3 BLAS-like operations that are not included in the current specification of BLAS: the product of a band matrix times a general matrix and the product of two band matrices. Keywords: Linear algebra, BLAS, band linear systems, high performance.
1
Introduction
Linear algebra operations involving band matrices appear, among others, in static and dynamic analyses and linear equations in structural engineering, finite element analysis in structural mechanics, and domain decomposition methods for partial differential equations in civil engineering [2,3,4,5]. Exploiting the band structure of the matrix in these problems yields huge savings in the number of computations and storage. This is partially recognized in the BLAS specification, which currently includes 5 routines that operate on banded matrices (band BLAS): gbmv, hbmv, sbmv and tbmv for the matrix-vector product with general, Hermitian, symmetric and triangular band matrices, respectively; and tbsv for the solution of linear systems with triangular band coefficient matrix. These all correspond to operations from the BLAS-2. Our experience with the solution of model reduction problems involving band state-space matrices has shown the situation to be unsatisfactory in two aspects. First, current implementations of the band BLAS-2 do not fully optimize the routines that operate on band matrices. Second, the functionality of the band BLAS falls short by not including routines for BLAS-3 operations like, e.g., the matrix-matrix product when one of the matrices is banded, or the solution of triangular band linear systems with multiple right-hand sides. The only option in these two cases is to perform multiple invocations of the corresponding band BLAS-2. Thus, the inefficiency carries over to the band BLAS-3.
This research was supported by the CICYT project TIN2005-09037-C02-02 and FEDER, and the DAAD programme Acciones Integradas HA2005-0081.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 668–677, 2008. c Springer-Verlag Berlin Heidelberg 2008
The Implementation of BLAS for Band Matrices
669
In this paper we present an evaluation of several implementations of the band BLAS-2 (including some of our own), as well as a proposal of two BLAS-3 operations on band matrices. The experiments on Intel Pentium Xeon and Itanium2 processors show that implementations of BLAS well-known for their performance on BLAS-3 operations as, e.g., MKL and Goto BLAS, are not so thoroughly optimized for all band BLAS operations and band widths. The paper is structured as follows. In Section 2 we choose the symmetric band matrix-vector product to describe how to obtain different implementations of the routines in the band BLAS-2. In Section 3 we develop a portable extension of band BLAS-3 using the symmetric band matrix-matrix product (sbmm) as an example. In Section 4 we report an experimental evaluation of the codes. Finally, in Section 5 follow a few concluding remarks.
2
The Implementation of sbmv
Given a symmetric band matrix A ∈ Rn×n , with bandwidth kd , and a pair of vectors x, y ∈ Rn , the matrix-vector product operation is defined as y := α Ax + β y, where α and β are scalars. For simplicity, we consider hereafter that α = β = 1. There are several manners to orchestrate the operations in the matrix-vector product resulting in different implementations. In the following, it is important to realize that the routines in the band BLAS (as well as the band LAPACK) employ the packed symmetric format to store symmetric band matrices. 2.1
Reference Implementation: sbmv ref
The reference implementation of BLAS (http://www.netlib.org/blas) proceeds as follows. Consider the partitionings A→
AT L ABL ABR
,
x→
xT xB
,
y→
yT yB
,
(1)
where AT L is a square block with the same number of rows/columns as elements in xT and yT . (The “” symbol in A denotes the symmetric quadrant (block) of the matrix and will not be referenced.) At a given iteration of routine sbmv ref, the elements in AT L , ABL , and xT have already participated in the update of the corresponding entries of y, while the elements in ABR and xB are yet to be used. Thus, the thick lines separate those parts that have participated in the update (AT L , ABL , and xT ) from those that still have to do so. We next recur to the repartitionings
AT L ABL ABR
⎛
⎞ A00 ⎜ aT10 α11 ⎟ ⎟ →⎜ ⎝ A21 a21 A22 ⎠ , A32 A33
xT xB
⎛
⎞ x0 → ⎝ χ1 ⎠ , x2
yT yB
⎛
⎞ y0 → ⎝ ψ1 ⎠ , y2 (2)
670
A. Rem´ on, E.S. Quintana-Ort´ı, and G. Quintana-Ort´ı
where α11 , ψ1 , χ1 are scalars. To proceed with the computation, during the next iteration of sbmv ref the following operations are carried out: ψ1 := ψ1 + α11 χ1 + aT21 x2 , y2 := y2 + a21 χ1 .
(3) (4)
Then, in preparation for the next iteration, α11 , a21 , and χ1 are aggregated to the parts of A and x that have already participated in the update of y. For further details on this notation see, e.g., [1,6]. The reference implementation performs the operations in (3) and (4) using a single loop. The physical storage of A thus determines that the matrix is accessed by columns in a single pass. The actual performance of this implementation strongly depends on the ability of the compiler to tune the reference code, incorporating high-performance strategies as, e.g., loop unrolling, software segmentation, prefetching, etc. 2.2
High-Quality Implementations: sbmv mkl and sbmv goto
There exist highly tuned implementations of BLAS that target specific architectures. Well-known examples for the Intel architectures include the vendor Math Kernel Library, MKL, (http://www.intel.com) and the Goto BLAS (http:// www.tacc.utexas.edu/resources/software). While these implementations focus on the performance of the matrix-matrix product and, as a consequence, of all BLAS-3, they are also expected to include fair implementations of sbmv and other BLAS-1 and (band) BLAS-2 routines. The performance of sbmv in these high-quality implementations (hereafter, sbmv mkl for MKL and sbmv goto for Goto BLAS) depends on the skills of the developers, which manually transform the assembly codes to reuse data in registers, reduce cache misses, avoid pipeline stalls, etc. The same goals are pursued by compilers automatically, though at a higher level. 2.3
Implementation Based on BLAS-1: sbmv b1
We can also perform the operations in (3) and (4) by calling kernels from the BLAS-1. Thus, the routine that results from this approach, sbmv b1, computes the operations in ψ1 := ψ¯1 +aT21 x2 = (ψ1 +α11 χ1 )+aT21 x2 as a call to the BLAS-1 kernel dot, and the update in y2 := y2 + a21 χ1 by invoking the BLAS-1 kernel axpy. As in the reference implementation, matrix A is accessed by columns but in a double pass for each column this time. The overhead of invoking kernels to perform these simple operations and the second pass of the matrix can be counterbalanced by the use of a highly tuned implementation of the dot and axpy kernels. 2.4
Implementation Based on BLAS-2: sbmv b2
Alternatively, the code for sbmv can be reorganized to perform the updates in terms of kernels from the BLAS-2. In particular, consider now the partitionings
The Implementation of BLAS for Band Matrices ⎛
AT L ⎜ A → ⎝ AM L AM M ABM ⎞
⎛
⎛
A00 ⎜ A10 A11 ⎜ ⎟ ⎠→⎜ ⎜ A20 A21 ⎝ ABR A31 ⎞
⎞ y0 ⎜ y1 ⎟ yT ⎜ ⎟ ⎜ ⎟ ⎟ y → ⎝ yM ⎠ → ⎜ ⎜ y2 ⎟ , ⎝ y3 ⎠ yB y4 ⎛
⎞ ⎟ ⎟ A22 ⎟ ⎟, A32 A33 ⎠ A42 A43 A44 ⎛ ⎞ x0 ⎛ ⎞ ⎜ x1 ⎟ xT ⎜ ⎟ ⎜ ⎟ ⎟ x → ⎝ xM ⎠ → ⎜ ⎜ x2 ⎟ , ⎝ x3 ⎠ xB x4
671
(5)
where A11 , A33 ∈ Rnb ×nb , and y1 , y3 , x1 , x3 have all nb elements, and all diagonal blocks of A are square (and, therefore, symmetric). Here nb is the algorithmic block size which, for simplicity, we assume to be an exact multiple of kd . We note here that, with the previous partitionings, A31 is upper triangular. In this implementation, sbmv b2, during the current iteration we will perform the following updates y1 := y1 + A11 x1 + AT21 x2 + AT31 x3 , y2 := y2 + A21 x1 , y3 := y3 + A31 x1 .
(symv+gemv+trmv) (gemv) (trmv)
(6)
Given the special structure of some of the blocks of the matrix, A11 x1 involves a symmetric block and can be performed by calling BLAS-2 kernel symv, while AT31 x3 and A31 x1 are both triangular matrix-vector products which can be carried out using BLAS-2 kernel trmv. The remaining two matrix-vector products, AT21 x2 and A21 x1 , present no special structure in A21 and can be computed with BLAS-2 kernel gemv. After the updates in this iteration, the computation proceeds forward by nb columns and rows of A, and nb elements of y, x. With this algorithm, A is traversed by blocks of columns, with each subdiagonal block being accessed twice. The performance will depend on the efficiency of the implementation of the BLAS-2 kernels that are used.
3
Portable Implementation of sbmm
Consider now the generic formulation of the matrix-matrix product C := α AB + β C, where A ∈ Rn×n is a symmetric band matrix with bandwidth kd , and C, B ∈ Rn×m . For simplicity, we assume that α = β = 1. In order to compute the matrix-matrix product, we can employ a blocked generalization of the band BLAS-2 matrix-vector product, with the blocks of elements of y/x replaced by blocks of C/B. In particular, consider the partitioning of A in (5) together with block row partitionings of C and B as
672
A. Rem´ on, E.S. Quintana-Ort´ı, and G. Quintana-Ort´ı ⎞ C0 ⎜ C1 ⎟ CT ⎜ ⎟ ⎜ ⎟ ⎟ C → ⎝ CM ⎠ → ⎜ ⎜ C2 ⎟ , ⎝ CB C3 ⎠ C4 ⎛
⎞
⎛
⎞ B0 ⎜ B1 ⎟ BT ⎜ ⎟ ⎜ ⎟ ⎟ B → ⎝ BM ⎠ → ⎜ ⎜ B2 ⎟ , ⎝ BB B3 ⎠ B4 ⎛
⎞
⎛
(7)
where B1 C1 , C3 , B1 , B3 ∈ Rnb ×m . During the current iteration, the algorithm performs the following matrix-matrix products: C1 := C1 + A11 B1 + AT21 B2 + AT31 B3 , C2 := C2 + A21 B1 , C3 := C3 + A31 B1 .
(symm+gemm+trmm) (gemm) (trmm)
(8)
Each of the operations that compose the products is annotated to the right with the BLAS-3 kernel that is employed for its computation: symm, gemm, and trmm for the symmetric, general and triangular matrix-matrix product, respectively. We also note that in computing these operations, C and B can also be partitioned into blocks of columns, using a block size mb . The algorithm resulting from the previous organization of the operations employs kernels from the BLAS-3. Therefore, we can expect its performance to be much higher than that of computing the matrix-matrix product as a sequence of matrix-vector products, one per column of C and B. We note that the same ideas described for sbmm carry over to the product of a general/triangular/hermitian band matrix times a dense matrix, as well as the solution of a triangular band linear system with multiple right-hand sides.
4
Experimental Results
In this section we evaluate the performance of different implementations of several operations of the band BLAS-2 and BLAS-3 in terms of MFLOPS/s (millions of floating-point arithmetic operations per second). The results reported next were obtained for matrices of orders n=5,000 with bandwidth sizes ranging up to 1,200. Provided the bandwidth is much smaller than n, the performance of the routines is only determined by the band size and, for the blocked routines, the block size nb . In the evaluation of the blocked routines, for each bandwidth dimension, we also employed values of nb from 1 to 200 to determine the optiopt mal block size, nopt are shown. All b , but only those results corresponding to nb experiments were performed using double-precision floating-point arithmetic on two different platforms, xeon and itanium, representative of current desktop platforms, see Table 1. 4.1
Level 2 Band BLAS
For the symmetric band matrix-vector product, Figure 1 reports the performance of the reference implementation (sbmv ref), the implementations in MKL and
The Implementation of BLAS for Band Matrices
673
Table 1. Architectures (top) and software (bottom) employed in the evaluation Platform
Architecture
Frequency (GHz)
L2 cache (KBytes)
2.4 1.5
512 256
– 4096 Operating System
xeon Intel Pentium Xeon itanium Intel Itanium2
L3 cache RAM (MBytes) (GBytes)
Platform
Compiler
Optimization flags
BLAS
xeon
g77 3.5.0 icc 9.0 icc 9.0 icc 9.0
-O3 -O3 -O3 -O3
Goto BLAS 0.96st MKL 8.1st Goto BLAS 1.00 MKL 8.0
itanium
Linux Linux Linux Linux
1 4
2.6.13 2.4.21 2.4.21 2.4.21
Performance vs bandwidth (n=5000) on Intel Xeon 1400 SBMV_REF+ifc SBMV_MKL SBMV_GOTO SBMV_B1_MKL SBMV_B1_GOTO
1200
MFLOPS/s
1000
800
600
400
200
0 0
200
400
600
800
1000
1200
Bandwidth
Performance vs bandwidth (n=5000) on Intel Itanium2 3000 SBMV_REF+ifc SBMV_MKL SBMV_GOTO SBMV_B1_MKL SBMV_B1_GOTO
2500
MFLOPS/s
2000
1500
1000
500
0 0
200
400
600
800
1000
1200
Bandwidth
Fig. 1. Performance of the different implementations of sbmv
Goto BLAS (sbmv mkl and sbmv goto, respectively) as well as an implementation that performs the operations in (3) by calling the BLAS-1 kernel for the inner product in MKL or Goto BLAS, sbmv b1 mkl or sbmv b1 goto, respectively. In the latter implementation, the operations corresponding to (4) are
674
A. Rem´ on, E.S. Quintana-Ort´ı, and G. Quintana-Ort´ı
Performance vs bandwidth (n=5000) on Intel Xeon 800 TBSV_REF+ifc TBSV_MKL TBSV_GOTO TBSV_B2_MKL TBSV_B1_GOTO
700
600
MFLOPS/s
500
400
300
200
100
0 0
200
400
600
800
1000
1200
1000
1200
Bandwidth Performance vs bandwidth (n=5000) on Intel Itanium2 1600 TBSV_REF+ifc TBSV_MKL TBSV_GOTO TBSV_B2_MKL TBSV_B1_GOTO
1400
1200
MFLOPS/s
1000
800
600
400
200
0 0
200
400
600
800
Bandwidth
Fig. 2. Performance of the different implementations of tbsv
performed elementwise via an explicit loop as they are done in the reference implementation. For this operation, the implementation based on BLAS-2 in general attained a performance lower than those reported in the figure. For the triangular band linear system solve, Figure 2 reports the performance of the implementations in the reference BLAS, MKL, and Goto BLAS, (tbsv ref, tbsv mkl, and tbsv goto, respectively). Our two implementations in this case replace parts of the code in the reference BLAS by calls to the BLAS-1 kernels dot and axpy, and the BLAS-2 kernel trsv. The latter codes are then linked with the implementations of BLAS in MKL and Goto BLAS resulting in routines tbsv b1 mkl, tbsv b2 mkl, tbsv b1 goto, and tbsv b2 goto: among these four, in the figure we only show results for that combination which in general delivered higher performance on that specific platform. From these experiments we can conclude that the performance of the band BLAS-2 is highly dependent on the bandwidth and that, given an operation, matrix size, and width of the band, there is a high variability on which implementation is optimal. Similar results have been obtained or are to be expected from other band BLAS-2 routines as tbsv, gbmv, and hbmv, or in those operations where the band matrix appears as transposed.
The Implementation of BLAS for Band Matrices
4.2
675
Level 3 Band BLAS
Figure 3 illustrates the performance of our level 3 BLAS-based implementation of the product C := AB + C of a symmetric band matrix A and a general matrix B with m=4 and 10 columns. For comparison, we also include the performance of sbmv mkl when this routine is used m times to compute the result of the product. The results show that the new routine clearly outperforms the only current possibility in BLAS (that of performing a symmetric band matrix-vector product multiple times). Our last experiment, in Figure 4, shows the performance of two routines for the matrix product C := AB + C where both A and B are general band matrices with the same upper and lower bandwidths. One of these routines is based on the BLAS-2 kernels (bbmm b2 mkl and bbmm b2 goto, depending respectively on whether MKL or Goto BLAS were linked) while the other one casts the major bulk of the computations in terms of BLAS-3 kernels (bbmm b3 mkl and bbmm b3 goto, depending on the underlying BLAS). For each architecture and problem size we also report the optimal block size in the same figure. The results reflect clear benefits from using the BLAS-3 kernels.
Performance vs bandwidth (n=5000) on Intel Xeon
Optimal blocksize vs bandwidth (n=5000) on Intel Xeon
4000
100 SBMV_MKL x m SBMM_B3_MKL(m=4) SBMM_B3_GOTO(m=4) SBMM_B3_MKL(m=10) SBMM_B3_GOTO(m=10)
3500
SBMM_B3_MKL(m=4) SBMM_B3_GOTO(m=4) SBMM_B3_MKL(m=10) SBMM_B3_GOTO(m=10) 80
3000
Optimal Blocksize
MFLOPS/s
2500
2000
1500
60
40
1000 20 500
0
0 0
200
400
600
800
1000
1200
0
50
100
150
200
Bandwidth
250
300
350
400
450
500
Bandwidth
Performance vs bandwidth (n=5000) on Intel Itanium2
Optimal blocksize vs bandwidth (n=5000) on Intel Itanium2
5000
100 SBMV_MKL x m SBMM_B3_MKL(m=4) SBMM_B3_GOTO(m=4) SBMM_B3_MKL(m=10) SBMM_B3_GOTO(m=10)
SBMM_B3_MKL(m=4) SBMM_B3_GOTO(m=4) SBMM_B3_MKL(m=10) SBMM_B3_GOTO(m=10) 80
Optimal Blocksize
4000
MFLOPS/s
3000
2000
1000
60
40
20
0
0 0
200
400
600 Bandwidth
800
1000
1200
0
50
100
150
200
250
300
350
400
450
500
Bandwidth
Fig. 3. Performance of the different implementations of sbmm and optimal block size
676
A. Rem´ on, E.S. Quintana-Ort´ı, and G. Quintana-Ort´ı Performance vs bandwidth (n=5000) on Intel Xeon 3000
BBMM_B2_GOTO BBMM_B3_GOTO BBMM_B2_MKL 2500 BBMM_B3_MKL
MFLOPS/s
2000
1500
1000
500
0 0
200
400
600 Bandwidth
800
1000
1200
Performance vs bandwidth (n=5000) on Intel Itanium2 BBMM_B2_GOTO 5000 BBMM_B3_GOTO BBMM_B2_MKL BBMM_B3_MKL
MFLOPS/s
4000
3000
2000
1000
0 0
200
400
600 Bandwidth
800
1000
1200
Fig. 4. Performance of the different implementations of bbmm
Similar results have been obtained or are to be expected from other band BLAS-3 routines as the matrix product with a general band matrix or the triangular band linear system solve with multiple right-hand sides.
5
Conclusions
We have evaluated several implementations of a significative part of the operations in the (level 2) BLAS that involve band matrices, including the reference BLAS, MKL, Goto BLAS, and our own proposals. The results show that no implementation is globally optimized for all bandwidths and matrix dimensions. We have also developed our own implementations of two level 3 band BLASlike routines for the product of a symmetric band matrix times a general matrix, and the product of two general band matrices. This functionality is not available in the current specification of BLAS, though we believe it can be of general interest. The first matrix product is necessary, e.g., in the solution of sparse Lyapunov equations via the LR-ADI iteration with applications in model reduction of large-scale dynamical systems. The second class of matrix product
The Implementation of BLAS for Band Matrices
677
appears in the solution of symmetric eigenvalue problems via invariant subspace decomposition algorithms.
References 1. Bientinesi, P., Gunnels, J.A., Myers, M.E., Quintana-Ort´ı, E.S., van de Geijn, R.A.: The science of deriving dense linear algebra algorithms. ACM Transactions on Mathematical Software 31(1), 1–26 (2005) 2. Bjorstad, P.E., Gropp, W., Smith, B.: Domain decomposition: parallel multilevel methods for elliptic partial differential equations. Cambridge University Press, Cambridge (1996) 3. Hvidsten, A.: A Parallel Implementation of the Finite Element Program SESTRA. PhD thesis, University of Bergen (1990) 4. Matrix market, http://math.nist.gov/MatrixMarket/ 5. Przemieniecki, J.S.: Matrix structure analysis of substructures. Am. Inst. Aero. Atro. J. 1, 138–147 (1963) 6. Quintana-Ort´ı, G., Quintana-Ort´ı, E.S., Rem´ on, A., van de Geijn, R.: Supermatrix for the factorization of band matrices. FLAME Working Note #27 TR-07-51, The University of Texas at Austin, Department of Computer Sciences (September 2007)
Parallel Solution of Band Linear Systems in Model Reduction Alfredo Rem´on, Enrique S. Quintana-Ort´ı, and Gregorio Quintana-Ort´ı Depto. de Ingenier´ıa y Ciencia de Computadores Universidad Jaume I 12.071–Castell´ on, Spain {remon,quintana,gquintan}@icc.uji.es
Abstract. In this paper we present two parallel routines for the LU factorization of band matrices arising in model reduction problems that target SMP architectures. The special properties of these problems often allows the elimination of pivoting during the factorization, and results in a higher efficiency of the parallel routines. Also, the routines aggregate operations during the iteration, exposing a coarser-grain parallelism than their LAPACK counterpart. Experimental results on two different parallel platforms show the benefits of the new approach. Keywords: Model reduction, band linear systems, LU factorization, multithreaded BLAS, symmetric multiprocessors (SMP).
1
Introduction
Consider the dynamical linear system in generalized state-space form E x(t) ˙ = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t),
t > 0, t ≥ 0,
(1)
and the associated transfer function matrix (TFM) G(s) = C(sE − A)−1 B + D. Here, A, E ∈ Rn×n , B ∈ Rn×m , C ∈ Rp×n , D ∈ Rp×m , x(0) = x0 ∈ Rn is the initial state, and n is the order of the system. In model order reduction (MOR) we are interested in finding an alternative system ˆx ˆx(t) + Bu(t), ˆ E ˆ˙ (t) = Aˆ ˆ ˆ yˆ(t) = C xˆ(t) + Du(t),
t>0 t ≥ 0,
(2)
ˆ ˆ Eˆ − A) ˆ −1 B ˆ+D ˆ which “approxiof order r, with r n, and TFM G(s) = C(s mates” G(s) [2]. Model reduction of large-scale systems of the form (1) is needed, e.g., in control of multibody (mechanical) systems, manipulation of fluid flow, circuit simulation, VLSI chip design, and weather forecast [3,2]. In these cases, MOR is frequently
This research was supported by the CICYT project TIN2005-09037-C02-02 and FEDER, and the DAAD programme Acciones Integradas HA2005-0081.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 678–687, 2008. c Springer-Verlag Berlin Heidelberg 2008
Parallel Solution of Band Linear Systems in Model Reduction
679
used to replace the system model by one of much smaller order that enables simulation in an adequate time. Here we will only consider MOR methods based on system balancing, because of their appealing properties [2]. However, the computational cost of these methods tends to be quite high and their application benefits from the use of high performance techniques, in particular, parallel computing. There exist various methods for MOR which aim at balancing the system [2]. The core computation in many of these is the solution of two Lyapunov equations [5]. The iterative LR-ADI method for the solution of this class of equations [9] requires, at a given iteration j, the solution of a linear system of the form −1 Uj+1 := Aˆ−1 Uj , (3) j Uj = (A + γj E) where A and E are the matrices of the system (1), {γj }∞ j=0 are scalars with periodicity ts (i.e., γj = γj+ts ) and negative real part, and {Uj }∞ j=0 have all a small number of columns. Thus, the linear systems in iterations j and j + ts share the same coefficient matrix Aˆj and the use of direct solvers is highly recommendable. For further details on the LR-ADI iteration and the procedure to compute the shifts, see [9,6]. The coefficient matrices of the linear systems (3) arising in MOR problems often present a band structure (or can be transformed to that form), allowing the use of the band solvers in LAPACK [1]. In this paper we describe how specialized band solvers can be designed to efficiently exploit the properties and structure of these matrices. In particular, we describe two parallel algorithms for the LU factorization without pivoting that considerably outperform the LAPACK routine xgbtrf, while yielding the same level of numerical accuracy for the linear systems arising in model reduction. Similar algorithms have been reported for the LU factorization with pivoting in [10] and the Cholesky factorization in [11] though, in both cases, the performance improvements are modest. Combined with a multithreaded implementation of BLAS, the codes allow the parallel solution of large-scale linear systems on current multicore and SMP architectures R XeonTM and Itanium2TM in reasonable time. Experimental results on Intel processors provide evidence in support. The paper is structured as follows. The blocked factorization xgbtrf routine in LAPACK is reviewed in Section 2. Our new routines are then presented in Section 3. The experiments on two SMP architectures show the benefits of this approach in Section 4. Finally, some concluding remarks are given in Section 5.
2
LAPACK Blocked Routine for the Band LU Factorization
Given a matrix A ∈ Rn×n , with lower and upper bandwidth kl and ku respectively, routine xgbtrf computes the LINPACK-style LU factorization with partial pivoting −1 −1 L−1 n−2 · Pn−2 · · · L1 · P1 · L0 · P0 · A = U
(4)
680
α 00
A. Rem´ on, E.S. Quintana-Ort´ı, and G. Quintana-Ort´ı
α 01
*
*
*
0
0
0
*
*
*
α 10 α 11 α 12
*
*
0
0
0
0
*
*
υ 02 υ 13 υ 24
*
α 01 α 12 α 23 α 34 α 45
*
υ 01 υ 12 υ 23 υ 34 υ 45
α 20 α 21 α 22 α 23 α 31 α 32 α 33 α 34 α 42 α 43 α 44 α 45 α 53 α 54 α 55
υ 03 υ 14 υ 25 υ 35
α 00 α 11 α 22 α 33 α 44 α 55
υ 00 υ 11 υ 22 υ 33 υ 44 υ 55
α 10 α 21 α 32 α 43 α 54
*
μ 10 μ 21 μ 32 μ 43 μ 54
*
*
μ 20 μ 31 μ 42 μ 53
*
α 20 α 31 α 42 α 53
*
*
Fig. 1. 6 × 6 band matrix with upper and lower bandwidths kl = 2 and ku = 1, respectively (left); packed storage scheme used in LAPACK (center); result of the LU factorization where υi,j and μi,j stand, respectively, for the entries of the upper triangular factor U and the multipliers of the Gauss transformations
where P0 , P1 , . . . , Pn−2 ∈ Rn×n are permutation matrices, L0 , L1 , . . . , Ln−2 ∈ Rn×n are Gauss transformations, and U ∈ Rn×n is upper triangular with upper bandwidth kl + ku . In order to reduce the storage needs, the lower triangular factor of the (traditional) LAPACK-style LU factorization with partial pivoting P A = LU
(5)
is not explicitly constructed. Instead, the multipliers corresponding to the Gauss transformations are stored overwriting the subdiagonal entries of A. This corresponds to the permutations matrices not being applied to the Gauss transformations, as in (4). Figure 1 illustrates the packed storage scheme used for band matrices in LAPACK and how this scheme accommodates for the result of the LU factorization with pivoting. In particular, A is stored with kl additional superdiagonals initially set to zero, to accommodate for fill-in due to pivoting. In order to describe routine xgbtrf, for simplicity, we will assume that the algorithmic block size, nb , is an exact multiple of both kl and ku . Consider now the partitionings ⎛A A A ⎞ 00 01 02 ⎛ ⎞ AT L AT M ⎜ A10 A11 A12 A13 ⎟ ⎜ ⎟ A = ⎝ AM L AM M AM R ⎠ → ⎜ A20 A21 A22 A23 A24 ⎟ , (6) ⎝ ⎠ ABM ABR
A31 A32 A33 A34 A42 A43 A44
where AT L , A00 ∈ Rk×k , A11 , A33 ∈ Rnb ×nb , and A22 ∈ Rl×u , with l = kl − nb and u = k¯u − nb = (ku + kl ) − nb . With this partitioning, A02 , A13 , and A24 are lower triangular. Routine xgbtrf corresponds to what is usually known as a right-looking algorithm; that is, an algorithm where, at a certain iteration, AT L has been already factorized, AML and AT M have been overwritten, respectively, by the multipliers and the corresponding block of U , and AMM has been updated correspondingly. In order to move forward in the computation by nb rows/columns, the following operations are performed during the current iteration of the routine
Parallel Solution of Band Linear Systems in Model Reduction
681
(the annotations to the right of some of these operations correspond to the name of the BLAS routine that is used): 1. Obtain W31 := triu(A31 ), a copy of the upper triangular part of A31 ; compute the LU factorization with partial pivoting ⎛
⎞ ⎛ ⎞ A11 L11 P1 ⎝ A21 ⎠ = ⎝ L21 ⎠ U11 . W31 L31
(7)
The blocks of L and U overwrite the corresponding blocks of A and W31 . (In the actual implementation, the copy W31 is obtained as this factorization is being computed.) 2. Apply the permutations in P1 to the remaining columns of the matrix: ⎛
⎞ ⎛ ⎞ A12 A12 ⎝ A22 ⎠ := P1 ⎝ A22 ⎠ and A32 A32 ⎛ ⎞ ⎛ ⎞ A13 A13 ⎝ A23 ⎠ := P1 ⎝ A23 ⎠ . A33 A33
(xlaswp)
(8)
(9)
A careful application of permutations is needed in the second expression as only the lower triangular structure of A13 is physically stored. As a result of the application of permutations, A13 , which initially equals zero, may become lower triangular. No fill-in occurs in the strictly upper part of this block. 3. Compute the updates: A12 (= U12 ) := L−1 11 A12 , A22 := A22 − L21 U12 ,
(xtrsm) (xgemm)
(10) (11)
A32 := A32 − L31 U12 .
(xgemm)
(12)
4. Obtain a copy of the lower triangular part of A13 , W13 := tril(A13 ); compute the updates W13 (= U13 ) := L−1 11 W13 , A23 := A23 − L21 W13 , A33 := A33 − L31 W13 ;
(xtrsm)
(13)
(xgemm) (xgemm)
(14) (15)
and copy back A13 := tril(W13 ). T T 5. Undo the permutations on LT11 , LT21 , W31 so that these blocks store the multipliers used in the LU factorization in (9) and W31 is upper triangular; copy back A31 := triu(W31 ).
682
A. Rem´ on, E.S. Quintana-Ort´ı, and G. Quintana-Ort´ı
In our notation, after these operations are carried out, the part that has been already factorized grows in nb rows/columns so that ⎛A A A ⎞ 00 01 02 ⎛ ⎞ AT L A T M ⎜ A10 A11 A12 A13 ⎟ ⎜ ⎟ A = ⎝ AM L AM M AM R ⎠ ← ⎜ A20 A21 A22 A23 A24 ⎟ (16) ⎝ A A A A ⎠ ABM ABR
31
32
33
34
A42 A43 A44
in preparation for the next iteration. From this particular implementation we observe that, in Step 1, the LAPACK T T style LU factorization of AT11 , AT21 , W31 is computed. This allows the use of BLAS-3 in the application of the triangular factors to the blocks to the right. At T the end, in Step 4, the permutations are undone on LT11 , LT21 , LT31 so that these blocks store the multipliers from the LINPACK-style LU factorization. In this way, L31 recovers the upper triangular form, and no additional space is needed to store it overwriting the corresponding block of A (Step 5).
3
New Algorithms for the Band LU Factorization of Linear Systems Arising in Model Reduction
The exposition of the algorithm underlying routine xgbtrf in the previous section reveals two sources of inefficiency: pivoting and the fragmentation of operations. In this section we propose solutions to these two problems. 3.1
Pivoting
For some unsymmetric linear systems, pivoting can be eliminated without affecting the accuracy [4,8]. We next investigate whether this is the case for the linear systems arising in model reduction. By dropping pivoting, we not only avoid a costly process for the cache memories, but we also reduce the upper bandwidth of the triangular factor U from kl +ku to just ku . Thus, the number of operations to obtain the factorization is diminished and so is the cost of solving with this triangular factor. The results in this subsection were obtained on an Intel Xeon processor using R 7.0.4 (ieee double-precision arithmetic). In the experiment we use matlab three unsymmetric examples from the Oberwolfach model reduction collection1 : Butterfly gyroscope (gyro). The matrices in this example model a vibrating micro-mechanical gyro with application in inertial navigation [7]. The model corresponds to a second-order system of the form Mx ¨(t) + E x(t) ˙ + Kx(t) = Bu(t),
y(t) = Cx(t),
where M , K, and L are, respectively, the mass, stiffness, and damping matrices. This formulation is transformed into the first-order system at the 1
Visit http://www.imtek.uni-freiburg.de/simulation/benchmark
Parallel Solution of Band Linear Systems in Model Reduction
683
expense of doubling the dimension of the system matrices and the loss of the symmetric structure. Thermal flow model (chip and flow). These two examples appear in a design corresponding to a 3-D model of a chip cooled by forced convection that is used in the thermal simulation of heat exchange between a solid body (the chip) and a fluid flow. In all these examples, we generate a random solution vector x and construct the right-hand side vector b as b := Aˆj x = (A + γj E) x. using values for γj from the shifts that are used in practice during the LR-ADI iteration [9]. Table 1 reports the relative errors x − x∗ F /xF for the approximate solutions x∗ computed via the LU factorization with and without pivoting. No significative difference can be appreciated between the solutions obtained with these two factorizations. Insignificant differences were also found for other practical values of the shifts. Table 1. Relative errors for the solutions computed via the LU factorization with and without pivoting Example gyro chip flow
3.2
With pivoting
W/out pivoting
0.19170641595546e-11 0.26786234659609e-12 0.19170641595546e-11
0.18990151409212e-11 0.26786234659609e-12 0.18990151409212e-11
Fragmentation of Operations
To advance the computation by nb rows/columns, due to the storage pattern used for band matrices in LAPACK, routines xtrsm and xgemm are invoked repeatedly for the updates in Steps 3 and 4. As parallelism is extracted in the routine from calling multithreaded implementations of BLAS, the fragmentation of the computations that are performed in a single iteration of the algorithm into several small operations (and a large one) is potentially harmful for the efficacy of the codes. This is specially the case on SMP platforms, where the threshold between what is considered “small” and “large” is considerably large. Now, as the algorithms presented next perform no pivoting, there is no fill-in, the upper bandwidth of resulting upper triangular factor U equals that of A, and in the partitionings (6) and (16) it is sufficient to consider A22 ∈ Rl×u , with l = kl − nb and u = ku − nb . We next describe how to merge the operations corresponding to Steps 2–4 so that higher performance is likely to be obtained on SMP architectures. Our first algorithm requires additional storage space in the data structure containing A so that nb rows of zeros are present at the bottom of
684
A. Rem´ on, E.S. Quintana-Ort´ı, and G. Quintana-Ort´ı
the matrix. By doing some extra copies and manipulation of the matrix blocks, the second algorithm does not require this workspace. Routine xdbtrf+M. Consider the data structure containing A (see Fig. 1center) is padded with nb rows at the bottom with all the entries in this additional space initially set to zero. Then, Steps 1–4 in xdbtrf can be simplified as follows: 1. In the first step, the LU factorization (without pivoting) ⎛ ⎞ ⎛ ⎞ A11 L11 ⎝ A21 ⎠ = ⎝ L21 ⎠ U11 A31 L31
(17)
is computed and the blocks of L and U overwrite the corresponding blocks of A. There is no longer need for the workspace W31 nor copies to/from it as the additional rows at the bottom accommodate for the elements in the strictly lower triangle of L31 . 2. Compute the updates: (A12 , A13 ) (= (U12 , U13 )) := L−1 11 (A12 , A13 )
A22 A23 A22 A23 := A32 A33 A32 A33
L21 − (U12 , U13 ) L31
(xtrsm),
(18)
(xgemm).
(19)
The lower triangular system in (18) returns a lower triangular block in A13 . Routine xdbtrf+C. The previous approach, though efficient in the sense of grouping as many computations as possible in coarse-grain blocks, requires a non-negligible workspace. Therefore we propose the following algorithm which also aims at clustering blocks but does not require storage for nb rows: 1. Obtain W31 := stril(A31 ), a copy of the physical space that would be occupied by the strictly lower triangular part of A31 and set stril(A31 ) := 0; compute the LU factorization (without partial pivoting) ⎛ ⎞ ⎛ ⎞ A11 L11 ⎝ A21 ⎠ = ⎝ L21 ⎠ U11 . (20) A31 L31 The blocks of L and U overwrite the corresponding blocks of A, but a copy of the elements in the physical storage overwritten by this factorization is kept at W31 . 2. Obtain W13 := striu(A13 ), a copy of the physical space that would be occupied by the strictly upper triangular part of A13 and set striu(A13 ) := 0; compute the updates:
Parallel Solution of Band Linear Systems in Model Reduction
(A12 , A13 ) (= (U12 , U13 )) := L−1 11 (A12 , A13 )
A22 A23 A22 A23 := A32 A33 A32 A33
L21 − (U12 , U13 ) L31
685
(xtrsm),
(21)
(xgemm).
(22)
3. Restore stril(A31 ) and striu(A13 ) from W31 and W13 , respectively.
4
Experimental Results
In this section we report the performance of the routines for the LU factorization of band matrices of order n = 6000 with kl = ku . The matrices were generated as diagonally dominant so that pivoting during the factorization is unnecessary. All the experiments were performed using ieee double-precision (real) arithmetic. In the evaluation we include the LAPACK routine, dgbtrf, and the two new routines: ddbtrf+M and ddbtrf+C. To differentiate the benefits contributed by the elimination of pivoting from those with source in the aggregation of operations, we also include two more routines, dgbtrf+M and dgbtrf+C, which only address the fragmentation problem but still perform pivoting [10]. In the evaluation, for each bandwidth dimension and routine, we employed values from 1 to 200 to determine the optimal block size, nopt b ; only those results corresponding to nopt are shown in the following. b We report the performance of the routines on two different SMP architectures, with 2 and 4 processors; see Table 2. Two threads were employed on xeon and 4 on itanium. As the efficacy of the kernels in BLAS is crucial, for each platform we use the implementations listed in Table 3. Figure 2 illustrates the performance of all 5 routines and the speed-ups that routines dgbtrf+M, dgbtrf+C, ddbtrf+M, and ddbtrf+C attain with respect to the LAPACK routine. The results show that a notable reduction of the execution time was obtained by routines ddbtrf+M and ddbtrf+C on xeon for matrices of moderate bandwidth: around 20% and up to 40% for kl = ku ≤ 100. The improvements of the new routines on itanium are even higher. As expected, the efficacy of routine ddbtrf+C is slightly lower than that of ddbtrf+M, and in general these routines outperform their pivoted counterparts dgbtrf+C and dgbtrf+M. Table 2. SMP architectures employed in the evaluation Platform Architecture xeon Intel Xeon itanium Intel Itanium2
#Proc. Frequency L2 cache L3 cache RAM (GHz) (KBytes) (MBytes) (GBytes) 2 4
2.4 1.5
512 256
– 4
1 4
686
A. Rem´ on, E.S. Quintana-Ort´ı, and G. Quintana-Ort´ı Table 3. Software employed in the evaluation Platform BLAS
Compiler Optimization Operating Flags System
xeon GotoBLAS 0.96mt gcc 3.3.5 itanium MKL 8.0 icc 9.0
-O3 -O3
Performance vs bandwidth (m=n=6000)
Performance vs bandwidth (m=n=6000)
3000
1.4 DGBTRF DDBTRF+M DGBTRF+M DDBTRF+C DGBTRF+C
2500
1.2
Speed−up
MFLOPS
DDBTRF+M DGBTRF+M DDBTRF+C DGBTRF+C
1.3
2000
1500
1.1
1000
1
500
0.9
0
0
100
200
300
400
500
600
0.8
700
0
100
200
Bandwidth, ku=kl
300
400
500
600
700
Bandwidth, ku=kl
Performance vs bandwidth (m=n=6000)
Performance vs bandwidth (m=n=6000)
18000
1.8 DGBTRF DDBTRF+M DGBTRF+M DDBTRF+C DGBTRF+C
16000 14000
1.6
12000
1.5
10000 8000
1.4 1.3
6000
1.2
4000
1.1
2000
1
0
0.9
0
100
DDBTRF+M DGBTRF+M DDBTRF+C DGBTRF+C
1.7
Speed−up
MFLOPS
Linux 2.4.27 Linux 2.4.21
200
300
400
Bandwidth, ku=kl
500
600
700
0
100
200
300
400
500
600
700
Bandwidth, ku=kl
Fig. 2. Performance and speed-up attained by the new routines on xeon (top) and itanium (bottom) platforms
5
Conclusions
We have presented two new routines for computing the LU factorization without pivoting of a band matrix that reduce the number of calls to BLAS per iteration so that coarser-grain parallelism is exposed and higher performance can be obtained at SMP architectures. The routines can be employed in the solution of linear systems when no pivoting for stability is necessary as, e.g., in the LR-ADI iteration for the solution of Lyapunov equations arising in some model reduction problems, or in the solution of unsymmetric diagonally dominant linear systems. The new routines considerably outperform the LAPACK routine for the LU factorization with pivoting.
Parallel Solution of Band Linear Systems in Model Reduction
687
Acknowledgments We thank Peter Benner for his scientific help during the preparation of this manuscript.
References 1. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd edn. SIAM, Philadelphia (1999) 2. Antoulas, A.C.: Lectures on the Approximation of Large-Scale Dynamical Systems. SIAM, Philadelphia (2005) 3. Cheng, C.-K., Lillis, J., Lin, S., Chang, N.H.: Interconnect Analysis and Synthesis. John Wiley & Sons, NY (2000) 4. Golub, G.H., Van Loan, C.: Unsymmetric positive definite linear systems. Linear Algebra Appl. 28, 85–97 (1979) 5. Lancaster, P., Rodman, L.: The Algebraic Riccati Equation. Oxford University Press, Oxford (1995) 6. Li, J.-R., White, J.: Low rank solution of Lyapunov equations. SIAM J. Matrix Anal. Appl. 24(1), 260–280 (2002) 7. Lienemann, J., Billger, D., Rudnyi, E.B., Greiner, A., Korvink, J.G.: Mems compact modeling meets model order reduction: Examples of the application of arnoldi methods to microsystem devices. In: Technical Proceedings of the 2004 Nanotechnology Conference and Trade Show, Nanotech 2004 (2004) 8. Mathias, R.: Matrices with positive definite Hermitian part: Inequalities and linear systems. SIAM J. Matrix Anal. Appl. 13(2), 640–654 (1992) 9. Penzl, T.: A cyclic low rank Smith method for large sparse Lyapunov equations. SIAM J. Sci. Comput. 21(4), 1401–1418 (2000) 10. Rem´ on, A., Quintana-Ort´ı, E.S., Quintana-Ort´ı, G.: Parallel LU factorization of band matrices on SMP systems. In: Gerndt, M., Kranzlm¨ uller, D. (eds.) HPCC 2006. LNCS, vol. 4208, pp. 110–118. Springer, Heidelberg (2006) 11. Rem´ on, A., Quintana-Ort´ı, E.S., Quintana-Ort´ı, G.: Cholesky factorization of band matrices using multithreaded BLAS. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 608–616. Springer, Heidelberg (2007)
Evaluating Linear Recursive Filters Using Novel Data Formats for Dense Matrices Przemyslaw Stpiczy´ nski Department of Computer Science, Maria Curie–Sklodowska University Pl. M. Curie-Sklodowskiej 1, 20-031 Lublin, Poland [email protected]
Abstract. The aim of this contribution is to show that the performance of the recently developed high performance algorithm for evaluating linear recursive filters can be increased by using new generalized data structures for dense matrices introduced by F. G. Gustavson. The new implementation is based on vectorized algorithms for banded triangular Toeplitz matrix - vector multiplication and the algorithm for solving linear recurrence systems with constant coefficients. The results of experiments performed on Intel Itanium 2 and Cray X1 are also presented and discussed.
1
Introduction: Linear Recursive Filters
Let us consider the following problem of evaluating linear recursive filters which are very important in signal processing [14]. For a given sequence of real numbers xj , j = 1, . . . , n (input signals), we have to evaluate an output sequence yj satisfying m m yk = bj yk−j + aj xk−j , (1) j=1
j=0
where m is a small even integer (up to 6), xk = 0, yk = 0 for k ≤ 0 and the coefficients aj , bj are calculated using z-transforms [14]. Because of the recursive nature of our problem, simple algorithms based on (1) achieve poor performance, since they do not fully utilize the underlying hardware, i.e. memory hierarchies, vector extensions and multiple processors. The problem is important in science and engineering, thus it is clear that efficient high performance algorithms for solving our problem should be designed. Some earlier results for pipeline processors and SIMD architectures can be found in [12,13]. Various algorithms for computing linear recurrences have been designed for parallel and vector computers (see [1,8,10,11,15,19] for more references). However these algorithms like cyclic reduction, Wang’s method and recursive doubling lead to a substantial increase in the number of floating-point operations, what makes them unattractive in classical serial systems or parallel computers with a limited number of processors [4]. On the other hand, it is well known that reducing costs of memory access is crucial for achieving good performance of numerical software [5]. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 688–697, 2008. c Springer-Verlag Berlin Heidelberg 2008
Evaluating Linear Recursive Filters Using Novel Data Formats
689
In [2] new algorithms for solving first- and second-order linear recurrences by loop raking have been introduced. The main innovation of the algorithms is to store data in two-dimensional arrays. Following this observation, we have introduced a new fully vectorized version of the Wang’s method devoted for solving linear recurrences with constant coefficients [18] using two-dimensional arrays. Numerical testshave shown that it is faster than solvers for narrow banded lower triangular systems implemented in Cray SciLib. In [17] we have proposed the new high performance BLAS-based algorithm for evaluating linear recursive filters, which comprises two main stages obtained by splitting (1) into two separate formulas, namely fk =
m
aj xk−j
(2)
j=0
and the following linear recurrence system with constant coefficients y k = fk +
m
bj yk−j .
(3)
j=1
The main idea of this algorithm is to store the first r ·s, s > m, input signals in a two-dimensional array X using standard Fortran column major order instead of a one-dimensional vector. Thus the columns of X = (x1 , . . . , xr ) ∈ IRs×r satisfy xj = (x(j−1)s+1 , . . . , xjs )T .
(4)
Then the equation (2) can be treated as a matrix-vector multiplication, while the second is called linear recurrence system and can be evaluated using level 2 and level 3 BLAS routines (matrix-vector multiplication and matrix-matrix multiplication respectively) [16]. Unfortunately, although the algorithm is machine independent, it achieves really high performance for large values of m and rather poor performance for the most important cases m ≤ 4. It is caused by cache misses which occur during matrix-vector multiplication and because higher level BLAS routines are not efficient for very small blocks (namely 2 × 2). Some improvements can be obtained when we store X T in column major order. However it is well-known that standard Fortran/C arrays do not map nicely into L1 cache [6,7]. Thus the only way to increase the performance of the algorithm is to improve the data layout to utilize cache memory. Then, no conventional level 3 BLAS’s are required [6]. Following this observation, we have decided to implement a new algorithm for solving (1) using the fully vectorized, divide and conquer solver for linear recurrence systems with constant coefficients [18] and the new square blocked full column-major order [7]. Also, the level 2 BLAS-based algorithm for finding (2) can be rewritten in terms of vector operations. The aim of the contribution is to show how to store input signals using the new square blocked full column-major order and to rewrite the vectorized algorithms using this data layout. The results performed on Intel Itanium 2 and Cray X1 will be presented and discussed.
690
2
P. Stpiczy´ nski
Divide and Conquer Vector Algorithms
Now let us briefly describe the two-stage algorithm for finding (1) based on two vectorized algorithms for evaluating fk and yk defined by (2) and (3) respectively. For more details see [17] and [18]. For the sake of simplicity, let us assume that there exist two positive integers r and s such that rs = n. However, this assumption can be easily omitted: we can apply the algorithms for finding f1 , . . . , frs and y1 , . . . , yrs respectively, then we can use (2) and (3) to compute frs+1 , . . . , fn and yrs+1 , . . . , yn . 2.1
Narrow-Banded Triangular Toeplitz Matrix-Vector Multiplication
Let us observe that the equation (2) can be rewritten in the following matrixvector form ⎞⎛ ⎛ ⎞ ⎛a ⎞ 0 f1 x1 ⎜ ⎟ . ⎜ f2 ⎟ ⎜ x2 ⎟ ⎟⎜ a1 . . ⎜ ⎟ ⎟ ⎟⎜ ⎜ .. ⎟ ⎜ ⎜ ⎟ ⎜ .. ⎟ .. . . . . ⎜ . ⎟ ⎜ ⎟ . ⎜ ⎟ . . . ⎜ ⎟ ⎟ (5) ⎟⎜ ⎜ . ⎟=⎜ ⎜ . ⎟. ⎜ ⎟ . ⎜ .. ⎟ ⎜ am . . . a1 a0 ⎜ ⎟ ⎟ . ⎜ ⎟ ⎟ ⎟⎜ .. .. .. ⎝ fn−1 ⎠ ⎜ ⎝ ⎠ ⎝ . . . ⎠ xn−1 fn xn am · · · a1 a0 Let us define vectors xj = (x(j−1)s+1 , . . . , xjs )T and fj = (f(j−1)s+1 , . . . , fjs )T ∈ IRs , where j = 1, . . . , r and the following Toeplitz matrices ⎛ ⎞ a0 ⎛ ⎞ a m · · · a1 ⎜ .. . . ⎟ ⎜ . ⎟ . ⎜ . . .. ⎟ ⎜ ⎟ ⎜ . . ⎟ ⎟ L=⎜ a . . . a , U = ⎜ ⎟ ∈ IRs×s . 0 ⎜ m ⎟ ⎝ am ⎠ ⎜ ⎟ .. .. ⎝ . . ⎠ 0 a m · · · a0 It is clear that f1 = Lx1 and fj = Lxj + U xj−1 for j = 2, . . . , r. When we define matrices F = (f1 , . . . , fr ), X = (x1 , . . . , xr ) ∈ IRs×r , then we can find vectors fj as follows. First we perform the operation F ← LX and next we update the first m entries of each vector fj , j = 2, . . . , r, using U1:m,s−m+1:s Xs−m+1:s,j−1 , where ⎛ ⎞ am am−1 · · · a1 ⎜ 0 am · · · a2 ⎟ ⎜ ⎟ U1:m,s−m+1:s = ⎜ . ∈ IRm×m . . . .. ⎟ ⎝ .. . . ⎠ 0
···
0 am
Evaluating Linear Recursive Filters Using Novel Data Formats
691
Note that this algorithm can be also expressed in terms of AXPY-like vector updates instead of matrix multiplications what corresponds to the observation that level 3 BLAS’s are not required to obtain level 3 BLAS performance when using novel data structures [6]. 2.2
Vectorized Solver for Linear Recurrence Systems
The basic idea of this stage of the main algorithm is to rewrite (3) in terms of linear systems exploiting BLAS operations. Instead of finding all numbers y1 , . . . , yn sequentially, we define the matrix Y = (y1 , . . . , yr ) ∈ IRs×r and compute its entries “at once”. Let us define auxiliary matrices ⎛ ⎞ 1 ⎜ ⎟ ⎜ −b1 . . . ⎟ ⎜ ⎟ ⎜ .. . . . . ⎟ ⎜ . ⎟ . . Lb = ⎜ ⎟ ∈ IRs×s ⎜ −bm . . . −b1 1 ⎟ ⎜ ⎟ ⎜ ⎟ .. .. .. ⎝ . . . ⎠ −bm · · · −b1 1 and Ub ∈ IRs×s of the following form ⎛ ⎞ −bm · · · −b1 m m ⎜ . ⎟ .. ⎜ . .. ⎟ Ub = ⎜ bm+k−l ek eTs−m+l . ⎟=− ⎝ −bm ⎠ k=1 l=k 0 Then the numbers y1 , . . . , yrs organized as vectors yj = (y(j−1)s+1 , . . . , yjs )T , j = 1, . . . , r, satisfy y1 = L−1 b f1 −1 yj = L−1 b fj − Lb Ub yj−1 for j = 2, . . . , r. This yields
⎧ ⎨ y1 = z1 ⎩ yj = zj +
m
αkj tk
for j = 2, . . . , r,
(6)
k=1
where Lb zj = fj , Lb tk = ek , k = 1, . . . , m, and all coefficients αkj satisfy αkj =
m
bm+k−l y(j−1)s−m+l .
(7)
l=k
Note that to compute vectors tk we need to find the solution of the system Lb t1 = e1 , namely t1 = (1, t2 , . . . , ts )T . Then we can form vectors tk as follows tk = (0, . . . , 0, 1, t2 , . . . , ts−k+1 )T . k−1
(8)
692
P. Stpiczy´ nski
The fully vectorized algorithm (with possibility of parallelization) proceeds as follows. First we define the matrix F = (f1 , . . . , fr , e1 ) ∈ IRs×(r+1) and find the solution of the system Lb Z = F , using ⎧ for k ≤ 0 ⎨0 m Fk,∗ ← F + (9) bj Zk−j,∗ for 1 ≤ k ≤ s, ⎩ k,∗ j=1
overriding F by the matrix Z. Next we find m last entries of each vector yj , j = 2, . . . , r, using (6) and (7). Finally, we find s − m first entries of y2 , . . . , yr , using (6). Note that (7) can be vectorized, namely one can perform the following operation ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ α1j α1j y(j−1)s−m+k ⎜ α2j ⎟ ⎜ α2j ⎟ ⎜ y(j−1)s−m+k+1 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ (10) ⎜ ⎟←⎜ ⎟ + bm+1−k ⎜ ⎟ .. .. .. ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ . . . y(j−1)s αm−k+1 αm−k+1 j j for k = 1, . . . , m.
3
Novel Data Format
Now let us consider the implementation of the algorithms introduced in the previous section using a matrix representation based on the novel data format for dense matrices [6,7]. Suppose that we have a m × n matrix A. The matrix can be partitioned as follows ⎛ ⎞ A11 . . . A1ng ⎜ .. ⎟ . A = ⎝ ... (11) . ⎠ Amg 1 . . . Amg ng Each block Aij contains a submatrix of A and it is stored as a square nb × nb block which occupies a contiguous block of memory (Figure 1). The value of the blocksize nb should be chosen to map nicely into L1 cache [6,7]. Note that when ng nb = n or mg nb = m, then A1ng , . . . , Amg ng or Amg 1 , . . . , Amg ng are not square blocks and some memory will be wasted (’*’ in Figure 1). As mentioned in [6] the novel data format can be represented by using a four dimensional array where A(i,j,k,l) refers to the (i,j) element within the (k,l) block. Our algorithm used for evaluating (1) can be easily adopted for the novel data format described above. For simplicity, we assume that s = nb mg . Then the matrix X defined by (4) can be partitioned as (11) and then all blocks (possibly except of X1ng , . . . , Xmg ng ) are square blocks. The matrix-vector multiplication F ← LX and the update of the first m entries of each vector fj j = 1, . . . , r (namely the first stage of the main algorithm, see Figure 2) can be done locally on
Evaluating Linear Recursive Filters Using Novel Data Formats
693
1 5 9 13 | 33 37 41 45 | 65 69 73 77 | 97 101 * * 2 6 10 14 | 34 38 42 46 | 66 70 74 78 | 98 102 * * 3 7 11 15 | 35 39 43 47 | 67 71 75 79 | 99 103 * * 4 8 12 16 | 36 40 44 48 | 68 72 76 80 |100 104 * * A = ------------------------------------------------------17 21 25 29 | 49 53 57 61 | 81 85 89 93 |105 109 * * 18 22 26 30 | 50 54 58 62 | 82 86 90 94 |106 110 * * 19 23 27 31 | 51 55 59 63 | 83 87 91 95 |107 111 * * * * * * | * * * * | * * * * | * * * * Fig. 1. Square blocked full column major order of a 7 × 14 matrix
Step 1 Step 2
processor 0
processor 1
Fig. 2. Stage 1 using square blocked data format Step 1 Step 1
Step 2 11111 00000 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111
processor 0
11111 00000 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111
11111 00000 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111
Step 3
processor 1
Step 2 Step 3
Fig. 3. Stage 2 using square blocked data format
each block column. Note that this stage of the algorithm can also be parallelized (each column of blocks can be evaluated independently), thus each processor can be responsible for computing a set of block columns. The main loop of this stage which operates on block columns can be easily parallelized using the OpenMP ‘parallel do’ directive.
694
P. Stpiczy´ nski
Similarly we can implement the second stage of the algorithm, namely solution of the linear recurrence system (see Figure 3), however it is more complicated because it consists of three steps: the first and the third can be done in parallel, while the second is sequential. To reduce the number of cache misses, blocks should be computed in the appropriate order. During the first step, columns of blocks should be computed from right to left and ‘top-down’ within each block column. Then during the second step we update elements in the lower row of blocks and finally (the third step) we update blocks from right to left and ‘bottom-up’ within each column of blocks.
4
Results of Experiments
The method has been implemented in Fortran 95 with OpenMP [3] and all vector operations have been implemented using array section assignments. The experiments have been carried out on a dual processor Itanium 2 (1.3 GHz, 3 MB cache, approximately 5 Gflops peak performance, running under Linux) using the Intel Fortran Compiler (ver. 8.0) and on 16 SSPs of Cray X1 [9]. Each SSP n=10^6, p=2 1600
n=10^6 4
m=2 m=4 m=6
1400
p=1 p=2
3.5
1200 sppedup
Mflops
3 1000 800
2.5
2 600 1.5
400 200
1 0
100
200
300
400
500
2
blocksize n=67*10^6, p=2 1800
6
n=67*10^6 4
m=2 m=4 m=6
1600
4 m
p=1 p=2
3.5
1400
3 sppedup
Mflops
1200 1000
2.5 2
800 1.5
600
1
400 200
0.5 0
100
200
300
blocksize
400
500
2
4
6
m
Fig. 4. Itanium 2: performance for various n, m and nb (left) and speedup for optimal nb and various numbers of processors (right)
Evaluating Linear Recursive Filters Using Novel Data Formats
n=10^6, p=16 6000
n=10^6 140
m=2 m=4 m=6
5000
695
p=4 p=8 p=16
120 100 sppedup
Mflops
4000
3000
80 60
2000 40 1000
20
0
0 0
100
200
300
400
500
2
blocksize n=67*10^6, p=16 6000
6
n=67*10^6 160
m=2 m=4 m=6
5500
4 m
p=4 p=8 p=16
140
5000 120
4500 sppedup
Mflops
4000 3500 3000 2500
100 80 60
2000 40 1500 1000
20 0
100
200
300
blocksize
400
500
2
4
6
m
Fig. 5. Cray X1: performance for various n, m and nb (left) and speedup for optimal nb and various numbers of processors (right)
consists of a fast vector processor and very slow (400 MHz) scalar processor for scalar operations. The theoretical peak performance of one SSP is 3.2 Gflop/s. Sixteen SSPs (i.e. one node) operate in the symmetric multiprocessing mode, so they have access to shared memory. We have measured the performance of the algorithm for various n, m and nb and the speedup against the simple algorithm based on (1) for optimal nb and various numbers of processors to find out how we can improve the performance of the simple scalar code which can be found in the literature (see [14]) by the use of advanced computer hardware. Exemplary results are presented in Figures 4 and 5. The results of experiments can be summarized as follows. The performance of the algorithm depends on the chosen blocksize nb , so it should be chosen carefully. On Itanium the optimal blocksize is from nb = 32 (for m = 2) to nb = 128 (for larger values of m). On Cray the optimal blocksize is from nb = 32 to nb = 64. However, when the blocksize nb is too large, the performance of the algorithm decreases dramatically because blocks do not fit into L1 cache and cache misses occur. For optimal blocksizes the algorithm achieves reasonable speedup on Itanium and good speedup on Cray X1 because
696
P. Stpiczy´ nski
computations within local blocks are vectorized. Moreover in case of the simple algorithm based on (1), the fast vector processors cannot be used without manual optimization of the simple scalar code. Although all optimization switches have been turned on, the Cray compiler has produced rather slow output for the scalar code. The values of the parameters r and s should be chosen to minimize the number of flops required by the algorithm, however r = s is rather a good choice. The use of multiple processors is profitable even for smaller values of n and m = 2. Finally, comparing the performance of the new algorithm which utilizes the novel data format with the performance of the BLAS-based algorithm which uses standard Fortran two-dimensional arrays with the column major storage order (see [17]), we can observe that the new algorithm is about 50% faster than the BLAS-based one.
5
Conclusions
We have shown that the performance of the recurrence computations can be highly improved by the use of the simple vectorized algorithm which operates on novel data formats for dense matrices. The algorithm can also be parallelized, thus it should be useful on novel multicore architectures.
Acknowledgements The use of Cray X1 from the Interdisciplinary Center for Mathematical and Computational Modeling (ICM) of the Warsaw University is kindly acknowledged. The author would like to thank the anonymous referees for valuable discussions and suggestions.
References 1. Bario, R., Melendo, B., Serrano, S.: On the numerical evaluation of linear recurrences. J. Comput. Appl. Math. 150, 71–86 (2003) 2. Blelloch, G., Chatterjee, S., Zagha, M.: Solving linear recurrences with loop raking. Journal of Parallel and Distributed Computing 25, 91–97 (1995) 3. Chandra, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J., Menon, R.: Parallel Programming in OpenMP. Morgan Kaufmann Publishers, San Francisco (2001) 4. Dongarra, J., Duff, I., Sorensen, D., Van der Vorst, H.: Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia (1991) 5. Dongarra, J., Hammarling, S., Sorensen, D.: Block reduction of matrices to condensed form for eigenvalue computations. J. Comp. Appl. Math 27, 215–227 (1989) 6. Gustavson, F.G.: New generalized data structures for matrices lead to a variety of high performance algorithms. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2001. LNCS, vol. 2328, pp. 418–436. Springer, Heidelberg (2002) 7. Gustavson, F.G.: High-performance linear algebra algorithms using new generalized data structures for matrices. IBM J. Res. Dev. 47, 31–56 (2003)
Evaluating Linear Recursive Filters Using Novel Data Formats
697
8. Larriba-Pey, J.L., Navarro, J.J., Jorba, A., Roig, O.: Review of general and Toeplitz vector bidiagonal solvers. Parallel Comput. 22, 1091–1125 (1996) 9. Netwok Computer Services Inc.: The AHPCRC Cray X1 primer, http://www.ahpcrc.org/publications/Primer.pdf 10. Paprzycki, M., Stpiczy´ nski, P.: Solving linear recurrence systems on the Cray YMP. In: Wa´sniewski, J., Dongarra, J. (eds.) PARA 1994. LNCS, vol. 879, pp. 416–424. Springer, Heidelberg (1994) 11. Paprzycki, M., Stpiczy´ nski, P.: Parallel solution of linear recurrence systems. Z. Angew. Math. Mech. 76, 5–8 (1996) 12. Parhi, K.K., Messerschmitt, D.: Pipeline interleaving and parallelism in recursive digital filters, part ii: Pipelined incremental block filtering. IEEE Trans. Acoust., Speech Signal Processing ASSP-37, 1118–1135 (1989) 13. Robelly, J.P., Cichon, G., Seidel, H., Fettweis, G.: Implementation of recursive digital filters into vector SIMD DSP architectures. In: Proc. International Conference on Acoustics, Speech and Signal Processing. ICASSP 2004, Montreal, Canada, May 17-21 (2004) 14. Smith, S.W.: The Scientist and Engineer’s Guide to Digital Signal Processing. California Technical Publishing, San Diego (1997) 15. Stpiczy´ nski, P.: Parallel algorithms for solving linear recurrence systems. In: Boug´e, L., Robert, Y., Trystram, D., Cosnard, M. (eds.) CONPAR 1992 and VAPP 1992. LNCS, vol. 634, pp. 343–348. Springer, Heidelberg (1992) 16. Stpiczy´ nski, P.: Solving linear recurrence systems using level 2 and 3 BLAS routines. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 1059–1066. Springer, Heidelberg (2004) 17. Stpiczy´ nski, P.: Evaluating recursive filters on distributed memory parallel computers. Comm. Numer. Meth. Engng. 22, 1087–1095 (2006) 18. Stpiczy´ nski, P., Paprzycki, M.: Fully vectorized solver for linear recurrence systems with constant coefficients. In: Palma, J.M.L.M., Dongarra, J., Hern´ andez, V. (eds.) VECPAR 2000. LNCS, vol. 1981, pp. 541–551. Springer, Heidelberg (2001) 19. van der Vorst, H., Dekker, K.: Vectorization of linear recurrence relations. SIAM J. Sci. Stat. Comput. 10, 27–35 (1989)
Application of Fusion-Fission to the Multi-way Graph Partitioning Problem Charles-Edmond Bichot ´ Laboratoire d’Optimisation Globale, Ecole Nationale de l’Aviation Civile/Direction des Services de la Navigation A´erienne, 7 av. Edouard Belin, 31055 Toulouse, France [email protected] http://www.recherche.enac.fr/~bichot
Abstract. This paper presents an application of the Fusion-Fission method to the multi-way graph partitioning problem. The Fusion-Fission method was first designed to solve the normalized cut partitioning problem. Its application to the multi-way graph partitioning problem is very recent, thus the Fusion-Fission algorithm has not yet been optimized. Then, the Fusion-Fission algorithm is slower than the state-of-the-art graph partitioning packages. The Fusion-Fission algorithm is compared with JOSTLE, METIS, CHACO and PARTY for four partition’s cardinal numbers: 2, 4, 8 and 16, and three partition balance tolerances: 1.00, 1.01 and 1.03. Results show that up to two thirds of time are the partitions returned by Fusion-Fission of greater quality than those returned by state-of-the-art graph partitioning packages.
1
Introduction
For ten years, the state-of-the-art method to solve the multi-way graph partitioning problem is the multilevel method. The multilevel method often used a graph growing algorithm for the partitioning task and a Kernighan-Lin type refinement algorithm. This method has been introduced in [1,2,3]. It is a very efficient process which is very fast too. It consists in reducing the number of vertices of the graph, which is sometimes very high (more than 100,000 vertices), by coarsening them. Then, a partition of the coarsenest graph (less than 100 vertices) is built, generally with a graph growing algorithm [4]. After that, the vertices of the partition are successively un-coarsened and the partition refined with a Kernighan-Lin algorithm [5,6] or a helpful set algorithm [7]. Graph partitioning has many applications. The most famous of them are parallel computing, VLSI design and engineering computation. Thus, graph partitioning is an important combinatorial optimization problem. Because of its great number of applications, there are different graph partitioning problems. The aim of this paper is to study the most classical of them, the multi-way graph partitioning problem, also called k-way graph-partitioning problem [8]. The other graph partitioning problems, such as the Normalized-Cut partitioning problem [9,10] or the Ratio-Cut partitioning problems [11], are not presented in this paper. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 698–707, 2008. c Springer-Verlag Berlin Heidelberg 2008
Application of Fusion-Fission
699
This paper presents an application of the Fusion-Fission method to the multiway graph partitioning problem. The Fusion-Fission method was first created to solve the Normalized-Cut graph partitioning problem [12,13]. Because the multiway graph partitioning problem and the normalized-cut graph partitioning problems are strongly related, it is interesting to evaluate the efficiency of the same method to both problems, as it has been done for the multilevel method [11]. The work presented in this paper is the first adaptation of Fusion-Fission to the multi-way graph partitioning problem. Thus, all components of FusionFission presented in [13] do no appear in this preliminary adaptation. However, partitions found by this adaptation are quite good regarding partitions returned by state-of-the-art graph partitioning packages.
2
Graph Partitioning
The multi-way graph partitioning problem consists in finding a partition of the vertices of a graph into parts of the same size, while minimizing the number of edges between parts. It is well-known that the multi-way graph partitioning problem is NP-complete. The difficulty of the task is to keep the sizes of the parts equal while minimizing the edge-cut. The value which represents the difference between the sizes of the parts is named the balance of the partition. Because a small difference between the sizes of the parts may lead to a lower edge-cut [14], lots of results are presented with partitions not perfectly balanced. Definition 1 (Partition of the vertices of a graph). Let G = (V, E) be an undirected graph, with V its set of vertices and E its set of edges. A partition of the graph into k parts is a set Pk = {V1 , . . . , Vk } of sub-sets of V such that: – No element of Pk is empty. – The union of the elements of Pk is equal to V. – The intersection of any two elements of Pk is empty. The number of parts k of the partition Pk is named the cardinal number of the partition. Assume that the graph G is weighted. For each vertex vi ∈ V , let w(vi ) be its weight. For each edge (vi , vj ) ∈ E, let w(vi , vj ) be its weight. Then, the weight of a set of vertices V ⊆ V is the sum of the weight of the vertices of V : w(V ) = v∈V w(v). Definition 2 (Balance of a partition). Let Pk = {V1 , . . . , Vk } be a partition of a graph G = (V, E) into k parts. The average weight of a part is: Waverage = w(V ) k . The balance of a partition is defined as the maximum weight of all parts divided by the average weight of a part: balance(Pk ) =
maxVi ∈Pk w(Vi ) k = max w(Vi ) . Waverage w(V ) Vi ∈Pk
700
C.-E. Bichot
The objective function to minimize is the cut function. It is defined as the sum of the weight of the edges between the parts. More formally, let V1 and V2 be two elements of Pk : cut(V1 , V2 ) = w(u, v) . u∈V1 ,v∈V2
Then, the cut objective function is defined as: cut(Pk ) = cut(Vi , Vj ) . Vi ,Vj ∈Pk ,i<j
The partition which has the lowest cut value is the solution of the multiway graph partitioning problem. However, because of the size of the graph to partition (several thousands of vertices), and because of the combinatorial nature of the problem, the partition with the lowest cut value can not be found. Thus, combinatorial optimization methods are used to solve this problem.
3
The Fusion-Fission Adaptation to Multi-way Graph Partitioning
Because principles of Fusion-Fission are described in [13], this paper presents only succinctly this method. Fusion-Fission principles are based on nuclear force between nucleons. This force is responsible for binding of protons and neutrons into atomic nuclei. In the nature, the fifty-six particles of an iron nucleus are more tightly bound together than in any other element. Thus, the Fusion-Fission optimization process consists in splitting and merging atoms to create atoms of maximum binding energy. Nucleons of big atoms are merged into atoms with few nucleons. An analogy with graph partitioning is easy. Let the nucleons be the vertices of the graph and the atoms the parts of the partition. The binding energy between two nucleons is the edge weight between the corresponding vertices. According to the Fusion-Fission process, parts of the partition are successively merged and split. Then, the cardinal number of the partition changes during the process. Resulting atoms of the Fusion-Fission process should be atoms of the same size. Which means that the final partition is perfectly balanced. To be as close as possible to the process described before, the Fusion-Fission application to multi-way graph partitioning is an iteration process which works as follows: at each step of the process, a new partition Plt+1 is created based on the preceding partition Plt . The fission process consists in splitting each part of the partition Plt+1 into l parts. Because of its efficiency, the multilevel method has been chosen for the splitting. The fusion process consists in merging the l ∗ l parts into a partition P of l parts. The merging can be viewed as graph partitioning problem where the vertices of the graph are the l ∗ l parts. Thus, a
Application of Fusion-Fission
701
multilevel method has been chosen for the merging too. Then, the partition P is refined using a Kernighan-Lin type algorithm (KL). The resulting partition is the partition Plt+1 . The initial partition, Pk0 , is provided by the multilevel method. The algorithm 1 presents the Fusion-Fission application to multi-way graph partitioning. The number of part of the new partition, l , changes at each iteration. We decided to force it to follow a binomial distribution centered in k. Then, a list of numbers which follow this binomial distribution is constructed at the beginning of the Fusion-Fission algorithm. Then, each iteration starts by selecting a new number of part l in this list. The multilevel method and the Kernighan-Lin type algorithm (KL in the algorithm 1) used are those of the pMETIS software and are both described in [4]. The pMETIS software does not refer to the parallel implementation of METIS, but pMETIS is the name given of the recursive bisection software implemented in the serial METIS package. The particularity of the Fusion-Fission algorithm is to find several partitions of different cardinal numbers. Moreover, for each partition found during the algorithm’s iteration, refinement is a four-step process. The partition is first refined for a balance of 1.00, then for a balance of 1.01, and after, for balances of 1.03 and 1.05. This four-step refinement process greatly decrease the computation time of the algorithm. Since the algorithm code is not optimized as much as the multilevel softwares, its computation time is less relevant than partition quality. Algorithm 1. Fusion-Fission procedure FusionFission(G = (V, E), k, n, pM ET IS, KL) l←k P ← pM ET IS(G, k) Pk0 ← P = {P1 , . . . , Pk } for t = 1 to n do choose a new number of parts l Plt = {P1 , . . . , Pl } for j = 1 to l do Vl ← pM ET IS(Pj , l ) V ← V ∪ Vl end for make a graph G based on the set of parts V Plt+1 ← pM ET IS(G , l ) P ← KL(P ) if l = k and cut(P ) < cut(P ) then P ← P end if end for return P end procedure
702
4 4.1
C.-E. Bichot
Comparison with State-of-the-Art Graph Partitioning Packages Benchmarks Graphs
The performance of the Fusion-Fission adaptation to multi-way graph partitioning is evaluated on a wide range of tests graphs arising in different application domains. These tests graphs have been chosen among classical benchmarks in the literature of graph partitioning. Some of these benchmarks have been tested in some recent papers [15,16,17,11]. These graphs are both vertex and edge unweighted. The characteristics of these graphs are described in table 1. Table 1. Benchmark graphs characteristics
Graph name add20 data 3elt uk add32 bcsstk33 whitaker3 crack wing nodal fe 4elt2 vibrobox bcsstk29 4elt fe sphere cti memplus cs4 bcsstk30 bcsstk31 bcsstk32 t60k wing brack2
Size |V | |E| 2395 7462 2851 15093 4720 13722 4824 6837 4960 9462 8738 291583 9800 28989 10240 30380 10937 75488 11143 32818 12328 165250 13992 302748 15606 45878 16386 49152 16840 48232 17758 54196 22499 43858 28924 1007284 35588 572914 44609 985046 60005 89440 62032 121544 62631 366559
min 1 3 3 1 1 19 3 3 5 3 8 4 3 4 3 1 2 3 1 1 2 2 3
Degree max avg 123 6.23 17 10.59 9 5.81 3 2.83 31 3.82 140 66.74 8 5.92 9 5.93 28 13.80 12 5.89 120 26.81 70 43.27 10 5.88 6 6.00 6 5.73 573 6.10 4 3.90 218 69.65 188 32.20 215 44.16 3 2.98 4 3.92 32 11.71
Description (source) 20-bit adder (Motorola) 2D nodal graph (NASA/RIACS) 2D dual graph 32-bit adder (Motorola) 3D stiffness matrix (Boeing) 2D nodal graph (NASA/RIACS) 2D nodal graph 3D nodal graph Sparse matrix 3D stiffness matrix (Boeing) 2D nodal graph (NASA/RIACS) 3D semi-structured matrix Memory circuit (Motorola) 3D dual graph 3D stiffness matrix (Boeing) 3D stiffness matrix (Boeing) 3D stiffness matrix (Boeing) 2D dual graph 3D dual graph 3D nodal graph (NASA/RIACS)
All of these benchmarks graphs can be downloaded at the University of Greenwich graph partitioning archive (May 2007): http://staffweb.cms.gre.ac.uk/∼c.walshaw/partition/ All the experiments in this paper were performed on an Intel Pentium IV 3.0 GHz processor with 1 Go of memory, running a GNU/Linux Debian operating system.
Application of Fusion-Fission
4.2
703
Some Graph Partitioning Packages
The quality of the partitions produced by the Fusion-Fission algorithm is compared with those generated on the same computer by several public domain graph partitioning softwares: – The CHACO software [18]. This software includes multilevel and spectral algorithms. Because it is more efficient than the spectral algorithm, only the multilevel algorithm of CHACO, described in [1], is compared with Fusion-Fission. – The JOSTLE software [19]. It is based on a multilevel multi-way partitioning algorithm [20]. – The METIS package [21]. This package provides both the pMETIS and the kMETIS softwares. kMETIS is a direct multi-way partitioning algorithm [8]. pMETIS uses a recursive bisection algorithm [4]. – The PARTY software [22]. This software is is based on a multilevel algorithm and a helpful-sets refinement algorithm [7]. From all of this softwares, two have a balance parameter : JOSTLE and CHACO (with KL IMBALANCE). The two others found partitions with a variable balance. 4.3
Comparisons between Graph Partitioning Softwares
To be compared with the other algorithms, the Fusion-Fission algorithm has been limited to 2,000 iterations. Then, its runtime is between one minute and one hour. This computation time is quite long regarding those of graph partitioning packages which is often less than a second. There are some explanations to this deficiency. The Fusion-Fission algorithm has not been optimized. It makes four refinement steps instead of one (see section 3). However, the Fusion-Fission algorithm is not slow in comparison with metaheuristics applied to graph partitioning [15,16] which have a computation time of several hours to several days. Tables 2 and 3 present some comparisons between the public graph partitioning packages presented in section 4.2 and the Fusion-Fission algorithm. Four cardinals numbers have been chosen: k = 2, 4, 8 and 16. CHACO naturally finds partition perfectly balanced. Its results are compared with those of JOSTLE and Fusion-Fission for balance = 1.00. pMETIS (labeled pM. in tables 2 and 3) finds partitions for a balance number of 1.01, thus it is compared with JOSTLE and Fusion-Fission for this imbalance. kMETIS and PARTY are compared with JOSTLE and Fusion-Fission for balance = 1.03. When an algorithm do not find a partition for the given balance, the result is marked not available (N/A in tables 2 and 3). In Tables 2 and 3, lines heading “Best” summarize the number of times the algorithms found the best partition quality over the 23 graphs of this benchmark, regarding results of the other algorithms for the same balance. Results show that Fusion-Fission outperforms the other algorithms in all cases except for k = 8 and k = 16 with a balance of 1.00. In the two last cases, the Fusion-Fission algorithm
704
C.-E. Bichot
does as well as the JOSTLE software. The Fusion-Fission algorithm has not been constrained to find perfectly balanced partitions even if it tries to do so. Thus, in a few cases it does not find perfectly balanced partitions. The Fusion-Fission algorithm is particularly good for the two smallest cardinals numbers, k = 2 and k = 4. It can be noticed that the JOSTLE software does almost as well as the other softwares, except when it is compared with pMETIS for k = 16 and balance = 1.03. Table 2. Comparisons between algorithms for cardinals numbers k = 2 and k = 4 Graph
balance = 1.00 JOSTLE CHACO
add20 data 3elt uk add32 bcsstk33 whitaker3 crack wingnodal fe4elt2 vibrobox bcsstk29 4elt fesphere cti memplus cs4 bcsstk30 bcsstk31 bcsstk32 t60k wing brack2 Best
734 241 95 33 12 12621 136 207 1739 130 11436 2898 151 466 347 6141 455 6456 4044 6764 108 956 754 2
740 279 92 34 28 10224 132 209 1828 130 10346 2917 179 422 410 6861 421 6447 4020 5507 101 952 752 2
add20 data 3elt uk add32 bcsstk33 whitaker3 crack wingnodal fe4elt2 vibrobox bcsstk29 4elt fesphere cti memplus cs4 bcsstk30 bcsstk31 bcsstk32 t60k wing brack2 Best
1238 448 212 69 45 22130 417 442 4073 396 21761 9833 498 825 1355 10696 1194 25825 10190 14890 240 1922 3222 6
1357 433 219 63 79 26191 398 479 3992 356 21087 8831 405 868 1016 11532 1132 17013 10184 14946 290 2161 3356 6
balance = 1.01 balance = 1.03 FF JOSTLE pM. FF JOSTLE kM. PARTY k=2 699 729 725 677 721 788 750 204 241 218 203 241 244 233 90 95 108 90 95 129 136 21 28 23 21 25 41 29 11 10 21 10 10 50 23 10175 12616 10205 10175 12409 14655 12071 127 136 135 126 135 152 132 186 196 187 186 199 278 222 1790 1741 1820 1748 1724 2054 1782 130 130 130 130 130 143 130 11866 11424 12427 11552 11511 18245 11975 2843 2898 2843 2818 2997 3904 N/A 143 151 154 139 157 258 159 386 466 440 386 468 568 386 334 342 334 318 342 688 366 5816 6096 6337 5816 6047 6519 7604 414 435 414 414 406 613 418 6407 6599 6458 6345 6599 8297 6696 2805 4060 3638 2718 4191 4950 N/A 4782 7027 5672 4747 6888 5527 5520 100 107 100 83 98 103 93 950 908 950 908 896 1562 927 738 751 738 714 715 845 947 21 5 2 21 5 0 1 k=4 1292 1229 1292 1211 1255 1387 1281 459 447 480 430 425 505 511 212 210 231 210 201 265 243 61 67 67 55 67 85 62 36 40 42 33 41 107 62 23066 22293 23131 22652 21590 25493 22445 406 403 406 397 406 575 399 382 431 382 378 413 589 476 3720 4048 4000 3659 4048 4832 N/A 359 375 359 351 368 1780 437 20282 22156 21471 19940 21844 36206 N/A 8826 9122 8826 8692 9122 10851 N/A 378 485 406 351 434 425 364 844 818 872 825 806 1103 819 1049 1357 1113 1029 1329 2294 1089 10596 10550 10559 10436 10470 10640 11406 1154 1177 1154 1102 1162 1599 N/A 17443 25865 17685 16816 25438 24151 N/A 8201 10066 8770 7812 10134 15279 N/A 12205 14887 12205 11340 14887 16215 13333 255 229 255 255 242 279 272 1937 1840 2086 1937 1824 3454 N/A 3705 3144 3250 3109 2999 4129 N/A 12 5 0 19 4 0 0
FF 666 196 88 19 10 10069 126 186 1735 130 11440 2818 138 384 318 5574 414 6275 2698 4747 78 908 691 20 1202 420 208 51 33 21853 396 371 3659 351 19825 8523 342 818 976 10182 1089 16767 7812 9924 227 1900 2935 19
Application of Fusion-Fission
705
Table 3. Comparisons between algorithms for cardinals numbers k = 8 and k = 16 Graph
balance = 1.00 JOSTLE CHACO
add20 data 3elt uk add32 bcsstk33 whitaker3 crack wingnodal fe4elt2 vibrobox bcsstk29 4elt fesphere cti memplus cs4 bcsstk30 bcsstk31 bcsstk32 t60k wing brack2 Best
1853 798 462 114 120 36106 716 809 6070 713 30103 17391 674 1351 2257 12866 1703 39271 15360 29281 556 3028 8007 9
1881 763 393 130 139 41951 712 806 6152 651 33410 16887 701 1337 1886 13956 1808 35647 17553 25810 593 3221 8061 6
add20 data 3elt uk add32 bcsstk33 whitaker3 crack wingnodal fe4elt2 vibrobox bcsstk29 4elt fesphere cti memplus cs4 bcsstk30 bcsstk31 bcsstk32 t60k wing brack2 Best
2555 1299 645 211 269 59884 1172 1245 9083 1194 35447 28294 1081 1918 3402 14510 2518 87824 27897 46954 977 4761 13318 9
2269 1279 641 211 217 61800 1241 1277 9327 1095 42634 26239 1099 2061 3122 15654 2550 79046 29364 46266 1027 4806 13117 7
5
balance = 1.01 balance = 1.03 FF JOSTLE pMETIS FF JOSTLE kMETIS PARTY FF k=8 1907 1894 1907 1907 1836 2130 2018 1907 842 800 842 758 756 N/A 791 714 388 433 388 364 418 527 432 356 101 104 101 101 106 168 148 101 N/A 105 81 81 106 351 N/A 72 40070 36269 40070 35579 35961 44681 39071 34919 719 710 719 692 706 1047 759 687 773 779 773 721 751 1047 808 721 6070 6033 6070 6070 5965 8335 6284 5976 654 688 654 646 681 801 707 641 28696 31032 28177 26162 30247 43334 N/A 25796 17534 16742 16555 15181 17234 24525 N/A 15043 635 612 635 604 656 827 722 583 1330 1280 1330 1302 1274 1643 1277 1294 N/A 2158 2110 2076 2086 3888 2482 2005 13110 12684 13110 13110 12540 N/A 13119 13110 1746 1673 1746 1746 1588 2733 1721 1746 N/A 38746 36357 36357 38228 41052 48539 35668 16012 17094 16012 14754 19849 20647 N/A 14754 23601 26655 23601 23601 25343 39817 N/A 23601 561 532 561 561 530 1309 581 561 3205 2918 3205 3205 2911 5748 N/A 3205 7844 8037 7844 7844 7757 10171 N/A 7844 9 7 5 16 8 0 0 15 k = 16 2504 2532 2504 2504 2565 N/A 2510 2504 1309 1299 1370 1278 1263 4857 1475 1224 665 621 665 607 603 969 754 598 N/A 190 189 189 180 384 220 189 128 239 128 128 180 N/A N/A 128 59791 61505 59791 58694 57553 123044 61630 58183 1237 1138 1237 1180 1147 1436 1277 1165 1255 1212 1255 1197 1191 2296 1263 1187 9290 9091 9290 8962 8947 10097 9006 8890 1152 1146 1152 1083 1140 1500 1236 1076 37441 36233 37441 36398 34521 N/A N/A 35809 N/A 28062 28151 26422 28338 147143 N/A 25417 1056 1034 1056 1048 1012 5077 1282 1015 2030 1759 2030 1952 1741 2495 N/A 1943 3181 3345 3181 3181 3262 4760 N/A 3181 N/A 15085 14942 14942 13958 15804 14831 14942 2538 2519 2538 2538 2477 3614 2488 2538 N/A 87472 77293 77293 81764 93834 N/A 76791 27180 27954 27180 27180 27388 189562 N/A 27180 43371 48162 43371 43371 48395 50660 N/A 43371 N/A 969 998 998 984 1347 1097 998 4666 4623 4666 4666 4681 7712 N/A 4666 12655 13337 12655 12655 13164 150514 N/A 12655 8 7 9 16 9 0 0 14
Conclusion
A new multi-way graph partitioning method has been presented in this paper. This method named Fusion-Fission is based on a previous work we made to solve the normalized cut graph partitioning problem [12,13]. The adaptation of Fusion-Fission to the multi-way graph partitioning problem uses the pMETIS multilevel algorithm and its Kernighan-Lin refinement algorithm.
706
C.-E. Bichot
This method has been compared with four state-of-the-art graph partitioning packages: JOSTLE, METIS, CHACO and PARTY. Classical benchmarks have been used. The partitions searched are of cardinal numbers 2, 4, 8 and 16, with a balance of 1.00, 1.01 and 1.03. Results show that up to two thirds of time are the partitions returned by Fusion-Fission of greater quality than those returned with state-of-the-art graph partitioning packages. Since Fusion-Fission takes much longer time than state-of-the-art graph partitioning packages, it may bee difficult to used it for parallel matrix applications. However, it can be advantageously be used for fields where run-time is less of a concern, as VLSI layout or air traffic management problems.
References 1. Hendrickson, B., Leland, R.W.: A multilevel algorithm for partitioning graphs. In: Proceedings of Supercomputing (1995) 2. Karypis, G., Kumar, V.: Analysis of multilevel graph partitioning. In: Proceedings of Supercomputing (1995) 3. Alpert, C.J., Huang, J.H., Kahng, A.B.: Multilevel circuit partitioning. In: Proceedings of the ACM/IEEE Design Automation Conference, pp. 530–533 (1997) 4. Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal of Scientific Computing 20(1), 359–392 (1998) 5. Kernighan, B.W., Lin, S.: An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal 49(2), 291–307 (1970) 6. Fiduccia, C.M., Mattheyses, R.M.: A linear-time heuristic for improving network partitions. In: Proceedings of 19th ACM/IEEE Design Automation Conference, pp. 175–181 (1982) 7. Diekmann, R., Monien, B., Preis, R.: Using helpful sets to improve graph bisections. In: Proceedings of the DIMACS Workshop on Interconnection Networks and Mapping and Scheduling Parallel Computations, pp. 57–73 (1995) 8. Karypis, G., Kumar, V.: Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing 48(1), 96–129 (1998) 9. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000) 10. Dhillon, I.S., Guan, Y., Kullis, B.: Kernel k-means, spectral clustering, and normalized cuts. In: Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining (KDD), pp. 551–556 (2004) 11. Dhillon, I.S., Guan, Y., Kulis, B.: Weighted graph cuts without eigenvectors: A multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence (to appear, 2007) 12. Bichot, C.E.: A metaheuristic based on fusion and fission for partitioning problems. In: Proceedings of the 20th IEEE International Parallel and Distributed Processing Symposium (2006) 13. Bichot, C.E.: A new method, the fusion fission, for the relaxed k-way graph partitioning problem, and comparisons with some multilevel algorithms. Journal of Mathematical Modeling and Algorithms (JMMA) 6(3), 319–344 (2007) 14. Simon, H.D., Teng, S.H.: How good is recursive bisection? SIAM Journal on Scientific Computing 18(5), 1436–1445 (1997)
Application of Fusion-Fission
707
15. Ba˜ nos, R., Gil, C., Ortega, J., Montoya, F.: Multilevel heuristic algorithm for graph partitioning. In: Proceedings of the European Workshop on Evolutionary Computation in Combinatorial Optimization, pp. 143–153 (2003) 16. Soper, A.J., Walshaw, C., Cross, M.: A combined evolutionary search and multilevel optimisation approach to graph-partitioning. Journal of Global Optimization 29, 225–241 (2004) ˇ 17. Koroˇsec, P., Silc, J., Robiˇc, B.: Solving the mesh-partitioning problem with an ant-colony algorithm. Parallel Computing 30(5-6), 785–801 (2004) 18. Hendrickson, B., Leland, R.: The Chaco User’s Guide. Sandia National Laboratories. 2.0 edn. (1995) 19. Walshaw, C.: The serial JOSTLE library user guide. University of Greenwich. 3.0 edn. (July 2002) 20. Walshaw, C., Cross, M.: Mesh partitioning: A multilevel balancing and refinement algorithm. SIAM Journal on Scientific Computing 22, 63–80 (2000) 21. Karypis, G., Kumar, V.: Metis: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices. University of Minnesota. 4.0 edn. (September 1998) 22. Preis, R., Diekmann, R.: The Party Partitioning Library, User Guide. University of Paderborn. 1.99 edn. (October 1998)
A Parallel Approximation Algorithm for the Weighted Maximum Matching Problem Fredrik Manne1 and Rob H. Bisseling2 1
2
Department of Informatics, University of Bergen, Norway [email protected] Department of Mathematics, Utrecht University, The Netherlands [email protected]
Abstract. We consider the problem of computing a weighted edge matching in a large graph using a parallel algorithm. This problem has application in several areas of combinatorial scientific computing. Since an exact algorithm for the weighted matching problem is both fairly expensive to compute and hard to parallelise we instead consider fast approximation algorithms. We analyse a distributed algorithm due to Hoepman [8] and show how this can be turned into a parallel algorithm. Through experiments using both complete as well as sparse graphs we show that our new parallel algorithm scales well using up to 32 processors.
1
Introduction
A matching in an undirected graph G = (V, E) is a pairing of adjacent vertices such that each vertex is matched with at most one other vertex, the objective being to match as many vertices as possible or to maximise the sum of the weights of the matched edges. One application of matchings in scientific computation is when using pivoting in the direct solution of a system of equations Ax = b. Once a pivot element aij has been chosen no other element in row i or column j can be used as a pivot again. The typical strategy is then to choose pivots in a greedy fashion. But as was shown by Duff and Koster [5,6], this can lead to sub-optimal results and they demonstrate that better results can be achieved by modelling the pivoting problem as computing a weighted matching in a bipartite graph. This is done by viewing A as a bipartite graph G(V1 , V2 , E) where there is one vertex in V1 for each row of A, one vertex in V2 for each column of A, and the weight of edge (i, j) is equal to |aij |. Then any selection of pivots is equivalent to computing a perfect matching in G (i.e., with all vertices matched). The matching objective of Duff and Koster is to maximise the product of the edge weights; in contrast, in the present work we try to maximise their sum.
The authors wish to thank HPC-Europe, the Dutch supercomputing centre SARA, NCF, the BSIK/BRICKS MSV1-2 program, and the NFR funded Parcomb project for financial support.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 708–717, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Parallel Approximation Algorithm
709
Even though a maximum-weight matching can be computed in polynomial time, the time complexity of doing this is still high. For this reason, fast approximation algorithms are attractive for the matching problem. The sequential greedy matching (GM) algorithm where eligible edges are chosen by declining weight, obtains an approximation ratio of 12 but requires a global sorting of the edges, thus resulting in a running time of O(|E| log |E|) [1]. Preis [12] has shown how one can avoid the sorting in the GM algorithm, thus getting the time complexity down to O(|E|). The algorithm by Preis depends on finding dominating edges, i.e., edges that are heavier than their incident edges, and adding these to the matching. Since the work of Preis, linear-time algorithms that get the approximation ratio up to 23 − have been introduced [4,11]. Common to these algorithms is that they are inherently sequential. Thus, they are unpractical to use if the graph is distributed, as is often the case for scientific applications running on parallel computers. In the current paper, we present a new parallel matching algorithm that lends itself well to distributed-memory computers. Our algorithm is based on a distributed matching algorithm due to Hoepman [8]. This algorithm builds on the work by Preis and returns the same solution as this algorithm. We present and analyse the algorithm by Hoepman and show that this can in fact be phrased as a variant of Luby’s parallel algorithm [10] for finding an independent set in a graph. Next, we explain how it is possible to turn Hoepmans algorithm into an efficient parallel algorithm suitable for distributed-memory computers. This algorithm has been implemented and extensive tests on both complete and sparse graphs show that the algorithm scales well using up to 32 processors. The remainder of the paper is organised as follows. In Section 2, we present and analyse the Hoepman algorithm. In Section 3, we describe how this can be parallelised. We present results from experiments in Section 4 and conclude in Section 5.
2
Distributed Matching
We will throughout this presentation assume that there exists a global ordering on the weights of the edges. If this is not the case to begin with, we can impose such an ordering by using the relative numbering of the involved vertices to break ties. As an example, if w(u, v) = w(w, x) then we rank (u, v) before (w, x) if and only if max{u, v} > max{w, x} or max{u, v} = max{w, x} and min{u, v} > min{w, x}. The purpose of this ordering is to ensure that two edges of the same weight, incident on the same vertex, can always be ordered. Let NS (v) denote all vertices in the set S that are adjacent to the vertex v in G. If S = V we will just write N (v). Let further HS (v) be the vertex in NS (v) such that the edge (v, HS (v)) is of maximum weight among all edges (v, w) where w ∈ NS (v). We say that an edge (u, v) is dominating if HS (u) = v and HS (v) = u. Let δ(v) be the degree of v in G and Δ = maxv∈V δ(v).
710
F. Manne and R.H. Bisseling
2.1
The Hoepman Algorithm
We now present the distributed algorithm due to Hoepman [8]. This algorithm assumes that each vertex has associated with it a computing entity with its own memory and the ability to communicate with its neighbours through message passing. Algorithm 1 gives the algorithm that is run on each vertex. The main idea of the algorithm is to locate dominating edges and add these to the matching. Once a dominating edge has been found, all adjacent edges are discarded from use in the matching and the process continues. The set S in Algorithm 1 is used to maintain the possible candidates that a vertex v can match with. It is initially set to N (v). Also, the variable c is used to store the prime candidate that v wants to match with. It is set to HS (v) at the start of the algorithm and a req (request) message is sent to c to indicate that v wants to match with c. When the algorithm terminates, the c-values of all the vertices define the matching. For the rest of the algorithm, v will be processing incoming messages. All req messages are stored in a set R. If v receives a req message from c (meaning that c and v mutually prefer each other), then v will send drop messages to all other vertices in S to indicate that it is now matched with c. If v receives a drop message from u, then u will be removed from S since u has now matched with another vertex. Furthermore, if u = c then v must pick a new candidate in S to match with. Thus, v sets c = HS (v) and if c = null sends a req message to c. It is shown in [8] that when the algorithm terminates the c-values of the vertices define the same matching as produced by the GM algorithm. Note that this is under the assumption that ties are broken in the same manner in both algorithms. If this is not the case, then it is easy to show that Algorithm 1 will still produce a matching of weight at least 0.5 of the optimal one. Also, note the importance of having a deterministic tie breaking scheme for edges incident on the same vertex. Without such a scheme, Algorithm 1 could easily deadlock as the following example shows. Consider a graph with three vertices x, y, and z and edges (x, y), (y, z), and (z, x) where each edge is of the same weight. If each vertex individually chooses which of its incident edges it wants to use for a matching then x could send a req message to y, while y sends a req message to z, and z sends a req message to x. The algorithm would then be deadlocked with each vertex waiting for a response message. 2.2
Running Time
Next, we consider efficiency issues in implementing Algorithm 1. In [8] it is shown that at most two messages will be sent along any edge of the graph. That is, a vertex v will at most send one req message along any incident edge (v, w) and after this at most one more message will be sent from w to v. If this is a req message, then v and w both know that they prefer to match with each other. If this message is a drop message then w has matched and v will not consider w as a candidate for the rest of the algorithm. Thus, the total number of messages sent is at most 2|E|.
A Parallel Approximation Algorithm
711
Algorithm 1. The algorithm by Hoepman [8]. procedure DistributedMatching(v, N (v)) R←∅ S ← N (v) c ← HS (v) if c = null then send req to c while S = ∅ do receive m from some u ∈ N (v) if m = req then R ← R ∪ {u} else if m = drop then S ← S \ {u} if u = c then c ← HS (v) if c = null then send req to c if c = null and c ∈ R then forall w ∈ S \ {c} send drop to w S←∅ return c
We first note an interesting relationship between Algorithm 1 and one of the classical algorithms in parallel and distributed computing, namely the algorithm by Luby for computing an independent set [10]. To do so, we first need to introduce the notion of a line graph. A line graph L(G) is a graph where each vertex of L(G) represents an edge of G; two vertices of L(G) are adjacent if and only if their corresponding edges in G share a common endpoint. It is well known, and not hard to see, that a matching in a graph G is equivalent to an independent set in its line graph L(G). If we construct L(G) from G assigning the same weights to the vertices of L(G) as are assigned to the corresponding edges of G, we can interpret Algorithm 1 as being run on L(G) to find an independent set. Instead of locating dominating edges we now locate dominating vertices (i.e., vertices that are heavier than their remaining neighbours) and for each such vertex we add it to the independent set and remove its neighbours from further consideration. The resulting independent set in L(G) is then equivalent to the matching found in G by Algorithm 1. The algorithm on L(G) is a variant of the well-known Luby algorithm [10] for computing independent sets in parallel. The Luby algorithm is best viewed as operating in synchronous rounds, where a round starts with each remaining vertex v determining if it is dominating. If so, v is entered as a member of the independent set and a message is sent to any remaining neighbor of v before v exits the algorithm. These neighbors will in turn send a message to each of their remaining neighbors that they are also exiting the algorithm before the algorithm continues with the next round. If
712
F. Manne and R.H. Bisseling
the weights are assigned in a random fashion to the vertices of L(G) then the expected number of rounds of Luby’s algorithm is O(log |VL(G) |) [7,10]. Algorithm 1 can also be viewed as executing in synchronous rounds with communication only taking place once in each iteration of the main while-loop and with all remaining vertices participating. Doing so, we get the following observation. Lemma 1. If the edge weights are assigned randomly then Algorithm 1 is expected to terminate in O(log |E|) rounds. We next consider the time spent on each vertex in running Algorithm 1. The only non-trivial decision that has to be made in Algorithm 1 is how to maintain the set S and how to implement the HS (v) function efficiently. The easiest way of doing this is to initially presort the edges incident on each vertex by decreasing weight. One can then find HS (v) in O(1) time. Also, when removing a vertex from S one can maintain this list by an O(1) update. The total time spent computing on a vertex v will then be dominated by the sorting: O(δ(v) log δ(v)) = O(Δ log Δ) where Δ is the maximum degree in the graph. The accumulated work performed on all vertices is given by vi ∈V O(δ(vi ) log δ(vi )) = O(|E| log Δ). Comparing this with the GM algorithm we see that the work has decreased from O(|E| log |E|) although it is still not linear. We next show that if the probability of each edge incident on a vertex being removed is uniform then it is not difficult to get the expected accumulated cost of the algorithm down to linear. The way to do this is by keeping S as an unordered linked list for each vertex v ∈ V . In this list, we also store the weight w(v, x) with the vertex x ∈ S. In addition, we maintain a pointer rv that points to the vertex in S such that w(v, rv ) is the maximum over all vertices in S. Then we can determine HS (v) in O(1) time but whenever the vertex pointed to by rv is deleted from S we must perform a linear scan of the remaining vertices in S to find the new value of rv . But as the following result shows this is not expected to happen too often. Lemma 2. Let v be a vertex in G. If in each round of Algorithm 1 the probability is uniform that a particular element of S will be removed next, then the expected amount of time needed to maintain the value of rv throughout the algorithm is O(|δ(v)|). Proof. Note first that if v associates a local numbering from 1 through |N (v)| with the nodes in N (v), then given a vertex u ∈ N (v) it is possible to locate its position in S in time O(1), since the linked list representing S can be stored in an array of length |N (v)|. In the worst-case scenario, either a vertex v will not be matched until it has only one incident edge remaining or it will not be matched at all. Before this happens, every time the vertex pointed to by rv is removed from S, a cost proportional to the current size of S is incurred; if a different vertex is removed, the cost is only O(1). Let Ck be the expected cost of maintaining the value of
A Parallel Approximation Algorithm
713
rv for a vertex of degree k. Since at any point we are equally likely to delete any of the remaining edges, the value of Ck is given by the recursion Ck =
1 k−1 (k + Ck−1 ) + (1 + Ck−1 ) k k
where k = δ(v) and C1 = 1. Rearranging and expanding Ck−1 , we get k−1 + Ck−1 k 1 = 2 − + Ck−1 k = 2k − Hk ,
Ck = 1 +
(1) (2) (3)
where Hk is the kth harmonic number, Hk = 1 + 1/2 + · · · + 1/k. We note that Lemma 2 does not constitute a formal proof that the cost of the algorithm is linear when the edge weights are assigned in a random fashion. The probability that a particular vertex is removed from some set S depends on the relative size of the associated edge. A heavier edge is less likely to be dominated by one of its adjacent edges while a lighter edge is more likely to be dominated. Thus, one would expect that the edge (v, rv ) is in fact less likely to be dominated than any of the other edges incident on v.
3
Parallelizing the Hoepman Algorithm
For any realistic data set and parallel computer, one would expect that the number of processors p is far less than the number of vertices in the graph. Thus, in a parallel algorithm each processor must handle several vertices of the graph. It would be possible, although not very practical, to let each processor simulate several processes such that one could keep Algorithm 1 unchanged. Instead, we first develop a sequential version of Algorithm 1 that each processor will run on its allocated vertices and then separately look at how to handle communication between the processors. 3.1
A Sequential Algorithm
The sequential version is shown in Algorithm ??. The algorithm now uses an indexed variable c(v) to point to the current best match of vertex v and if c(c(v)) = v then v and c(v) are considered to be matched and the edge (v, c(v)) is added to the set M of matched edges. Similarly to Algorithm 1, we use a set Sv initialised to N (v) to hold the neighbours of v that might still be candidates to match with. The algorithm starts by finding all the edges that are dominating in the initial graph. The endpoints of each such edge are added to a set D while the edge itself is added to M . Note that we avoid adding endpoints twice to D because c(c(v)) = null the first time a dominant edge is considered for inclusion.
714
F. Manne and R.H. Bisseling
Algorithm 2 . The sequential matching algorithm. procedure SequentialMatching(G = (V, E)) for each v ∈ V do c(v) = null D←∅ M←∅ for each v ∈ V do Sv ← N (v) c(v) ← HSv (v) if c(c(v)) = v then D = D ∪ {v, c(v)} M = M ∪ {(v, c(v))} while D = ∅ do v ← some vertex from D D = D \ {v} for each x ∈ Sv \ {c(v)} where (x, c(x)) ∈ M do Sx ← Sx \ {v} c(x) ← HSx (x) if c(c(x)) = x then D = D ∪ {x, c(x)} M = M ∪ {(x, c(x))} return M
Now that the initial dominating edges have been found, we must for each vertex v ∈ D inform the unmatched neighbours of v that v is no longer a candidate for matching. Thus we traverse the unmatched neighbours of v (stored in Sv ) and for each such neighbour x which has not yet been matched, we remove v from Sx indicating that v is no longer a candidate to match with. We then update c(x) and if this results in a new matching, that is, if c(c(x)) = x, then we add x and c(x) to D. It is fairly straightforward to see that Algorithm 2 produces exactly the same matching as Algorithm 1; therefore, we omit a formal proof. Also, the implementation details of HS (v) that were discussed in Section 2 also apply to Algorithm 2. 3.2
A Parallel Algorithm
We now outline our parallel algorithm. As stated, each processor is responsible for a block of vertices. Each processor then holds information about its own vertices and all edges incident on these. It also has information about on which processors its adjacent vertices reside. We note that the incident edges of each vertex v can be ordered based only on information from N (v). In the case of a complete graph we use a regular block partition where each processor gets n/p vertices and in the case of a sparse graph we use the Metis graph partitioning library [9] to achieve an even partition where the number of crossing edges is kept small.
A Parallel Approximation Algorithm
715
In our implementation, we have used ghost-vertices to make handling of crossing edges easier. Thus, if a vertex v is assigned to processor i and has neighbours w1 , w2 , . . . , wk that reside on processor j where i = j, we add a ghost vertex v on processor j and edges (v , wl ) also on processor j for 1 ≤ l ≤ k. Once the graph has been partitioned and distributed across the processors, each processor will start to run Algorithm 2 on its regular vertices. This will run until no more dominant edges can be found. At this stage, each boundary vertex x that has become unavailable because it has matched will send a message to its own corresponding ghost vertices {x1 , x2 , . . . , xk } and inform them that it is no long available and that they should also make themselves unavailable for matching. In the case that x has changed status and now wants to match with a ghost vertex y residing on its processor, a message will be sent to the corresponding ghost vertex x residing on the same processor as y, to instruct x to try to match with y. If this results in a matching being discovered, the associated vertices x and y are added to D while (x , y) is added to M . Note that in this case the edge (x, y ) will also be added to M . The while-loop of Algorithm 2 is then run again on each processor. This interleaving of communication with local matching is continued until the set D is empty on each processor. We note that this separation of computation and communication into distinct, rather than intermingled, stages results in a BSP-type algorithm [2]. The parallelism in the algorithm is obtained from the assumption that each processor will have a large number of local matches to perform between the communication rounds. However, it is not difficult to come up with examples that would sequentialise the algorithm. For example, using two processors and a graph that is a straight line with increasing edge weights where the even vertices are assigned to Processor 1 and the odd vertices are assigned to Processor 2, would require that only one edge could be matched in each round.
4
Experiments
We have performed a set of experiments on a SGI Origin 3800 using up to 32 processors. For our input data we have used complete graphs with random weights on the edges as well as sparse graphs from the University of Florida sparse matrix collection [3]. The top chart of Figure 1 displays the running time for a complete graph on 5000 vertices with random edge weights, as different numbers of processors are applied. As one can see, the running time decreases evenly as more processors are applied. The only exception that is observed is when going from one to two processors, where the running time increases by about 50%. This is due to the extra overhead incurred by the algorithm. In the bottom chart of Figure 1, one can see the running time for the graph crankseg 1 from the University of Florida sparse matrix collection [3]. This is a sparse graph of 52,804 vertices and 5,280,703 edges. For this graph we display both the running time when the graph is partitioned using Metis and when
716
F. Manne and R.H. Bisseling 3.5
3
2.5
Time
2
1.5
1
0.5
0
0
5
10
15
20
25
30
35
Number of processors 0.9
0.8
Metis Block
0.7
Time
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
20
25
30
35
Number of processors
Fig. 1. Running time in seconds for a complete graph with 5000 vertices (top) and for a sparse graph with about 50000 vertices and 5 million edges (bottom)
using a block partitioning. As one can observe, there is a significant effect when partitioning the graph using Metis. This is due to much fewer crossing edges when using Metis than when using block partitioning. As a consequence, the algorithm also requires fewer rounds with Metis.
5
Conclusion
In this work, we have shown that the distributed matching algorithm by Hoepman [8] lends itself well to execution on a parallel computer. Moreover, we have shown that this algorithm is closely related to Luby’s maximal independent set algorithm. In the future, we intend to report more detailed results from our experiments including incorporations of our code in existing linear solvers. We will also look more closely at extending the algorithm by using short augmenting paths to improve the approximation ratio of the matching even further.
A Parallel Approximation Algorithm
717
References 1. Avis, D.: A survey of heuristics for the weighted matching problem. Networks 13, 475–493 (1983) 2. Bisseling, R.H.: Parallel Scientific Computation: A Structured Approach Using BSP and MPI. Oxford University Press, Oxford (2004) 3. Davis, T.: University of Florida sparse matrix collection. NA Digest 92(42) (October 16, 1994), NA Digest 96(28) (July 23, 1996), NA Digest 97(23) (June 7, 1997), http://www.cise.ufl.edu/research/sparse/matrices 4. Drake, D.E., Hougardy, S.: A linear-time approximation algorithm for weighted matchings in graphs. ACM Transactions on Algorithms 1, 107–122 (2005) 5. Duff, I.S., Koster, J.: The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Anal. Appl. 20, 889–901 (1999) 6. Duff, I.S., Koster, J.: On algorithms for permuting large entries to the diagonal of a sparse matrix. SIAM J. Matrix Anal. Appl. 22, 973–996 (2001) 7. Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel Computing, 2nd edn. Addison-Wesley, Reading (2003) 8. Hoepman, J.-H.: Simple distributed weighted matchings, arXiv:cs/0410047v1 (2004) 9. Karypis, G., Kumar, V.: Metis, a software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices. Version 4.0 (1998) 10. Luby, M.: A simple parallel algorithm for the maximal independent set problem. SIAM J. Comput. 15, 1036–1053 (1986) 11. Pettie, S., Sanders, P.: A simpler linear time 2/3 − approximation for maximum weight matching. Inf. Process. Lett. 91, 271–276 (2004) 12. Preis, R.: Linear time 1/2-approximation algorithm for maximum weighted matching in general graphs. In: Meinel, C., Tison, S. (eds.) STACS 1999. LNCS, vol. 1563, pp. 259–269. Springer, Heidelberg (1999)
Heuristics for a Matrix Symmetrization Problem Bora U¸car CERFACS, 42 Avenue Gaspard Coriolis, 31057, Toulouse, Cedex 1, France [email protected]
Abstract. We consider the following problem: given a square, nonsymmetric, (0, 1)-matrix, find a permutation of its columns that yields a zero-free diagonal and maximizes the symmetry. The problem is known to be NP-hard. We propose a fast iterative-improvement based heuristic and evaluate the performance of the heuristic on a large set of matrices. Keywords: unsymmetric sparse matrix, bipartite matching, matrix symmetrization.
1
Introduction
We consider the matrix symmetrization problem for square (0, 1)-matrices. A square matrix A is symmetrizable if its columns can be permuted to yield a symmetric matrix. Deciding whether a given matrix is symmetrizable is NP-complete [4]. The variation of the problem which restricts the permutations such that the permuted matrix has a zero-free diagonal is also NP-complete [2]. In this work, we are interested in the optimization version of the matrix symmetrization with zero-free diagonal problem, i.e., given a square (0, 1)-matrix, find a permutation of the columns that yields a zero-free diagonal and maximizes the symmetry. We assume that there is at least one permutation that yields a zero-free diagonal. The problem arises in a preprocessing phase of some other algorithms. For example, when a given sparse matrix A has an unsymmetric pattern, most of the graph partitioning and ordering algorithms are applied to the pattern of the symmetric completion A + AT (ignoring numerical cancellation); see discussions in [8,10,15]. A remark which usually accompanies using the pattern of the symmetric completion is that this trick would be appropriate only if the matrix is nearly symmetric. The techniques proposed in this work can be used to make a given matrix more symmetric and obtain a sparser symmetric completion. In other words, the proposed techniques can help improve the running time of the aforementioned algorithms and their solutions’ quality. Since the decision problem is NP-complete, and hence the optimization version that we are interested is NP-hard, we propose a heuristic algorithm. We test the heuristic on a large set of matrices. We also report encouraging experiments on a certain matrix ordering problem.
This work was supported by “Agence Nationale de la Recherche”, ANR-06-CIS6-010.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 718–727, 2008. c Springer-Verlag Berlin Heidelberg 2008
Heuristics for a Matrix Symmetrization Problem
2
719
Method Description
In this section, A is a square (0, 1)-matrix of size n × n. As is common, we associate a bipartite graph G = (R∪C, E) with the matrix A, where R and C are the two sets in the vertex bipartition, and E is the set of edges. Here, the vertices in R and C correspond, respectively, to the rows and the columns of A such that (ri , cj ) ∈ E if and only if aij = 1. Although the edges are undirected, we will always specify an edge e ∈ E as e = (ri , cj ); the first vertex will always be a row vertex and the second one will always be a column vertex. An edge e = (ri , cj ) is called to be incident on the vertices ri and cj . We have the following notation: N (ri ) denotes the neighbors of a row vertex ri , i.e., N (ri ) = {cj : (ri , cj ) ∈ E}; similarly N (cj ) = {ri : (ri , cj ) ∈ E} denotes the neighbors of a column vertex cj ; for a set s, |s| denotes its cardinality; d(·) denotes the degree of a vertex, e.g., d(ri ) = |N (ri )|; and wij denotes the weight of an edge (ri , cj ). We recall some standard definitions and well-known results. An even cycle contains an even number of vertices. A set of edges M is a matching if no two edges in M are incident on the same vertex. Given a matching M, an Malternating cycle is a simple cycle whose edges are alternately in M and not in M. A matching is called perfect if for any vertex v in G, there is an edge in M incident on v. If the edges are weighted, then the weight of amatching w(M) is equal to the sum of the weights of its edges, i.e., w(M) = (ri ,cj )∈M wij . A maximum weight perfect matching on a weighted graph is a perfect matching with maximum weight. Both the perfect matching problem and the maximum weight perfect matching problem are efficiently solvable [7,13]. We use mate(v), to denote the vertex matched to the vertex v in a matching M. That is if (ri , cj ) ∈ M, then mate(ri ) = cj and mate(cj ) = ri . It is well known that perfect matchings in the bipartite graph G correspond to permutations which yield zero-free diagonals, see for example [7]. A matching edge (ri , cj ) is used to permute the column cj to the ith position, yielding a zero-free diagonal. This is achieved by defining the permutation matrix M as 1 (ri , cj ) ∈ M mij = , 0 otherwise and then by multiplying A on the right by M , i.e., by forming AM . We will use calligraphic letters for perfect matchings, e.g., M, and the corresponding italic, Roman letters, e.g., M , for the associated permutation matrices. For a given square (0, 1)-matrix A, we define the symmetry score as S(A) = aij aji . (1) aij =0
As seen from the formula each nonzero entry aij contributes either 0 or 1 to the score. Hence, for a symmetric matrix A, S(A) is equal to the number of its nonzeros. For a given column permutation M , S(AM ) measures the symmetry score of the permuted matrix.
720
2.1
B. U¸car
The Heuristic
We propose an iterative-improvement-based heuristic. The proposed heuristic works on the bipartite graph representation of a given matrix. It starts with a perfect matching to guarantee a zero-free diagonal, and then iteratively improves the current matching to increase the symmetry while maintaining a perfect matching at all times. Algorithm 1. Compute the symmetry score Input: a bipartite graph G = (R ∪ C, E) corresponding to an n × n matrix A Input: a perfect matching M Output: score = S(AM ) 1: mark(r) ← 0 for all r ∈ R 2: score ← 0 3: for each (ri , cj ) ∈ M do 4: for each c ∈ N (ri ) do 5: mark(mate(c)) ← j mark ri too 6: for each r ∈ N (cj ) do 7: if mark(r) = j then 8: score ← score + 1 increase by one for (ri , cj ), also for a symmetric entry (rk , cj ) ∈ /M with rk = mate(c) for a c ∈ N (ri )
Given a perfect matching M, the symmetry score of the permuted matrix AM can be computed as shown in Algorithm 1. The algorithm runs in O(E) time, since the edges incident on a vertex v are visited only when the matching edge incident on v is processed at line 3. Note that for a matching edge (ri , cj ) ∈ M, the test at line 7 holds, and the matching edge (a diagonal entry in the permuted matrix AM ) contributes one to the score. Consider two matching edges (ri , cj ) and (rk , cl ) such that (ri , cl ) ∈ E and (rk , cj ) ∈ E. These four edges form an M-alternating cycle of length four, and herald two off-diagonal symmetric entries in addition to the two diagonal entries in AM . The score is incremented by 2 for those two off-diagonal symmetric entries in two steps; by 1 when the “for loop” at line 3 is processing (ri , cj ), and by 1 while the “for loop” at line 3 is processing (rk , cl ). Let C4 be the set of unique alternating cycles of length four, then the score will be S(AM ) = n + 2 × |C4| .
(2)
Note that all perfect matchings will result in a symmetry score of at least n, therefore the number of alternating cycles of length four is the important term. Let M and M be, respectively, an optimal perfect matching maximizing the symmetry score, and another perfect matching on the bipartite graph G. Then the symmetric difference M ⊕ M = (M \ M) ∪ (M \ M ) contains only isolated vertices and even cycles. This is because a given vertex is incident on exactly one edge of M and one edge of M ; if those edges are the same, then
Heuristics for a Matrix Symmetrization Problem
721
the vertex becomes isolated, otherwise it becomes a part of a cycle (intrinsically an even cycle as the graph is bipartite). Note that each such cycle is an Malternating cycle. Therefore, an optimal solution M can be obtained from any perfect matching M by finding M-alternating cycles and then reversing the membership of the edges along those cycles. Note that this does not imply an efficient algorithm, as there are combinatorially many alternating cycles with respect to a given matching. Having observed the role of the alternating cycles in improving a given matching, and having noted Eq. 2, we propose to work on the alternating cycles of length four to improve a given perfect matching. Similar to some other iterativeimprovement-based heuristic, such as that in [12], we organize the refinement process in passes. At each pass, we build the set C4 of unique alternating cycles of length four using an algorithm much like Algorithm 1. Then, we visit the unique alternating cycles of length four in a random order. Among the cycles those whose vertices are not affected by a previous operation and with a nonnegative effect on symmetry score are reversed. That is, in a four cycle, the matching edges are replaced by the non-matching ones. Algorithm 2 shows the actions taken within a pass. Algorithm 2. Refine a perfect matching Input: a bipartite graph G = (R ∪ C, E) corresponding to an n × n matrix A Input: a perfect matching M(0) Output: another perfect matching M(1) where S(AM (1) ) ≥ S(AM (0) ) 1: M(1) ← M(0) 2: C4 ← {(r1 , c1 , r2 , c2 ) : (r1 , c1 ) ∈ M(1) and (r2 , c2 ) ∈ M(1) and (r1 , c2 ) ∈ E and (r2 , c1 ) ∈ E} 3: while C4 = ∅ do 4: pick a cycle C = (r1 , c1 , r2 , c2 ) ∈ C4 5: if isReversible(C) and gain(C) ≥ 0 then 6: M(1) = M(1) ⊕ C 7: remove the cycle C from C4
The test isReversible(C) returns true if none of the vertices {r1 , c1 , r2 , c2 } has been moved before. In other words, this test returns true if the current matching contains the edges (r1 , c1 ) and (r2 , c2 ). The gain computations are done by using the main “for loop” of Algorithm 1 for the edges (r1 , c1 ) and (r2 , c2 ), and then for the edges (r1 , c2 ) and (r2 , c1 ). The difference between the returned scores gives the gain of reversing the edges in the cycle C. In the worst case, all gain computations return negative, Algorithm 2 evaluates the gain of all alternating cycles of length four. We first note that a vertex v can be in at most d(v) − 1 alternating cycles of length four, because one of its neighbors in all such cycles should be its mate. For each such cycle, the edges incident on v are visited during gain computations. Therefore, the gain 2 computation operations can spend at most O(d(v) a vertex v. Hence ) time on 2 2 the worst case total time can be bound as O d(r) + r∈R c∈C d(c) . In practice a faster worst case running time can be expected. This is because of
722
B. U¸car
two reasons. First, there are negative terms adding up to O(|E|) that we omit. Second, the number of length four alternating cycles containing a vertex v cannot be larger than d(mate(v)) − 1. We make the following observations: (1) the set of alternating cycles of length four, C4, is constructed according to the initial matching; (2) each cycle in C4 is considered at most once; (3) due to the nonnegative gain requirement, the algorithm cannot escape from a local minimum. Having observed these deficiencies, we designed two alternatives. The first alternative starts with the same set C4 as in Algorithm 2 with more involved data structures and operations. It maintains C4 as a priority queue using the gain of a cycle as its key; tentatively modifies the current matching along the best cycle, even along those with a negative effect; at the end, realizes the most profitable prefix of modifications. This approach obtained better results than Algorithm 2, with increases in running time. The second alternative visits the row vertices in a random order, computes the best length four alternating cycle containing that vertex, and modifies the current matching along that cycle if the gain is nonnegative. Algorithm 2 outperformed this alternative in terms of solution quality. Another point worth mentioning is that computing the block triangular form (btf) of matrix can help reduce the running time of the algorithm. It is well known that the entries outside the diagonal blocks of a btf cannot belong to a perfect matching; see for example [6, Chapter 6] and [14]. Therefore, the corresponding edges in the bipartite graph G cannot belong to an alternating cycle. Those edges can be discarded without any effect on the solution quality of the proposed algorithm to reduce the running time. 2.2
Upper Bounds and Initial Matching
Consider a row ri and a column cj where aij = 1. For any matching M with (ri , cj ) ∈ M, the contribution of the matching edge (ri , cj ) to S(AM ) is limited by min{d(ri ), d(cj )}, or, equivalently by the smaller of the number of nonzeros in the corresponding row and column of A. Consider a maximum weight perfect matching MB1 , subject to edge weights wij = min{d(ri ), d(cj )}, in the bipartite graph G. The weight w(MB1 ) defines an upper bound, referred to as UB1, on the attainable symmetry score. We further obtain an improved upper bound by observing that all neighbors of the vertex ri may not be matchable to the neighbors of cj (or vice versa). In fact, two vertices ri and cj can contribute at most by the weight of the maximum weight matching Mij , subject to unit edge weights, in the induced subgraph Gij = (N (cj ) ∪ N (ri ), E ∩ N (cj ) × N (ri )). Consider a maximum weight perfect matching MB2 , subject to edge weights wij = w(Mij ), in G. The weight w(MB2 ) defines an upper bound, referred to as UB2, on the attainable symmetry score. Both of the perfect matchings MB1 and MB2 can be used as an initial solution. Note that we have w(MB2 ) ≤ w(MB1 ). However, we do not have any relation between the resulting symmetry scores. The CPU cost of finding MB2 can be very high. Therefore, we prefer using MB1 as an initial solution.
Heuristics for a Matrix Symmetrization Problem
3
723
Experiments
We present results with two sets of square matrices from University of Florida Sparse Matrix Collection [5]. Matrices in the first set originally have symmetric nonzero pattern and full sparse rank, and they satisfy the following assertions regarding the order n and the number of nonzeros nnz: n ≥ 1000, nnz ≤ 106 , nnz ≥ 3 × n. There were a total of 420 such matrices (out of 1840) at the time of writing. We ensured that the original matrices have zero-free diagonal. Then, we permuted the rows and columns of those matrices randomly and created unsymmetric matrices. If B is the pattern of an original matrix, then the corresponding matrix A in this set has the pattern of P (B + I)Q, where I is the n × n identity matrix, and P and Q are random permutation matrices. That is, for a matrix A in this set, the optimal symmetry score is equal to the number of nonzeros in A. The matrices in the second set are the 28 public domain matrices used in [7,15]. These matrices are highly unsymmetric. There are four matrices with an original symmetry score greater than 0.4 × nnz; however, there are zeros in the main diagonal. For these 28 matrices, we do not know the optimal symmetry score. We computed the upper bound UB2 for these matrices. In the experiments, initial matching M(0) is chosen to be MB1 , the perfect matching that defines the upper bound UB1. We use mc64 (described in [7]) from the mathematical software library HSL [9] with job=4 to compute MB1 . The worst case time complexity of mc64 on an n × n matrix with τ nonzeros is O (n(τ + n) log2 n). However, in practice it behaves much faster; see our experiments below and some others in [1] and [7]. We present the performance of the proposed algorithm by normalizing the symmetry score by the upper bound. In what follows, M(p) is the final perfect matching obtained at the end of the pth refinement pass. Table 1 displays the minimum, the average, and the maximum normalized symmetry scores of M(0) = MB1 and M(p) for p ∈ {1, 3, 5, 7} on 420 originally symmetric matrices. As seen from the average normalized scores, most of the improvements are obtained within the first few passes. In the remainder, we limit the number of refinement passes to five, i.e., we apply Algorithm 2 five times. As seen from the table, the average symmetry score 0.628 of M(0) is improved by around 38% in five refinement passes, resulting in an average symmetry score of 0.872 for M(5) . Figure 1 displays the histogram of the normalized symmetry scores for the initial matching M(0) and the final matching M(5) . The final matching M(5) obtained the optimal symmetry score for 89 matrices. The initial matching M(0) obtained the optimal symmetry score only for 4 matrices. There are only 7 instances below 0.628 (the average for M(0) ) among the normalized symmetry scores of M(5) . We argue that the initial choice of using the perfect matching MB1 which attains the upper bound UB1 is an important part of the proposed algorithm. We show this by experimenting with two other initial perfect matchings. The first one is an arbitrary perfect matching Ma on the bipartite graph G (found using mc64 with job=1). The second one is the perfect matching MB2 which attains the upper bound UB2. For the originally symmetric matrices, the upper bound UB1 is ex-
724
B. U¸car
Table 1. Statistics regarding the symmetry scores normalized by the optimal score on 420 matrices. S(A) is the symmetry score without any column permutation. S(A) S(AM (0) ) S(AM (1) ) S(AM (3) ) S(AM (5) ) S(AM (7) ) min 0.000 0.307 0.407 0.500 0.556 0.556 avg 0.005 0.628 0.754 0.834 0.872 0.891 max 0.125 1.000 1.000 1.000 1.000 1.000
200 Initial
180
Final
160 140 120 100 80 60 40 20 0
0.4
0.5
0.6
0.7
0.8
0.9
1
Fig. 1. Histogram of the normalized symmetry scores for the initial matching M(0) (white bars) and the final matching M(5) (black bars)
act. Hence, UB2 and the associated matching coincide with UB1 and its matching. With an arbitrary initial perfect matching, the proposed algorithm obtained an average normalized symmetry score of 0.656 (minimum 0.073 and maximum 0.995), well below the average 0.872 given in the rightmost column of Table 1. For the originally unsymmetric matrices, the two upper bounds are different, therefore using the three initial matchings makes sense. The average normalized (with respect to UB2) score after five refinement passes starting from these initial matchings are 0.593 (with Ma ), 0.655 (with MB1 ), and 0.699 (with MB2 ). Using MB2 as an initial matching resulted in a better average score with the alternative refinement method which uses priority queues (mentioned towards the end of Section 2.1) too. However, on some matrices, it took hours to compute MB2 . That is MB1 leads to results similar to those of MB2 , and it is affordable. For reproducibility of the results on originally unsymmetric matrices, we present the upper bound UB2 and our results in Table 2. The results are obtained by starting from the initial matching M(0) = MB1 . On average, the symmetry score of the initial matching is improved by 20% at the end of the fifth refinement pass. We implemented the algorithm in C, compiled with gcc using option -O3, and performed the tests on a Pentium IV 2.80 GHz PC with 2GB main memory and 1MB cache. The running time of the proposed algorithm (with five refinement
Heuristics for a Matrix Symmetrization Problem
725
passes) is, on average, 1.6 seconds for those matrices with a normalized initial symmetry score, S(AMB1 )/UB2, less than 0.9. For others, it takes much larger. The largest running time was obtained for the matrix ASIC 100k (n = 99340, nnz = 940696) having an initial normalized score of 0.912 (final score is 0.998). For this matrix, the running time is almost an hour. The running time of the HSL subroutine mc64 is, on average, 0.9 seconds. Therefore, we suggest computing the perfect matching MB1 , checking its symmetry score, and proceeding with 3-to-5 refinement passes if the initial result is not satisfactory. We have tested the effects of the proposed heuristic on the problem of ordering a matrix to doubly bordered block diagonal form. Given a matrix and an integer K, the aim is to permute the matrix into (K + 1) × (K + 1) block form such that Table 2. Symmetry scores for originally unsymmetric matrices. Matrices in the top block are from [7]; those in the bottom block are from [15]. The upper bounds (UB2) on the symmetry score are not guaranteed to be attainable. The matrices originally have zeros on the main diagonal, therefore we do not display their original symmetry score. matrix av41092 bayer01 gemat11 goodwin lhr01 lhr02 lhr14c lhr71c mahindas onetone1 onetone2 orani678 west1505 west2021 bayer03 circuit 3 extr1 fidapm11 g7jac200sc hydr1 impcol d jan99jac020sc mark3jac140 poli large radfr1 rdist1 sinc15 Zhao2
n 41092 57735 4929 7320 1477 2954 14270 70304 1258 36057 36057 2529 1505 2021 6747 12127 2837 22294 59310 5308 425 6774 64089 15575 1048 4134 11532 33861
nnz 1683902 277774 33185 324784 18592 37206 307858 1528092 7682 341088 227628 90158 5445 7353 56196 48137 11407 623554 837936 23752 1339 38692 399735 33074 13299 94408 568526 166453
UB2 S(AM (0) ) S(AM (5) ) 618198 210806 247112 160968 98995 101217 31851 26054 31851 221955 81901 171292 10819 8385 8963 21640 16706 17960 173826 146361 151670 862800 728203 752616 3974 2357 2364 275064 139182 197513 189331 106215 135745 28550 13875 15509 3547 1742 1747 4784 2354 2363 28124 13809 14597 39956 30526 34543 8039 3119 3255 623554 258775 562958 554972 228093 266580 14450 7254 7518 840 507 507 20166 7562 7832 271990 135451 145771 18077 15619 15627 8734 6184 6782 56804 46238 48152 322484 203810 302922 158305 84060 144795
726
B. U¸car
there is no nonzero in blocks (k, ) and (, k) for 1 ≤ k < ≤ K, the diagonal blocks (k, k) for 1 ≤ k ≤ K are square and have almost equal order. The cost is the size of the border, i.e., the order of the (K + 1, K + 1)st block. The problem is NP-hard [3]. A common tool used for this task is MeTiS [11]. MeTiS finds the block form using symmetric permutations. The heuristics used in MeTiS works on symmetric matrices. Therefore, the trick of using the pattern of A+AT (no numerical cancellation) instead of A is applicable here. We tested MeTiS on the matrices given in Table 2 with A + AT and AM (5) + (AM (5) )T for K = 4, 8, 16, 32, 64. Using AM (5) resulted in, on average, 43% smaller border size than using A, with, on average, 36% reduction in the running time. In 110 instances out of 140, using AM (5) resulted in better results than using A. The best reduction, 114 versus 8288, is obtained for bayer01 with K = 4. The worst increase, 6726 versus 3997 was obtained for sinc15 with K = 64; this was an odd case—MeTiS gave almost the same border size for K = 4, 8, 16, 32, 64 when using A.
4
Conclusion
We proposed a heuristic to solve the matrix symmetrization with zero-free diagonal problem. The heuristic starts from a judiciously chosen initial solution and iteratively improves it. We presented experiments on two sets of matrices. We know the optimal solution for the matrices in one of the sets. The solutions found by the proposed heuristic are, on average, around 0.87 of the exact solutions for the 420 matrices in this set. We do not know an optimal solution for the matrices in the other set. Therefore, we compared the solution quality with respect to a an upper bound described in the paper. The proposed heuristic achieved solutions, on average, within 0.70 of the upper bounds for the matrices in the second set. Compared to the average achieved for the matrices in the first set, we think that the latter results are lower due to having a loose upper bound.
Acknowledgments I thank M. Benzi, I. S. Duff, O. Parekh, and S. Pralet for helpful discussions, an anonymous referee for constructive comments which helped improve the presentation of Section 3.
References 1. Benzi, M., Haws, J.C., T˚ uma, M.: Preconditioning highly indefinite and nonsymmetric matrices. SIAM Journal on Scientific Computing 22, 1333–1353 (2000) 2. Boros, E., Gurvich, V., Zverovich, I.: Neighborhood hypergraphs of bipartite graphs. Technical Report RRR 12-2006, RUTCOR, Piscataway, New Jersey (2006) 3. Bui, T.N., Jones, C.: Finding good approximate vertex and edge partitions is NPhard. Information Processing Letters 42, 153–159 (1992) 4. Colbourn, C.J., McKay, B.D.: A correction to Colbourn’s paper on the complexity of matrix symmetrizability. Information Processing Letters 11, 96–97 (1980)
Heuristics for a Matrix Symmetrization Problem
727
5. Davis, T.: University of Florida sparse matrix collection. NA Digest 92/96/97 (1994/1996/1997), http://www.cise.ufl.edu/research/sparse/matrices 6. Duff, I.S., Erisman, A.M., Reid, J.K.: Direct Methods for Sparse Matrices. Oxford University Press, London (1986) 7. Duff, I.S., Koster, J.: On algorithms for permuting large entries to the diagonal of a sparse matrix. SIAM Journal on Matrix Analysis and Applications 22, 973–996 (2001) 8. Hendrickson, B., Kolda, T.G.: Partitioning rectangular and structurally unsymmetric sparse matrices for parallel processing. SIAM Journal on Scientific Computing 21, 2048–2072 (2000) 9. HSL: A collection of Fortran codes for large-scale scientific computation (2004), http://www.cse.scitech.ac.uk/nag/hsl 10. Hu, Y.F., Scott, J.A.: Ordering techniques for singly bordered block diagonal forms for unsymmetric parallel sparse direct solvers. Numerical Linear Algebra with Applications 12, 877–894 (2005) 11. Karypis, G., Kumar, V.: MeTiS: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices, version 4.0. University of Minnesota, Department of Computer Science/Army HPC Research Center, Minneapolis, MN 55455 (1998) 12. Kernighan, B.W., Lin, S.: An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal 49, 291–307 (1970) 13. Lawler, E.: Combinatorial Optimization: Networks and Matroids. Dover, Mineola, New York (unabridged reprint of Combinatorial Optimization: Networks and Matroids, originally published by New York: Holt, Rinehart, and Wilson, c1976) (2001) 14. Pothen, A., Fan, C.-J.: Computing the block triangular form of a sparse matrix. ACM Transactions on Mathematical Software 16, 303–324 (1990) 15. Reid, J.K., Scott, J.A.: Reducing the total bandwidth of a sparse unsymmetric matrix. SIAM Journal on Matrix Analysis and Applications 28, 805–821 (2006)
A Supernodal Out-of-Core Sparse Gaussian-Elimination Method Sivan Toledo and Anatoli Uchitel Tel-Aviv University
Abstract. We present an out-of-core sparse direct solver for unsymmetric linear systems. The solver factors the coefficient matrix A into A = P LU using Gaussian elimination with partial pivoting. It assumes that A fits within main memory, but it stores the L and U factors on disk (that is, in files). Experimental results indicate that on small to moderately-large matrices (whose factors fit or almost fit in main memory), our code achieves high performance, comparable to that of SuperLU. In some of these cases it is somewhat slower than SuperLU due to overheads associated with the out-of-core behavior of the algorithm (in particular the fact that it always writes the factors to files), but not by a large factor. But in other cases it is faster than SuperLU, probably due to more efficient use of the cache. However, it is able to factor matrices whose factors are much larger than main memory, although at lower computational rates.
1
Introduction
We present an out-of-core sparse direct solver for unsymmetric linear systems. The solver factors the coefficient matrix A into A = P LU using Gaussian elimination with partial pivoting. It assumes that A fits within main memory, but it stores the L and U factors on disk (that is, in files). The availability of computers with large main memories reduced the need for out-of-core solvers. Personal computers can be fitted today with 2-4 GB of main memory for a reasonable cost. This allows users to factor large sparse matrices using high-performance in-core solvers on widely-available machines. But the need to factor matrices that are too large to factor in core still arises occasionally, and new out-of-core algorithms are still being actively developed [11]. The solver that we present here combines ideas from two earlier sparse LU factorization algorithms. The overall structure of our algorithm follows that of SuperLU a high-performance in-core code by Li et al. [4]. We also use techniques from an earlier (and much slower) out-of-core algorithm by Gilbert and Toledo [10].
This research was supported by an IBM Faculty Partnership Award, by grant 848/04 from the Israel Science Foundation (founded by the Israel Academy of Sciences and Humanities), and by grant 2002261 from the United-States-Israel Binational Science Foundation.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 728–737, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Supernodal Out-of-Core Sparse Gaussian-Elimination Method
729
Experimental results indicate that on small to moderately-large matrices (whose factors fit or almost fit in main memory), our code achieves high performance, comparable to that of SuperLU. In some of these cases it is somewhat slower than SuperLU due to overheads associated with the out-of-core behaviour of the algorithm (in particular the fact that it always writes the factors to files), but not by a large factor. But in other cases it is faster than SuperLU, probably due to more efficient use of the cache. Our method is able, of course, to factor matrices whose factors are much larger than main memory, although at lower computational rates. The paper is organized as follows. Section 2 describes the new algorithm. We discuss the efficiency of the algorithm in Section 3. Section 4 presents our experimental results and Section 5 our conclusions.
2
The Algorithm and Its Data Structures
This section describes our algorithm and the data structures that our new factorization code uses. Due to lack of space, we do not provide a pseudo-code of the algorithm, but such a description is available in the second author’s MSc thesis. The algorithm partitions the matrix into a set of consecutive columns that we call panels. The algorithm first factors the columns of panel 1, then the columns of panel 2, and so on. The algorithm is left looking (between panels); when it starts processing panel p, the columns of the panels have not been updated at all. During the factorization of the panel, these columns are updated by the columns of panels 1 to p − 1, and then the panel itself is factored. Once the columns of the panel are factored, they are written to files to make space in memory for the factorization of the next panel. As we factor a panel, we partition it into supernodes, smaller sets of consecutive columns with similar nonzero structure in L. As in SuperLU [4], our algorithm builds supernodes incrementally: after a column is factored, its nonzero structure is inspected to determine whether it should be added to the current supernode. If its nonzero structure is similar to that of the supernode, it is added to it and the algorithm continues to the next column. If it is not similar to the structure of the supernode, the current supernode is finalized and the column starts a new supernode. External-Memory Data Structures Our code assumes that the input matrix is stored in main memory in a compressdcolumn format, although it can be easily converted to use an out-of-core input matrix. The code stores L and U , as well as the pivoting permutation, in a file or files. It uses the matrix-oriented input-output library developed by Rotkin and Toledo [12]. This library allows the code to write out to files arbitrary rectangular arrays of floating point or integer data and to retrieve such matrices from files.
730
S. Toledo and A. Uchitel
A set of arrays is associated with a symbolic name such as mymatrix.L. These arrays can be stored in one file, or if it grows beyond one gigabyte, multiple files. The splitting of the logical data into multiple files is transparent to the code. We store U column by column in such a file. The columns are compressed, of course. We store L in supernodes, also compressed. The nonzeros of the supernode (padded with zeros if the columns of the supernode do not have identical nonzero structure) are stored in rowwise order, and the rows are ordered in their original order (not in pivot order). Data Structures The algorithm uses several data structures. The main data structure stores the columns of the current panel and information about them. Another data structure, a priority queue q, is associated with the entire panel. The algorithm also uses integer vectors that map columns to their pivot rows (π) and back, and columns to supernodes and back. The panel data structure consists of an n-by-w array of floating-point numbers, where w is the number of colums in the panel. This is a dense array that stores a sparse submatrix, so it does not use memory as efficiently as possible, but the dense array allows for fast access. The panel is stored in this array in columnwise order (columns are contiguous). The row ordering of panel p reflects all the row exchanges that were made during the factorization of panels 1 through p − 1. That is, the first row in the array is the pivot row of column 1 and so on. We associate with each column j in the panel three integer vectors. The vector rj stores the indices of nonzero rows in column j. The indices are compressed toward the beginning of the vector and an integer stores the number of nonzero rows. The row indices are not sorted. We allocate rj to be large enough to store all the nonzero indices in column j, using the precomputed bounds on the column counts in L and U . The bound is usually much smaller than n. An n-vector j stores the location of indices in rj . If column j has a zero in a particular row, then the corresponding position of j is −1. That is, if row i in the column is nonzero, then element i of j stores the position in rj that contains i, and it contains −1 otherwise. A third vector, hj , contains a binary heap (a priority queue) of the nonzero row indices in column j. The priority order is the pivoting order. That is, if rows i1 and i2 are in the heap, and if i1 was used as a pivot row before i2 , then i1 will precede i2 in the priority ordering. Rows that are nonzero in column j but were not yet used as pivots are not in the queue. The panel-wide priority queue q stores at most one nonzero row index from every column in the panel. If column j contains nonzeros in rows that were already used as pivots, then q contains a pair (i, j), where i is the minimal row index in j in the pivot order. (We actually do not store (i, j) but (s, j), where s is the supernode in which i was used as pivot; this simplifies the code a bit, but complicates the notation here, so we view the contents of the queue as (i, j); i can easily be mapped to s.) If a column contains now nonzero rows that were used as pivots, it is not represented in q.
A Supernodal Out-of-Core Sparse Gaussian-Elimination Method
731
The Algorithm The factorization proceeds in phases that each factors one panel. When phase p starts, panels 1 through p − 1 have already been factored, and the supernodes that they have generated in L are stored in a file. The n-by-w array for the panel contains only zeros and the vectors contain only −1’s. Phase p starts by loading the nonzeros of the columns of the panel from the input-matrix data structure into the panel. We traverse the compressed representation of each column of the panel. We store each nonzero Aij in the n-by-w array, add its row index to rj , note the location in rj in the ith position of j , and if i has been used as a pivot row, we append it to hj (ignoring the heap structure that hj should have). Once all the nonzeros of column j have been processed, we build the heap structure of hj (see [2]; the cost of building the heap this way is linear in its size, faster than multiple heap insertions). To finish the pre-processing of column j, we test whether hj contains any indices. If it does, we append (min hj , j) into the panel-wide priority queue, which we will build as a heap later. The total cost of this panel pre-processing step is linear in the number of nonzeros in the panel in the input matrix A. Next, the algorithm updates the panel, performing all the column-column operations involving a column from panels 1 to p − 1 and a column in panel p (the operations are performed in a blocked fashion, as we explain below). This is done as follows. We repeatedly extract a pair (i, j) from q. This tells us that column π −1 (i) (the column in which row i was the pivot row) must be scaled and subtracted from column j. But in order to perform block operations, we delay the subtraction. Immediately after extracting (i, j) from q, we extract the minimal row index i in hj and insert (i , j) to q, to maintain its invariant property. We remember the pair (i, j) in an auxiliary vector and continue to extract pairs from q until the row index in the minimal pair in q belongs to a different supernode than row i, the first row that we extracted from q. Then we stop. We now read the supernode s of L that contains π −1 (i) from external memory. We copy the columns of panel p whose indices we extracted from q into a compressed array B. We extract from q row-column pairs; the columns that we copy are the column indices in these pairs. We only copy the rows that appear as row indices in supernode s. In other words, B contains the elements of p that belong to columns that s updates and to rows that appear in s. These numbers are compressed in B, in rowwise order, and the rows have the same ordering as in the compressed supernode s. We can now perform the triangular solve and the matrix-matrix multiplication that constitute the column-column operations involving s and p. Hence, all of these numerical operations are performed by two calls to the level-3 BLAS [8,7]. The supernode s is not needed any more during the processing of panel p, so it can be evicted from memory. The contents of B are now copied back into the main data structure of the panel, updating the r, , and h vectors where necessary. When all the column-column operations involving panels 1 through p − 1 have been applied to p, we factor p itself. At this point the heaps hj are empty.
732
S. Toledo and A. Uchitel
The factorization of the panel is supernodal and uses a mixture of left-looking and right-looking updating rules. We first explain the updating rules. Suppose that the next column to be factored is column j and that the factorization of the previous columns in the panel generated supernodes sx , . . . , sy , sz . If the nonzero structure of column j of L will be similar enough to the nonzero structure of sz , j will join sz . Therefore, we view sz as a pending supernode, which we store in somewhat different data structure than the finalized supernodes sx , . . . , sy . A finalized supernode in the panel is stored in the same data structure as a supernode on disk: compressed in rowwise order (rows are contiguous). A pending supernode is stored in a large two dimensional array. Its rows are contigous, but there are gaps between them. That is, a pending supernode is stored in two dimensional array that has both more rows and more columns than the supernode. This array is partitioned into a U area and an L area. This data structure allows us to add more columns and more nonzero rows to the supernode and to use it as an argument to BLAS subroutines. When the algorithm reaches the factorization of column j, updates from all the finalized supernodes have already been applied to it (right-looking updates from supernodes within the panel and left-looking updates from earlier panels). We explain below how this happens. But there may be column-column operations involving the pending supernode that still need to be performed on column j. To find them, we inspect hj . If it is empty, we do not need to update column j, so we proceed to factor it. If hj is not empty, we empty it (the indices in it must belong to the pending supernode sz ) and copy the rows in column j which are nonzero in sz to the compressed buffer B. We then use two level-2 BLAS subroutines [6,5] to update column j and we copy it back to the panel’s main data structure, updating rj and j if the column gained new nonzeros. To factor column j, we use rj to locate its nonzeros, to find the maximal one (in absolute value), and to scale them. We denote the row index of this maximal element by π(j). We then exchange rows j and π(j) in all the panel’s data structure. We traverse rows j and π(j) in the two-dimensional array that stores the panel and swap elements (traversing only columns that have not yet been factored). If both elements (j, k) and (π(j), k) are nonzero in some column k, then the only other operation that we need to perform on column k is to insert j into hk . If only element (j, k) is nonzero, we swap it with the zero in (π(j), k) and change the index j somewhere in rk to π(j); this is what we need k for, to locate this index with a constant number of operations. If only element (π(j), k) is nonzero, we swap it with the zero in (j, k), change it to j in rk , and insert j into hk . In the special case of j = π(j), no rows are exchanged, of course, but we still need to traverse row j and to insert a nonzero in (j, k) to hk . For every column k in the panel, we keep a boolean variable that indicates whether the column is represented in the panel-wide queue q. Whenever we insert an index into hk we inspect this indicator, and if false, we insert the pair (min hk , k) into q. Since we add indices to hk in pivot order, if already k is represented in q then (min hk , k) is already in q. (The scheme can be simplified a bit, but this
A Supernodal Out-of-Core Sparse Gaussian-Elimination Method
733
description keeps the intra-panel updating rules similar to the inter-panel ones, so it is simpler.) The traversal of the pivot row in the full two-dimensional array storing the panel is somewhat inefficient; we discuss this issue later. Now that column j has been factored, we compare its nonzero structure to that of the pending supernode. If the structures are similar enough (we describe the criteria below), we add j to sz . This may require adding rows to the nonzero structure of the pending supernode and possibly moving row π(j) (if it was already nonzero in sz ) from the L area of the supernode’s array to the U area. If the structure of sz and j are dissimilar or if j is the last column in the panel, we finalize sz . To finalize sz , we first copy it into a compressed two-dimensional array. We then extract from the panel-wide q the indices of all the columns in the panel that sz must update. We copy these columns to the compressed array B and use two level-3 BLAS subroutine calls to update them. This operation is identical to updates from supernodes from earlier panels. The finalized supernode can now be written to disk and erased from memory. The criteria that we use to decide whether a column should be added to a pending supernode are as follows. If the supernode has fewer than 8 columns, we always add the column (to avoid very narrow supernodes). Similarly, we limit the size of supernodes to 80 columns. If the supernode has 8 to 79 columns, we add the column only if is the parent of the previous column in the column elimination tree and the total number of nonzero rows in the supernode is less than 8192. This concludes the description of the algorithm. We omit discussion of how the panel’s data structures are cleared in preparation for the processing of the next panel; this is done in a conventional way in time proportional to the number of nonzero elements in the panel.
3
Discussion
The following points summarize the main points of the algorithm. – The columns are statically partitioned into panels, and each panel is dynamically partitioned into supernodes (groups of consecutive columns with similar nonzero structure in L). – During the factorization of a panel, each factored supernode that updates the panel participates in one supernode-panel update. This operation is performed by two calls to the level-3 BLAS. This rule applies both to supernodes from earlier panels and to supernodes from the same panel. – Before a supernode is finalized (before the set of columns that belong to it has been determined), the supernode updates one column at a time using supernode-column operations involving two calls to level-2 BLAS subroutines. – The algorithm uses priority queues to determine which updating operations must be performed.
734
S. Toledo and A. Uchitel
– The algorithm uses two n-by-w arrays to store the elements of the current panel, as well as a few smaller data structures; one array stores the nonzero values and the other stores the pointer vectors j . There are two places where our algorithm gives up sparsity to obtain higher performance. The first is the use of the n-by-w arrays to store the current panel. This involves a one-time Θ(nw) cost to set up these arrays initially (to clear them) and it means that the algorithm consumes large amounts of memory if w is large. The second is in the traversal of the full rows j and π(j) of the remaining columns of the panel after the factorization of column j. If the panel is very wide (large w), these rows can be fairly sparse. This may cause our algorithm to perform a significant amount of work on zeros. Asymptotically, the n 2 total cost of these row scans is Θ( w w ) = Θ(nw). Therefore, as long as we use a full array to hold the panel, the row scans are not a dominant cost in the algorithm, due to the cost of initializing the full array. We did not explore the possibility of compressing the entire panel. Our goal in this research has been to acheive high performance through the use of the level-3 BLAS, and we did not find a way to do this with a compressed panel without a large overhead.
4
Experimental Results
This section presents experimental results that explore the performance of our new algorithm. All the experiments that we report on were performed on a 3.2 GHz Intel Pentium 4 computer with 2 GB of main memory running Linux version 2.6.17. We used GCC version 4.04 to compile all the codes and linked them with the same high-performance implementation of the blas, atlas version 3.6. We used colamd to order the columns of the matrices prior to factoring them. All the codes used strict partial pivoting. Our out-of-core code stored the factors on a Maxtor 160 GB serial-ATA disk. The factors were stored in files on an ext2 file system that occupies an 80 GB partition on the disk. The same disk also hosts a 4 GB swap partition, but nothing else (the rest of the disk was not used during these runs). The machine was dedicated to these experiments and did not run any significant process during the experiments. We used for the experiments the 22 matrices that are described in Table 1. The matrices are all available from Tim Davis’s sparse matrix collection1 . The table shows the size of the matrices, the size of the computed factors, the maximal size of panels that our code chose (automatically), and the factorization and solve times. The table shows that the code was able to factor matrices whose factors are much larger than main memory. For sparsine, for example, the L and U factors have each more than 500 million nonzeros. Just the values of the nonzeros in L 1
http://www.cise.ufl.edu/research/sparse/matrices/
A Supernodal Out-of-Core Sparse Gaussian-Elimination Method
735
Table 1. The matrices that we used for the experiments
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Name zhao2 twotone fidapm11 wang4 cage10 bbmat av41092 mark3jac140 xenon1 g7jac200 li ecl32 gupta3 xenon2 ldoor_full cage11 pre2 ohne2 conesh22 hamrle3 af_shell10 sparsine
Order 33861 120750 22294 26068 11397 38744 41092 64089 48600 59310 22695 51993 16783 157464 952203 39082 659033 181343 837967 1447360 1508065 50000
1000’s of nonzeros in A 166 1224 626 177 151 1772 1684 400 1181 838 1350 380 9323 3867 42494 560 5834 6870 43039 5514 52260 1549
1000’s of OOC OOC nonzeros Max factor facin L and columns time tor U in panel (sec) Mflop/s 20627 1470 26.4 449 30134 413 30.8 275 29107 2232 31.2 732 27543 1909 41.2 781 27399 4366 52.8 964 53961 1285 63.9 705 47863 1211 83.2 892 53891 777 95.6 361 66641 1024 107.2 828 65948 839 119.4 483 71274 2193 167.8 429 82588 957 171.0 789 82202 2965 227.6 759 387103 316 1644.9 719 1053055 53 2924.3 361 285699 1274 3620.4 511 806898 76 3842.2 315 723379 275 3952.6 624 1527770 60 8114.7 363 636482 35 14823.4 81 1778122 33 17878.2 178 1105720 425 128598.0 131
OOC solve time (sec) 0.7 1.7 0.6 0.6 0.5 1.0 1.0 1.5 1.3 1.6 1.1 1.7 1.2 323.6 1375.4 156.8 1078.2 610.0 1706.6 861.6 2540.0 599.4
take more than 4 GB to store, twice the amount of main memory on the machine that factored the matrix. Clearly, when the factorization performs a large amount of I/O, it slows down. The lowest computation rate that we observed, 81 Mflop/s on hamrle3, is lower by more than a factor of 10 than the highest observed rate, 964 Mflop/s. But on several other large matrices, the code performed well in spite of the size of the factors. Our code chose the panel size as follows. It broke the column elimination tree into subtrees. The size of each subtree was limited by two constraints: a maximal number of columns and a maximal number of predicted nonzeros in the corresponding columns of L and U . The limit on the number of columns was chosen so that the total size of the n-by-w array is at most 3/16 of the available memory. The total number of predicted nonzeros in a subtree was limited to 820, 000. This is a fairly small number compared to the total amount of memory. We selected this number experimentally to achieve high performance. This is the main tunable parameter in the code. Figure 1 (left) compares the running times of our new out-of-core code to those of three other sparse LU factorization codes that all perform partial pivoting. One code is the out-of-core code of Gilbert and Toledo [10]. Another is an
736
S. Toledo and A. Uchitel
Xenon2 1000
6
10
900
TSOOC
Factorization Time ( sec )
Factorization Rate ( MFLOP/sec )
Gilbert−Toledo
5
10
SuperLU MultiLU 4
10
3
10
2
10
800
700
600
500
400
300
200
1
10
100 0
10
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22
Input Matrix Number
0
0
100
200
300
400
500
600
700
800
Panel Size ( in columns )
Fig. 1. Left: The factorization times of our new code (tsooc), along with the factorization times of several existing codes: the out-of-core code of Gilbert and Toledo, the in-core unsymmetric multifrontal code in taucs, denoted MultiLU in the graphs), and SuperLU. The ordering of the matrices is according to the ordering in Table 1. Right: Performance of our algorithm as a function of the width of the panel on the matrix Xenon2.
in-core unsymmetric-pattern multifrontal code [1] which is now part of taucs, a library of sparse linear solvers that our group has developed. This code, denoted MultiLU below, is based on the umfpack 4.0 algorithm [3]. The last code that we use in this comparison is SuperLU [4], also an in-core algorithm. Our new code is denoted in the figures by tsooc. The results show that our new code is usually slower than SuperLU and MultiLU when these codes can factor the matrix, but not by a large factor. In many cases the performance of our code is similar to that of SuperLU, but in two cases SuperLU is more than twice as fast as our out-of-core code. In these experiments, MultiLU is usually faster than SuperLU, and on one matrix it is five times faster than our code. But on most matrices our code is within a factor of 2.5 of MultiLU and within a factor of 2 of SuperLU. On one matrix SuperLU was much slower than our code and on another MultiLU was much slower, but these behaviors are likely to have been caused by excessive paging. Both codes failed to factor some of the larger matrices due to lack of sufficient memory to factor them in core. The code of Gilbert and Toledo managed to factor some of the matrices that were too large for SuperLU and MultiLU, but it was slower than the other codes by a large factor (more than a factor of 8 slower than our new out-of-core code), and it also failed on matrices that the other codes managed to factor in core. Figure 1 (right) shows the influence of the panel width on the performance of our code on one matrix; experiments on other matrices led to graphs with the same structure, sometimes with some fluctuations near the performance peak. Performance first rises sharply when the panel size grows, and it eventually drops slowly as it becomes very large.
A Supernodal Out-of-Core Sparse Gaussian-Elimination Method
5
737
Conclusions
We have presented a new out-of-core sparse LU factorization algorithm. The performance of the new algorithm degrades gracefully as the problem gets larger. The code factored some matrices that in-core codes failed to factor at computational rates of over 500 Mflop/s, comparable to the rates it achieved on small matrices . On a few large matrices the performance of the algorithm declined considerably (to 81 Mflop/s in one case) due to a large amount of I/O.
References 1. Avron, H., Shklarski, G., Toledo, S.: Parallel unsymmetric-pattern multifrontal sparse LU with column preordering. ACM Transactions on Mathematical Software 34(2) (2008) 2. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press and McGraw-Hill, Cambridge (2001) 3. Davis, T.A.: A column pre-ordering strategy for the unsymmetric-pattern multifrontal method. ACM Trans. Math. Softw. 30(2), 165–195 (2004) 4. Demmel, J.W., Eisenstat, S.C., Gilbert, J.R., Li, X.S., Liu, J.W.H.: A supernodal approach to sparse partial pivoting. SIAM Journal on Matrix Analysis and Applications 20, 720–755 (1999) 5. Dongarra, J.J., Cruz, J.D., Hammarling, S., Duff, I.: Algorithm 656: An extended set of Fortran basic linear algebra subprograms. ACM Transactions on Mathematical Software 14, 18–32 (1988) 6. Dongarra, J.J., Cruz, J.D., Hammarling, S., Duff, I.: An extended set of Fortran basic linear algebra subprograms. ACM Transactions on Mathematical Software 14, 1–17 (1988) 7. Dongarra, J.J., Cruz, J.D., Hammarling, S., Duff, I.: Algorithm 679: A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software 16(1), 18–28 (1990) 8. Dongarra, J.J., Cruz, J.D., Hammarling, S., Duff, I.: A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software 16(1), 1–17 (1990) 9. Gilbert, J.R., Peierls, T.: Sparse partial pivoting in time proportional to arithmetic operations. SIAM Journal on Scientific and Statistical Computing 9, 862–874 (1988) 10. Gilbert, J.R., Toledo, S.: High-performance out-of-core sparse LU factorization. In: Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing, San-Antonio, Texas, 10 pages on CDROM (1999) 11. Reid, J.K., Scott, J.A.: The design of an out-of-core multifrontal solver for the 21st century. In: Proceedings of the Workshop on State-of-the-Art in Scientific and Paralell Computing, Umea, Sweden (June 2006) 12. Rotkin, V., Toledo, S.: The design and implementation of a new out-of-core sparse Cholesky factorization method. ACM Transactions on Mathematical Software 30, 19–46 (2004)
A Large-Scale Semantic Grid Repository Marian Babik and Ladislav Hluchy Intelligent and Knowledge-oriented Technologies Group, Department of Parallel and Distributed Computing, Institute of Informatics, Slovak Academy of Sciences [email protected], [email protected]
Abstract. The Semantic Grid is a recent initiative to expose semantically rich information associated with Grid resources to build more intelligent Grid services [2]. Recently, several projects have embraced this vision and there are several successful applications that combine the strengths of the Grid and of semantic technologies [5,11,4]. However, Semantic Grid still lacks a technology, which would provide the needed scalability and integration with existing infrastructure. In this paper we present our on-going work on a semantic grid repository, which is capable of addressing complex schemas and answer queries over ontologies with large number of instances. We present the details of our approach and describe the underlying architecture of the system. We conclude with a performance evaluation, which compares the current state-of-the-art reasoners with our system.
1
Introduction
Semantic Grid is a recent effort, which tries to extend the current Grid technologies by providing a well defined meaning to the services and information, thus enabling computers and people to work in cooperation [2]. The Semantic Web is seen as a possible infrastructure, which can provide an environment for hosting and managing both grid and web services. The use of the Semantic Web technologies in Grids is not novel. It has already been useful in providing a scalable solutions for the planning and resource optimization of a large-scale scientific workflows [9,8,6]. However, as the number of resources and data increases in the scientific communities, the creation and management of the heterogeneous and dynamic resources is the key to the future analysis and scientific computations. The need for semantically enabled technologies was recognized in several scientific applications such as bioinformatics, chemistry and environmental sciences [4,10,5]. These applications require support for the dynamic and complex scientific workflows, which are based on processing and sharing of large amounts of heterogeneous data. Such scientific workflows can be based on the composition of a large number of grid-based jobs and can benefit from the AI planning and semantic description of data as was described in [6,9]. Others can be based on the composition and interoperability between grid and web services. Examples of such systems are [5,4,11]. Such environments often require support for discovery, description, composition and executions of the grid and web services. One R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 738–745, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Large-Scale Semantic Grid Repository
739
of the major obstacles for such systems is the lack of scalability of the existing semantic repositories, which are based on the Semantic Web standards such as RDF/RDFS or OWL. In this paper we present a novel semantic repository, which can support the subset of the OWL standard and can thus address the service discovery and composition based on the semantic description of services.
2
Motivation: Semantic Description of Services
The Semantic Web is making available technologies, which support and to some extent automate knowledge sharing. In particular, there are several existing initiatives (OWL-S, WSMO, WSFS, SAWSDL1 ), which provide evidence, that ontologies with their ability to interweave human understanding of symbols with their machine-processability can play a key role in the automating service discovery and composition in the Semantic Web. This is also supported by the numerous successful extensions of these technologies to the Grid environment [5,11,4]. Although there are various different approaches to the service discovery and composition [12], there are only few typical query types, which the underlying semantic repository has to support: 1. keyword search or relational database queries (e.g., SQL queries) 2. subsumption queries, i.e. queries over the taxonomy (schema) of the ontology, while taking into consideration the semantics of the underlying logical model. 3. conjunctive queries over ontologies, which asks for the instances of the ontological classes, while taking into consideration the taxonomy of classes and properties (schema). The current Semantic Web infrastructure offers a range of existing ontological repositories, which can provide the support for ontological querying and reasoning. However, there are several shortcomings in the current implementations: – The well-known repositories such as Sesame, Kowari and Jena [1] are mostly based on the RDF/RDFS and thus provide only models with limited expressivity and no or very restricted support for OWL ontologies. – Conjunctive queries over ontologies with large number of instances is only available on certain systems. – There is limited support for performing ontological queries over relational databases. This is major obstacle in performing an integration of the semantic repositories with existing relational-based systems (such as MDS, OGSA-DAI, etc.). In summary, our goal is to provide a scalable ontological repository for services and data, which would support both subsumption and conjunctive queries as well as coupling with legacy metadata stored in relational databases. 1
OWL-S:http://www.daml.org/services/owl-s/1.1, WSMO:http://www.wsmo.org, SWSL:http://www.daml.org/services/swsl, SAWSDL:www.w3.org/2002/ws/sawsdl
740
3
M. Babik and L. Hluchy
Approach
The Semantic Web proposes a standard for ontological descriptions in the form of a Web Ontology Language(OWL). There are currently two algorithms for reasoning with OWL ontologies, namely, tableaux decision procedure [13] and reasoning in the framework of resolution [14]. The former approach benefits from numerous optimizations in the area of classification and satisfiability, which allows the tableaux-based reasoners to perform fast classifications of the ontological schemas and the ability to answer complex queries over schemas. The latter is based on the translation of description logics into a disjunctive datalog and benefits from its relation to the deductive databases. In the following we will denote the tableaux procedure as T B and the resolution based reasoning with DD. To support a storage and inference system for large scale OWL ontologies on top of relational databases we have developed a new approach with the following characteristics: – Our method combines the existing description logics reasoners for computing taxonomies (TBoxes), i.e. T B, with rule-based reasoners for the reasoning with large number of instances (ABoxes), i.e. DD. – Based on the proposed combination we can re-use the existing optimizations (i.e. classification and satisfiability techniques) of the description logics reasoners to perform fast classifications of the complex schemas. Further, we can exploit the optimizations of the rule-based systems (i.e., join-order and magic sets) to perform queries over ontologies with large number of instances. Since deductive databases are designed to perform the queries over existing relational databases, it is possible to integrate our system with existing RDBMS-based grid registries. Every ontology is composed of a set of axioms, which can be divided into a sets of so called terminological axioms (axioms describing schema) and assertional axioms (axioms describing instances of classes and properties of the schema). The set of terminological assertions is often called a T Box and the set of assertional axioms ABox. Overall the common set of assertions is called a knowledge base, denoted KB. In our approach we initially split the KB into a set of T Box and ABox assertions and feed these assertions to different reasoners as follows. The overall T Box is loaded into a tableaux-based reasoner, i.e. T B(T Box). This allows to ask complex schema queries and perform classification of the T Box without considering the ABox, which for the tableaux methods is quite costly. Subsequently, the overall KB is then load to the disjunctive datalog engine, i.e. DD(KB). This enables the possibility to check the consistency of the overall KB, but to leave the T Box specific tasks to the tableaux reasoners. The subsequent classification and taxonomy queries can be forwarded to the T B(T Box), while the conjunctive queries and consistency of the KB can be handled by the DD(KB). Since the decomposition of the KB to the T Box and ABox has to be performed anyway, there is no significant performance overhead during the initialization of the reasoners.
A Large-Scale Semantic Grid Repository
4
741
Architecture
The overall architecture of the system is based on extending the existing tableaux reasoner Pellet [15] with the optimizations for the conjunctive query answering and database backend. Fig. 1 shows the main components of the system. The core of the system is composed of two reasoners, tableaux reasoner and disjunctive datalog engine. The aim of the tableaux reasoner is to check the consistency of the T Box and to compute its classification. Disjunctive datalog engine is based on the KAON2 [14] and its aim is to check the consistency of the knowledge base KB 2 and to perform the conjunctive queries over ABox.
Fig. 1. An overview of the architecture
The OWL ontologies are loaded into the reasoners after parsing and validation. The validation ensures that all resources are valid and the actual expressivity is within the boundaries of our method. During the loading phase, axioms about classes and properties are put into T Box and assertions about instances are stored in the ABox. T Box axioms are then preprocessed and fed into the tableaux reasoner. Additionally, T Box is preprocessed for the resolution method and together with the ABox loaded into the disjunctive datalog engine. The engine performs the necessary preprocessing and clausification of the KB. The outcome of such process is a disjunctive datalog program, which can be used to answer conjunctive queries. Although the transformation to DD(KB) is quite complex it is performed only once during the initialization of the KB. The set of rules of the DD(KB) can be saved for later re-use. Alternatively, ABox assertions can be stored in the relational database (RDBMS): by mapping ontology entities to database tables, disjunctive datalog engine can be used to query the database on the fly during the reasoning. The system provides a KB interface, which can be used to programmatically access the parsing, validation, tableaux and resolution-based methods for inferencing. Additionally, the system provides a SPARQL query interface, which 2
The consistency of the knowledge base can only be decided based on the overall T Box and ABox.
742
M. Babik and L. Hluchy
Fig. 2. Experimental results for the LUBM benchmark (left) and OWL-S (right) ontology. The different versions of ontologies contain an increasing number of instances (for OWL-S ranging from 50000-120000; for LUBM ranging from 67464-319714).
translates the SPARQL queries to the disjunctive datalog reasoner. It is also possible to use standard OWL API functions such as consistency, classification and subsumption queries.
5
Evaluation
Evaluation of our approach is based on the comparison of the performance of our system with that of existing description logics reasoners. We compared our system (denoted Hires) with KAON2, Racer and Pellet [14,16,15]. KAON2 is a resolutionbased reasoner, Racer and Pellet are tableaux-based reasoners. We did not consider other reasoners as they either do not provide up to date implementations or lack the needed expressivity and features. Since the actual reasoning algorithms are quite complex and there are many different optimizations involved we have performed the following justifications. We have started a new instance of the reasoner for each query. We did not consider any caching and materialization approaches as this would penalize the KAON2, which currently doesn’t implement any caching or materialization. All tests were performed on the laptop computer (T60) with 1.8Ghz memory and 1 GB of RAM, running Linux kernel 2.6.20-1. For Java-based reasoners (Pellet,KAON2) we have used Java runtime 1.5.0 Update 6 with virtual memory restricted to 800 MB. We run each reasoning task five times and plotted the average of the set. Each task had a time limit of 5 minutes. Tests that either run out of memory or out of time are denoted with time 300000. We have run our experiments on three ontologies: the Wine ontology3 , the OWL-S ontology4 and Leigh University Ontology benchmark (LUBM)5 . These 3 4 5
http://www.schemaweb.info/schema/SchemaDetails.aspx?id=62 http://www.daml.org/services/owl-s/1.1 http://swat.cse.lehigh.edu/projects/lubm/index.htm
A Large-Scale Semantic Grid Repository
743
Fig. 3. Experimental results for the Wine ontology (left) and benchmark showing the load times for different ontologies (right)
ontologies provide a very good mix of different expressivity of the T Box, while containing a large number of instances in the ABox. For all the ontologies we have measure the time to answer a conjunctive query (query time Q) and the time to compute the classification of the ontology (classification time C). The queries used follow the directions of the respective benchmarks. The results of our evaluation are shown in Fig.2 for the LUBM and OWLS benchmark ontology and in Fig.3 for the Wine benchmarks. It can be seen, that our system can perform classification and conjunctive query answering with comparable performance to that of the specialized reasoners. Furthermore, the combined approach of our system can improve the overall performance of query answering.
6
Related Work
The existing OWL reasoners can be divided into a tableaux-based and resolutionbased reasoners. Among the tableaux-based the most prominent are Racer, Pellet and Fact++ [16,15,18]. Most of them however, suffer from insufficient performance in answering conjunctive queries with large number of instances. Some of them, like e.g. Fact++, do not provide support for ABox reasoning at all. Our approach extends such reasoners with the capability of providing scalable conjunctive query reasoning. Similar to our approach, InstanceStore is an extension of the Racer, which aims at providing support for reasoning with large number of instances, which can be stored in the database. However, unlike our approach it is limited to the ABoxes with no instances of properties. An alternative OWL reasoner for a subset of OWL is KAON2, which implements a resolution-based reasoning [14]. Unlike KAON2 we can provide a fast classification of the T Box and the ability to answer complex T Box queries.
744
M. Babik and L. Hluchy
Our approach was largely motivated by our previous work on the Grid Organizational Repository (GOM)[3]6 , which provides support for the grid-level ontological management and semantic metadata. Additionally, there are also several other Grid-based semantic repositories such as Tupelo7 , Grimoires[7] and VEGA-KG. Unlike these systems our approach can handle a more expressive ontologies, while being able to perform conjunctive queries over ontologies with large number of instances.
7
Conclusion
We have described design and development of the scalable grid repository. Currently, we are working on the evaluation of the proposed system on a real-life application, following the use case that we have developed during the project Knowledge Workflow Grid8 .In the future we would like to address the caching and materialization aspects of the semantic storage.
References 1. Lee, R.: Scalability Report on Triple Store Applications, http://simile.mit.edu/reports/stores/ 2. Goble, C., De Roure, D.: The Semantic Grid: Myth Busting and Bridge Building. In: Proceedings of the 16th European Conference on Artificial Intelligence (ECAI 2004), Valencia, Spain (2004) 3. Kryza, B., Slota, R., Majewska, M., Pieczykolan, J., Kitowski, J.: Grid organizational memory-provision of a high-level Grid abstraction layer supported by ontology alignment. Future Generation Computer Systems 23(3), 348–358 (2007) 4. Wroe, C., Goble, C.A., Greenwood, M., Lord, P., Miles, S., Papay, J., Payne, T., Moreau, L.: Automating Experiments Using Semantic Data on a Bioinformatics Grid. IEEE Intelligent Systems 19, 48–55 (2004) 5. The Knowledge-based Workflow System for Grid Applications FP6 IST project, http://www.kwfgrid.net 6. Gil, Y., Ratnakar, V., Deelman, E., Mehta, G., Kim, J.: Wings for Pegasus: Creating Large-Scale Scientific Applications Using Semantic Representations of Computational Workflows. In: Proceedings of the 19th Annual Conference on Innovative Applications of Artificial Intelligence (IAAI), Vancouver, British Columbia, Canada, July 22-26 (to appear, 2007) 7. Wong, S.C., Tan, V., Fang, W., Miles, S., Moreau, L.: Grimoires: Grid registry with metadata oriented interface: Robustness, efficiency, security. IEEE Distributed Systems Online 6(10) (2005) 8. Oinn, T., Greenwood, M., Addis, M., Alpdemir, M.N., Ferris, J., Glover, K., Goble, C., Goderis, A., Hull, D., Marvin, D., Li, P., Lord, P., Pocock, M.R., Senger, M., Stevens, R., Wipat, A., Wroe, C.: Taverna: Lessons in creating a workflow environment for the life sciences in Concurrency and Computation: Practice and Experience. Grid Workflow Special Issue 18(10), 1067–1100 (2005) 6 7 8
http://gom.kwfgrid.net/web/space/Welcome http://dlt.ncsa.uiuc.edu/wiki/index.php/Main Page A captured demo session can be found at http://www.gridworkflow.org/kwfgrid/distributions/movie-09-ffsc-red.avi
A Large-Scale Semantic Grid Repository
745
9. Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludscher, B., Mock, S.: Kepler: An Extensible System for Design and Execution of Scientific Workflows, system demonstration. In: 16th Intl. Conf. on Scientific and Statistical Database Management (SSDBM 2004), Santorini Island, Greece, June 21-23 (2004) 10. Sudholt, W., Altintas, I., Baldridge, K.K.: A Scientific Workflow Infrastructure for Computational Chemistry on the Grid. In: 1st International Workshop on Computational Chemistry and Its Application in e-Science in conjunction with ICCS 2006 (2006) 11. Alper, P., Corcho, O., Kotsiopoulos, I., Missier, P., Bechhofer, S., Kuo, D., Goble, C.: S-OGSA as a Reference Architecture for OntoGrid and for the Semantic Grid. In: GGF16 Semantic Grid workshop (2006) 12. Rao, J., Su, X.: A Survey of Automated Web Service Composition Methods. In: Cardoso, J., Sheth, A.P. (eds.) SWSWPC 2004. LNCS, vol. 3387, Springer, Heidelberg (2005) 13. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, Cambridge (2003) 14. Kazakov, Y., Motik, B.: A Resolution-Based Decision Procedure for SHOIQ. In: Furbach, U., Shankar, N. (eds.) IJCAR 2006. LNCS (LNAI), vol. 4130, pp. 662–667. Springer, Heidelberg (2006) 15. Sirin, E., Parsia, B.: Pellet: An OWL DL Reasoner. In: Haarslev, V., Mller, R. (eds.) Proceedings of the 2004 International Workshop on Description Logics (DL 2004). CEUR Workshop Proceedings, CEUR-WS.org., Whistler, British Columbia, Canada, June 6-8, 2004, p. 104 (2004) 16. Haarslev, V., Mller, R., Wessel, M.: Querying the Semantic Web with Racer + nRQL. In: Proceedings of the KI 2004 International Workshop on Applications of Description Logics (ADL 2004), Ulm, Germany, September 24 (2004) 17. Bechhofer, S., Horrocks, I., Turi, D.: Implementing the Instance Store. University of Manchester, Computer Science preprint CSPP-29 (August 2004) 18. Tsarkov, D., Horrocks, I.: FaCT++ description logic reasoner: System description. In: Furbach, U., Shankar, N. (eds.) IJCAR 2006. LNCS (LNAI), vol. 4130, pp. 292–297. Springer, Heidelberg (2006)
Scientific Workflow: A Survey and Research Directions Adam Barker and Jano van Hemert National e-Science Centre, University of Edinburgh, United Kingdom {a.d.barker,j.vanhemert}@ed.ac.uk
Abstract. Workflow technologies are emerging as the dominant approach to coordinate groups of distributed services. However with a space filled with competing specifications, standards and frameworks from multiple domains, choosing the right tool for the job is not always a straightforward task. Researchers are often unaware of the range of technology that already exists and focus on implementing yet another proprietary workflow system. As an antidote to this common problem, this paper presents a concise survey of existing workflow technology from the business and scientific domain and makes a number of key suggestions towards the future development of scientific workflow systems.
1
Introduction
Service-oriented architectures are a popular architectural paradigm for building software applications from a number of loosely coupled, distributed services. By loosely coupled we mean that the client of a service is essentially independent from the service itself. When a client (which can be another service) makes an invocation on a remote service, it does not need to concern itself with the inner workings (for example, what language it is written in) to take advantage of its functionality. Loosely coupled services are often more flexible than traditional tightly coupled applications; in a tightly coupled architecture, the different components are bound to one another, sharing semantics, libraries and often state; making it difficult to evolve the application. As services are independent from one another they offer a greater degree of flexility and scalability for evolving applications. Although the concept of service-oriented architectures is not a new one, this paradigm has seen wide spread adoption through the Web services approach, which makes use of a suite of standards such as XML, WSDL and SOAP, to facilitate service interoperability. Web services in their vanilla form provide a simple solution to a simple problem, things become more complex when a group of services need to coordinate together to achieve a shared task or goal. This coordination is often achieved through the use of workflow technologies. As defined by the Workflow Management Coalition [1], a workflow is the automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant (a resource; human or machine) to another for action, according to a set of procedural rules. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 746–753, 2008. c Springer-Verlag Berlin Heidelberg 2008
Scientific Workflow: A Survey and Research Directions
747
Workflow provides the glue for distributed services, which are owned and maintained by multiple organisations. Although originally a concept applied to automate repetitive tasks in business, there is currently much interest in the scientific domain to automate distributed experiments. The plethora of competing workflow specifications, standards, frameworks and toolkits from both the business and scientific community cause researchers and engineers to decide it is safer to reinvent the wheel by implementing yet another proprietary workflow language rather than investing time in existing workflow technologies. As a response to this re-occuring problem, this paper provides a concise survey of existing workflow technology from the business and scientific domain, and uses this to make a number of key suggestions.
2
Business Workflow Technology
As workflow technology was first adopted by the business community, this has led to a space crowded with competing specifications, from opposing companies, some of which have risen to the top, superseding others. Before discussing the standards it is important to note that there are two main architectural approaches to implementing workflow; service orchestration and service choreography. Service orchestration refers to an executable business process that may interact with both internal and external services. Orchestration describes how services can interact at the message level, with an explicit definition of the control and data flow. Orchestrations can span multiple applications and/or organisations and result in long-lived, transactional processes. A central process always acts as a controller to the involved services and the services themselves have no knowledge of their involvement in a higher level application. The Business Process Execution Language (BPEL) [2] is an executable business process modelling language and currently the current de-facto standard way of orchestrating Web services. It has broad industrial support from companies such as IBM, Microsoft and Oracle. Industrial support brings concrete implementations, tools and training. Recent efforts from the Open Middleware Infrastructure Institute UK (OMII-UK) have resulted in an open-source graphical editor, called BPEL Designer [3]. Other languages have been developed but have not been as widely adopted by the community. Yet Another Workflow Language (YAWL) [4] is based on the rigourous analysis of workflow patterns, a particular type of design pattern. YAWL aims to support all (or most) of the workflow patterns and has a formal underpinning based on Petri-nets. The language is supported by an an Open Source implementation [5] and has some industrial support. XML Process Definition Language (XPDL) is a format standardised by the Workflow Management Coalition (WfMC) to interchange Business Process definitions between different workflow products like modelling tools and workflow engines. WfMOpen [6] is an open-source J2EE based implementation of a workflow engine as proposed by the Workflow Management Coalition (WfMC) and the Object Management Group (OMG), WfMOpen uses XPDL as input.
748
A. Barker and J. van Hemert
Service choreography on the other hand is more collaborative in nature. A choreography model describes a collaboration between a collection of services in order to achieve a common goal. Choreography describes interactions from a global perspective, meaning that all participating services are treated equally, in a peer-to-peer fashion. Each party involved in the process describes the part they play in the interaction. Choreography focuses on message exchange, all involved services are aware of their partners and when to invoke operations. Orchestration differs from choreography in that it describes a process flow between services from the perspective of one participant (centralised control), choreography on the other hand tracks the sequence of messages involving multiple parties (decentralised control, no central server), where no one party truly owns the conversation. The Web Services Choreography Description Language (WS-CDL) [7] is an XML-based language that can be used to describe the common and collaborative observable behavior of multiple services that need to interact in order to achieve a shared goal. WS-CDL describes this behavior from a global or neutral perspective rather than from the perspective of any one party. WS-CDL is designed to sit on top of the Web services interface language, WSDL. WSDL focuses on capturing message types, while WS-CDL is about capturing behaviour. A user models a choreography from a global perspective, then each service will have to be programmed by a developer in such a way that they talk to one another, and in doing so, enforce the constraints of the choreography. WS-CDL supersedes the Web Service Choreography Interface (WSCI), although the language is defined by a W3C specification, at the time of writing no implementations exist and interest in the specification has dwindled.
3
Scientific Workflow Technology
The concepts of workflow have recently been applied to automating large-scale science (or e-Science), coining the term scientific workflow [8]. Business workflow tools look more like traditional programming languages, and are in general pitched at the wrong level of abstraction for scientists to take advantage of. Instead, scientists require higher level tools, which enable them to plug together problem solving components to prove a scientific hypothesis. A scientific workflow attempts to capture a series of analytical steps, which describe the design process of computational experiments. Scientific workflow systems provide an environment to aid the scientific discovery process through the combination of scientific data management, analysis, simulation, and visualisation. Scientific and business workflows began from the same common ground. Both communities have overlapping requirements, however they each have their own domain specific requirements, and therefore need separate consideration. Scientific work is centred around conducting experiments, therefore a scientific workflow system should mirror a scientist’s conventional work patterns by allowing them to apply their methodology over distributed resources. The workflow system should allow the same information to be shown at various levels of abstraction, depending on who is using the system. A high level of abstraction should
Scientific Workflow: A Survey and Research Directions
749
be presented to the scientist, who we assume knows nothing, about the underpinnings of service composition. The elements of the workflow should be in the context of the appropriate scientific domain and allow the scientist to validate a hypothesis. The process of constructing a workflow that achieves this validation will generally be built in an incremental manner as opposed to the business oriented approach where a workflow will be designed and then implemented. It is therefore essential that the workflow language and scientific workflow system can support this kind of user-driven, incremental, prototypical approach to workflow composition. As the validation of scientific hypotheses depend on experimental data, scientific workflow tends to have an execution model that is dataflow-oriented, where as business workflow places an emphasis on control-flow patterns and events. Workflows in the scientific community involve the transportation and analysis of large quantities of data between distributed repositories. Scientists will have to schedule and fund the use of expensive resources and cannot afford for a workflow to fail half way through the execution cycle. It is therefore desirable that the systems that support them are robust and dependable. In addition, they should support incremental workflow construction and must be able to run detached from the user console. As a consequence of the lengthy, iterative design process, workflows become a valued commodity and a source of intellectual capital. The output of workflows or workflows themselves may be used as a basis for future research, either by the scientists who generated the data, or colleagues in a related field. This methodology is consistent with the usual practise of non-computational labs. These workflows should be reused, refined over time, and shared with other scientists in the field. Scientific workflows must be fully reproducible. In order for a workflow to be reproduced, provenance information must be recorded that indicates where the data originated, how it was altered, and which components and what parameter settings were used. This will allow other scientists to re-conduct the experiment, confirming the results. Similar to the business domain, the space of scientific workflow systems has become crowded with different languages and frameworks allowing scientists to automate tasks through a workflow. Below we provide a survey of the most popular workflow systems. Taverna is an open-source, Grid-aware workflow management system; it provides a set of transparent, loosely-coupled, semantically-enabled middle-ware to support scientists that perform data-intensive in-silico [9] experiments on distributed resources. Taverna is implemented as a service-oriented architecture, based on Web service standards. Provenance [10] plays an integral part in Taverna, allowing users to capture and inspect details such as who conducted the experiment, what services were used, and what the results of services provided. Taverna uses a proprietary language, the Simple Conceptual Unified Flow Language or SCUFL [11] for short. The SCUFL language is a high level XML-based conceptual language allowing a user to define a workflow through groups of local or remote services which are connected with data links (providing data flow)
750
A. Barker and J. van Hemert
and control links (allowing coordination of services not connected through data flow). The Taverna workbench depends on the FreeFluo engine [9]. Kepler [12] is an open-source scientific workflow engine with contributors from a range of application-oriented research projects. Kepler is built upon the Ptolemy II system [13] based at the University of California at Berkeley, which is a mature dataflow-oriented workflow architecture. In Kepler, the focus is on actor-oriented design. Actors are re-usable independent blocks of computation, such as: Web services, database calls etc. They consume data from a set of inports and write data to a set of outports. A group of actors can then be wired together by introducing a mapping from outports to inports. A novel feature in Kepler allows the actor communication (dataflow) concerns to be separated from the overall workflow coordination, which is defined in a separate component called a director. This separation allows a workflow model to be run with different execution semantics, such as synchronous dataflow and process networks. Kepler provides a large variety of computational models inherited from the Ptolemy II system and uses the proprietary Modelling Markup Language. Triana [14] is an open-source problem solving environment and a test application for the GridLab project [15]. It is designed to define, process, analyse, manage, execute and monitor workflows. The toolkit allows users to compose workflows graphically by dragging programming components called units or tools onto a workspace; connectivity is achieved by wiring components together using data and control links. Triana can distribute sections of a workflow to remote machines through a connected peer-to-peer network. Triana supports multiple languages by allowing different workflow readers/writers to be plugged in, including: Web Services Flow Language (WSFL), Directed Acyclic Graph (DAG), Business Process Execution Language (BPEL) and Petrinet formats. Planning for Execution in Grids or Pegasus [16] is a framework which maps scientific workflows onto distributed resources such as a Grid. Abstract workflows designed by a domain scientist are independent of any resources they will be executed on, this allows a scientist to focus on workflow design rather than having to decide which physical resources to use. Pegasus then attempts to find a mapping of the tasks to the available resources for execution at runtime through the use of Artificial Intelligence planning techniques. Pegasus uses the proprietary language, DAX at the abstract level which is an XML representation of DAG. GridNexus [17] is a graphical system for creating and executing scientific workflows in a Grid environment. GridNexus allows the user to assemble complex processes involving data retrieval, analysis and visualisation by building a directed acyclic graph (DAG) in a visual environment. The graphical user interface (GUI) of GridNexus, like Kepler is based on Ptolemy II from UC Berkeley. Once a scientist has designed a workflow using the GUI editor it is translated into the proprietary XML-based language, JXPL. Importantly GridNexus separates the GUI from the execution of the workflow, hence once constructed a workflow (described using JXPL) can be executed locally or remotely. DiscoveryNet [18] is an EPSRC funded project to build a platform for scientific knowledge discovery from the data generated by a wide variety of high
Scientific Workflow: A Survey and Research Directions
751
throughput devices at Imperial College, London. DiscoveryNet takes a serviceoriented view and provides a concrete implementation firstly for scientists to plan, manage and share knowledge discovery and secondly for service providers to publish data mining and data analysis software components. The Discovery Process Markup Language (DPML) is a proprietary XML-based language which describes process as dataflow graphs, a user can compose a workflow by wiring together nodes (which represent datasets and functions) as a directed acyclic graph. Other projects worth referencing are the Bioinformatics Workflow Builder Interface (BioWBI) [19], an IBM Web-based environment for constructing and executing workflows in the Life Sciences community, GridBus [20], a Grid-aware workflow execution environment, Imperial College e-Science Networked Infrastructure (ICENI) [21], and Magenta [22], an open-source, decentralised Web services composition tool. The Workflow Enactment Engine Project (WEEP) aims to implement an easy to use workflow enactment engine for WS-I/WSRF services using WS-BPEL 2.0 (Web Services Business Process Execution Language) as a formalism to process and execute workflows, which currently focus on data mining tasks [23].
4
Discussion
Through our experience of evaluating workflow frameworks, this paper makes the following observations for improvements and future research concerning usability, sustainability and tool adoption: Collaboration is key. With so many tools available, it is essential that researchers from multiple domains form collaborative groups to meet and discuss the common, overlapping requirements. Collaboration is necessary in order to prevent implementing highly specific, hardly used features, as well as preventing reinvention of any wheels. Use a conventional scripting language instead. There is increased interest and hype surrounding workflow technology and often it is oversold to domain scientists. Existing scripting languages (Perl etc.) have received much investment, both in the form of training and in the development of features such as debugging, extensive add-on libraries and integrated editors. These features have yet to be included in workflow systems. Workflow systems essentially have to start from scratch, re-implementing each feature for a specific framework. Do not implement another workflow language. As demonstrated by our survey of workflow technology, implementing yet another workflow language is the last thing that researchers should be doing. There are many well developed and well supported frameworks, which researchers should first investigate. Abstract is not abstract enough. “Research your domain” may seem like an obvious statement, but it is a fact that the day-to-day tools that scientists use differ vastly from one domain to the next. Experience with working alongside domain scientists has taught us that wet laboratory biologists are uncomfortable using even the most abstract workflow tools currently available. Problem solving in terms of services is outside the normal pattern of thinking. On the other end of
752
A. Barker and J. van Hemert
the spectrum, physics researchers are comfortable with command-line tools and are used to thinking about problems in terms of programming. Tools need to be tailored to the domain instead of being built by computer scientists for computer scientists. Research into intelligent, abstract editors should be a priority for the scientific community, preliminary examples can be seen through the Taverna and Pegasus frameworks. Stick to standards. It is still unclear if the processes required of a scientific domain can be captured using business workflow technology such as BPEL. To this end, does it make sense to create languages specifically for scientific workflow, like SCUFL which is used in Taverna? BPEL has become the de-facto standard workflow language, supported by industrial strength software implementations by major vendors. With this level of support, from the perspective of the user it makes sense to stick to standards instead of tying oneself to a proprietary workflow language. If software development and tool support terminates on one of the proprietary frameworks, workflows will need to be re-implemented from scratch. Instead, standards need to be viewed in a different way, research needs to be targeted at providing powerful abstraction mechanisms for languages such as BPEL and providing integrated tool support. These mechanisms should allow scientists to model workflows from higher levels of abstraction and automatically translate them into BPEL processes that will run on any BPEL workflow engine. BPEL designer, which is part of OMII-UK software stack, is an initial step at providing these abstractions. Portal-based access. Through our experience, even if workflow tools are made available to domain scientists they often cannot and will not download and install them. These tools need to be made available through portals, taking advantage of thin client technology such as AJAX. It is clear from our survey of existing workflow technology that the space is starting to get crowded with competing alternatives. Business workflow specifications and standards are the result of rival companies negotiations. With these concrete standards come industrial strength software. Scientific workflow on the other hand is a relatively new area of interest, each project has its own unique set of requirements and often develops yet another proprietary language and workflow execution engine. To encourage the adoption of existing tools, this paper has presented a concise survey of business and scientific workflow technologies. Furthermore, we have pointed out important research directions that make the existing technologies more suitable within the context of scientific workflow.
References 1. Hollingsworth, D.: The Workflow Reference Model. Workflow Management Coalition. Document Number tc00-1003 edn. (January 1995) 2. The OASIS Committee: Web Services Business Process Execution Language (WSBPEL) Version 2.0 (April 2007) 3. Wassermann, B., et al.: Sedna: A BPEL-based environment for visual scientific workflow modelling. Workflows for eScience - Scientific Workflows for Grids (December 2006)
Scientific Workflow: A Survey and Research Directions
753
4. van der Aalst, W., ter Hofstede, A.: Yet another workflow language. Information Systems 30(4), 245–275 (2005) 5. van der Aalst, W.M.P., et al.: Design and Implementation of the YAWL system. In: Persson, A., Stirna, J. (eds.) CAiSE 2004. LNCS, vol. 3084, Springer, Heidelberg (2004), http://sourceforge.net/projects/yawl/ 6. Lipp, M.: The Danet Workflow Component V2.1 (2007), http://wfmopen.sourceforge.net/ 7. Kavantzas, N., et al.: Web Services Choreography Description Language Version 1.0 (November 2005) 8. Deelman, E., Gil, Y.: Workshop on the Challenges of Scientific Workflows. Technical report, Information Sciences Institute, University of Southern California (May 2006) 9. Oinn, T., et al.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17), 3045–3054 (2004) 10. Zhao, J., et al.: Annotating, linking and browsing provenance logs for e-Science. In: 1st Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data, Sanibel Island, Florida, USA (October 2003) 11. Oinn, T., et al.: Delivering Web Service Coordination Capability to Users. In: WWW 2004, New York, pp. 438–439 (2004) 12. Ludascher, B., et al.: Scientific workflow management and the kepler system. Concurrency and Computation: Practice and Experience 18(10), 1039–1065 (2005) 13. Buck, J.T., Ha, S., Lee, E.A., Messerschmitt, D.G.: Ptolemy: A framework for simulating and prototyping heterogeneous systems. Journal of Computer Simulation 4, 155–182 (1994) 14. Taylor, I.J., et al.: Distributed P2P Computing within Triana: A Galaxy Visualization Test Case. In: 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), pp. 16–27. IEEE Computer Society, Los Alamitos (2003) 15. Allen, G., et al.: Enabling Applications on the Grid: A GridLab Overview. International Journal of High Performance Computing Applications: Special Issue on Grid Computing: Infrastructure and Applications 17(4), 449–466 (2003) 16. Deelman, E., et al.: Pegasus: A framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming Journal 13(3), 219–237 (2005) 17. Brown, J.L., et al.: GridNexus: A Grid Services Scientific Workflow System. International Journal of Computer Information Science (IJCIS) 6(2), 72–82 (2005) 18. Rowe, A., et al.: The Discovery Net System for High Throughput Bioinformatics. Bioinformatics 19(1), 225–231 (2003) 19. Siepel, A.C., et al.: An integration platform for heterogeneous bioinformatics software components. IBM Systems Journal 40(2), 570–591 (2001) 20. Buyya, R., Venugopal, S.: The Gridbus Toolkit for Service Oriented Grid and Utility Computing: An Overview and Status Report. In: Proceedings of the First IEEE International Workshop on Grid Economics and Business Model, pp. 19–36 (April 2004) 21. Mayer, A., et al.: Meaning and Behaviour in Grid Oriented Components. In: Parashar, M. (ed.) GRID 2002. LNCS, vol. 2536, pp. 100–111. Springer, Heidelberg (2002) 22. Walton, C., Barker, A.D.: An Agent Based e-Science Experiment Builder. In: Proceedings of The 1st International Workshop on Semantic Intelligent Middleware for the Web and the Grid, European Conference on Artificial Intelligence (ECAI) (August 2004) 23. Janciak, I., Kloner, C.: Workflow enactment engine project v1.0 (2007), http://weep.gridminer.org/
A Light-Weight Grid Workflow Execution Engine Enabling Client and Middleware Independence Erik Elmroth, Francisco Hern´ andez, and Johan Tordsson Dept. of Computing Science and HPC2N Ume˚ a University, SE-901 87 Ume˚ a, Sweden {elmroth, hernandf, tordsson}@cs.umu.se
Abstract. We present a generic and light-weight Grid workflow execution engine made available as a Grid service. A long-term goal is to facilitate the rapid development of application-oriented end-user workflow tools, while providing a high degree of Grid middleware-independence. The workflow engine is designed for workflow execution, independent of client tools for workflow definition. A flexible plugin-structure for middleware-integration provides a strict separation of the workflow execution and the processing of individual tasks, such as computational jobs or file transfers. The light-weight design is achieved by focusing on the generic workflow execution components and by leveraging state-of-theart Grid technology, e.g., for state management. The current prototype is implemented using the Globus Toolkit 4 (GT4) Java WS Core and has support for executing workflows produced by Karajan. It also includes plugins for task execution with GT4 as well as a high-level Grid job management framework.
1
Introduction
Motivated by the tedious work required to develop end-user workflow tools and the lack of generic tools to facilitate such development, this contribution focus on a light-weight and Grid-interoperable workflow execution engine made available as a Grid service. As a point of departure, we identify important and generic capabilities supported by well-recognized complete workflow systems [18,15,9,14,1,10,2] (e.g., workflow design, workflow repositories, information management, workflow execution, workflow scheduling, fault tolerance, and data management). However, many of these projects provide similar functionality and much work is overlapping, as the systems have been developed independently [18]. The tool presented here is not proposed as an alternative to these more complete workflow systems, but as a core component for developing new end-user tools and problem solving environments. The aim is to offer a generic workflow execution engine that can be employed for building new high-level tools as
This research was conducted using the resources of the High Performance Computing Center North (HPC2N). Financial support has been provided by The Swedish Research Council (VR) under contract 621-2005-3667.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 754–761, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Light-Weight Grid Workflow Execution Engine Enabling Client
755
well as to provide support for both processing individual tasks on multiple Grid middlewares and accepting different workflow languages as input. The engine is light-weight as it focuses only on workflow execution (i.e., selecting tasks that are ready to execute) and its corresponding state management. The engine is developed with a strict focus on Grid resources for task processing and makes efficient use of state-of-the-art Web and Grid services technology. The current prototype is implemented using the Java WS Core from the Globus Toolkit 4 (GT4) [7]. The service has support for executing workflows expressed either in its native workflow language or the Karajan [16] format. It includes plugins for arbitrary Grid tasks, e.g., for execution of computational tasks in GT4 and in the high-level Grid Job Management Framework (GJMF) [3,5], as well as GridFTP file transfers.
2
System and Design Requirements
The general system requirements follow directly from the aim and motivation for the proposed workflow engine. As it is developed with a general aim to provide an efficient and reusable tool for managing workflows in Grid environments, overall requirements include client and middleware independence, modularity, customizability, and separation of concerns [4]. A set of high-level design requirements for Grid workflow systems includes the following. – The workflow execution should be separated from the workflow definition. The former must be done by the engine, the latter can be done, e.g., by an application specific GUI or a Web portal. Furthermore, workflow repositories and application specific information should not be managed by the service. – The workflow engine should be independent of the Grid middleware used to execute the tasks, with middleware-specific interactions performed by plugins. The plugins should in turn be unaware of the context (the workflow) to which individual Grid jobs belong. – The design can and should to a large extent leverage state-of-the-art Grid technology and emerging standards, e.g., by making use of general features of the Web Services Resource Framework (WSRF) [13] instead of implementing their workflow-specific counterparts. – The engine should have a clean separation between the state management and the handling of task dependencies. In addition to the high-level design requirements, the following specific system requirements are highlighted. The workflow system should: – provide support for executing workflows, managing workflow state, and pausing and resuming execution. This enables restart of partially completed workflows stored on disk, and provides a foundation for fault tolerant workflow execution. – provide support for both abstract (resources unspecified) and concrete workflows (resources specified on a per-task level) as well as arbitrary nestings of workflows.
756
E. Elmroth, F. Hern´ andez, and J. Tordsson
– provide support for dynamic workflows, i.e., making it possible to modify an already executing workflow, by pausing the execution before modification. – provide support for workflow monitoring, both synchronously and by asynchronous notifications. – provide support for notifications of different granularity, e.g., enabling asynchronous status updates on both a per workflow and a per task basis. These requirements are in agreement with and extend on the requirements of Grid workflow engines presented in [6]. How the requirements are mapped to the actual implementation is presented in Section 3.
3
Design and Implementation
The design requirements of customizability and ability for integration with different client tools and middlewares are met by use of appropriate plugin points. The chain-of-responsibility design pattern allows concurrent usage of multiple implementations of a particular plugin. The three main responsibilities of the workflow service, namely management of task dependencies (i.e., deciding the task execution order), execution of workflow tasks on Grid resources, and management of workflow state, are each performed by separate modules. Reuse, in a broad sense, is a key issue in the design. The workflow service reuses ideas from an architecture for interoperable Grid components [5] and builds on a framework for managing stateful Web services and notifications [13]. Exploiting the capabilities offered by GT4 Java WS Core (e.g., security and persistency) also simplifies the design and implementation of the service. 3.1
Modelling Workflows with the WSRF
The workflow engine uses the tools provided by GT4 Java WS Core to make the engine available as a Grid service and to manage the workflow state. Building the engine on top of Java WS Core should not be interpreted as built primarily for GT4-based Grids. Integration with different middlewares is provided by middleware-specific plugins which are independent from the workflow execution. By careful design, the service can handle arbitrarily many workflows concurrently without these interfering with each other. Multiple users can share the same workflow service, but only the creator of a workflow instance can monitor and control that workflow. Each workflow is modelled as a WS-Resource and all information about a workflow, including task descriptions, inter-task dependencies and workflow state, is stored as WS-ResourceProperties. The default behavior is to store each WS-Resource in a separate file, although alternative implementations such as persistency via database can be added easily. Reuse of the Java WS Core persistency mechanisms makes workflow state handling trivial. Workflow state management enables the control of long-running workflows and the recovery of workflows, e.g., upon service failures. The states handled include default, ready, running, and completed, which apply to both tasks and (sub)workflows. Tasks can also be failed whereas
A Light-Weight Grid Workflow Execution Engine Enabling Client
757
workflows can be disabled. All newly created tasks and workflows have the default state. A task/workflow is ready to be started when all tasks on which it depends are completed. Running tasks are processed by some Grid resource until they become either completed or failed. A running workflow has at least one task that is not completed and no failed task, whereas completed workflows only contain completed tasks/subworkflows. A workflow becomes disabled either if a task fails or if the user requests the workflow to be paused. No new tasks are initiated for disabled workflows. A resume request from the user is required to make a disabled workflow running again. If the workflow becomes disabled due to task failure, the user must modify the workflow (to correct the failed task) before issuing the resume request. 3.2
Architecture of the Workflow Engine
The workflow service implements operations to (i) create a new workflow, (ii) suspend the execution of a workflow, (iii) resume execution of a workflow, (iv) modify a workflow, and (v) cancel a workflow. The service also supports monitoring of workflows, either by explicit status requests or by asynchronous notifications of updates. To support a wide range of client requirements, different granularities of notifications are available, ranging from a single message upon workflow completion to detailed updates every time a task changes its state. As Java WS Core contains mechanisms for managing WS-Resources (in this case workflows), the monitoring functionality as well as operations (iv) and (v) are trivial to implement (using WS-Notifications [8], and WSRF [13], respectively). The architecture of the workflow engine is shown in Figure 1. User credentials are delegated from clients to the workflow service to be used when interacting with Grid resources. This requires the Web service interface to perform authentication and authorization of clients. All incoming requests are forwarded to the Coordinator, which organizes and manages the execution of tasks (and subworkflows) in the workflow and handles workflow state. When a new workflow is requested, the Coordinator uses the Input Converter plugin(s) to translate the input workflow description from the native format specified by the client to the internal workflow language. However, the Input Converters do typically not translate the individual task descriptions, as these are only to be read by the Grid Executor plugin(s), which the Coordinator invokes to process one (or more) tasks. The Grid Executor interface defines operations to initiate new tasks, to reconnect to already initiated tasks after service restart, and to cancel tasks, corresponding to the create, resume and cancel operations in the Web service interface. There is however no operation to pause a running task, as this functionality generally is not supported by Grid middlewares. Computational Grid Executors also ensure that tasks’ input and output files are transferred in compliance with the data dependencies in the workflow, but are unaware of the context (the workflow) to which each task belongs. This type of Grid Executor only requires a basic job submission mechanism, e.g., WS-GRAM [7], but can also make use of sophisticated frameworks, e.g., the GJMF [3] for resource brokering and fault tolerant job execution,
758
E. Elmroth, F. Hern´ andez, and J. Tordsson
E.g., Client APIs, End-user tools, Application portals
Web Service Interface Input Converter
Input Converter
...
Workflow Service
Dependency Manager
Coordinator
Grid Executor
Input Converter
Grid Executor
...
Grid Executor
Grid Middleware(s)
Fig. 1. Overview of the workflow service architecture
should such functionality be available. Scheduling is performed on a per-task basis by the Grid Executors plugins. However, tools for planning or pre-scheduling of workflows (e.g., Pegasus [2]) can be employed if such functionalities are required. Moreover, support for abstract and concrete workflows is granted via the Executor plugins and external tools respectively. Before the Coordinator can invoke the Grid Executor(s) in order to start new tasks, the Dependency Manager is used to select which task(s) to execute. This module keeps track of dependencies between tasks (and subworkflows) in a workflow, and determines when a task (or subworkflow) is ready to start. The Coordinator invokes the Dependency Manager to get a list of tasks available for execution when a new workflow is started, when a task in an existing workflow completes, and when a paused workflow is resumed. 3.3
Properties of the Workflow Language
In the workflow service, workflows are described in a data flow language, defined using XML schema. In this language, users specify task dependencies, not task execution order. This removes the burden of figuring out which tasks can execute in parallel as this is the responsibility of the Dependency Manager. The workflow language supports arbitrary nesting of tasks and subworkflows within a workflow. Each task (or workflow) specifies a set of input and output ports. A (sub)workflow contains a set of links, where each link connects an output port of one task/workflow with an input port of another. The task description contains a field to specify how to perform the task. By having this field generic,
A Light-Weight Grid Workflow Execution Engine Enabling Client
759
the usage of multiple Grid task description formats is possible. Different formats for individual task descriptions may even be used within the same workflow. This design also enables support for new task types, e.g., database queries and Web service invocations, to be added by implementing Grid Executor plugins rather than extending the workflow language.
4
Analysis and Comparison with Other Systems
One of the main objectives of this work is to provide independence not only of Grid middleware but also of input representation, the latter achieved by converters that translate different workflow languages to the service’s internal data flow language. How difficult these translations are depend on the style of the original client’s language and the amount and type of information that can be expressed in that language. Data flow languages with similar input/output port structure are simple to translate. Control flow languages can also be translated by specifying ports that represent flow of control rather than data transfers. For example, the subset of the Karajan language [16] that performs basic interactions with Grid resources (job submissions, file transfers, and sequential and parallel definition of tasks) has been translated as described above. Petri net languages pose more difficulties. Places and transitions representing data flows can easily be translated to the service’s internal data flow language. However, there is not an equivalent concept for representing loops in the service’s language. Finally, it can also be hard to translate languages that do not have all the information encoded in the workflow description but rely on the runtime system to obtain the missing information (e.g., a workflow system that dynamically queries a repository to obtain the input/output structure of workflow tasks). While several workflow projects have been built to interact with Grid systems [15,14,1], many of them have not been designed for exclusive use of Grid resources for workflow execution. Nevertheless they are integrated solutions with sophisticated graphical environments, workflow repositories, and fault management mechanisms. Our work does not attempt to replace those systems, but to provide a means for accessing advanced capabilities offered by multiple Grid middlewares. These benefits are obtained by the separation of the workflow execution from its definition and by making use of well-established protocols. Furthermore, implementing the workflow engine as a stateful WSRF service facilitates the management and control (including fault recovery) of long-running workflows which are common in Grid computing. The P-GRADE portal [12] and Karajan [16] also focus on the use of resources from different Grids within the same workflow. P-GRADE offers a collaborative environment in which multiple users define workflows through a client application, and control and manage workflows through a portal. The workflows can access resources from multiple Globus-based virtual organizations. Our work goes beyond this functionality by adding the capability of using other middlewares besides Globus and also offering independence of input language. Karajan also provides a level of interoperability between different execution mechanisms
760
E. Elmroth, F. Hern´ andez, and J. Tordsson
(mainly GT2, GT4, Condor, and the SSH protocol) through the use of providers that allow selection of middleware at runtime. However, while Karajan has a stronger focus on the interaction between users and workflows, our work focuses on handling the workflow state, delegating the interaction with users to clients that have access to the workflow service. There are a few projects that are using WSRF to leverage the construction of workflow services. The Grid Workflow Execution Service (GWES) [11] uses a Petri net language to define and control Grid workflows. Besides the differences in workflow language type, the main difference between GWES and our work is the ability of using multiple input representations offered by our contribution. The Workflow Enactment Engine Project (WEEP) [17] provides a BPEL engine for Grid Workflows. The engine is accessible as a WSRF service running in a GT4 container. However, WEEP is focused on Web service invocations and not on interfacing with Grid middleware.
5
Concluding Remarks
The goal of this research is to investigate how to design a light-weight workflow engine that can be reused by different high-level tools. General requirements for portability and interoperability are supported by the use of an appropriate plugin-structure for workflow language formats and for interacting with different Grid middlewares. Scalability is obtained by handling multiple workflows and by supporting large hierarchical workflows. The workflow service performs monitoring, state management, fault recovery, and it uses appropriate security mechanisms to achieve user isolation. The Executor plugins handle data movement, job submission, information retrieval, and just-in-time scheduling. External tools can be employed for planning and pre-scheduling of workflows. We finally note that much of the supported functionality is obtained with little or no effort by appropriate use of the WSRF.
Acknowledgements ¨ We thank P-O Ostberg for fruitful discussions on workflow system design and language constructs, and for collaboration in the integration of the GJMF [3]. We are also grateful to the anonymous referees for their constructive comments.
References 1. Altintas, I., Birnbaum, A., Baldridge, K., Sudholt, W., Miller, M., Amoreira, C., Potier, Y., Ludaescher, B.: A framework for the design and reuse of Grid workflows. In: Herrero, P., S. P´erez, M., Robles, V. (eds.) SAG 2004. LNCS, vol. 3458, pp. 119–132. Springer, Heidelberg (2005) 2. Deelman, E., Singh, G., Su, M., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Scientific Programming 13(3), 219–237 (2005)
A Light-Weight Grid Workflow Execution Engine Enabling Client
761
¨ 3. Elmroth, E., Gardfj¨ all, P., Norberg, A., Tordsson, J., Ostberg, P.-O.: Designing general, composable, and middleware-independent Grid infrastructure tools for multitiered job management. In: Priol, T., Vaneschi, M. (eds.) Towards Next Generation Grids, pp. 175–184. Springer, Heidelberg (2007) ¨ 4. Elmroth, E., Hern´ andez, F., Tordsson, J., Ostberg, P.-O.: Designing service-based resource management tools for a healthy Grid ecosystem. In: Wyrzykowski, R., et al. (eds.) Parallel Processing and Applied Mathematics. 7th Int. Conference, PPAM 2007. LNCS, Springer, Heidelberg (2007) 5. Elmroth, E., Tordsson, J.: An interoperable, standards-based Grid resource broker and job submission service. In: Stockinger, H., et al. (eds.) First International Conference on e-Science and Grid Computing, pp. 212–220. IEEE CS Press, Los Alamitos (2005) 6. Eswaran, S., Del Vecchio, D., Wasson, G., Humphrey, M.: Adapting and evaluating commercial workflow engines for e-Science. In: Second IEEE International Conference on e-Science and Grid Computing, IEEE CS Press, Los Alamitos (2006) 7. Foster, I.: Globus toolkit version 4: Software for service-oriented systems. In: Jin, H., Reed, D., Jiang, W. (eds.) NPC 2005. LNCS, vol. 3779, pp. 2–13. Springer, Heidelberg (2005) 8. Graham, S., Hull, D., Murray, B.: Web Services Base Notification 1.3 (WSBaseNotification) (May 2007), http://docs.oasis-open.org/wsn/ wsn-ws base notification-1.3-spec-os.pdf 9. Guan, Z., Hern´ andez, F., Bangalore, P., Gray, J., Skjellum, A., Velusamy, V., Liu, Y.: Grid-Flow: a Grid-enabled scientific workflow system with a petri-net-based interface. Concurrency Computat.: Pract. Exper. 18(10), 1115–1140 (2006) 10. Hern´ andez, F., Bangalore, P., Gray, J., Guan, Z., Reilly, K.: GAUGE: Grid Automation and Generative Environment. Concurrency Computat.: Pract. Exper. 18(10), 1293–1316 (2006) 11. Hoheisel, A.: User tools and languages for graph-based Grid workflows. Concurrency Computat.: Pract. Exper. 18(10), 1101–1113 (2006) 12. Kacsuk, P., Sipos, G.: Multi-grid and multi-user workflows in the P-GRADE Grid portal. J. Grid Computing 3(3-4), 221–238 (2006) 13. OASIS. OASIS Web Services Resource Framework (WSRF) TC (May 2007), http://www.oasis-open.org/committees/wsrf/ 14. Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: A tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17), 3045–3054 (2004) 15. Taylor, I., Shields, M., Wang, I., Harrison, A.: The Triana workflow environment: architecture and applications. In: Taylor, I., et al. (eds.) Workflows for e-Science, pp. 320–339. Springer, Heidelberg (2007) 16. von Laszewski, G., Hategan, M.: Workflow concepts of the Java CoG Kit. J. Grid Computing 3(3-4), 239–258 (2005) 17. WEEP. The Workflow Enactment Engine Project (May 2007), http://weep.gridminer.org 18. Yu, J., Buyya, R.: A taxonomy of workflow management systems for Grid computing. J. Grid Computing 3(3-4), 171–200 (2006)
Supporting NAMD Application on the Grid Using GPE Rafal Kluszczy´ nski1 and Piotr Bala1,2 1
2
Faculty of Mathematics and Computer Science Nicolaus Copernicus University ul. Chopina 12/18, 87-100 Toru´ n, Poland Interdisciplinary Center for Mathematical and Computational Modelling Warsaw University ul. Pawi´ nskiego 5a, 02-106 Warsaw, Poland {klusi,bala}@mat.umk.pl
Abstract. Molecular simulations are playing important role in understanding the mechanisms at microscopic level of all organisms. Increasing computer power available at single computers is still not enough for the molecular simulations. In this case, the grid middleware which has the ability to distribute calculations in a seamless and secure way over different computing systems becomes an immediate solution. In this paper we present GridBean developed for NAMD application which is commonly used for molecular simulations. The GridBean is designed with standards of Grid Programming Environment (GPE) currently being developed and implemented, based on the latest versions of Globus and UNICORE middlewares.
1
Introduction
The main idea of the Grid is to enable the use of multiple resources combined together into one application to work cooperatively. The first definition of “Grid Computing” has been introduced in 1999 [5] and has evolved over the next few years [4]. Since then, there has been a good progress in defining standards implemented in most of today’s Grid middlewares [6]. The Grids have been successfully used in 3D graphics, quantum chemistry, molecular modeling and bioinformatics [3]. Over last years, biology became an important field for computers and computational methods. New interdisciplinary and biology-related research areas have emerged including computational biology and bioinformatics [10]. In particular, computational biology addresses theoretical and experimental questions in biology. It uses mathematical modeling, quantum chemical calculations and all-atom simulations including molecular dynamics (MD). All of them are computationally demanding. Molecular dynamics methods are used for determining the equilibrium and transport properties of biomolecular system. They are now widely used in disease research, vaccine and drug design. Simulated molecules are usually composed from many thousands of atoms. Atomic coordinates of the proteins, nucleic R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 762–769, 2008. c Springer-Verlag Berlin Heidelberg 2008
Supporting NAMD Application on the Grid Using GPE
763
acids, lipids, water, ions and other structures are usually obtained from crystallographic experiments. Next, an empirical energy function is used to describe atom interactions. Based on the value of energy function, Newtonian equations of motion are being solved resulting in determining new atom positions. Such simulations can provide data how biomolecules interact with each other. Molecular dynamics application simulates the behavior of the molecular system. The CPU power required for such simulations can be obtained only with multiprocessor system. One of parallel molecular dynamics applications is NAMD (Not Another Molecular Dynamics) [13]. This object-oriented program is designed specifically for the simulations of large biomolecular systems using parallel architectures [11,12]. NAMD has the ability of good scaling on both massively parallel supercomputers and clusters of workstations [9]. Currently it is possible to use it to investigate system consists of more than 2 million of atoms [16]. NAMD application brings to the scientists many useful features and methods of studying biomolecular systems (see [2,13] for detailed description). The scalability of parallel version allows to investigate large molecular systems in reasonable time, however still not suitable for systems of interest. Therefore there is strong interest in putting molecular dynamics applications on the Grid. The purpose of this paper is to show how Grid technology can be used to speed up molecular dynamics simulations. We will describe here the integration of the NAMD molecular dynamics code with the grid middleware.
2
Grid Programming Environment (GPE)
The grid middleware has been developed over the last years. However, the service oriented architecture resulted in new standards and implementations. One of them is Grid Programming Environment (GPE) which is being developed in accordance to the grid services standards. The GPE is designed to establish a stable interface between different Grid middlewares and the user. It is significantly based on the success of the UNICORE client which provides access to the distributed resources [17]. Over the time, the UNICORE client has become a very powerful and sophisticated tool allowing users to interact with the Grid. Based on that experience, GPE [14] provides different client applications depending on the type of the user: – Expert Users will use Java application running on their workstation. It is the successor of the UNICORE Client with all the functionality such as workflow and task construction, as well as management of different user identities. – Application Users are usually not so experienced with the grids. They aim to use one application at the time. The client depicted in Fig. 1 is easy and intuitive to use for scientists, however it has limited functionality. – Unaware Users are those types of users who would like to manage their tasks and Grid resources through the web browser. Web client is based on portal solution and is developed along with the JSR 168 specification allowing further integration with already existing portals.
764
R. Kluszczy´ nski and P. Bala
Fig. 1. GPE Application Client interface dedicated for users interested in running just one application at a time. Task menu can be seen where user can download job results or reconstruct field values from one of the previous jobs.
Major advantage of the GPE framework is GridBean concept which is a followup of UNICORE plug-in approach. During the development of the UNICORE middleware it has occurred that the concept of plug-ins as extensions to the client was a very good and well-accepted solution. The plug-ins should not depend on the client application and should be easy to develop resulting in bringing a new program on the Grid as soon as it is needed. That is why GPE brings to the developers and users a GridBean approach. GridBeans are a new generation of plug-ins for web service based Grid infrastructures. They hide all server specific issues with graphical interface. In the GPE framework, dedicated GridBean Service has been introduced to store information about available GridBeans and to provide clients with them. According to the [14], the GridBean approach has 4 major advantages: – GridBeans are easy to distribute and update – the user may at any time download the latest available version. – Through the GridBean Service user can get clearly presented list of applications actually supported on the Grid. – Once implemented GridBean can be used with all different types of client applications.
Supporting NAMD Application on the Grid Using GPE
765
Fig. 2. First panel of the GridBean’s GUI. It contains parameters required for running NAMD simulation.
– The main effort during the implementation is to organize graphical components responsible for variety of program parameters and to implement construction of the job description separately from the GUI. GridBean API allows for easy and quick implementation of user interface for applications. In order to actually run the application on the Grid the description of the application invocation on the target system has to be prepared. The GridBean has access to such data and is able to create job description containing proper data and parameters. The GPE Client application can contact available Target System Service on the Grid independently of the hosting environment used there. This brings to the interoperability and allows to use the same GridBeans to prepare job which can be run either using UNICORE or Globus server side infrastructures [15].
3
NAMD GridBean for GPE
Scientists run their simulations on different computing centers simultaneously. Managing many remote tasks is very time-consuming. Recently NAMD-G, an infrastructure for executing NAMD simulations within the context of a Grid [7] was presented. In this section we propose another solution by designing a
766
R. Kluszczy´ nski and P. Bala
Fig. 3. NAMD GridBean’s panel containing additional parameters for the simulation. It can be seen that there are subpanels grouping options responsible for the same feature of NAMD program.
GridBean for NAMD application. It makes possible to run MD simulations on already existing GPE grids including such middlewares as Globus and UNICORE which have been successfully used in many scientific research projects so far. NAMD has many features which can affect the way of performing the simulation and methods used in calculations. This implies a large number of parameters which detailed description can be found in user’s guide [2]. To increase the clarity of visual interface for the scientists, parameters responsible for similar feature are grouped into subpanels which are organized into five panels: – The main panel presented in Fig. 2 is dedicated to enter the job name, number of CPUs to use and other mandatory parameters required to start the simulation. – The second panel provides parameters fields which determine the type of files format to be used during loading and/or saving molecular system and a force field description. – The next panel contains components related to basic simulation parameters. – Additional and more sophisticated simulation parameters can be determined in the GridBean’s fourth panel depicted in Fig. 3.
Supporting NAMD Application on the Grid Using GPE
767
– In the last panel there is a text area for the scientists to enter more advanced and rarely used parameters in NAMD configuration file format. The panels (Input Panels) provide a nice, graphical interface where users can easily set the parameters values without worrying about the syntax of NAMD configuration file. It is also important to mention that GridBean has the ability to contact application-related services available on the grid. In such a way there can be obtained a list of possible values for specific parameters. Another advantage standing for the flexibility of the GridBean approach is, that designed plug-in can also have Output Panels. Their aim is to present task results downloaded from the Grid in much more attractive form than the standard output text. Such panels are usually strictly connected with the application the GridBean is designed for.
4
Implementation
With GPE concepts there have been introduced different types of researchers using the Grid. Besides them, there is of course another very important category of users. Those are Grid application developers who bring existing applications on a Grid or implement new ones. For them GPE concept provides GridBean application programming interface making implementation of service oriented plug-ins faster and easier comparing to the previous version of middlewares. Developer’s guide presented in [8] describes a GridBean as an object responsible for generating job description for grid services and providing graphical user interface for input and output data. That is why a typical plug-in is composed of job description generation module and graphical user interface module. The first one, inherits 2 interfaces dedicated to store the named GridBean parameters with corresponding values (IGridBeanModel) and to generate a job description using those stored parameters (IGridBean). Like in case of NAMD GridBean (Fig. 4), it can be also done by extending the abstract class AbstractGridBean which already inherits both interfaces. Moreover, the authors had to implement methods presented in that figure. A job description is requested by the client application before submission and is obtained from the GridBean by calling IGridBean.setupJobDefinition(Job). This method determines the name of parameters with its corresponding values and generates JSDL description [1]. Another method called IGridBean.parseJobDefinition(Job) is invoked when the user wants to reconstruct fields values in his GridBean GUI from former tasks. Some fields may be also declared as input or output parameters. They are used to declare input and output files or other arbitrary data which is obtained from or transferred to the local machine or other jobs. NAMD GridBean contains such parameters, those are files containing coordinates and the structure of a molecule before and after the simulation. In order to use the plug-in in GPE client it must provide graphical interface. There may be more than one such module depending on the client application type. In case of GPE standalone clients, there must be implemented class which inherits the interface IGridBeanPlugin in order to create graphical area. The
768
R. Kluszczy´ nski and P. Bala
Fig. 4. Pseudocode of NAMD GridBean module responsible for storing the parameters values and generating job description submitted further on the Grid
authors have implemented it in NAMDPlugin class providing methods for setting input and output panels, loading and storing data from and to a GridBean model and validating values entered by the user. Each input and output panel should inherit interface IGridBeanPanel. In particular, input panel can be implemented as an extension of class GridBeanPanel like it has been done in case of all panels for NAMD GridBean. With every graphical component there can be linked a translator, a validator and description objects. They are responsible for translating values into GridBean’s internal representation, validating entered values and describing field semantics when the value is invalid. The authors have successfully designed a GridBean for NAMD application enabling to run MD simulations on the GPE grids. This plug-in can be also used as an element of more sophisticated workflow tasks whose construction is possible with Expert Client application. NAMD GridBean was developed on the basis of the GridBean API within UNICORE project [17] and has been successfully tested on the GPE4Unicore Application Client (release 6.0.0). NAMD program has also been used (version 2.6) available at the University of Illinois website [12].
5
Conclusions
In this paper, the authors have presented plug-in allowing the use of NAMD application on the GPE grids. NAMD GridBean makes the use of Grid technology which is becoming more and more popular every day. New service-oriented standards described in GPE concepts allow existing Grid middlewares to interoperate with each other. This will for sure accelerate interests in the Grids. NAMD GridBean enables many biology-related scientists to run their biomolecule simulations on many computing systems in seamless and secure way. Thanks to graphical interface scientists not familiar with NAMD program can easily run their own calculations without learning structure of the configuration file. This confirms that the presented GridBean approach is a very promising solution as a way of bringing and accessing applications on the Grids.
Supporting NAMD Application on the Grid Using GPE
769
Acknowledgements. This work has been performed with the support from EU-IST-033437 grant.
References 1. Anjomshoaa, A., Brisard, F., Drescher, M., Fellows, D., Ly, A., McGough, S., Pulsipher, D., Savva, A.: Job Submission Description Language (JSDL). Specification, version 1.0 (2005) 2. Bhandarkar, M., Brunner, R., Chipot, C., Dalke, A., Dixit, S., Grayson, P., Gullingsrud, J., Gursoy, A., Hardy, D., H´enin, J., Humphrey, W., Hurwitz, D., Krawetz, N., Kumar, S., Nelson, M., Phillips, J., Shinozaki, A., Zheng, G., Zhu, F.: NAMD User’s Guide. Theoretical Biophysics Group at University of Illinois and Beckman Institute (2006) 3. Borcz, M., Kluszczy´ nski, R., Bala, P.: BLAST Application on the GPE/UnicoreGS Grid. In: Lehner, W., Meyer, N., Streit, A., Stewart, C. (eds.) Euro-Par Workshops 2006. LNCS, vol. 4375, pp. 244–252. Springer, Heidelberg (2007) 4. Foster, I.: What is the grid? A three Point Checklist. Grid Today, Argonne National Laboratory & University of Chicago, vol. 1(6) (2002) 5. Foster, I., Kesselman, C. (eds.): The Grid: Blueprint for a New Coputing Infrastructure. Morgan Kaufmann Publishers, San Francisco (1999) 6. Global Grid Forum home page: http://www.ggf.org/ 7. Gower, M., Cohen, J., Phillips, J., Kufrin, R., Schulten, K.: Managing Biomolecular Simulations in a Grid Environment with NAMD-G. In: Proceedings of the 2006 TeraGrid Conference (2006) 8. GPE4GTK project home page: http://gpe4gtk.sourceforge.net/ 9. Hein, J., Reid, F., Smith, L., Bush, I., Guest, M., Sherwood, P.: On the performance of molecular dynamics applications on current high-end systems. Phil. Trans. R. Soc. A 363, 1987–1998 (2005) 10. Huerta, M., Haseltine, F., Liu, Y., Downing, G., Seto, B.: NIH Working Definition of Bioinformatics and Computational Biology (2000) 11. Kal´e, L., Skeel, R., Bhandarkar, M., Brunner, R., Gursoy, A., Krawetz, N., Phillips, J., Shinozaki, A., Varadarajan, K., Schulten, K.: NAMD2: Greater scalability for parallel molecular dynamics. Journal of Computational Physics 151, 283–312 (1999) 12. NAMD application home page: http://www.ks.uiuc.edu/Research/namd/ 13. Phillips, J.C., Braun, R., Wang, W., Gumbart, J., Tajkhorshid, E., Villa, E., Chipot, Ch., Skeel, R.D., Kale, L., Schulten, L.: Scalable molecular dynamics with NAMD. Journal of Computational Chemistry 26, 1781–1802 (2005) 14. Ratering, R.: Grid Programming Environment (GPE) Concepts. Intel Corporation, GPE documentation (2005) 15. Ratering, R., Lukichev, A., Riedel, M., Mallmann, D., Vanni, A., Cacciari, C., Lanzarini, S., Benedyczak, K., Borcz, M., Kluszczy´ nski, R., Bala, P., Ohme, G.: GridBeans: Support e-Science and Grid Applications. In: Proceedings of the 2nd IEEE International Conference on e-Science and Grid Computing, p. 45 (2006) 16. Sanbonmatsu, K.Y., Tung, C.-S.: High Performance computing in biology: Multimillion atom simulations of nanoscale systems. Journal of Structural Biology 157, 470–480 (2007) 17. UNICORE project home page: http://www.unicore.eu/
A Grid Advance Reservation Framework for Co-allocation and Co-reservation Across Heterogeneous Local Resource Management Systems Changtao Qu IT Research Division, NEC Laboratories Europe, NEC Europe Ltd. Rathausallee 10, D-53757 Sankt Augustin, Germany [email protected]
Abstract. Co-allocation and co-reservation is a key capability of Grid schedulers for supporting some complex Grid applications, e.g., workflow. The chief enabling technology of co-allocation and co-reservation is Advance Reservation (AR), which is typically implemented by local Resource Management Systems (RMSs). As at present only a limited number of RMSs can support AR, and most of them use individual interface formats, it is rather difficult for Grid schedulers to manage Grid ARs across heterogeneous RMSs, including AR-incapable ones, through a uniform interface. In this paper we propose a Grid AR framework which can address the issue by means of a Grid AR manager that is able to externalize the AR functionality from local RMSs. An advanced Grid AR algorithm is implemented in the Grid AR manager, and a local AR API and Grid AR API is respectively defined to standardize the interaction between the Grid AR manager and local RMSs, as well as between high level Grid scheduler components and Grid AR manager. Based on a plugin architecture, the Grid AR framework is able to incorporate different types of local RMSs to implement Grid AR functionalities, irrespective of whether local RMSs support AR or not. Keywords: Grid, Advance Reservation, Co-Allocation, Co-Reservation.
1
Introduction
Advance Reservation (AR) is a chief enabling technology of co-allocation (i.e., the simultaneous use of Grid resources across multiple sites) and co-reservation (i.e., the coordinated use of Grid resources in sequence across multiple sites), which is deemed a key capability of Grid schedulers for supporting some complex Grid applications, e.g., workflow. Typically, AR is provided by local Resource Management Systems (RMSs). As an advanced functionality of RMSs, at present AR is however only supported by a limited number of RMSs, e.g., Maui Cluster Manager/Moab Workload Manager [1], Platform LSF [2], LoadLeveler [3], and PBS Professional [4]. As these RMSs generally utilize individual interface R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 770–779, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Grid Advance Reservation Framework
771
formats to implement the AR functionality, there lacks a generic framework for Grid schedulers to manage Grid ARs across heterogeneous local RMS through a uniform interface. As a result, nowadays most of Grid schedulers, e.g., Moab Cluster Manager [1], LSF Multi-Cluster [2], etc., can only manage Grid ARs within a homogeneous environment, i.e., an environment consisting of the same type of RMSs that are all AR-capable. A few of Grid schedulers such as CSF (Community Scheduler Framework) [5] have promised to support the Grid AR across heterogeneous RMSs. They are however still far from the final goal. Due to imaginable heterogeneity of RMSs in future computational Grids, we cannot reasonably expect that all RMSs in a computational Grid are AR-capable, let alone they use a uniform AR interface and are able to support some advanced AR functionalities which are crucial or even indispensable to co-allocation and co-reservation as envisioned in [6],[7],[8], e.g., the two-phase commit, advanced query support, AR accounting, etc. While designing a Grid scheduler for the EU IST FP6 project NextGRID (http://www.nextgrid.org), we take into account the heterogeneity of local RMSs in the next generation Grid, thus propose a Grid AR framework which can incorporate different types of RMSs to implement Grid AR functionalities irrespective of whether local RMSs support AR or not. The kernel component of the framework is the Grid AR manager, which can externalize the AR functionality from local RMSs, and further interact with local systems based on a uniform interface and a plug-in architecture. In the Grid AR framework, two sets of APIs are defined. While the local AR API provides a uniform interface between the Grid AR manager and local RMSs, the Grid AR API exposes full sets of Grid AR functionalities to high level Grid scheduler components, e.g., the co-allocation/co-reservation planner (cf. Fig. 1).
2
Design of the Grid Advance Reservation Framework
In Fig. 1 we illustrate the design of the Grid AR framework. In the Grid AR framework, we differentiate between two types of ARs: (a) Grid ARs, which are generated by the co-allocation/co-reservation planner and then distributed to local RMSs through the Grid AR manager; and (b) local ARs, which are directly submitted to AR-capable RMSs through local schedulers. Besides, there may also involve Grid jobs and local jobs. They are however not directly handled in the Grid AR framework. In the first instance, we deal with Grid ARs in computational Grids. A Grid AR is thus modeled as: Ar = (id, ts, te, j, Cr) (1) where id is the collection of identity information, ts is the required start time, te is the latest end time, j is the job bound to Ar, and Cr is the required compute resources defined as Cr = (n, p, m, d, s). Here n is the required number of compute nodes, p is the number of processors on each node, and m, d, s is respectively the size of required memory, disk, and swap. For each accepted Ar, a unique handle Har is issued by the Grid AR manager.
772
C. Qu
Fig. 1. Design of the Grid AR framework for co-allocation and co-reservation
2.1
Grid Advance Reservation Manager
In order to externalize the AR functionality from local RMSs, the Grid AR manager itself implements a Grid AR algorithm, and also maintains a Grid AR list for each local RMS, which records Grid ARs distributed to the local system. If local RMSs are AR-capable, Grid ARs are also managed by local schedulers after the two-phase commit. Otherwise, Grid ARs are basically independent of the local management. In Fig. 2 we illustrate the activity diagram for submitting a Grid AR through the Grid AR manager in UML 2.0. The Grid AR manager can implement full sets of Grid AR functionalities based on different levels of interactions with local RMSs. In particular, it can implement those advanced AR functionalities which are crucial or even indispensable to co-allocation and co-reservation, but not yet commonly implemented in local RMSs, e.g., the two-phase commit. Depending on AR capabilities of local systems, the interactions between the Grid AR manager and local RMSs vary from non-intrusive to intrusive, resulting in different impacts on local ARs/jobs. On local RMSs with full AR support such as Torque/Maui [1], the interaction is non-intrusive in the sense that local jobs and local ARs always have priority over Grid ARs. In this case, Grid ARs are only accepted if they do not collide with local jobs and local ARs. As shown in Fig. 2, the final decision on accepting a Grid AR is ultimately made by local schedulers in term of their local scheduling algorithms and policies. Thus, Grid ARs are deemed to have no dramatic influence on local systems. On local RMS with partial AR support such as SGE [9], the interaction is “limited” intrusive in the sense that only local jobs with assignment reservations have priority under certain conditions. In this case, although the final decision on accepting a Grid AR is still made by local schedulers, the decision may be
A Grid Advance Reservation Framework
773
Fig. 2. Activity diagram for submitting a Grid AR through the Grid AR manager
incorrect, as local schedulers cannot predict workload for a time slot beyond the assignment reservation scope. As a result, accepted Grid ARs may still seriously slow down local jobs, even those jobs with assignment reservations. This is due to the fact that Grid ARs are essentially out of control of local schedulers, despite that local schedulers do have a hand in the decision-making process. On local RMSs without AR support such as Torque/pbs sched [1], the interaction is intrusive in the sense that Grid ARs always have priority over local jobs. In this case, the decision on accepting a Grid AR is fully made by the Grid AR manager. In order to guarantee the start time and required resources of Grid ARs, the Grid AR manager may suspend local jobs. This implies that Grid ARs may dramatically slow down local jobs. 2.2
Local Advance Reservation API
The local AR API is the uniform interface between the Grid AR manager and local RMSs, which is typically implemented by local RMSs in the form of local RMS plug-ins.The local AR API can logically be divided into two subsets as listed in Table 1. Local scheduler API. This subset of API is typically implemented by ARcapable RMSs to enable the Grid AR manager to directly operate Grid ARs on local schedulers. Through the API, the Grid AR manager may submit Grid ARs to local schedulers for verifying the acceptance, or command local schedulers to remove Grid ARs either because the two-phase commit fails, or on request of the
774
C. Qu
co-allocation/co-reservation planner. The Grid AR manager may also need to query local schedulers to implement advanced query functions, just as detailed in Section 2.4. Local Job Manager (JM) API. This subset of API is typically implemented by AR-incapable local RMSs to enable the Grid AR manager to directly operate Grid ARs on local systems, just like local jobs. It is the key to incorporating ARincapable RMSs into the Grid AR framework. Through the API, the Grid AR manager may directly start or terminate Grid ARs as local jobs according to their start time and end time, or suspend/resume local jobs in order to guarantee the start time and required resources of Grid ARs. In a sense, the Grid AR manager acts as an AR scheduling add-on, which functions on top of local schedulers/job managers through the local JM API to add the AR support to local systems. Table 1. Local advance reservation API Local scheduler API submit Grid AR to local scheduler submitReservation(Ar), return a local reservation ID Lar if succeed. cancel Grid AR on local scheduler cancelReservation(Lar), return true or false. query local reservations queryReservation(), return “stable” local ARs (cf. Section 2.4). Local Job Manager (JM) API submit Grid AR as local job to local JM submitJob(Ar), return a local job ID J. cancel Grid AR as local job on local JM cancelJob(J), return true or false. suspendLocalJobs(Cr), suspend local suspend local jobs jobs to get required resources Cr. return sets of suspended local job ID Ji. resume local jobs resumeLocalJobs(Ji), resume suspended local jobs.
Note should be paid that while the definition of the local AR API enables both AR-capable and AR-incapable RMSs to participate in the Grid AR framework, it also marks the threshold for AR-incapable RMSs. As the priority of Grid ARs has to be ensured on local systems, local RMSs have to at least support the queue priority/job pre-emption, and also need to provide the Grid AR manager with sufficient authority to operate local systems. Besides, in order to minimize Grid ARs’ intrusive influence on local systems, some advanced job management functionalities, in particular job check-pointing and migration, are also advantageous, though not indispensable. 2.3
Grid Advance Reservation API
The Grid AR API is a high level API implemented by the Grid AR manager to expose full sets of Grid AR functionalities. It is purposed to effectively hide high level Grid scheduler components, e.g., the co-allocation/co-reservation planner from the heterogeneity of local systems. In the Grid AR API, besides usual
A Grid Advance Reservation Framework
775
submit/cancel/modify operations, we further define sets of advanced query functions, as listed in Table 2. For effective co-allocation and co-reservation, we argue that some complex scheduling queries, e.g., “tell me the earliest start time if I submit job immediately (or in 2 hours)?”, “give me a list of all possible timeframes to execute a job after a specific time point?”, “tell me the amount of available resources within a specific time-frame?”, etc., are of particular importance. Whereas these advanced query functions are commonly missing in currently available AR API proposals and implementations (cf. Section 4), they are included in our Grid AR API design, and also guaranteed to be implementable through the Grid AR manager. Table 2. Grid advance reservation API submit a Grid AR to a targeting local RMS commit a Grid AR cancel a Grid AR modify a Grid AR
query Grid AR
2.4
submitReservation(Rt, Ar), Rt is the targeting local RMS. If succeed, return a handle Har for commit. commitReservation(Har), return true or false. cancelReservation(Har), return true or false. modif yReservation(Har, Ar), Har is the accepted AR handle, Ar is the new AR. If succeed, a new Har is returned,otherwise the old Har is returned. queryReservation(Har), return description of an accepted AR. queryT imef rame(Rt, ts, tduration, Cr), tduration is the duration of required Cr. return a list of timeframes {ti, tj} where ts ≤ ti ≤ tj, for required Cr after specific ts. The possible AR start time should be within {ti, tj}.The earliest timeframe also indicates the earliest start time. queryF reeResource(Rt, ts, te), return the amount of free resources within the specified timeframe.
Grid Advance Reservation Algorithm
The Grid AR algorithm is the key to the implementation of the Grid AR API in the Grid AR manager. As presented in [10],[11],[12], a general approach to the AR algorithm design is to use so-called “slot table”, in which each AR is considered as a slot of time. As the slot table has to be looked through every time a reservation, or query is made, the AR algorithm cannot reach complexity less than O(N ) in both time and space [12], where N is the number of ARs. Such complexity is generally not satisfactory for co-allocation and co-reservation, as the AR “probing” is rather critical here, thus query operations become much more frequent than usual AR operations. In our Grid AR algorithm, we use a new AR table structure, which can limit the algorithm to linear time complexity and constant space complexity. Since the number of free resources at each time point is the interest of query operations, in the Grid AR algorithm we use an AR list structure, in which the number of free resources at each time point is directly stored. The AR list is thus constituted of sets of tokens {ti, Cri}, where ti is the endpoints of
776
C. Qu
a reservation timeframe, and Cri denotes the number of free resources in the timeframe between ti-1 and ti. Obviously, here the token number i is less than 2*N, where N is the total number of ARs. In the Grid AR algorithm, usual AR operations such as submit/cancel/modify have a bit more overhead than in “slot table” based AR algorithms, as for each of such operations all tokens within the required time range have to be updated. However, the algorithm is especially advantageous to query operations. As free resources at all time points are directly stored in the AR list, the query operations can be conducted more efficiently. In correspondence to the operations defined in the Grid AR API as in Table 2, the Grid AR list is changed as follows. For explanation, we define a function ti.next() to denote the most next time point of ti in the Grid AR list. 1. queryF reeResource(Rt, ts, te): return the number of free resources Cf ree = min(Cri), where Cri is from the token set {ti, Cri}, ts.next() ≤ ti ≤ te.next(). If te.next() does not exist in the Grid AR list, simply use ts.next() ≤ ti; If ts.next() does not exist, return Cf ree = Ctotal, where Ctotal is the total configurable resources on Rt. 2. queryT imef rame(Rt, ts, tduration, Cr): return the time point set {ti, tj}, where ts ≤ ti ≤ tj, Cf ree = queryF reeResource(Rt, ti, tj + tduration), and Cr ≤ Cf ree. Here the first {ti, tj} indicates the earliest start time. 3. submitReservation(Rt, Ar): Ar = (iden, ts, te, j, Cr) is acceptable, if Cf ree = queryF reeResource(Rt, ts, te), and Cr ≤ Cf ree. An accepted AR will then update the Grid AR list according to the following algorithm: 1: 2: 3: 4: 5: 6: 7: 8: 9:
update the token set {ti, Cri} to {ti, Cri − Cr} where ts < ti < te if ts does not exist in the AR list then if tm = ts.next() exists then insert {ts, Crm}; else insert {ts, Ctotal}; else do nothing if te does not exist in the AR list then if tn = te.next() exists then insert {te, Crn − Cr}; else insert {te, Ctotal − Cr}; else update {te, Cre} to {te, Cre − Cr};
4. cancelReservation(Har) and modif yReservation(Har, Ar): The update to the AR list is rather similar to submitReservation(Rt, Ar). We thus omit details here. In terms of the Grid AR algorithm, each of submit, cancel, modify, and query operations can be done within a single cycle through the AR list. As the length of the AR list is less than 2*N, the Grid AR algorithm is deemed to have linear time complexity less than O(2*N) and constant space complexity. Note should be paid that we also need to consider local ARs or assignment reservations in the Grid AR algorithm if Grid AR operations are targeted on RMSs with full or partial AR support. This is however only relevant to query operations. For submit/modify operations, as the final decision on accepting a
A Grid Advance Reservation Framework
777
Grid AR is anyway made by local RMSs, we need not consider local workload in the Grid AR algorithm. For query operations, we only take into account some relatively “stable” local ARs in the Grid AR algorithm, such as administrative reservations and standing reservations on Torque/Maui. As Grid schedulers basically have no control over local RMSs, user-settable local ARs are definitely out of sight of the Grid AR algorithm. Therefore, queries against RMSs with full or partial AR support are destined to return “inaccurate” results, which mainly indicate the Grid AR workload.
3
Discussions
It is well known that (local) ARs will unavoidably introduce high job drop rate and resource fragmentation within local RMSs [12],[13],[14]. Such drawbacks are further highlighted after a local RMS participates in the Grid AR framework, as in addition to local ARs, Grid ARs are also introduced, leading to different impacts on local systems. On local RMSs with full AR support, as Grid ARs are given lower priority than local jobs and local ARs, the local ARs and local jobs will not dramatically be slowed down by Grid ARs. In contrast, on local RMSs without AR support, as all local jobs are given lower priority than Grid ARs, they will dramatically be slowed down by Grid ARs. In the worst case, local jobs may be pre-empted by Grid ARs at any time, leading to unpredictable execution time. For the proposed Grid AR framework, an essential assumption is that local RMSs can achieve extra gain from Grid ARs in comparison to local jobs and local ARs. This is expected to be the case in particular with local RMSs which are AR-incapable, but still willing to accept intrusive Grid ARs. In a Grid computing environment, such extra gains have to be guaranteed through certain mechanisms. In NextGRID, for example, we adopt SLA (Service Level Agreement) to provide such a mechanism, which can effectively manage the lifecycle of Grid ARs including their creation, negotiation, and payment, etc [15]. Moreover, as Grid ARs can influence the utilization of local resources, even before they are put in action (e.g., on AR-capable RMSs), the committed Grid ARs need to be charged irrespective of whether they are indeed enforced thereafter or not. Although the Grid AR framework is able to incorporate local RMSs with partial or even without AR support, the preferable use case of the framework is to incorporate fully AR-capable RMSs. In this case, the framework can provide value-added Grid AR functionalities, in particular the two-phase commit and advanced query functions, with the least influence on local systems. For ARincapable RMSs, there are also several solutions to minimize Grid ARs’ intrusive influence on local systems. For example, local RMSs may only partition part of resources for Grid ARs and low priority local jobs; local RMSs may adopt advanced job check-pointing and migration mechanisms, etc. In addition, ARincapable RMSs may also choose to use some external AR-capable schedulers or AR add-ons such as Maui or PluS [10] in order to become AR-capable.
778
4
C. Qu
Related Work
OGF (Open Grid Forum, http://www.ogf.org) proposes an experimental document [7] for defining the AR API, which is basically derived from the GARA project [11]. This AR API defines basic AR functionalities, but does not include the advanced query API. We view our work as an extension to the OGF proposal, which additionally addresses advanced query functions for co-allocation and co-reservation, and also takes into account the heterogeneity of local RMSs. Several commercial Grid schedulers such as Moab Cluster Manager [1] and LSF Multi-Cluster [2] can support Grid ARs across homogeneous local RMSs. As these Grid schedulers mostly focus on the multi-cluster management and monitoring, they do not sufficiently address co-allocation and co-reservation requirements, thus lack AR functionalities such as the two-phase commit and advanced query functions. GridSim [12] is a Grid AR simulation environment, which is able to implement all Grid AR functionalities as in our Grid AR framework except for advanced query functions. So is another Grid AR capable Grid scheduler, the Grid resource broker [16]. CSF [5] is a Grid scheduler mainly focusing on dispatching Grid ARs to local RMSs. It thus has neither own AR algorithm implementation nor the two-phase commit nor advance query support. CSF does intend to support Grid ARs across heterogeneous local RMSs based on a plug-in architecture. However, at present it only provides a plug-in for LSF. Globus Alliance (http://www.globus.org/) also has the plan to add Grid AR support to GT4 (Globus Toolkit) GRAM (Grid Resource Allocation & Management). The first design draft was proposed in Oct. 2006, but so far no further details are available. Basically, their AR work is expected to be the extension to the GARA project [11].
5
Conclusions and Future Work
The proposed Grid AR framework has two major contributions: (a) an advanced Grid AR algorithm, which is advantageous to Grid AR query operations; and (b) a Grid AR API and local AR API, which can hide heterogeneity of local RMSs to implement Grid AR functionalities across different types of local systems. We deem these two contributions may effectively address co-allocation and coreservation requirements in the next generation Grid. In the next project phase, we plan to integrate more RMSs into the testbed to validate the framework, and provide local RMS plug-ins for popular RMSs such as LSF [2], LoadLeveler [3], and PBSPro [4] etc. The research focus will also be on the co-allocation/co-reservation planner, which currently conducts the Grid AR planning based on a simple performance model of local RMSs. We plan to take advantage of semantic technologies in the planner to address more complex Grid AR scheduling issues such as resource contention, etc. Acknowledgments. This work was supported in part by EU IST FP6 project NextGRID under the contract number IST-2004-511563.
A Grid Advance Reservation Framework
779
References 1. Cluster Resources Inc: Maui, Moab, and Torque, http://www.clusterresources.com/pages/products.php 2. Platform Computing Inc.: LSF Product Suite, http://www.platform.com/Products/ 3. IBM Corp.: LoadLeveler, http://www-03.ibm.com/systems/clusters/software/ 4. Altair Engineering: PBS Professional, http://www.altair.com/software/pbspro.htm 5. Community Scheduler Framework, http://sourceforge.net/projects/gcsf 6. Kuo, D., Mckeown, M.: Advance Reservation and Co-Allocation Protocol for Grid Computing. In: 1st IEEE International Conference on e-Science and Grid Computing, IEEE Computer Society Press, Los Alamitos (2005) 7. Roy, A., Sander, V.: Advance Reservation API, http://www.gridforum.org/documents/GFD.5.pdf 8. Schwiegelshohn, U., Yahyapour, R.: Attributes for Communication between Scheduling Instances, http://www.gridforum.org/documents/GFD.6.pdf 9. Grid Engine Project, http://gridengine.sunsource.net/ 10. Nakada, H., Takefusa, A., Ookubo, K., Kishimoto, M., Kudoh, T., Tanaka, Y., Sekiguchi, S.: Design and Implementation of a Local Scheduling System with Advance Reservation for Co-Allocation on the Grid. In: IEEE International Conference on Computer and Information Technology (CIT 2006), IEEE Computer Society Press, Los Alamitos (2006) 11. Roy, A., Sander, V.: GARA: A Uniform Quality of Service Architecture. In: Nabrzyski, J., Schopf, J.M., Weglarz, J. (eds.) Grid Resource Management: State of the Art and Future Trends, Kluwer Academic Publishers, Dordrecht (2003) 12. Sulistio, A., Buyya, R.: A Grid Simulation Infrastructure Supporting Advance Reservation. In: 16th International Conference on Parallel and Distributed Computing and Systems (PDCS 2004), ACTA Press, Calgary (2004) 13. Heine, F., Hovestadt, M., Kao, O., Streit, A.: On the Impact of Reservations from the Grid on Planning-Based Resource Management. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3516, Springer, Heidelberg (2005) 14. Smith, W., Foster, I., Taylor, V.: Scheduling with Advanced Reservations. In: International Parallel and Distributed Processing Symposium 2000. LNCS, vol. 1800, Springer, Heidelberg (2000) 15. Hasselmeyer, P., Qu, C., Schubert, L., Koller, B., Wieder, P.: Towards Autonomous Brokered SLA Negotiation. In: eChallenges 2006, IOS Press, Amsterdam (2006) 16. Elmroth, E., Tordsson, J.: An Interoperable Standards-based Grid Resource Broker and Job Submission Service. In: 1st IEEE Conference on e-Science and Grid Computing, IEEE Computer Society Press, Los Alamitos (2005)
Using HLA and Grid for Distributed Multiscale Simulations Katarzyna Rycerz1 , Marian Bubak1,2 , and Peter M.A. Sloot2 1 2
Institute of Computer Science, AGH, al. Mickiewicza 30,30-059 Krak´ ow, Poland Faculty of Sciences, Section of Computational Science, University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam, The Netherlands {kzajac,bubak}@agh.edu.pl, [email protected] Phone: (+48 12) 617 39 64; Fax: (+48 12) 633 80 54
Abstract. Combining simulations of different scale in one application is non-trivial issue. This paper proposes solution that supports complex time interactions that can appear between elements of such applications. We show that High Level Architecture, especially its time management service can be efficiently used to distribute and communicate multiscale components. Grid HLA Management System (which was presented in our previous work [10]) is used to run HLA–based distributed simulation system on the Grid. The example application is build from simulation modules taken from Multiscale Multiphysics Scientific Environment (MUSE)[8], which is sequential simulation system designed for calculating behaviour of dense stellar systems like globular clusters and galactic nuclei. Keywords: multiscale simulation, Grid computing, HLA, distributed simulation.
1
Introduction
Multiscale simulations is a very important and interesting field of research. Examples include approach to create multiphysics model of capillary growth [11] or modeling colloidal dynamics [3]. Another example is Multiscale Multiphysics Scientific Environment (MUSE)[8] for simulating dense stellar systems like globular clusters and galactic nuclei. The MUSE currently consists of the Python scheduler and three simulation modules of different time scale: stellar evolution (in macro scale), stellar dynamics (nbody simulation - in meso scale) and hydro dynamics (simulation of collisions - in micro scale). Combining simulations of different scale in one application is a complex and non-trivial issue [4,5]. In particular, it requires advanced and flexible time management techniques (e.g. ability of joining together simulations of different internal time management). From that point of view, it would be useful to adapt one of the existing solutions suited for distributed interactive simulations that can fulfill this requirement. One of the important standards is High Level Architecture (HLA) [6] that provides various services needed for this kind of applications R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 780–787, 2008. c Springer-Verlag Berlin Heidelberg 2008
Using HLA and Grid for Distributed Multiscale Simulations
781
such as time management with the ability to connect time-driven and eventdriven as well as optimistic and conservative simulations together. It also takes care of data distribution management and allows all application components to see the entire application data space in an efficient way. There are many implementations of open source HLA standard as well as its closed source [12]. In this paper we show, how multiscale simulation can benefit from HLA time management services. As example, we use simulation modules from MUSE environment. Another important aspect worth to be addressed is the ability of flexible and transparent creation of distributed multiscale simulation system according to user needs. This issue include requirement of interoperability, composability and reusability of exiting models created by other researchers. For this purpose we propose to use Grid technology as it is oriented towards joining geographically distributed communities of scientists working on similar problems - this will allow users working on multiscale simulations to more easily exchange the models already created. Therefore, the attempt to integrate HLA with new possibilities given by Grid is a promising approach useful for multiscale simulations with time management supported by HLA. As there is already much work on improvements of HLA functionality regarding conditions in Grid environment (a good example is the implementation based on Grid Services [9]), we are not planning to develop our own implementation. Instead, we would like to show, how using HLA on the Grid can be beneficial for multiscale simulations. To run the simulation we are using Grid HLA Management System (G-HLAM) described in [10], where you can also find detailed analysis of challenges of integrating HLA with the Grid. In the future, we also plan to use component approach to extend support for composability of simulation models. This paper is organized as follows: in Section 2 we briefly describe HLA time management, in Section 3 we analyze the typical time interactions between multiscale components basing on MUSE modules [8]. In Section 4 we describe our first attempts of running HLA–based distributed multiscale simulation on the Grid and the performance results. In Section 5 we describe conclusions and plans for future work.
2
Overview of Time Management in HLA
The High Level Architecture (HLA) standard [6] defines an infrastructure for developing distributed interactive simulations. In HLA terminology each component of a distributed simulation is called a federate and can form federation with other federates. In HLA, time management is concerned with the mechanisms for controlling the advancement of each federate along the federation time axis. The perception of the current time may differ among participating federates. So called regulating federates regulate the progress in time of federates that are designated as constrained. A federation may be comprised of federates with any combination of time management models. Regulating federates are able to send events or data objects with time stamps and constrained federates are able to receive such objects in time stamp order. The maximal point in time which
782
K. Rycerz, M. Bubak, and P.M.A. Sloot
the constrained federate can reach at certain moment is calculated dynamically according to the position of regulating federate on the time axis. The constrained federate can ask HLA runtime infrastructure to proceed to to the next point in time calculated by adding time step to the current time (for time–driven simulations) or calculated as the time of the next event received (for event driven simulations). Additionally, optimistic federates can also ask to receive all of the events that have been sent in the federation execution regardless of the time-stamp ordering. The messages that are received with a time-stamp less than messages already sent may invalidate the previous messages. In this case, HLA provides retraction mechanism that can be used.
3
HLA Support for Types of Time Interactions in Multiscale Simulations
We have adapted three simulation modules of different time scale taken from MUSE [8]: stellar evolution (in macro scale), stellar dynamics (nbody simulation - in meso scale) and hydro dynamics (simulation of collisions - in micro scale). The main multiscale simulation elements data are shown in the Fig 1. It shows two steps of evolution and four steps of dynamics (the 4 2 number of steps is chosen for simplicity - actu3 ally there are more steps of dynamics within data data data time of one evolution step). Simulation of col2 trigger collision 1 lisions is seen by evolution and dynamics as a 1 point in time. Collision is triggered by dynamics and data are sent from collision to both evolution dynamics dynamics and evolution. Apart from that, evolution sends data to dynamics. We identified Fig. 1. Multiscale simulation elethree needed types of time interactions be- ments and their interactions tween multiscale elements. Meso scale triggers micro scale and waits for the results. In simplest case (shown in the Fig.2) simulation of stellar dynamics (meso scale) can detect the situation when two stars become close. Then, the collision simulation (micro scale) should be performed (triggered). As the collision takes time in smaller scale then dynamics and the computed data is sent from micro scale to meso scale, the dynamics should wait as the collision finishes. Macro scale and meso scale running concurrently - conservative simulation. In this case (shown in the Fig.3), evolution (macro scale) and dynamics (meso scale) can run concurrently. The data are sent from macro scale to meso scale as star evolution data (change of star mass) is needed by dynamics. No data is needed from dynamics to evolution . Single step of dynamics simulation is shorter in terms of simulation time units than that in evolution, so it needs to
Using HLA and Grid for Distributed Multiscale Simulations
783
simulation time 4
4 3
3
dynamics
data 2
2 trigger collision
1
1
collision
dynamics wallclock time
dynamics
Fig. 2. Interaction between dynamics and collision
take more steps to reach the same point of time. Also, complexity of the single step of dynamics (and related execution time) is greater then this of evolution. Therefore, there is less probable that dynamics simulation will frequently wait for its data until evolution reaches next step. As shown in Fig.3, point A1 is earlier on wallclock time axis than point B1 (for simplicity in the Fig.3 steps of both simulations are equal. In real experiment, which is described later in Section 4, dynamics part have to calculate above 1000 simulation steps to get to point B1, what can take even 25 seconds of wallclock time, while evolution performs one step to get to point A1 in 1.1 milisecond.) Therefore, we propose the conservative type of interaction between these two simulations as it allows to run two simulations concurrently and should not generate frequent delays idle time of waiting simulation. Also, in optimistic solution the whole state of dynamics would have to be rolled back in case the evolution was late as data sent by evolution impact the whole dynamics simulation. The simulation system has to make sure that dynamics will get update from evolution before it actually passes the appropriate point in time. The time simulation time evolution
data 4
4
2
2 data
3
B1
A1 2
1 1 evolution dynamics
dynamics
1
3
2 1 wallclock time
A1 − point in time, when evolution sends data to dynamics B1 −maximum time, when dynamics should receive data (if not, it should wait for data) for simplicity, single both simulations steps have equal calculation time (wallclock time)
Fig. 3. Interaction between evolution and dynamics
784
K. Rycerz, M. Bubak, and P.M.A. Sloot simulation time C2 2
2
rollback 1
data collision
evolution
1 B2
B2’ collision A2
evolution wallclock time
A2 − point in time, when collision sends data to evolution B2 −maximum time, when evolution should receive data (if not, it should rollback from C2 to B2’)
Fig. 4. Interaction between evolution and collision
management mechanism of regulating federate (evolution) that controls time flow in constrained federate (dynamics) could be there very useful. The maximal point in time which the constrained federate can reach at certain moment is calculated dynamically according to the position of regulating federate on the time axis. Macro scale and micro scale running concurrently - optimistic simulation. In this case (Fig.4), the collisions simulation (micro scale) impact part of evolution data (macro scale). The start time of collision is independent on evolution. Collision means that the set of simulated stars changes (two stars disappear, and one star of different mass appears in their place). In the case the evolution already passed the point of time of collision (e.g. in the Fig.4 point A2 is later then B2 on the simulation time axis), it has to rollback (in the Fig.4 the evolution has to rollback from point C2 to B2’) As the data from micro scale impacts only part of macro scale simulation (the evolution of each star is calculated independently of other stars), the optimistic simulation can be a reasonable solution. HLA time management [6] offers also support for this kind of simulations - e.g. ability to check messages with future time stamps and retract messages that were sent basing on the future data. Interaction summary. All presented types of time interactions have different features. In the first and second interaction, the data flow effects whole receiving simulation, while in third case only a part (evolution calculates every star separately, so data from collision does not have effect on calculations of not colliding stars). The first type of interaction is sequential (dynamics triggers new collisions and waits for the results), while the other two are parallel: second one is conservative and third one is optimistic. In HLA all forms of time management may be linked together and conservative simulations can interact with optimistic simulations. Therefore, it allows to link three scenarios of time interaction described above in the natural way and is a good choice for this kind of complex simulation system.
Using HLA and Grid for Distributed Multiscale Simulations
4
785
HLA–Based Distributed Muse on the Grid
G-HLAM. In our previous work, we developed G-HLAM system that allows for efficient execution of HLA–based applications on the Grid [10]. One of the main goals was to support HLA legacy applications, so the actual communication between simulation elements (federates) are done through HLA communication bus. However, management of the application (performance monitoring and migration) is done on the Grid Services level. The group of main G-HLAM services consists of a Broker Service which coordinates management of the simulation, a Performance Decision Service which decides when the performance of any of the federates is not satisfactory and therefore migration is required, and a Registry Service which stores information about the location of local services. On each Grid site supporting HLA there are local services for performing migration commands on behalf of the Broker Service as well as for monitoring of federates and benchmarking. The HLA–Speaking Service is one of the local services interfacing federates running on its site to the G-HLAM system. Performance results. We have used G-HLAM for running HLA–based distributed multiscale application on the Grid. For this experiment we have chosen two modules dynamics and evolution to show second type of time interaction described in Section 3 (macro scale and meso scale running concurrently - conservative simulation). This experiment is good example of using HLA time management features, when one simulation controls time of the other. In our experiment, we compared performance of this modules when running them firstly in legacy MUSE environment and secondly as HLA–based distributed simulation on the Grid. The experiments were performed on DAS2: legacy MUSE and dynamics part of HLA Table 1. Comparison of time of actions for legacy MUSE and for HLA–based multiscale simulation on the Grid (sum of 10 steps) action average, sec σ, sec Actions independent on simulation steps number (HLA–based distributed multiscale on the Grid) job submission (GRAM and local job manager) 15.2 1 HLA initial actions 7.3 1.1 HLA quit actions 0.07 1.1 Actions dependent on simulation steps number (Legacy MUSE) dynamics 45.8 0.1 evolution 0.001 0.00005 update from evolution to dynamics 0.061 0.001 total loop time 45.9 0.2 Actions dependent of simulation steps number (HLA–based distributed multiscale) dynamics 49 3 synchronization with evolution 0.016 0.001 total loop time 49 3
786
K. Rycerz, M. Bubak, and P.M.A. Sloot
application was run at grid node at Leiden, evolution part of HLA application was run at Delft and RTIexec HLA control process was run at Amsterdam. All grid nodes have the same architecture (clusters of two 1-GHz Pentium-IIIs nodes connected with internal Myrinet-2000 network). Fast ethernet (10Gbps) is used as the external network between Grid nodes. The application setup and actual submission of HLA components was done by G-HLAM (with Globus Toolkit 3.2 and HLA RTI-1.3NGv5). We have calculated average values from 10 runs. The actions independent of number of simulation steps (time of submission, HLA setup and finalizing functions) are shown in the upper part of Tab.1. The rest of Tab.1 shows execution time of actual simulation loops. The experiments were done on 100 star system (first 10 simulation steps). The middle part of Tab.1 shows performance results of legacy MUSE environment. It is sequential execution consisting of: running dynamics, running evolution and updating dynamics from evolution in each step. The lower part of Tab.1 shows the experiment results with the same modules, but running concurrently on the Grid and using HLA time management. As we observed, execution time of evolution is much shorter then dynamics, so it does not delay its execution (although dynamics is controlled by evolution as described in Subsection 3). As both modules were running concurrently and evolution part finished quicker, we have shown results of the longer part (dynamics). The results include actual calculation time (which are similar to legacy execution) and synchronization time (receiving data from evolution, getting permission to advance simulation time). It should be noted that synchronization (which in fact is the most important overhead of HLA distribution as it is repeated in the loop) does not take much time, because evolution is quicker and sends data in advance, so dynamics has only to process what was delivered before. However, as the evolution execution time is much shorter then dynamics, the results of legacy and distributed simulations are comparable as the dynamics time totally dominates sequential execution. We can draw a conclusion that the distributed solution would be beneficial for modules that are both significant for sequential execution time, but with regulating module (in our case evolution) quicker then constrained module (in our case dynamics).
5
Conclusions and Future Work
In this paper we have shown how multiscale simulations can benefit from HLA, especially its time management facility. We have described and analyzed the typical the time interactions between multiscale components basing on modules taken from MUSE [8], which originally is sequential simulation environment designed for dense stellar systems. For the experiment we have chosen one of the described types of time interaction as it is the most typical example of using HLA time management features, when one simulation controls time of the other. However, we plan to add also other interaction types in the near future. We have shown how the multiscale components can be distributed using HLA and benefit from it. G-HLAM system was used to run the simulation on the Grid which allows for better usage of available resources needed by such applications.
Using HLA and Grid for Distributed Multiscale Simulations
787
The performance results have shown that overhead of HLA–based distribution (especially its repeating part of synchronization between multiscale elements) on the Grid is small and can be beneficial for multiscale simulations. The results presented in this paper are a good starting point for building Grid component framework for HLA–based simulations that will support a user in dynamic set up of multiscale simulation system comprised of chosen components. Acknowledgments. The authors wish to thank Simon Portegies Zwart for valuable discussions on MUSE and Maciej Malawski for discussions on component models. The support from Polish Foundation for Science (FNP) is acknowledged. This research was also partly funded EU IST Project CoreGRID and the Polish State Committee for Scientific Research SPUB-M.
References 1. Armstrong, R., et al.: The CCA component model for high-performance scientific computing. Concurr. Comput.: Pract. Exper. 18(2), 215–229 (2006) 2. Chen, X., Cai, W., Turner, S.J., Wang, Y.: SOAr-DSGrid: Service-Oriented Architecture for Distributed Simulation on the Grid. Principles of Advanced and Distributed Simulation (PADS), 65–73 (2006) 3. Dzwinel, W., Yuen, D.A., Boryczko, K.: Bridging diverse physical scales with the discrete-particle paradigm in modeling colloidal dynamics with mesoscopic features. Chemical Engineering Sci. 61, 2169–2185 (2006) 4. Hoekstra, A.G., Lorenz, E., Falcone, J.-L., Chopard, B.: Towards a Complex Automata Framework for Multi-Scale Modeling: Formalism and the Scale Separation Map. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp. 922–930. Springer, Heidelberg (2007) 5. Hoekstra, A.G., Portegies Zwart, S., Bubak, M., Sloot, P.M.A.: Towards Distributed Petascale Computing. In: Bader, D. (ed.) Petascale, Computing: Algorithms and Applications, Chapman & Hall / CRC Press, Taylor and Francis Group (2007) (expected publication December 2007) 6. High Level Architecture specification - IEEE 1516 7. Malawski, M., Bubak, M., Placek, M., Kurzyniec, D., Sunderam, V.: Experiments with distributed component computing across grid boundaries. In: Proceedings of the HPC-GECO/CompFrame workshop in conjunction with HPDC 2006, Paris, France (2006) 8. MUSE Web page, http://muse.li/ 9. Pan, K., Turner, S.J., Cai, W., Li, Z.: A Service Oriented HLA RTI on the Grid accepted by Principles of Advanced and Distributed Simulation (PADS) (2007) 10. Rycerz, K.: Grid-based HLA Simulation Support. PhD thesis, University of Amsterdam, June, promoter P.M.A. Sloot, copromoter: M.Bubak (2006) 11. Szczerba, D., Sz´ekely, G., Kurz, H.: A Multiphysics Model of Capillary Growth and Remodeling. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 86–93. Springer, Heidelberg (2006) 12. Wikipedia - list of HLA implementations, http://pl.wikipedia.org/wiki/High Level Architecture
The OpenCF: An Open Source Computational Framework Based on Web Services Technologies Adrian Santos1 , Francisco Almeida1 , and Vincente Blanco1 Dpto. Estad´ıstica, .I.O. y Computaci´ on Universidad de La Laguna, Spain [email protected]
Abstract. Web Services-based technologies have emerged as a technological alternative for computational web portals. Facilitating access to distributed resources through web interfaces while simultaneously ensuring security is one of the main goals in most of the currently existing manifold tools and frameworks. OpenCF, the Open Source Computational Framework that we have developed, shares these objectives and adds others, like enforced portability, generality, modularity and compatibility with a wide range of High Performance Computing Systems. OpenCF has been implemented using lightweight technologies (Apache + PHP), resulting in a robust framework ready to run out of the box that is compatible with standard security requirements.
1
Introduction
A widespread drawback among many scientists who wish to use the potential computational power provided by High Performance Computer Systems (HPCS) is the significant barrier (technological and learning) these users face when trying to access such services. Parallel machines usually run UNIX-based Operating Systems which users access through terminal-based connections. Once the code is compiled by the user, it is run as a batch code through the queue system. Typical researchers, who at one time may have been familiar with the code or library used to run experiments in their field of expertise, are having to confront their inexperience when using terminals and UNIX/Linux commands. An additional source of complexity is introduced by queue systems that vary from system to system and which use their own commands and manipulation rules. The type of user we are dealing with is not particularly interested in learning new tools or knowing the details of parallel computing, and yet the effort required in these cases from researchers who are only interested in faster processing times is still high. Solution strategies founded on Client/Server-based applications have emerged as alternatives to overcome this barrier. A friendly graphical interface enables access to non-local resources where parallel codes can be executed remotely in geographically distributed parallel machines, or where heterogeneous devices
This work has been partially supported by the EC (FEDER) and the Spanish MEC (Plan Nacional de I+D+I, TIN2005-09037-C02).
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 788–797, 2008. c Springer-Verlag Berlin Heidelberg 2008
The OpenCF: An Open Source Computational Framework
789
can interact with each other to solve a problem in parallel. The approach allows the use of the system by a larger number of users, including those without any knowledge of sequential or parallel programming. Details of the parallel architecture are transparent to the user and parallel codes can be efficiently executed in the parallel system. A paradigmatic example of this approach is found in [1] or in many other Client/Server applications that even use web interfaces in the clients [2,3,4,5]. As Web Services (WS) progress toward an industry standard, they have attracted the attention of the parallel computing community and proved to be a technological alternative for implementing this kind of system. WS technology eases the implementation of a web-accessed computing system. It also makes for better interoperability between the different systems and tools that comprise a High Performance Computing environment. The Open Computational Framework (OpenCF) presented here uses WS technologies to achieve these objectives with a lightweight and portable implementation that is compatible with the security requirements of many HPCS sites. OpenCF is freely downloadable from[6], licensed under GPL, and has been successfully used by the Pelican project [7], from where some snapshots have been extracted for illustrative purposes in this paper. This paper is structured as follows. In section 2 we introduce the background on Web Services for computing environments and the main motivation for this work. Section 3 describes the software architecture and technology used in the modular design of the OpenCF system. Several security-related issues involving WS and HPCS are discussed in section 4 where a description of our implementation in OpenCF can be found. Finally in section 5 we present some concluding remarks and future lines of work.
2
Motivation and Related Work
Web Service (WS) technology facilitates the development of remote access tools which have the ability to communicate through any common firewall security measure without requiring changes to the firewall filtering rules. The solutions presented vary from static ad-hoc WS for specific applications on specific parallel machines [8], to more general and complete solutions. For example, Gateway [9,10,11] implements a WS portal based on Java and CORBA technologies. In the GridSam [12] project we can find a WS infrastructure that is also implemented using Java, and includes JSDL descriptions for job submissions and a command-line interpreter for access to the GridSAM system. OpenDSP [13] (DRMAA Service Provider) is an open architecture implementation of SOAP Web Service for multi-user access and policy-based job control routines by various Distributed Resource Management (DRM) systems. This API follows the DRMAA [14] specification for the submission and control of jobs to one or more DRM systems. The scope of this specification is the high-level functionality necessary for an application to consign a job to a DRM system, including common operations for jobs like termination or suspension.
790
A. Santos, F. Almeida, and V. Blanco
In these environments the user is provided with web interfaces or other software infrastructures that supply easy and secure access to several resources, including different architectures and applications. However, the use of frameworks based on Java (JavaBeans, JavaServlets, . . . ) and CORBA is sometimes criticized for delivering weighted solutions. The Open Computational Framework (OpenCF) that we present here may be enclosed in this group of projects that provide generic solutions using Web Services-based technologies. While ideas, concepts and goals are shared by these projects, in practice, the differences between them are highlighted by their specific implementations. In our case, the OpenCF is based on PHP technology, providing a lightweight and portable implementation. Although in many of the cases the development effort has been technologically impressive and the demand for this kind of tool is still high, none of the generic frameworks based on computational web services has achieved enough popularity in the wider parallel programming community to be accepted as a standard framework enjoying widespread use. In many cases they have not been used beyond the institutions where they were developed. Some of the reasons for this are as follows: – The technology used by itself can be a handicap when the effort demanded from the system managers is high. – The compatibility with the security policies and resource managers of the HPCS sites. – Many of the solutions provided are non-standard, non-generic (based on specific technologies), non-portable and non-open source. With OpenCF we intend to build a generic Open Source Computational Framework based on WS technologies implemented according to the W3C (World Wide Web Consortium) recommendations for WS development [15]: – OpenCF is compact and highly efficient since it is based on a modular design that has been implemented using the HTTP Apache server, a set of PHP libraries and a Perl script. This combination results in a lightweight package with low resource requirements. – The system is easy to install, requiring from the system manager no more knowledge than that needed to install a web mail application. – The generality is enforced through the use of portable technologies that are compatible with the resource management systems running on most parallel systems, and the security requirements have been implemented in such a way that they are independent and compatible with the security measures of HPCS sites. That means that OpenCF, out of the box and without any modifications, meets the security requirements of HPCS sites. – The flexibility to enlarge the service with new computational proposals was an important issue considered in our design strategy. Although we are not using new concepts, our main design effort has been to bring all these ideas together and make them work as a successful tool that we hope will be of great use to the scientific community.
The OpenCF: An Open Source Computational Framework
791
Fig. 1. The OpenCF architecture
3
The OpenCF Architecture
For OpenCF, we identified the following core services to be implemented in the portal: secure identification and authorization for users and inter-communication services, information services for accessing descriptions of available host computers, applications and users, job submission and monitoring through the queue system, file transfer, and facilities for user and resource management. The above services are typically implemented in many computational web portals; however, we also list some of the design features that are not always included in this kind of tool and which, in our opinion, are required to successfully implement a web computing framework: – Modular design, whereby modules can be easily added, replaced, updated or independently used with minimum development and management effort.
792
A. Santos, F. Almeida, and V. Blanco
– Generality so that the requirements of as many users, applications and HPCS as possible can be met. – Portability. – The use of standard technology so as to reach a widespread community. – Independence from hardware requirements. – Lightweight. – Open Source as a requirement to allow the project to be enhanced in the future with new ideas and developments. – Ease of use and installation both for end users and system managers. – Ready to use. – No modifications required in the original source code by the end user. Figure 1 depicts the OpenCF software architecture. In keeping with a modular design, two modules make up the OpenCF package, namely, the server (side) module and the client (side) module. Modules can be independently extended or even replaced to provide new functions without altering the other system components. The client and the server implement the three lower-level layers of a WS stack: Service Description, XML Messaging, and Service Transport. The fourth level, Service Discovery, has not been implemented for security reasons. Thus, system managers still control the client services accessing the clusters via traditional authentication techniques. The client provides an interface for the end user and translates the requests into queries for the servers. The server receives queries from authenticated clients and transforms them into jobs for the queue manager. These modules, in turn, are also modularized. Access Control, Query Processors and Collector components can be found on both the server and client sides. The client also holds a database to better manage the information generated by the system. The server includes elements for the scripting and launching of jobs under the queue system. The following subsections describe the functionalities and technology used in each of the aforementioned modules, except for access control modules and security issues, which will be detailed in a separate section. 3.1
The Client
The client is the interface between the end user and the system. Users are registered in the system through a form. Some of the information requested is used for security purposes, while the rest is needed for job management. This information is stored in the client’s database. Next is a listing of the client module sub-modules. – The client DataBase stores information on users, servers, jobs, input/output files, etc. It has been implemented as a relational MySQL database and it is accessed through PHP scripts. – The client Query Processor consists of a web interface through which the user can access lists of available applications. Each entry in the lists shows a brief description of the routine. Tasks are grouped according to the servers
The OpenCF: An Open Source Computational Framework
793
Fig. 2. Job status under the IDEWEP project
supporting them. When execution of a routine is requested, the target platform is implicitly selected. An XHTML form for inputting parameters is dynamically generated according to the job description. – The client Collector manages the job’s server-generated output. The service notifies the user via e-mail when the job is finished. The state of the jobs submitted for execution can also be checked through a web interface (see Fig. 2), and the results stored into disk files. 3.2
The Server
The server manages all job-related issues, making them available at the service and controlling their state and execution. When Apache catches a new query from the client, it allocates a new independent execution thread to create a new instance of the server module. – The Query Processor module consists of a set of PHP scripts that is responsible for distributing the job among the different components. Queries addressed for the computational system are dispatched to the Queue Manager Interface, and the rest of the queries are served by the Query Processor. The web service is also generated and served by this module. – The Queue Manager Interface handles the interaction with the HPCS Queue System. The server needs to know how a job will be executed and how to query the status of a job being executed on the server supporting it. To do so, two PHP class methods, the class OpenCFJob, have to be overwritten. These methods enable the job to be executed (under the queue system) and the job status to be checked. In addition, an XML description of each available routine is needed to specify the job. As an example, listing 1.1 shows the XML description for a simple “Hello World” code.
794
A. Santos, F. Almeida, and V. Blanco Listing 1.1. Describing a Problem
xml v e r s i o n=” 1 . 0 ” e n c o d i n g=”UTF−8” ?> <j o b x m l n s : x s i=” h t t p : //www . w3 . o r g / 2 0 0 1 /XMLSchema−i n s t a n c e ” x si : n o N a m e S p a c e S c h e m a L o c a t i o n=” j o b . xsd ”> H e l l o W o r l d <s e r v i c e n a m e>h e l l o s e r v i c e n a m e> b i n / h e l l o b i n a r y> S i m p l e H e l l o W o r l d a p p l i c a t i o n d e s c r i p t i o n> <argument t y p e=” i n t e g e r ”> g r e e t i n g s _ n u m <s d e s c>N u m b e r o f g r e e t i n g s s d e s c> < l d e s c>T h e n u m b e r o f g r e e t i n g s t o o u t p u t . l d e s c> argument> <argument t y p e=” s t r i n g ”> n a m e <s d e s c>P e r s o n t o g r e e t i n g s d e s c> < l d e s c>T h e n a m e o f t h e p e r s o n t o g r e e t . l d e s c> argument> j o b>
– The Scripts Generator produces the scripts needed for the job execution under the various queue systems. The Script Generator is composed of a set of templates plus a processing engine to create the script. A different template is needed for each of the queue managers supported. The template is instantiated into a functional script by the processing engine by the substitution of a fixed set of fields. These fields are obtained from the input data arguments for the job, from the XML job description document, and from the user data stored by the client DataBase. – The Launcher is the interface between OpenCF and the operating system. A privileged user, the cluster access account (www-data in Debian systems), forks the process to be executed, returns its identification and unlocks the thread handling the client query. – The Collector is the client interface which delivers the output data produced by an executed job. Once a job has finished, the queue system automatically sends an e-mail to the user, and moves the output data files to a temporary directory until they are downloaded from the client Collector.
4
Security in OpenCF
Security is an important consideration when dealing with standard web services. Typically, in web services technologies, two levels of security are required: clientserver and server-backend transactions must be authenticated, authorized, and encrypted. In our approach, since the architecture in OpenCF has two independent modules, a WS consumer, the client module, and a WS provider, the server module, a third level of security must be introduced, user-client security, for the transactions between the end users and the client module.
The OpenCF: An Open Source Computational Framework
795
Fig. 3. The OpenCF security implementation
Furthermore, in our particular case, the WS security requirements must be compatible with those of the computational resource. HPCS services are usually restricted-access resources due to their high complexity and capacity, especially considering many of these systems are used in the data centers of government institutions and companies with strong security controls. The common requirements are: – User authentication to access the system through the (username, password) pair. – The authorization to control access to the resources, usually based on services provided by the operating system and the queue managers. – The privacy typically provided through secure connections like SSH for remote transactions of the end users. Higher levels of security can be found in computational sites where more specific access controls are needed, for example using Kerberos. The WSS[16] document provides a standard mechanism for authenticating and authorizing the sender of a SOAP message, and also for encrypting the message. Some of the supported technologies are: Kerberos (both for the authentication and the encryption), X.509 certificate, SAML (Security Assertion Markup Language, developed by OASIS for the interchange of data for authentication and authorization between security domains using XML documents), SSL, etc. The OpenCF approach to security (fig. 3) is based on implementations of the standard technologies present in time-tested scientific and commercial applications. We adopt simple solutions to comply with the above-stated security requirements. The end users register at the client web page by using a form. The information required consists of an e-mail address where results and system notifications are sent, the username and password to be authenticated at the client and some information relative to the user’s organization to help the system manager decide whether the registration was successful or not. Once this information has been submitted, the account will not be activated until the
796
A. Santos, F. Almeida, and V. Blanco
organization, through a privileged user (the system manager), decides whether the request is accepted or not. At that time the user is notified of the resolution. The client database module stores information about the users registered. The password is encrypted for security reasons and a PHP module is used to authenticate end users when logging into the system. Users with management privileges can be added. These users may access the management options at the client and are responsible for adding and removing users and servers. Clients are also authenticated at the server using a username and password provided by the system manager of the HPCS where the server is installed. Apache enforces authentication between client and server through the mod auth module that controls the access to resources through the (username, password) pair. This pair is generated by the server manager using Apache tools when installing the system. The authentication between the server and the cluster is enforced through a cluster access account. This approach is compatible with the recommendation of the WSS committee. To make OpenCF compatible with the security requirements on the cluster where the server will be running, OpenCF security measures are implemented using the security technologies of the cluster system. For instance, in most cases, the cluster access account is a predefined user at the system that executes the Apache server. New access accounts are not created to access the cluster, thus minimizing requirements. Once the end user has been authenticated she may access all the services provided by the client. The authenticated client is also authorized to launch all the services provided by the server. The system manager at the HPCS where the server is installed authorizes the cluster access account to submit the jobs, launched by the WS, to the queue manager according to her access control policy. In keeping with the OASIS recommendation, privacy between end user-client and client-server is enforced through the use of HTTPS to encrypt messages. Privacy between the server and the cluster is delegated to the cluster’s privacy facilities so as to comply with compatibility requirements inside the cluster.
5
Conclusion and Future Work
We have presented an open source computational framework based on OpenCF web services technologies. An important design consideration was the ability to provide the community with a modular, lightweight, standard tool ready to run out of the box. Security in OpenCF was implemented according to established standards which are compatible with the security policies of many HPCS sites. A valuable feature is the simplicity of the approach. The robustness of the OpenCF was successfully tested in the Pelican[7] project. We plan to extend OpenCF with the discovery layer of the Web Service stack, which will certainly provide new functionalities in terms of interoperability, as well as present new challenges in terms of security.
The OpenCF: An Open Source Computational Framework
797
References 1. NetSolve project web site, http://icl.cs.utk.edu/netsolve 2. Klotz, G.A., Harvey, N., Stacey, D.A.: Connecting researchers to hpcs through web services. In: HPCS, p. 14. IEEE Computer Society, Los Alamitos (2006) 3. Thomas, M., Mock, S., Boisseau, J.: Development of web toolkits for computational science portals: The npaci hotpage. hpdc 00, 308 (2000) 4. Petcu, D.: Between web and grid-based mathematical services. iccgi 0, 41 (2006) 5. Yang, X., Hayes, M., Usher, A., Spivack, M.: Developing web services in a computational grid environment. scc 00, 600–603 (2004) 6. OpenCF project webpage, http://opencf.pcg.ull.es 7. Pelican project webpage, http://pelican.pcg.ull.es 8. e-HTPX E–Science Resource for High Throughput Protein Crystallography, http://clyde.dl.ac.uk/e-htpx/index.htm 9. Pierce, M.E., Youn, C.H., Fox, G.: The gateway computational web portal. Concurrency and Computation: Practice and Experience 14(13-15), 1411–1426 (2002) 10. Pierce, M.E., Youn, C.H., Fox, G.: The gateway computational web portal: Developing web services for high performance computing. In: Sloot, P.M.A., Tan, C.J.K., Dongarra, J., Hoekstra, A.G. (eds.) ICCS-ComputSci 2002. LNCS, vol. 2329, pp. 503–512. Springer, Heidelberg (2002) 11. Youn, C.: Web Service Based Architecture in Computational Web Portals. PhD thesis, Graduate School of Syracuse University (2003) 12. GridSAM project web site, http://gridsam.sourceforge.net/ 13. OpenDSP: DRMAA Service Provider open implementation, http://sourceforge.net/projects/opendsp 14. DRMAA: Distributed Resource Management Application API, http://www.drmaa.org/ 15. World wide web consortium, http://www.w3.org/ 16. OASIS Web Services Security (WSS) TC, http://www.oasis-open.org/committees/wss/
Service Level Agreement Metrics for Real-Time Application on the Grid L ukasz Skital2 , Maciej Janusz1 , Renata Slota1 , and Jacek Kitowski1,2 1
Institute of Computer Science, AGH University of Science and Technology, 30-059 Cracow, Poland 2 ACK Cyfronet AGH, 30-950 Cracow, Poland
Abstract. Highly demanding application running on grids needs carefully prepared environments. Real-time High Energy Physics (HEP) application from Int.eu.grid project is a good example of an application with requirements difficult to fulfill by typical grid environments. In the paper we present Service Level Agreement metrics which are used by application’s dedicated virtual organization (HEP VO) to sign SLA with service providers. HEP VO with signed SLAs is able to guarantee sufficient service quality for the application. These SLAs are enforced using presented VO Portal.
1
Introduction
Service Level Agreement (SLA) is a formal agreement between a service provider and a customer or between service providers. SLA agrees common understanding about a service with all relevant aspects, like performance, availability, responsibility, etc. It precisely states customer’s needs, without specifying how a service provider should this achieve. To describe service level, SLA uses metrics. With production grid environments, quality of services became an urgent issue. This is even more important for real-time or interactive applications. Int.eu.grid project [1] aims to adapt some interactive and real-time applications to the Grid environment. One of these demanding applications is High Energy Physics (HEP) application, which requires highly available computational environment with some real-time constraints. To run this kind of application on the Grid, dedicated Virtual Organization (VO) is created. This VO - HEP VO - is responsible for providing of run-time environment for the application. However, HEP VO, as a grid virtual organization, has to rely on sites which support the VO. To guarantee required service quality, formal agreements in a form of SLA are necessary. In case of the Grid this agreements are signed between sites and VO. Sites act as service providers for VO. VO represents users – members of VO – responsible for the application. In the paper we propose a set of metrics for SLA between VO for the HEP application and service providers. Section 2 presents current state of the art in area of SLA, with focus on grid environments with EGEE as a leading grid infrastructure. HEP application with dedicated VO is introduced in section 3, R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 798–806, 2008. c Springer-Verlag Berlin Heidelberg 2008
Service Level Agreement Metrics for Real-Time Application
799
while section 4 describes metrics used for HEP VO SLA. In section 5 there are described SLA enforcement methods used in the project. Section 6 concludes the paper.
2
State of the Art
The Service Level Agreement (SLA), according to [2] is a legal contract between a service provider and a customer that specifies, in measurable terms, what service level guarantees the service provider will furnish, and it defines the consequences (penalties) if the service provider fails to deliver on these guarantees per the specified conditions. There is no general set of metrics fitting to all domains, so for each domain, an appropriate set of metrics should be defined. In [3] there are defined principles which help to select metrics in SLA. A set of metrics should be small enough to be easy to control and big enough to cover all essential application areas. An excessive set of metrics and ambiguity should be avoided because it can lead to SLA inefficiency, misunderstanding or abuse. Metrics should be easy to measure and collect to allow effective SLA enforcement. All metrics should be within service providers control. A service provider can outsource some services to other providers. He signs SLAs with other providers and then he can sign SLAs with clients. This situation is described as ”SLA inheritance”. The client’s SLA is based on the level of services provided by other providers. If some of SLA metrics are not fulfilled by one of underlying providers, clients‘ claims can be delegated to this provider. Project GEMSS described in [4] supports SLA by agents on client’s and provider’s side. These agents are negotiating common SLA basing on application requirements and site’s free resources or they are used for reservation of resources before they are really needed. The paper [5] defines monitoring grid applications using SLAs, and in [6] a protocol for SLAs negotiation was defined. Solutions like [7] or [8] have one agent as a part of resource broker and monitoring agents on sites. This allows to select best fitting site which will provide proper quality of service for user’s demand and if a user adds this kind of constraints also to minimize cost of job submission e.g. selecting cheaper provider which guarantees sufficient quality of service. Also it ensures monitoring to easily check site’s state and simplifies accounting. Service providers can allocate their resources to fulfill as many SLAs as possible and in the result increase their income. In [9] commonly used SLA metrics are distinguished for groups: hardware, software, network, storage and service desk related. EGEE [10] as the production grid environment needs to use SLA in many areas and on many levels between users, virtual organizations, sites, external service providers and EGEE operational units. The most important areas, where SLA is needed are contracts between EGEE and partners (sites) and contracts between partners and virtual organizations. Currently EGEE does not have general SLA which could be used to qualify site’s operations. However, there are defined network SLA and some basic metrics, which can be used for SLA between site
800
L . Skitalet al.
and VO. The work on the general EGEE SLA is in progress [11]. The draft of this kind of SLA is already prepared in SEEGRID-2 [12] project, which is closely related to EGEE. SEEGRID-2 project’s SLA is divided into following sections: hardware and connectivity, level of support, level of expertise, VO support and conformance to operational metrics. The SEEGRID-2 SLA defines also some SLA enforcement methods, which involves Project Steering Committee decisions. EGEE has already defined SLA in network area [13] and this SLA is being implemented. The SLA covers aspects related with connectivity between all partners including GEANT2, National Research and Education Network and MAN/Campus/Institution networks. SLA between resource providers and virtual organizations should be prepared by each VO to conform it’s specific requirements. VO SLA should be consistent with site SLA (work in progress), which should already assure basic level of services. EGEE provides very basic metrics for VO services requirements. These metrics are: number of users in VO, job CPU limit, job wall clock limit, job scratch space, per-SE storage space, RAM, minimum number of jobs slots. VO also can specify some additional requirements like beyond the normal information system parameters used, preferred core-services and others. This kind of metrics provides information about VO requirements, which can be used to create basic SLA between site and VO.
3
VO and HEP Application
High Energy Physic (HEP) application is dedicated to analyze data from ATLAS [14,15] experiment which is located at CERN. The HEP application is designed to perform on-line analysis of events during the experiment run. This puts some real-time constraints, which have to be fulfilled by an application’s run-time environment. These constraints concern both performance and stability. The ATLAS experiment requires computational power which is estimated to 3500 CPUs. Originally all of these resources had to be located in CERN. However, production grid environments open new possibilities offering low cost, remote computational power. Nevertheless, a considerable effort to align real-time application with current grid environments is necessary. This work is carried on in int.eu.grid project. To solve this non-trivial problem a dedicated VO for HEP application is introduced – HEP VO [16]. This VO provides stable and sufficient run-time environment, which is equipped with some additional services necessary to fulfill application requirements: – VO Portal – used by VO Manager, users and site admins, – Real Time Dispatcher (RTD) [17,18] – responsible for real-time distribution of computational data (events) to processing tasks running on the Grid; additionally provides performance data (computation time and data transfer time) presented in VO Portal and used for SLA enforcement, – JMX1 -based Infrastructure Monitoring System (JIMS) [19] – provides crucial monitoring data, like available network bandwidth, 1
Java Management Extensions.
Service Level Agreement Metrics for Real-Time Application
801
– Grid-enabled OMIS2 -Compliant Monitoring system (OCM-G)[20] – for processing tasks operation monitoring.
4
HEP VO SLA Metrics
Using grid resources for real-time applications requires high quality of services, which have to be assured by legal contracts. From the technical point of view, the most important part of these contracts is SLA. The HEP application has global requirements on CPU count, but it does not require a minimal number of CPUs per site. However, the CPU count should be reasonable to justify administrative work [16] necessary to run the application on a site. Moreover the network performance should be adequate to feed available CPUs with computational data. Storage requirements of the application are not essential, therefor storage metrics are not discussed. While the performance issues are easy to address, more important for HEP application is high availability of resources and connectivity. One of the most important requirements regarding network parameters is ability to handle incoming connections from the experiment to site’s worker nodes. This is dictated by CERN’s security policy. Currently, a site which does not allow inbound connectivity can not support HEP VO. Just before the experiment starts, jobs are submitted to the Grid. These jobs allocate resources and establish communication channel. During the experiment run events are transferred to remote computational resources using HEP VO services as well as directly to CERN’s local computer farm. This process is presented in Figure 1. With 3500 events per second3 and 1.5Mb for each event, the application is very sensitive for any malfunction. While occasional failures of computational nodes are acceptable and handled, insufficient computational resources or network bandwidth as well as frequent failures leads to buffers overload and valuable data loss. Each problem reported to site should be answered in a definite time period (e.g. 2 hours) and solved as soon as possible, but again within a defined time period. Optimally all problems should be solved before the next experiment run. All of these aspects should be covered by SLA, which uses metrics easily allowing to predetermine if a site is able to fulfill HEP application requirements. SLA metrics used by HEP VO SLA are divided into the following sections: – Performance metrics – characterize site’s computational resources and network bandwidth available for the VO, – Availability and connectivity metrics – describes availability of site’s resources for VO. – Support and expertise metrics – control site’s responsiveness to reported problems. 2 3
On-line Monitoring Interface Specification. Estimated event throughput during normal experiment run.
802
L . Skitalet al.
Fig. 1. Event distribution process in HEP Application
HEP VO SLA introduces two kinds of constraints: soft and hard, which, in distinction from functional and not-functional requirements, are more suitable for the VO management task. Inability to satisfy soft constraints causes warnings to be sent to the site and eventually renegotiation of the contract, while inability to satisfy hard constraints causes immediate suspending of the site operation in HEP VO. Soft constraints are related to all performance issues, which do not influence application stability, these include CPU count, CPU performance, network bandwidth, single job failures. Hard constraints are related with site availability and other critical requirements like network connectivity, site unresponsiveness or massive job failures. The metrics used in HEP VO SLA are listed bellow. Please note, that CPU means single core CPU or one core of multi-core CPU. Performance metrics – Ccpu – General CPU count (soft) – total number of CPUs (Ccpu ) which are available for HEP VO (dedicated or shared with other VOs). – Dedicated CPU count (soft) – number of CPUs which are dedicated for HEP VO. – CPU architecture (hard) – required processor architecture. Currently only x86 architecture is supported. – CPU performance (soft) – CPU performance (Tavg ) as time needed to compute one model event (lower value – better performance) – Network bandwidth (soft) – inbound (Bin ) and outbound (Bout ) network bandwidth, optimal values are given by following equations: Bin =
Ccpu · Sevent Tavg
(1)
Service Level Agreement Metrics for Real-Time Application
Bout =
Ccpu · Sreply Tavg
803
(2)
where: Ccpu – General CPU count, Sevent – event size, Sreply – reply size, Tavg – CPU performance. Availability and connectivity metrics – Job success rate (soft/hard) – percent of jobs which end with success, this parameter has two values: warning (soft) and critical (hard). – Grid job start-up (soft) – Time needed to start HEP VO job, measured from point of reaching site’s Computing Element to job run, this time should only be guaranteed for dedicated CPUs. – Job wall-clock time (soft) – how long single job can occupy a CPU. – Job CPU time (soft) – how much CPU time can be used by a single job. – Inbound connectivity (hard) – inbound connectivity from CERN to site’s worker node is required by CERN security policy. Support and expertise metrics – Problem response time (soft) – time from problem report to site’s response, – Time to solve not-critical problem (soft) – time, which is needed to solve not-critical problem, ie. problem where soft SLA metric is not fulfilled. – Time to solve critical problem (hard) – time, which is needed to solve critical problem, ie. problem, where hard SLA metric is not fulfilled.
5
SLA Enforcement for HEP VO
Defined and signed SLA should guarantee that services will be provided on a desired level. However, without an effective method of enforcement, a customer is not able to confirm if SLA is fulfilled. This problem is a very important for the HEP application, where a slight infringement of SLA, if not detected in time, can lead to a significant data loss. HEP VO is equipped with a simple management mechanism which is based on VO Portal. This portal is used for following tasks: – site certification [16], – submission of application’s processing tasks to the Grid and monitor their execution (using included Grid-job submission service), see Figs. 2 and 3, – gathering monitoring data and creating statistics to allow easy SLA enforcement.
804
L . Skitalet al.
Fig. 2. HEP VO portal - monitoring of submitted jobs
Fig. 3. HEP VO portal - site’s job statistics
The portal is used during the site certification process to gather information about the site and it’s resource declarations, which are next included into SLA. The portal, using data from available information sources, provides a quick overview of SLA metrics and allow to easily judge if the site is fulfilling SLA. The
Service Level Agreement Metrics for Real-Time Application
805
portal gathers information from RTD and Grid-job submission service. Based on data gathered and presented by the portal, VO Manager can decide if the site is working correctly or should be temporary suspended. In case of critical problems the site is automatically suspended with information send to VO manager and the site administrators. Suspended sites are not considered during jobs submissions, jobs are submitted only to sites which are in certified state.
6
Summary and Conclusions
Real-time applications require services to be provided on a strictly defined level, which should be guaranteed by the service provider. SLA is the way to agree common understanding of services and to state customer’s requirements. The HEP application is a customer of selected grid sites, these grid sites have to follow the certification procedure described in [16] and in consequence sign SLA. Sites with signed SLA should be able, from technical point of view, to match requirements of the application. Signed SLA gives the VO formal basis to run HEP application and to enforce declared service level. In the paper we proposed metrics for SLA between HEP VO and sites supporting the VO. These metrics are tailored to match requirements of the real-time HEP application against the grid infrastructure. They are defined in easy to understand way and can be automatically collected using available application services. All measurements data are available through the VO portal, which provides quick overview of the application environment. With well defined metrics it is possible to perform automatic SLA enforcement. HEP VO will be fitted with a component, which will react to any problem covered by SLA and depending on it’s severity will send a warning or suspend affected site’s operations in the VO.
Acknowledgment The work has been supported by EU IST 031857 int.eu.grid project. AGH-UST grant is also acknowledged.
References 1. Interactive European grid project homepage, http://www.interactive-grid.eu/ 2. Khater, G.: SLA Management, Networld Interop May 7 (2001) 3. Hayes, I.S.: Metrics for IT Outsourcing Service Level Agreements, Clarity Consulting (2004), http://www.clarity-consulting.com/MetricsforIToutsourcing.pdf 4. Benkner, S., Engelbrecht, G., Middleton, E.S., Surridge, M.: Supporting SLA Negotiation for Grid-based Medical Simulation Services. In: Workshop on State-OfThe-Art in Scientific and Parallel Computing, Umea, Sweden, June 18-21 (2006) 5. Sahai, A., Graupner, S., Machiraju, V., van Moorsel, A.: Specifying and Monitoring Commercial Grids through SLA. In: CCGrid (May 2003)
806
L . Skitalet al.
6. Czajkowski, K., Foster, I., Kesselman, C., Sander, V., Tuecke, S.: SNAP: A Protocol for Negotiation of Service Level Agreements and Coordinated Resource Management in Distributed Systems. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, Springer, Heidelberg (2002) 7. Dumitrescu, C.L., Foster, I.: GRUBER: A Grid Resource Usage SLA Broker. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 465–474. Springer, Heidelberg (2005) 8. Sandholm, T.: Service Level Agreement Requirements of an Accounting-Driven Computational Grid, Royal Institute of Technology, Technical Report TRITA-NA0533 (2005) 9. Paschke, A., Schnappinger-Gerull, E.: A Categorization Scheme for SLA Metrics. In: Multi-Conference Information Systems (MKWI 2006), Passau, Germany (2006) 10. Enabling Grids for E-sciencE (EGEE), http://www.eu-egee.org/ 11. EGEE SA1 SLA Working Group, https://twiki.cern.ch/twiki/bin/view/EGEE/SA1 SLA WG 12. South Eastern European GRid-enabled eInfrastructure Development SLA, http://wiki.egee-see.org/index.php/SG SLA 13. EGEE Network Service Level Agreement, http://egee-sa2.web.cern.ch/egee-sa2/SLA.html 14. The ATLAS Experiment, http://atlas.web.cern.ch/Atlas/index.html 15. ATLAS Collaboration, Technical Proposal, CERN/LHCC/94-43, LHCC/P2 (December 15, 1994) 16. Skital, L ., Dutka, L ., Korcyl, K., Janusz, M., Slota, R., Kitowski, J.: Virtual Organization approach for running HEP applications in Grid Environment. In: Cracow Grid Workshop, Cracow, Poland, October 15-18 (2006) 17. Dutka, L., Korcyl, K., Zielinski, K., Kitowski, J., Slota, R., Funika, W., Balos, K., Skital, L., Kryza, B.: Interactive European Grid Environment for HEP Application with Real Time Requirements Cracow Grid Workshop, Cracow, Poland, October 15-18 (2006) 18. Pieczykolan, J., Korcyl, K., Dutka, L., Kitowski, J.: The Real-Time Dispatcher – A Solution for Running Applications on Grid with QoS requirements. In: GRID 2007, Austin, Texas, USA, September 19-21 (submission, 2007) 19. Balos, K., Zielinski, K.: JMX-based Grid Management Services, Broadnets, San Jose, California, USA, October 25-29 (2004) 20. Balis, B., Bubak, M., Funika, W., Wismueller, R., Radecki, M., Szepieniec, T., Arodz, T., Kurdziel, M.: Grid Environment for On-line Application Monitoring and Performance Analysis. Scientific Programming 12(4), 239–251 (2004)
Dynamic Control of Grid Workflows through Activities Global State Monitoring Marek Tudruj1,2 , Damian Kopanski1 , and Janusz Borkowski1 1 2
Polish-Japanese Institute of Information Technology, 86 Koszykowa Str., Warsaw Institute of Computer Science, Polish Academy of Sciences, 21 Ordona Str., Warsaw {tudruj,damian,janb}@pjwstk.edu.pl
Abstract. This paper presents a method to design dynamic execution control in Grid workflows, which is based on predicates defined on global states of constituent activities. This method is based on introduction of user-defined control-dedicated processes called synchronizers. The co-operating synchronizers provide dynamic features in Grid workflows functionality. The use of synchronizer-based control infrastructure provides also means for structured design of complicated control in Gridlevel application programs. This control method has been embedded in a graphical parallel program design system, which facilitates the programming process. The proposed Grid programming environment principles are illustrated with a representative dynamic workflow implementation example.
1
Introduction
Parallel Grid-level applications consist of parallel programs executed in cluster environments. The execution of these programs must be coordinated. In our design, a special global control infrastructure is used for these purposes. Each cluster-level parallel program has an associated control process called a synchronizer. Constituent processes report their local states to local synchronizers. The synchronizers compute program level global states [1,2] and evaluate control predicates on them. Based on the evaluation result, control signals can be sent to processes to communicate control decisions. As a result of control signal reception, actions defined by a programmer can be activated influencing program behavior in an asynchronous manner. This control method de-couples the control aspects solved using a high-level abstraction, from the computational code. The PS-GRADE graphical program design environment [2], developed as an enhancement of the original P-GRADE system [4], supports the described control method. Parallel applications, implemented using the PS-GRADE system and run in clusters, can be executed in a coordinated fashion to form a Grid-level multi-application [3]. To achieve this, Grid-level synchronizers are introduced. They gather state reports from application synchronizers to construct Grid-level global states, evaluate predicates on them and send control signals to Grid applications. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 807–816, 2008. c Springer-Verlag Berlin Heidelberg 2008
808
M. Tudruj, D. Kopanski, and J. Borkowski
We propose a generalized workflow design system, which extends the basic workflow control paradigms [5,6,7,8]. The system is constructed with the use of the synchronizers, which allow for dynamic behavior of the resulting workflows. Many researchers have investigated dynamic features in workflows. Insertion and deletion of activities during workflow execution time was described in [11]. In [12] the focus has been set on maintaining the structural correctness of the workflow graph, when modifying it. Interoperation of workflows by dynamic binding was covered in [13,14]. In our approach, dynamic workflows allow for modification of control flow between constituent activities and changing the functionality (normally completely opaque) of activities. The rest of the paper is composed of 3 parts. The synchronizer-based Grid application control is described in part 2. Implementation principles for the proposed Grid environment are described in part 3. Part 4 discusses features of the proposed dynamic workflow control including some examples. Start State reports
Collect a new state report
New SCGS detected? Yes
No
Register the state and evaluate starting conditions of predicates Starting cond.1
Evaluate predicate 1 True
Starting cond.2
False
Evaluate predicate 2 True
Send control signals
Send control signals
Starting cond.k
False ...
Evaluate False predicate k True Send control signals
Fig. 1. Control flow in a synchronizer
2
Application Control Based on Synchronizers
Our approach to Grid applications control relies on monitoring states of constituent parallel applications. Processes report about their local states to monitors, which define global or regional (partial) application states. A user formulates predicates whose arguments are selected application states. Some control signals are sent to application processes when a predicate is fulfilled. To prevent passive waiting for control decisions, the signals are asynchronously handled. They can
Dynamic Control of Grid Workflows through Global State Monitoring
809
activate some predefined parts of the program code or they can break current program execution to cancel its execution of some further part of the program code. Signals can carry some data, which can be used by the activated code and can activate some code, which can be introduced into execution of current computations. The application states we talk about are the Strongly Consistent Global States (SCGS), which are parallel process states for which it can be stated that occurred at the same time. This approach requires, that timestamps determined with a known accuracy are attached to process state reports. In many cases, the use of timestamps (and the necessary local clock synchronization) can be omitted and Observed Global States can be used instead of SCGS. This situation happens in workflow control. The PS-GRADE graphical parallel program design system incorporates the described control mechanisms [2]. A synchronizer process has special communication ports to receive state reports and to send control signals. It works according to the control scheme shown in Fig. 1. Each synchronizer collects state reports from processes. Based on them it constructs user-predefined consistent/observed global/regional states. It further checks, which predicates should be evaluated on the encountered states (based on evaluation of predicate starting conditions) and evaluates them. A predicate is expressed by a user in a form of a control flow diagram of a similar type as that for a process code. To design Grid-level parallel applications we provide special extensions in the PS-GRADE which enable to define Grid-level programs containing many component applications. In the case of workflows, the component applications are treated as workflow activities. Additional layer of Grid-level synchronizers has been introduced. Standard application-level synchronizers can report selected application states to Grid-level synchronizers. The Grid-level synchronizers create global or regional application states. Then, they compute control predicates on these states and send signals to influence the application behaviour. This influence can be exemplified by blocking/activating a control path in an application control flow graph or by modifying internal application functionality. The main advantage of our approach over a usual workflow system is that the modifications of Grid applications control (including the workflows) can be managed externally to applications computational components using the synchronizer-based control infrastructure.
3
Implementation of Grid Workflow Control
The general architecture of the Grid PS-GRADE environment is shown in Fig. 2. GRADE Globus Web Services (GGWSs) constitute user-defined Grid service repositories used in the Grid PS-GRADE environment. These services implement such actions as: data communication between local (Synchr(i)) and Gridlevel synchronizers (GLS), management of SOAP message transfers of state reports and control signals, management of timestamps attached to state reports (used in the SCGS detection), management of the involved Grid center timesynchronization and monitoring of the PS-GRADE applications execution using
810
M. Tudruj, D. Kopanski, and J. Borkowski
GLOBUS GRID CENTER 2
GLOBUS GRID CENTER 1 P1,1
P1,2
PS-GRADE – APP (PVM DOMAIN)
PS-GRADE – APP (PVM DOMAIN ) Synchr (1)
PVMMessaging
GSP (1)
GLS
GSP (2)
TCP Socket Messaging
PVMMessaging
P1,k
GGWS (1) Web-Services SOAP Messaging
GGWS (3)
GLOBUS GRID CENTER 3 P3,1
P3,2 P3,g
PS-GRADE – APP (PVM DOMAIN ) Synchr (3)
PVMMessaging
GSP (3)
TCP Socket Messaging
GGWS (2) Web-Services SOAP Messaging
To Other Grid Centers
GGMS GRADE GLOBUS MAINTENANCE CENTER
Fig. 2. Grid PS-GRADE general architecture
message passing (PVM) communication inside Globus Grid Centers. Communication between GGWS and application level synchronizers is performed with intermediation of the GRADE-Server-Processes (GSPs) with the use of message passing (PVM) and TCP Sockets. The global execution management of a Grid PS-GRADE workflow application is provided by the GRADE Globus Maintenance Service (GGMS), whose control flow diagram is shown in Fig. 3. Its basic functions are to initiate execution of applications based on the data-flow starting predicates and to activate applications on requests from Grid synchronizers. GGMS creates and updates a Grid-level application execution map, which contains information on all running workflow applications i.e. assignments and states of their activities. GGMS distributes the execution map among all the GGWSs. It also handles the global clock synchronization and distributes clock synchronization signals to all GGWSs. GGMS activates GridFTP services for sending input/output data files to/from activities and for transfering program code files. The GGMS is specified by the user in terms of user defined Globus services [9,10]. The GGMS executes GRAM services to handle job management. It uses SOAP protocol to communicate with the Grid-level synchronizers via the GGWSs. To describe application workflow schemes the XML language is used. The workflow scheme descriptions formulated by the user as XML files are sent to the GGMS. The GGMS performs an analysis of the received workflow schemes and first evaluates the basic workflow initial predicates (with simple dataflow principle). Accordingly to them, it handles activities initial activation. It continues with handling of activation and inactivation of workflow activities during the entire workflow run. If required, the GGMS also performs activation and inactivation of activities on requests coming from Grid-level synchronizers (GLSs) as a result of predicates evaluated on global or regional Grid application states.
Dynamic Control of Grid Workflows through Global State Monitoring
Fig. 3. The GGMS block diagram
811
812
4
M. Tudruj, D. Kopanski, and J. Borkowski
Dynamic Workflows Based on Synchronizers
The proposed control mechanisms are general purpose and extend the standard workflow features towards efficient implementation of more complicated control flow patterns. We can subdivide known workflow instances into static workflows and dynamic workflows. Static workflows contain activities that are atomic, which means that their functionality does not change during workflow execution. Workflow control activities are performed at the boundaries of the activities. On the completion of activities their final states are examined and control flow decisions in the workflow are taken.
Fig. 4. Static (left) and dynamic (right) control in workflows with synchronizers
The aim of dynamic workflows is to enable the control flow and functional contents of activities to be at least partially modified during workflow execution. It should be possibly done in dependence from states of sets of activities in a workflow. Such dynamic workflows can be useful for arranging workflow control when execution takes place in a distributed system. In our approach, the dynamic workflow control is inherently based on global state analysis of sets of workflow activities, since it is based on using a control infrastructure supplied by synchronizers. The main features of our approach to dynamic workflow control in the Grid as opposed to the static control are illustrated in Fig. 4. In standard static workflows, activities have fixed behavior and the control patterns in which they are placed are supervised by workflow engines. In the environment that we propose,
Dynamic Control of Grid Workflows through Global State Monitoring
813
synchronizers and/or Grid services are used to organize the flow of control in static workflows, but the control decisions are taken based on output data files of executed activities. The GGMS is used to perform both the necessary control actions and to organize transfers of input/output data files. For workflow schemes with more complicated control patterns, such as N-out-of-M-split and join shown in Fig. 4 (left), the necessary control predicates are programmed and evaluated inside synchronizers. The synchronizers organize dispatching of input data files for activities which are ready to execute and then, they request their activation by the GGMS. The GGMS directs output files from the completed activities to be stored in local result repositories. Before starting execution of any activities, the respective Grid synchronizers are activated as a result of sending the Grid execution maps to the CGWSs of the synchronizers. The synchronizers evaluate control predicates based on state reports and output data of completed activities stored in the local repositories. Dynamic workflows implemented in the environment of synchronizers, Fig. 4 (right), enable modifying both external control flow and internal functionality of activities at workflow run-time. The modifications are controlled by predicates defined in synchronizers on global/partial consistent/observed states of running activities. Dynamic insertions (spawnings) of new activities are not yet enabled, however, the internal activities modifications are asynchronous regarding the activities starting time. To provide some control of this asynchronizm facing possible network delays, the reactions to control signals from synchronizers are additionally governed by creating some user-defined sensitivity regions in the code. At the level of computers which work inside clusters, the reports on activities states are initially collected by local (application) synchronizers. The synchronizers evaluate the respective predicates on them and then control signals, which correspond to reports on application states can be sent via the GGWSs to Grid-level synchronizers. Based on these reports, the Grid synchronizers can generate Grid-level application states (global/regional, consistent/observed). Then, they can evaluate predicates on these application states and send control signals down to local synchronizers to request necessary reactions inside applications. If the reaction to evaluation of Grid-level predicates concerns activation/ inactivation of entire workflow activities (Grid applications), the Grid synchronizers call the respective GGMS services to execute the required actions. The control of the dynamic N-out-of-M-split and join workflow scheme presented in Fig. 4 (right) is shown in a more detailed way in Fig. 5. The discussed workflow is composed of A, B.1, ..., B.M, D activities, which are entire parallel applications executed on the Grid. There are two Grid synchronizers: SYNCH1 and SYNCH2 which execute the basic control for this workflow scheme. The SYNCH1 synchronizer activates M (or less) child activities. Then the SYNCH2 synchronizer performs partial synchronization - it waits for the completions of N activities. The dynamic workflow control is based on an additional Grid synchronizer called Grid-SYN. The Grid synchronizer (Grid-SYN) receives the state messages from the activities B.1, ... B.m , m ≤ M, which were activated by SYNCH1. The
814
M. Tudruj, D. Kopanski, and J. Borkowski
structure of B.M is presented as the B.M part of Fig. 5. It consists of 5 processes, B1, ..., B5, and the local synchronizer LSynchB. Processes B1,..., B5 send reports on their states to the local synchronizer LSynchB, which transfers the B.M state reports to the Grid synchronizers, Grid-SYN and SYNCH2. For dynamic workflow, the Grid-SYN synchronizer receives state messages from B.1, ..., B.m, which consist of the activities loads and the computations state. The received state messages are used for detection of strongly consistent global states (SCGS). Based on the detected SCGS states, Grid-SYN evaluates some control predicates which estimate the current progress of computations and the need of activation of additional activities B.i. In the result, Grid-SYN decides if it should send a control message to the synchronizer SYNCH1 - with information that it should increase the number of working activities B.i. SYNCH1 activates a number of new B.i applications on the Grid. It is done with the use of services provided by the GGMS center. SYNCH1 also sends the control signal to the synchronizer SYNCH2 with information about the current number of working activities m. SYNCH2 performs the partial synchronization of the running activities B.1, ..., B.m. The ACT-A part of Fig. 5 illustrates an activity activation. The synchronizer SYNCH1 sends an activating signal to GGMS. The GGMS performs the condition check and activates the application B.M.
Fig. 5. Dynamic activation of activities in workflows with synchronizers
Dynamic Control of Grid Workflows through Global State Monitoring
815
Another part of Fig. 5 - ACT-S, shows the way in which the completion condition monitoring for all running applications B.i is implemented. When an application completes, its local synchronizer calls the proper GGWS service, which sends the respective report to the SYNCH2 synchronizer as a state message.
5
Conclusion
This paper has presented how to design dynamic execution control in Grid workflows with the use of user-defined control-dedicated processes called synchronizers. The synchronizers evaluate predicates defined on global states of constituent workflow activities. The synchronizers provide dynamic features in the Grid workflows structure and functionality of constituent activities. The synchronizers provide a useful control infrastructure which facilitates structured design of complicated control in Grid-level application programs. The proposed workflow control design method has been embedded in a graphical parallel program design system, which makes programming easier and less error prone. It is currently under implementation inside the Grid PS-GRADE graphical parallel program design system in co-operation with the SZTAKI Institute of the Hungarian Academy of Sciences.
References 1. Borkowski: Parallel Program Control Based on Hierarchically Detected Consistent Global States. In: PARELEC 2004, Dresden, Germany, pp. 328–333. IEEE, Los Alamitos (2004) 2. Tudruj, M., Borkowski, J., Kopanski, D.: Graphical Design of Parallel Programs with Control Based on Global Application States Using an Extended P-GRADE System. In: Distributed and Parallel Systems, Cluster and GRID Comp., vol. 777. Kluver (2004) 3. Kopanski, D., Borkowski, J., Tudruj, M.: Co-ordination of Parallel Grid Applications Using Synchronizers. In: Proceedings of PARELEC 2004, Dresden, Germany, pp. 323–327. IEEE, Los Alamitos (2004) 4. Kacsuk, P., Dózsa, G., Lovas, R.: The GRADE Graphical Parallel Programming Environment. In: Kacsuk, P., Cunha, J.C., Winter, S.C. (eds.) The book: Parallel Program Development for Cluster Computing: Methodology, Tools and Integrated Environments, ch. 10, pp. 231–247. Nova Science Publishers, New York (2001) 5. Lovas, R., Dózsa, G., Kacsuk, P., Podhorszki, N., Drótos, D.: Workflow Support for Complex Grid Applications: Integrated and Portal Solutions. In: Proceedings of 2nd European Across Grids Conference, Nicosia, Cyprus (2004) 6. Kiepuszewski, B.: Expressiveness and Suitability of Languages for Control Flow Modelling in Workflows, Ph. D. Thesis, Queensland University of Technology (2003) 7. Van der Aalst, W.M.P., et al.: Workflow Patterns. Distributed and Parallel Databases 14, 5–51 8. White, S.A.: Process Modelling Notations and Workflow Patterns, BPTrends (2004), http://www.omg.org/bp-corner/bp-files/Process_Modeling_Notations.pdf
816
M. Tudruj, D. Kopanski, and J. Borkowski
9. Globus Toolkit, http://www.globus.org/toolkit 10. Open Grid Services Infrastructure, https://forge.Gridforum.org/projects/ogsi-wg 11. Reichert, M., Dadam, P.: A Framework for Dynamic Changes in Workflow Management Systems. In: Proc. 8th Int. Workshop on database and Expert Systems Applications, pp. 42–48. IEEE, Los Alamitos (1997) 12. Reichert, M., Dadam, P.: ADEPTflex-Supporting Dynamic Changes of Workflows Without Loosing Control. J. Intell. Inf. Syst. 10(2), 93–129 (1998) 13. Kwak, M., Han, D., Shim, J.: A Framework Supporting Dynamic Workflow Interoperation and Enterprise Application Integration. In: Proc. of 35th Hawaii Int. Conf. on System Sciences, IEEE, Los Alamitos (2002) 14. Fakas, G.J., Karakostas, B.: A peer to Peer (P2P) architecture for dynamic workflow management. Information and Software Technology 46, 423–431 (2004)
Transparent Access to Grid-Based Compute Utilities Constantino V´azquez, Javier Font´ an, Eduardo Huedo, Rub´en S. Montero, and Ignacio M. Llorente Departamento de Arquitectura de Computadores y Autom´ atica Facultad de Inform´ atica Universidad Complutense de Madrid, Spain {tinova, jfontan, ehuedo}@fdi.ucm.es, {rubensm, llorente}@dacya.ucm.es
Abstract. There is normally the need for different grid architectures depending on the constrains: from enterprise infrastructures, to partner or outsourced ones. We present the technology that enables the creation and combination of different grid architectures, offering the possibility to create federated grids that can be easily deployed as compute utilities on grid infrastructures based on the Globus Toolkit. Then, taking advantage of the uniform and standard interface of the proposed grid-based compute utilities, we present a way to enable transparent access to virtualized grids from local computing infrastructure managed by Sun Grid Engine, although similar approaches could be followed for other resource managers.
1
Introduction
Grid computing is concerned with “coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations”, as stated by Ian Foster in [1]. Therefore, due to its multi-institutional nature, a grid is built on top of potentially heterogeneous resources. Furthermore, due to socio-political landscape, there is normally the need for different grid architectures depending on the constraints. We can differentiate three kinds of arrangements. There are, firstly, enterprise grids, contained within the boundaries of one organization. They improve internal collaboration and enables a better return from IT investment, coming from a more optimal resource utilization. Secondly, we have partner grids, created using the resources of two or more organizations and /or scientific communities. Partner grids allow the access to higher computing performance to satisfy peak demands and also provide support to face collaborative projects. As a third architecture we find
This research was supported by Consejer´ıa de Educaci´ on of Comunidad de Madrid, Fondo Europeo de Desarrollo Regional (FEDER) and Fondo Social Europeo (FSE), through BioGridNet Research Program S-0505/TIC/000101, by Ministerio de Educaci´ on y Ciencia, through research grant TIN2006-02806, and by European Union, through BEinGRID project EU contract IST-2005-034702.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 817–824, 2008. c Springer-Verlag Berlin Heidelberg 2008
818
C. V´ azquez et al.
outsourced grids, although not a reality right now, it has been predicted that dedicated service providers will organize and maintain cyberinfrastructures to supply resources on demands over the Internet. This business model will be feasible when growing network capacity is enough to allow business and consumers to draw their computing resources from outsourced grids apart from enterprise and partner grids [2]. These different grid perspectives need the technology to ensure their deployment in a way that preserves the autonomy of each organization, while offering at the same time overall security (critical if organizations are planning to outsource sensitive data) and scalability (an inherent requirement of grid infrastructures). In this paper we present the technology that enables the creation and combination of different grid architectures, i.e. we offer the possibility to create federated grids, that can be easily deployed as compute utilities on grid infrastructures based on the Globus Toolkit (GT). Furthermore, taking advantage of the uniform and standard interface of these grid-based compute utilities, we present a way to enable transparent access to virtualized grids from local computing infrastructure managed by Sun Grid Engine, although similar approaches could be followed for other resource managers. The rest of this paper is organized with the following structure. Section 2 introduces and explains the “grid for utility” concept. Section 3 presents the GridGateWay solution for building federated grid infrastructures based on the Globus Toolkit and the GridWay Metascheduler. It also describes a testbed for a federated grid and provides some results. In Section 4, we present a complementary component to enable access to virtualized grids using the Sun Grid Engine Transfer Queue to Globus. To finish, Section 5 presents some conclusions and our plans for future work.
2
Grid for Utility
Originally, the concept of grid was developed from the power grid. In short, the idea behind a grid is to be able to plug a computing resource so it gets added to the total computer power with the minimum possible configuration. It also will be able to use the grid computing capability as a single computer, transparently to the user, the same way an electric device can be plugged to a power socket to draw electricity from it. Utility computing is a service provisioning model which will provide adaptive, flexible and simple access to computing resources, enabling a pay-per-use model for computing similar to traditional utilities such as water, gas or electricity. It pursues an scenario where computing power is given or taken (consumer/producer paradigm) from the different components that conform to the grid based on the demand and offer of this computing power at a given time. This posses different problems, ranging from accounting and billing to scalability and security. One desired outcome of the “grid for utility” idea is to allow the creation of the outsourced grids identified above. It will indeed provide means to sell
Transparent Access to Grid-Based Compute Utilities
819
computing power to anyone interested on it. Obviously, security is essential, as sensitive data may be outsourced. Several efforts have been undergone to tackle this problem [3,4].
3
The GridGateWay Solution for Grid Federation
In order to leverage the practical implementation of the concepts of the previous section we provide means to build grid federations, although admittedly only in a somewhat still crude form. A GridGateWay provides the standard functionality required to implement a gateway to a federated grid. The main innovation of our model is the use of Globus Toolkit services to recursively interface to the services available in a federated Globus-based grid. Such a combination allows the required virtualization technology to be created in order to provide a powerful abstraction of the underlying grid resource management services. The GridGateWay acts as a utility computing service, providing a uniform standard interface based on Globus protocols and services for the secure and reliable submission and control of jobs, including file staging, on grid resources. A GridGateWay offers the possibility of encapsulating a virtualized grid inside a WS (Web Service) GRAM (Globus Resource Allocation and Management), using GridWay as the underlying LRMS (Local Resource Management System). To interface GridWay through GRAM, a new scheduler adapter has been developed along with a scheduler event generator. Also, a scheduler information provider has been developed in order to feed MDS (Monitoring and Discovery Service) with scheduling information. The grid hierarchy in our federation model is clear. An enterprise grid, managed by the IT Department, includes a GridGateWay to an outsourced grid, managed by the service provider. The outsourced grid provides on-demand or pay-per-use computational power when local resources are overloaded. This hierarchical grid organization may be extended recursively to federate a higher number of partner or outsourced grid infrastructures with consumer/provider relationships. Figure 1 shows one of the many grid infrastructure hierarchies that can be deployed with GridGateWay components. This hierarchical solution, which resembles the architecture of the Internet (characterized by the end-toend argument [5] and the IP hourglass model [6]), involves some performance overheads, mainly higher latencies, which have been quantified elsewhere [7]. The access to resources, including user authentication, across grid boundaries is under control of the GridGateWay service and is transparent to end users. In fact, different policies for job transfer and load balancing can be defined in the GridGateWay. The user and resource accounting and management could be performed at different aggregation levels in each infrastructure. There are several characteristics of this architecture that we identify as advantages. First one, it is completely based on standard protocols and uses a de facto standard grid middleware as Globus Toolkit is. Secondly, it doesn’t require for extra deployment for clients as it uses a GRAM interface. And, as a third
820
C. V´ azquez et al.
Fig. 1. GridWay encapsulated under a Globus WS GRAM service
advantage, also being its main objective, it allows for a complex grid federation, based on recursive architecture. In order to test the GridGateWay we used the following testbed. On one server (cepheus) we have an instance of GridGateWay that virtualizes a partner grid, in this case is going to be the fusion Virtual Organization (VO) of EGEE. Also, we have a local cluster managed by GridWay, that also has access to the EGEE partner grid. As it can be seen in Figure 2, cepheus appears as having a GW (GridWay) LRMS and it just displays the number of total and free nodes, and no more configuration details, it is in fact virtualizing the partner grid. The users doesn’t know the details of the grid behind cepheus Then again, he really doesn’t need to know, but if needed a host information provider could be developed. Figure 3 shows the throughput achieved with this configuration. We used a small parameter sweep application composed of 100 tasks, each one needing about 10 seconds to execute on a 3.2 GHz Pentium 4 and about 10 KB of tinova@draco:~$ gwhost HID OS ARCH 0 Linux2.6.16-2-6 x86 1 Linux2.6.16-2-a x86_6 2 Linux2.6.16-2-6 x86 3 Linux2.6.16-2-a x86_6 4 NULLNULL NULL
MHZ %CPU MEM(F/T) DISK(F/T) N(U/F/T) 3216 0 831/2027 114644/118812 0/0/1 2211 100 671/1003 76882/77844 0/2/2 3215 0 153/2027 114541/118812 0/0/1 2211 100 674/1003 76877/77844 0/2/2 0 0 0/0 0/0 6/665/1355
LRMS Fork SGE Fork PBS GW
HOSTNAME cygnus.dacya.ucm.e aquila.dacya.ucm.e draco.dacya.ucm.es hydrus.dacya.ucm.e cepheus.dacya.ucm.
Fig. 2. Resources in UCM (enterprise grid) as provided by GridWay gwhost command. cepheus hosts a GridGateWay and appears as another resource with GW (GridWay) as LRMS (Local Resource Management System), only providing information about its status (used, free and total nodes).
Transparent Access to Grid-Based Compute Utilities
821
Fig. 3. Throughput achieved in UCM when provisioning resources from EGEE (fusion VO) through a GridGateWay
input/output data. UCM resources executed 73 jobs, thus EGEE resources executed only 27 jobs, because their response time was much higher than that of UCM resources (the first job executed on EGEE completes 7.45 minutes after the experiment starts). This was in part due to network latencies and the use of the GridGateWay, but mainly because EGEE resources were heavily loaded with production jobs and, therefore, the submitted jobs suffered from high queue wait times. UCM and EGEE respectively contributed with 297.96 and 121.65 jobs/hour to the aggregated throughput of 408.16 jobs/hour. Results with this configuration has been proved to give a close to optimal performance, according to predictions made using a performance model for federated grid infrastructures [8].
4
Sun Grid Engine Transfer Queue to Globus
An interesting application of the GridGateWay solution is the outsourcing of computing power to sustain peak demands. This can be done using a pure GridWay based solution, where a local GridWay instance manages the local resources and uses a GridGateWay to satisfy computing power demand when the local cluster/grid is overloaded. We are aware that in the real, non academic, world (and even there) there is a huge inertial component that makes difficult to assume changes. With that in mind, we are committed to provide solutions for a wide range of the current possible configurations, so as to allow our concept of grid federation without any major change in the available infrastructure (i.e. both the grid middleware or the
822
C. V´ azquez et al.
Fig. 4. Grid federation achieved through GridGateWay and accessible through a local LRMS, in this case, SGE
underlying LRMS). GridWay already offers the technology to construct a grid out of heterogeneous sub-grids, based on different flavors of Globus (pre-WS and WS) [9,10]. On top of that, we now offer the possibility of accessing a federated grid not only from GridWay, but also from a LRMS like Sun Grid Engine (SGE). Sun Grid Engine1 is an open source batch-queuing system, supported by Sun Microsystems2 . This LRMS offers the possibility to define custom queues, 1 2
http://gridengine.sunsource.net http://www.sun.com
Transparent Access to Grid-Based Compute Utilities
823
through the development of scripts to submit, kill, pause and resume a job. It is already available from the vendor a Transfer Queue to Globus [11], but just for GT2, with pre-WS interfaces. As stated above, GridGateWay uses WS GRAM interface, so a Transfer Queue to GT4 has been developed3 . What we obtain from this component is the ability to submit jobs to a virtualized grid (utility computing service) from a local, well known, LRMS in a transparent way for the user. Figure 4 shows SGE with one transfer queue configured. In the SGE domain we find the normal SGE local queues and a special grid queue, that enables SGE to submit jobs to a WS GRAM service or to GridWay. This WS GRAM service can, in turns, virtualize a cluster (with several known LRMS) or it can encapsulate a GridWay (thus conforming a GridGateWay) that gives access to a Globus-based outsourced infrastructure. The special grid queue can be configured to be available only under certain special conditions, like peak demands or resource outage.
5
Conclusions and Future Work
In this paper we have shown a utility computing solution that can be deployed on any grid infrastructure based on existing Globus Toolkit components. This solution enables the creation of enterprise, partner and outsourced grids using automated methods on top of de facto grid standards in a simple, flexible and adaptive way. Moreover, we have developed components that intend to make the deployment as simple as possible, ideally requiring the minimum changes to the existing local infrastructure. As of now, we just support one LRMS, but is in the scope of the Grid4Utility project4 to provide a wider range of components to handle the broader set of pre-existing configurations as possible. Other aims of the project are the development of new components for scheduling, negotiation, service level agreement, credential management and billing.
Acknowledgments This work makes use of results produced by the Enabling Grids for E-sciencE project, a project co-funded by the European Commission (under contract number INFSO-RI-031688) through the Sixth Framework Programme. EGEE brings together 91 partners in 32 countries to provide a seamless grid infrastructure available to the European research community 24 hours a day. Full information is available at http://www.eu-egee.org. We want to acknowledge all institutions belonging to the fusion VO5 of the EGEE project for giving us access to their resources. 3 4 5
Available at http://www.grid4utility.org/#SGE http://www.grid4utility.org http://grid.bifi.unizar.es/egee/fusion-vo
824
C. V´ azquez et al.
References 1. Foster, I.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. Intl. J. High Performance Computing Applications 15(3), 200–222 (2001) 2. Llorente, I.M., Montero, R.S., Huedo, E., Leal, K.: A Grid Infrastructure for Utility Computing. In: Proc. 15th IEEE Intl. Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises (WETICE 2006), IEEE CS Press, Los Alamitos (2006) 3. Humphrey, M., Thompson, M.: Security Implications of Typical Grid Computing Usage Scenarios. Cluster Computing 5(3), 257–264 (2002) 4. Azzedin, F., Maheswaran, M.: Towards Trust-Aware Resource Management in Grid Computing Systems. In: 2nd IEEE/ACM International Symp. Cluster Computing and the Grid (CCGrid 2002), pp. 452–457 (2002) 5. Carpenter, E.B.: Architectural Principles of the Internet. RFC 1958 (June 1996) 6. Deering, S.: Watching the Waist of the Protocol Hourglass. In: 6th IEEE International Conference on Network Protocols (1998) 7. V´ azquez, T., Huedo, E., Montero, R.S., Llorente, I.M.: Evaluation of a Utility Computing Model Based on the Federation of Grid Infrastructures. In: Kermarrec, A.-M., Boug´e, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, Springer, Heidelberg (2007) 8. V´ azquez, T., Huedo, E., Montero, R.S., Llorente, I.M.: A Performance Model for Federated Grid Infrastructures. In: The 16th Euromicro International Conference on Parallel, Distributed and network-based Processing (PDP 2008) (in press) (Submitted) 9. V´ azquez-Poletti, J.L., Huedo, E., Montero, R.S., Llorente, I.M.: Coordinated Harnessing of the IRISGrid and EGEE Testbeds with GridWay. J. Parallel and Distributed Computing 66(5), 763–771 (2006) 10. Huedo, E., Montero, R.S., Llorente, I.M.: A Modular Meta-Scheduling Architecture for Interfacing with Pre-WS and WS Grid Resource Management Services. Future Generation Computer Systems 23(2), 252–261 (2007) 11. Seed, T.: Transfer-queue Over Globus (TOG): How To. Technical Report EPCCSUNGRID-TOG-HOW-TO 1.0, EPCC, The University of Edinburgh (July 2003), Available at: http://gridengine.sunsource.net/download/TOG/tog-howto.pdf
Towards Secure Data Management System for Grid Environment Based on the Cell Broadband Engine Roman Wyrzykowski and Lukasz Kuczynski Institute of Computational and Information Sciences, Czestochowa University of Technology {roman,lkucz}@icis.pcz.pl
Abstract. Nowadays grid applications process large volumes of data. This creates the need for an effective distributed data-management solutions. For the ClusteriX grid project, the CDMS (ClusteriX Data Management System) has been developed. Analysis of user requirements and existing implementations of data management systems in grids have been the foundations for its creation. Special attention has been paid to make the system user-friendly and efficient. In this paper, we propose to use the innovative Cell Broadband Engine to implement a set of measures which are necessary to fulfill the security and availability requirements of grid data management systems. Also, we discuss how this goal can be achieved in the CDMS2, an improved version of the CDMS. Keywords: grid environment, data management system, security, cryptography, availability, erasure codes, Cell Broadband Engine.
1
Introduction
Data management issues are amongst the most important in modern grid environment [1]. As the applications being run on grids become more real-life oriented, they generate or depend on data sets of growing importance and confidentiality. One of the principal goals of data management systems in grids is to provide transparent and efficient access to globally distributed data [1]. Among the most important issues that need to be solved are: optimization of the data transfers over the WAN, reliability and security of data access, and ease of use [11]. The most frequently encountered approach to solving these problems is based on application of metadata and mechanism of data replication [19], [21]. Metadata are used, e.g., for translation of a logical filename to its physical location. The replication mechanism should provide optimization of data access and reliability. An example of a data managament system for grid environment is the CDMS (ClusteriX Data Management System) [13], [15] created for management of data storage in the Polish national grid project ClusteriX [5], [23]. The CDMS has R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 825–834, 2008. c Springer-Verlag Berlin Heidelberg 2008
826
R. Wyrzykowski and L. Kuczynski
been developed based on the analysis of user requirements and existing implementations of such systems. When designing the CDMS, a special attention has been paid to making the system high productive, and user friendly. There is a rapid increase in sensitive data, such as biomedical records [2] or financial data. Protecting such data while in transit as well as while at rest is crucial [14]. Distributed data storage systems have different security concerns than traditional file systems. Rather than being concentrated in one place, data are now spread across multiple hosts. The chances for a security breach are much greater. Consequently, suitable techniques and tools (e.g. cryptographic ones [6]) should be applied to fulfill such key security requirements as confidentiality, integrity, and availability, for data management system in distributed/grid environments. The Cell Broadband Engine (shortly Cell BE) [9] is the first implementation of the Cell Broadband Engine Architecture, which is based on heterogeneous chip multiprocessing. The Cell BE integrates nine processor elements (cores) of two types: the Power processor element (PPE) is optimized for control tasks, while the eight synergistic processor elements (SPE) provide an execution environment optimized for data-intensive processing. Each SPE supports vector processing on 128-bit words. The element interconnect bus (EIB) connects all the processor elements with a high-performance communication subsystem. The PPE is fully compliant with the 64-bit Power Architecture specification, and can run 32-bit and 64-bit operating systems and applications. The SPEs are independent processors, each running an independent application thread. Each SPE includes a private local store for fast instruction and data access. Both types of processor cores share coherent access to a common address space, which includes main memory, and address ranges to each SPE’s local store, control registers, and I/O devices. In this paper, we propose to use the impressive computational power of the Cell Broadband Engine, coupled with its advanced security architecture (see Section 3), to implement a set of measures which are necessary to fulfill the security and availability requirements of modern grid data management systems. Also, we discuss how this goal can be achieved in the CDMS2, an improved version of the CDMS.
2
CDMS: Concept, Architecture, and Implementation
A modern data management system should be implemented with the following features in mind: transparent access, reliability, security and safety of transferred and stored data, access control, possibility of transparent data compression, and access optimization. The development of an intuitive and effective data access and system administration toolkit was seen as an equally important task. It is particularly necessary when the end-user is not expected to be aware of the low-level mechanisms. In the CDMS an Virtual File System (VFS) abstraction layer was implemented, creating an illusion of working with a local file system.
Towards Secure Data Management System for Grid Environment
827
Fig. 1. Architecture of CDMS
2.1
CDMS Architecture
The architecture of the ClusteriX Data Management System was introduced in [12]. It has a modular design and consists of (Fig. 1): – – – – – – – –
Main Management Module (CDMS Core) Global Data Catalogue (GDC) Local Data Catalogue (LDC) Transport Subsystem Synchronization Module Statistic Module Optimization Module Replication Module
The main part of the system is the CDMS Core responsible for data collections management, data coherence, running the Optimizer and Replicator processes, and data transfer initialization. Using data stored in the Global Data Catalogue, the Main Management Module performs the mappings of logical filenames to Storage Elements holding the data. Proper functioning of the GDC is crucial for reliable operation of the CDMS, which makes replication of this data vital for the entire system. The responsibility of the Replication Module is to perform data replication on the CDMS Core request. It currently allows for the initial and automatic replication. The main task of the Optimization Module is determining the best data location from available replicas. The application of this module decreases delays in
828
R. Wyrzykowski and L. Kuczynski
data access and balances the load between Storage Elements. The Optimization Module uses such statistics information as network throughput and performance of Storage Elements, as well as measured values like current network load, system load, and available disk space on Storage Elements. The Transport Subsystem has been introduced to increase the CDMS performance during data transfers. It consists of the Proxy Transport Agents (PTAs) and the Transport Agents (TAs). The PTA is responsible for transferring data between the user and the CDMS. It runs as a standalone process, accepting data transfer requests from the CDMS Core. Such a solution allows the CDMS Core to select the agent located closest (in networking terms) to the served user. The main task of a TA is transferring data between the Storage Element and Proxy Transport Agents. Data sent by the user to PTA are directed to a suitable TA. The CDMS Core asks the Optimization Module for the suggested data locations, and then it requests the proper TA to perform the required operations. An important feature of the Transport Subsystem is parallel data transfer between the Proxy Agent and the Transport Agents. It enables data replication in the very moment they enter the Data Management System. 2.2
CDMS Implementation
The CDMS has been implemented in the C language (Optimization an Replication Modules excepted) using the gSOAP package [10]. An adequate data transmission security, x509 certificates infrastructure and gsiftp protocol support have been achieved using Globus Toolkit 3.x libraries [8]. In the grid infrastructure based on the Globus Toolkit, users are identified using x509 certificates, which have unique subjects. This fact is the foundation of user namespaces introduced in CDMS, which are named after the subject of an user certificate. This approach eliminates possibility of collisions in file and directory names. Every user in the CDMS system has his own file system root (/) located in his namespace. The Universal Resource Locator (URL) for the CDMS system is defined as follows: cdms://[user-namespace]/url-path.
3
Motivation for CDMS2: Secure Data Management System Using Cell BE
Providing security of data becomes a key issue when designing a modern data management system. Opposite to local file systems, the CDMS is a distributed data management system, where data are spread among Storage Elements (SEs) in the grid environment. These SEs are distributed among different locations (organizations), which feature different levels of security. Taking control over a single SE could lead to compromising the whole system, and loss of some sensitive data. Consequently, providing an adequate level of such security services [14] as confidentiality and integrity, as well as authentication and authorization, is considered to be the priority when designing the architecture of CDMS2.
Towards Secure Data Management System for Grid Environment
829
Another key aspect of security services is availability. Failures in server hardware or computer network are very difficult to prevent. A system that embeds strong cryptographic technigues, but does not ensure availability, backup and recovery are of little use [14]. Typically systems are made fault tolerant by replicating data or entities that are considered as single-point-of-failure [11]. However, replication could incur a high cost of maintaining the consistency between replicas, and be space inefficient. A more efficient approach, which is proposed to provide increased data availability in CDMS2, is based on using the erasure codes [17]. This approach provides redundancy without the overhead of strict replication. Erasure codes divide an object into k fragments, and recode them into n fragments, where n ≥ k. A code with the encoding rate of r = n/k increases the storage cost by a factor of r. The key property of erasure codes is that the original object can be restored from any k (or a bit more) fragments. To recover the data without noticeable delays, high-performance computing platforms have to be adopted. 3.1
Security Architecture of the Cell Broadband Enginee
As computers and other electronics devices become more connected, platform security becomes increasingly important for everyone. Within this context, the Cell BE offers the processor security architecture that provides [3] a robust foundation for the security platform. Instead of relying on software-implemented approaches, the Cell BE architecture is designed with the goal of providing hardware-based security features that are intrinsically less vulnerable that software solutions. The advantage of such an approach is that even the operating system or other supervisory software is compromised by an attack, the hardware security features can still protect the application and its important data. To achieve this, the security architecture of the Cell BE offers [20] four basic features: 1. Using the hardware Secure Processing Vault feature, an application can execute in isolation from the rest of software environment. 2. With the Runtime Secure Boot feature, an application can check itself before it is executed to verify that it has not been modified and compromised. Secure Boot is normally done only at power-on time, but the Cell BE processor can Secure Boot an application thread multiple times at runtime. 3. With the Hardware Root of Secrecy feature, access to secrets such as various keys can be protected, based on the robustness of hardware. 4. The hardware random generator is important for efficient implementation of some cryptography algorithms [6]. The impressive computational power of the Cell Broadband Engine [4], coupled with its advanced security architecture, make it a perspective platform to implement a set of measures which are necessary to fulfill the security and availability requirements of modern grid data management systems.
830
R. Wyrzykowski and L. Kuczynski
Fig. 2. Architecture of Secure CDMS Broker
4
CDMS2 Versus CDMS: New Features and Functionality
While preserving the basic assumptions and functionality of the CDMS, the architecture of CDMS2 has been considerably modified to support new features and functionality. Fig.2 and Fig.3 shown the architectures of the Data Broker and Storage Element, respectively. 4.1
CDMS2 File System
The important difference between the first and second versions of CDMS is the metadata storage method. Instead of a relational database, a dedicated filesystem cdmsfs is proposed to improve the whole system security and fault-tolerance. Also, this solution allows us to group components and mechanisms responsible for performing replications, statistics processing, and improving security and reliability. Functionally, the cdmsfs can be divided into the following components – FUSE Interface – Cryptographic Layer – Statistic Mechanism
Towards Secure Data Management System for Grid Environment
831
Fig. 3. The architecture of Secure CDMS Storage Element
– Replication Mechanism – Metadata Container – Data Container The implementation of the cdmsfs is based on the FUSE: Filesystem in Userspace software [7]. This solution allows for implementing a fully functional filesystem without necessity to introduce changes in the kernel of an operating system. In addtion, this opens possibility to utilize a wide spectrum of utility libraries which are not available on the kernel side. Important advantages of using FUSE is also a better portability among different versions of Linux kernel, as well as possibility of porting the developed software to other operating systems. 4.2
Statistic and Replication Mechanisms
Unlike the previous version, the Data Broker of the CDMS2 does not contain an independent statistic module. Now this functionality is embedded into the cdmsfs. Each I/O operation results in updating statistics information, which are placed directly in the metadata. This solution eliminates dependence on the external database, and allows for increasing performance and reliability. Similar to the previous version, the decision about performing replications of user’s data among SEs is taken by the external replication module. In the CDMS2, the novelty consists in introducing replication mechanism into the cdmsfs. This mechanism is responsible for performing replications of data, metadata, and statistics information, without participation of additional services. In case of the Primary Data Broker failure [13], the Secondary Data Broker, which takes over the Primary Broker functions, holds a coherent image of the whole data management system. 4.3
Key Manager Module
Like the previous version, the access of a user to the system resources is possible via a WebService interface, using the SOAP protocol and communication
832
R. Wyrzykowski and L. Kuczynski
channel secured by SSL. When connecting to the system for the first time, the separated namespace is created based on the user’s name contained in his certificate. Such a solution eliminates possibility of collisions in file and directory names. Furthermore, a set of cryptographic keys (so called keyring) is generated to provide data encryption/decryption. The Key Manager Module is responsible for storage and management of cryptographic keys in the CDMS2. In order to provide the high level of security, the keyring of each user is stored encrypted in the Key Manager Module, using the user’s private key. Before performing the Read or Write operation, the user has to decrypt the corresponding keyring, using his key. 4.4
Cryptographic Layer
The Cryptographic Layer is responsible for providing security of data. Before storing in the filesystem, the data and metadata are encrypted using the symmetric cryptography. The encryption keys are stored in the Key Manager Module in an encrypted form. The scenarios of dealing with cryptographic keys are different depending on either the cdmsfs is a part of the Data Broker or Storage Elements. In case of the Data Broker, the Key Manager Module contains cryptographic keys for each Storage Element, keys of all users, and system keys for the Data Broker. When starting the Data Broker, the Key Manager gets the primary key from an external device (e.g., USB flash drive, reader of smart cards) or from a file. This key is used to decrypt both the keys for available Storage Elements, and key for the cdmsfs. The decrypted key is transferred to the Cryptographic Layer via FUSE, to mount the cdmsfs correctly. Such an approach allows for secure storage of cryptographic keys for each component of the system. While starting, the first activity performed by a Storage Element is its registration in the Data Broker. During this activity the Storage Element receives the cryptographic keys which are used to mount the cdmfs correctly. To prevent access to data by non-authorized users, the keys used for encryption are stored in a RAM area; after disconnecting the Storage Element from the network this area is cleaned. When the network connection is recovered, the registration of the Storage Element in the Broker is repeated automatically, allowing for receiving the keys from the Broker. The use of such a mechanism permits to separate keys, which are stored in the Broker, from the data, as well as to decrease the probability of decrypting data by a third party. 4.5
Using Erasure Codes for Increased Data Availability
A classical concept of building fault-tolerant systems consists in replicating data on several servers. Erasure codes can improve the availability of distributed storage by splitting up the data into fragments, encoding them redundantly and distributing the fragments over various servers. As was shown in [22], the use of erasure codes reduces ”mean time of failures by many orders of magnitude compared to replication systems with similar storage and bandwidth requirements”.
Towards Secure Data Management System for Grid Environment
833
More precisely, an erasure code works in the following way. A file F of size |F | is partitioned into k fragments. Then these fragments are encoded int n fragments F1 , F2 , ..., Fn of size |Fk | , where n ≥ k. Encoded fragments are stored on different nodes to reduce failure correlations among fragments. Later on, the original file can be restored from any k (or a bit more) of encoded blocks. This operation should be performed on-fly in such a way that partial failures of Storage Elements are not visible to the end-user. There are many ways erasure codes can be generated. A standard way is the use of the Reed-Solomon (or RSE) codes [17]. Unfortunately RSE are intrinsically limited by the Galois field size they uses. This constraints the size of the stripes necessary to distribute data blocks among multiple storage elements. In this respect, more interesting variants of erasure codes are Digital Fountain codes [16] and large block FEC (Forward Error Correction) codes [18]. The choice in favor of one of them to be used in the CDMS2 will be the subject of further investigations.
References 1. Allcock, B., et al.: Data Management and Transfer in High Performance Computational Grid Environments. Parallel Computing Journal 28(5), 749–771 (2002) 2. Blanchet, C., Mollon, R., Deleage, G.: Building an Encrypted File System on the EGEE Grid: Application to Protein Sequence Analysis. In: Proc. First Int. Conf. on Availability, Reliability and Security (ARES 2006), IEEE Computer Society, Los Alamitos (2006) 3. Cell BE: Security SDK 3.0 Installation and User’s Guide, http://www.ibm.com/developerworks/power/cell 4. Costigan, S.: Accelerating SSL using the vector processors in IBM Cell Broadband Engine for Sony Playstation 3. In: SPEED:Software Performance Enhancement for Encryption and Decryption, Amsterdam, pp. 65–76 (2007) 5. ClusteriX Project Home Page, http://clusterix.pcz.pl 6. Denis, T.S., Johnson, S.: Cryptography for Developers. Syngress Publishing, Rockland (2007) 7. FUSE: Filesystem in Userspace, http://fuse.sourceforge.net/ 8. GLOBUS Project Home Page, http://www.globus.org 9. Gschwind, M., et al.: Synergistic Processing in Cell’s Multicore Architecture. IEEE Micro. 2, 10–24 (2006) 10. gSOAP Project Home Page, http://www.cs.fsu.edu/∼ engelen/soap.html 11. Karczewski, K., Kuczynski, L., Wyrzykowski, R.: Secure Data Transfer and Replication Mechanisms in Grid Environments. In: Proc. Cracow Grid Workshop - CGW 2003, Cracow, pp. 190–196 (2003) 12. Karczewski, K., Kuczynski, L., Wyrzykowski, R.: CDMS - ClusteriX Data Management System. In: Proc. Cracow Grid Workshop - CGW 2004, Cracow, pp. 241–247 (2004) 13. Karczewski, K., Kuczynski, L.: ClusteriX Data Management System (CDMS) Architecture and Use Cases. In: Talia, D., Bilas, A., Dikaiakos, M.D. (eds.) Knowledge and Data Management in Grids, pp. 99–115. Springer, Heidelberg (2006) 14. Kher, V., Kim, Y.: Securing Distributed Storage: Challenges, Techniques, and Systems. In: Proc. 2005 ACM workshop on Storage Security and Survivability, Fairfax, VA, USA, pp. 9–25 (2005)
834
R. Wyrzykowski and L. Kuczynski
15. Kuczynski, L., Karczewski, K., Wyrzykowski, R.: ClusteriX Data Management System and Its Integration with Applications. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 240–248. Springer, Heidelberg (2006) 16. MacKay, D.J.C.: Fountain codes. IEE Proc.-Commun. 152(6), 1062–1068 (2005) 17. Rizzo, L.: Effective Erasure Codes for Reliable Computer Communication Protocols. ACM Computer Comm. Review 27(2), 24–36 (1997) 18. Roca, V.: Neumann, C.: Design, Evaluation and Comparison of Four Large Block FEC Codes, LDPC, LDGM, LDGM Staircase and LDGM Triangle, plus a ReedSolomon Small Block FEC Codes. Tech. Rep. No.5225, INRIA Rhone-Alpes (2004) 19. SDSC Storage Resource Broker, http://www.sdsc.edu/srb/ 20. Shimizu, K.: The Cell Broadband Engine processor security architecture, http://www-128.ibm.com/developerworks/power/library/pa-cellsecurity/ 21. Slota, R., Skital, L., Nikolow, D., Kitowski, J.: Algorithms for Automatic Data Replication in Grid Environment. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 707–714. Springer, Heidelberg (2006) 22. Weatherspoon, H., Kubiatowicz, J.D.: Erasure Coding vs. Replication: A Quantitive Comparison. In: Druschel, P., Kaashoek, M.F., Rowstron, A. (eds.) IPTPS 2002. LNCS, vol. 2429, Springer, Heidelberg (2002) 23. Wyrzykowski, R., et al.: Meta-computations on the CLUSTERIX Grid. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 489–500. Springer, Heidelberg (2007)
Ontology Alignment for Contract Based Virtual Organizations Negotiation and Operation Joanna Zieba1 , Bartosz Kryza2, Renata Slota1 , Lukasz Dutka2 , and Jacek Kitowski1,2 1
2
Institute of Computer Science, AGH-UST, Cracow, Poland {rena, kito}@icsr.agh.edu.pl Academic Computer Centre CYFRONET-AGH, Cracow, Poland {bkryza, dutka}@icsr.agh.edu.pl
Abstract. The cooperation of diverse partners carried out in Grid Virtual Organizations (VOs) requires clear specification of each party’s capabilities and obligations. The negotiation and enactment of contracts, underpinned by Service Level Agreements (SLAs), is essential for establishing a VO. This paper describes how semantic technologies can simplify this task by employing ontology alignment methods. The application of these techniques is placed in the context of the European Union’s Gredia Project and its journalism-related pilot application.
1
Introduction
Virtual Organizations play the key role in the Grid computing paradigm. They are dynamic collections of individuals, institutions and resources [1], governed by rules, which clearly define, how the members’ resources should be shared in order to use them in the most effective and sufficiently secure way. Shared resources can be very diverse: computational cycles, databases, software components, scientific instruments or network capacity. Collaboration inside the Virtual Organization, dynamic as it is, must nevertheless conform to organizational policies of VO participants. Different organizations may have different rules regarding resource access, sharing and network security. To ensure the fulfillment of all such internal requirements, a contract, which states, how the stakeholder’s resources can be exploited, what quality-of-service it provides and how the resources’ usage should be accounted, has to be signed off. The technical specification of quality-of-service is expressed in the Service Level Agreement (SLA). A Virtual Organization does not have the same members through all its lifecycle. Participants of a VO may join and leave it in a dynamic manner; in most cases, approval of a VO administrator is necessary for new organizations to become members. Registration in a Virtual Organization requires that the new member specifies the resources, which it can share with others and conditions, under which they can be shared. Semantic technologies, which play an increasingly important role in the Grid landscape (for more publications regarding the Semantic Grid community efforts, see [2]), can be helpful in the resource description and contract R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 835–842, 2008. c Springer-Verlag Berlin Heidelberg 2008
836
J. Zieba et al.
negotiation activities. The purpose of this paper is to present the semantic heterogeneity problem arising in the novel approach based on contract negotiation and realization field, together with a possible solution based on ontology alignment methods. The paper is organized as following: section 2 describes current state-of-the-art in the area of Virtual Organizations’ contracts and ontology alignment. In section 3, we present the context of Gredia project pilot applications, which strongly influences the requirements for the proposed Virtual Organization framework. Section 4 focuses on ontology alignment, which can be used to conciliate different semantic descriptions of resources shared in the VO. The paper is summarized and conclusions are drawn in section 5.
2
Contracts for Virtual Organizations
Automating the contract negotiation belongs to research issues pursued by multiple European Grid projects. The OntoGrid project takes an agent-based approach, bringing Grid and agent technologies together in order to enable efficient provisioning and management of services [3]. In OntoGrid framework, Rubinstein’s alternating offers protocol is adapted for web services negotiation XML schemas expressing offers and actions (e.g. agree, reject) have been developed, together with a set of negotiation strategies. NextGrid project publications (e.g. [4]) point out the importance of Service Level Agreement negotiation for ebusiness and foresee the adoption of Semantic Web and agent technologies in this area. Currently, they propose a brokered approach, where the negotiation service is outsourced to a 3rd party intermediary, yet without any semantic support. Finally, BREIN (Business objective driven REliable and Intelligent grids for real busiNess, [5], one of the newly established 6th Frame Programme projects, also aims to develop a business-oriented, user-centric Grid framework, which also covers the contracts’ negotiation issue. To sum up, all mentioned Grid projects acknowledge the importance of Service Level Agreements and view semantic technologies as a means of automating SLA development and negotiation. The described Grid frameworks rely upon specifications from the Service Oriented Architecture/Web Services community, namely WS-Negotiation [6] and WS-Agreement [7]. The semantics of a contract and its negotiation should be machine-understandable and this is where ontologies come into play. Enhancements to WS-Agreement, which make extensive use of ontology engineering and reasoning to automate the SLA establishment, are presented in [8]. Also, several Quality of Service ontologies have been proposed lately, e.g. [9], to facilitate precise specification of a contract between negotiating parties. There is also an initiative of creating a unified QoS ontology (under the name of OWL-QoS, formerly DAML-QoS) and submitting it to the World Wide Web Consortium (W3C) for standarization [10]. Since the ontologies used by various organizations will be usually heterogenous, some of their concepts should be aligned in order to allow for the contract to refer to resources in every organization using concepts understendable for
Ontology Alignment for Contract Based VOs Negotiation and Operation
837
the Virtual Organization middleware [12], [13]. [14] mentions multiple kinds of semantic links between semantic nodes (concepts, instances, or even whole ontologies); here the most relevant ones are: similar-to, equal-to, part-of, subtype. The mentioned relations between concepts in two different ontologies can be discovered by means of ontology alignment methods. Ontology alignment uses various similarity algorithms to determine, which pairs of concepts correspond to each other (either as equivalent or as similar ones), thus forming a mapping [15]. Existing similarity measures can be divided into several groups [16]: lexical, structural, instance-based and others. Ontology alignment methods often use multiple measures at a time, to take into account more characteristics of the compared ontology entities. There is a distinction between local and global ontology alignment: the local approach analyzes all the ontology concepts and relations separately during the search for their equivalents in the other ontology, while the global one handles the ontology graph as a whole, which leads to a greedy algorithm, e.g. Similarity Flooding. The local approach is often used in existing ontology alignment tools such as COMA++ [18], OLA [19] and Cupid [20].
3
Virtual Organizations in Gredia Project
The EU GREDIA project ([21]) aims at the development of a Grid application platform, providing high level support for the implementation of Grid business applications through a flexible graphical user interface. This platform is generic in order to combine both existing and arising Grid middleware, and facilitate the provision of business services, which mainly demand access and sharing of large quantities of distributed annotated numerical and multimedia content. Furthermore, GREDIA facilitates mobile devices to exploit Grid technologies in a seamless way by enabling the mobile access to distributed annotated numerical and multimedia content. GREDIA also focuses on supporting Virtual Organizations to handle the complexity of allocating heterogeneous resources in the Grid. An significant contribution of GREDIA is to hide the complexity of heterogeneous resources and support dynamic Virtual Organizations of business entities through a special semantic framework called Grid Virtual Organization Semantic Framework (GVOSF), which provides a unified access layer to all elements of the Virtual Organization. Figure 1 presents an example of Virtual Organization managed by the GVOSF framework. Each organization participating in this Virtual Organization publishes its resources through an instance of GVOSF component, which are connected together by P2P overlay. The GVOSF framework is based on the Grid Organizational Memory knowledge base [23]. The potential effects of the platform are being validated through two pilot applications, servicing the vital sectors of media (news) and banking. The goal of the first application is provision of Gredia Grid Online service for freelance journalist, where they share their articles and media files and collaborate on preparation of articles for publishing. Here, the VO comprises media companies, content brokers, news agencies and freelance journalists. The distinctive
838
J. Zieba et al.
Fig. 1. Sample Virtual Organization managed by GVOSF framework
characteristic of this application, is that it requires Virtual Organizations to be created dynamically on demand and support hierarchical Virtual Organizations created for some short term activities, such as writing an article. In particular, the freelance journalists will be able to find and share the content from their mobile devices with other members of the Virtual Organizations instantly after the content is produced. The EasyLoan application is related to a banking sector, where several organizations cooperate in order to provide bank clients with fast and sensible credit assessments. The Virtual Organization consists of such organizations as banks, interbanking systems, credit risk department and others. In this application, the Virtual Organizations have a rather static nature and very strict security constraints. Bank clients themselves are not considered as members of the VO, since all their requests submitted to the bank portal are processed by assigned bank clerks.
4
Applying Ontology Alignment in Virtual Organization Life-Cycle
Due to the fact, that different organizations participating in the Virtual Organizations have different conceptualizations of their resources, mapping of ontologies should be applied during various phases of a VO life-cycle.
Ontology Alignment for Contract Based VOs Negotiation and Operation
4.1
839
VO Inception
Inception of Virtual Organization includes several actions including among others contract negotiation process between participating organizations. Each organization provides information about its resources in the form of ontologies which describe the resources. However in real world different organizations use different vocabularies and ontologies to define the resources they want to share within the Virtual Organization. Lets consider an example of 2 organizations willing to create a Virtual Organization. Each of them publishes some information about its resources for the participants of the negotiation process (see Fig. 2).
Fig. 2. Matching concepts from ontologies of the two organizations negotiating the contract
Organization_A uses concepts from the Common Information Model ontology while Organization_B uses generic ontologies developed within EU-IST K-Wf Grid project (see [22]). In order for the Organization_A and Organization_B to reach agreement in the form of a contract, mapping of ontologies describing their resources is essential. In Fig. 2 we can see how such mapping allows contract definition to refer to various resources in each organization and allowing disambiguation of concepts from each of the ontologies. Each statement in the contract must use generic concepts such as Resource, Commitment, Actor. These concepts must be than instantiated with individuals from the ontologies of the organizations, which describe particular resources, e.g. Service_1. Since the concepts used in each of the organizations for definiton of a service are different, only ontology mapping can remove the ambiguity between the two conceptualizations. 4.2
VO Deployment
During the deployment phase of the Virtual Organization life-cycle, all the abstract statements from the contract must be used to configure the underlying
840
J. Zieba et al.
Fig. 3. Example of a semantic query which finds resources from different organizations
VO middleware. This includes security infrastructure, low level Virtual Organization management systems such as VOMS as well as monitoring middleware which will enforce proper QoS as defined in the contract through the Service Level Agreements. Ontology mapping in this phase will be used in order to translate concepts from the contract into proper configuration files in the Grid layer - e.g. ACL or SAML statements. 4.3
VO Operation
In this phase the processes in the Virtual Organization are taking place in order to realize its goal. Mapping in this phase is essential mostly for resource and service discovery as well as workflow composition. In order for a member of one organization, which uses one ontology for description of its resources, to find a resource which is provided by another organization with yet another conceptualization of the domain - mapping of ontologies will provide users with transparent access to much more resources than they would be able to discover otherwise. In this example we can see a user of Organization_A, which uses as its underlying conceptualizaiton of resource ontology based on Common Information Model, thus the main concept for file is CIM_DataFile. User however wants to find a file in the Virtual Organization which contains a picture from World Press Photo contest in PNG format (for simplicity of example other parameters are ommitted). In this case, such a file is available in Organization_B, however direct query will always fail in this case since the concept used in Organization B to refer to files is simply File. Thus the underlying middleware must perform mapping from CIM_DataFile of Organization_A to the File concept of Organization B and then can return to the user a reference to the requested data.
Ontology Alignment for Contract Based VOs Negotiation and Operation
5
841
Summary and Conclusions
In this paper, we have proposed an application of semantic technologies, which would solve problems related to contract negotiation and enactment in Grid Virtual Organizations. An ontology alignment algorithm, which can help to configure VO’s resources according to the contract, has been described. We have also presented current work in this domain of Grid and VO research, together with specific requirements of the Gredia project media pilot appplication. As it has been proposed in this paper, ontology engineering can bring substantial improvements to the Virtual Organizations field. Various activities in the Virtual Organization’s lifecycle, such as VO partners discovery, contract negotiation or contract realization, may be fully or partially automated, thus making the VO administration simpler and faster. It is particularly important when partnerships are very dynamic, like in the described scenario of media workers collaborating on an article. Eventually, introducing such semantic-based enhancements to the classical Grid Virtual Organizations can result in their wider adoption in both research and business organizations.
Acknowledgements The authors want to acknowledge the support of the EU Gredia Project (ISTFP6-034363) and AGH University of Science and Technology.
References 1. Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the grid: Enabling scalable virtual organizations. International Journal of Supercomputer Applications 15(3) (2001) 2. Semantic Grid Community Portal, http://www.semanticgrid.org/ 3. Wooldridge, M., Paurobally, S., Tamma, V.: Cooperation and Agreement between Semantic Web Services. In: W3C Workshop on Frameworks for Semantics in Web Services 4. Hasselmeyer, P., Qu, C., Schubert, L., Koller, B., Wieder, P.: Towards Autonomous Brokered SLA Negotiation. In: Proceedings of eChallenges e-2006 Conference, Barcelona (October 2006) 5. BREIN project website, http://www.eu-brein.com/ 6. Hung, P., Li, H., Jeng, J.: WS-Negotiation: An Overview of Research Issues. In: Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 2004) - Track 1, IEEE Computer Society, Washington (2004) 7. Andrieux, A., Czajkowski, K., Dan, A., Keahey, K., Ludwig, H., Nakata, T., Pruyne, J., Rofrano, J., Tuecke, S., Xu, M.: Web Services Agreement Specification (WS-Agreement)., https://forge.gridforum.org/projects/graap-wg/ 8. Oldham, N., Verma, K., Sheth, A., Hakimpour, F.: Semantic WS-Agreement Partner Selection. In: Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW 2006, pp. 697–706. ACM Press, New York (2006)
842
J. Zieba et al.
9. Dobson, G., Sanchez-Macian, A.: Towards Unified QoS/SLA Ontologies. In: Proceedings of the IEEE Services Computing Workshop (SCW 2006), August 2006, pp. 169–174 (2006) 10. Zhou, C., Chia, L.-T., Lee, B.-S.: Web services discovery with DAML-QoS ontology. International Journal of Web Services Research (JWSR) 2(2), 44–67 (2005) 11. Bouquet, P., Euzenat, J., Franconi, E., Serafini, L., Stamou, G., Tessaris, S.: Specification of a common framework for characterizing alignment. Deliverable 2.2.1, IST Knowledge web NoE, Knowledge web NoE, p. 21 (June 2004) 12. Slota, R., Majewska, M., Kitowski, J., Lambert, S., Laclavik, M., Hluchy, L., Viano, G.: Evaluation of Experience-based Support for Organizational Employees. In: Lu, J., Ruan, D., Zhang, G. (eds.) E-Service Intelligence. Methodologies, Technologies and Applications, Series: Studies in Computational Intelligence, vol. 37, pp. 411– 434. Springer, Heidelberg (2007) 13. Laclavik, M., Babik, M., Balogh, Z., Hluchy, L.: AgentOWL: Semantic Knowledge Model and Agent Architecture. Computing and Informatics 25(5), 419–437 (2006) 14. Zhuge, H.: Autonomous Semantic Link Networking Model for the Knowledge Grid, Technical Report KGRC-2006-1 (January 2006), http://www.knowledgegrid.net/TR 15. Ehrig, M., Sure, Y.: FOAM - Framework for Ontology Alignment and Mapping. In: Proceedings of the Workshop on Integrating Ontologies, vol. 156, pp. 72–76. CEUR-WS.org (October 2005) 16. Euzenat, J., et al.: State of the art on ontology alignment. Deliverable 2.2.3, IST Knowledge web NoE, Knowledge web NoE, p. 79 (August 2004) 17. Slota, R., Zieba, J., Kryza, B., Kitowski, J.: Knowledge Evolution Supporting Automatic Workflow Composition. In: Proceedings of the 2nd IEEE Conference on e-Science and Grid Computing, Amsterdam (2006) 18. Aumuller, D., Do, H., Massmann, S., Rahm, E.: Schema and Ontology Matching with COMA++. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 906–908 (2005) 19. OLA (OWL Lite Alignment) website, http://www.iro.umontreal.ca/owlola/index.html 20. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic Schema Matching with Cupid. In: Proceedings of the 27th International Conference on Very Large Data Bases, Rome, Italy, pp. 49–58 (2001) 21. Gredia project website, http://www.gredia.eu 22. Grid Organizational Memory website, http://gom.kwfgrid.net 23. Kryza, B., Slota, R., Majewska, M., Pieczykolan, J., Kitowski, J.: Grid organizational memory provision of a high-level Grid abstraction layer supported by ontology alignment. Future Generation Computer Systems, Grid Computing: Theory, Methods & Applications 23(3), 348–358 (2007)
On Service-Oriented Symbolic Computing Alexandru Cˆ arstea1 , Marc Frˆıncu1 , Alexander Konovalov2, Georgiana Macariu1 , and Dana Petcu3 1
2
Institute e-Austria Timi¸soara, Romˆ ania [email protected] University of St Andrews, St Andrews, Scotland [email protected] 3 Western University of Timi¸soara, Romˆ ania [email protected]
Abstract. Exposing computer algebra systems as computing services allows their further development and integration in complex serviceoriented architectures. While existing standards may be used for service deployment and interaction, particularities of services to be built require more specialized solutions. We present recently implemented technical approach aimed to integrate legacy computational algebra software into modern service-oriented and Grid architectures. Keywords: service-oriented architecture, symbolic computing, wrappers for legacy software.
1
Introduction
Symbolic computation is one of the computational high demanding fields that can benefit from the capabilities offered by the Grid-based infrastructures. Recent tools and software components for Grid-based infrastructures, and amongst them the latest Globus Toolkit [1], rely on service-oriented architecture concepts. A Grid service (named just service in what follows) is a Web service compliant with the WSRF standard [2]. Computer Algebra Systems (CAS) are currently the main tools for symbolic computations. Usually, CASs are designed to be used in the isolated context of the local machine on which they are installed. Their ability to interact with the external world is mostly resumed to a command line user interface, to file I/O based operations and rarely the ability to interact using TCP/IP sockets. These facts entitle us to include CASs into the category of legacy software systems. The problem of integrating legacy software into modern distributed systems has two obvious solutions [3,4]: reengineering of the legacy software and creation of wrapper components. In this paper we have consider the second approach. General wrapper tools enable exposing functionality of legacy components. In this category we can include Soaplab [5] and JACAW [6]. Exposing CASs functionality is achieved by several special purpose tools: GENSS [7], SCORUM [8], GridSolve [9], MathLink [10], MathGridLink [11], Geodise [12], Nimrod/G [13] R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 843–851, 2008. c Springer-Verlag Berlin Heidelberg 2008
844
A. Cˆ arstea et al.
and JavaMath [14]. Using these tools, several CASs can be remotely accessible, but, unfortunately, the majority of CASs are still not remotely available. Additionally, almost all tools mentioned above do not use a standard data model for interactions with CASs. The aim of the EU Framework VI SCIEnce project (Symbolic Computation Infrastructure for Europe, http://www.symbolic-computation.org) is to improve integration between CAS developers and application experts. The project includes developers from four major CASs: GAP [15], Maple [16], MuPAD [17] and KANT [18]. The research is primarily concerned with parallel, distributed, Web and Grid-based symbolic computations. A framework for symbolic computation on the Grid was designed and presented in [19]. It allows multiple invocations of symbolic computing applications to interact via the Grid and it is designed to support the specific needs of symbolic computation, including computational steering, complex data structures, and domain-specific computational patterns. In the context of SymGrid implementation, the present paper explores the available possibilities to enable invocation of legacy CAS software through services and introduces the CAS Server architecture. The main functionality of a CAS Server is to enable virtually any CAS to have its own functions remotely accessible. Several CASs can be exposed through the same CAS Server at the same time. Section 2 is an overview of the CAS Server architecture. The integration of legacy software in service oriented architectures must consider three major issues: data encoding, exposing technology, and wrapping the legacy system. Sections 3, 4 and 5 are dealing with these issues. The proposed architecture was implemented and the specific tool has been build. The tool is implemented using Java 5 SDK [20] with supporting tools and software from Globus Toolkit, Apache Axis [21], Apache Tomcat 5 [22] and RIACA OpenMath Java API [23]. The generic services implemented in the CAS Server architecture invoke CAS functions that are applied to OpenMath [24] objects. Additionally, this tool allows to limit the number of functions exposed for a CAS and to interrogate the CAS Server about the functions exposed. Section 6 covers implementation issues, while Section 7 give some efficiency results. Finally Section 8 draws some conclusions.
2
General Overview of the Architecture
The architecture that we propose aims to expose CASs functionality as services. The interaction between these services can lead to computing efficiency gains when solving complex symbolic problems. Unfortunately, the different data encoding and algorithms used within distinct CASs to solve the same problem hinders the integration process. We assume that it is the clients responsibility to choose the right functions needed for a computation, just as he would have had the CAS installed on the local machine. Alternatives like predefined workflows or patterns for service composition are subject of later developments (see [19] for patterns description).
On Service-Oriented Symbolic Computing
845
Fig. 1. CAS-wrapper service architecture
In order to fulfill the above aim, several issues must be addressed. The first and most important one is to find means to expose CAS functions in order to make them available through service invocation. Other issues that must be taken care of are related to user assistance tools and security. A significant number of existing CASs is not conceived to be remotely accessible. But most of them are allowing the interaction with the external world through line commands. Only few of them have special facilities like socket connections or multi-threading. In general, the redesign of these systems to allow modern communication with outside world cannot be achieved easily. Due to these reasons, we have considered the wrapper approach for integrating CASs. Wrapping legacy software and exposing their functionality using service technologies involves the creation of a three level architecture at server site, as shown in Figure 1. The server has the role of receiving the calls from the client, resolve them using underlying legacy software and return the result of the computation to the client. The architecture intends to expose functions implemented by several CASs in a secure manner. To achieve this goal the simple wrapper architecture is augmented with several features. Setting up the architecture on a server machine requires that all necessary software tools and components are available on the machine, namely the service tools and the CASs. A simple tool should allow to the administrator of the machine to register into a CAS registry the CAS functions that he wants to expose. Every function exposed by a CAS will have an entry in the registry. Thus, a method that does not appear in this registry is considered inaccessible to remote invocations (for example the system function available in different CASs). The remote clients of the system may interrogate the system in order to find out the available functions and their signatures. The signatures of the functions should not differ in any way from the original signature of the function as provided by a CAS. The remote invocation of a function should be reduced to the invocation of a remote execute(CAS ID,call object), where CAS ID is the CAS unique identifier and call object represents an OpenMath object, as described in the following section. Additionally, several other related operations should be available: find the list of the CASs that are installed on the CAS Server machine or find the list of the available functions that were exposed.
846
A. Cˆ arstea et al.
procedure Arg1 Arg2 Fig. 2. A generic call
3
Parameter Encoding Level
One important aspect of the interaction between the client and the legacy system is the model of the data exchange. The data model used by the client must be mapped to the internal data model of the legacy software. Currently, the available CASs are using a variety of encoding standards, from plain text to XML structured documents. Due to the benefits introduced regarding the representation and manipulation of the data in a structured manner, the XML standard was adopted as the natural choice for machine to machine interaction. Representation of the mathematical objects was achieved by MathML and OpenMath. While the former is well suited for representing mathematical objects in Web browsers, the latter is more appropriate for describing mathematical objects with semantic context. Most of the interaction between CASs and external world were based on non-standard data representations meaningful only for a certain CAS, or even for a certain version of a CAS. Efforts to enable parsing of OpenMath objects are conducted for several CASs. A standard encoding for CAS invocations is described in [25]. Since most of the CASs do not currently support OpenMath, for immediate implementation purposes, we considered an intermediate approach. The best solution would have been to translate a function procedure(Arg1,Arg2) in a corresponding OpenMath object as the one shown in Figure 3. The parser will use the information encapsulated in the OpenMath object to create the appropriate CAS command. Note that the internal OMOBJ objects in Figure 2 should be replaced with the actual OpenMath encoding of the corresponding mathematical objects, either atoms or compound, but without nested OMOBJ tags. For the sake of compatibility, since currently not all CASes are ready to parse OpenMath objects, it should be possible to encapsulate the generic representation of Arg1, Arg2 using OMSTR atoms, provided that the CAS will be able to convert the encapsulated string to the CAS object. The greatest advantage of using a unique description language for mathematical objects is the simplified communication mechanism between applications.
On Service-Oriented Symbolic Computing
4
847
Exposing Technology
In a service-oriented architecture loosely coupled systems can be combined in modular way. Complicated tasks can be organized in workflows. Alternatives to Web services can be technologies that permit creation of remotely accessible objects. Implementing remote objects can be achieved using CORBA [26]. Still, these solutions are not that versatile and reliable and do not offer the independence needed to integrate with Grid technologies. In what follows we argue that exposing the full functionality of the CASs is difficult due to the high number of functions that CASs implements. Another issue is security since certain functions exposed could represent a security gap. The first approach to expose functions of a CAS that one might consider is a one-to-one approach. This means that for every function of a CAS a corresponding operation of a service should be implemented. The experience gained by constructing the CAGS tool [27] leads us to the conclusion that this approach is not feasible for a large number of operations. A service with many operations exposed makes impossible dynamic creation and invocation of the service. Additionally, the WSDL 2.0 [28] standard explicitly forbids that operations with the same name exist within the same service definition, while in a CAS functions from different libraries can have the same name. The second approach (considered also in GENSS platform) is to implement a generic operation execute(function name, parameters). In this call function name represents the name of the CAS function and parameters represent an encoding for the parameters of the function. In this case, a Java method with the name function name can be dynamically invoked using reflection mechanisms. Afterwards, this method has to invoke the function exposed by the CAS. This solution, has some drawbacks also. Deploying such services into a service container is not efficient and obtaining the list of the exposed functions and assuring access to them on a per user basis is not trivial. The solution that we have considered in this paper uses the second approach as a start point. We have created a registry mechanism that allows the administrator of the server to register CAS functions into a database. The general execution schema associated with this approach is composed from several steps: 1. The client invokes an execute(CAS ID,call object) operation on the service. 2. The server verifies that the CAS identified by CAS ID is available on the server and that the function encoded in the call object is available (for details regarding the call object parameter see Section 3). 3. The server returns a unique job identifier that identifies the job and starts the execution. 4. At a latter moment the client will use the job identifier to retrieve information about the status of the job and the results of the computation. As mentioned above, the interaction between the client and the server is carried out in an asynchronous manner. Additional functionality is available using this approach, such as: the list of the CASs exposed, the list of functions of a certain CAS that are exposed, the signature of functions, and so on.
848
5
A. Cˆ arstea et al.
Wrapper Interaction with the Legacy System
CASs were meant to run on a single machine, or sometimes on cluster machines, in order to solve very specific computational tasks. Usually the interaction with these systems involves a command line interface. According to [29] software can be encapsulated at several levels: job level, transaction level, program level, module level and procedure level. The way that the wrapper interacts with the legacy software component depends on the native technology of the legacy component. The wrapper may use TCP/IP sockets, redirecting of the I/O streams for a process, input and output files or it may have to use JNI encapsulation. As noted in [30], the system may use the encapsulated software as a black box, knowing only the input/output parameters or it may have to use white box encapsulation at the lowest level possible. The communication model we had to use depends on the CAS that we want to communicate with. As a regular basis, we communicate with the CASs by redirecting I/O streams in a program level encapsulation style. For GAP and KANT we have used the RunManager component described in [27]. The interaction with Maple was implemented using the Java API that it provides. The WS-GRAM service offered by Globus Toolkit enables to run command line batch job. As an alternative to the RunManager component we implemented the interaction to the CAS by making appropriate calls to the WS-GRAM service. An efficiency comparison of the two approaches is presented in Section 7.
6
Implementation Details
The CAS server that we implemented exposes two main categories of service operations. The first category includes operations that offers information about the CASs exposed on that server and the functions that the client is able to call. The second category refers to the generic operation that allows remote invocation of CAS functions. The first operation category includes:getCASList(),getFunctionList(CAS ID), getNrFunction(CAS ID), getFunctionsByIndex(CAS ID, startIndex, endIndex), getFunctionsMatch(CAS ID, stringToMatch). These functions offer functionality for the client to discover the CASs installed on the server machine, and, for the exposed CASs, the functions being exposed. The parameter CAS ID uniquely identifies a CAS system. The identifier of CASs can be obtained by calling the getCASList() function. Since the number of functions exposed on the CAS Server can be large, we created convenient operations to filter the list of function signatures when issuing a getFunctionList(CAS ID) call. For example, we can display functions with a certain name or we can display a limited number of functions that match a certain criteria. The actual execution of a command can be achieved by calling a generic operation execute(). The operation String execute(CAS ID, call object) returns a unique identifier for the submitted job. This identifier will be used to retrieve information about the status and the result of the job.
On Service-Oriented Symbolic Computing
849
The registry of the exposed functions can be populated by the system administrator using a registry editing tool that we implemented. CAS related information stored into registry includes, but is not restricted to, the identifier of the CAS, the path where the CAS is installed and a brief description of the CAS. Function related information includes the function name, the signature and a brief description. If the CAS allows socket connections, it can reside on another machine and the specific path can include the shell command that allows the connection to that machine that runs the CAS. As a first step of the development, the component that implements the operation execute() will use the wrapper approach. As the functionality of the CASs permits it, a different way of communicating with the CASs will be considered. We expect in the near future a new OpenMath-based standard protocol for CAS inter-communications to emerge as SCIEnce deliverable. It will offer the support needed for a CAS to communicate with the outside world. Efforts to implement a protocol that offers this functionality are already carried out by several major CAS development teams (GAP, KANT, MuPAD, Maple).
7
Efficiency
As described above, the tool that we implemented invokes CAS using two approaches: RunCommand component and Globus WS-GRAM service. The RunCommand component does not offer other feature besides running jobs. The invocation of the WS-GRAM service requires an extra Web service call. It is then expected that using WS-GRAM induces some overhead. The testing architecture is a PC HP ProLiant DL-385 with 2 x CPU AMD Opteron 2.4 GHz, dual core, 1 MB L2 cache per core, 4 GB DDRAM, 2 network cards 1 Gb/s. Our test was meant to investigate time needed for a job to be submitted for execution by the Web service wrapper. More precisely, we have considered the execution time of the Web service execute() operation. The results obtained for the case when RunCommand was used to run GAP and Maple tasks shows an average of 51 milliseconds while for the WS-GRAM approach showed an average overhead of 678 milliseconds. The difference between the two approaches is significant when multiple invocations are needed, as in the case of combining several CAS functions.
8
Conclusions
A service-oriented architecture for exposing computer algebra system functionality through computing services was introduced and discussed. It is composed by several CAS servers based on symbolic computing services that are registered to them. The services are wrapping legacy codes. The novelty of this approach compared with the similar ones that were identified in this study is the fact that is based on current standards for Web and Grid services and the only nonstandard technique is given by the usage of a software and function register that is proved to be more efficient than using the standard approach of a special service call.
850
A. Cˆ arstea et al.
Acknowledgements. This research is partially supported by European Framework 6 grant RII3-CT-2005-026133 SCIEnce: Symbolic Computation Infrastructure for Europe and Romanian project GridMOSI CEEX-I 95/03.10.2005.
References 1. Globus Toolkit, http://globus.org/toolkit/ 2. WSRF, http://www.globus.org/wsrf/ 3. Denemark, J., Kulshrestha, A., Allen, G.: Deploying Legacy Applications on Grids. In: Procs. 13th Annual Mardi Gras Conference, Frontiers of Grid Applications and Technologies, Baton Rouge, pp. 29–34 (2005) 4. Solomon, A.: Distributed Computing for Conglomerate Mathematical Systems. In: Joswig, M., et al. (eds.) Integration of Algebra& Geometry Software Systems (2002) 5. Senger, M., Rice, P., Oinn, T.: Soaplab - a unified Sesame Door to Analysis Tools. In: Cox, S.J. (ed.) Procs. UK e-Science, All Hands Meeting 2003, pp. 509–513 (2003) 6. Huang, Y., Taylor, I., Walker, D.W., Davies, R.: Wrapping Legacy Codes for GridBased Applications. In: IPDPS workshop, pp. 139–147. IEEE Computer Society, Los Alamitos (2003) 7. GENSS, http://genss.cs.bath.ac.uk/ 8. Naifer, M., Kasem, A., Ida, T.: A System of Web Services for Symbolic Computation. In: Asian Workshop on Foundation of Software, Xiamen (2007) (in print) 9. Casanova, H., Dongarra, J.: NetSolve: a Network Server for Solving Computational Science Problems. Inter.J. Supercomputer Appls.& HPC 11(3), 212–223 (1997) 10. Wolfram Research, MathLink, http://www.wolfram.com/solutions/mathlink/ 11. Tepeneu, D., Ida, T.: MathGridLink - A bridge between Mathematica and the Grid. In: Proc. JSSST, pp. 74–77 (2003) 12. Eres, M.H., Pound, G.E., Jiao, Z., et al.: Implementation of a Grid-enabled Problem Solving Environment in Matlab. In: Proc. WCPS (2003) 13. Abramson, D., Giddy, J., Kolter, L.: High Performance Parametric Modeling with Nimrod/G: A Killer Application for the Global Grid. In: Proc. IPDPS, pp. 520–528 (2000) 14. Solomon, A., Struble, C.A.: JavaMath - an API for Internet Accessible Mathematical Services. In: Procs.Asian Symposium on Computer Mathematics (2001) 15. GAP Group, Groups, Algorithms& Programming, http://www.gap-system.org 16. Maple, http://www.maplesoft.com 17. MuPAD, http://www.mupad.de 18. KANT, http://www.math.tu-berlin.de/∼ kant/kash.html 19. Hammond, K., Al Zain, A., Cooperman, G., Petcu, D., Trinder, P.: SymGrid: a Framework for Symbolic Computation on the Grid. In: Kermarrec, A.-M., Boug´e, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 447–456. Springer, Heidelberg (2007) 20. Java 5 SDK, http://java.sun.com 21. Apahe Axis, http://ws.apache.org/axis/ 22. Apache Tomcat, http://tomcat.apache.org/ 23. OMLib, http://www.riaca.win.tue.nl/products/openmath/lib/ 24. OpenMath, http://www.openmath.org/ 25. Konovalov, A., Linton, S.: Symbolic Computation Software Composability Protocol Specification. CIRCA preprint, 2007/5, University of St Andrews, http://www-circa.mcs.st-and.ac.uk/preprints.html
On Service-Oriented Symbolic Computing
851
26. Comella-Dorda, S., Wallnau, K., Seacord, R.C., Robert, J.: A Survey of Legacy System Modernization Approaches (2000), http://www.sei.cmu.edu/pub/documents/00.reports/pdf/00tn003.pdf 27. Cˆ arstea, A., Frˆıncu, M., Macariu, G., Petcu, D., Hammond, K.: Generic Access to Web and Grid-based Symbolic Computing Services: the SymGrid-services Framework. In: Procs. ISPDC 2007, pp. 143–150. IEEE CS Press, Los Alamitos (2007) 28. WSDL 2.0, http://www.w3.org/TR/wsdl20/ 29. Sneed, H.: Encapsulating Legacy Software for Reuse in Client/Server Systems. In: Proceedings of WCRE-96, IEEE Press, Monterey (1996) 30. Sneed, H.: Encapsulation of Legacy Software: A Technique for Reusing Legacy Software Components. Annals of Software Engineering, pp. 293–313. Springer, Heidelberg (2000)
CPPC-G: Fault-Tolerant Applications on the Grid Daniel D´ıaz, Xo´ an C. Pardo, Mar´ıa J. Mart´ın, Patricia Gonz´alez, and Gabriel Rodr´ıguez Computer Architecture Group, University of A Coru˜ na, Spain {ddiaz,pardo,mariam,pglez,grodriguez}@udc.es
Abstract. The Grid community has made an important effort in developing middleware to provide different functionalities, such as resource discovery, resource management, job submission, execution monitoring. As part of this effort this paper addresses the design and implementation of an architecture (CPPC-G) based on services to manage the execution of fault tolerant applications on Grids. The CPPC (Controller/Precompiler for Portable Checkpointing) framework is used to insert checkpoint instrumentation into the application code. Designed services will be in charge of submission and monitoring of the execution of the application, management of checkpoint files and detection and automatic restart of failed executions. Keywords: Fault-Tolerance, Grid Computing, Globus, MPI, Checkpointing.
1
Introduction
As parallel platforms increase their number of resources, so does the failure rate of the global system. This is not a problem while the mean time to complete an application execution remains well under the mean time to failure (MTTF) of the underlying hardware, but that is not always true on long running applications, where users and programmers need a way to ensure that not all of the computation done is lost on machine failures. Checkpointing has become a widely used technique to obtain fault tolerance on such environments. It periodically saves the computation state to stable storage, so that the application execution can be resumed by restoring such state. A number of solutions and techniques have been proposed [1], each having its own pros and cons. CPPC (Controller/Precompiler for Portable Checkpointing) [2,3] is a checkpointing tool focused on the insertion of fault tolerance into long-running MPI applications (of which sequential applications are a particular case). It is designed to allow for execution restart on different architectures and/or operating systems, also supporting checkpointing over heterogeneous systems, such as the Grid. It uses portable code and protocols, and generates portable checkpoint R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 852–859, 2008. c Springer-Verlag Berlin Heidelberg 2008
CPPC-G: Fault-Tolerant Applications on the Grid
853
files while avoiding traditional solutions which add an unscalable overhead, such as process coordination or message-logging. This paper introduces CPPC-G, a set of new Grid services1 implemented on top of Globus 4 [4] using the Java API, which is able to manage the execution of CPPC-instrumented fault tolerant applications (CPPC applications from now on). The designed set of services will be in charge of submitting and monitoring the execution, as well as of managing and replicating the state files generated during the execution. It is widely accepted that the performance of an MPI application on the Grid remains a problem, caused by the communication bottleneck on wide area links. To overcome such performance problem, in this work it is assumed that all processes of a MPI application are executed on the same computing resource (e.g. a cluster or an MPP machine with MPI installed). Upon a failure, the Grid services described in this work restart the application from the last consistent state, in a completely transparent way. At the present stage of CPPC-G development, only failures in the computing resource where the CPPC application is executed are being considered. The structure of the paper is as follows. Section 2 gives an overview of the CPPC framework. Section 3 describes CPPC-G, its architecture, implementation details and deployment. The operation of the CPPC-G tool is shown in Section 4. Section 5 concludes the paper.
2
The CPPC Framework
CPPC2 appears to the user as a runtime library containing checkpointsupporting routines, together with a compiler tool which automates the use of the library. There are several issues to be solved in implementing checkpointing solutions such as portability, memory requirements, or transparency. Besides, for parallel applications checkpoint consistency must be taken into account. To manage portable checkpoint files CPPC recovers non-portable state by means of the re-execution of the code responsible for creating such opaque state in the original execution. Moreover, in CPPC the effective data writing will be performed by the selected writing plugin, using its own format. This enables the restart on different architectures, as long as a portable dumping format is used for code variables. Currently, a writing plugin based on HDF5 [5] is provided. Checkpointing of large-scale real applications may lead to a great amount of stored state, the cost being so high as to become impractical. CPPC reduces the amount of data to be saved by including, in its compiler, a live variable analysis in order to identify those variable values that are needed upon restart. Besides, a compressed format based on the ZLib library [6] is included. A multithreaded dumping option is also provided to improve performance when working with large datasets. 1
2
With the term Grid service we denote a Web service that complies with the Web Services Resource Framework (WSRF) specifications. CPPC is an open source project, and its current release can be downloaded at http://cppc.des.udc.es
854
D. D´ıaz et al.
The user must insert only one compiler directive into the original application (the cppc checkpoint pragma) to mark points in the code where the relevant state will be dumped to stable storage into a checkpoint file. The compiler will perform a source-to-source transformation, automatically identifying both the variables to be dumped to the checkpoint file and the non-portable code to be re-executed upon restart; and inserting the necessary CPPC functions, as well as flow control code. For MPI applications checkpoints are taken at the same relative code points by all the processes (assuming SPMD codes) where it is guaranteed that there are no in transit, nor ghost messages. These points are called safe points, and checkpoint files generated in such points are strongly consistent [7].
3
CPPC-G Design
This section describes the design and implementation of the proposed set of Grid services for remote execution of CPPC applications. 3.1
Some Preliminary Questions
Before going into details of the proposed CPPC-G architecture, some general questions, that are common to all its services, must be introduced. They are related to management of long operations, security issues, chained invocations among services and resource creation requests. Using a simple invocation of a Grid service to implement an operation that takes a long time to complete lacks flexibility, since there is no way to cancel or interrupt it. Instead, an approach based on the factory pattern is preferred. In this approach an operation is started by invoking a factory service that creates an instance resource in an instance service using a given resource description as template. In the following, the terms resource and service will be used to refer to resource and service instances. The newly created resource is responsible for tracking the progress of the operation and it will be possible to query its state or subscribe to it in order to receive notifications when it changes. Furthermore, resources created in this way can be explicitly destroyed at any time or be automatically destroyed when a termination time specified by the user expires. It is the responsibility of the user to extend the resource lifetime if it was not long enough to complete the operation. Resource lifetimes are a means to ensure that resources will be eventually released if communication with the user is lost. The developed services depend on the underlying GSI security infrastructure of Globus for authentication, authorization, and credential delegation. Credential delegation can be performed directly by the user, or automatically by a service that itself has access to a delegated credential. The standard Globus services follow the principle of always making the user delegate the credentials himself beforehand, never resorting to automatic delegation. This is more cumbersome for the user, but allows a greater control over the credentials. The developed CPPC-G services also try to follow that principle whenever possible. However,
CPPC-G: Fault-Tolerant Applications on the Grid
855
an exception is made in situations in which the service to be invoked is not known beforehand because it is dynamically selected. Automatic delegation has been used in these cases. Most operations are implemented by invoking a service that itself invokes other services. It is usual to end up with several levels of chained invocations. With more than two levels, some difficulties arise with the delegation of credentials. In Globus, services that require user credentials publish the delegation factory service EPR (endpoint reference) as a resource property of their corresponding factory service. The user must use that EPR to delegate his credentials before being allowed to create resources in the service. Services that invoke other services that also require the delegation of credentials must publish one delegation factory service EPR for each invoked service. In the following, the term delegation set will be used to refer to the set of delegation factory service EPRs where the user must delegate his credentials before using a service. Once the user has delegated his credentials to the proper delegation services, delegated credential EPRs are passed to invoked services as part of resource creation requests, which will be explained later in this section. In the following, the term credential set will be used to refer to the set of delegated credential EPRs. When there are a large number of services involved in a chained invocation, the use of delegation sets becomes complicated for users and administrators. From the user’s point of view, the delegation set is a confusing array of delegation EPRs to which he must delegate his credentials before invoking a service. To help users, XML entities have been defined to be used in the service WSDL file to describe the delegation sets in a hierarchical fashion. From the administrator’s point of view, all the EPRs in the delegation set of a service must be specified in its configuration files, which is an error-prone task. To avoid this problem, a technique based on queries to build delegation sets dynamically has been implemented. In order to create a resource, factory services take as parameters the following creation request datatypes: the initial termination time of the resource; the resource description; a credential set, that is, the delegated credential EPR to be used by the resource, plus the delegated credential EPRs to be used in invocations to other services; and a refresh specification for each of the services the resource is expected to invoke. This specification has two components: a lifetime refresh period and a ping period. Being the lifetime refresh period the amount of seconds the lifetime of resources in invoked services is extended; and the ping period the frequency with which the resources in invoked services are checked to be up and reachable. 3.2
System Architecture
Figure 1 shows the proposed CPPC-G architecture that comprises a set of five new services that interact with Globus RFT and WS-GRAM services. A FaultTolerantJob service is invoked to start the fault-tolerant execution of a CPPC application. CkptJob services provide remote execution functionalities. Available computing resources hosting a CkptJob service are obtained from a SimpleScheduler service. StateExport services are responsible for tracking
856
D. D´ıaz et al. GRAMlaunch
WS−GRAM USERlaunch
SEquery
h nc lau
CJ
Fault Tolerant Job
FTlaunch
gis
ter
FTpoll
CJ
re
FTquery
Simple Scheduler
CPPC library
SEnotify
Ckpt Job
CJquery
Computing Resource
SEtransfer
CJnotify
Ckpt Warehouse
Application
State Export
CPPCread
CWdelete
RFTdelete
RFT
RFTtransfer
CPPCwrite CP Data
Fig. 1. System Architecture
local checkpoint files periodically stored by the CPPC application. And finally, CkptWarehouse services maintain metadata and consistency information about stored checkpoint files. In the following, the functionality of each service is described in depth. CPPC applications running in a computing resource can store checkpoint files in locations not accessible to services hosted in different computing resources (e.g. the local filesystem of a cluster node). The StateExport service is responsible for tracking these local checkpoint files and move them to a remote site that could be accessed by other services. There must be a StateExport service for every computing resource where a CPPC application could be executed. To help finding checkpoint files, the CPPC library has been slightly modified. Now processes of CPPC applications write, besides checkpoint files, metadata files in a previously agreed location in the computing resource filesystem. StateExport resources periodically check (by using GridFTP) for the existence of the next metadata file in the series. When a new one is found, the resource parses it, locates the newest checkpoint file using the extracted information, and replicates it via RFT in a previously agreed backup site. When the replication is finished a notification is sent to the CkptJob service. The CkptJob service provides remote execution of CPPC applications. This service coordinates with StateExport and CkptWarehouse to extend WSGRAM adding needed checkpointing functionality. Job descriptions used by the CkptJob service are those of WS-GRAM augmented with the EPR of a CkptWarehouse service and a StateExport resource description element, that will be used as a template to create StateExport resources. The CkptJob service can be configured to register useful information with an MDS index service in the form of a sequence of tags. These tags are specified by the administrator in a configuration file and used to indicate particular properties (e.g. that MPI is installed in the computing resource).
CPPC-G: Fault-Tolerant Applications on the Grid
857
The CkptWarehouse service maintains metadata about checkpoint files generated by running CPPC applications. Each time a checkpoint file is successfully exported the proper resource in the CkptWarehouse service is notified, being responsible for composing sets of globally consistent checkpoint files. When a new globally consistent state is composed, checkpoint files belonging to the previous state (if they exist) become obsolete and can be discarded (they are deleted by using RFT). The SimpleScheduler service keeps track of available computing resources hosting a CkptJob service. The service subscribes to an MDS index service that aggregates the information published by registered CkptJob services. In particular, the sequences of tags published by CkptJob services are used to select the proper computing resource that satisfies some given scheduling needs. As for now, the only supported scheduling need is the required presence of a given tag, but this mechanism could be used in future versions to support more complicated selections. The FaultTolerantJob service is the one which the user invokes to start the execution of a CPPC application. One resource is created for each application, being responsible for starting and monitoring it by periodically checking its execution state. In case of failure, the execution is restarted automatically. Computing resources needed for executing the application are obtained by querying a SimpleScheduler service, so it is not possible to know beforehand where the application will be executed. As a consequence, credential delegation has to be deferred until it is known the precise CkptJob service to be invoked to execute the application. 3.3
Component Deployment
As it was already mentioned, it is assumed that all processes of a CPPC application will be executed in the same computing resource for performance reasons. In a typical configuration of CPPC-G, a CkptJob service and a StateExport service will be present in that resource (as will be Globus WS-GRAM and RFT services). The CkptWarehouse, SimpleScheduler and FaultTolerantJob services will reside in other computing resources. It is enough that one instance of each of these last three services exists in a Grid for the system to work. CkptWarehouse is not required to reside on the same computing resource where the checkpoints are going to be replicated. It must be noted that the use of SimpleScheduler and FaultTolerantJob is optional. They must be present only if automatic restart of failed executions is wanted. The rest of services can be used on their own if only remote execution of CPPC applications is necessary. In this case it will be responsibility of the user to manually restart a failed execution.
4
CPPC-G Operation
To initiate the execution of an application the following steps are taken in sequence. It is assumed that, before the user submits an application, all
858
D. D´ıaz et al.
available CkptJob services are already registered with a SimpleScheduler service (CJregister in Figure 1): 1. In order to prepare the application submission, the user must create in advance an instance of the CkptWarehouse service and a credential set. This information will be included as part of the resource creation request used to submit the application. 2. The user submits the application to a FaultTolerantJob service (USERlaunch in Figure 1). 3. The FaultTolerantJob service invokes a SimpleScheduler service (FTquery in Figure 1), asking for an available computing resource. 4. The FaultTolerantJob service invokes the CkptJob service on the selected computing resource to get its delegation set. The CkptJob service builds the delegation set dynamically by querying the services hosted in the computing resource (i.e. the StateExport service, WS-GRAM and RFT). With this delegation set and one of the delegated credentials of the user, the FaultTolerantJob service creates a credential set that will be included as part of the resource creation request used to start the CPPC execution. 5. The FaultTolerantJob service invokes the CkptJob service on the selected computing resource to start a CPPC execution (FTlaunch in Figure 1). 6. The CkptJob service queries the CkptWarehouse service to obtain the last consistent set of checkpoint files (CJquery in Figure 1). Checkpoint files will be moved, as part of the staging of input files, by the WS-GRAM on the selected computing resource when the application was started. 7. The CkptJob service invokes WS-GRAM to initiate the application execution (CJlaunch in Figure 1). 8. The CkptJob service invokes also the StateExport service to initiate the exporting of checkpoints. When a process of the CPPC application generates a checkpoint file (CPPCwrite in Figure 1) the following steps are taken in sequence: 9. The corresponding StateExport resource detects the presence of the newly created checkpoint file (SEquery in Figure 1) by using the technique based on metadata files already explained in Section 3.2. 10. The StateExport service uses RFT to export the checkpoint file to a previously agreed backup site (SEtransfer in Figure 1). 11. Once the transfer is finished, the StateExport service notifies the CkptJob service (SEnotify in Figure 1) about the existence of a new checkpoint file. 12. After receiving the notification, the CkptJob service notifies the CkptWarehouse service in its turn (CJnotify in Figure 1). The notification includes information about the process that generated the checkpoint file. 13. When, upon arrival of more recent checkpoint files, a new consistent state is composed, the CkptWarehouse service deletes obsolete files by using RFT (CWdelete in Figure 1).
CPPC-G: Fault-Tolerant Applications on the Grid
859
Currently two general types of execution failures are being considered: failures in the CPPC application execution, or failures in the computing resource. In both cases the FaultTolerantJob service is finally aware of the failed execution and a new restart is initiated going back to step 3. The process ends when the execution terminates successfully. All resources are released when the user acknowledges the finished execution.
5
Conclusions and Future Work
The aim of this work is to provide a set of new Grid services for remote execution of fault-tolerant applications. CPPC is used to save the state in a portable manner. The functionality of already existing Globus services is harnessed whenever possible. The new Grid services automatically ask for the necessary resources; start and monitor the execution; make backup copies of the checkpoint files; detect failed executions; and restart the application automatically. At the moment, the CPPC-G architecture is not fault-tolerant itself. In the future it is planned to use replication techniques for the FaultTolerantJob, SimplerScheduler and CkptWarehouse services. Other future direction will be to automate the finding of potential checkpoint backup repositories over the Grid by querying a MDS index service. Acknowledgments. This work has been supported by the Ministry of Education and Science (TIN-2004-07797-C02 and FPU grant AP-2004-2695), Galician Government (PGIDIT04TIC105004PR) and CYTED Project (506PI0293).
References 1. Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002) 2. Rodr´ıguez, G., Mart´ın, M.J., Gonz´ alez, P., no, J.T.: Controller/Precompiler for Portable Checkpointing. IEICE Transactions on Information and Systems E89-D(2), 408–417 (2006) 3. Rodr´ıguez, G., Mart´ın, M.J., Gonz´ alez, P., no, J.T., Doallo, R.: Portable checkpointing of MPI applications. In: Proceedings of the 12th Workshop on Compilers for Parallel Computers (CPC 2006), A Coru˜ na, Spain, pp. 396–410 (January 2006) 4. Foster, I.T.: Globus toolkit version 4: Software for service-oriented systems. Journal of Computer Science and Technology 21(4), 513–520 (2006) 5. National Center for Supercomputing Applications: HDF-5: File Format Specification, http://hdf.ncsa.uiuc.edu/HDF5/doc/ 6. Gailly, J., Adler, M.: ZLib Home Page, http://www.gzip.org/zlib/ 7. Hlary, J., Netzer, R., Raynal, M.: Consistency issues in distributed checkpoints. IEEE Transactions on Software Engineering 25(2), 274–281 (1999)
Garbage Collection in Object Oriented Condensed Graphs Sunil John and John P. Morrison Centre for Unified Computing, Dept. of Computer Science, National University of Ireland, University College Cork, Cork, Ireland {s.john,j.morrison}@cs.ucc.ie http://www.cuc.ucc.ie
Abstract. Even though Object Orientation has been proven to be an effective programming paradigm for software development, it has not been shown to be an ideal solution for the development of large scale parallel and distributed systems. There are a number of reasons for this: the parallelism and synchronisation in these systems has to be explicitly managed by the programmer; few Object Oriented languages have implicit support for Garbage Collection in parallel applications; and the state of a systems of concurrent objects is difficult to determine. In contrast, the Condensed Graph model provides a way of explicitly expressing parallelism but with implicit synchronisation; its implementation in the WebCom system provides for automatic garbage collection and the dynamic state of the application is embodied in the topology of the Condensed Graph. These characteristics free programmers from the difficult and error prone process of explicitly managing parallelism and thus allows them to concentrate on expressing a solution to the problem rather than on its low level implementation. Object Oriented Condensed Graphs are a computational paradigm which combines Condensed Graphs with object orientation and this unified model leverages the advantages of both paradigms. This paper illustrates the Garbage Collection mechanism of Object Oriented Condensed Graphs as well as its basic concepts. Keywords: Condensed Graphs, Object Oriented Systems, Software Engineering, Distributed and Parallel Computing.
1
Introduction
The support of large scale software systems has long been the focus of research in software engineering. Object Oriented Systems have attracted wide attention due to its many desirable properties which aid software development such as enhanced maintainability and reusability. With the support of characteristics such as inheritance, modularity, polymorphism and encapsulation, this paradigm can help the development of complex software programs[15]. However, the development of parallel and distributed applications is not significantly simplified by R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 860–869, 2008. c Springer-Verlag Berlin Heidelberg 2008
Garbage Collection in Object Oriented Condensed Graphs
861
Object Orientation. The onus is still on the programmer to explicitly manage each parallel component and to ensure proper synchronisation. The interaction of parallel components in a large system can be complex and this complexity is compounded in sophisticated environments such as heterogeneous clusters and computational grids. Moreover, the encapsulation concept in OO is complex in a parallel environment. OO does not impose constraints upon invocation of an object’s attributes or member functions. This complicates the relationship among objects when there are several method invocations. Similarly, memory management is a major concern. Most of the current Garbage Collection methodologies work sequentially; only a few OO languages can support automatic garbage collection in a parallel system. In addition, parallelism poses additional challenges for access control mechanisms and object state determination. The Condensed Graph (CG) model of computing is based on Directed Acyclic Graphs (DAGs). This model is language independent and it unifies the imperative, lazy and eager computational models[1]. Due to its features including Implicit Synchronisation and Implicit Garbage Collection this computational model can be effectively deployed in a parallel environment. CGs have been employed in a spectrum of application domains, from FPGAs to the Grid[5,8,2]. The most advanced reference implementation of the model is the WebCom abstract machine[6]. WebCom is being used as the basis of Grid-Ireland’s middleware development and deployment[2,3,4]. This paper addresses the concept of Object Oriented Condensed Graphs OOCG. OOCG is a unified model that combines the Object Oriented paradigm with that of the Condensed Graph methodology. This unified model helps to leverage the advantages of both paradigms. Some of the similar research in this area of modelling languages are Object Oriented Petri Nets (OOPN)[9] and Object Petri Nets (OPN)[10] in which the Object Orientation principles has combined with that of Petri Nets[12]. Other notable Petri Net modelling languages embodying OO concepts are CPN[11], HOOPN[13] and CO-OPN[14]. This paper is organised as follows: Section 2 provides an overview of Condensed Graphs, and Section 3 presents the basics of Object Oriented Condensed Graphs. Finally, Section 4 introduces the Automatic Garbage Collection employed by the OOCG.
2
Condensed Graphs
Like classical dataflow, the CG model is graph-based and uses the flow of entities on arcs to trigger execution. In contrast, CGs are directed acyclic graphs in which every node contains not only operand ports, but also an operator and a destination port. Arcs incident on these respective ports carry other CGs representing operands, operators and destinations. Condensed Graphs are so called because their nodes may be condensations, or abstractions, of other CGs. (Condensation is a concept used by graph theoreticians for exposing meta-level information from a graph by partitioning its vertex set, defining each subset of the partition
862
S. John and J.P. Morrison
to be a node in the condensation, and by connecting those nodes according to a well-defined rule.) Condensed Graphs can thus be represented by a single node (called a condensed node) in a graph at a higher level of abstraction. The number of possible abstraction levels derivable from a specific graph depends on the number of nodes in that graph and the partitions chosen for each condensation. Each graph in this sequence of condensations represents the same information but in a different level of abstraction. It is possible to navigate between these abstraction levels, moving from the specific to the abstract through condensation, and from the abstract to the specific through a complementary process called evaporation. The basis of the CG firing rule is the presence of a CG in every port of a node. That is, a CG representing an operand is associated with every operand port, an operator CG with the operator port and a destination CG with the destination port. This way, the three essential ingredients of an instruction are brought together (these ingredients are also present in the dataflow model; only there, the operator and destination are statically part of the graph). Any CG may represent an operator. It may be a condensed node, a node whose operator port is associated with a machine primitive (or a sequence of machine primitives) or it may be a multi-node CG. The present representation of a destination in the CG model is as a node whose own destination port is associated with one or more port identifications. Fig. 1 illustrates the congregation of instruction elements at a node and the resultant rewriting that takes place. 5 5
f f
g
g
Fig. 1. CGs congregating at a node to form an instruction
Strict operands are consumed in an instruction execution but non-strict operands may be either consumed or propagated. The CG operators can be divided into two categories: those that are value-transforming and those that only move CGs from one node to another in a well-defined manner. Valuetransforming operators are intimately connected with the underlying machine and can range from simple arithmetic operations to the invocation of software functions or components that form part of an application. In contrast, CG moving instructions are few in number and are architecture independent. By statically constructing a CG to contain operators and destinations, the flow of operand CGs sequences the computation in a dataflow manner. Similarly, constructing a CG to statically contain operands and operators, the flow of destination CGs will drive the computation in a demand-driven manner. Finally, by constructing CGs to statically contain operands and destinations, the flow of
Garbage Collection in Object Oriented Condensed Graphs
863
operators will result in a control-driven evaluation. This latter evaluation order, in conjunction with side-effects, is used to implement imperative semantics. The power of the CG model results from being able to exploit all of these evaluation strategies in the same computation, and dynamically move between them using a single, uniform, formalism.
3 3.1
OOCG Concepts Object Annotations
The definition of a Condensed Graph may be viewed as a class from which instances can be created. Such instances are analogous to objects. A class can contain nested graphs representing individual methods and attributes. Class X method A
. . . method m
E
X
Fig. 2. A Node to represent Class X. The member functions form child nodes within that node.
In the CG model, Graph instance creation can occur implicitly when an appropriate CG node is invoked. Alternatively, explicit instance creation will give rise, not only to an appropriate graph instance but also, to a CG node which can be used to represent that instance. In effect, these nodes are coherent names of dynamically created objects. OOCG specifies two kinds of visibility for its member functions and attributes. As is typical in an Object Oriented language, private members belonging to an Object can be accessed only within that Object. In contrast, a public member has a broad visibility. It can be accessed from inside as well as from outside of that Object. In other words, it has a Global Scope. Some of these scenarios are shown in Figs. 2 and 3. 3.2
Inheritance in Condensed Graphs
Object Oriented Condensed Graphs incorporate single inheritance. A Class can inherit from other classes. In this way, a sub class can inherit the properties of
864
S. John and J.P. Morrison o1 X.A
o2 o3
Y
Fig. 3. Invoking Method A of Object X with Operands. Sending the result to Y. A op1 func1
func1 op2
X
addOp
func2 op3
.
op4
addOp
func2
Fig. 4. The parent super class A B A func1
func1 . func2
func3 func4 .
func3
X
op5
Fig. 5. The child sub class B. The child class inherited the public contents of class A.
its super class. The sub class can introduce new properties or can override the inherited properties of its super class. As shown in Figs. 4 and 5, subclass Class B extends superclass Class A. By default, public members of A are available in Class B and these can be used for operations within B. The class definitions can be instantiated as Objects. The characteristics of Polymorphism also can be observed in the class instances. By overriding the parent class operation in the child class, the behaviour can be changed in the sub class. In the above example, we can redefine the member function func1 in the sub class so that the whole operation behaviour can be redefined. This also gives rises dynamic binding. During compile time the contexts between the operations can be switched. 3.3
Concurrency and Synchronisation
OOCG adheres to the standard OO principle of Encapsulation. The public methods and attributes are visible and accessible from outside the Object. A method within an object can be subject to several invocations. These invocations can be intra Object or can be inter Object. Condensed Graphs are inherently parallel by design. The definition of the graph dictates the parallel execution. The node dependencies within the graph
Garbage Collection in Object Oriented Condensed Graphs
865
Condensed Node Refering to the Resource Graph CG1 o1 Create Resource
CG1
o2 CG2
o3
CG3
Resource Graph o1 o2
E
X1 X2 X3 X
o3
Fig. 6. The Condensed Nodes refer to the Resource Graph. The E node of the Resource Graph contains a local semaphore which restricts the access to resource one at a time.
explicitly specifies the order of execution. Cross Cutting Condensed Graphs[7] is a methodology to cater for inter graph synchronisation for concurrently executing graphs. This methodology helps to overcome the model’s basic criteria that values exit a graph only through its X Node. In the cross-cutting version, values may exit via a special construct thus helping to coordinate independent graphs. Supporting access to shared resources is illustrated in Fig. 6. Many operands may converge to the E node of this graph. A priority based queuing mechanism is employed in the E node to handle these accesses and a local semaphore is used to enforce sequential access to the underlying graph. Typically it sets a flag when it allows the invocation of the graph. When the X node of the graph fires this flag reverts back to its previous value.
4
Object Creation and Automatic Garbage Collection
Memory management is a major priority for large scale software development. Languages such as Lisp and Java support Automatic Garbage Collection or Implicit destruction of Objects. However, most of the modern day Garbage Collector mechanisms work only sequentially and this is cumbersome when deployed across parallel computer architectures. In this section we discuss the theory behind Object Creation and Automatic Garbage Collection in OOCG. These implementations of efficient Object Creation and efficient Garbage Collection are critical features of OOCG that help in developing large scale parallel applications. Condensed Graphs support Object Creation and Implicit Object Destruction or Automatic Garbage Collection. Object Creation can be performed by invoking the Create instruction. This instruction takes the Class Definition as an input Operand and creates an Object as output. The Object produced will be a pointer to the Class which itself is a Condensed Graph. Fig. 7 illustrate some scenarios
866
S. John and J.P. Morrison
a
b
c func1
B
B* Create
B*
X
Fig. 7. In this Condensed Graph realisation, object B* has been created and is used to invoke the function func1 with operands a,b and c
A* D* E* CG Node
CN (Create E)
A*
CN (Create D) A*
B*
C*
CN (Create A) CN (Create A) CN (Create B) CN (Create C)
Fig. 8. Removing objects from Stack. Nested association can be observed among the objects.
of the Object creation. Typically when an Object’s X Node fires the Object is collected automatically. When an Object is created, the corresponding Condensed Graph is pushed onto the Object Stack. Similarly when an Object is destroyed the stack pointer pops the Condensed Graph out from the Object Stack. A sample scenario is depicted in Fig. 8 in which an Object A* and its corresponding entry in the stack frame is deleted. Construction and Destruction of an Object is handled by a special Administrative Area within each stack frame. Objects which undergo nested creation appear as a nested path through the Object Stack (Fig. 8). Typically when an X node of an Object fires that object as well as all nested objects below its Object Stack are automatically garbage collected. However, not all times an object as its own deallocates automatically when its X node fires. In Fig. 9, P represents the parent of A. Since A was created by the Create node, it is not destroyed by A’s X Node by default. The reason is that in the Creating Parent many Condensed Nodes (A*) may exist and when they fire their operand will be sent to the same A definition frame. A is only deleted/destroyed if no reference to it (i.e., no Condensed Node A*) passes through A’s parent X Node when that X Node is fired. This is also true if A gives rise to further child stack frames. In Fig. 10, C may have to be maintained as Persistent depending on whether or not it will be available to subsequent invocations of B. In this later case, it too will only be destroyed by A. We can learn from the scenarios that long chains of
Garbage Collection in Object Oriented Condensed Graphs
867
P E Create A
A
A*
E
X
X
Fig. 9. Object Creation and the relation of Stack Frame to its Parent Frame. There could be many A*’s within the Parent Frame. A E
B E
X
C X
E X
Fig. 10. The datastructure C is linked with its parent B and the grandparent A. It can not be deleted on its own since its reference has been passed through its parents. It deallocates only when the X node of the grandparent A fires.
stack frames may have to be maintained and also in this nested association, if a reference of an object passes through its parent, it can be deleted completely only by its parent. OOCG occasionally supports Object Creation in the Heap. The Creation of an Object in the Heap is different from the Creation in the stack. The Objects are created in the Heap on the assumption that it needs to get reused frequently or the value of a condensed node is needed elsewhere in the program. Since the Object creation in the heap are free from the the association constraints, the Object deletion is comparatively straightforward. The Heap also maintains an array counter, using which deletion of array of Objects is easier. 4.1
Explicit Deletion of an Object
We have discussed so far the Implicit Deletion of an Object. Eventhough Condensed Graphs strongly encourages Implicit Deletion, it supports Explicit Deletion as well. This Deletion normally can be performed by invoking the Triple Manager instruction Release on the object. Explicit Deletion of an object is the complete responsibility of an application developer. The characteristic of this approach is Explicitly Imperative.
5
Conclusions and Future Work
The contribution of this paper is the introduction of an enhanced version of Condensed Graphs, Object Oriented Condensed Graphs. OOCG is a unified model that combines the Condensed Graphs methodology with object orientation and
868
S. John and J.P. Morrison
this leverages the advantages of both paradigms. This unified computational model is beneficial for the development of Large Scale Parallel Systems since the CG aspects allows the developer to think about parallelism and the Object Oriented aspects allow the developer to address the large scale concepts. OOCG provide implicit support of Synchronisation and Garbage Collection capabilities, which help the development of concurrent software. The Object concept and features such as Encapsulation and Inheritance enhance the Reusability and Maintainability of OOCG. The existing implementation of Condensed Graphs has been extended to implement these features. The future work include the integration of the OOCG model into the existing Condensed Graph Integrated Development Environment, W ebCom-GIDE [3] which is a development tool that enables visual application development and optimisation. This would help to implement and deploy concurrent software systems in an integrated environment. Acknowledgments. This work is supported by the Enterprise Ireland Informatics Research Initiative and by Science Foundation Ireland.
References 1. Morrison, J.P.: Condensed Graphs: Unifying Availability-Driven, Coercion-Driven and Control-Driven Computing, PhD Thesis, Eindhoven (1996) 2. Morrison, J.P., Clayton, B., Power, D.A., Patil, A.: WebCom-G: Grid Enabled Metacomputing. The Journal of Neural, Parallel and Scientific Computation. Special issue on Grid Computing 2004(12), 419–438 (2004), Arabnia, H.R., Gravvanis, G.A., Bekakos, M.P.: Guest Editors 3. Morrison, J.P., John, S., Power, D.A., Cafferkey, N., Patil, A.: A Grid Application Development Platform for WebCom-G. In: IEEE Proceedings of the International Symposium of the Cluster and Grid Computing, CCGrid 2005, Cardiff, United Kingdom (2005) 4. Morrison, J.P., John, S., David, A.: Power, Supporting Native Applications in WebCom-G. In: Juhasz, Z., et al. (eds.) Distributed and Parallel Systems Cluster and Grid Computing Series: The Kluwer International Series in Engineering and Computer Science, vol. 777 (September 2004) 5. Morrison, J.P., Kennedy, J.J., David, A.: Power, A Condensed Graphs Engine to Drive Metacomputing. In: Proceedings of the international conference on parallel and distributed processing techniques and applications (PDPTA 1999), Las Vagas, Nevada, June 28 - July 1 (1999) 6. Morrison, J.P., Power, D.A., Kennedy, J.J.: An Evolution of the WebCom Metacomputer. The Journal of Mathematical Modelling and Algorithms: Special issue on Computational Science and Applications 2, 263–276 (2003) 7. Mulcahy, B.P., Foley, S.N., Morrison, J.P.: Cross Cutting Condensed Graphs. In: International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), Las Vegas, Nevada, USA, June 27–30, vol. 3, pp. 965–973 (2005) 8. Morrison, J.P., Healy, P.D., O’Dowd, P.J.: Architecture and Implementation of a Distributed Reconfigurable Metacomputer. In: International Symposium on Parallel and Distributed Computing, Ljubljana, Slovenia, October 13-17 (2003)
Garbage Collection in Object Oriented Condensed Graphs
869
9. Niu, J., Zou, J., Ren, A.: OOPN: Object-oriented Petri Nets and Its Integrated Development Environment. In: Proceedings of the Software Engineering and Applications, SEA 2003, Marina del Rey, USA (2003) 10. Lakos, C.A.: Object Oriented Modelling with Object Petri Nets, Advances in Petri Nets. LNCS. Springer, Berlin (1997) 11. Sarah, L.: Englist, Colored Petri Nets for Object Oriented Modeling, Ph.D. Dissertation of University of Brighton (June 1993) 12. Murata, T.: Petri Nets: Properties, Analysis and Applications. Proc. of the IEEE 77(4), 541–580 (1989) 13. Hong, J.E., Bae, D.H.: HOONets: Hierarchical Object-Oriented Petri Nets for System Modeling and Analysis, KAIST Technical Report CS/TR-98-132 (November 1998) 14. Buchs, D., Guelfi, N.: CO-OPN: A Concurrent Object Oriented Petri Net approach. In: 12th Int. Conf. on Application and Theory of Petri Nets, Aahrus, pp. 432–454 (1991) 15. Meyer, B.: Object-Oriented Software Construction, 2nd edn. Prentice Hall, Englewood Cliffs (1997)
MASIPE: A Tool Based on Mobile Agents for Monitoring Parallel Environments David E. Singh, Alejandro Miguel, F´elix Garc´ıa, and Jes´ us Carretero Universidad Carlos III de Madrid, Computer Science Department, Spain
Abstract. In this work MASIPE, a tool for monitoring parallel applications, is presented. MASIPE is a distributed tool that gives support to user-defined mobile agents, including functionalities for creating and transferring these agents through different compute nodes. In each node, the mobile agent can access the node information as well as the memory space of the parallel program that is being monitored. In addition, MASIPE includes functionalities for managing and graphically displaying the agent data. In this work, its internal structure is detailed and an example of a monitored scientific application is shown. We also perform a study of the MASIPE requirements (in terms of CPU and memory) and we evaluate its overhead during the program execution. Experimental results show that MASIPE can be efficiently used with minimum impact on the program performance.
1
Introduction
Nowadays, the increasing computational power of commodity computers and interconnection networks allows efficiently executing parallel applications in lowcost platforms. Commodity computers can be organized conforming clusters which are usually specifically designed for CPU-intensive computing. For this configurations, there are several tools [1,2,3,4] that provide solutions for system management. However, commodity computers can also be organized in labs or computer rooms, usually as single-user, stand-alone computers. In this scenario, CPU-intensive computing is a secondary task (habitually performed during nighttime). Under these circumstances, the use of cluster management tools is complicated, given that they can interfere with the installed software. For example, it would be necessary to explicit start the monitor tools in each compute node or performing a system reboot. Consequently, in this kind of environments it is more interesting to use non-intrusive tools for monitoring parallel programs. The work presented in this paper addresses this problem by means of a new monitoring tool called Mobile Agent Systems Integration into Parallel Environments (MASIPE). This tool is designed for monitoring parallel applications by means of mobile agents. MASIPE provides a platform for executing agent-based programs in each computing node where the parallel program is being executed. The agent is executed sharing the same memory space than the parallel program thus it is able to read program variables and modify its value. In addition, the mobile agent includes code for executing user-defined operations on each R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 870–879, 2008. c Springer-Verlag Berlin Heidelberg 2008
MASIPE: A Tool Based on Mobile Agents
871
compute node. All these operations are performed asynchronously without interrupting/modifying the normal program execution. The agent is able to obtain information about the compute node (memory and CPU consumption, network use, etc.). All these information (both related to each process of the parallel program and each compute node) is collected in the agent private memory space, transferred along the nodes of the agent itinerary and finally stored and visualized in the front-end GUI. MASIPE allows the integration with any kind of C, C++ or Fortran parallel programs for distributed memory (MPI) or shared memory (OpenMP) architectures. Our tool is freely distributed. More information about MASIPE can be found in [5]. The structure of this paper is as follows. In Section 2 a description of the internal structure of each element of MASIPE is shown. Section 3 presents an example of its use for monitoring a parallel application that simulates the emission, transport and deposition of pollutants produced by a power plant. Next, Section 4 evaluates the impact of MASIPE on the program performance taking into account both memory and CPU requirements. Section 5 analyzes the related work and finally, Section 6 presents the main conclusions of this work.
2
MASIPE Internal Structure
Due to the massive expansion of Internet usage, distributed computation has become one of the most common kinds of programming paradigms. New alternatives have appeared, some examples of these are grid architectures, peer to peer systems, remote method invocation, web services, network services or mobile agent systems. MASIPE is a distributed mobile agent system that provides support for transferring and executing mobile agents in different compute nodes in an autonomous way. The communications between the compute nodes are implemented using the CORBA middleware. More specifically, we have adopted the MASIF [6] specification proposed by OMG. This specification contains a description of the methods and interfaces of the agent-based architecture. Figure 1 shows a diagram of MASIPE architecture as well as the relationships between its elements. MASIPE consists of three main components: – Agent System (AS). This module receives the mobile agent, deserializes it, extracts its data and executes its associated code. When the agent execution concludes, AS collects and packs the agent code plus processed data, serializes it and sends it to the following compute node (where another AS is running). AS is attached to each parallel program process, sharing its memory space. This property allows reading or modifying (by means of the mobile agent) the data structure contents (variables or arrays) of the parallel program that is being monitored. – Register Service (RS). This component has the task of registering AS and agents as well as providing its localization, that is, the IP address of each element. Using this information, when a mobile agent arrives to an AS, the agent itinerary (stored with its data) is read in order to obtain the following AS where the given agent can be transferred.
872
D.E. Singh et al.
Fig. 1. Diagram of MASIPE distributed and internal structure
– User Application (UA). This module consists of three different elements: control unit, ending AS and GUI. The control unit creates the agent which is constituted by a header and a payload. The header contains the control data (including the agent itinerary). The payload contains the user-defined code (which performs user-defined operations) and its associated data. The control unit packs all this information (both code and data) and sends it to the first AS. Then, the agent is transferred along all the AS included in its itinerary. Finally, the ending AS receives the agent and unpacks all the collected data. These data are shown to the user by means of the GUI. The User Application includes the following additional functionalities: it allows defining the system topology (agent itinerary); it performs agent tracking; it gives support for sending periodically (under a defined frequency) different types of agents; it receives, manages and displays the agent data and it includes policies for log generation. All the communication operations described above (with the AS, RS and UA) are performed using pre-defined interfaces that are declared according the MASIF specification [6]. Despite having this complex structure, the use of MASIPE is quite simple. Both RS and UA modules are independently executed as stand-alone programs. AS module is provided as a library that is linked with the parallel program subjected to be monitored. In addition, an entry point has
MASIPE: A Tool Based on Mobile Agents
L1 L2 L3 L4
873
First stage: add parameters call AS (0, 0, myrank) call AS (0, 1, it) call AS (0, 2, acum_time) Second Stage: start agent system call AS (1, 0, 0)
Fig. 2. Example of a Agent System call in a Fortran program
to be defined in the parallel program. This point represents a call to the AS routine that starts executing the monitoring service. Figure 2 shows an example of its use in a Fortran program. In the first stage, the user provides the program data structures that will be accessed by the agent. This task is performed in lines L1, L2 and L3 where calls to AS routine receive a pointer of three different program variables1 . In this example, variables rank, it and acum time are monitored by the agent. A zero value in the first argument indicates to the AS routine that has to store the given pointer value (third argument) in an internal data structure using the integer given by the second argument as an index. In the second stage, the Agent System is started calling the AS routine with a value in the first argument equal to one (line L4). Internally, a new thread is started and the code of the AS routine is executed concurrently with original process.
3
Case of Study: Monitoring STEM-II Parallel Application
In this section we show an overview of MASIPE functionalities when used in a real environment of parallel computing. More specifically, our proposal was used for monitoring STEM-II application. STEM-II is a 3D grid-based model that simulates SOx /N Ox /RHC multiphase chemistry, long-range transport and dry plus wet acid deposition. This application is used for calculating the distribution of pollutants in the atmosphere from specified emission sources such as cities, power plans or forest fires under particular meteorological scenarios. The prediction of the atmospheric pollutants behavior includes the simulation of a large set of phenomena, including diffusion, chemical transformations, advection, emission and deposition processes. This model was successfully used for the control of the emissions of pollutants produced by the Endesa power plant of As Pontes (Spain). In addition, STEM-II was chosen as case of study in the European CrossGrid project, proving its relevance for the scientific community from an industrial point of view as well as its suitability for the high performance computing. STEM-II code structure is complex, consisting of more than 150 functions and 13.000 lines of Fortran code. STEM-II performs iterative simulation and has a 1
Note that in Fortran an argument variable of a routine is internally considered as a pointer. In case of C-programs, pointers should be used instead.
874
D.E. Singh et al.
multiple nested structure. The outmost loop is the temporal loop that controls the simulation time. Inner loops traverse the mesh dimensions computing for each cell (associated with 3D volume elements) the chemical processes and transport of the pollutants. This task is performed in parallel using the MPI standard library for implementing communications. A detailed description of the parallel program structure can be seen in [7]. For this scenario, two different agents were designed. The first type of agent (called A-Type) is used for tracking different STEM-II parameters. More specifically, we have considered the following parameters: 1. Rank identification (myrank ). This parameter is the MPI rank associated to each process. 2. Iteration (it ). It corresponds to the iteration number of the outmost loop. Each loop iteration corresponds to a minute of the simulation time step. 3. Accumulated time (acum time). It accumulates the execution time of the rxn routine. This routine performs the chemical simulation in each mesh node. Rxn is the most time-consuming routine of the program (70% of the complete program execution). Given that STEM-II is iterative, this routine is executed in each time step (outmost loop). In acum time variable, the execution time of rxn is accumulated along different iterations. An A-Type agent includes logic for reading, checking and modifying these parameters. In the case of acum time, the mobile agent includes a threshold (with a value specified and initialized by the agent). When agent is executed it tests whether acum time reaches this threshold. In case of being affirmative, the agent resets its value. Otherwise, its value is kept. In addition, the values of all these parameters are stored with agent data and further transferred to the UA for its visualization. A second agent, called B-Type, was designed for collecting information about the compute nodes. Several features were collected; examples of them are the memory, CPU and disk-swap use. In our experiments, we have executed STEM-II in a computer room of AMD Athlon(tm) XP 1700+ computers with 1.5 GB RAM memory, interconnected by a Fast Ethernet 100 Mbps network. Operative system is Linux Sarge 2.6.13. STEMII was compiled with MPI-2 library LAM 7.1.1 and was executed using seven computing nodes. Regarding MASIPE, we used the Mico 2.3.12 CORBA distribution. We used one extra node for storing the CORBA name service and Register Service. Another extra node was used for executing the User Application component. Taking advantage of MASIPE infrastructure, A-Type and B-Type agents were periodically sent for monitoring both program and computer parameters. Figure 3(a) shows the it value captured for the A-Type agent. X axis represents the agent execution number (agent iteration). That is, each A-Type agent has associated a new iteration number when completes the itinerary that has assigned. In each agent iteration, seven STEM-II it values are captured (one for each compute node). Note that these values are asynchronously read by the agent which monitors the parallel application without interfering with the program
MASIPE: A Tool Based on Mobile Agents 100
Node 0
Node 1
Node 2
Node 3
Node 4
Node 5
1600
Node 6
Total Memory
Free Memory
Used Memory
Total Swap
Free Swap
1400
80
1200
70
Memory (Mbytes)
STEM-II iteration (it) value
90
875
60 50 40 30 20 10
1000 800 600 400 200
0 1
4
7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58
0
A-Type agent iteration number
(a) Stem-II iteration (it) value.
1
2
3
4
Compute Node
5
6
7
(b) Memory usage.
STEM-II acum_time value
Fig. 3. Examples of captured data using A-Type and B-Type agents
A-Type agent iteration number
Fig. 4. Example of data capture and visualization using MASIPE GUI 1.8
Case B (normalized)
Case C (normalized)
Normalized acum_time
1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
STEM-II iteration number
Fig. 5. Evaluation of MASIPE overhead on STEM-II execution time
execution. Figure 3(b) represents the memory usage values captured by B-Type agent related to the memory use. In this case, agent captures this information in each compute node.
876
D.E. Singh et al.
All values shown in Figure 3 were stored in disk in the User Application (after agents execution) and subsequently read and graphically displayed. In addition, the User Application includes a GUI for visualizing the captured data. Figure 4 shows an example of this visualization for the acum value variable. Again, a A-Type agent was periodically sent, thus X axis represents the agent iteration number. But now, the acum value variable is modified according a given threshold (with a value of 15). In the figure we can observe that the variable is reset when reaches the threshold. Note that each processor takes different time executing rxn routine thus acum value increases at different rate in each node.
4
Performance Evaluation
We have taken advantage of the computing environment described in the previous section for evaluating the impact of MASIPE in the parallel program performance. In order to evaluate all possible situations, three different scenarios where considered: 1. Case A: STEM-II is executed as an stand-alone program. MASIPE tool is not used. This is used as a reference scenario. 2. Case B: STEM-II is executed in combination with MASIPE but no agents are executed. In this scenario we evaluate the impact of the Agent System in STEM-II performance. 3. Case C: STEM-II is executed in combination with MASIPE. In addition an agent is periodically sent with the maximum frequency that the system can achieve. In this scenario evaluates a full-operational MASIPE execution environment. In all these cases, STEM-II is executed on seven compute nodes. In Case C we use a A-Type agent which performs an infinite round-circuit across the entire ASs. We modified STEM-II allowing collecting the performance measurements directly from this program in all the scenarios. Figure 5 shows the impact of MASIPE on rxn subroutine. Results are normalized by Case A. We can see that the use of MASIPE without agents implies and increment around 10% on the rxn execution time. When agents are being periodically executed, this overhead reached up the 20% of rxn time. Note that we are considering only rxn routine because: Firstly, it is a compute-intensive routine with no E/S nor communications operations (we want to evaluate the CPU overhead) and secondly, it is the most consuming part of the program. Figure 6 represents the one-minute load average for the three cases. This magnitude represents the average of processes in Linux’s run queue marked running or uninterruptible during last minute. The load averages differ from CPU percentage in two significant ways: 1) Load average measure the trend in CPU utilization not only an instantaneous snapshot, as does percentage, and 2) Load average include all demands for the CPU not only how much was active at the time of measurement. In this figure we can see that the MASIPE load average overhead is minimal when no mobile agents are employed. Otherwise, the load
MASIPE: A Tool Based on Mobile Agents 2
Case A
Case B
877
Case C
One minute load averagee
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1
11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191
STEM-II iteration number
Fig. 6. Evaluation of MASIPE overhead on load average
Case A
Number of active processes
160
Case B
Case C
150 140 130 120 110 100 90 1
11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191
STEM-II iteration number
Fig. 7. Evaluation of MASIPE overhead on number of active processes
average is incremented in one unit that means that are two CPU-consuming processes in execution2 . Figure 7 shows the number of active process (including non CPU-demanding processes) for each one of the considered cases. Again, the use of MASIPE infrastructure increases this number in two processes. In contrast, when mobile agents are continuously used, the number of active process increases from 114 up to 142. This increment is not related to the MASIPE or mobile agent implementation but to the CORBA internal structure where multiple threads are automatically created. The memory requirements of STEM-II (Case A) are 350 MB in each compute node. Starting the ASs (Case B) consumes 8 MB extra memory (an increment about 2%). When a mobile agent is fully operational (Case C) this extra memory 2
Note that although the number of process in execution increases, the whole application execution time is not doubled as we can see in Figure 5.
878
D.E. Singh et al.
rises up to 57 MB (16% increment respect to Case A). Taking into account the amount of installed memory in commodity computers, the memory requirements of MASIPE for this case of study are reduced.
5
Related Work
There are many cluster administration tools. One well known example is Rocks [1]. Rocks is a management tool that makes complete Operating System installation and its further administration and monitoring. Other relevant schemes for system monitoring are Ka-admin project [2], Vampir [3] and Paraver [4]. However, all of them do not include functionalities for monitoring parallel applications accessing to their memory space. Example of agent systems are Jade [8] and Mobile-C [9]. The agent mobility in Jade is achieved through Java object serialization, but Java implementation limits the integration with C/C++/Fortran codes. Mobile-C uses ACL messages for mobile agent migration which is an alternative to our implementation based on CORBA. In [10] an instrumentation library for message passing parallel applications, called GRM, is presented. GRM collects trace data from parallel applications using a similar way than our scheme. However, there are important differences in the tool architecture: In GRM all the ASs send the data to the User Application introducing contention risks in this point. The integration with Mercury Monitor presented in the paper does not solve this drawback. In contrast, we use a mobile agent program paradigm that reduces these risks given that the agent traverses different ASs. The agent itinerary in user-defined and can be dynamically changed (adding decision logic in the agent). In addition, with MASIPE it is possible to introduce new designed agents without changing (even restarting) the tool. This strength provides MASIPE a broad range of uses. Finally, CORBA allows using our tool on heterogeneous platforms.
6
Conclusions
MASIPE is a distributed application that gives support to user-defined mobile agents, including functionalities for creating and transferring the mobile agent along the compute nodes of its itinerary. In each node, the mobile agent can access to the node information as well as the parallel program memory space. In addition, MASIPE includes functionalities for managing the agent data. MASIPE allows the integration of user-defined code (included in the mobile agent) with a generic parallel/sequential program. This process only requires of a single functional call to ASs (represented as a dynamic library). The mobile agent can not only read the program data but also modify it. This introduces a vast number of possibilities of dynamic interaction between the mobile agent and the program: check pointing (the mobile agent collects and stores the critical data of the program), convergence checking (the mobile agent evaluates the numerical convergence of iterative methods, even introduces corrections), evaluation of
MASIPE: A Tool Based on Mobile Agents
879
the program and system performance, etc. These interactions are performed asynchronously, without interfering with the monitored program. The tool design was performed based on MICO Corba and Java IDL following the MASIF specification, which strengths its capacity of use on heterogeneous platforms (including compute rooms, clusters and grid environments). The installation requirements are minimal, being necessary just a single dynamic library (AS component) in each compute node. No administrator grants are needed for its use. MASIPE was applied with success to monitor STEM-II parallel application. Several tests were made including reads and writes in the application memory space as well as capturing information of each compute node. Experimental results show that MASIPE overhead is reduced both in terms of CPU and memory consumption. As future work we plan to develop more MASIPE functionalities like new database functionalities for managing the data collected by the mobile agents (specially in checkpoint operation), improving the GUI for dynamically visualizing the distributed data (obtaining an snapshot of the simulation), introducing new commands for controlling the program behavior (I/O operations, start and resuming the program) and its application to grid platforms.
References 1. Sacerdoti, F.D., Chandra, S., Bhatia, K.: Grid systems deployment & management using Rocks. In: CLUSTER 2004: Proceedings of the 2004 IEEE International Conference on Cluster Computing, pp. 337–345 (2004) 2. Augerat, P., Martin, C., Stein, B.: Scalable Monitoring and Configuration Tools for Grids and Clusters. In: 10th Euromicro Workshop on Parallel, Distributed and Network-Based Processing, pp. 147–153 (2002) 3. Vampir, http://www.vampir-ng.de 4. Paraver: the flexible analysis tool, http://www.cepba.upc.edu/paraver 5. Mobile Agent Systems Integration into Parallel Environments (MASIPE), http://www.arcos.inf.uc3m.es/∼ masipe 6. OMG Mobile Agent Facility Specification (2007), http://www.omg.org/ technology/documents/formal/mobile agent facility.htm 7. Martn, M.J., Singh, D.E., Carlos Mourio, J., Rivera, F.F., Doallo, R., Bruguera, J.D.: High performance air pollution modeling for a power plant environment. Parallel Computing 29, 11–12 (2003) 8. Bellifemine, F., Poggi, A., Rimassa, G.: Developing Multi-agent Systems with JADE. In: Castelfranchi, C., Lesp´erance, Y. (eds.) ATAL 2000. LNCS (LNAI), vol. 1986, pp. 42–47. Springer, Heidelberg (2001) 9. Chen, B., Cheng, H.H., Palen, J.: Mobile-C: A mobile agent platform for mobile C/C++ agents. Software. Practice and Experience 36, 1711–1733 (2006) 10. Podhorszki, N., Balaton, Z., Gomb´ as, G.: Monitoring Message-Passing Parallel Applications in the Grid with GRM and Mercury Monitor. In: European Across Grids Conference, pp. 179–181 (2004)
Geovisualisation Service for Grid-Based Assessment of Natural Disasters Peter Sl´ıˇzik and Ladislav Hluch´ y Institute of Informatics, Slovak Academy of Sciences, D´ ubravsk´ a cesta 9, 845 07 Bratislava, Slovakia [email protected], [email protected]
Abstract. The Medigrid Geovisualisation Service brings together the power of Grid computation and the flexibility of web mapping. The Service provides visualisation of the simulations of natural disasters, such as floods, forest fires, and soil erosion. It was developed for the Medigrid project and can be used from the project web portal. The user is able to zoom to the specific area and watch the progress of the event in time. Being web-based, the Service is universal and user friendly.
1
Introduction
Data visualisation, with its ability to provide a rich and vivid user experience, is an inherent part of many computational applications. Geographical systems are an obvious example. Virtually every computer computation dealing with spatially referenced data produces their outputs in the form of maps. This paper describes the Geovisualisation Service developed for the Medigrid project. After the short introduction of the project, the paper discusses the architecture of the service, the visualisation engine, and supported data formats. The paper presents some example visualisations provided by the Geovisualisation Service. For aesthetic reasons, they have been scattered throughout the text, rather than collected at its end. However, they are not referenced from the text.
2
The Medigrid Project
Medigrid1 is a European RTD project (2004–2006), which objective was to develop a framework for multi-risk assessment of natural disasters and integrate models developed in the previous projects [1]. The antecedent projects provided several excellent simulation models: forest fires, floods, landslides, and soil erosion [2]. However, some of the models were interactive-only (i.e., incapable of being run in batch environment); some other were just not grid-aware. In order 1
MEDIGRID—EU 6FP RTD SustDev Project: Mediterranean Grid of Multi-Risk Data and Models (2004–2006), GOCE-CT-2003-004044 (Call FP6-2003-Global-2, STREP).
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 880–887, 2008. c Springer-Verlag Berlin Heidelberg 2008
Geovisualisation Service for Grid-Based Assessment of Natural Disasters
881
to be able to run in the Grid infrastructure, they had to be gridified. The task was made more difficult by the different requirements of the applications; as they had been developed unaware of one another, they are both sequential and parallel, and running both on Unix and Windows operating systems. The stateof-the-art Grid infrastructure (particularly the GridFTP service) supports data exchange between Unix applications only; data communication with Windowsbased applications had to be implemented from scratch. The project solving team consisted of six partner institutions from France, Greece, Portugal, Slovakia, Spain, and United Kingdom. The pilot areas in Spain, France, Portugal, and Greece were specified for the purposes of development and testing of the software. Built on the web service technology, the Medigrid project can be viewed as a set of cooperating, loosely coupled services. Following is a list of services that form the core of Medigrid; they are actually accompanied by custom services that launch the user applications. Data Management Service provides means for platform-independent management of data file sets. Data Transfer Service makes the copying of data between the cooperating services possible. Basically, it provides the same functionality as GridFTP, with the advantage of supporting the Windows platform. Metadata Catalogue Service supports publishing, discovery, and access to metadata for large-scale data sets. Job Management Service provides automated platform-independent management of the applications running in the system. Geovisualisation Service presents the simulation results in a form of maps. The Geovisualisation Service is the subject of the rest of this paper. An essential part of the system is its Distributed Data Repository. It is a decentralised storage for both the input digital maps and the outputs produced by the simulations. The repository provides data to all models, using data conversions where necessary. The Medigrid Web Portal is a central project management tool. It enables the users to access the different parts of the system from any place with the Internet connection.
3
Architecture Overview
The web service architecture, which the Medigrid project is built upon, allows seamless connection of services implementing different tasks, creating thus complex computational chains. The services can be joined automatically or manually, depending on the task. The only limitation are the data dependencies. Visualisation services are usually the last ones in the typical chain of a simulation system. The Medigrid Geovisualisation Service brings together the advantages of Grid computing and geospatial web services, as defined by the OpenGeospatial Consortium [3]. The Consortium’s WMS and WFS specifications define the standards
882
P. Sl´ıˇzik and L. Hluch´ y
Fig. 1. The visualisation of a flood simulation on the V´ ah river, Slovakia. The water poured out of the river bed and endangers the village. The intensity of colour (originally in the shades of blue) shows the depth of the water. The colour turns to black in the river bed.
for web mapping services, i.e., they provide the description of how web applications should communicate with mapping servers. The Geovisualisation Service is built on the client-server architecture. The service consists of two parts, relatively independent, tied only with data dependencies. This split is necessary in order to ensure quick response of the interface to the user’s actions. The first, non-interactive part is responsible for preparing the simulation outputs (which can be in application-dependent formats) for the rendering part. The non-interactive part parses the input files, does the necessary conversions, prepares colour palettes, generates templates, etc. It can be launched from the Medigrid Portal in the same way as the other services. The second, interactive part, provides the user interface. It consists of two components, a server-side component responsible for rendering the data, and the client-side part. The client is invoked by the user from the internet browser. The standard web client is a part of the Medigrid portal.
4
The Choice of a Mapping Platform
Geographic Information Systems (GIS) are software systems for displaying, managing, and editing spatially referenced data. Some GISes are capable of complex data analyses. Unlike them, the so-called map servers focus only on data rendering. Designed for embedding their outputs into web portals, map servers
Geovisualisation Service for Grid-Based Assessment of Natural Disasters
883
communicate with clients using the HTTP protocol. The choice of various clients is considerably rich, ranging from simple web forms, through JavaScript applications, to stand-alone clients. The choice of a mapping engine for the project was a decision of profound importance, since nearly every part of the system has been influenced by it. As the number of possible alternatives is considerably high, the selected platform must meet stipulated requirements. Among others, the following properties were considered: implementation platform, complexity of installation, configuration, and maintenance, user and developer support, and availability of client software. The following four systems were considered in the narrower selection: MapServer is an open-source, multi-platform framework for development of spatially enabled web applications [4]. Its functionality is configured in a single configuration file called mapfile. Many vector and raster data formats are supported through the third-party OGR [5] and GDAL [6] libraries. The package also supports the WMS, WFS, and WCS specifications of the OpenGeospatial Consortium [3]. Deegree toolkit [7] is a reference implementation of WMS, WFS, WCS, and WTS services. Deegree aims to be the most complete Java implementation of the OpenGeospatial standards. Its support for different data formats is slightly limited comparing to the other packages. GeoTools is an open-source Java toolkit for developing OGC-compliant GIS solutions [8]. It is a library providing methods for manipulation of geospatial data; its modular architecture allows extra functionality to be added or removed easily. It can be used to develop both desktop- and server-based geospatial applications and services. However, the effort to make the library extendable in every possible way led to a rather complex architecture. GeoServer is an initiative aiming to provide an open-source mapping solution written in Java [9]. Built on the top of the GeoTools library, it profits from its robust design and richness of features. GeoServer is one of the most complete WMS, WFS, and WCS implementations, with full support for SLD’s and support for the most common data formats. All listed applications except MapServer are written in Java. Java-based solutions are preferred, as the whole Medigrid project is written in Java. The installation complexity also slightly disfavours MapServer, since it depends on a bunch support libraries. The easiness of configuration highly prefers MapServer with its single-file configuration. The complex architectural model of GeoTools makes the development of small and medium-sizes applications exceedingly awkward. The community support for developers seems to be an issue with the Deegree project. Among the evaluated solutions, the biggest support community has formed around MapServer. The availability of client software does not seem to be an issue here, as all the considered servers support the WMS and WFS standards. The easiness of configuration and its simple programming model, together with the strong support and good documentation, finally prevailed and led towards the choice of MapServer as the mapping engine.
884
P. Sl´ıˇzik and L. Hluch´ y
Fig. 2. The visualisation of a forest fire simulation. The different shades of the background represent the different types of vegetation (grass, shrubs, trees, etc.) The small circles show the progress of the fire. The spread of the fire corresponds with the areas of grass cover.
5
Input and Output Data
The data with which the service works differ according to the simulations being computed; however, they share some common core. Typically, each visualisation uses a background image that serves as a reference to which other map elements are related. On this background are rendered the outputs of simulations, sometimes assissted with another reference mesh. Terrain texture is an image (usually an orthophotomap) used as a background for referencing other map elements. Reference structure is a mesh of polygons or a set of points that the simulation algorithm actually works with. Sometimes it is useful to draw this reference structure to the map; for instance, to make visible which of the considered elements were affected by the disaster. Even if not rendered, the visualisation service may rely on this structure to provide the final maps. Simulation results are the actual (numerical) results of the simulations. They almost always represent time steps showing the development of the disaster in time. More often than not are the outputs from proprietary models delivered in proprietary data formats, which has to be reckoned with. Terrain model is not required by the Geovisualisation Service, but it is mentioned here because it can be used as a grey-shaded background (texture), and it is also very important for 3D visualisations. The terrain models are
Geovisualisation Service for Grid-Based Assessment of Natural Disasters
885
usually obtained by remote sensing technology. They are referred to with acronyms DEM (Digital Elevation Model) and DTM (Digital Terrain Map). 5.1
Data Formats
The following paragraphs describe the data formats which the Geovisualisation Service uses for exchanging data both with the rest of the Medigrid software and internally between the interactive and non-interactive part. The multitude of data formats supported by MapServer means the list can be easily extended to support other simulation models. Vector Data Formats. The vector data formats represent the contained geographical entities as sets of points, lines, and polygons in a resolution-independent way [10]. MapServer’s default data format is the ESRI Shapefile, with other vector data formats either built-in or supported by third-party libraries. ESRI Shapefile (SHP) was created by Environmental Systems Research Institute [11]. It consists of three files: a main file (describing the shapes), an index file, and a dBASE table (for attributes). The ESRI Shapefile format belongs to the most popular geographical formats, with many applications providing built-in access. The complete description of the format is available at [12]. In addition to the ESRI Shapefile, the Service supports a few proprietary formats used by flood and forest fire models. Though usually profoundly different in terms of syntax, the custom formats share some common features. Since they are required to show the progress of a disaster simulation, they are divided into time steps. The individual numeric values show which places have been affected by the disaster. Raster Data Formats. The raster data formats are used as coverages, i.e., layers showing some data aspect that continuously varies in the space. Raster files consist of rows and columns of cells where each cell is stored as a single value. Examples of such values are temperature, precipitation, or air circulation. The raster data can be also used as a background for vector data layers. The supported models use the following two raster formats; they were chosen because they have become de facto standard, as a result of their simplicity, flexibility, and widespread use. Since they are only single-band files, they cannot be used to represent imagery data. ARC/INFO ASCIIGRID format has been created for the ESRI ARC/INFO software. An ARC/INFO file consists of a header that specifies the geographic coordinates and dimensions of the region. The header is followed by the actual grid cell values. The values are usually integers, but floating point values are also acceptable [13] [14].
886
P. Sl´ıˇzik and L. Hluch´ y
Fig. 3. The picture shows the Marseille industrial port with adjacent country. The small picture in the lower left corner shows the same area with the gray shades representing the potential fire intensity.
GRASS ASCII is a data raster format introduced by the GRASS geographic information system [16]. The format is very similar to ARC/INFO ASCIIGRID, the only difference is that the GRASS format specifies the coordinates of the four borders instead of giving a reference point and dimensions. To support the imagery data, the general image formats, such as JPEG, GIF, PNG, and TIFF are used. This is a standard practice in the GIS community, taking advantage of their wide portability.
6
Conclusions
The Medigrid Geovisualisation Service has met the requirements set by the project proposal by providing means for the visualisation of computational simulations of natural disasters. The Service is able to visualise both the spatial and temporal progress of a catastrophic event. The service is scalable, e.i., it is able to display simulations consisting either of a few or a few hundreds of time steps. The Service supports zooming and panning; map layers can be switched on and off. From the scientific point of view, the Service has brought together the power of Grid computing and the flexibility of web mapping. The future research will focus on generalisation of this approach in order to create a solution usable in a general Grid infrastructure. In the long term, the Service is planned to be adapted for cooperation with 3D visualisation platforms.
Geovisualisation Service for Grid-Based Assessment of Natural Disasters
887
Acknowledgements. This work has been supported by projects MEDIGRID EU 6FP RTD GOCE-CT-2003-004044 and VEGA No. 2/6103/6. The authors would like to thank also the members of the Medigrid Consortium for providing their advice and expert knowlegde during the development of the project.
References ˇ 1. Hluch´ y, L., Habala, O., Nguyen, G., Simo, B., Tran, V., Bab´ık, M.: Grid Computing and Knowledge Management in EU RTD Projects of IISAS. In: Proceedings of 1st International Workshop on Grid Computing for Complex Problems—GCCP 2005, VEDA, 2006, Bratislava, Slovakia, November – December 2005, pp. 7–19 (2005) ISBN 80-969202-1-9 ˇ 2. Simo, B., Ciglan, M., Sl´ıˇzik, P., Maliˇska, M., Hluch´ y, L.: Core Services of Heterogeneous Distributed Framework for Multi-Risk Assessment of Natural Disasters. In: International Conference on Computational Science—ICCS 2006, May 28–31 (2006) 3. OGC—Open Geospatial Consortium, Inc., http://www.opengeospatial.org/ 4. Minnesota MapServer, http://mapserver.gis.umn.edu/ 5. OGR Simple Feature Library: http://www.gdal.org/ogr/ 6. GDAL—Geospatial Data Abstraction Library: http://www.gdal.org/ 7. Deegree—Free Software for Spatial Data Infrastructures: http://www.deegree.org/ 8. GeoTools—The Open Source Java GIS Toolkit: http://geotools.codehaus.org/ 9. GeoServer—Open Gateway for Geospatial Data: http://docs.codehaus.org/display/GEOS/Home 10. Vector Data Access: A Comprehensive Reference Guide to Using Different Vector Data Formats with MapServer, MapServer Online Reference, http://mapserver. gis.umn.edu/docs/reference/vector data/referencemanual-all-pages 11. ESRI: Environmental Systems Research Institute: http://www.esri.com/ 12. ESRI Shapefile Technical Description, An ESRI White Paper, Environmental Systems Research Institute, Inc., (July 1998), http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf 13. ARC/INFO ASCIIGRID File Format Description: http://www.climatesource.com/format/arc asciigrid.html 14. ESRI ARC/INFO ASCII Raster File Format Description (GRASS manual): http://grass.itc.it/grass63/manuals/html63 user/r.in.arc.html 15. Geographic Resources Analysis Support System GRASS: http://grass.itc.it 16. GRASS ASCII Raster File Format Description: http://grass.itc.it/grass63/manuals/html63 user/r.in.ascii.html
Web Portal to Make Large-Scale Scientific Computations Based on Grid Computing and MPI Assel Zh. Akzhalova and Daniar Y. Aizhulov Kazakh National University, Mechanics and Mathematics Faculty, Computer Science Department, Masanchi street 39/47,050012 Almaty, Kazakhstan ak [email protected],[email protected]
Abstract. The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. This concept is realized in the most popular problem solving environments (PSEs) such as NetSolve and WebPDELab. These systems use some approaches to computational grids and Web browser interfaces to back-end computing resources. The aim of the given work is to build PSE implemented as a Web portal that allows clients to choose the most appropriate services to solve some problems with using of matrix-algebra and numerical methods based on MPI techniques. In addition, it is available to extend the library of system by loading actual problem-solving algorithms. Thus, the proposed system allows the rapid prototyping of ideas, detailed analysis, and higher productivity. . . . Keywords: grid, web portal, online problem solving, MPI.
1
Introduction
The further development of the progress in IT industry and manufacturing produce the more complex problems and most of them require finding exact result. Nowadays, during several years it is most actual will be such commercial applications as financial services, life sciences and manufacturing. These applications include seismic analysis, statistical analysis, risk analysis, and mechanical engineering, weather analysis, drug discovery, and digital rendering. To solve arisen problems in these areas we need in large infrastructure consisting of supercomputers, advanced algorithms, and programming tools. The most appropriate tool to realize this infrastructure is the Grid computing. The third generation of grid computing introduced a service-oriented approach leading to commercial projects in addition to the scientific projects. The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. This concept is realized in the most popular problem solving environments (PSEs) R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 888–893, 2008. c Springer-Verlag Berlin Heidelberg 2008
Web Portal to Make Large-Scale Scientific Computations
889
such as NetSolve and WebPDELab. These systems use some approaches to computational grids and Web browser interfaces to back-end computing resources. As known, Grid systems can classified depending on their usage as: computational grid, data grid, service grid. For example, TeraGrid – NSF funded linking 5 major research sites at 40 Gbs (www.teragrid.org), European Union Data Grid – grid for applications in high energy physics, environmental science, bioinformatics (www.eu-datagrid.org), Access Grid – collaboration systems using commodity technologies (www.accessgrid.org), Network for Earthquake Engineering Simulations Grid - grid for earthquake engineering (www.nees.org). Network Enabled Solvers (NetSolve) is an example of a grid based hardware/software server. Its original goal was to free domain scientists from having to perform these tedious tasks when they needed to use numerical software, particularly on multiple platforms. Nowadays, this is leading grid system in the world. The principle of NetSolve is client/agent/server model [1]. All requests from client are accepted by the agents, the agents manage the requests and allocate servers to service, and the servers receive inputs for the problem, do the computation, and return the output parameters to the client. NetSolve can be invoked via C, FORTRAN, MATLAB, or Mathematica interfaces. Fault tolerance and load balancing are supported within the system. Another example of PSE is WebPDELab. WebPDELab is a Web server that provides access to PELLPACK [2], a sophisticated problem-solving environment for partial differential equation (PDE) problems. WebPDELab interface supports users by a collection of case studies such as flow, heat transfer, electromagnetism, conduction. Any registered user can upload PELLPACK problem-definition files, such as .e files, mesh files, and solution files from the previous sessions. WebPDELab is implemented with the help of virtual networking computing (VNC). It runs the application and generates the display, a viewer that draws the display on the client screen and a TCP/IP connection between a server and a viewer. Within WebPDELab it is realized security for the user, server, and payment for computing services. Despite of existence of the described systems more and more scientists face with such problem as lack of computational resources. Certainly, there are several powerful high performance computers that are allocated in developed countries. However most of supercomputers are local and joining to them is not trivial. Sometimes a scientist has to overcome a number of bureaucratic obstacles before getting an access to supercomputer and solve the problem. Moreover, only huge problems are usually introduced for the execution on such resources. In the same time, there are a lot of problems that should be solved by scientists over the world with the help of open computational resources. The emergence of networked computers has led to the possibility of using networks through the Internet for parallel programming. In the given article we propose the system that allows solving different scientific problems at any time through the internet technologies. The system offers an attractive alternative to expensive supercomputers and parallel computer systems for high-performance computing.
890
2
A.Z. Akzhalova and D.Y. Aizhulov
Concept
The main idea is the merging of server technologies and MPI programming tools in one system. The basic scheme of the system can be presented by three levels infrastructure. Figure 1 shows this model. Data flows between levels through the third level to the first one. The second level is a middleware. The third high level is a client interface. Client interface gives the client web form where input data can be uploaded to the server. There are two ways to input the data: completing the web form or uploading the text file to the server. The second way is better for working with a large amount of data. Predefined templates allow to structure data before uploading. In addition, third level offers flash and java scripts to check data and also display graphics. The second level is responsible for several tasks. Firstly, the information is processed and converted to data according to the template that depends on the problem being used. The second level also provides security. The server
Fig. 1. Three levels model of proposed system
Web Portal to Make Large-Scale Scientific Computations
891
Fig. 2. The architecture of proposed PSE
scripts validate the data on the conformity and check whether user is going to use parallel computer or not with relevant purposes. Server is connected to the database that stores all information about previous executed tasks. User can control and analyze results with the help of diagrams and tables. Special agents find resources, control time execution and store it in the database. In fact, the database supports all needs starting from authorization and ending on control load balancing of parallel computer. In our case, parallel computer is the set of workstations connected by bus in P2P topology. The parallel computations are performing with using of MPI techniques. The main program read the input data and calls for the appropriate task. Each task makes references to mpi.h library to implement communication and synchronization between processes. To implement parallel numerical calculations there is an additional library with ready numerical methods. Synchronization is done by non-blocking sending and receiving routines. [3] If the same task is called by many users at once then the main program find free workstations and send them equal portions. There are different techniques to manage shared task set: cooperative lists manipulation and parallel access queue using fetch&add [4]. In the given work, it is used centralized load balancing technique. The parallel Figure 2 shows architecture of proposed PSE. The proposed PSE has some features of dependability and reliability. These features are realized on the second level. For instance, we face with a situation when there is more than one request done simultaneously. It leads to the problem
892
A.Z. Akzhalova and D.Y. Aizhulov
of sharing resources between the clients. To resolve this problem the system proposes a combination of the two decisions: 1) To block the server until the problem of one client is executed. 2) In regular intervals to distribute resources between clients, and supply them with full statistics about a quantity of clients that are on-line at the moment. The task is analyzed by the following rules: how much time we need to spend to execute it and how many processes are required. The connection between third and second level is based on HTTP protocol. There is no need to make persistent connection between client and server as client has to make two actions: input and output the data. It is clear that browser should be forced to check whether parallel computer has finished the work or not. After the data had been posted to server, it starts validating all received information. It authenticates the user, creates session, and validates the data, checks workload of parallel computer. After all these actions have been completed server is ready to communicate with parallel computer. Communication is carried out by reading and writing files. This process does not waste much time, because reading and writing are made only once for each computation. Server starts an execution of the chosen problem. The agent starts an appropriate application (“calculator”) on accessible host-machines. “Calculator” is one program from the set of given samples which are the set of programs allocated on hosts-machines. The agent transfers two arguments to the “calculator”: amount of processes and input data. It calls MPIRun with required arguments and a parallel program [5]. The code of the program contains compiling directives with call functions from MPI library to spawn required quantity of processes and starts implementation. The output information after termination of calculations is saved in special logfile that subsequently will be presented on a browser of the client. The client can also download the output data to a file. All input and output data are saved on database. All data will be written down in a text file. In the failure case, the information with mistakes report will be written down in log-file. The server script opens the log-file with results, reads, and writes down in a database. In addition, the server script generates a graphical representation of obtained results. When server script reads data from log-file it uses a system of flags. The flag will be equal to zero if “calculator” still works else it equals to 1 if “calculator” has finished work. The keeper of a flag is an output file. In fact, it provides termination detection service. It is available to extend the library of system by loading actual problem-solving algorithms. The system gives opportunity to view obtained results, save it in different text and graphics formats, and to analyze them.
3
Conclusion
Using a Web portal has a number of well-enumerated advantages over specially designed multiprocessor systems such as: high performance workstations and
Web Portal to Make Large-Scale Scientific Computations
893
PCs are readily available at low cost, the latest processors can easily be incorporated into the system as they become available, existing software can be used or modified. Grid infrastructure gives to end users interactive control, access to remote storage system at any time, and tool to solve a computational problem. In future, it is expected to add manager of the agents supporting reliability of the system. In conclusion, the Web portal was created for the decision of applied problems with use of MPI on the basis of the above described three-level platform. Any registered user can access to the portal on-line with use of any web-browser. Using of the proposed system will allow constructing the Internet services solving actual applied problems.
References 1. Dongarra, J.: Sourcebook of parallel computing, 2nd edn. Morgan Kaufmann Publishers, San Francisco (2003) 2. Catlin, A.C., Weeravarana, S., Houstis, E.N., Gaitatzes, M.: The PELLPACK User Guide. Department of Computer Sciences, Purdue University, Lafayette, IN (2000) 3. Jordan, H.: Gita Alaghband Fundamentals of parallel processing. Pearson Education, Inc., New Jersey (2003) 4. Wilkinson, B., Allen, M.: Parallel programming: techniques and applications using networked workstations and parallel computers. Prentice-Hall, Inc., Englewood Cliffs (1999) 5. Nemnjugin, S.A., Stesik, O.L.: Parallel programming for high-efficiency multiprocessing systems. St.-Petersburg (2002)
The GSI Plug-In for gSOAP: Building Cross-Grid Interoperable Secure Grid Services Massimo Cafaro1, Daniele Lezzi2 , Sandro Fiore2 , Giovanni Aloisio1 , and Robert van Engelen3 1
University of Salento, Lecce, Italy {massimo.cafaro, giovanni.aloisio}@unile.it 2 Euro Mediterannean Center for Climate Change Lecce, Italy {daniele.lezzi, sandro.fiore}@unile.it 3 Computer Science Department Florida State University, USA [email protected]
Abstract. Increasingly, grid computing is becoming the paradigm of choice for building large-scale complex scientific applications. These applications are characterized as being computationally and/or data intensive, requiring computational power and storage resources well beyond the capability of a single computer. Grid environments provide distributed, geographically spread computing and storage resources made available to scientists belonging to Virtual Organizations; resource sharing is tightly controlled across multiple administrative domains through established service-level agreements. The adoption of Service-Oriented Architectures leads to grid environments characterized by grid services built using Web Services technologies that can be composed as needed to create arbitrarily complex workflows. In this context, security is a key issue that must be taken into account; another concern is interoperability among grids, a fundamental building block to develop grid-aware applications that can benefit from multiple grid environments. We present the GSI plug-in for gSOAP, an open source solution to the problem of securing Web Services in grid environments providing full interoperability between grid environments based on the Globus Toolkit and gLITE middleware.
1
Introduction
Increasingly, grid computing [1] is becoming the paradigm of choice for building large-scale complex scientific applications. These applications are characterized as being computationally and/or data intensive, requiring computational power and storage resources well beyond the capability of a single computer. Grid environments provide distributed, geographically spread computing and storage resources made available to scientists belonging to Virtual Organizations [2]; resource sharing is tightly controlled across multiple administrative domains through established service-level agreements. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 894–901, 2008. c Springer-Verlag Berlin Heidelberg 2008
The GSI Plug-In for gSOAP
895
The adoption of Service-Oriented Architectures [3] leads to grid environments characterized by grid services [4] built using Web Services technologies that can be composed as needed to create arbitrarily complex workflows. In this context, security is a key issue that must be taken into account; another concern is interoperability among grids, a fundamental building block to develop gridaware applications that can benefit from multiple grid environments. The Globus Toolkit [5] and gLITE [6] are the two most widely used middleware for building distributed grid infrastructures; both leverage the Globus Security Infrastructure (GSI) [7] to secure applications. Even though many of the features provided by the GSI are also available using Transport Layer Security (TLS), there are still important features extremely relevant in grid computing applications that the GSI package deals with, e.g. support for delegation of credentials. The GSI provides an implementation of the Generic Security Services (GSS) API [8], enhanced to work in grid environments. These extensions are being standardized through the Open Grid Forum (OGF). The Web Services framework is the de facto approach to Service-Oriented Architectures; indeed, it provides an established, clean and widely available interface to wrap existing legacy applications and for developing new ones. Among the many Web Services tools available, Apache Axis [9] provides an open source solution for development of Java and C/C++ Web services; the Java version is currently utilized by the Globus Toolkit v4 Java WS Core. The gSOAP toolkit [10] [11] is another open source solution for seamless implementation of C/C++ Web Services; it provides native support for TLS and a plug-in based modular architecture, which makes it easy to add new functionalities. Since grid services have strict security requirements including full GSI support, we have been developing a GSI plug-in for the gSOAP Web services toolkit [12]. The middleware, initially developed in the context of the GridLab project [13], an EU-funded development effort to provide high level services and grid aware libraries (the GridLab Application Toolkit), is actively maintained and enhanced at the University of Salento, Italy. In this paper we describe the evolution of this software: its latest version fully supports interoperability between Globus Toolkit and gLITE based grids, the gLITE VOMS service (Virtual Organization Membership Service) [14] and the delegation of delegated credentials. The overall performances are also improved, owing to a careful code inspection, rewriting and refactoring. The remainder of this paper is organized as follows. Section 2 presents the gSOAP Web services toolkit and its evolution, where Section 3 describes the GSI plug-in for gSOAP in more detail. We present related work in Section 4 and draw our conclusions in Section 5.
2
The gSOAP Web Services Toolkit
The open-source gSOAP middleware include several software components and tools for developing stand-alone XML Web services and client applications, and for integrating existing applications into Web services. The software provides a
896
M. Cafaro et al.
runtime web server engine and a set of code generators. To support application integration, the toolkit generates service interfaces for C and C++ applications, which allows legacy C and C++ code to communicate with other SOAP/XML services. The interface automatically translates C/C++ data structures into XML and vice versa, while ensuring Web services industry-standards compliance. The gSOAP software is compliant with standards that are widely used, including: – – – – –
SOAP 1.1/1.2 RPC encoding SOAP 1.1/1.2 document/literal style (WS-I Basic Profile 1.0a compliant) WSDL 1.1 (consume and generate) WS-Security (message authentication and integrity) and WS-Addressing HTTP 1.1 and HTTPS (with session caching)
The toolkit utilizes plug-ins to extend functionalities, such as a WSSE plug-in for WS-Security message integrity. By allowing core functionalities of gSOAP to be replaced with a plug-in’s routines, integration is more tightly controlled. By contrast, other toolkits use handles or layered stacks that make it impossible to develop solutions that require sessions and negotiation, e.g. when a component must be actively engaged in a transaction. On the other hand, developing specialized plug-ins can be more complex without layering.
3
The GSI Plug-In for gSOAP
The middleware presented in this paper allows seamless development of C/C++ GSI enabled client and servers providing the following features: – – – – – – – – – – –
support for mutual authentication; support for authorization; support for the gLITE VOMS service; support for delegation of credentials; support for delegation of delegated credentials; support for IPv4 and IPv6; support for connection caching; support for multi-threaded servers; based on the GSS API for improved performance and reliability; extensive error reporting related to GSS functions; debugging framework.
Mutual authentication between client and server is switched on by default for security reasons, and allows full interoperability using transport-level security between Globus Toolkit and gLITE-based clients and servers. Owing to the fact that gLite uses the Globus Toolkit v2.4.3 GSI libraries, the middleware allows developing a gLITE client that fully interoperates with a server based on the Globus Toolkit; the other direction will work with default proxies only if the client is built with the Globus Toolkit v2.4.3, while versions 3.x and 4.x
The GSI Plug-In for gSOAP
897
require a proxy generated with the -old option of the grid-proxy-init command line interface, in order to generate a legacy format proxy compatible with the GSI libraries available in gLITE. It is worth noting here that our package also provides full interoperability with the Globus Java CoG (that provides a Java implementation of GSI) and Apache Axis for Java (the current Globus soap engine). The authorization process can be carried out on the basis of the IP address and port used by the client for the connection, by the distinguished name related to the X509v3 digital certificate utilized by the client and, for gLITE based environments, on the basis of the VOMS FQANs (Fully Qualified Attribute Names). The gLITE middleware provides an authorization framework based on the VOMS concepts of role, group and capabilities. These attributes are embedded within the users credentials (as an X509v3 certificate extension), and set to specify the principals privileges. The syntax is as follows: /Role=/Capability=NULL. It is worth noting here that capabilities are now a deprecated feature of VOMS and as such should not be used in new applications. We propose, on the basis of the VOMS mechanism, a two step authorization process for gLITE based grid services: (i) global through VOMS extensions and (ii) local by means of a grid service local authorization framework. Each step produces, as an intermediate result, an authorization mask; these are then appropriately combined in order to infer the final User Privileges Mask (UPM). A wide range of authorization scenarios is available, including: – user policies defined only on the VOMS server (global mode, coarse grain approach); – user policies defined only on the server (local mode, fine grain approach); – user policies defined both on VOMS and server (combined mode). The coarse grain approach requires acquiring user’s credentials through the voms-proxy-init command-line interface, which issues a proxy certificate with an X509v3 certificate extension containing roles and groups information; the UPM is inferred only from the available VOMS FQANs. The main advantage of this kind of approach is the easy and fast setup procedure owing to the fact that, for instance, the authorization policies of many grid services related the same Virtual Organization can be defined just once on the VOMS server. A related advantage is its intrinsic scalability, that makes it feasible for production-level grid environments. The fine grain approach requires acquiring user’s credentials as usual through the grid-proxy-init command-line interface; in this case the UPM is derived directly from the local grid service. This incurs a drawback, which is the need to carry out the setup procedure on each grid service. As a consequence, the scalability of this approach gets worse if many instances of the same grid service need to be deployed on different sites. However, the case of a single instance of a grid service poses no problems. When combining the coarse and fine grain approaches, a user’s credential must be acquired through voms-proxy-init; the UPM is constructed by joining
898
M. Cafaro et al.
related information on access policies, available in the X509 certificate extension and the local grid service authorization database. We conclude our discussion on authorization by noting the difference on setting policies on VOMS or grid services. With VOMS, a VO admin is allowed (i) granting or (ii) revoking a grid role. Given a specific role, it can be in one of two possible states: enabled/disabled. However, a grid service can locally manage the same role by setting, undefining or unsetting (three states) the user’s data access policies. For each policy, a possible meaning of the three states is the following: 1. setting: the policy is added to the UPM independently of VOMS settings (the local authorization confirms or extends the global one); 2. undefining: it leaves the VOMS setting (the local authorization does not affect the global one); 3. unsetting: the policy is switched off independently of VOMS settings (the local authorization confirms or reduces the global one). From a design perspective, a three state approach is beneficial, owing to the induced different levels of priorities between local (high) and global (low) settings (a grid service configuration can, locally, override the VOMS one and no VOMS roles managing user policies on the grid service are locally available). Coarse and fine grained authorizations play complementary roles in increasing access control flexibility and definitively provide the best mix to manage production grids, preserving, when needed, site local policies. This authorization approach has been successfully implemented in the GRelC (Grid Relational Catalog project [15]) DAS (Data Access Service) [16]. Delegation of credentials through a proxy certificate allows a user, agent, or service to act on another user’s behalf. This is a common need of many grid services, exemplified by the following situation: a client A contacts a remote service B delegating its credentials; in turn, B contacts another service C utilizing the credentials belonging to A, actually acting on behalf of A. For instance, a grid scheduler needs delegated credentials in order to perform job submission on Globus Toolkit based grids, since the GRAM (Globus Resource Allocation Manager) component requires that the user’s credentials be handed over to the gatekeeper daemon. The new version of the GSI plug-in adds support for delegation of delegated credentials. This allows arbitrarily complex scenarios where an intermediate grid service is now able to delegate to another grid service the delegated credentials received. In the simplest scenario we have a client A delegating its credentials to an intermediate service B which, in turn, delegates A credentials to a service C that finally uses them to contact a service D on behalf of A. Support for protocol independent networking (IPv4/IPv6) and connection caching is strictly tied to the gSOAP implementation. Connection caching gives better performances by exploiting the gSOAP HTTP keep-alive mechanism. Thread safety is achieved by using reentrant functions and avoiding sharing the plug-in local data among threads through a deep copy mechanism. GSI security is implemented by leveraging the GSS APIs for improved performance and reliability; additionally, in case of error of GSS functions an extensive error report
The GSI Plug-In for gSOAP
899
is generated to inform developers on what went wrong. Debugging has been also made simpler through an improved debug framework that provides developers with multiple debug levels, with different information associated to each level. The latest v3.x plug-in API retains backward compatibility with previous v2.x releases. We have updated as needed the plug-in functions and added the following ones: (i) gsi acquire delegated credential, (ii) gsi get credential data and (iii) gsi get voms data. The function gsi acquire delegated credential allows a server using as GSI credential the delegared credential received during the GSS context establishment; this way the server can delegate the delegated credential just received to another server if needed. gsi get credential data retrieves from the peer credential in the GSS context the peer certificate and certificate chain and finally, gsi get voms data retrieves VOMS attributes from the peer certificate and certificate chain. The middleware is open-source, covered by the Apache License version 2 and freely available from one of the authors web site.
4
Related Work
The Globus Toolkit 4.0.5 C WS Core is a C implementation of WSRF (Web Services Resource Framework). It currently provides support for development of grid services secured either by GSI or WS-Security. However, it lacks support for development of multi-threaded services and a usable authorization framework. Even though it is feasible to develop WRSF grid services in C, it is not actually possible to deploy production level services for the following two reasons: (i) WS-Security based grid services are not performant enough and (ii) grid services deployed in the C WS Core container can only use the default SELF authorization scheme (a client will be allowed to use a grid service if the clients identity is the same as the services identity), which is useless for a production service. Unfortunately, the globus-wsc-container program does not have options to handle different authorization schemes. A possible solution could be the development of a customized service that uses the globus service engine API functions to run an embedded container. Doing so requires setting the GLOBUS SOAP MESSAGE AUTHZ METHOD KEY attribute on the engine to GLOBUS SOAP MESSAGE AUTHZ NONE to omit the authorization step and then using the clients distinguished name to perform authorization. However, with the Globus Toolkit 4.0.5 it is not possible for a C grid service to retrieve the distinguished name of a client contacting it, so this is not a viable option and people will have to wait for the next stable release of the Globus Toolkit (4.2) that should provide major enhancements to the C WS Core, including a usable authorization framework. The CGSI plug-in developed by Ben Couturier and Akos Frohner of CERN in the context of the European EGEE project is similar to our GSI plug-in. CGSI has been extended to support VOMS, and recently made freely available under the EGEE license. The main difference with our implementation lies in its design, meant to clearly separate development of clients or servers. This results
900
M. Cafaro et al.
in a convoluted API, whereas our plug-in provides a transparent, easy-to-use and clean interface for development of clients and servers. It also enables mixed modes: servers can be simultaneously clients of GSI enabled Web services when needed. From a technical perspective, the CGSI plug-in is not thread-safe, due to the fact that plug-in local data is not deeply copied as needed to ensure thread-safety. Moreover, data is never encrypted when sent over the network, only data integrity is applied. While this provides better performance, it also results in no data confidentiality at all. In our implementation, developers can freely decide how to set the communication channel security properties including: data confidentiality, data integrity, protection against replay attacks, detection of out of sequence packets, delegation of credentials.
5
Conclusion
We described the latest version of the GSI plug-in for gSOAP, an open source solution to develop secure, GSI-enabled Web services suitable for grid environments. We have introduced key new features, providing full support for interoperable grid services built using the two most widely deployed grid middleware: the Globus Toolkit and gLITE. We have also proposed and discussed an authorization schema based on the gLITE VOMS service and local authorization policies managed on local grid services. Finally, we have also reported on the implementation of a mechanism allowing delegation of delegate credentials allowing arbitrarily complex scenarios. The plug-in enables seamless integration of both legacy and new applications in grid environments wrapping them as GSI enabled Web services, is distributed under the Apache License version 2 and is freely available. Future work will be focused on the addition of support for message-level security and WSRF (Web Services Resource Framework).
References 1. Berman, F., Fox, G., Hey, T.: Grid Computing: Making the Global Infrastructure a Reality. John Wiley & Sons, Chichester (2003) 2. Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International Journal Supercomputer Applications 15(3), 200–222 (2001) 3. Marks, E., Bell, M.: Service Oriented Architecture: A Planning and Implementation Guide for Business and Technology. John Wiley & Sons, Chichester (2006) 4. Foster, I., Kesselman, C., Nick, J., Tuecke, S.: Grid Services for Distributed System Integration. Computer 35(6), 37–46 (2002) 5. Foster, I., Kesselman, C.: Globus: A Metacomputing Infrastructure Toolkit. Intl J. Supercomputer Applications 11(2), 115–128 (1997) 6. gLite: Lightweight Middleware for Grid Computing, http://glite.web.cern.ch/glite/ 7. Foster, I., Kesselman, C., Tsudik, G., Tuecke, S.: A Security Architecture for Computational Grids. In: Proceedings of ACM Conference on Computers and Security, pp. 83–91. ACM, New York (1998)
The GSI Plug-In for gSOAP
901
8. Linn, J.: Generic Security Service Application Program Interface, Version 2. INTERNET RFC 2078 (1997) 9. Apache Axis web site, http://ws.apache.org/axis 10. van Engelen, R., Gallivan, K.: The gSOAP Toolkit for Web Services and Peer-ToPeer Computing Networks. In: Proceedings of the 2nd IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2002), Berlin, Germany, May 21-24, 2002, pp. 128–135. IEEE, Los Alamitos (2002) 11. van Engelen, R., et al.: Developing Web Services for C and C++. IEEE Internet Computing Journal, 53–61 (March 2003) 12. Aloisio, G., Cafaro, M., Epicoco, I., Lezzi, D., van Engelen, R.: The GSI plug-in for gSOAP: Enhanced Security, Performance, and Reliability. In: Proceedings of Information Technology Coding and Computing, vol. I, pp. 304–309. IEEE Press, Los Alamitos (2005) 13. Aloisio, G., Cafaro, M., Epicoco, I., Nabrzyski, J.: The EU GridLab Project: A Grid Application Toolkit and Testbed. In: Yang, L.T., Dongarra, J., Hoisie, A., Martino, B.D., Zima, H. (eds.) Engineering the Grid: Status and Perspective, Nova Science Publisher (2005) 14. Alfieri, R., Cecchini, R., Ciaschini, V., dell’Agnello, L., Frohner, R., Lrentey, K., Spataro, F.: From gridmap-file to VOMS: managing authorization in a Grid environment. Future Generation Comp. Syst. 21(4), 549–558 (2005) 15. GRelC project web site, http://grelc.unile.it 16. Fiore, F., Cafaro, M., Negro, A., Vadacca, S., Aloisio, G., Barbera, R., Giorgio, E.: GRelC DAS: a Grid-DB Access Service for gLite Based Production Grids. In: Proceedings of the Fourth International Workshop on Emerging Technologies for Next-generation GRID (ETNGRID 2007), Paris (France), June 18-20 (to appear, 2007)
Implementing Effective Data Management Policies in Distributed and Grid Computing Environments Luisa Carracciuolo1, Giuliano Laccetti2 , and Marco Lapegna2 1
Institute of Chemistry and Technology of Polymers (ICTP) National Research Council (CNR) c/o Department of Chemistry via Cintia Monte S. Angelo, 80126 Naples, Italy [email protected] 2 Department of Mathematics and Applications University of Naples Federico II via Cintia Monte S. Angelo, 80126 Naples, Italy {giuliano.laccetti,marco.lapegna}@dma.unina.it
Abstract. A common programming model in distributed and grid computing is the client/server paradigm, where the client submits requests to several geographically remote servers for executing already deployed applications on its own data. In this context it is mandatory to avoid unnecessary data transfer because the data set can be very large. This work addresses the problem of implementing a strategy for data management in case of data dependencies among subproblems in the same application. To this purpose, some minor changes have been introduced to the Distributed Storage Infrastructure in NetSolve distributed computing environment.
1
Introduction
As stated in [8], a computational grid is a system that coordinates resources that are not subject to centralized control, using standard, open, general purpose protocols and interfaces, to deliver non trivial qualities of services. That means that a Grid infrastructure is built on the top of a collection of disparate and distributed resources (computers, databases, network, software, storage ) with functionalities greater than the simple sum of those addends [9]. The added value is a software architecture aimed to virtualize scattered computing and data resources to create a single computing system image, granting users and applications seamless access to vast IT capabilities. The hardware of this single computing system is often characterized by slow and non dedicated Wide Area
This work has been carried out as part of the project ”Sistema Cooperativo Distribuito ad Alte Prestazioni per elaborazioni Scientifiche Multidisciplinari (S.CO.P.E.)”, supported by the Italian Ministry of Education, University and Research (MIUR), PON 2000-2006.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 902–911, 2008. c Springer-Verlag Berlin Heidelberg 2008
Implementing Effective Data Management Policies
903
Networks (WAN) connecting very fast and powerful processing nodes (that can be also represented by supercomputers or large clusters) scattered on a huge geographical territory, whereas the operating system (the grid middleware) is responsible to find and allocate resources to the scientists applications, taking into account the status of the whole grid. Many papers focus on this aspect of the grid computing, addressing topics, such as resources brokering (e.g. [4], [14]), performance contract definition and monitoring (e.g. [10], [12]) and migration of the applications in case of contract violations (e.g. [15]). On the other hand, common scientific applications are characterized by very large input data and dependencies among subproblems, thus it is not sufficient to choose the most powerful computational resources in order to achieve good performance, but it is essential to define suitable methodologies to distribute application data onto the grid components overlapping communication and computation and to provide tools that eliminate unnecessary data transfers. In this research area a small number of papers are available (e.g. [7]). However, because of the natural vision of a computational grid as a single computational resource, it is possible to borrow ideas and methodologies commonly utilized for traditional systems and to adapt them to new environments. This paper is structured as follows: in section 2 we describe our caching methodologies to address data distribution onto the grid components in case of significant dependencies among subproblems of a scientific application; in section 3 we describe the software environment we use with some minor modifications we introduced to better adapt it to our aims; in section 4 we show the experiments we carried out to validate our methodology ; finally in section 5 we provide conclusions and outline our future works.
2
Data Management in Distributed Environments
For many years the dominant style in parallel computing has been the Single Program Multiple Data programming model, where all processors use the same program, although each has its own data. The algorithms based on these techniques are designed to be used in a static and dedicated computing environment, like an MPP supercomputer. They need also frequent data communications and synchronizations among homogeneous nodes in a systolic fashion, where it is also possible to overlap communication and computation. A notable example is the ScaLAPACK library [3] where the data are mapped on a 2D grid of processors. A computational grid is a computing environment which differs substantially from the one previously described. In this case the heterogeneity and the sharing capability of the resources make the client/server programming model more promising compared to the SPMD approach, as it eliminates the need of data communication among the server nodes. Probably the most famous project based on this model is the SETI@home project designed to find extraterrestrial intelligence, where a central system send a huge number of independent tasks to the participating servers for the computation [13]. In the client/server programming model the data are stored in the client from where are sent in chunks to the
904
L. Carracciuolo, G. Laccetti, and M. Lapegna
servers for computations; after the computation the results are returned to the client. The data movement from client and servers in a grid is similar to the data transfer between memories and processing unit in a single Non Uniform Memory Access (NUMA) machine. A memory-aware model of the grid is shown in Fig. 1, where the computational units (the servers) can retrieve data from registers and caches, main memory, secondary storage, as well as from remote clients. Fast and small memories are positioned at the higher level, whereas slower memories, that are usually accessed by means of geographic networks, are located at the lower ones. In this model the servers’ secondary storage level can be represented by the disks of each server as well as by an external (to the server) data repository, however close enough to make the access to this level negligible for the remote client.
Fig. 1. A memory-aware model of a computational grid
Fig. 2 shows typical peak bandwidth and latency of the three lowest memory levels when accessed from the server. The illustrated values refer to a common workstation usually available in a distributed computing environment and are not representative of the leading edge technology. For the server main memory the values for a DDR2 memory interface running at 400 MHz are reported; for the server secondary storage the values for a Parallel ATA disk adapter are reported; for the remote client both values for a Local Area Network (a Fast Ethernet) and for a Wide Area Network (e.g. a metropolitan area network) are reported. Note however that the LAN bandwidth is shared among all the data transfers on the network, so that the actual bandwidth is very sensitive to the network traffic and can be significantly below the peak bandwidth. This performance reduction is much less evident for the disk transfer rate because of the optimization of the disk scheduling algorithms in the current operating systems. It is commonly acknowledged that the key strategy to achieve high performances with a NUMA machine is an extensive use of caching methodologies at
Server main memory Server secondary storage Remote client (LAN) Remote client (WAN)
Bandwidth 10 GByte/sec 100 MByte/sec 12.5 MByte/sec < 1 MByte/sec
Latency 2-10 ns 5 ms 10 ms 100 msec
Fig. 2. Typical values for bandwidth and latency in Fig. 1
Implementing Effective Data Management Policies
905
each level of the memory hierarchy, in order to provide the computing elements with data taken from fast memories at the high level and to avoid unnecessary data transfer toward the lowest levels. Compilers or problem-oriented libraries, like the BLAS library for numerical linear algebra [6], are usually in charge of the management of the higher levels of the memory hierarchy, but the lowest levels have to be managed by means of suitable programming methodologies and software tools. Scientific applications rarely can be divided in totally independent tasks and some data dependencies are always present among them, thus, the definition of methodologies and the development of software tools for an effective data distribution among the components of a grid assume a key role in grid computing. As an example, assume an application composed by three tasks with dependencies in the form of three pipelined stages, as shown in Fig. 3. In this example the output data from stage 1 represent the input data for stage 2, and the output data from stage 2 represent, in turn, the input data for stage 3.
Fig. 3. A pipelined three stages application
A raw implementation of this application with the client/server programming model is depicted in Fig. 4 where three servers compute the three stages of the application. In this implementation the output data from stages 1 and 2 are sent back to the client and then sent again to a new server for the computation of the next stage. In this case the input data for stages 2 and 3 will be located at the lowest level of the memory hierarchy in Fig. 1 when accessed by the servers. In Fig. 5, the use of the server secondary storage as a cache for the intermediate results allows to locate them to a higher level in the memory hierarchy and avoids unnecessary data transfers toward the client memory. Furthermore, by keeping intermediate data in higher level memories it’s possible to overlap data
Fig. 4. Raw implementation of the application in Fig. 2
Fig. 5. Implementation of the application in Fig. 2 with caching of the intermediate results
906
L. Carracciuolo, G. Laccetti, and M. Lapegna
communication and stage computation if the entire sequence has to be repeated several times. A similar approach to the data management in distributed environments is described in [7], where the server main memory is used as a cache in place of the server secondary storage. The main advantage of the approach described in the current section is the larger amount of available space to cache the intermediate data, with an access time to the cache however negligible respect to the client memory.
3
Software Tools for Caching Data in NetSolve
One of the best known software environments for the on demand approach to distributed and grid computing is NetSolve 2.0 [1]. This is a software environment based on a client-agent-server paradigm, that provides a transparent and inexpensive access to remote hardware and software resources. In this environment a key role is played by the agent, that collects hardware performance and available software of the servers in the environment setup phase as well as dynamic information about the workload of the resources. When the agent is contacted by the client by means of the NetSolve client library linked to the user application, it selects the most suitable server to be used on the basis of stored information and notifies the client. Therefore, the client can send data directly to the selected server that performs the computation by using a code generated through a Problem Description File that acts as interface between NetSolve and the legacy software. Finally the result is directly sent back to the client. This data exchange protocol is executed for every request to NetSolve, then in case of dependency among multiple tasks in the same application, the execution looks like those in Fig. 4, with a superfluous network traffic. NetSolve however includes two tools in order to manage data more efficiently: the Request Sequencing and the Data Storage Infrastructure. – The Request Sequencing is realized by means of an appropriate NetSolve construct that builds a Direct Acyclic Graph where the nodes represent the tasks and the arcs represent the data dependencies among them. The main limitation of this approach is that currently the entire DAG sequence is executed by the same server, even if there are independent tasks in the sequence that can be executed in parallel by multiple servers. – The Distributed Storage Infrastructure (DSI) is an attempt to overcome the limitation of the Request Sequencing in NetSolve, by allowing the client to place data in a storage infrastructure that can be accessed by the server. If the storage resource is sufficiently close to the servers, it’s possible to reduce the network traffic in case of data dependency among tasks. Currently NetSolve implements this storage service by means of the Internet Backplane Protocol (IBP) [2],[11], a middleware for managing and using remote resources. Fig. 6 shows how the network traffic is reduced in case of multiple accesses to the stored data by sending them only once from client to the computational space formed by the servers and IBP storage. However, this approach also shows currently a drawback: the servers are able to read data
Implementing Effective Data Management Policies
907
from the IBP storage but they appear to be unable to write on it, so that the implementation in Fig. 5 can be only partially achieved. In order to fully implement a caching methodology like those shown in Fig. 5, it has been necessary to modify in some extent the NetSolve DSI implementation. More precisely, the DSI infrastructure defines the new data type DSI OBJECT as a data structure describing the location and several information about the IBP storage area (dimension, access permissions, access mode, ) that can be used as a cache. In a typical NetSolve session, a DSI OBJECT is generated by the client and sent to the servers by means of the NetSolve API, so that they can access the IBP storage. Among the information in a DSI OBJECT there are the read/write/management capabilities of the IBP storage, i.e. unique character strings used as keys to access correctly the data on the storage.
Fig. 6. Implementation of the Distributed Storage Infrastructure in NetSolve
The main changes are therefore related to the DSI functions for reading and writing data on the IBP storage, so that they can be used also by the servers. Actually, at the present time the servers cannot call directly these DSI functions because the Problem Description Files used to generate the server codes are unable to manage a DSI OBJECT. For this reason in the modified implementation of the DSI infrastructure, the APIs of the DSI functions for accessing the IBP storage include the capabilities of the IBP storage in place of the DSI OBJECT. The character type used for the IBP capabilities is managed by the Problem Description File, thus the servers can use without restriction the DSI function for reading and writing on the data storage. As an example, consider the API of the current DSI function to read a vector from the IBP storage: int ns dsi read vector(DSI OBJECT* dsi obj, void* data, int count, int data type) It has been modified in: int ns new read vector(char* ibp read cap, void* data, int count, int data type) where ibp read cap is the IBP capability to read data in the storage area. Similar changes have been carried out to the functions ns dsi write vector(),
908
L. Carracciuolo, G. Laccetti, and M. Lapegna
ns dsi read matrix(), ns dsi read matrix() and ns dsi close(). As a consequence, it has been necessary to introduce some minor changes in the related code, and the software infrastructure obtained with the modified functions has been used in place of the DSI in the NetSolve architecture.
4
Computational Experiments
As a test bed for the software infrastructure described in the previous section a block matrix multiplication algorithm has been chosen because it is a basic linear algebra computational kernel which is representative of similar other computations; on the other hand, it encompasses a lot of data movements, and in such a case, minimizing communication overhead becomes a challenging task. More precisely, the client/server ijk form of the block matrix algorithm has been used, because it exploits a smaller synchronization overhead compared to the other forms [5]. For simplicity, assume that A, B and C are square matrices of order n, and divided in square blocks C(I, J) , A(I, K) and B(K, J) of order r, with n divisible by r, so that let N B the number of blocks in each dimension, it is N B = n/r. for I=1, NB (in parallel) for J=1, NB (in parallel) choose a server for K=1, NB send C(I,J), A(I,K), B(K,J) to the server receive C(I,J) from the server endfor endfor endfor client algorithm
receive C(I,J), A(I,K), B(K,J) from the client C(I,J) = C(I,J) + A(I,K), B(K,J) send C(I,J) to the client
server algorithm
Fig. 7. The client server ijk form of the block matrix multiplication algorithm (Algorithm 1)
In Algorithm 1 (Fig. 7) the client can manage N B 2 independent tasks over the indices I and J and synchronizations occur only among two successive values of K in the same task. Furthermore, the computation of a single matrix product in the server can be performed through the BLAS3 sequential DGEMM routine [6]. Let now Ts and Tr be respectively the access time to the server secondary storage and to the remote client memories. The communication cost for the complete computation of each block C(I, J) with Algorithm 1 is then: T1 = 4N B Tr r2 . However in the Algorithm 1 it is obvious that in the innermost loop of the client algorithm, the same block C(I, J) is received and sent again to the server
Implementing Effective Data Management Policies
909
for successive values of the index K. This is an example of unnecessary data movement between client and server that can be avoided by storing intermediate results in the server secondary storage. A more efficient solution is therefore the Algorithm 2 (Fig. 8). From the bandwidth values in Fig. 2 is Ts < Tr , so the communication cost for the computation of each block C(I, J) with Algorithm 2 is then : T2 = 2N B r2 (Ts + Tr ) < T1 . To test the software infrastructure needed to implement the data management policy described in Section 2, some experiments have been carried out on a cluster of 2.4 GHz PCs, each of them provided with a Parallel ATA disk adapter with a peak transfer rate of 100 MByte/sec, and connected using a 100 Mbits switch. The operating system running on the PCs was Linux 2.4. Algorithm 1 and Algorithm 2 have been implemented by using NetSolve-2.0 computing environment with the DSI infrastructure modified as described in Section 3. The IBP 1.0.4.2 software infrastructure has been installed on the servers of the NetSolve system in order to support the modified DSI infrastructure. for I=1, NB (in parallel) for J=1, NB (in parallel) choose a server store C(I,J) in the server secondary storage for K=1, NB send A(I,K), B(K,J) to the server endfor retrieve C(I,J) from the server endfor endfor client algorithm
retrieve C(I,J) from the secondary storage receive A(I,K), B(K,J) from client C(I,J) = C(I,J) + A(I,K), B(K,J) store C(I,J) in the secondary storage
server algorithm
Fig. 8. The client server ijk form of the block matrix multiplication algorithm with caching of intermediate results in the server secondary storage (Algorithm 2)
Fig. 9 shows the total execution times in seconds for Algorithm 1 and Algorithm 2 for matrices of order n = 50, 100, 200 and 300 with square blocks of order r = 50 . In order to minimize the impact of the traffic fluctuation in the network, the reported values are the average times over 10 runs. The results show a significant reduction of the total execution time for the computation of the entire matrix multiplication. Same results are obtained also with larger problems: Fig. 10 shows the total execution times of Algorithm 1 and Algorithm 2 in case of matrices of order n = 500, 1000 and 2000 with square blocks of order r = 500.
910
L. Carracciuolo, G. Laccetti, and M. Lapegna
n=50 (NB=1) n=100 (NB=2) n=200 (NB=4) n=300 (NB=6)
Algorithm 1 Algorithm 2 0.08 0.08 0.41 0.36 9.5 5.2 14.38 8.5
Fig. 9. Execution times for a block matrix multiplication with square blocks of order r = 50 Algorithm 1 Algorithm 2 n=500 (NB=1) 2.46 1.81 n=1000 (NB=2) 17.7 12.2 n=2000 (NB=4) 66.6 44.4 Fig. 10. Execution times for a block matrix multiplication with square blocks of order r = 500
5
Conclusions and Future Works
In this work, it is addressed the problem of implementing a strategy for data management in case of data dependencies among subproblems in the same application. The main aim of this paper is twice. On one side, an effective methodology for the placement of data among the resources of a distributed environment with the client/server programming model has been described. The methodology is based on the observation that a computational grid is a large NUMA machine, where at the lowest memory level there is the client and at the highest level there are the server resources. Therefore an effective client/server implementation needs that the application data have to be as more as possible close to the servers. On the other hand it has been necessary to modify in some extents the Distributed Software Infrastructure, part of the NetSolve distributed computing system, in order to implement the described methodology. The computational experiments confirm the expectations, showing a significant reduction of the execution times when the intermediate data are kept in the secondary storage of the servers. Future works are devoted to integrate the obtained software infrastructure in the grid environment of an ongoing project conducted by the University of Naples Federico II and funded by the Italian Ministry of University and Research. The main aim of the project is to solve multidisciplinary applications, deriving from Naples scientists researches, in a new powerful grid infrastructure to be integrated in large national and European grids.
References 1. Arnold, D., Agrawal, S., Blackford, S., Dongarra, J., Miller, M., Seymour, K., Sagi, K., Shi, Z., Vadhiyar, S.: User’s Guide to NetSolve V. 2.0. Univ. of Tennessee, See also NetSolve home page (2004), http://icl.cs.utk.edu/netsolve/index.html
Implementing Effective Data Management Policies
911
2. Bassi, A., Beck, M., Moore, T., Plank, J.S., Swany, M., Wolski, R., Fagg, G.: The Internet Backplane Protocol:a study in Resource Sharing. Future Gener. Comput. Syst. 19, 551–561 (2003) 3. Blackford, L., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D.,, R.: Whaley: ScaLAPACK Users Guide. SIAM, Philadelphia (1997) 4. Czajkowsky, K., Foster, I., Karonis, N., Kesselman, C., Martin, S., Smith, W., Tuecke, S.: A Resource Selection Management Architecture for Metacomputing Systems. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1998, SPDP-WS 1998, and JSSPP 1998. LNCS, vol. 1459, pp. 62–82. Springer, Heidelberg (1998) 5. D’Amore, L., Laccetti, G., Lapegna, M.: Block matrix multiplication in a distributed computing environment: experiments with NetSolve. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 625–632. Springer, Heidelberg (2006) 6. Dongarra, J.J., Du Croz, J., Hammarling, S., Hanson, R.J.: A Proposal for an Extended Set of Fortran Basic Linear Algebra Subprograms. ACM SIGNUM Newsletter 20, 2–18 (1985) 7. Dongarra, J.J., Pineau, J.F., Robert, Y., Shi, Z., Vivien, F.: Revisiting Matrix Product on Master-Worker Platforms. In: APDCM workshop IPDPS 2007 Conference. A revised version is in International J. on Foundations of Computer Science (in press, 2007) 8. Foster, I.: What is the Grid? A three point checklist, http://www-fp.mcs.anl.gov/∼ foster/Article/WhatIsTheGrid.pdf 9. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan and Kaufman, San Francisco (1998) 10. Petitet, F., Blackford, S., Dongarra, J., Ellis, B., Fagg, G., Roche, K., Vadhiyar, S.: Numerical Libraries and the Grid:The GrADS Experiment with ScaLAPACK. University of Tennessee Technical Report UT-CS-01-460 (2001) 11. Planck, J., Bassi, A., Beck, M., Moore, T., Swany, M., Wolsky, R.: Managing Data Storage in the Network. IEEE Internet Computing 5, 50–58 (2001) 12. Ribler, R., Vetter, J., Simitci, H.,, D.: Reed: Autopilot:Adaptive Control of Distributed Applications. In: The 7th IEEE High Performance Distributed Computing Conference, pp. 172–179. IEEE Computer Society, Los Alamitos (1998) 13. Seti@home home page: http://setiathome.ssl.berkeley.edu/ 14. Vadhiar, S., Dongarra, J.: A Metascheduler for the Grid. In: The 11th IEEE High Performance Distributed Computing Conference, pp. 343–351. IEEE Computer Society, Los Alamitos (2002) 15. Vadhiar, S., Dongarra, J.: A performance oriented migration framework for the grid. In: CCGRID 2003 The 3rd International Symposium on Cluster Computing and the Grid, pp. 130–137. IEEE Computer Society, Los Alamitos (2003)
Data Mining on Desktop Grid Platforms Valerie Fiolet1 , Richard Olejnik1 , Eryk Laskowski2, L ukasz Masko2 , Marek Tudruj2 , and Bernard Toursel1 1
Laboratoire d’Informatique Fondamentale de Lille (LIFL UMR CNRS 8022) Universit´e des Sciences et Technologies de Lille (USTL), Lille, France {fiolet, olejnik, toursel}@lifl.fr 2 Institute of Computer Science Polish Academy of Sciences, Warsaw, Poland {tudruj, laskowsk, masko}@ipipan.waw.pl
Abstract. Very large data volumes and high computation costs in data mining applications justify the use for them of Grid–level massive parallelism. The paper concerns Grid-oriented implementation of the DisDaMin (Distributed Data Mining) project, which proposes distributed knowledge discovery through parallelization of data mining tasks. DisDaMin solves data mining problems by using new distributed algorithms based on special clusterized data decomposition and asynchronous task processing, which match the Grid computing features. The DisDaMin algorithms are embedded inside the DG-ADAJ (Desktop-Grid Adaptative Application in Java) system, which is a middleware platform for Desktop Grid. It provides adaptive control of distributed applications written in Java for Grid or Desktop Grid. It allows an optimized distribution of applications on clusters of Java Virtual Machines, monitoring of application execution and dynamic on-line balancing of processing and communication. Simulations were performed to prove the efficiency of the proposed mechanisms. They were carried on using the French national project Grid’50001 (part of the CoreGrid project) and the DG-ADAJ.
1
Introduction
Permanent development of the technology of data mining involves processing of bigger and bigger amounts of data, frequently stored in data bases or warehouses. The knowledge extraction from data has been embedded inside many kinds of time consuming algorithms, which for large analyzed data can not be executed on single processors in satisfactorily short periods of time. Therefore, to meet the quality of service requirements in data mining parallel computing technique has to be applied. This is true independently on the fact if the data sets are dispersed in the surrounding space or not. However, if the data are actually distributed over a set of independent data repositories, the natural approach is to apply 1
Grid’5000 is a French research effort developing a large scale nation wide infrastructure for Grid research. 17 laboratories are involved, sharing the common objective of providing the community of Grid researchers a testbed, which allows experiments at all software layers between network protocols up to applications.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 912–921, 2008. c Springer-Verlag Berlin Heidelberg 2008
Data Mining on Desktop Grid Platforms
913
distributed Grid-oriented algorithms. They increase the parallelism of execution in data mining actions, optimize the involved exchange of data and improve the efficiency of the use of the engaged computing resources. The research reported in this paper concerns the DisDaMin project (Distributed Data Mining), in which distributed data mining algorithms oriented towards Grid computer networks have been proposed and examined [6]. They are based on data fragmentation on the Grid using the clustering approach and asynchronous but collaborative data mining tasks execution to solve the association rules problem. The development of the DisDaMin project has been supported by the DGADAJ project (Desktop Grid – Adaptative Distributed Application in Java) [2], [7]. DG-ADAJ has supplied a middleware platform for the Grid, which enabled optimized Java programs execution based on processor communication and load observation tools with subsequent optimized object initial placement and possible further migration to eliminate the potential load imbalance. It has been used for the DisDaMin algorithms implementation. The structure of the remaining part of the paper is as below. In the first part, the motivations and the related works are presented. Then, the DG-ADAJ platform is described. Next, the DisDaMin project is outlined and its implementation on the Grid using DG-ADAJ. Finally, the experimental results, including optimization methods are presented.
2
Motivations and Related Works
The exponential computational cost involved in traditional data mining methods enforces researching for new less complex algorithms. This complexity is a limit to the exploration of very large databases, which are increasingly frequent and increasingly growing: medical or genetic databases (e.g. gene analysis of pathologies), geographical databases (e.g. ozone layer analysis, climate) or marketing databases (e.g. analysis of customer practices from credit cards data). Parallelization of the algorithms seems to be a partial answer to the increasing complexity. Various parallel algorithms were proposed to take advantage of the aggregated power and storage capacity (see Park et al. [9], Srikant [10], Shintani et al. [11], Savasere et al. [12] and Zaki [14]). However, these solutions required frequent information exchange or synchronization between participating nodes, which brings latency due to communication costs and waiting time. They are then suited to supercomputers as CC-NUMA or SMP, with a shared memory and fast internal interconnection network (e.g. Parallel Data Miner for IBM-SP3). The capacity/cost ratio of a Grid is more interesting than a supercomputer or a dedicated cluster. As a result it has become increasingly popular. Realization of a fully decentralized data mining algorithm that could exploit the capacity of each node without requiring many communications or synchronizations is an interesting challenge and appears interesting for many kinds of uses. Especially, data mining on the Grid is a challenge due to the lack of shared memory in Grid computing, which puts special attention to communication optimization. Grid Data Mining projects (Discovery Net, Knowledge Grid see [13], GridMiner) provide mechanisms for integration and deployment of classical algorithms on
914
V. Fiolet et al.
Grid, but not new Grid–specific algorithms. But, the use of a non–dedicated algorithm on a Grid does not exhaust the Grid’s capacity and limits gains. The DisDaMin project proposes to distribute the process of knowledge discovery by parallelization of data mining tasks. In particular, we intend to give an effective solution for association rules search in very large databases. The exponential nature of the complexity, in the particular the problem of association rules search, forces the adaptation of existing algorithms. The use of specific parallelizing techniques permits to obtain speedup coming from parallel execution and from the reduction of time complexity. In the DisDaMin project, an intelligent data distribution for the problem of association rules is used (data fragmentation on which the distribution of techniques depends). Computations performed on data fragments according to this intelligent distribution decrease the global processing complexity.
3
DG-ADAJ Platform
DG-ADAJ is a Java program execution environment, which provides mechanisms at middleware level that enable dynamic and automatic adaptation of computations to the structure of the executive platform and changes in resource availability. It assures dynamic on-line load balancing based on object behavior monitoring and program graph assignment optimization algorithms. It provides a single system image of the executive Grid platform. DG-ADAJ has been implemented on the top of the ADAJ, JavaParty and Java/RMI platforms Fig. 1 and Fig. 2. It has a multi–layer structure using several APIs. It is a Java compliant system, which assumes the object oriented program design methodology. DG-ADAJ on-line influences the granularity of computations and distribution of the application components on the Desktop Grid. In this way, the programmers are liberated from these problems. Irregular and unpredictable execution control can be efficiently designed with DG-ADAJ in heterogeneous applications. It provides a run–time environment, which optimizes dynamic placement of the application objects on Java Virtual Machines implemented on the Desktop Grid. This placement is based on new mechanisms of observation of objects behavior
Fig. 1. The layered structure of the DG-ADAJ environment
Data Mining on Desktop Grid Platforms
D-Grid Node 1 Center (desktop system)
D-Grid Node 3 Center (cluster)
Desktop Grid environment
Desktop Grid environment
CFDA framework CCA component
CFDA framework
DG-ADAJ
RMI
RMI
CCA component
DG-ADAJ
CCA component
CCA compo nent
LAN / WAN
CCA compo nent
915
CCA component
••• Desktop Grid environment
D-Grid Node 2 Center (SMP) CFDA framework
Desktop Grid environment CFDA framework CCA component
CCA compo nent
RMI
DG-ADAJ
CCA component
DG-ADAJ
CCA compo nent
RMI
CCA component
CCA component
Application deploying D-Grid Host Server Desktop Grid server infrastructure
Submitting a Java Application
resource monitoring & management
placement optimizer
Fig. 2. DG-ADAJ architecture
and of mutual interactions between them. Initially, DG-ADAJ was conceived as an extension of the ADAJ system [8], built for cluster computing. It has been re– engineered to extend it for larger scale distributed computing and to introduce special security mechanisms, which provide reliable execution. We have chosen a component–based architecture for the development and the deployment of scientific applications over DG-ADAJ. The DG-ADAJ provides special control mechanisms based on program component approach that support programmers while they design parallel/distributed component–based applications. Component–based version of the DG-ADAJ, built on CCA (Common Component Architecture) standard [2], has been proposed as a CCADAJ project.
4
DisDaMin Project
DisDaMin has been designed to enable efficient analyzing of big volumes of data and to maximize the resource usage in terms of the CPU time and memory occupancy. The specific Grid features (such as the lack of globally shared memory) enforces decomposition of the database into fragments to enable handling these fragments in parallel in a possibly optimal way. Usually very high communication time overhead in Grid environments makes that parallel computations must be carried on in the way to avoid communication and/or synchronization. It excludes maintaining a global image of the data set. Due to that, standard data mining algorithms are not applicable on the Grid and new algorithms and methods have to be designed. To achieve these goals, we propose a distributed algorithm, based on an intelligent database fragmentation, which solves the Association Rules Problem
916
V. Fiolet et al.
FEDERATOR F
Node
Node
T1 (D1)
T2 (D2)
......
Node
Node
Tn−1 (Dn−1)
T1 (Dn)
send to federator F of local information send to computing nodes Ti of global information
Fig. 3. Collaborative Federator principle in DICCoop algorithm
(ARP) [1], [3]. The data fragmentation phase is the first step in the general methodology introduced in the DisDaMin Project for the ARP computation. This step involves preprocessing and fragmentation of the database before the association rules are computed on the resulting data fragments. To decrease the global complexity of the ARP algorithm, the fragmented computation can then be optimized by looking for data similarities in data sets fragments. The data fragmentation should fulfil the clustering rules, which require that data inside a fragment should be the most possibly similar, while being possibly dissimilar among different fragments. The discussion of such fragmentation applied in association rules computation in [4] confirms the advantages of using such data fragmentation to decrease the computational complexity of the method. So, we propose the CDP (Distributed Progressive Clustering) algorithm for intelligent database fragmentation [5]. At the beginning of the clustering, the algorithm considers attribute subsets (according to existing data distribution) instead of the whole attribute set. The results concerning the attribute subsets are merged to create clusters pertaining to all attributes of the database. This process leads to data fragmentation adjusted to the Grid structure. When the data fragmentation has been completed then the Association Rules problem can be solved by the distributed algorithm DICCoop [6]. This algorithm processes data fragments in a collaborative way. It uses a federator (see Fig. 3), which merges partial results pertaining the data fragments to obtain the global results. The federator supervises collaboration while processing data fragments and also includes a validation phase. The federator provides DICCoop components with global information that allows to arrange, launch and cancel local computation. Communication between components and the federator is performed asynchronously. The two major phases of the algorithm (Fragmentation and Association Rules Computation) can be sub-divided into several tasks: data deployment, discretization processing, partial data instances fragmentation, complete data instances fragmentation, data fragments placement, local DicCoop algorithms, global
Data Mining on Desktop Grid Platforms
917
Fig. 4. Workflow of tasks of DisDaMin Project
DicCoop algorithms. These tasks are deployed in three kinds of nodes in the Grid: storing nodes with a direct access to the database(s) – these nodes are responsible for transmitting data to the computing nodes; computing nodes, which compute the CDP algorithm’s tasks and/or the DICCoop algorithm’s tasks and federator node(s) responsible(s) for collaboration between the computing nodes for the CDP and/or the DICCoop algorithms. Fig. 4 presents the general workflow of processing in the DisDaMin project. Data discretization as well as data instances fragmentation tasks need to access data. So, the data deployment task (that distributes data over the Grid) needs to be completed before. The data discretizeation workflow and the data distribution workflow must be synchronized before the DICCoop process can start. Outputs from data discretization and from fragmentation decision are used to obtain the final discretized data distribution. Fig. 5 describes the workflow in CCADAJ components (nodes) in the project. The initial data are transferred once as well as the discretized data (two sets resulting from the CDP algorithm). All other communication concerns the exchange of intermediate results and control information.
Fig. 5. Workflow of components in DisDaMin Project
918
V. Fiolet et al.
Fig. 6. Acceleration factor on DiabCare database
5
Experiments on Grid and Optimizations
Experiments were carried out on the Grid’5000 infrastructure. Grid’5000 is an experimental Grid platform which differs from others mainly because it is a dedicated Grid. Consequently, a node can be used by only one user at a time and, except hard crash, reserved nodes will not go down. Experiments realized on Grid’5000 infrastructure for the CDP algorithm (intelligent fragmentation step) on a medical database (DiabCare – 180 attributes and 35000 instances) result in an acceptable speedup for this method (see Fig. 6 – which is near of the linear speed–up for use of reasonable number of nodes on the Grid). First experiments used Java RMI components to assure interoperability between components. It results that at the beginning of the algorithm, attributes of the database are distributed and computed independently over nodes, so as the number of nodes increases, the speedup is better. The next step of the algorithm sets up collaboration between nodes to cross results. So, if too many nodes are used in respect to the attribute’s number, a lot of components rapidly become inactive in the crossing part of calculations. This fact explains the limitation of the measured speedup. If more instances are used, the speedup is better. The larger communication and the idle time periods for small numbers of instances explains this. The CDP algorithm was also deployed using CCADAJ components, Fig. 7. Distribution of the application components (data fragments and federators) between active Grid nodes should guarantee a possibly high efficiency of the overall data mining algorithm. A DisDaMin application distributed over the Grid will be treated by the DG-ADAJ system in a similar way as it does with a common Java distributed program. So, the optimization features allowed in the DG-ADAJ such as the introductory task assignment optimization, which results in a shorter execution time, can be applied to the DisDaMin programs. The
Data Mining on Desktop Grid Platforms
919
Fig. 7. CCADAJ components for CDP algorithm
applied initial object placement optimization algorithm follows the pattern of static parallelization method in multithreaded Java program. It defines decomposition of Java code into parallel threads distributed on a set of JVMs, so as to reduce program execution time [7]. The control decisions are taken on which of the classes should be distributed and what the mapping of objects and data components (fragments) should be to JVM nodes, to reduce direct inter-object communication and to balance loads of the JVMs. When applied to a DisDaMin application, it will determine an initial distribution of the DisDaMin objects on Java Virtual Machines (JVMs) assigned to Grid active nodes, thus leading to a reduction of the total execution time. The flow of actions during a data mining application execution in the proposed Grid-based environment is shown in the diagram in Fig. 8. The first three blocks in the diagram determine an initial optimized placement of application objects and perform the respective objects distribution over Grid on JVMs nodes. This part of the algorithm starts with execution of the application using some representative sample data. For that, the number of available JVMs nodes on the Grid must be known. The number of method calls and spawned threads that have appeared during execution is recorded. Next, the program method and thread dependence graphs are annotated with the recorded data. At the beginning of the object placement optimization algorithm we treat all objects as remote objects (respectively all classes are distributed classes). Based on the recorded control data the algorithm decides which classes should remain distributed and how the involved objects should be placed on a set of JVMs assigned to active nodes on the Grid. We use the heuristics based on the following principles: the strong locality of method calls has to be preserved
920
V. Fiolet et al.
begin ↓ Program method and thread dependence graphs are created. Application is executed using a sample data set. The number of method calls and thread spawns is registered. ↓ JVM reservation requests are sent to the D-Grid Centers. The number and features of JVMs available for execution of the application is obtained. ↓ The placement optimization algorithm is performed. ↓ The program is deployed among the reserved D-Grid Centers. ”Logical” JVMs are mapped onto physical JVMs in the D-Grid. Application objects are placed on D-Grid reserved active nodes. ↓ The application is started. In parallel with the application, the observation and dynamic object redistribution tool (DGADAJ) is run. ↓ end
Fig. 8. The control flow of an application execution
inside each parallel thread and the number of inter-thread calls, which cross the boundaries of JVMs has to be reduced. To fulfill such object placement requirements we designate all calls inside a single thread to the same JVM and optimize distribution of threads across available sets JVMs by applying load balancing methods.
6
Conclusion
We have presented a solution for computing of data mining problems on a Grid, taking into account several aspects of algorithms and support characteristics. The distributed implementation DisDaMin project on a Grid, which deals with the Association Rules problem and the data fragmentation by clustering aspects has been discussed. The intelligent fragmentation included in the DisDaMin performs partial optimization without the need of synchronized processing. Besides, the collaborative processing can be adjusted to the characteristics of the executive system. The DG-ADAJ project can provide a middleware environment for deployment of DisDaMin applications on a Grid. At this stage, the implementation of the DisDaMin algorithms benefits from the DG-ADAJ mechanisms and exploits pipelining and asynchronous processing patterns during computations. The experimental results, which are presented in that paper, demonstrate that
Data Mining on Desktop Grid Platforms
921
the distributed implementation of data mining algorithms in the described system provides acceptable speedup due to an optimized use of the executive platform. Another ongoing experimentations make use of the DG-ADAJ observation tools and try to optimize DisDaMin methods according to observation results on load balancing and optimized placement of data fragments and components on the Grid.
References 1. Agrawal, R., Imielinski, T., Swami, A.: Database mining: A performance perspective. In IEEE Trans. on Knowledge and Data Engineering: Special issue on learning and discovery in knowledge–based databases 5(6), 914–925 (1993) 2. Alshabani, I., Olejnik, R., Toursel, B.: Parallel Tools for a Distributed Component Framework. In: 1st International Conference on Information & Communication Technologies: from Theory to Applications (ICTTA 2004), Damascus, Syria (April 2004) 3. Agrawal, R., Srikant, R.: Fast algorithms for mining associations rules in large databases. In: Proc. of the 20th Int. Conf. on Very Large Data Bases (VLDB 1994), pp. 478–499 (September 1994) 4. Fiolet, V., Toursel, B.: Distributed Data Mining. In Scalable Computing: Practice and Experiences 6(1), 99–109 (2005) 5. Fiolet, V., Toursel, B.: Progressive Clustering for Database Distribution on a Grid. In: Proc. of ISPDC 2005, July 2005, pp. 282–289. IEEE Computer Society, Los Alamitos (2005) 6. Fiolet, V., Lefait, G., Olejnik, R., Toursel, B.: Optimal Grid Exploitation Algorithms for Data Mining. In: Proc. of ISPDC 2006, July 2006, pp. 246–252. IEEE Computer Society, Los Alamitos (2006) 7. Olejnik, R., Toursel, B., Tudruj, M., Laskowski, E., Alshabani, I.: Application of DG-ADAJ environment in Desktop Grid. Future Generation Computer Systems 23(8), 977–982 (2007) 8. Olejnik, R., Bouchi, A., Toursel, B.: Object observation for a java adaptative distributed application platform. In: Intl. Conference on Parallel Computing in Electrical Engineering PARELEC 2002, Warsaw, Poland, pp. 171–176 (September 2002) 9. Park, J.S., Chen, M.-S., Yu, P.S.: Efficient parallel data mining for association rules. In: Proc. of the 4th Int. Conf. on Information and Knowledge Management, pp. 31–36 (1995) 10. Srikant, R.: Fast algorithms for mining association rules and sequential patterns. PhD thesis, University of Wisconsin (1996) 11. Shintani, T., Kitsuregawa, M.: Hash-Based Parallel Algorithms fir Mining Association Rules. In: Proc. of the Int. Conf. on Parallel and Distributed Information Systems (1996) 12. Savasere, A., Omiecinski, E., Navathe, S.: An efficient algorithm for mining association rules in large databses. In: Proc. of the 21st VLDB Int. Conf (VLDB 1995), pp. 432–444 (September 1995) 13. Congiusta, A., Talia, D., Trunfioa, P.: Distributed data mining services leveraging WSRF. Future Generation Computing Systems 23(1), 34–41 (2007) 14. Zaki, M.J.: Parallel and Distributed Association Mining: A survey. In IEEE Concurrency, special issue on Parallel Mechanisms for Data Mining 7(4), 14–25 (1999)
Distributed Resources Reservation Algorithm for GRID Networks Matviy Il’yashenko Zaporozhye National Technical University, Computer systems and Networks department, Zhukovskogo 64, 69068 Zaporozhye, Ukraine [email protected]
Abstract. New distributed resources reservation algorithm for GRID presented. Set of preliminary conditions based on wave expansion of graphs, that reduce complexity of combinatorial part of algorithm proposed. Algorithm allows to handle any set of additional limits and reserved resources types in addition to processors productivity and networks connections bandwidths. Method for splitting high-dimensional problems into several problems of smaller size for large networks proposed. Keywords: distributed resources reservation, graphs matching, GRID networks, wave expansion.
1
Introduction
Computational GRID resource allocation algorithms espouses one of two paradigms: centralized resource control [1,2] or localized application control [3,4]. Each paradigm has advantages and disadvantages, in particular, centralized resource control allows managing resources on global level, and as result finds globally optimal solutions, but it can be not scalable solution either in terms of execution efficiency or fault resilience. On other hand, the second approach can lead to unstable resource assignments, and does not allow to have efficient resource allocation in terms of average GRID resources load. Each approach has its own area of use. It’s better to use the local application control in networks with frequent resource reallocation and large amount or small tasks running, in contrast to centralized resource control approach that works better at networks with small amount of large tasks, that works during long time or at networks that require globally optimized resource allocation. Moreover a lot of resource allocation strategies don’t handle distributed resource allocation (co-allocation) that is required to launch parallel programs on GRID. Usually most important resources for parallel programs are processors productivity, amount of available memory on each computational node and bandwidth or latency of communicational network hardware. Global resource allocation should be able to handle all types of resources that influence on programs productivity and can be reserved, as minimum processors productivity and network connections bandwidth. In this paper graph-analytical approach will be used for solving distributed resources reservation problem. Recent progress in R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 922–931, 2008. c Springer-Verlag Berlin Heidelberg 2008
Distributed Resources Reservation Algorithm for GRID Networks
923
exact graph matching algorithms [5,6,7,8] makes possible developing new global optimization and distributed resource allocation algorithms, that will handle real-size GRID networks. Resource allocation algorithm presented in this paper based on subgraph matching algorithm for weighted graphs [9] with modifications concerning ability to map several parallel application processed onto single processor.
2
Problem Formalization
Given two not oriented graphs GN and GT . Let graph GN = (EN , VN , IN , JN ) - represents network, where EN - is set of physical network connections, VN - is set of computational nodes (processors), IN - is set of weights added to each edge of graph, represented network connection bandwidth in Mbps and JN - is set of weights added to each vertex, represented computational productivity of each processor in MIPS or MFLOPS. Let graph GT = (ET , VT , IT , JT ) - represents distributed application requirements to resources that should be allocated before starting calculations, where ET - set of application network communications, VT - set of application computational processes, IT - requirements to bandwidth of each network connection in Mbps, JT - requirements to productivity of each processor in MIPS or MFLOPS. In terms of definitions introduced above, distributed resources reservation problem formalizing next way. It is necessary to find substitution ϕ : VT → VN , such that ∀ (vi , vj ) ∈ ET ⇒ (ϕ (vi ) , ϕ (vj )) ∈ EN and ∀ (ii , ij ) ∈ IT ⇒ (ii , ij ) ≤ (ϕ (ii ) , ϕ (ij )) ∈ IN and ∀ji ∈ JT ⇒ ji ≤ ϕ (jj ) ∈ JN .
3 3.1
Algorithm Overview
As follows from problem formalization, problem is similar to subgraph isomorphism problem on weighted graphs. That is because the subgraph isomorphism on weighted graphs algorithm is used as basis for developing distributed resource allocation algorithm. Edges and vertices weights are used as set of upper bounds when searching for final substitution. In additional, it is possible to map several edges of graph GT onto single edge of graph GN . It is convenient to describe algorithm in terms of state-space search method. Each state s corresponds to a partial substitution ϕ (s) that contain only part of vertices that was already superposed. Algorithm consists of two main parts: preliminary and combinatorial. 3.2
Preliminary Part of Algorithm
The main goal of this part of algorithm is to perform all calculations that can be processed once per algorithms work. In this section we also sort vertices of graph and generate possible superpositions matrix.
924
M. Il’yashenko
Central item for described algorithm is matrix of possible superpositions. It is binary table Mi,j and each cell represents whenever it possible or not to superpose pair of vertices VN,i and VT,j . The values of M are generated depending on preliminary tests performing on both graphs and Mi,j has value ”true” if there are no restrictions to try superpose two vertices VN,i and VT,j based on preliminary tests and ”false” another way. The main idea of possible superpositions matrix M is to have single value for each pair of vertices that will indicate if it reasonable to try to superpose that vertices or not. In the presented algorithm several basic tests that build the matrix M are implemented. If some test condition is true, appropriate element of matrix M that corresponds to pair of vertices VN,i and VT,j marked as ”false”. Let |Vx | be number of edges incident to vertex VX . Is this case comparison condition is |VN,i | < |VT,j | → Mi,j = f alse (1) Vertex weights comparison condition is JN,i < JT,j → Mi,j = f alse
(2)
Let V inN,i be subset that consists from incoming edges that are incident to vertex Vi , in this case condition based on incoming edges will be |V inN,i | < |V inT,j | → Mi,j = f alse
(3)
Let IinN,i be subset of weights that corresponds to V inN,i , and condition based on incoming edges weights will be IinN,i < IinT,j → Mi,j = f alse (4) Let V outN,i be subset that consists from outgoing edges that are incident to vertex Vi , in this case condition based on outgoing edges will be |V outN,i | < |V outT,j | → Mi,j = f alse
(5)
Let IoutN,i be subset of weights that corresponds to V outN,i , and condition based on outgoing edges weights will be IoutN,i < IoutT,j → Mi,j = f alse (6) Those are basic conditions that don’t require any complicated computations, but in the same time such conditions often reduce complexity of state-space search part of algorithm not enough. In this paper we are proposing a set of conditions that are based on wave expansion of graphs that is described with more details in paper [10]. Wave expansion technique generates a lot of structural information about graph that can be used to build more efficient conditions. Most important part of wave expansion that is used in presented algorithm is vertex surrounding subgraph. It is subgraph that consists of vertices that are distant
Distributed Resources Reservation Algorithm for GRID Networks
925
k from given vertex, where k - is a variable parameter, affected on how many neighbouring vertices will be used to generate vertex surrounding subgraph. For graphs with low edge density it is better to use larger k. Another important issue of wave expansion technique is the sets of vertices corresponding to each wave in expansion. Let WX,Y,k be a set of vertices in wave expansion of graph X starting from vertex Y with maximal edge distance between vertices k. Condition that is based on this will be |WN,i,k | < |WT,j,k | → Mi,j = f alse, where k = 1 . . . |WT,j |
(7)
Let UX,Y,k be a set of vertices weights that corresponds to vertices in WX,Y,k UN,i,k < UT,j,k → Mi,j = f alse, where k = 1 . . . |WT,j | (8) Let QX,Y,k be a subset of edges that are part of wave expansion of graph X starting from vertex Y with maximal edge distance between vertices k and let |QX,Y,k | represent an amount of edges in set QX,Y,k . In this case we can add the condition |QN,i,k | < |QT,j,k | → Mi,j = f alse, where k = 1 . . . |QT,j |
(9)
Let RX,Y,k be a set of edges weights that corresponds to edges in set QX,Y,k RN,i,k < RT,j,k → Mi,j = f alse, where k = 1 . . . |QT,j | (10) The conditions that are based on wave expansion of graphs can be evaluated for some static value of parameter k, or for all or several possible values of k. In second case test will have larger computational complexity, but also will have better results in reducing complexity of combinatorial part of algorithm. As all tests in preliminary part have polynomial computational complexity, for large graphs it is usually reasonable to perform all test in list. Some additional time that will be spent at preliminary part will save much more time at combinatorial part of algorithm. Another important action performing at preliminary part of proposed algorithm is vertex sorting. Algorithm productivity can depends on order of vertices, as conditions at combinatorial part of algorithm are based on superposed pairs of vertices. The main idea is to sort vertices in order where first of all will be stored vertices that have more edges density of edges incident to that vertices internally. The only vertices of graph GT , those represents application requirements to distributed resources are rearranged, where vertices of graph GN are not affected and will be substituted to vertices of graph GT at combinatorial part of algorithm. Let TT,i represent number of edges that are incident to vertices that have index lower then i, and let PT,i - the number of possible substitutions available for vertex VT,i of graph GT . Then the order of vertices in GT will be determined on the base of the following conditions: VT,i ↔ VT,k , where TT,k = min (TT,j ) f or j = (i + 1) . . . |VT |
(11)
926
M. Il’yashenko
Where VT,i ↔ VT,k means that two vertices will be swaped. In case when TT,i = TT,k another condition applied VT,i ↔ VT,k , where PT,k = min (PT,i , PT,j ) 3.3
(12)
Combinatorial Part of Algorithm
The combinatorial part of the algorithm is a section performing a search for exact full substitution ϕ. Algorithm is represented by state-space search recursive function, that generates the new partial substitution ϕi+1 (s) from the previous partial substitution ?i(s) by adding one vertex into the partial substitution ϕi (s). The initial state is represented by the partial substitution ϕ0 (s) = 0. On each step i, the function enumerates all the vertices of the graph GN , that are marked with ”true” in the possible substitutions matrix M for the vertex VT,i . For each vertex VN,j that is taken from VN , the algorithm performs several tests to verify that all the required conditions that make possible to add the current vertex to the next partial substitution ϕi+1 (s) are satisfied. The list of the conditions used at each step is described with more details below. The most important is the condition based on the possible substitutions matrix M Mi,j = true (13) Let TN,i be number of edges that incident to vertex VN,i and to some vertex that belongs to the partial substitution ϕi (s). TN,i ≥ TT,j
(14)
Condition that depends on edges that part of current partial substitution (vi , vk ) = (ϕi (vi ) , ϕi (vk )) , where k = 1 . . . i
(15)
Condition that depends on edges weights (ii , ik ) ≥ (ϕi (ik ) , ϕi (ik )) , where k = 1 . . . i
(16)
If all conditions described above are true for next pair of vertices, that pair will be added to a partial substitution ϕi+1 (s). In this list, conditions (15) and (16) restrict partial substitution generated at each step to be part of the final exact substitution. When those conditions are applied to all the partial substitutions ϕi (s) for i from 1 to |VT |, the resulting complex condition corresponds to problem formalization, and guaranties that achieved full substitution satisfies the source problem formulation. The states enumeration is performed using in-depth search strategy.
4
Experimental Results
At presented times most network configurations belong to one of the standard topologies (star, common bus or circle), but wire-less technologies developing
Distributed Resources Reservation Algorithm for GRID Networks
927
very fast when more complicated topologies possible because direct communication between computers that are on distance but using the same communicational space. Such type of communication can generate rather heterogeneous topologies, and resource allocation in such networks can be difficult. Because reasons described above, for practical tests of algorithms productivity, randomly generated graphs were used. Such graphs can describe as first approach wire-less communicational environments. For this research heterogeneous computational networks were generated that consists from computers of different productivity and network connections of different bandwidth. Computational productivity of processors was generated in range between 200 and 3000 points (MHz) and randomly generated network connections with bandwidth between 10 and 100 Mbps. Task graphs were generated from network graphs be removing part of vertices, edges and applying factor to vertices and edges weights. Next factors were used: knv - vertex reduction factor (|VT | = knv |VN |); kr - edges reduction factor (|ET | = kr |EN |); kw - weights reduction factor (JT,i = kw JN,i , IT,i,j = kw IN,i,j ). Results described below achieved on computer AMD Athlon XP 1700+ with 512MB RAM. Each point is represents average reservation time for 100 pairs of random graphs. As follows from Fig. 1 and Fig. 2, for graphs with edges density of 50% and task graph size of 20% of network size, algorithm spends less then 1 second to find resources reservation for graphs with up to 200 vertices. For graphs with edges density of 30% and task graph size of 10% of network size, algorithm finds resource reservation during less then 1 second for graphs with up to 450 vertices.
Fig. 1. Average reservation time for graphs with parameters: knv = 0.2, kr = 0.5, kw = 0.2, network graphs edges density 50%
928
M. Il’yashenko
Fig. 2. Average reservation time for graphs with parameters: knv = 0.1, kr = 0.5, kw = 0.2, network graphs edges density 30%
Such results prove that it is possible to use presented algorithm for solving real GRID resource reservation problems.
5 5.1
Modifications for Practical Use Additional Restrictions
Modern parallel and distributed applications can have restrictions to some parameters of reserved distributed environment, such as latency for network connections, minimal amount of physical memory available on some processing nodes, etc. Described algorithm can be easily modified to handle almost any types of such restrictions. Additional conditions should be added to a preliminary part of algorithm. For example, if there is a restriction to a minimal amount of memory for each computational node, an appropriate condition can be developed using parameters RN - additional weights added to each vertex of graph representing the network, RT - an additional weights added to each vertex of graph GT , representing restrictions to a minimal amount of available physical memory. In this case condition for preliminary section is RN,i < RT,j → Mi,j = f alse
(17)
Distributed Resources Reservation Algorithm for GRID Networks
5.2
929
Handling Very Large Networks
Because of the combinatorial nature, the presented algorithm can handle graphs up to about hundreds vertices in size, but modern GRID systems can have much more processing nodes. To handle such large networks we propose multi-level resource reservation method, that is based on idea to group locally located computers into virtual subnetworks and on high level associate single graph vertex to a whole group of computers, and assign weights as summary of productivity and network connections bandwidths of all computers that makes up group. Global resources reservation performs for groups of computers and results with reserved resources allocation for each group. On second level each group of computers has associated set of resources that should be reserved inside this group of computers. This makes up several local distributed resource reservation problems for each group of computers, but with lower graphs dimensions.
Fig. 3. Multi-level resources reservation approach. Mainframe reserve resources globally for subnetworks on top level and set of servers perform local resources reservation.
Example of multi-level resources reservation showed on Fig. 3. Network consists of two levels, where there is mainframe computer on top level and several subnetworks with servers and set of workstations on second level. For each subnetwork servers reports to mainframe amount of resources, available at computers of subnetwork. On top level mainframe computer apply resources reservation algorithm for subnetworks and send to each server amount of resources that should be reserved inside its network. Then on local level servers apply the same algorithm to reserve distributed resources on computers of subnetwork. In such way large-scale problem can be solved, as algorithm applies several times but
930
M. Il’yashenko
to graphs of smaller size. Such approach allows solving problem on GRIDs of almost any size, by adding another grouping levels when needed.
6
Conclusions
In this paper new distribution resource reservation algorithm presented. The algorithm is based on wave expansion of graphs approach to forming preliminary conditions and allow to handle additional restrictions when reserving resources either resources limits (both minimal and maximal) or additionally reserved characteristics. Proposed method of splitting large scale problems into several problems of smaller size that makes possible to use this algorithm for real-size resource reservation systems. Algorithm can be implemented as part of GRID management system, as middleware component that responds for resource management (reservation, allocation and monitoring of distributed system resources). Its usage should be desirable at systems that require globally optimized resource reservation because small amount of large tasks running or globally balanced network utilization. Such component is now under development with support of grand for developing GRID-enabling infrastructure at Ukraine. Also future efforts will be applied to estimate algorithms productivity characteristics and future enhancements.
References 1. Foster, I., Roy, A., Sander, V.: A quality of service architecture that combines resource reservation and application adaptation. In: Proceedings of the 8th International Workshop on Quality of Service (IWQOS), Pittsburgh, pp. 181–188 (June 2000) 2. Sander, V.: A Metacomputer Architecture Based on Cooperative Resource Management. In: Proceedings of High Performance Computing and Networking Europe 1997 (HPCN 1997), Wien, April 28 - 30 (1997) 3. Berman, F., Wolski, R., Figueira, S., Schopf, J., Shao, G.: Application level scheduling on distributed heterogeneous networks. In: Proceedings of Supercomputing 1996 (1996) 4. Wandan, Z., Guiran, C., Dengke, Z., Xiuying, Z.: G-RSVPM: A Grid Resource Reservation Model, skg. In: First International Conference on Semantics, Knowledge and Grid (SKG 2005), p. 79 (2005) 5. Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: Performance evaluation of the VF Graph Matching Algoritmh. In: Proc. of the 10th ICIAP, pp. 1172–1177. IEEE Computer Society Press, Los Alamitos (1999) 6. Bunke, H., Vento, M.: Benchmarking of graph matching algorithms. In: Proceedings of the 2nd Workshop on Graph-based Representations, Haindorf, pp. 109–114 (1999) 7. Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: An improved algorithm for matching large graphs. In: Proceedings of the 3rd IAPR-TC-15 International Workshop on Graph-based Representations, Italy, pp. 149–159 (2001)
Distributed Resources Reservation Algorithm for GRID Networks
931
8. Foggia, P., Sansone, C., Vento, M.: A performance comparison of five algorithms for graph isomorphism. In: Proceedings of the 3rd IAPR-TC15 Workshop on Graph based Representation (GbR 2001), Italy (2001) 9. Il’yashenko, M.: Parallel subgraph isomorphism algorithm development and research. Radio electronics, Informatics, Management 1, 39–63 (2006) 10. Pinchuk, V.P.: Based on Wave Expansion System of the Invariants for Simple Graphs. - Summ. of Accept. Com. of Intern. Conf. In: Applied Modelling & Simulation AMS 1993, Ukraine, Lviv, september 30–october 2 (1993)
A PMI-Aware Extension for the SSH Service Giuliano Laccetti1 and Giovanni Schmid2 1
Dipartimento di Matematica ed Applicazioni - Universit´ a degli Studi di Napoli Federico II, Via Cintia, 80126 Napoli [email protected] 2 ICAR - Istituto di Calcolo e Reti ad Alte Prestazioni - Sede di Napoli, Via P. Castellino n. 111, 80131 Napoli [email protected]
Abstract. Privilege Management Infrastructures (PMI), used in conjunction with PKIs, allow for an effective, efficient and scalable enforcement of access control in complex distributed systems like grids. We propose a PMI-aware extension for the SSH service, in order to obtain a certificate-based system entry service supporting the direct delegation functionality. Our design uses the PAM and NSS frameworks, so that such extension could be easily generalized to encompass any other system entry service. Indeed, as detailed in a previous work, we look at it as a starting point of a fully integrated design, strictly adhering to modern computing security principles, in which distributed security-oriented OSs act as building blocks of grid-like architectures encompassing advanced resource-sharing and collaborative environments. Keywords: Grid security; Access control; Authentication; Authorization.
1
Introduction
Modern grid implementations adhere to the ISO-IEC general access control framework introduced in [8], which considers an access control mechanism as composed of two independent functional moduli: a Policy decision Point (PDP) and a Policy Enforcement Point (PEP). A PDP is an application independend function which performs authorization decisions on the basis of an application independent policy. On the contrary, a PEP is an application dependent function which should be located in the same operating environment of the access target, and its task is to carry out the access decisions made by the PDP. In these grid implementations, PDPs are realized through standalone, grid-unaware policy decision engines, such as PERMIS [2] or any other X.509-compliant privilege management infrastructure; whilst PEPs are implemented in grid middleware, e.g. the Globus Security Infrastructure (GSI) of the Globus Toolkit. However, as detailed in a previous work [9], major interoperability problems exist between the grid and the operating system layer, which the access control approach previously described could even make worse. Indeed, all the solutions
Partially supported by the SCoPE Italian PON Project.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 932–941, 2008. c Springer-Verlag Berlin Heidelberg 2008
A PMI-Aware Extension for the SSH Service
933
proposed so far rely on grid middleware PEPs that must be interfaced with PEPs at the OS layer. Since resource access management is left to legacy OS kernel mechanisms, then entity identification is realized through two separate namespaces, that of Distinguished Names (DNs) at the grid layer and that of user names at the OS layer. To accomplish two basic requirements of grids, namely the single sign-on and the coexistence with local security principles, the above two namespaces must be explicitly correlated into a global domain name with a suitable mechanism. That imposes a trade-off between expressiveness and scalability, and prevents direct delegation to be implemented at the OS layer. Direct delegation is the capability of users on a given OS to delegate to people, who are not registered on such a system, their rights concerning resource usage. It was introduced in [5] to accomplish both transient and ad-hoc collaborations, which could greatly improve grids (and other complex distributed environments) as collaborative environments. The analisys in [9] shows that the above limitations could be eliminated by extending the authentication and authorization features at the OS layer, as system applications and services, in order to obtain full interoperability with X.509based infrastructures at the grid layer. In this work, we propose a PMI-aware extension for the Secure Shell (SSH), in order to obtain a first and significative example of certificate-based system entry service that supports the direct delegation functionality. In our design, we choose to rely on the Pluggable Authentication Module (PAM) and Name Service Switch (NSS) frameworks, virtually available on every modern Unix-like platform. Instead of extending the SSH protocol itself, this allows us to integrate it with the current state-of-the-art technologies for accessing and managing network-related entities information in open, etherogeneous distributed systems.
2
Functional Requirements
SSH is an Internet standards track protocol for secure remote login and other secure network services, which is widely adopted to operate on a remote host through a command-line shell in a secure way. SSH host-based authentication relies on asymmetric cryptography, and user authentication can either use it optionally. In the current, up-to-date version 2 of the standard [14], such authentication protocols are both designed to support public-key certificates (PKCs), although - at the knowledge of the authors - none of the production SSH codes implement PKCs management, relying only in the key management through local databases. However, a PKI deployment eases the maintenance problem in case of many keys, and it is mandatory in complex environments, where clients and servers may belong to different administrative domains. A one more limitation is that SSH interactions are possible only if the requester (i.e., the client agent) has a regular account on the remote host which offers the service, or otherwise she knows the authentication credentials of another user on that host. This is a severe obstacle for some usage scenarios and
934
G. Laccetti and G. Schmid
distributed environments; in particular, it is incompatible with grid basic functional requirements, and led the Globus designers to realize a grid-aware SSH server, that is supporting Globus proxy-certificates [13]. Suppose now to consider an host H which runs an SSH service supporting both Public-Key Certificates (PKC) and Attribute Certificates (AC) [7], where PKCs and ACs are used to convey separately authentication and authorization information. Let U and S be a remote user (which connects to the service through a compatible SSH client) and a user with an account on H, respectively. Then, the following two features can be implemented in the service: 1. U can authenticate herself on H, no matter if she has or not an account on H; 2. S can eventually entitle U to get a profiled shell on H and/or to use a restricted set of the applications and utilities on H for which he has execute permissions. Item (2) is the direct delegation requirement, revisited for SSH, whilst (1) represents itself a generalization of common authentication UID-based scenarios. If U has an account on H, then she could eventually apply suitable restrictions for her remote logging facilities, depending on the host or domain name from which the access request is sent. This allows the enforcement of the Least-Privilege Principle in the context of remote access; for example, an administrator might set-up attribute certificates for the root UID of H, in such a way she can log therein as a more or less powerful “root profile”, depending she makes the logon request from the same private LAN of H, another host in the same enterprise or organization, or a generic client in Internet (current SSH implementations usually deny by default remote root access requests). If U hasn’t an account on H, then (1) results in a “nobody”-access scenario, but now the administrator has the opportunity of setting-up one or more nobody profiles, depending on the information (host, domain name, etc) conveyed through the DN of U and her related PKC. That - for example - is an important facility for grid environments, since it easily allows resource administrators to set-up upper-bounds about resource usage for a small variety of user grid profiles, leaving the specification of finer-grain privileges to administrators at the virtual organization layer, eventually for each single grid user. With the functionality stated in item (2), authorized but ordinary users of H are granted the rank of “administrators of their own resources” with respect to users which haven’t an account on H, too. Thus, (2) is a way of extending to distributed environments the discretionary access control policy as implemented in general-purpose OSs. If (2) applies, then we will say that S is a sponsor for U on H. It is assumed that U could have more than a single sponsor on H, and that S could act as a sponsor for U only for specified periods of time and/or as a consequence of suitable conditions, as defined by S according to the security policy for H. Our main aim is to design the above extensions for the SSH service with respect to a general access control framework for grid-like environments, in which
A PMI-Aware Extension for the SSH Service
935
distributed security-oriented OSs interact each other through suitable entry system services playing a role similar to that of the gatekeeper in the Globus Toolkit [4]. This framework, introduced in [9], aims to make fully interoperable the access control model at the OS layer and X.509-based PKI and PMI at the grid layer, where PKCs and ACs are used to convey authentication and authorization information, respectively. This turns out in the following requirements for any compliant OS: a) support for asymmetric authentication through X.509 PKCs, where a user is identified through a X.509 Distinguished Name (DN) composed of a login name and the Fully Qualified Name of an host, the login name being the account identifier of the user on such host; b) support for resource access authorization information through X.509 ACs; c) embedded functions of both Certification Authority and Source of Authority for its (registered) users; d) provisioning of an execution environment for both registered and unregistered users, on the basis of the authorization information contained in the ACs issued for such users and checked as valid at runtime. Although these requirements are a main concern in the design of a PMI-aware system entry service, they represent requisites upon which such design has to be built, rather than objects of the design itself. Indeed, in modern Unix-like OSs, access control is a quite complex, multistage procedure which relies on different subsystems and technologies, in order to provide a layer of abstraction between a system entry service (in our case, specifically, the SSH service) and the implementation of any authentication, identification or authorization functions it needs to perform. Thus, the discussion of the implementation issues for a fully compliant X.509 PKI and PMI SSH service consists mostly in detailing about the way such service affects the access control procedure, in terms of the different subsystems involved, their functions and their mutual relationships.
3
Implementation Issues
Our design refers to a NSS and PAM enabled implementation of SSH, thus only slight modifications are required at the SSH code: they just concern the SSH protocol itself, not any authentication or authorization method. This turns out in a main advantage: our design can be quite easily adapted to any other NSS and PAM entry system server. Another main design choice was to adopt the Lightweith Directory Access Protocol (LDAP) as a remote and distributed repository of information about PKCs and ACs. LDAP support for PKCs was discussed in [1] and is now a proposed standard [17]; whereas, at least at our knowledge, no support actually exists for ACs. Discussing such extension is outside the scope of the present work, and will be treated in a next paper.
936
G. Laccetti and G. Schmid
Finally, we decided to rely on the pull distribution mechanism for certificate sharing between the SSH client and server. This choice was because it simplifies the add-on functions required for both the SSH protocol and the client user interface. In what follows, we detail about some key functional aspects of the subsystems involved in the design, indicating how our design choices affect them. As a starting point, we need briefly to recall how access control take place in modern Unix-like operating systems, and how to extend it in order to encompass direct delegation. 3.1
Access Control
Access control refers to the procedure of controlling the access to a resource (target) by an entity making an access request (initiator), and in general-purpose OSs it is implemented as a pipeline composed of the authentication and authorization stages. The authentication stage is intended to verify the identity of the initiator with respect to a userland namespace (e.g. the login-names namespace or the NIS+ namespace) and, if the verify succeded, it returns handlers for the kernel to operate with the initiator: a single user identifier (UID) and one or more group identifiers (GIDs). Given the identity of the initiator as returned by her authentication, the authorization stage returns a granting or denying response according to the access control policy in force on the system. In the classical Unix-like security model, authorization decisions strictly depends upon identification at the authentication stage: if UID=0, then access is always granted, otherwise it is checked against the access control lists (ACLs) for the target. Some UNIX flavors have moved away from this security model, allowing for the enforcement of more powerful access control policies, and offering process privilege models which detach authorization from authentication. A comprehensive, interesting approach was chosen for the privilege model in Sun Solaris [10], which has very useful features to accomplish the direct delegation requirement (2). Detailing about its implementation is out of the scope of this work; what is relevant here is that such approach, through the specification of a suitable set of process privileges, allows to enforce a login environment for U as a restricted, customized version of the login environment of her sponsor S. U can act on H as S, that is with his same UID, since all the processes she will run on H will be checked against her process privilege set. In our design, the set of privileges defining the running profile of U on the host H are encoded as an AC issued by S for U under the authority of the operating system on H, which acts as a Source of Authority [7]. This allows to publish them in public available repositories, that is through X.509 directory servers. 3.2
NSS and PAM-Enabled Services
The mapping from userland to kernel namespaces is called identification, and in our design it is required both in the authentication and authorization stages, in order to support direct delegation.
A PMI-Aware Extension for the SSH Service
937
Identification is realized through a nameservice. Generally speaking, a nameservice is a local or remote (eventually distributed) repository of information about mappings from some namespaces to others, and viceversa. At its simplest, a nameservice may be implemented through a collection of files in the local filesystem which are opened and searched by the C library. Other common nameservices include the Network Information Service (NIS) and the Domain Name System (DNS). Likewise identification, the verification process might relies on different technologies (e.g secret-key technologies as the password-based one, or public-key technologies), and it is obviously preferable that different technologies are implemented as separate, loadable modules that can be combined in a configurable stack fashion. For these reasons, operating systems provide a layer of abstraction between a user application and the implementation of any identification or verification functions it needs to perform. A user application needs only call functions exported by the UNIX C library. Based on user-level settings defined by the system administrator, different modules providing different mechanisms both for identification and verification might be used to perform the actual task. Virtually any modern Unix-like system adopt the Name Service Switch (NSS) and thePluggable Authentication Module (PAM) as the abstraction frameworks for identification and verification, respectively. Since our design refers to a NSS and PAM enabled implementation of SSH, then it results in some extensions to both these two frameworks. NSS Extensions. NSS was firstly introduced by Sun microsystems in the C library of Solaris 2, and since then it has been subsequently picked up by most all C/C++ library distributions for Unix-like systems, but generally with differing implementations. NSS allows user and system information to be obtained from different nameservices, such as NIS or DNS, through standard libc interfaces. Its exact behaviour can be configured throught the nsswitch.conf file, which specifies how the lookup process should work for each supported database, that is: the nameservices involved, their lookup order, and the reaction on lookup result. Distinct nameservices are implemented through separate library modules, and quite recently - a NSS module was introduced in order to support for the NIS schema as well as for user-defined schema from LDAP directories [12]. However, some adds-on might be required in this framework in order to accomplish the retrieval of user information via X.509 PKCs a nd ACs. PAM Extensions. In PAM [11], authentication mechanisms are implemented by dynamically linked shared libraries called PAM modules, and PAM-enabled applications know which modules to load through a configuration file, named pam.conf. This allows system administrators to dynamically configure authentication schemes for all PAM-enabled system utilities and applications by editing the configuration file. Moreover, new authentication technologies can be added without changing any PAM-enabled application, but only developing ad-hoc PAM modules.
938
G. Laccetti and G. Schmid
Part of the power of PAM lies in the ability to combine (in PAM terminology, to stack) PAM authentication modules for a given task. For example, on a system in which rlogind has been PAM-enabled, the administrator can stack the standard UNIX authentication module with an rhosts-based authentication module to implement traditional rlogin-style authentication. The administrator can also choose to remove the rhosts-style authentication module from the stack, disallowing rhosts authentication, or can combine the first two modules with a third module to allow rhosts authentication only for non-root users. Our design require two new PAM modules - here denoted pam X509 authE and pam X509 authO - which operate in stack to realize the required features for the SSH service, as described in section 2. pam X509 authE requires a X.509 DN of a user U, as prescribed in (a), and performs X.509 PKC-based authentication as follows: – using the extensions previously described for NSS, searches for U’s publickey Kpub (U ) in a local repository of X.509 PKCs (pkX509cert) and through an LDAP Directory User Agent; – compares Kpub (U ) with the public-key sent by the client in its access request and, if they match, then checks the signature. pam X509 authO performs X.509 AC-based authorization as a consequence of successfull authentication by pam X509 authE. It operates as follows: – checks if the user distinguished name contains a host name which differs from the one specified through the hostname argument and, if this is the case, then asks the client for a sponsor login name < S >; – using standard NSS mechanisms, searches for < S > in a local repository (passwd) and through an LDAP Directory User Agent. 3.3
SSH Code Extensions
Along with the extensions described in the previous section, which doesn’t affect the SSH code, some minor changes must be made at both the client and server sides of the application. Indeed, the new required functions are based upon certificate usage and, although the pull distribution mechanism avoids the need to support certificate transmissions, the SSH protocol and the client user interface have to be augmented to support them. The Client User Interface. Through the user interface, the client has to tell the server that it intends to perform a PKI and PMI-based interaction. Since current implementations of the SSH client provide the following synopsis for the specification of users and hosts: ssh [opts] [-l logname] hostname | user@hostname [command], we simply leave it apparently unchanged, but adding the parameter type X.509 distinguished name both for the -l option and the argument user:
A PMI-Aware Extension for the SSH Service
939
ssh [opts] [-l logname | X.509name] hostname | user@hostname [command]. X.509 distinguished names are typed into as dot-separated DNS-like style strings, as in: ssh -l schmid.client.na.icar.cnr.it server.dma.unina.it or, equivalently: ssh [email protected]. It is left to the PAM API the parsing of the user name and the selection of the right authentication method, thus avoiding modifications at the server side of the application. In the above example, the user distinguished name contains an host name which differs from the one specified through the hostname argument, thus the service assumes that the user is asking for logging in the system thanks to a sponsor. Hence, a sponsor login name request is sent from the pam X509 authO module to the client, which prompts the user to type in such information. Alternatively, the -S option can be used to insert the sponsor login name at the command line, like in: ssh -S laccetti [email protected]. Finally, if the user distinguished name contains an host name which equals the one specified through the hostname argument, as in: ssh [email protected], then the service assumes by default that the user intends to perform a X.509PKC based authentication on the server host, but he has a standard account on it. However, we allow that a user which has an account on a host can ask to logging therein through a sponsor. In such case, the command line must explictly contain the option to indicate the sponsor login name, like in: ssh -S laccetti [email protected]. The SSH Protocol. One design goal of SSH protocol version 2 has been extensibility: the core architecture is simple, and algorithms, methods, formats, and extension protocols are identified with textual names that are of a specific format. The SSH protocol consists of a stack of the following three major components [14]. The transport protocol ssh-trans [16] performs server host authentication, key exchange, encryption, and integrity protection. The authentication protocol ssh-userauth [15] provides a suite of mechanisms that can be used to authenticate the client user to the server. Finally, the connection protocol ssh-connect [16] specifies a mechanism to multiplex multiple streams (channels) of data over the confidential and authenticated transport.
940
G. Laccetti and G. Schmid
Our design leaves unchanged the above components, just requiring to add one extension protocol to ssh-userauth, in order to specifically support direct delegation. Indeed, ssh-userauth provides a built-in authentication mechanisms named publickey, on which we rely to perform user authentication using her public key. In this way, the SSH client sends to the server the user public-key and a signature, and the pkX509 X509 authE module on the server side has to check the validity of both to grant authentication. Publickey may optionally support the transmission of public-key certificates through the ”public key blob” field of the SSH MSG USERAUTH REQUEST message, but we actually do not make use of this facility because adopt the pull distribution mechanism. Instead, as stated before, the user public key is derived from a PKC which in turn is got by the SSH server using the NSS framework in conjunction with a local repository (pkX509cert) and LDAP DUAs. The extension protocol operates after ssh-userauth and before ssh-connect, and it is required to realize the sponsor request-response interaction. This protocol is very simple and it enters in action only if the -S option is provided in the command line or if the user distinguished name contains an host name which differs from the one specified through the hostname argument.
4
Conclusions
Direct delegation is a quite recent functional requirement for grid-like distributed environments which has emerged in some advanced scenarios such as Ad-hoc Collaboration, Multi-project Management and Resource Provider Protection. It is intended to allow users and applications on a given host to obtain services from another host without the need of being registered on it, but only as a consequence of an authorization issued by one of its users or services, as specified by the local policy. In this work we propose a PMI-aware extension for the SSH service, in order obtain a system entry service that is fully integrated with X.509 authentication and authorization infrastructures, and that supports direct delegation. We envision this kind of entry services as the ”glue” of gridlike architectures for advanced resource-sharing and collaborative environments, composed of gridware-unaware distributed security-oriented OSs. Instead of extending the SSH protocol itself, we choose to rely on the PAM and NSS frameworks, virtually available on every modern Unix-like platform. This approach has the following advantages: – assuming the PAM-awareness of the SSH implementation, it only requires slight modifications to the SSH protocol and the client user interface; – it could be easily generalized to encompass any other (PAM-aware) system entry service; – following a main trend in the design of distributed environments, it promotes the adoption of the Lightweith Directory Access Protocol (LDAP) and its adds-on, a set of emerging Internet standards and proposed-standards, for conveying both user, host and application information.
A PMI-Aware Extension for the SSH Service
941
Acknowledgements The authors would thank Domenico Minchella and Giuseppe Russo of Sun microsystems for their useful comments and suggestions.
References 1. Boeyen, S., Howes, T., Richard, P.: Internet X.509 Public Key Infrastructure LDAPv2 Schema. RFC 2587 (1999) 2. Chadwick, D.W., Otenko, O.: The Permis X.509 Role Based Privilege Management Infrastructure. Future Gener. Comput. Syst. 19, 277–289 (2003) 3. Chadwick, D.W.: Authorization in Grid Computing. Information Security Technical Report 10, 33–40 (2005) 4. Globus Toolkit 4 Security Documentation, http://www.globus.org/toolkit/docs/4.0/security/index.html 5. Lorch, M., Kafura, D.: Supporting Secure ad hoc User Collaborations in Grid Environments. In: Parashar, M. (ed.) GRID 2002. LNCS, vol. 2536, Springer, Heidelberg (2002) 6. ISO-IEC Std. 9594-8 — ITU-T Rec. X.509 (1993) 7. ISO-IEC Std. 9594-8 — ITU-T Rec. X.509 (2001) 8. ISO-IEC Std. 10181-3 — ITU-T Rec. X.812 (1995) 9. Laccetti, G., Schmid, G.: A Framework Model for Grid Security. Future Gener. Comput. Syst. 23, 702–713 (2007) 10. Mauro, J., McDougall, R.: Solaris Internals, 2nd edn. Sun Microsystem Press (2005) 11. Samar, V., Lai, C.: Making Login Services Independent of Authentication Technologies. In: Proceedings of the SunSoft Developers Conference (1996) 12. nss ldap module web page: http://www.padl.com/OSS/nss ldap.html 13. Tuecke, S., Welch, V., Engert, D., Pearlman, L., Thompsonet, M.: Internet X.509 Public-Key Infrastructure (PKI) Proxy Certificate Profile. RFC 3820 (2004) 14. Ylonen, T.: The Secure Shell (SSH) Protocol Architecture. RFC 4251 (2006) 15. Ylonen, T.: The Secure Shell (SSH) Authentication Protocol. RFC 4252 (2006) 16. Ylonen, T.: The Secure Shell (SSH) Transport Layer Protocol. RFC 4253 (2006) 17. Zeilenga, K.: Lightweight Directory Access Protocol (LDAP) Schema Definitions for X.509 Certificates. RFC 4523 (2006)
An Integrated ClassAd-Latent Semantic Indexing Matchmaking Algorithm for Globus Toolkit Based Computing Grids Raffaele Montella, Giulio Giunta, and Angelo Riccio Dept. of Applied Science at University of Naples Parthenope - Italy {raffaele.montella, giulio.giunta,angelo.riccio}@uniparthenope.it
Abstract. The effective and efficient computing resource allocation is a critical issue in the challenging achievement of realistic, high space and temporal resolution and computing time affordable weather, marine and soil flooding simulation and forecast. Resource allocation relies on matchmaking tool, which must be flexible and straightforward to be deployed, configured and used. Using the Globus Toolkit 4 and our resource broker service, we compared the ClassAd matchmaking algorithm with our matchmaking algorithm and then we integrated the two algorithms in order to avoid their drawbacks. Keywords: Grid Computing, Web Service, Resource Broker, Matchmaking, Latent Semantic Indexing, Globus Toolkit, ClassAd.
1
Introduction
In the grid computing field the term “resource” is usually referred to something shared between Virtual Organizations (VO) participants to be discovered, selected and then used basically for computing or storage purposes, so the design of matchmaking and resource broking systems is a technological challenge in the emerging technology of grid computing affecting the over all performances of the specific grid-aware application and the whole grid itself. The Condor Project’s ClassAd [3] language is commonly adopted as a “lingua franca” for describing grid resources, but Condor itself does not make extensive use of Web Services, while the Globus Toolkit version 4 (GT4) [1] is implemented using the web services resource framework, and offers basic services for job submission, data replica and location, reliable file transfers and resource indexing, but does not provide a resource broker and matchmaking service. In a previous paper we described the development of a Resource Broker Service (RBS) [2] based on the Web Services technology offered by the GT4. Looking for a new approach to the matchmaking problem, we developed from scratch a Latent Semantic Indexing (LSI) [6] based algorithm using concepts commonly applied to information retrieval and internet search engines. We compared algorithms performances using the framework built-in test component in order to analyze the behavior in different grid conditions. In this paper we describe the integration in the GT4 of tools R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 942–950, 2008. c Springer-Verlag Berlin Heidelberg 2008
An Integrated ClassAd-Latent Semantic Indexing Matchmaking Algorithm
943
implementing the Resource Broker Service (RBS) focusing on the interoperable issues leveraging on the GT4 Index Service and implemented using a resource representation mapping component between the MDS native to the ClassAd in order to design a matchmaking algorithm integrating both approach achieving best selection performances. A description of the resource broker architecture and design, the matchmaking algorithm model, ClassAd and LSI implementation is described in the section 2, while in sections 3 we compare algorithm performances using a self developed test suite. In section 4, we describe an integrated ClassAd/LSI approach to matchmaking finalized to achieve a best needs/resources fit. The final section contains some performance considerations about developed algorithms, concluding remarks and information about plans for future work.
2
Multi-matchmaking Algorithm Resource Broker Service
The brokering service that we have developed is responsible for interpreting requests and enforcing virtual organization policies on resource access, hiding many details involved in locating suitable resources. The resource broking process can be divided into two parts. First, a matchmaking algorithm finds a set of matching resources using specific criteria. Then, an optimization algorithm is used to select the best available resource among the elements [8]. The RBS leverages on a 2-phase commit approach and enables users to query a specified virtual organization index service for a specific resource, and then mediates between the resource consumer and the resource producer(s) that manage the resources of interest. Resources register themselves to the resource broker, by performing an availability advertisement inside the VO index. Entities, represented by End Point References (EPR), are automatically classified as resource producers, stored in a GT4 oriented data structure because of efficiency and the mapped using the Condor [3] ClassAd [4] description language. Some properties process issues are performed thanks to the Collector component fully configurable and customizable features as the creation of new one valued in a user defined way, lookup throw a table of values, ranging between class intervals, averaging and aggregation as in the case of cluster computing nodes capabilities. This approach permits to decouple the internal resource status representation with the external form used by a specific matchmaker algorithm or query system. We realized our goal to implement a sort of extensible and customizable matchmaking framework useful for testing and comparison purposes too (Figure 1). We leveraged on this feature developing a Latent Semantic Indexing (LSI) [5] based matchmaking algorithm from scratch using a custom resource properties formalization. In order to apply LSI to the grid computing resource matchmaking we had to map some concepts form one field on the other: resources, identified
944
R. Montella, G. Giunta, and A. Riccio
Fig. 1. The matchmaker framework characterized by Collector and Matchmaker components
by EPRs, are characterized named and typed properties in the same way documents, localized by an URL, are described by keywords. We performed some simplification to this model because resource properties have not intrinsic semantic because of the aggregation of terms as document keywords. Using a LSI based algorithm, each grid resource is placed in a hyperspace with many dimensions as the number of properties characterizing the resource, so, in order to prevent anisotropic space deformation, numeric property values have to be normalized performing an adimensionalization process. Then in the isotropic a-dimensional space is calculated the Euclidean distance between the query target and each available resource, then choosing the shortest one as the best one (Figure 2). In order to implement a GT4 resource oriented matchmaker algorithm using ClassAds framework, was developed a mapping component in order to achieve an Index Service entries projection into ClassAds resource representation in automatic, but configurable fashion. The library provided ClassAd matchmaking implementation performs a really basically feature matching two ClassAds, evaluating the strict requirements compatibility and eventually the rank value for each side. The selection of the best resource is totally in charge of the developer so we implemented two selectable approaches. In the simple one, the best resource selection is performed using the maximization of both rank values. Otherwise, the LSI matchmaking algorithm is applied to the ClassAd selected resources in order to find the much similar to the requested one. From the client side, we developed a globus-rb client in order to submit ClassAd queries to the resource broker as the following example where a ManagedJobFactoryService hosted on a cluster of at least 64 computing nodes running Linux on the Intel architecture is requested: [ ImageSize=512; Rank=1/other.ComputingElement_PBS_WaitingJobs; Requirements= other.Type=="ManagedJobFactoryService" && other.NumNodes>=64 && other.Arch=="x86" && other.OpSys=="Linux" ]
An Integrated ClassAd-Latent Semantic Indexing Matchmaking Algorithm
945
Fig. 2. Resource representation in a 2d anisotropic space (a) and isotropic space (b). From A to M resource identifiers, X query representation in the resource property space.
The resource broker client invokes the appropriate resource broker service method specifying the query to be performed and, optionally, the matchmaking algorithm to be used. The resource broker service retrieves the matchmaker component managed by the stateful service resource, selects all unclaimed resources and runs the matchmaking algorithm on this resource set. The client shows the web service address or the resource EPR on the standard output or on a specified text file. In order to prevent deadlocks or starvation effects, the user can choose between two different behaviors asking to the service to be notified as soon as the requested resource become available or to receive an error response because the unavailability.
3
Algorithms Evaluation
The resource broking framework we developed provides a configurable testing component implementing virtual grids in order to compare different matchmaking algorithms under controlled conditions as the number of selectable resources, the type
946
R. Montella, G. Giunta, and A. Riccio
Fig. 3. LSI (continuous line), ClassAd (dotted line) and ClassAd-LSI (broken line) matchmaking algorithms evaluation. Time per Match from 10 to 100 resources (a) and from 10 to 5000 resources (b).
and the number of properties characterizing each resource and the submitted query. We performed tests using virtual grids build by 10 to 5000 resources with different increment steps with the best accuracy in the range between 10 and 100. Each resource is identified by 10 integer properties valued by a random number in the range of 0 and 99. The query leverages on 6 random values carried out form the same interval and applied to 3 properties for the strict selection characterizing the discovery phase and 3 properties for the optimization in the selection phase. Because each test is repeated 1000 times, in order to prevent starvation caused by resource claiming, this feature is disabled by default when the RBS is used in testing mode, otherwise a variable repeating count could to be used achieving a lack in testing procedures. Tests were performed on a dedicated linux box equipped by a [email protected] and 4GB of RAM.
An Integrated ClassAd-Latent Semantic Indexing Matchmaking Algorithm
947
As seen in charts in Fig. 3, the LSI matchmaking algorithm is faster than the ClassAd one especially in the case of a low available resource count as in small, but world spread department grids. In order to compare LSI and ClassAd algorithms from the effectiveness point of view in a relative fashion, we considered the Euclidean distance computed between the query figuring the wished resource and the resource selected by the matchmaker. Both resources are represented in the a-dimensional isotropic space by their numeric values. Our tests demonstrate how the Latent Semantic Indexing is a valid approach in matchmaking, nevertheless the ClassAd algorithm is more suitable because it represents a sort of standard in the grid computing world. In the actual ClassAd matchmaking algorithm implementation, the discovery is performed using the Requirements property, while in the selection phase both self.Rank and other.Rank are maximized. Using this approach, resources ranking could be tricky.
4
Integrating ClassAd and LSI Algorithms
In order to improve the overall process effectiveness and leveraging on grid computing community interests in Index Service to ClassAd resource representation mapping, we developed from scratch a matchmaking algorithm using the ClassAd notation to discovery suitable resources and the Latent Semantic Indexing to select the best one. Using the flexible Collector component we implemented, each resource is mapped in to the ClassAd representation. In this way the ClassAd provided APIs are used to perform the discovery process using the Requirements property. In combined ClassAd-LSI matchmaking algorithm queries the explicit use of the Preferences property is mandatory in order to specify the preferred property value using the same notation of Requirements. In the Preferences the ” =” symbol is with the meaning of ”about” and new symbol for maximization and minimization are introduced. In the following example the consumer is looking for a ManagedJobFactoryService hosted on a cluster of at least 64 computing nodes running Linux on the Intel architecture specifying a CPU speed as closer as possible to 3GHz and minimizing the PBS job queue: [ ImageSize=512; Rank=1; Preferences="other.ComputingElement_PBS_WaitingJobs=min && other.CPUSpeed~=3000" Requirements= other.Type=="ManagedJobFactoryService" && other.NumNodes>=64 && other.Arch=="x86" && other.OpSys=="Linux" ] The ClassAd-LSI matchmaking algorithm performances are similar to the pure ClassAd one because the LSI weight is poor involving only discovered resources and not the complete available set, while the distances between the queried and the selected resource are lower (Figure 4).
948
R. Montella, G. Giunta, and A. Riccio
Fig. 4. LSI (continuous line), ClassAd (dotted line) and ClassAd-LSI (broken line) matchmaking algorithms evaluation. Time per Match from 10 to 100 resources (a) and from 10 to 5000 resources (b).
Fig. 5. LSI (continuous line), ClassAd (dotted line) and ClassAd-LSI (broken line) matchmaking algorithms evaluation. Difference between ClassAd and ClassAd-LSI distances (a). Starvation event probability (b).
The Figure 5a represents the difference between ClassAd and ClassAd-LSI distances showing how the integrated algorithm selects best fitting resources acting as LSI and ClassAd from the starvation point of view (Figure 5b). Evaluations are performed in an experiment environment because the need of a great amount on grid resources. Some tests on a real world grid application for the operational production of weather forecasts [7] demonstrate the convenience of the integrated ClassAd-LSI algorithm due to the less restrictive resource broking selection algorithm as shown in Fig. 6.
An Integrated ClassAd-Latent Semantic Indexing Matchmaking Algorithm
949
Fig. 6. A real world weather forecast grid application using different matchmaking algorithms. No Grid: all software run on a single Beowulf cluster, non parallel jobs on the front-end node. No RB: all software pieces are distributed a priori on computing elements and no one is demanding computational power to the grid. LSI/ClassAd/ClassAd-LSI: the same run is performed using different matchmaking algorithms in a real application scenario with many users demanding computational and storage power to the department grid.
5
Conclusions and Future Works
We described our state of the art in the development of resource broker as key components for the grid computing technology with the aim to evolve our virtual laboratory for environmental simulations [9]. From the grid components point of view, our best result is the development of the framework for automatically publishing, collecting and classifying of resource properties in the VO Index service. The same component realize the autonomic, but fully customizable, mapping between GT4 resource properties and ClassAds. This component permits to deal with any kind of resource (computing, storage, data, instruments) in a homogeneous and transparent way. The LSI algorithm is very effective and efficient, but the use of the ClassAd notation in resource representation and request is a requirement in many grid computing application fields. The integration of ClassAd and LSI algorithms permits to improve ClassAd effectiveness with a convenient drawback in terms of performances. Our resource broker service permits to choose the algorithm used to find the best resource, so, regarding the grid application behavior, the best choice could be performed. In this direction, our interests are about the development of optimization based matchmaker algorithms leveraging on our developed framework. We focus our next goals in the development of grid components finalized to open new application scenarios in
950
R. Montella, G. Giunta, and A. Riccio
the field of environmental computing science with particular regards to ocean and atmospheric researches [10].
References 1. Foster, I.: Globus Toolkit Version 4: Software for Service-Oriented Systems. Journal of Computational Science and Technology International Journal of High Performance Computing Applications 21(4), 523–530 (2006) 2. Montella, R.: Development of a GT4-based Resource Broker Service: an application to on-demand weather and marine forecasting. In: C´erin, C., Li, K.-C. (eds.) GPC 2007. LNCS, vol. 4459, Springer, Heidelberg (2007) 3. Thain, D., Tannenbaum, T., Livny, M.: Distributed Computing in Practice: The Condor Experience. Concurrency and Computation: Practice and Experience 17(24), 323–356 (2005) 4. Raman, R.: Matchmaking Frameworks for Distributed Resource Management. Ph.D. Dissertation (October 2000) 5. Dumais, S.T.: Using LSI for Information Retrieval, Information Filtering, and Other Things. In: Cognitive Technology Workshop (April 1997) 6. Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the Society for Information Science 41(6), 391–407 (1990) 7. Montella, R., Giunta, G., Riccio, A.: An integrated ClassAd-Latent Semantic Indexing matchmaking algorithm for Globus Toolkit based computing grids. In: Proceedings of the Workshop on Models, Algorithms and Methodologies for Gridenabled Computing Environments (2007) 8. Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. Intl. J. High Performance Computing Applications 15(3), 200–222 (2001) 9. Ascione, I., Giunta, G., Montella, R., Mariani, P., Riccio, A.: A Grid Computing Based Virtual Laboratory for Environmental Simulations. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, Springer, Heidelberg (2006) 10. Giunta, G., Montella, R., Mariani, P., Riccio, A.: Modeling and computational issues for air/water quality problems. A grid computing approach. Il Nuovo Cimento 28C(2) (2005)
A Grid Computing Based Virtual Laboratory for Environmental Simulations Raffaele Montella1 , Giulio Giunta1 , and Giuliano Laccetti2 1
Department of Applied Science University of Naples Parthenope, Naples, Italy {raffaele.montella, giulio.giunta}@uniparthenope.it 2 Department of Mathematics and Applications University of Naples Federico II, Naples, Italy [email protected]
Abstract. The effective and efficient computing resource allocation, leveraging on a flexible, but straightforward to deploy, configure and use matchmaking tool, is a key component in the grand challenge of realistic, high resolution and computing time affordable environmental simulations and forecasts. With the aim of building a computing environmental science virtual laboratory, we developed a suite of Globus Toolkit 4 based web services to implement an interoperable service oriented architecture integrating matchmaking and resource broking, workflow management, data and metadata advertisement and instruments integration. Keywords: Grid Computing, Web Service, Resource Broker, Environmental Data Provider, Globus Toolkit, ClassAd.
1
Introduction
In the grid computing field the term ”resource” is usually referred to something of shared between Virtual Organizations (VO) participants and discoverable, selectable and then usable for computing or storage purposes. Data stored in a distributed fashion over the grid could be considered as a resource especially if tagged using metadata. This point of view could change the grid utilization from a tool centered perspective, where the computer science expert is the main actor as the know-how leader, to an application centered one, where the specific domain expert, for example the environmental scientist, uses the grid as source for his application needed data. In a data grid approach, software components, data and instruments, wrapped by web services, are advertised on the VO Index Service in a common, consistent and homogeneous way, thanks to developed guidelines and standard publishing tools. The resource broker collects grid status data and process queries where needed resources are represented in a standard fashion. In our implementations we choose Globus Toolkit 4 (GT4) [8]
Work partially supported by the S.Co.P.E. Italian PON project, funded by Ministero dell’Universit e della Ricerca, PON 2000-2006.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 951–960, 2008. c Springer-Verlag Berlin Heidelberg 2008
952
R. Montella, G. Giunta, and G. Laccetti
because it exposes its features, including service persistence, state and stateless behavior, event notification, data element management and index services, via the web services resource framework (WSRF). In a previous paper [13] we described the architecture and the design of our resource broker service focusing on the LSI-based (Latent Semantic Indices) matchmaking algorithm. We focused our attention on environmental data content distribution network developing a grid enabled, service based, version of the GrADS Data Server leveraging on automatic metadata Index Service publishing and the ClassAd representation [14]. In this paper we describe the integration in the GT4 of tools implementing the Resource Broker Service (RBS), the GrADS Data Service (GDDS), the Instrument Service (IS) and Job Flow Scheduler Service (JFSS) focusing on the interoperable issues leveraging on the GT4 Index Service and implemented using a resource representation mapping component between MDS native to the Condor Classified Advertisement (ClassAd) [18]. The user can dynamically glue all developed components using the JFSS and the RBS in order to build real grid aware application integrating data, instruments and models with the feeling of a single distributed virtual machine interaction.
2
Resource Broker Service Design and Development
The developed brokering service is responsible for interpreting requests and enforcing virtual organization policies on resource access, hiding many details involved in locating suitable resources. The resource brokering process is performed in two parts: the matchmaking where strictly satisfying query criteria resources are discovered from a set of available ones, the selection where the best one is selected applying a sort of optimization algorithm [10]. Resources register themselves to the resource broker, by performing an availability advertisement inside the VO index [19]. Entities, represented by End Point References (EPR), are automatically classified as resource producers, stored in a GT4 oriented data structure because of efficiency and then mapped using the ClassAd description language. Some properties process issues, as aggregation, lookup and ranging, are performed thanks to the Collector component. This approach allows to decouple the internal resource status representation with the external form used by a specific matchmaker algorithm or query system. We realized our goal to implement a sort of extensible and customizable matchmaking framework useful for testing and comparison purposes too. We leveraged on this feature developing a Latent Semantic Indexing (LSI) [6] based matchmaking algorithm from scratch using a custom resource properties formalization. Each grid resource is placed in a hyperspace with many dimensions as the number of properties characterizing the resource. We perform an adimensionalization process in order to avoid numerical properties units and to compute the Euclidean distance between the query target and each available resource in an isotropic space. Finally we select the best resource choosing the shortest distance. Thanks to the modular architecture of the matchmaking framework, we developed a Classified Advertisements based matchmaking algorithm and then we compared it with the LSI. In the evaluation
A Grid Computing Based Virtual Laboratory for Environmental Simulations
953
the LSI based matchmaking algorithm leads because efficiency and effectiveness, but, because ClassAds are stated as a sort of ”lingua franca” in the field of grid computing, we developed and integrated ClassAd-LSI matchmaking algorithm in order to balance efforts and drawbacks [15]. In the following example, using the integrated ClassAd-LSI matchmaking algorithm, the user is looking for a MM5Service, providing weather forecasts and simulations services wrapping the Mesoscale Model 5 (MM5) [11], hosted on a cluster of at least 81 nodes running Linux on the Intel architecture: [ ImageSize=512; Rank=1; Preferences= other.CpuSpeed=max; Requirements= other.Type=="MM5Service" && other.NumNodes>=81 && other.Arch=="x86" && other.OpSys=="Linux" ] In order to prevent deadlocks or starvation effects the user can choose between to different behaviors asking the service to be notified as soon as the requested resource become available or to receive an error response because the unavailability.
3
Five Dimensions Data Provider Service
Matchmaking algorithms are interestingly used in distributed data systems. In the field of atmospheric and marine environmental science, with main regards to weather and oceanographic applications, is commonly used the Grid Analysis and Display System (GrADS) [7], a free and open source interactive desktop tool, characterized by an easy access, manipulation, and visualization of earth science data stored in various formats. GrADS uses a 5-Dimensional data environment: longitude, latitude, vertical level, time and parameters (variables). In order to distribute environmental data is used the GrADS-DODS Server (GDS) combining GrADS and OPeNDAP (the Open source Project for a Network Data Access Protocol), formerly DODS (Distributed Oceanographic Data System) to create an effective and efficient solution for serving distributed scientific data. The GDS could be accessed using its web interface or by the direct approach to the data server via client tools, but there is no native implementation using web or grid services causing a lack in security management, component interoperability and metadata advertisement. In order to provide our application with a data management and distribution component, we developed a GT4 web service enabled version of the GrADS Data Server deeply integrated into the index service model and into the resource broker behavior: thanks to the RB Collector component mapping algorithms, the metadata published on the index service are automatically processed and exposed as ClassAds. Our GrADS Data Service architecture design was a non trivial task because our wisdom of applying the ”same binary” approach. From our point of view, developing a GrADS interface from scratch could be an hard (and un-useful because already done) work with the risk of introducing bugs or unacceptable behaviors or idiosyncrasies in a pretty legacy
954
R. Montella, G. Giunta, and G. Laccetti
working software. Due to the same reason, the solution to reuse partially modified original source code was avoided too because this approach means to freeze the actual available GDS source version and start a new development branch with the risk of introducing bugs. We developed the GrADS Web Service using the factory/instance design pattern in order to manage multiple resources implementing the multi stage analysis feature of the plain GDS. The development approach we followed allows to reuse, without any sort of source rebuild, the original GDS distribution packages leveraging on Anagram [21] features. Anagram helps in the development of new servers supporting the OPeNDAP subsetting protocol acting on different 5D environmental data formats. The most valuable feature in our implementation is the automatic metadata publishing as named and typed resource properties in the VO Index Service: for each available data file, dimension (space, time) and attribute information (name, type) are published in the index service, processed and ClassAd mapped by the resource broker collector. In order to access to a particular data set as a grid resource, the web service consumer creates GDDS instances through a factory service. Thanks to the resource broker integration and the ClassAd powerful representation available data are advertised using metadata. Then metadata could be used in the ClassAd data oriented query as in the following example where the user is looking for a dataset, composed by the u10 and v10 variables, produced by the MM5 mesoscale model in a specific domain area, expressed by latitude and longitude ranges, regarding a specified time interval without any special interest to the time step, but asking a ground resolution of about 333 meters: [Rank=1; Preferences=other.NetworkInterface_Bandwidth=max other.NetworkInterface_Latency=min; Requirements=other.Type="dataset"; other.Time>=’01:02:2007 00:00:00’ && other.Time<=’02:02:2007 00:00:00’ && other.Lat>=40.5 && other.Lat<=40.85 && other.Lon>=14.14 && other.Lon<=14.25 && other.DX==0.0015 && other.DY==0.0015 && other.Variables= u10,v10; && other.Model=="mm5" ] In this example, using the ClassAd-LSI matchmaking algorithm, data set quickly reachable are preferred. From the client side point of view, requesting a specified computational power or data is the same thing because the resources are represented in a homogeneous fashion. Asking to the RBS to be notified when a query could be satisfied means that this component could act as an environmental data provider as in the case a weather forecast operational grid application. The spread of GDS is due also to the third parties client products availability as Matlab and IDL widely used in the scientific community. Because the now days unavailability of grid enabled versions of this software, we developed
A Grid Computing Based Virtual Laboratory for Environmental Simulations
955
a custom version of the GDS accepting connections only by the local machine and acting as a proxy service to the grid enabled GDDS. In this way the use of grid tools we developed is pretty transparent.
4
Integrating Instruments into the Grid Using GT4 Services
The challenge to integrate instruments into the grid environment is strategically relevant because the improvement in the use of instruments themselves in a more efficient and effective way, thanks to a better interaction with other kind of computing elements: it reduces the general overhead and maximizes throughputs. The use of grid technology to control and retrieve acquired data implies the need to develop a standard interface methodology across different kind of instrument hardware. The relevance of this challenge increases considering that many instruments are not build up with off-the-shelf components, but could be unique with specific hardware and software, characteristics. The idea of a plug in driver for any kind of sensor, group of sensors and instruments is the way followed to achieve this kind of results. We developed an Instrument Service [16], based on the GT4, in order to achieve the integration of our virtual laboratory for computational environmental science with some data acquisition instruments. In our design constraints we considered the largest possible use of WSRF mechanisms as persistence, notification and index service advertisement in order to develop components integrating in the world wide maintained GT4 middleware as much as possible. We focus our attention on the service framework used to decouple instrument specific controlling issues. The implementation realizes the primary target of best integration in other developed services leveraging on the automatic mapping between native GT4 index service and ClassAd resource representation performed by the RBS. The user is able to query for instrument data using our Resource Broker Service and the ClassAd as in the case of its computational or storages needs. Different instruments interact with their proxy hardware in various ways and sometimes the SOAP interaction is too slow or latency prone than the needed data rate. To face this different situations we used a layered approach finding for each instrument the boundary between the strictly hardware related interaction and the better place where to group common characteristics and start with virtualization. The Instrument component is the glue between the instrument interface driver and the WSRF based Instrument Service implementing the basic methods to explore the concrete instrument properties converting them in WSRF compliant resource properties. In this way automatically and dynamically created resource properties are added to the VO index service ready to be discovered by the RBS using, for example, the ClassAd query form: [Type="Consumer"; Lat=40.82; Lon=14.22; Rank=1; Preferences= GIS.getDistance(Lat,Long,other.Lat,other.Long)=min; Requirements=other.Type=="Instrument" && other.Sensor==WindSensor,PressSensor ]
956
R. Montella, G. Giunta, and G. Laccetti
In the provided example the user is looking for a weather station, as near as possible to the specified coordinates, with both wind and barometer sensor available. As show in the ClassAd-LSI query, we are working on geographical features in the matchmaking best resource criteria. The Instrument Service exposes operation providers to start and stop data streaming using a specified protocol chosen between the available ones (published, for each sensor, on the index service as a metadata field). In this case the standard SOAP message exchanging between the web service consumer and the web service producer is performed only to control the data transmission, while the actual data uses a more efficient protocol.
5
Web Service Orchestration Using a Workflow Engine Service
We implemented a custom Job Flow Scheduler Service (JFSS) from scratch in order to enhance the flexibility and to minimize the impact on the grid aware application development through a Job Flow Description Language (JFDL) file supporting late binding, multi branch and loops, describing job characteristics and activation order and offering global ad local variable, job property access, expression evaluation, late binding, workflow branching, resource broker interaction and logging features [2]. The grid concept evolved from job to service oriented approach thanks to technologies as the WSRF so we developed our JFSS providing a generic web service invocation interface in order to both submit jobs to computing nodes or invoke web service operation providers. The file describing the grid application could be divided in two parts: in the first one, each job or service belonging to the grid application is described specifying its symbolic name, the related service EPR, the operation provider name and, eventually, all needed arguments both as literals or EPRs. Inside the second part the jobs activation order is described using a direct acyclic graph. In this section, each job node is characterized by the reference to all previous jobs and to all jobs that will be submitted when the current job will finish. The invocation to the RBS could be done using an explicit invocation as a common web service or specifying the query and the preferred matchmaking algorithm in the job definition as in following the example: <jfdl:resourceBroker classAlgorithm="it.uniparthenope.dsa.grid.ClassAdLSI"> [ Type="MM5Job"; ImageSize=512; Rank=1; Prefernces=other.ComputingElement_PBS_WaitingJobs=min; Requirements= other.Type=="MM5Service" && other.Disk>=64 ] The JFSS was implemented as a multi instance service using a class framework encapsulating all described features including the JFDL parsing, graph setup and application runtime support. The most interesting component is the Job
A Grid Computing Based Virtual Laboratory for Environmental Simulations
957
implementing the job submission using a clear, effective and efficient algorithm: if the job is to be started, make a thread join to each related jobs using the previously defined dependence graph. In this way the component waits until all prerequisite data are successfully produced. Then the job is submitted to the grid invoking the web service operation provider specified using the EPR (Fig. 1).
Fig. 1. High level representation of a JFSS workflow for on demand coupled weather, marine and air quality forecast or scenario analysis
6
Real World Grid-Aware Applications at Operational and on Demand Meteo-marine Forecasts
We recently evolved our virtual laboratory for environmental modeling applications following a service oriented approach wrapping each component, as models and model modules, with a GT4 web service. Using this technique, we coupled several environmental models that we can classify in weather simulations and forecasts, as the MM5 and the Weather and Research Forecasting model (WRF) [12]; air quality related, as the Spatio-temporal distribution Emission Model (STdEM) [3], the Parallel Naples Airsheld Model (PNAM) [4] and the Comprehensive Air quality Model with eXtension (CAMx) [17]; sea-wave propagation models as the WaveWatch III (WW3) [20] and the SWAN [5]; ocean circulation models as the ROMS [1] and the Princeton Ocean Model (POM), that we enhanced developing a parallel version with nesting capabilities (POMpn) [9]. Wrapping web services around legacy applications, as we can consider numerical models, leading some advantages especially related with the higher level approach to componentization as the interoperation, a better security management and the VO Index Service integration. On the other side, there are some drawbacks with a leak of usage flexibility. Leveraging on our previous experience in grid enabling, we developed framework to simplify and standardize this task following guidelines acceptable in a wide range of cases. Limiting the lack in flexibility due to the interfaced access to models configuration files, but offering a complex operation provider based web service approach, thanks to the GT4 WSRF technology, we realized the side effect of web service orchestration
958
R. Montella, G. Giunta, and G. Laccetti
with any available tools. Using our grid components, we implemented real world applications based on the close integration between environmental models, environmental data acquisition instruments and data distribution systems glued by the grid computing technology. Since the year 2003 we run operationally a meteo-marine forecasts grid application integrating weather, sea wave propagation, ocean circulation and air quality models and offering products, focused thanks to a nesting approach on the Bay of Naples where the reached ground resolution is 3000 meters, to common citizens and local institutions via portal interface or using the grid oriented GDDS. We used this application as test bed to evolve our laboratory till the actual state of the art of a fully service oriented architecture approach.
Fig. 2. Results produced by a grid-aware application for on demand high resolution weather forecast over the South of Italy and the inner part of the Bay of Naples with a maximum ground resolution of 333 meters
With the increasing of the output resolution the produced data grow up. Evolving our routinely ran weather forecast application, we developed our On Demand High Resolution Weather Forecast System grid application leveraging on a set of dynamically located telescopic shaped domains centered on a given geographic position identified by latitude and longitude values. The application requests to the grid, using the RBS and GDDS interaction, if any of needed domains have been previously computed. If some data is found, the model simulates only on not yet computed domains. In this way is performed an effective and efficient use of storage and computing resources (Fig. 2). Products of this on demand weather forecast grid application are available using the Globus provided RLS, our developed GDDS or using Google Earth or Google Map provided tools.
7
Conclusions and Future Works
Because our needs for an efficient and effective use of available resources in very computational power demanding applications as weather forecasts, we focused our attention on the matchmaking algorithms. We developed a matchmaking framework and then implemented from scratch an integrated ClassAd and LSI based matchmaking algorithm for resources discovery (ClassAd) and selection
A Grid Computing Based Virtual Laboratory for Environmental Simulations
959
(LSI). From the grid components side, one of ours most demanding results is the development of a framework for resource properties autonomic publishing, collecting and classifying throw the VO Index service. The same component realize the customizable mapping between GT4 resource properties and Classified Advertisement. This component allows to deal with any kind of resource (computing, storage, data, instruments) in a homogeneous and transparent way. The use of JFSS and other grid computing technologies under our application conditions is convenient because the high grade of loosely coupled parallel jobs and web service operation provider invocations involved. While our RBS, GDDS and Instrument Service are characterized by unique features, the JFSS is a piece of software becoming common in the grid computing world, so we planned to perform tests and evaluations with analogues components in order to choose the best one. We focus our next goals in the development of grid components finalized to open new application scenarios in the field of environmental computing science with particular regards to ocean and atmospheric researches. In our plans, instrument grid integration is a key field in the grid computing evolution because that means more efficient and effective perception of the real world status and then a better way to understand it. For the same reasons, we plan to study the integration between the grid computing technology, with particular regards to content distribution networks, and grid enabled Geographical Information Systems. We ride the trend to make the grid technology a transparent tool for a wide range of common users belonging to different, and far from computer science, research and application domains.
References 1. Arango, H.G., Moore, A.M., Miller, A.J., Cornuelle, B.D., Lorenzo, E.D., Neilson, D.J.: The ROMS Tangent Linear and Adjoint Models: A comprehensive ocean prediction and analysis system. IMCS, Rutgers Tech. Reports (2006) 2. Ascione, I., Giunta, G., Montella, R., Mariani, P., Riccio, A.: A Grid Computing Based Virtual Laboratory for Environmental Simulations. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 1085–1094. Springer, Heidelberg (2006) 3. Barone, G., D’Ambra, P., di Serafino, D., Giunta, G., Montella, R., Murli, A., Riccio, A.: An Operational Mesoscale Air Quality Model for the Campania Region. In: Proc. 3th GLOREAM Workshop, Annali Istituto Universitario Navale (special issue), pp. 179–189 (2000) 4. Barone, G., D’Ambra, P., di Serafino, D., Giunta, G., Murli, A., Riccio, A.: Parallel software for air quality simulation in Naples area. J. Eviron. Manag. and Health 10, 209–215 (2000) 5. Booij, N., Holthuijsen, L.H.: The SWAN wave model for shallow water. In: Proc. 25th Int. Conf. Coastal Eng., vol. 1, pp. 668–676 (1996) 6. Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the Society for Information Science 41, 391–407 (1990) 7. Doty, B.E., Kinter III, J.L.: Geophysical Data Analysis and Visualization using GrADS. In: Szuszczewicz, E.P., Bredekamp, J.H. (eds.) Visualization Techniques in Space and Atmospheric Sciences, pp. 209–219. NASA (1995)
960
R. Montella, G. Giunta, and G. Laccetti
8. Foster, I.: Globus Toolkit Version 4: Software for Service-Oriented Systems. Journal of Computational Science and Technology 21, 513–520 (2006) 9. Giunta, G., Mariani, P., Montella, R., Riccio, A.: pPOM:A nested, scalable, parallel and Fortran 90 implementation of the Princeton Ocean Model. Envirnonmental Modelling & Software 22, 117–122 (2007) 10. Liu, C., Foster, I.: A Constraint Language Approach to Matchmaking. In: 14th International Workshop on Research Issues in Data Engineering (RIDE-WS-ECEG 2004), Web Services for E-Commerce and E-Government Applications, pp. 7–14. IEEE Computer Society, Los Alamitos (2004) 11. Michalakes, J., Canfield, T., Nanjundiah, R., Hammond, S., Grell, G.: Parallel Implementation, Validation, and Performance of MM5. In: Parallel Supercomputing in Atmospheric Science, World Scientific Publ. Company, Singapore (1994) 12. Michalakes, J., Dudhia, J., Gill, D., Henderson, T., Klemp, J., Skamarock, W., Wang, W.: The Weather Research and Forecast Model:Software Architecture and Performance. In: Zwiefloher, W., Modzynski, G. (eds.) Use of High Performance Computing in Meteorology. Proceedings of the ECMWF Workshop on the Use of High Performance Computing in Meteorology, World Scientific Publ. Company, Singapore (2005) 13. Montella, R.: Development of a GT4-based Resource Broker Service: an application to on-demand weather and marine forecasting. In: C´erin, C., Li, K.-C. (eds.) GPC 2007. LNCS, vol. 4459, pp. 204–217. Springer, Heidelberg (2007) 14. Montella, R., Giunta, G., Riccio, A.: Using grid computing based components in on demand environmental data delivery. In: Proceedings of Use of P2P, GRID and Agents for the Development of Content Networks (UPGRADE-CN 2007) (2007) 15. Montella, R., Giunta, G., Riccio, A.: An integrated ClassAd-Latent Semantic Indexing matchmaking algorithm for Globus Toolkit based computing grids. These Proceedings 16. Montella, R., Riccio, A., Menna, M., Mastrangelo, D.: A Globus Toolkit 4 based Instrument Service for environmental data acquisition. Dept. of Applied Science at University of Naples Parthenope Tech Report (2007) 17. Morris, R.E., Yarwood, G., Emery, C., Wilson, G.: Recent Advances in CAMx Air Quality Modeling. A&WMA Annual Meeting and Exhibition, Orlando, FL (2001) 18. Raman, P.: Matchmaking Frameworks for Distributed Resource Management. Ph.D. Dissertation (2000) 19. Schopf, J.M., D’Arcy, M., Miller, N., Pearlman, L., Foster, I., Kesselman, C.: Monitoring and Discovery in a Web Services Framework:Functionality and Performance of the Globus Toolkit’s MDS4. Argonne National Laboratory Tech Report ANL/MCS-P1248-0405 (2005) 20. Tolman, H.L.: A third-generation model for wind waves on slowly varying, unsteady and inhomogeneous depths and currents. J. Phys. Oceanogr. 21, 782–797 (1991) 21. Wielgosz, J.: Anagram - A modular java framework for high-performance scientific data servers. In: 20th International Conference on Interactive Information and Processing Systems (IIPS) for Meteorology, Oceanography, and Hydrology (2004)
Exploring the Behaviour of Fine-Grain Management for Virtual Resource Provisioning Fernando Rodr´ıguez-Haro, Felix Freitag, Leandro Navarro, and Rene Brunner Polytechnic University of Catalonia, Jordi Girona 1-3, 08034 Barcelona, Spain {frodrigu, felix, leandro, rbrunner}@ac.upc.edu
Abstract. The virtualization of resources in SMP machines with Xen offers automatic round-robin based coarse-grain assignment of VMs to physical processors and manual assignment via privileged commands. However, two problems can arise in Grid environments: (1) if Xen’s sEDF scheduler assigns the VMs, then some processors could be over or underutilized and the VMs could receive more resources than specified, and (2) manual assignment is not feasible in a dynamic environment and also requires being aware of each node’s heterogeneity. Our approach proposes an enhanced fine-grain assignment of SMP’s virtualized resources for Grid environments by means of a local resource manager (LRM). We have developed a prototype which adds a partitioning layer of subsets of physical resources. Our experimental results show that our approach achieves a flexible assignment of resources. At the same time, due to the fine-grain access, a more efficient resource assignment is achieved compared to the original mechanism. Keywords: Resource provisioning, local resource manager, virtual machine.
1
Introduction
Virtualization technologies have become an important research area for resource management in Grid computing [1]. Recent efforts have proposed and developed middleware that implements mechanisms to manage network and physical node virtualization. Most of these approaches leverage recent advances of Xen [2] and VMware [3] for LRMs, and VNET [4] for virtual networking. In current approaches it is more common to view a physical resource as a whole (e.g. undividable CPU) without considering the particular hardware configuration at each node. The resources are therefore assigned in a coarse-grain fashion. The problem is the unbalance of the workload. In SMP nodes, for instance, Xen offers automatic coarse-grain assignment of VMs to physical processors. Each VM created is assigned to a physical processor in a round robin fashion. Thus, some tasks must be done by an administrator to ensure an expected
This work is supported in part by the European Union under Contracts SORMA EU IST-FP6-034286 and CATNETS EU IST-FP6-003769, the Ministry of Education and Science of Spain under Contract TIN2006-5614-C03-01, and the program PROMEP.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 961–970, 2008. c Springer-Verlag Berlin Heidelberg 2008
962
F. Rodr´ıguez-Haro et al.
performance of the competing VMs. First, the SMP node architecture must be known. Second, the vcpu-pin command must be issued to assign VMs to a specific processor. And third, the weights, needed by the simple Earliest Deadline First (sEDF) scheduler, need to be matched to user requirements taking in account the VMs competing in the same processor. We proposed the fine-grain management of virtual resources [5] with a Multiple Dimension Slotting approach (MuDiS). With MuDiS we have the following benefits. First, our approach acts on behalf of a custom policy defined by administrators. Second, the resource providers are able to maximize the use of node resources using fine-grain assignment. Third, this fine grain assignment allows better fulfilling of service level agreements (SLAs). And fourth, it allows Grid middleware to make better scheduling decisions by receiving accurate information of the internal resource usage in every resource provider. The rest of this paper is structured as follows: In Sect. 2 we describe work related to our approach. In Sect. 3 we explain the fine-grain resource management approach and the prototype implementation. In Sect. 4 we report on several experiments in which we study the achieved behaviour, both with the original approach and with the fine-grain resource management. Section 5 concludes and provides an outlook on future work.
2
Related Work
Current work which manages resources with the help of virtualization techniques enables the migration of VMs from over utilized nodes to under utilized nodes. This has led to new research challenges related to adaptation mechanisms in the scope of intra node and inter node resource managers. Our approach addresses the intra node adaptation by means of a local resource manager. K. Keahey et al [6] introduce the concept virtual workspace (VW), an approach which aims at providing of a customized and controllable remote execution environment for Grid resource isolation purposes. The underlying technologies that support the prototype are virtual machines for the hardware virtualization (Xen and VMware) and Globus Toolkit (GT). The interactions of resource provisioning follow VW descriptions. A VM state (running, shutdown, paused, etc) and the main properties that a required VM must complain such RAM size, disk size, network, etc are defined. Currently, however, there is not discussion about the performance impact of multiple VM running at the same time and the consequences when SLAs are not fulfilled. Nadyr Kiyanclar et al [7] present Maestro-VC, a set of system software which uses virtualization (Xen) to multiplex the resources of a computing cluster for client use. The provisioning of resources is achieved by a scheduling mechanism which is split into two levels. An upper level scheduler (GS) which manages VMs inside the virtual cluster, and an optimal low-level scheduler (LS) per VM. The purpose is to incorporate information exchange between virtualized and native environments to coordinate resource assignment. The LS, however, is an optional mechanism and if desired it must be explicitly supplied by the user.
Exploring the Behaviour of Fine-Grain Management for VR Provisioning
963
David Irwin et al [8] present Shirako, a system for on-demand leasing of shared networked resources in federated clusters. Shirako uses an implementation of Cluster-on-Demand for the cluster site manager component and SHARP [9] for the leasing mechanism. Leases are used as a mechanism for resource provisioning. Thus intra LRM adaptation cannot be made in an independent way and is only possible at long terms. Our work is in the direction of Dongyan Xu et al [10]. The authors propose the support of autonomic adaptation of virtual distributed environments (VDE) with a prototype of adaptive VIOLIN [11] based on Xen 3.0. They address challenges of dynamic adaptation mechanisms and adaptation decision making. Inter LRMs adaptation is based on dynamic cross-domain migration capability. Intra LRM adaptation by adjusting resources shares of physical nodes according to VM usage. The difference is that our approach addresses intra LRM fine-grain adaptation mechanisms within single physical resources.
3 3.1
Fine-Grain Resource Management The Management Component
The fine-grain resource management enhance the multiplexing mechanisms offered by Virtual Machine Monitors (VMMs). At the same time, it allows that LRMs can manage the pinning mechanisms according to user requirements. The management component runs in the first VM (known as privileged domain or domain0). It carries out the usage accounting of every single resource and groups single physical resources into subsets. These subsets which can be dynamically changed are seen by the LRM upper level as pseudo-physical machines with minimal interfering I/O interruptions between each other. Thus, physical resources with different characteristics can be exploited by a LRM. These subsets of resources can be assigned to VMs with different application profiles (e.g. CPU intensive, net intensive, SMPs requirement, or I/O disk intensive). In fine-grain resource management the VMM sEDF scheduler works without changes in the VMs. With the help of pinning mechanisms, multiplexing occurs in subsets of resources. The difference to coarse-grain assignment is that the virtual division will allow VMs behaving according to the limitations of the assigned resources, and hence the running applications. Figure 1 shows an example of the partitioning of physical resources. In Fig. 1a we can see the partition in 3 subsets, each one with different characteristics. These subsets are exposed to the LRM and each one can be sliced with traditional approaches (e.g. VM1 and VM2 could have each other 50% of each resource of subset 1 respectively). The management component must map these subset shares, and assign them to virtualized resources through the VMM. In Fig. 1b we can see an example of new configuration after a certain time n. A usage pattern could have shown that some resources were under- or over-utilized. For better performance or load balancing of the VMs, the LRM could have grouped these VMs taking into account the observed applications behaviour.
964
F. Rodr´ıguez-Haro et al. (a) time t
(b) time t+1
LRM VM1
VM2
IDE
NIC2 100Mbs
VM3 VM4
SCSI
NIC1 10Mbs
Processor2 Processor3
Processor1
LRM VM5
SATA
NIC3 1Gbs
Processor4
VM1
VM4
IDE
NIC3 1Gbs
VM3
SCSI
VM2 NIC1 10Mbs
Processor2 Processor3
Processor1
VM5
SATA
NIC2 100Mbs
Processor4
multiple dimension slotting management Pinning frameworks Virtualization
multiple dimension slotting management Pinning frameworks Virtualization
Physical node
Physical node Processor1
Processor1
SCSI
SATA
IDE
Processor2
SCSI
SATA
IDE
Processor2
NIC1 10Mbs
NIC2 100Mbs
NIC3 1Gbs
Processor3
NIC1 10Mbs
NIC2 100Mbs
NIC3 1Gbs
Processor3
Processor4
Processor4
Fig. 1. Example of subsets of physical resources at time t in a), and at time t+n in b
3.2
Prototype
For evaluating the proposed approach we have developed a prototype of the LRM. The prototype has been implemented in Python. The LRM aims to provide to Grid middleware the infrastructure for accessing a machine’s subset of virtualized resources in a transparent way. It exposes a simple to use standard service interface for running jobs and hides the details of the managed local resources. The implementation of the proposal architecture leverages technologies that are currently in continuous development, i.e. virtualization based on Xen and Tycoon [12]. Tycoon is a market-based system for managing compute resources in distributed environments. In the following, the main components of the LRM and their interrelations are explained in detail. The Local Resource Manager (LRM) component offers public services via xml-rpc interfaces. Part of the actions that can be performed by the LRM are encapsulated in a TycoonAPI component which interfaces with Tycoon. The current prototype uses Tycoon for managing the creation, deletion, boot and shutdown of virtual machines, as well as weighting the resources with its bidding mechanism. The TycoonAPI component is part of the LRM architecture and has been designed to allow migration to other high level VM resource manager. One of the open issues of our prototype that are desirable to develop is a high level VM resource manager as alternative to Tycoon. The design of the LRM includes other components that interface directly with Xen to perform tasks such monitoring and CPU capacity management. The first traces VMs performance to compute costs, and the second offers fine-grain CPU specifications for VMs in SMP architectures.
Exploring the Behaviour of Fine-Grain Management for VR Provisioning
965
A txtLRM client interface is used to remote submissions of jobs (among other actions like retrieve output of jobs executed or query the status of the resource provider). The steps involved to execute jobs in our prototype are: First, the user define a configuration file with each job application requirements as follows: [JOB.1] slaCPU = 1000 slaCPUfloor=0.10 slaBUDGETceil=2.5 job = cputime.py output = cputime.csv necessary=True [JOB.2] slaCPU = 1500 ...
#expressed in MHz #slaCPU tolerable percent degradation #spendable budget per GHz per unit of time #application, source or binary #output file #means schedule obligation
Second, the user query the action to parse and execute the job plan definitions python txtLRM.py --lrm=147.83.30.203 --jobplan=job6fabrics Third, if the user requirements can be fulfilled by the resource provider then some processes are executed such as VMs creation and booting, application deployment via ssh credentials, launching and monitoring. Otherwise, the user receives information about the jobs’s requirements that can not be fulfilled. Finally, the user can retrieve the output files, if the job plan applications have not yet finished it is properly informed.
4
Experiments
The behaviour and performance of the LRM prototype is interesting to study in order to assess its design and to identify empirically the needs for future developments. The experimental results are therefore an important part of the presented work which will illustrate our main conclusions. In our experiments the hardware of the resource provider node has a Pentium D 3.00GHz (two processors), 1Gb of RAM, one hard disk (160GB, 7200 rpm, SATA), and one network interface card (Broadcom NetXtreme Gigabit). The operating system is Fedora Core 4 and for virtualization we use Xen 3.0.2. The Tycoon client and auctioneer are 0.4.1 version. In order to stress each virtual workspace, we use an application that behaves as a CPU intensive job. The user application executes 500 transactions where each transaction is composed by math operations. Two experiments are discussed to evaluate our approach for fine-grain resource management. – In the first experiment we create four VMs and compare the expected end times of the executed jobs in two settings: (a) requesting through LRM interface the creation of the four VMs in the original Tycoon-Xen way, and (b) using the fine-grain assignment made by the LRM.
966
F. Rodr´ıguez-Haro et al.
– In the second experiment we request the creation of six VMs using fine-grain resource management. In the first setting of the experiment one, we use our LRM prototype to create four VMs. The request for the creation of a virtual workspace includes CPU requirements (slaCPU) which must be expressed in Hertz. In this case the values are 300MHz, 600MHz, 900MHz, and 1200MHz for fabric1, fabric2, fabric3, and fabric4, respectively. When the creation process of the four virtual machines ends, we proceed to upload the user application to each VM. Finally the custom benchmark is launched in all four VMs at the same time. The CPU requirements are mapped (by the LRM prototype) to percent share (0.0 – 1.0), and finally to Tycoon credits (or weight value for sEDF scheduler if we were interfacing directly with VMM). Therefore, external Grid users (or components) expect a performance of 0.1, 0.2, 0.3, and 0.4 share of physical CPU in fabric1, fabric2, fabric4, and fabric4, respectively (this will be achieved with credit scheduler in future Xen versions). Furthermore, when we launch the user application (executing 500 transactions) in every VM, the external Grid components (e.g. Global Grid scheduler) would expect in principle that the order of completion should be fabric4, fabric3, fabric2, and finally fabric1. Figure 2 shows the utilization of each processor for experiment one (original coarse-grain approach).There are three important aspects to observe from this experiment. First, notice that virtual machines are assigned to each processor in a round-robin fashion. The assignment is affected by the order of virtual machine creation (which was fabric1, fabric2, fabric3 and fabric4). This is how the VMM sEDF scheduler is programmed to do it by default. A LRM not aware of this will inform incomplete or incorrect information about node capacity to Grid middleware (like number of processors) as discussed later. Second, a consequence of the first aspect is that the proportion of consumed
processor 1
100 90 80 70 60 50 40 30 20 10 0
fabric4 fabric2 idle dom0
0
14
30
45
61
time(s)
76
91 10 5
%cpu
%cpu
processor 0 100 90 80 70 60 50 40 30 20 10 0
fabric3 fabric1 idle dom0
1
15
30
44
59
73
87 10 0
time(s)
Fig. 2. Experiment 1: four virtual workspaces, assigned by the original coarse-grain approach, executing the same CPU intensive job on a physical machine with two processors. The assignment of virtual machines is done in a round robin fashion by Xen VMM during creation. The four VMs are distributed on the two processors.
Exploring the Behaviour of Fine-Grain Management for VR Provisioning
967
physical resources does not correspond to the initial requirements. We observe from Fig. 2 that the share is translated to a new value in the domain of each processor. Applying proportional share (PS) we obtain the new weights for each fabric: 1/3, 2/3 for fabric2, fabric4, and 1/4, 3/4 for fabric1 and fabric3, respectively. And third, the order of the expected completion is not fulfilled. Even though fabric4 has the highest share, fabric3 ends first. This is given by the weights in each processor and the observed order of finishing is fabric3, fabric4, fabric2, fabric1 (instead of fabric4, fabric3, fabric2, fabric1). A closer look to the completion times is given in Fig. 3. We can see that as soon as a job ends (stop consuming resources of its VM share), at t=67.73 and t=75.56 in processor 1 and processor 0 respectively, the share of the other job in the same processor changes. This is due the nature of sEDF scheduler. 67,73 75,56
500
102,14 101,57
450
transactions
400 350
Fabric1 Fabric2 Fabric3 Fabric4
300 250 200 150 100 50 0 0
10
20
30
40
50
60
70
80
90 100 110
time(s)
Fig. 3. Transactions completed during job execution (original coarse-grain management). The time consumed per transaction is measured by the benchmark application.
In the second setting of experiment one, we use the proposed fine-grain resource management with the LRM prototype to request the same VMs requirements. Utilization of each processor is shown in Fig. 4. The assignment by the fine-grain approach in the LRM meets the Hertz required for each VM. It can be seen that all VMs are created in processor 0. Even though different policies can be applied for load balancing, in this experiment we have instructed the LRM to allocate VMs in one processor. In this case processor 1 is only used by Domain-0. The completion times of benchmarking in VMs assigned by the fine-grain approach are shown in Fig. 5. If we compare the measured times at the completion of each job with the results of the original setting, we notice that each job takes more time to end. However, this is the expected behaviour for each VM that receives accurate proportions of physical processors. In fact, three of them last less than the expected execution time since as soon as a job in the highest weighted VM finishes, the rest of VMs use the unused share. Finally, we can see that with
968
F. Rodr´ıguez-Haro et al.
Fig. 4. Experiment 1: four VMs assigned by fine-grain resource management through pinning mechanisms and sEDF scheduler. The four VMs are assigned to one processor.
130,75
500
156,52 184,76 206,85
450
transactions
400 350
fabric1 fabric2 fabric3 fabric4
300 250 200 150 100 50 0 0
20
40
60
80 100 120 140 160 180 200 220
time(s)
Fig. 5. Transactions completed during job execution (fine-grain approach). The sequence in the creation of VMs does not affect the expected performance; hence the execution time is according to requested share.
fine-grain LRM the completion order which is fabric4, faric3, fabric2, fabric1, is according to the requested share. In the second experiment, we apply the fine-grain approach to assess resource assignment efficiency, and require LRM to allocate six VMs. The CPU requirements are 300MHz, 600MHz, 900MHz, 1200MHz, 900MHz, and 2100MHz corresponding to fabric1, fabric2, fabric3, fabric4, fabric5, and fabric6, respectively. The LRM allocates four VMs, as in the experiment one, to processor 0. The rest of VMs (fabric5 and fabric6) are assigned to processor 1. The results for this experiment are shown in Fig. 6. We can see that the completion order (fabric4, fabric3, babric2, fabric1 in processor 0; fabric6, fabric5 in processor 1) is according to the initial requirements. Finally, the jobs executing in fabric1-4 do not affect those of fabric5-6 and vice versa.
Exploring the Behaviour of Fine-Grain Management for VR Provisioning
969
Fig. 6. Allocation of six VMs with LRM according to requested Hertz. The sequence in the creation of VMs does not affect the expected performance; hence the execution time is according to requested share.
5
Conclusions and Outlook
We have presented an approach for fine-grain assignment of the virtualized resources of a SMP machine by means of a local resource manager (LRM). This approach is transparent to users who do not need to know the detailed hardware configuration of the machine and can specify their resource needs in a general way. The LRM has been implemented taking advantage of the Xen virtualization tool and Tycoon. Our preliminary results obtained from experiments showed that with the LRM the local physical resources were properly assigned, such that the performance measured in transactions per second was accurate and fulfills to agreed job plan requirements. We have observed two main benefits of our approach: First, the fine-grain approach allows better fulfilling certain subjective constraints given in the job plan requirements, like completions orders, than the coarse-grain approach. Second, a more efficient resource assignment can be achieved, since the LRM can assign in a flexible way virtualized resources according to different policies. For instance, we followed a fill-capacity per processor policy for the assignment of VMs in order to meet the required Hertz. We also obtain certainty about the expected completion times of the executed jobs. The weakness of the implemented prototype is that offers static partitioning. For this reason, we have identified additional features which could be beneficial to have in the LRM in order to comply with more features of job plan definition. One of the identified features is to have a mechanism which addresses the dynamic behaviour of the workload. This development could include on one hand the adaptation of resources to the changes in the workload composition and secondly the dynamic adaptation of the resource to a changing application profile.
970
F. Rodr´ıguez-Haro et al.
References 1. Figueiredo, R.J., Dinda, P.A., Fortes, J.A.B.: A case for grid computing on virtual machines. In: ICDCS 2003: Proceedings of the 23rd International Conference on Distributed Computing Systems, p. 550. IEEE Computer Society, Washington (2003) 2. Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., Warfield, A.: Xen and the art of virtualization. In: SOSP 2003: Proceedings of the nineteenth ACM symposium on Operating systems principles, pp. 164–177. ACM Press, New York (2003) 3. VMWare (2006), http://www.vmware.com 4. Sundararaj, A., Dinda, P.A.: Towards virtual networks for virtual machine grid computing (2004) 5. Rodr´ıguez, F., Freitag, F., Navarro, L.: A multiple dimension slotting approach for virtualized resource management. In: 1st Workshop on System-level Virtualization for High Performance Computing (Eurosys 2007), Lisbon, Portugal (March 2007) 6. Keahey, K., Foster, I.T., Freeman, T., Zhang, X., Galron, D.: Virtual workspaces in the grid. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 421–431. Springer, Heidelberg (2005) 7. Kiyanclar, N., Koenig, G.A., Yurcik, W.: Maestro-vc: On-demand secure cluster computing using virtualization. In: 7th LCI International Conference on Linux Clusters (May 2006) 8. Irwin, D., Chase, J., Grit, L., Yumerefendi, A., Becker, D., Yocum, K.G.: Sharing networked resources with brokered leases. In: USENIX Annual Technical Conference (USENIX), June 2006, pp. 199–212 (2006) 9. Fu, Y., Chase, J., Chun, B., Schwab, S., Vahdat, A.: Sharp: an architecture for secure resource peering. In: SOSP 2003: Proceedings of the nineteenth ACM symposium on Operating systems principles, pp. 133–148. ACM Press, New York (2003) 10. Xu, D., Ruth, P., Rhee, J., Kennell, R., Goasguen, S.: Short paper: Autonomic adaptation of virtual distributed environments in a multi-domain infrastructure. In: 15th IEEE International Symposium on High Performance Distributed Computing (HPDC 2006), June 2006, pp. 317–320 (2006) 11. Ruth, P., Rhee, J., Xu, D., Kennell, R., Goasguen, S.: Autonomic live adaptation of virtual computational environments in a multi-domain infrastructure. In: IEEE International Conference on Autonomic Computing, 2006. ICAC 2006, pp. 5–14 (2006) 12. Lai, K., Rasmusson, L., Adar, E., Sorkin, S., Zhang, L., Huberman, B.A.: Tycoon: an implemention of a distributed market-based resource allocation system. Technical report, HP Labs (December 2004)
Parallel Irregular Computations with Dynamic Load Balancing through Global Consistent State Monitoring Janusz Borkowski1 and Marek Tudruj2 1
2
Polish-Japanese Institute of Information Technology Koszykowa 86, 02-008 Warsaw, Poland Institute of Computer Science, Polish Academy of Sciences, 21 Ordona Str., 01-237 Warsaw, Poland {janb,tudruj}@pjwstk.edu.pl
Abstract. For efficient execution of parallel irregular computations, dynamic load balancing must be applied. If the computational work is associated with data sets, which must be separately processed by an algorithm, then load balancing can be performed most efficiently by transfering the data sets between processes using application level messages. Such a situation exists in parallel branch and bound (B&B) computations. A parallel B&B algorithm has been implemented in a novel parallel programming environment. This environment facilitates an infrastructure for parallel application control. Application consistent global states are continuously monitored. Control decisions are taken based on the monitored states and the decisions are communicated to the application processes. This infrastructure has been used for load balancing strategy implementation in parallel B&B computations. An analysis of the characteristics of the control infrastructure and the application resulted in a choice of a global load balancing strategy working with many simple and small steps executed frequently. Experiments have shown, that this strategy works well. The chosen strategy is much more efficient (shortening the application runtime by more than 3 times), if the prediction of the results of an already taken load balancing decision is used for subsequent load balancing decisions.
1
Introduction
Parallel computations cannot be performed efficiently if some of the involed processors remain idle for longer periods of time. Properly implemented parallel computations can divide the whole tasks and distribute obtained fragments among all available processors. If a computational task can be split into a number of parts having a known / predictable computational cost, then the splitting and parts to CPUs assignment can be done before the computations start. Such a scheme is known as a static load balancing. Unfortunately, many problems are dynamic in nature. They can be divided into subtasks only during runtime, because the subtasks are created dynamically and are not known apriori. Moreover, R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 971–980, 2008. c Springer-Verlag Berlin Heidelberg 2008
972
J. Borkowski and M. Tudruj
the computational cost of the obtained subtasks can be only estimated. To keep all the processors busy, a dynamic load balancing must be applied. Dynamic load balancing takes into account the current load of computational nodes and tries to transfer some work to underloaded nodes to keep all the nodes busy. In [1] one can find more detailed classification of load balancing methods. Research on dynamic load balancing has a long history. In a simplest form, whole newly submittted serial tasks were allocated to least loaded workstations [2]. Migration of running processes from overloaded nodes proved to be a more complicated task, inducing a large overhead and requiring usually a special support of the operating system or a low level (not portable) library. Examples of projects concerned with process migration are described in e.g. [3,4]. Along with development of object-oriented programming, load balancing methods, targeted to object-oriented systems and able to allocate/migrate objects, were developed, e.g. [5]. Since all the above mentioned methods were implemented within an operating system or in an programming environment, load balancing actions could be performed automatically, without intervention of a programmer. Any application started in such an environment could benefit from load balancing. However, fully automatic load balancing has much limited performance, as it can use only system-level information, and usually do not explore application specific features. Therefore, semi-automatic load balancing has been introduced. For example, in an object oriented system, a programmer can annotate the program code to indicate, which objects should be created remotely for performance reasons [6]. Load balancing can be implemented at the application level, too. In such a case, specialized load balancing strategies can be employed, which work best in for a particular application. At the application level, the programmer can utilize specific application features to accurately estimate the load of the computing nodes, e.g. in parallel adaptive integration the local integration error is proportional to the amount of awaiting work (more calculations are necessary to diminish a large error) [7]. Moreover, the load transfer can be realized simpler at the application level, than at the system level. For example, in parallel adaptive integration an overloaded node can send just a description of the integration area to give some work to another node. Such a description is not visible below the application level. Application-level load balancing has been used in many projects, where the highest efficiency was desired [8,9]. Unfortunately, application level load balancing must be implemented separately for each application. The necessary effort can be diminished if a proper environment is available for the programmer. DAMPVM is an example [10]. DAMPVM provides an infrastructure to stop, move and restore a PVM process using only user-level message passing. A library for parallel solving of divide-and-conquer type problems was successfully built on top of DAMPVM. Our work was concerned with irregular computational problems requiring dynamic load balancing. We applied application level load balancing implemented with a support of a convenient and general application control infrastructure. We report which load balancing strategies proved to be efficient in our environment. Section 2 describes the parallel programming environment we have designed and
Parallel Irregular Computations
973
created, underlining its suitability for dynamic load balancing support. Section 3 presents the experiments we have performed and obtained results. Conclusions summarize the contribution of this paper.
2
Control Framework
In our approach, parallel application processes report their local states to monitors, each report has a timestamp. The monitors use the reports to compute strongly consistent global states (SCGS) [11,12] based on timestamps attached to this reports. To generate the timestamps the computing nodes must be equipped with local real-time clocks and the clocks must be synchronized with a known accuracy. An SCGS is represented internally as a vector, each vector component represents a local state of a single process. A user-defined predicates are evaluated by the monitors on the detected SCGS by examining computed vectors. If a predicate is satisfied, then control signals are dispatched to chosen application processes. The signals are handled asynchronously by the processes, as depicted in 1. Because the SCGS can be detected on-the-fly with low cost algorithms, and because asynchronous control signal handling makes it possible to avoid passive waiting, the resulting parallel/distributed program control method proved to be efficient [13]. The proposed program control method has been included in a graphical parallel program design environment PS-GRADE [14]. In this environment a programmer defines, how process states will be presented to the monitors. Important aspects of a process state can be easily coded as values of chosen variables, e.g. value of the best solution found so far in a parallel search. The values must be communicated to the monitors when they change, by placing communication instructions. An important feature of the proposed control framework is a separation of the code responsible for high-level application control from the main computational code. The control rules are contained in the monitors in the form of predicates, and in processes in the form of a separated code, activated by control signals. The framework is very well suited for load balancing implementation. Process can watch their own load and they can report it to a monitor. A predicate
Fig. 1. Process reactions for control signal arrival
974
J. Borkowski and M. Tudruj
evaluated on resulting global states can verify if the system is balanced. Underor overloaded processes can be sent control signals to perform load balancing actions. The load balancing strategy can be easily changed or tuned by replaceing the predicate in the monitor. Surely, the framework encourages using global and centralized load balancing strategies. Such strategies benefit from global knowledge, but can suffer from the lack of scalability. We show in the following section, that the scalability problem is not serious in our case. It can be further diminished by using hierarchical control scheme [15].
3 3.1
Experiments Load Balancing in Irregular Computations
Many computational problems have irregular data and/or control characteristics. Such irregular computations are difficult to be efficiently implemented in a parallel system [16]. In this paper, we deal with irregular computations, in which a set of tasks is processed in such a way, that each task potentialy, but not necessarily, generates a number of new subtasks. As it is not possible to foresee, which task can be solved quickly, and which will produce subtasks, a dynamic task distribution among parallel processes must be applied. Examples of such irregular problems include branch and bound (B&B) computations [17] and adaptive integration [8]. In parallel implementations of these problems the same algorithm is executed by every computational node. The computational work is associated with data sets, which the algorithm must process. In B&B the data sets take the form of partial solutions, in adaptive integration the form of integration areas. Load balancing can be performed by transfering some data sets from an overloaded process to an underloaded one, using application level message passing. The use of a system- or middleware-level load balancing would require encapsulation of the data sets in a way perceivable and manageable by the operating system (or middleware), producing a large overhead, as compared with simple message passing. Therefore, we decided to use application level load balancing in our implementation of irregular computations. The experiments described in this paper are based on the Travelling Salesman Problem implemented using a parallel B&B algorithm. To investigate this approach for a large scope of parameters and with adjustable performance features, we employed a dedicated simulator [18]. In the simulation, the processes were allocated to separate computing nodes. The network was based on a crossbar switch and it had characteristics equal to Myrinet. The proposed control framework was fully supported by the simulator. 3.2
Coupling Load Balancing Strategy with the Control Framework
We employed the control infrastructure described in section 2. Processes, allocated one per CPU, reported their load to the monitor. The load was measured as a weighted number of partial solutions waiting for further processing (more
Parallel Irregular Computations
975
completed solutions had smaller weight). Reporting was performed periodically and immediately after completion of each load balancing action. The monitor evaluated a properly constructed predicate on each detected global application state, looking for load imbalance. The evaluation resulted in decisions relating to the displacement of available work among constituent processes. The decisions were communicated to the processes in the form of control signals. The actual load transfer was implemented as a standard message passing between involved processes. It seemed natural to use the whole information about the load of the processes, contained in detected SCGS and readily available at the monitor. We could easily implement an optimal load balancing strategy to determine the load flow necessary to completely equalize process loads. However, such a strategy would have a number of disadvantages in our system: – Predicates are evaluated on each detected SCGS. If the optimal load flow calculation for all constituent processes were included in the predicate evaluation, then the predicate evaluation cost could be high. We would not be able to evaluate the predicate often and the rate of SCGS detection should be artificially limited. – Process load is only estimated using a heuristic measure. It makes little sense to perform optimal load balancing using such a measure. – The monitor learns about the application state with some delay. Control decisions are executed after further delay when control signals reach processes. The calculated optimal flow can be not fully relevant to the current situation. – It may be possible, that the optimal flow cannot be performed, because processes may not be able to divide their loads precisely as requested. – Optimal flow strategy creates bursty network traffic - when optimal flow is calculated resulting commands must be dispatched to every process one by one. We noticed, that such bursts can limit performance of the system [18]. Among other reasons, if a process signaled as the first one has to receive work from a process signaled as the last one, the former must wait much time until expected data arrive. Additionally, relative effectivness of simple (as opposed to optimal) load balancing strategies has been proved already in [19]. Taking above arguments into account, we decided to drop the idea of adopting an optimal strategy and to use a simple one. In the proposed design, the control predicate, which estimates the need of work displacement, had to be evaluated at each detected SCGS at low cost. To fulfil this requirement, at each evaluation only the most loaded process and the least loaded process were selected using a sequential search on the vector representing a consistent application state. If their loads differed significantly, the processes were signaled and they communicated with each other to level their loads. A simple rule prevented from selecting the same pair of processes a few times in a row and sending the same orders repeatedly, until the changed load was reported back from the signaled processes. Although a single load balancing
976
J. Borkowski and M. Tudruj
action concerned only two processes, the action could be repeated frequently at each SCGS. The more processes took part in the computations, the more often the global load state changed and SCGSs were detected more often. Due to that, the strategy was able to balance the load efficiently also in large systems using many small steps. The generated network traffic was distributed evenly in time, because only 3 messages (2 control signals and following one load transfer) could be generated at a time. Moreover, the chosen pair of processes received control signals almost simultaneously, reducing waiting time of the receiving process. 3.3
Simulation Results
We performed detailed tests, described below, using the simulator, which provided flexible and controlled testbed environment. We noticed, that application speedup was poor for a large number of processors (512). Partially, we could blame for it the inherent inefficiencies in parallel B&B [20]. However, an improvement of the load balancing strategy could help. The improvement had to retain the desired overall strategy features described in the preceding subsection. First of all, the computational cost of decision taking (predicate evaluation) should remain low. We checked four variants of the load balancing strategy: STD “standard”, as described in the preceding subsection. C1 “cycle 1st”, the sequential search examines the state vector not from index 0 to N, but from index last to (last+N ) modulo N. Variable last is incremented after each search. Thanks to that the search does not favour processes having small indexes in the vector (if many processes have load equal to 0, the first found one is selected). P “prediction”, when a load balancing decision is taken, the resulting load change can be reflected in the state vector only after involved processes have reported their new load. Before that time, the monitor uses obsolete information for subsequent decisions. In the “prediction” variant, the monitor writes a supposed load change into the state vector immediately, as a prediction. The prediction will be overwritten by actual process reports, when they come. C1P both C1 and P improvements described above applied together. Application parameters where chosen in such a way, that differences in load balancing efficiency should be clearly visible. Branching and bounding a single partial solution lasted around 15μs. It was fast enough to change process load very quickly - a large pool of partial solutions could be solved / eliminated during a few miliseconds, or it could evolve into a larger number of prospective partial solutions. Prompt and accurate load balancing actions were necessary to retain good efficiency of computations. Fig. 2 shows the measured CPU idle time expressed as a fraction of the application execution time. The obtained results show well the differences between tested strategy variants. For 8 CPUs the differences are not significant. For 512 CPU the STD variant is much worse (73% idle) than the improved P and C1P variants (33% idle). CPU idle time should be small, but small CPU idle time does not guarantee high efficiency of parallel B&B computations. The most practical measure is the
Parallel Irregular Computations
977
Fig. 2. Normalized idle time
Fig. 3. Execution time
application execution time. In Fig. 3 we see, that again the improved P and C1P variants give best results for 64 and for 512 CPUs (3.5 times better than the STD variant), while showing roughly the same performance as the other variants for 8 CPUs. It becomes clear, that for a large number of CPUs the prediction feature is most important. For a large number of processes the global application state changes very frequently. In each state load balancing decisions are taken. Without the prediction, the monitor uses obsolete information many times before it gets an update about the current process load. The prediction lets the monitor differentiate between processes recently chosen for load balancing (and expected to have balanced load) and imbalanced processes not yet served. The importance of the prediction is further confirmed in Fig. 4. Without it, a load balancing strategy is not able to increase the number of load transfers when the number of CPUs grows from 64 to 512. A monitor chooses a pair of processes for load balancing and this pair is chosen again and again until the monitor receives fresh state reports from the two involved processes. Because
978
J. Borkowski and M. Tudruj
Fig. 4. Number of load transfers
the same process pair is not allowed to do load balancing many times in a row, effectively no further load balancing is performed until the arrival of new state reports. Prediction eliminates this deficiency. The described load balancing strategy worked well in real computations performed on a limited scale with the use of the PS-GRADE system [13].
4
Conclusions
Efficient parallel implementation of irregular computations requires load balancing. Load balancing performed at the application level should be most efficient for a class of computations, in which the computational work is associated with data sets, which an algorithm must process. Application level messages can transfer such data sets between processes in the most efficient way. A novel parallel programming environment proposed by us facilitates a ready framework very well suited for implementation of application level global load balancing strategies. In this environment control decisions are taken based on a continuous analysis of application global states. We implemented parallel Travelling Salesman Problem using Branch and Bound algorithm in the proposed environment. A careful analysis of the programming environment properties and application properties has shown, that we should get best results by adopting a load balancing strategy, which works using many simple small steps executed frequently. In each step we choose a pair of the most and the least loaded processes and we command them to equalize their loads. Our experiments have shown, that this strategy works well. We have noticed, that the chosen strategy is much more efficient (shortening the application runtime by more than 3 times), if the prediction of the results of an already taken load balancing decision is used for subsequent load balancing decisions. Also, some improvement can be gained if the search for the the most and the least loaded processes is performed in such a way, that for a number of equally loaded processes each of them can be selected with a similar probability by the search procedure. The proposed parallel programming
Parallel Irregular Computations
979
environment has proved to be an efficient tool for implementation of compound parallel computations.
References 1. Wu, J.: Distributed System Design. CRC Press, Boca Raton (1999) 2. Zhou, S., Ferrari, D.: An experimental study of load balancing performance. Technical Report Technical Report UCB/CSD 87/336, University of California, Berkeley (1987) 3. Phillips, I.W., Capon, P.C.: Experiments in distributed load balancing. In: Boyanov, K. (ed.) Proceedings of the Fourth Workshop on Parallel and Distributed Processing, WP&DP 1993, Sofia, Bulgaria, pp. 146–171. North-Holland, Amsterdam (1993) 4. Iskra, K., Linden, F., der Hendrikse, Z.V., Overeinder, B., Albada, G., Sloot, B.V.: The implementation of dynamite: An environment for migrating PVM tasks. Operating Systems Review 34(3), 40–55 (2000) 5. Corradi, A., Leonardi, L., Zambonelli, F.: Dynamic load distribution in massively parallel architectures: The parallel objects example. In: 1st Conference on Massively Parallel Computing Systems, May 1994, pp. 318–322. IEEE, Los Alamitos (1994) 6. Philippsen, M., Zenger, M.: JavaParty - transparent remote objects in java. Concurrency: Practice and Experience 9(11), 1225–1242 (1997) 7. http://www.cs.wmich.edu/~parint/ 8. Schürer, R.: Optimal communication interval for parallel adaptive integration. Parallel and Distributed Computing Practices 5 (3), 269–278 (2002) 9. Bahi, J.M., Contassot-Vivier, S., Couturier, R.: Dynamic load balancing and efficient load estimators for asynchronous iterative algorithms. IEEE Transactions on Parallel and Distributed Systems 16(4), 289–299 (2005) 10. Czarnul, P., Kranzlmüller, D., Kacsuk, P., Dongarra, J., Volkert, J.: Development and tuning of irregular divide-and-conquer applications in DAMPVM/DAC. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J., Volkert, J. (eds.) PVM/MPI 2002. LNCS, vol. 2474, pp. 208–216. Springer, Heidelberg (2002) 11. Stoller, S.D.: Detecting global predicates in distributed systems with clocks. Distributed Computing 13(2), 85–98 (2000) 12. Borkowski, J.: Global predicates for on-line control of distributed applications. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Waśniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 269–277. Springer, Heidelberg (2004) 13. Borkowski, J., Kopanski, D., Tudruj, M.: Parallel irregular computations control based on global predicate monitoring. In: Proc. Of International Symposium on Parallel Computing in Electrical Engineering PARELEC 2006, Bialystok, Poland, pp. 233–238. IEEE, Los Alamitos (2006) 14. Borkowski, J., Tudruj, M., Kopanski, D.: Parallel program design tool with application control methods based on global states. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Waśniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 338–343. Springer, Heidelberg (2004) 15. Borkowski, J.: Parallel program control based on hierarchically detected consistent global states. In: International Conference on Parallel Computing in Electrical Engineering (PARELEC 2004), pp. 328–333. IEEE, Los Alamitos (2004)
980
J. Borkowski and M. Tudruj
16. Biswas, R., Oliker, L., Shan, H.: Parallel computing strategies for irregular algorithms. Annual Review of Scalable Computing (April 2003) 17. Androulakis, I.P., Floudas, C.A.: Distributed branch and bound algorithms for global optimization. The IMA Volumes in Mathemathics and its Applications, "Parallel Processig of Discreet Problems" 106, 1–37 (1999) 18. Borkowski, J., Tudruj, M.: Dual communication network in program control based on global application state monitoring. In: Proceedings of ISPDC 2007, Linz, Austria, IEEE, Los Alamitos (accepted for publication, 2007) 19. Eager, D., Lazowska, E., Zahorjan, J.: Adaptive load sharing in homogeneous distributed systems. IEEE Transactions on Software Engineering, 662–675 (May 1986) 20. Trienekens, H.: Parallel branch and bound and anomalies. Technical Report EURFEW-CS-89-01, Erasmus University Rotterdam (1989)
On-Line Partitioning for On-Line Scheduling with Resource Conflicts Piotr Borowiecki Department of Discrete Mathematics and Theoretical Computer Science University of Zielona G´ ora, Poland [email protected]
Abstract. Within this paper, we consider the problem of on-line partitioning the sequence of jobs which are competing for non-sharable resources. As a result of partitioning we get the subsets of jobs that form separate instances of the on-line scheduling problem. The objective is to generate a partition into the minimum number of instances such that the response time of any job in each instance is bounded by a given constant. Our research is motivated by applications in scheduling multiprocessor jobs on dedicated processors and channel assignment in WDM networks. Keywords: On-line scheduling, resource conflicts, maximum response time, quality of service, graph partitioning, on-line algorithms.
1
Introduction
In many real-life applications a set of resources has to be shared among jobs subject to the mutual exclusion constraint. This includes, e.g. time sharing in wireless networks, session scheduling in high-speed LAN’s [1], channel assignment in WDM optical networks [3] and managing processor partitions of massively parallel computer systems [5] (see Sect. 6 for more information). Each job Jj requires for execution a set of non-sharable resources Rj and a conflict arises between jobs Jj and Ji , if they require the same resource, i.e. Rj ∩ Ri = ∅. A pair of conflicting jobs may not be executed concurrently. We model the conflicts using conflict graph, where vertices represent jobs and edges represent pairs of conflicting jobs. Considering the well known classical scheduling metrics, e.g. makespan and average completion time it is not hard to see that they represent the viewpoint of the system owner while none of them seems to be suitable in representation of user’s viewpoint. In this case the time of finishing an individual job seems to be more important. The subject of scheduling with conflicts has been studied mainly in off-line model. In on-line setting the main concern of [6] was the makespan while in [9] minimization of the maximum response time of unit jobs for interval and bipartite conflict graphs was investigated. The structure of conflict graphs used in [9] was given in advance and vertices represented types of jobs having the same resource requirements. If two types of jobs demanded a common resource, there was an edge between appropriate nodes of a graph. In our setting nodes of R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 981–990, 2008. c Springer-Verlag Berlin Heidelberg 2008
982
P. Borowiecki On job leave message
'
?
$ J 1
J
-
On-line job-to-subsystem assignment
&
On-line scheduling in subsystem M1
r
-
On-line scheduling
r
-
r
-
* in subsystem M2 J 2 ... H HH Ji H j On-line scheduling H %
in subsystem Mi ...
Fig. 1. Scheme of the system
a graph represent single jobs and they are dynamically added and deleted from the graph as jobs arrive and depart the system. Further, we let each job to have its own requirement on the response time (quality of service requirement - QoS). It is not hard to imagine the sequence of jobs and their conflict graph that will cause some job to wait for resources longer than it requested. We deal with this problem using the system consisting of independent subsystems Mi , each having its own replica of the resources. At the input of the system we get the sequence of jobs J = (J1 , J2 , . . . , Jn ) together with their resource and QoS requirements (see Fig. 1). Each job arriving at the system has to be immediately (i.e. without knowledge of future jobs) assigned to one of the subsystems. By Ji we denote the set of jobs assigned to subsystem Mi . Further reassignment of jobs is not allowed. Our main concern is to minimize the number of subsystems (replicas) guaranteeing that the response time of each job in any subsystem does not exceed the time requested by the job. In the system we maintain two types of conflict graphs. Namely, global conflict graph used for job-to-subsystem assignment and local conflict graphs used for on-line scheduling of jobs assigned to subsystems.
2
Mathematical Model
Within this section we state two closely related problems. First, we describe the problem of feasibility of the job sets to scheduling algorithms used in the subsystems (subsystems perspective), then from the system perspective we state the main problem, i.e. the problem of on-line assignment of jobs to the minimum number subsystems guaranteeing feasibility of the job sets within each of the subsystems. All times are assumed to be integral and if not otherwise stated we investigate unit jobs, identical subsystems and jobs having the same QoS
On-Line Partitioning for On-Line Scheduling with Resource Conflicts
983
requirements. Remarks on non-unit jobs, varying subsystems and QoS can be found in Sect. 5. 2.1
On-Line Scheduling Problem - Subsystems Perspective
In general the statement of on-line resource constrained scheduling problem, that is to be solved in each of our subsystems, resembles the one which is commonly used in the literature (see e.g. [6]). Each job Jj has its release time rj and the jobs may arrive at different release times. The job may be assigned to the time slots starting not earlier than its release time. When Jj arrives its resource requirements are known and the structure of the local conflict graph is updated, i.e. the vertex of a new job is added together with the edges to all vertices representing jobs in conflict with Jj and having their release times at most rj . Released but not scheduled jobs are pending. For unit jobs, at the beginning of each time slot t, the on-line scheduling algorithm selects the subset of nonconflicting pending jobs, i.e. an independent set in the local conflict graph The selected jobs are scheduled in time slot t. At the beginning of the next time slot t + 1, the conflict graph structure is updated: vertices of jobs with release times t + 1 are added, while vertices corresponding to the scheduled jobs are removed. It is assumed that the time required to select the jobs to be scheduled, compared to the duration of the unit time slot, is very small and therefore not important. We do not assume any explicit bound on the number of jobs that can be executed concurrently; the only limitation comes from the structure of the resource conflicts. The response time (also referred to as the flow time) of the job Jj , denoted by Fj , is defined as the difference between the completion time Cj and the release time rj of the job. For the set Ji of jobs assigned to the subsystem Mi , by Si we denote any sequence of jobs from Ji arranged in non-descending order of their release times. For the sequence Si and on-line scheduling algorithm A, by A(Si ) we denote the longest response time, resulting from the schedule constructed by A. More formally A(Si ) = maxJj ∈Si (Cj − rj ) . Finally, we define the responsiveness of Ji with respect to on-line scheduling algorithm A as the maximum of A(Si ) taken over all possible sequences Si . Namely RA (Ji ) = max A(Si ) . From the perspective of the subsystem we are interested in giving the conditions to be satisfied by the job sets, such that their responsiveness for algorithm A can be bounded by some given constant k. Problem 1. What requirements the jobs from Ji should comply with to guarantee that for any given integer k ≥ 1 and on-line scheduling algorithm A, the responsiveness RA (Ji ) is not greater than k? The set Ji of jobs complying with such requirements will be called k-feasible for on-line scheduling algorithm A.
984
2.2
P. Borowiecki
On-Line Partitioning Problem - System Perspective
Recall that at the input of the system we get the sequence J of jobs and that each job arriving at the system has to be immediately and irrevocably assigned to one of the subsystems. When the job arrives to the system its resource requirements are known and can be taken into account during the subsystem assignment. The global conflict graph is maintained at the system level and it is revealed as jobs arrive. Its structure is updated when the job arrives at the system and when it leaves the subsystem after execution (note the feedback in Fig. 1). From the perspective of the system we are interested in on-line algorithms of the job-to-subsystem assignment which generate the minimum number of k-feasible instances of the on-line scheduling problem (to be solved in the subsystems). Problem 2. For a given integer k ≥ 1 and on-line algorithm A, partition online a sequence J, into the minimum number of sets J1 , J2 , . . . , Jm , each being k-feasible for A.
3
Feasibility of Unit Jobs
Within this section we investigate the problem of guaranteed responsiveness of unit job sets for Greedy-FIFO (GF for short) scheduling algorithm. The makespan objective of Greedy-FIFO like algorithm for resource constrained scheduling was analyzed in [6]. Algorithm GF works as follows. Let Vt denote the set of jobs pending at the beginning of time slot t. The algorithm selects maximal set of non-conflicting pending jobs according to the FIFO rule that among all possible choices guarantees the selection of a vertex with the minimum release time and avoids starvation. The selected vertices are scheduled at time slot t. The problem of scheduling unit jobs can be modeled as the classical (proper) vertex coloring of a conflict graph, where colors assigned to vertices (jobs) correspond to the time slots. Let Gt be a local conflict graph induced by all jobs pending at the beginning of the time slot t and let v ∈ V (Gt ) be a vertex representing job Jj that has just been released, i.e. rj = t. Since we are interested in guaranteed response time of each job, the problem of the maximum response time of job Jj for the GF scheduler is equivalent to the problem of finding the maximum color that may be necessary to color vertex v of Gt using MISF (maximal independent set FIFO) graph coloring algorithm. Similarly to GF algorithm, MISF selects maximal independent set preferring vertices that represent jobs having minimum release times. Since maximal independent set I selected by MISF is also a dominating set, all remaining vertices have at least one neighbor in I. It follows that all colorings generated by MISF are Grundy colorings. On the other hand, each Grundy coloring can be generated using First-Fit (FF for short) coloring algorithm. Hence for any graph the number of colors required by MISF is never greater than for FF. Moreover, in worst case, i.e. when release dates of all vertices are the same, there always exists the choice of maximal independent sets such that MISF uses exactly the same number of colors that is required by FF.
On-Line Partitioning for On-Line Scheduling with Resource Conflicts
985
Algorithm MISF; BEGIN U ← V (G) sorted in non-decreasing order of release dates; k ← 0; WHILE U = ∅ DO BEGIN k ← k + 1; I ← ∅; FOR each u ∈ U in order DO IF I ∪ {u} is independent THEN I ← I ∪ {u}; Assign color k to all vertices in I; U ← U \ I; ENDWHILE END. Denoting by A(G) the maximum number of colors that may be required by algorithm A for graph G we have Lemma 1. For any graph G, MISF(G) ≤ FF(G). In the following we use the notion of critical graphs. Graph G is called k-critical for algorithm A, if A(G) = k but A(G − v) < k for any v ∈ V (G), where G − v is a vertex removal operation. Critical graphs play an important role in the coloring process. In fact they decide on the number of colors used by algorithm. This can be stated as the following lemma. Lemma 2. For any graph G, A(G) < k if and only if G does not contain an induced subgraph isomorphic to any of the graphs k-critical for algorithm A. Using very similar arguments that led us to Lemma 1 one can argue that MISF would never use k colors on G, if G did not contain a subgraph isomorphic to one of the graphs k-critical for FF. Forbidding the subgraphs isomorphic to graphs k-critical for FF defines the class F F(k) of graphs k-colorable with FF. For instance if k = 2, then F F(2) is defined by forbidding induced subgraphs isomorphic to either K3 or P4 [2,8]. Let us assume that the subsystem Mi uses GF algorithm to schedule the set Ji of unit jobs and let Gi = (Gt : t = 1, 2, . . .) be the sequence of graphs, each representing local resource conflicts in Mi at the beginning of the time slots t = 1, 2, . . . respectively. Then we have the following theorem. Theorem 1. If none of the local conflict graphs Gt ∈ Gi contains an induced subgraph isomorphic to some graph k-critical for FF, then Ji is k-feasible for GF. It is not hard to see that the theorem holds not only for GF but also for all Grundy schedulers, i.e. chromatic scheduling algorithms that generate Grundy colorings. One of the reasons of our interest in graphs critical for FF is the existence of their constructive characterization [2] and the possibility to decide in polynomial time whether for fixed k the graph is k-colorable by FF or equivalently whether it belongs to F F(k) [8]. We use this fact in the construction of on-line algorithm that assigns jobs to subsystems.
986
4
P. Borowiecki
On-Line Assignment of Jobs to Subsystems
Since there is one to one correspondence between jobs from the input sequence J and vertices of the global conflict graph G, every partition (V1 , V2 , . . . , Vm ) of V (G) uniquely determines the partition (J1 , J2 , . . . , Jm ) of J into m sets and each subgraph G[Vi ] is isomorphic to the appropriate local conflict graph of Mi . However, not all partitions are feasible. Following Theorem 1, to guarantee for fixed k ≥ 1 that Ji is k-feasible for GF we have to generate a partition such that each G[Vi ] belongs to F F(k). In order to state it more formally we use the P-partitioning framework. Namely, the class P of graphs is said to be induced hereditary if whenever G ∈ P and H is a vertex induced subgraph of G, then H ∈ P. A class P is called additive if for each graph G all of whose components are in P it follows that G is in P, too. Let P1 , P2 , . . . , Pm , m > 1 be any additive induced hereditary classes of graphs. A vertex (P1 , P2 , . . . , Pm )-partition of the graph G is a partition (V1 , V2 , . . . , Vm ) of V (G) such that each induced subgraph G[Vi ] of G belongs to class Pi . The set Vi are called the components of a partition. If P1 = P2 = . . . = Pm = P, then the partition is called simply a P-partition of G. Note, that P-partition into independent sets (inducing the edgeless graphs) is a proper vertex coloring. Using P-partitions, we can view Problem 2 for unit jobs and identical subsystems as the problem of on-line F F(k)-partitioning of the global conflict graph with the objective to minimize the number of partition classes (subsystems). In the following example we demonstrate how different partitioning strategies influence the response time of jobs. Example 1. Let J = {J1 , . . . , J11 } be the set of unit jobs at the input of the system and let G be a global conflict graph for J (see Fig. 2 (a)). Since we are interested in worst case scenarios, without loss of generality all release times are assumed to be equal. If all jobs were assigned to subsystem M1 , i.e. J1 = J, then we would possibly have the sequence S1 = (J6 , J9 , J11 , J1 , J5 , J8 , J10 , J2 , J7 , J3 , J4 ) forcing GF to schedule J3 and J4 such that F3 = F4 = 4 what implies RGF (J) ≥ 4 (see Fig. 2 (b)). On the other hand, we can significantly improve the responsiveness at a cost of using an additional subsystem. Below we give two partitions. In both cases jobs are assumed to arrive to the system in order (J1 , J2 , . . . , J11 ): Partition 1. Assuming the objective of balancing the number of jobs assigned to both subsystems we get the following partition: J1 = {J1 , J3 , J5 , J7 , J9 , J11 }, J2 = {J2 , J4 , J6 , J8 , J10 } and there exist sequences S1 = (J9 , J11 , J1 , J5 , J7 , J3 ) and S2 = (J6 , J8 , J10 , J2 , J4 ) such that F3 = F4 = 4. Thus RGF (Ji ) ≥ 4, i = 1, 2 (see Fig. 2 (c)). Partition 2. Algorithm FFP greedily assigns each job to the lowest subsystem whose local conflict graph remains in class P after job addition (see [4] for details of FFP). Let us assume that we want to partition J into the sets 2feasible for FF, i.e. P = F F(2). Consequently, for jobs arriving in the abovementioned order algorithm FFP gives two sets J1 = {J1 , J2 , J3 , J4 , J5 , J8 } and J2 = {J6 , J7 , J9 , J10 , J11 }. It is not hard to see that appropriate local conflict
On-Line Partitioning for On-Line Scheduling with Resource Conflicts J5 rH HH J 3 H r r
J10 r
H HH J2 Hr J6 r HH J4 H H r J8 r
J9
J11
J1 rH HH J 7 H r r
H HH 3 Hr 1 r HH 4 H H r 2 r
HH
H Hr3 1 rH HH 4 Hr 2 r
1
2 rH HH Hr3 1 r (c)
1
2 rH HH 3 H r 1 r (b)
2 rH HH Hr4 r
2r
2 rH HH 4 H r r
2r
(a)
987
1 rH HH Hr2 r
2r
HH H Hr1 1 rH HH 2 Hr 1 r
1
1 rH HH Hr2 1 r (d)
Fig. 2. Global conflict graph and its partitions. Subsystem M1 (dotted line) and M2 (dashed line).
graphs are 2-colorable by FF and from Theorem 1 it follows that there does not exist the sequences of jobs that could force GF to schedule some job such that its response time is greater than 2. Thus RGF (Ji ) ≤ 2, i = 1, 2. (see Fig. 2 (d)).
Algorithm FFP belongs to important class SA of stingy algorithms, i.e. algorithms that always try to use one of the already used partition components (not necessarily the lowest one) before they introduce a new one. It is known that the performance guarantee function A(n) defined as the maximum of A(G)/OPT(G) taken over all n-vertex graphs, has the linear lower bound that independently of P holds for any stingy on-line P-partitioning algorithm A. Theorem 2 ([4]). For any additive induced hereditary class of graphs P and any stingy on-line P-partitioning algorithm A we have A(n) > n/(4(α−1)), where α > 1 is an integer constant dependent on P. Although there exist graphs, e.g. forcing P-trees and graphs used in the proof of Theorem 2, having for any stingy A an arbitrarily large difference between worst and optimum solution values A(G) − OPT(G), for some families of graphs there exists a function f such that A(G) can be bounded from above by f (OPT(G)).
988
P. Borowiecki
For instance if P = F F(1) we have to partition a graph into independent sets, thus any on-line algorithm for proper coloring can be used. We mention just few bounds for this case while the readers interested in more details are referred to [3,11] for surveys on proper on-line coloring. For algorithm FFP an upper bound 2OPT(G) − 1 holds for complements of chordal graphs and 3OPT(G)/2 for complements of bipartite graphs while any split graph requires no more than OPT(G) + 1 colors [7] and no on-line algorithm can do better [2]. Any interval graph can be colored by FFP with at most 10OPT(G) colors [12]. It is also well known that for cographs FFP always gives optimal partitions. Despite of the pessimistic bound of Theorem 2, FFP was proved to be the best stingy on-line P-partitioning algorithm, e.g. for the family of P-trees. Theorem 3 ([4]). Let P be arbitrary additive induced hereditary class of graphs and let μ(G) = minA∈SA A(G). Then for any P-tree T we have FFP(T ) = μ(T ). A special attention should be given to graphs that can be optimally partitioned. The general idea of constructing for any given P the forbidden subgraph characterizations of graph families that are optimally P-partitionable with FFP may be found in [4]. In particular for given k ≥ 1 the construction can be applied to obtain the family X (k) of graphs optimally partitionable into graphs belonging to F F(k). Theorem 4. Let P = F F(k). Then on-line partitioning algorithm FFP gives optimal P-partition of any graph G ∈ X (k). We omit the details of constructing the forbidden subgraphs for X (k) being in general case very tedious and beyond the scope of this paper. Finaly, we have to give our attention to probably the most frequently applied family of graphs, i.e. interval graphs. Let us use ω(G) to denote the clique number of graph G. Theorem 5. Let P = F F(3). If G is an interval graph, then on-line partitioning algorithm FFP gives P-partition of G into at most ω(G) partition components. Proof (sketch). Since the class of interval graphs is induced hereditary each partition component Vi generated by FFP induces an interval graph. It was proved in [10] that for any on-line algorithm there exists an interval graph G requiring at least 3ω(G) − 2 colors. If FFP generated less than ω(G) partition components 3-colorable by FF, then putting our system into a black-box we would have an on-line coloring algorithm using less than 3ω(G) − 2 colors. On the other hand, algorithm SCC [13] greedily groups vertices and within each of the groups assigns them to levels using FF principle (the color is a pair consisting of the group and the level number). From the construction of graphs critical for FF [2] we know that graphs induced by groups generated by SCC belong to the subclass of F F(3). As a result of the analysis of both algorithms it follows that FFP never assigns vertex to a higher group (subsystem) than SCC does. The number of groups used by SCC is not grater than ω(G).
On-Line Partitioning for On-Line Scheduling with Resource Conflicts
5
989
Non-unit Jobs, Varying Subsystems and Regularity
In the proceeding sections we dealt with unit jobs only. Here, we remark on the problem of the guaranteed response time of non-unit jobs and identical subsystems. If pj denotes the processing time of job Jj , then each job can be modeled as a sequence of pj unit jobs (tasks for brevity). Consequently, we distinguish two ways of processing non-unit jobs. Either they must be processed entirely within the same subsystem or jobs may re-enter the system (without incurring any overhead) to execute its subsequent task; possibly in different subsystem. The next task can be executed only if the proceeding task of the same job has already finished. In both cases the response time for each job is guaranteed to be not greater than kpj . Theorem 6. Let GF be scheduling algorithm in subsystem Mi . If none of the local conflict graphs Gt ∈ Gi contains an induced subgraph isomorphic to some graph k-critical for FF, then for any job Jj ∈ Ji , Fj ≤ kpj . The same approach can be used in the case of persistent conflicting jobs, i.e. jobs that do not leave the system after execution and have to be scheduled repeatedly. In paper [1] the authors were interested in regular schedules, i.e. in such scheduling that the periods of time between successive executions of a particular job were as evenly spaced as possible. Maximum response time was used as one of the two measures of regularity. The problems analyzed in [1] were off-line. Within our framework the conflict graphs change over time and at the additional cost of replicating resources we get the possibility of achieving very high regularity of schedules. This approach seems to be very natural especially for session scheduling in high-speed local-area networks. Finally, note that P-partitioning approach is strong enough to be used in the case when each of the subsystems Mi uses its own on-line scheduling algorithm Ai and required responsiveness of particular jobs is different. To guarantee for each subsystem, that RAi (Ji ) ≤ ki , we have to generate a (P1 , P2 , . . . , Pm )-partition of V (G), where Pi is the class of graphs on-line ki -colorable with Ai .
6
Applications
Scheduling multiprocessor jobs on dedicated processors. Multiprocessor jobs require more than one processor at the same moment of time. The examples of multiprocessor jobs are parallel applications consisting of many concurrent threads, redundant copies of the same program run simultaneously on different processors to gain higher level of reliability (see [5] for review). Let us consider a multicomputer system consisting of parallel computers M1 , M2 , . . . , Mm and let Pi = {P1 , P2 , . . . , Pmi } be the set of dedicated processors available on computer Mi . The jobs are competing for the subsets of dedicated processors and two multiprocessor jobs are in resource conflict if the sets of processors they require have at least one element in common. Note that F F(k)-partitioning of the global conflict graph resolves selected conflicts guaranteeing a required level of feasibility of job sets assigned to appropriate parallel computers.
990
P. Borowiecki
Communication channel assignment in WDM networks. Communication network can be modeled as a network graph whose vertices correspond to communicating nodes and edges represent communication links. When two nodes vs and vt want to communicate, a session (s, t) is established. After the route selection (represented as a path P = vs . . . vt in a network graph) a communication channel is assigned. The resource conflict appears when paths P1 = vs1 . . . vt1 , P2 = vs2 . . . vt2 representing sessions (s1 , t1 ) and (s2 , t2 ) share at least one edge. However, if there is no requirement of instant data transmission, then suitable channel access protocol enables the channel reuse. To provide high quality of service the maximum time of waiting for link access must be guaranteed. Using our framework to analyze channel assignments we treat the communication sessions as jobs that have to be assigned to one of the subsystems (channels). The objective is to find an assignment of sessions to the minimum number of channels such that for each channel the maximum time a session has to wait for data transmission never exceeds given constant.
References 1. Bar-Noy, A., Mayer, A., Schieber, B., Sudan, M.: Guaranteeing fair service to persistent dependent tasks. SIAM Journal on Computing 27, 1168–1189 (1998) 2. Borowiecki, P.: Characterization of graphs critical for First-Fit graph coloring. In: 13. Workshop on Discrete Optimization, Burg, pp. 8–12 (1998) 3. Borowiecki, P.: On-line coloring of graphs. In: Kubale, M. (ed.) Graph colorings. Contemporary Mathematics 352, pp. 21–33. American Mathematical Society (2004) 4. Borowiecki, P.: On-line P-coloring of graphs. Discussiones Mathematicae Graph Theory 26, 389–401 (2006) 5. Drozdowski, M.: Scheduling multiprocessor tasks - an overview. European J. Oper. Res. 94, 215–230 (1996) 6. Even, G., Halld´ orsson, M.M., Kaplan, L., Ron, D.: Scheduling with conflicts: approximation algorithms and online algorithms (manuscript, 2007), http://www.eng.tau.ac.il/∼ danar/papers.html 7. Gy´ arf´ as, A., Lehel, J.: On-line and First-Fit coloring of graphs. J. Graph Theory 12, 217–227 (1988) 8. Gy´ arf´ as, A., Kir´ aly, Z., Lehel, J.: On-line 3-chromatic graphs - II. Critical graphs. Discrete Math. 177, 99–122 (1997) 9. Irani, S., Leung, V.: Scheduling with conflicts on bipartite and interval graphs. Journal of Scheduling 6, 287–307 (2003) 10. Kierstead, H.A., Trotter, W.T.: An extremal problem in recursive combinatorics. Congressus Numerantium 33, 143–153 (1981) 11. Kierstead, H.A.: Coloring graphs on-line. In: Fiat, A. (ed.) Dagstuhl Seminar 1996. LNCS, vol. 1442, pp. 281–305. Springer, Heidelberg (1998) 12. Pemmaraju, S.V., Raman, R., Varadarajan, K.: Buffer minimization using maxcoloring. In: Proc. 15th ACM-SIAM Symp. on Discrete Algorithms, pp. 562–571 (2004) ´ 13. Slusarek, M.: A coloring algorithm for interval graphs. In: Kreczmar, A., Mirkowska, G. (eds.) MFCS 1989. LNCS, vol. 379, pp. 471–480. Springer, Heidelberg (1989)
A Multiobjective Evolutionary Approach for Multisite Mapping on Grids Ivanoe De Falco1 , Antonio Della Cioppa2 , Umberto Scafuri1 , and Ernesto Tarantino1 1
2
ICAR–CNR, Via P. Castellino 111, 80131 Naples, Italy {ivanoe.defalco,ernesto.tarantino,umberto.scafuri}@na.icar.cnr.it DIIIE, University of Salerno, Via Ponte don Melillo 1, 84084 Fisciano (SA), Italy [email protected]
Abstract. Grid systems, constituted by multisite and multi–owner time– shared resources, make a great amount of locally unemployed computational power accessible to users. To profitably exploit this power for processing computationally intensive grid applications, an efficient multisite mapping must be conceived. The mapping of cooperating and communicating application subtasks, already known as NP–complete for parallel systems, results even harder in grid computing because the availability and workload of grid resources change dynamically, so evolutionary techniques can be adopted to find near–optimal solutions. In this paper a mapping tool based on a multiobjective Differential Evolution algorithm is presented. The aim is to reduce the execution time of the application by selecting among all the potential solutions the one which minimizes the degree of use of the grid resources and, at the same time, complies with Quality of Service requirements. The proposed mapper is assessed on some artificial problems differing in application sizes and workload constraints. Keywords: Grid computing, mapping, Differential Evolution.
1
Introduction
The amount of average load of computing systems varies as a function of the user type. In fact, the applications and the timing they are submitted for the execution diversify when the user changes. Nevertheless, independently of the user, the quantity of computational resources really employed is only a fraction of that provided by the systems which, hence, result on average underused. Obviously, in grid environments [1] the resources unexploited could be usefully integrated and eventually used for the execution of weight– and power–sensitive applications. It is to note that, in case of time–shared systems and in absence of information about the communication timing, the co–scheduling of the communicating subtasks of the same parallel application must be guaranteed to avoid possible deadlock conditions [2]. Grid is just a decentralized heterogeneous multisite system which aggregates multi–owner resources (processing, network bandwidth and storage capacity) R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 991–1000, 2008. c Springer-Verlag Berlin Heidelberg 2008
992
I. De Falco et al.
spread across multiple domains. This system enables the virtualization of distributed computing, such as to create a single powerful collaborative problem– solving environment which can grant the execution of MPI computationally intensive applications [3]. Thus, given a grid constituted by an adequate number of sites, each of which contains a specified number of nodes, the MPI communicating subtasks of a parallel application could be conveniently assigned to the grid nodes which, on the basis of their characteristics and load conditions, turn out to be more suitable to execute them. However, maximizing parallel efficiency and minimizing influence on the local workload performance remains an open challenge when parallel jobs are executed on non–dedicated grid systems together with the local workload. Besides, when the execution of an MPI grid application must satisfy also user–dependent requirements [4,5], such as performance and Quality of Service (QoS) [6], single sites resources could be insufficient for fulfilling all these needs. Thus, a multisite mapping tool, able to obtain high throughput while matching applications demands with the networked grid computing resources, must be designed. Hypothesizing well known both the physical characteristics of the single grid system and its average load conditions in a given time span, our mapping solution distributes the subtasks among the nodes minimizing execution time and communication delays, and at the same time optimizing resource utilization. QoS issues are investigated only in terms of reliability. This means that our mapper maximizes reliability by preferring solutions which make use of devices, i.e. processors and links connecting the sites to internet, which only seldom are broken. As the mapping is an NP–complete problem [7], several evolutionary–based techniques have been used to face it in a heterogeneous or grid environment [8,9,10,11]. To provide the user with a set of possible mapping solutions, each with different balance for use of resources and reliability, a multiobjective version of Differential Evolution (DE) [12] based on the Pareto method [13] is here proposed. Unlike the other existing evolutionary methods which simply search for one site onto which map the application, we deal with a multisite approach. Moreover, as a further distinctive issue with respect to other approaches presented in literature [14], we consider the nodes making up the sites as the lowest computational unit taking into account its reliability and its actual load. Paper structure is as follows: Section 2 outlines our multiobjective mapper, Section 3 reports on the test problems experienced and shows the findings achieved, and Section 4 contains conclusions.
2 2.1
DE for Mapping Problem The Technique
Given a minimization problem with q real parameters, DE faces it starting with a randomly initialized population consisting of M individuals each made up by q real values. Then, the population is updated from a generation to the next
A Multiobjective Evolutionary Approach for Multisite Mapping on Grids
993
one by means of different transformation schemes. We have chosen a strategy which is referenced as DE/rand/1/bin. In it for the generic i–th individual in the current population three integer numbers r1 , r2 and r3 in [1, . . . , M] differing one another and different from i are randomly generated. Furthermore, another integer number s in the range [1, q] is randomly chosen. Then, starting from the i–th individual a new trial one i is generated whose generic j–th component is given by xij = xr3 j + F · (xr1 j − xr2 j ), provided that either a randomly generated real number ρ in [0.0, 1.0] is lower than the value of a parameter CR, in the same range as ρ, or the position j under account is exactly s. If neither is verified then a simple copy takes place: xij = xij . F is a real and constant factor which controls the magnitude of the differential variation (xr1 j −xr2 j ). This new trial individual i is compared against the i–th individual in the current population and, if fitter, replaces it in the next population, otherwise the old one survives and is copied into the new population. This basic scheme is repeated for a maximum number of generations g. 2.2
Formalizations
To focus the mapping problem in a grid we need information on the number and on the status of both accessible and demanded resources. We assume to have an application task subdivided into P subtasks (demanded resources) to be mapped on n nodes (accessible resources) with n ∈ [1, . . . , N ], where P is fixed a priori and N is the number of grid nodes. We need to know a priori the number of instructions αi computed per time unit on each node i. Furthermore, we assume to have cognition of the communication bandwidth βij between any couple of nodes i and j. Note that βij is the generic element of an N ×N symmetric matrix β with very high values on the main diagonal, i.e., βii is the bandwidth between two subtasks on the same node. This information is supposed to be contained in tables based either on statistical estimations in a particular time span or gathered tracking periodically and forecasting dynamically resource conditions [15,16]. In the Globus Toolkit [6], which is a standard grid middleware, the information is gathered by the Grid Index Information Service (GIIS) [16]. Since grids address non dedicated–resources, their own local workloads must be considered to evaluate the computation time. There exist several prediction methods to face the challenge of non–dedicated resources [17,18]. For example, we suppose to know the average load i (Δt) of the node i at a given time span Δt with i (Δt) ∈ [0.0, 1.0], where 0.0 means a node completely discharged and 1.0 a node locally loaded at 100%. Hence (1 − i (Δt)) · αi represents the power fraction of the node i available for the execution of grid subtasks. As regards the resources requested by the application task, we assume to know for each subtask k the number of instructions γk and the number of communications ψkm between the k–th and the m–th subtask ∀m = k to be executed. Obviously, ψkm is the generic element of a P × P symmetric matrix ψ with all null elements on the main diagonal. All this information can be obtained either by a static program analysis, or by using smart compilers or by other tools which
994
I. De Falco et al.
automatically generate them. For example the Globus Toolkit includes an XML standard format to define application requirements [16]. Finally, information must be provided about the degree of reliability of any component of the grid. This is expressed in terms of fraction of actual operativity πz for the processor z and λw for the link connecting to internet the site w to which z belongs. Any of these values ranges in [0.0, 1.0]. 2.3
Encoding
In general, any mapping solution should be represented by a vector μ of P integers ranging in the interval [1, N ]. To obtain μ, the real values provided by DE in the interval [1, N +1[ are truncated before evaluation. The truncated value μi denotes the node onto which the subtask i is mapped. For all mapping problems in which also communications ψkm must be taken into account, the allocation of a subtask on a given node can cause that the optimal mapping needs that also other subtasks must be allocated on the same node or in the same site, so as to decrease their communication times and thus their execution times, taking advantage of the higher communication bandwidths existing within any site compared to those between sites. Such a problem is a typical example of epistasis, i.e. a situation in which the value taken on by a variable influences those of other variables. This situation is also deceptive, since a solution μ1 can be transformed into another μ2 with better fitness only by passing through intermediate solutions, worse than both μ1 and μ2 , which would be discarded. To overcome this problem we have introduced a new operator, named site mutation, applied with a probability pm any time a new individual must be produced. When this mutation is to be carried out, a position in μ is randomly chosen, let us suppose its contents refers to a node in site Ci . Then another site, say Cj , is randomly chosen. If this latter has Nj nodes, the position chosen and its next Nj − 1 are filled with values representing consecutive nodes of Cj , starting from the first one of Cj . If the right end of the chromosome is reached, this process continues circularly. If Nj > P this operator stops after modifying the P alleles. If site mutation does not take place, the classical transformations typical of DE must be applied. 2.4
Fitness
Given the two goals described in Section 1, we have two fitness functions, accounting one for the time of use of resources and the other for their reliability. Use of resources. Denoting with τijcomp and τijcomm respectively the computation and the communication times requested to execute the subtask i on the node j it is assigned to, the generic element of the execution time matrix τ is computed as: τij = τijcomp + τijcomm In other words, τij is the total time needed to execute the subtask i on the node j. It is evaluated on the basis of the computation power and of the bandwidth which remain available once deducted the local workload. Let τjs be the
A Multiobjective Evolutionary Approach for Multisite Mapping on Grids
995
summation on all the subtasks assigned to the j–th node for the current mapping. This value is the time spent by node j in executing computations and communications of all the subtasks assigned to it by the proposed solution. Clearly, τjs is equal to zero for all the nodes not included in the vector μ. Considering that all the subtasks are co–scheduled, the time required to complete the application execution is given by the maximum value among all the τjs . Then, the fitness function is: Φ1 (μ) = max {τjs } j∈[1,N ]
(1)
The goal of the evolutionary algorithm is to search for the smallest fitness value among these maxima, i.e. to find the mapping which uses at the minimum, in terms of time, the grid resource it has to exploit at the most. Reliability. In this case the fitness function is given by the reliability of the proposed solution. This is evaluated as: Φ2 (μ) =
P
πμi · λw
(2)
i=1
where μi is the node onto which the i–th subtask is mapped and w is the site this node belongs to. It should be noted that the first fitness function should be minimized, while the second should be maximized. We face this two–objective problem by designing and implementing a multiobjective DE algorithm based on the Pareto–front approach. It is very similar to the DE scheme described in Sect. 2, apart from the way the new trial individual i is compared to the current individual i. In this case i is chosen if and only if it is not worse than i in terms of both the fitness functions, and is better than i for at least one of them. By doing so, a set of “optimal” solutions, the so–called Pareto optimal set, emerges in that none of them can be considered to be better than any other in the same set with respect to all the single objective functions.
3
Experiments and Findings
We assume to have a multisite grid architecture composed of N = 116 nodes divided into five sites (Fig. 1). Hereinafter the nodes are indicated by means of the external numbers, so that 54 in Fig. 1 is the sixth of 16 nodes in the site C. Without loss of generality we suppose that all the nodes belonging to the same site have the same power α expressed in terms of millions of instructions per second (MIPS). For example all the nodes belonging to C have α = 2000. We have considered for each node three communication bands: the bandwidth βii available when subtasks are mapped on the same node (intranode communication), the bandwidth βij between the nodes i and j belonging to the same site (intrasite communication) and the bandwidth βij when the nodes i and j belong to different sites (intersite communication). For the sake of simplicity
996
I. De Falco et al.
Fig. 1. The grid architecture
we presume that βij = βji and that all the βii s have the same very high value (100 Gbit/s) so that the related communication time is negligible with respect to intrasite and intersite communications (Table 1). Table 1. Intersite and intrasite bandwidths expressed in Mbit/s A B A B C D E
10 2 6 5 2
C
D
E
100 3 1000 10 7 800 5 6 1 100
Moreover we assume to know the local average load i (Δt) · αi of available grid nodes and, as concerns the reliability, we suppose that λw = 0.99 ∀w ∈ {A, B, . . . , E}, while in any site the nodes have different π values (Table 2). Table 2. Reliability for the nodes Sites
A
B
C
D
E
nodes 1–12 13–32 33–40 41–48 49–58 59–64 65–72 73–84 85–100 101–116 π 0.99 0.96 0.97 0.99 0.97 0.99 0.99 0.97 0.98 0.96
Since a generally accepted set of heterogenous computing benchmarks does not exist, to evaluate the effectiveness of our DE approach we have conceived and explored four scenarios of increasing difficulty: one without communications among subtasks, one in which communications are added, another in which local node loads are also considered and a final in which reliability of a link is decreased. To test the behavior of the mapper also as a function of P , two cases have been dealt with: one in which P = 20 and another with P = 35. Given the grid architecture, P = 20 has been chosen to investigate the ability of the mapper when the number of subtasks is lower than the number of nodes of some of the sites, while P = 35 has been considered to evaluate it when this number
A Multiobjective Evolutionary Approach for Multisite Mapping on Grids
997
is greater than the number of nodes in any site. For the sake of brevity, the mapping solutions attained for P = 20 are not reported here but they vary appropriately as a function of the four different scenarios and are on line with our expectations. After a preliminary tuning phase, DE parameters have been set as follows: M = 100, g = 1000, CR = 0.8, F = 0.5, pm = 0.2. All the tests have been made on a 1.5 GHz Pentium 4. For each problem 20 DE runs have been performed. Each execution takes about 1 minute for P = 20 and 2 minutes for P = 35. Henceforth we shall denote by μΦ1 and μΦ2 the best solutions found in terms of lowest maximal resource utilization time and of highest reliability, respectively. Experiment 1. It has regarded an application of P = 35 subtasks with γk = 90 Giga Instructions (GI), ψkm = 0 ∀k, m ∈ [1, . . . , P ], i (Δt) = 0.0 for all the nodes and λw = 0.99 ∀w ∈ {A, B. . . . , E}. Any execution of the DE mapper finds out several solutions, all being non–dominated in the final Pareto front. As expected, the best mappings provided by our tool are: μΦ1 = {72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71} with Φ1 = 52.94s, Φ2 = 0.32 and uses the most powerful nodes in C and D, and μΦ2 = {66, 59, 60, 56, 59, 41, 61, 62, 69, 71, 65, 63, 62, 72, 67, 68, 53, 72, 44, 61, 62, 66, 71, 47, 74, 66, 63, 65, 63, 68, 71, 70, 43, 61, 65} with Φ1 = 158.82s, Φ2 = 0.465 and picks the most reliable nodes among the sites B, C and D. Furthermore, the system proposes some other non–dominated solutions which are better balanced in terms of the two goals. Experiment 2. It is like the first but we have added a communication ψkm = 10 Mbit ∀k, m ∈ [1, . . . , P ]. The best solutions are: μΦ1 = {81, 82, 83, 84, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80} with Φ1 = 74.60s, Φ2 = 0.322 and μΦ2 = {60, 66, 73, 61, 61, 66, 69, 59, 66, 65, 69, 72, 70, 48, 60, 76, 61, 42, 71, 67, 59, 41, 67, 65, 70, 65, 65, 62, 70, 68, 64, 64, 60, 69, 54} with Φ1 = 293.14s, Φ2 = 0.465. The mapping μΦ1 uses only nodes in sites C and D, i.e., the most powerful ones. In fact, in this case the increase in execution time due to communications between sites is lower than that which would be obtained by using just one site and mapping two subtasks on a same node. Experiment 3. It is like the second but we have taken into account the node loads as well, namely (Δt) = 0.9 for all the nodes of the sites B and D, while
998
I. De Falco et al.
for the site C we assume i (Δt) = 0.8 for i ∈ [49, . . . , 52] and i (Δt) = 0.6 for i ∈ [53, . . . , 64]. This time we have: μΦ1 = {93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 59, 60, 61, 85, 86, 87, 88, 89, 90, 91, 92} with Φ1 = 165.85s, Φ2 = 0.257 and μΦ2 = {1, 6, 63, 8, 9, 2, 5, 1, 2, 6, 7, 5, 6, 7, 8, 9, 10, 24, 12, 1, 2, 3, 9, 57, 4, 5, 6, 7, 8, 9, 10, 11, 64, 65, 66} with Φ1 = 860.01s, Φ2 = 0.47. In this case μΦ1 maps 32 subtasks on the site E and 3 on the nodes 59, 60 and 61 of C. This might seem strange, since nodes 53 − 64 in C, being loaded at 60%, are capable of providing an available computation power of 800 MIPS, so they are anyway more powerful than those in E. Nonetheless, since P = 35, those nodes would not be sufficient and 23 less powerful would be needed. The most powerful after those in C are those in E, and, in terms of Φ1 , these two solutions are equivalent since the slowest resources utilized are in both cases nodes in E, yielding a resource use due to computation of (90000/700) = 128.57s. Moreover, the intrasite communications of E are faster than the intersite between C and E, hence the μΦ1 proposed. It is to note that a number of subtasks on E greater than 32 would imply that more than one subtask should be placed per node. This would heavily increase the computation time. On the contrary, a number of subtasks on C greater than 3 and lower or equal to 12 would leave the computation time unchanged but would increase the amount of intersite communication time. As regards μΦ2 , instead, the suboptimal solution chooses all the nodes with π = 0.99 apart from node 24 with π = 0.96 and node 57 with π = 0.97. Experiment 4. The scenario is the same as in the previous case, but we have supposed that the reliability λA = 0.97. The best solutions found are: μΦ1 = {108, 109, 110, 111, 112, 113, 114, 115, 116, 60, 61, 63, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107} with Φ1 = 165.85s, Φ2 = 0.257 and μΦ2 = {65, 57, 58, 59, 79, 61, 68, 62, 65, 65, 66, 67, 63, 69, 70, 71, 72, 73, 60, 62, 58, 48, 60, 61, 71, 63, 64, 41, 66, 38, 68, 69, 70, 71, 72} with Φ1 = 1653.55s, Φ2 = 0.437. Here μΦ1 uses heavily site E and some nodes from C. This solution is slightly different from that in the previous experiment, but has the same fitness value. The solution μΦ2 instead uses the most reliable nodes in sites B, C and D and, as it should be, avoids nodes from A because the reliability λA for the link connecting this site to internet has dropped down to 0.97. Moreover, the five non optimal nodes (38, 57, 58, 73 and 79) in the solution are not in E because E, at parity of reliability, has inferior performance due to its minor intrasite bandwidths.
A Multiobjective Evolutionary Approach for Multisite Mapping on Grids
999
Table 3 shows for each test (Exp. no) for both fitnesses the best final value Φb , the average of the final values over the 20 runs Φ and the variance σΦ . The tests have evidenced a high degree of efficiency of the proposed model in terms of goodness for both resource use and reliability. In fact, efficient solutions have been provided independently of work conditions (heterogenous nodes diverse in terms of number, type and load) and kind of application tasks (computation or communication bound). Moreover, we have that often, especially as regards the problems with P = 20 subtasks, σ(Φ1 ) = 0 which means that in all the 20 runs the same final solution, i.e. the globally best one, has been found. Table 3. Findings for each experiment
Exp. no. Φb1 Φ1 σΦ 1 Φb2 Φ2 σΦ 2
4
1 52.94 52.94 0 0.668 0.668 0
P=20 2 3 53.17 130.47 53.17 130.47 0 0 0.668 0.668 0.668 0.660 0 0.008
4 130.47 130.47 0 0.668 0.652 0.008
1 52.94 93.61 17.86 0.465 0.443 0.009
P=35 2 3 74.60 165.85 94.93 224.33 15.02 31.44 0.465 0.470 0.448 0.436 0.008 0.013
4 165.85 165.85 0 0.437 0.419 0.011
Conclusions
Co–scheduling and mapping are open questions when parallel jobs are executed in time–shared grid systems jointly with the local load. Our efforts are directed towards maximizing parallel efficiency and minimizing impact on the local performance in such systems by using a mapping which takes into account the node characteristics and the estimated local workload. In particular, this paper deals with the grid multisite mapping problem by means of a multiobjective DE optimizing two contrasting goals: the minimization of the use degree of the grid resources and the maximization of the reliability of the proposed mapping. The promising findings show that DE is a feasible approach to the considerable problem of grid resource allocation.
References 1. Foster, I., Kesselmann, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999) 2. Mateescu, G.: Quality of service on the grid via metascheduling with resource co-scheduling and co-reservation. International Journal of High Performance Computing Applications 17(3), 209–218 (2003) 3. Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI: The Complete Reference. The MPI Core, vol. 1. The MIT Press, Cambridge (1998) 4. Khokhar, A., Prasanna, V.K., Shaaban, M., Wang, C.L.: Heterogeneous computing: Challenges and opportunities. IEEE Computer 26(6), 18–27 (1993)
1000
I. De Falco et al.
5. Siegel, H.J., Antonio, J.K., Metzger, R.C., Tan, M., Li, Y.A.: Heterogeneous computing. In: Zomaya, A.Y. (ed.) Parallel and Distributed Computing Handbook, pp. 725–761. McGraw–Hill, New York (1996) 6. Foster, I.: Globus toolkit version 4: Software for service–oriented systems. In: Jin, H., Reed, D., Jiang, W. (eds.) NPC 2005. LNCS, vol. 3779, pp. 2–13. Springer, Heidelberg (2005) 7. Fernandez-Baca, D.: Allocating modules to processors in a distributed system. IEEE Transaction on Software Engineering 15(11), 1427–1436 (1989) 8. Wang, L., Siegel, J.S., Roychowdhury, V.P., Maciejewski, A.A.: Task matching and scheduling in heterogeneous computing environments using a genetic–algorithm– based approach. Journal of Parallel and Distributed Computing 47, 8–22 (1997) 9. Braun, T.D., Siegel, H.J., N.B.,, B¨ ol¨ oni, L.L., Maheswaran, M., Reuther, A.I., Robertson, J.P., Theys, M.D., Yao, B.: A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. Journal of Parallel and Distributed Computing 61, 810–837 (2001) 10. Kim, S., Weissman, J.B.: A genetic algorithm based approach for scheduling decomposable data grid applications. In: International Conference on Parallel Processing (ICPP 2004), Montreal, Quebec, Canada, pp. 406–413 (2004) 11. Song, S., Kwok, Y.K., Hwang, K.: Security–driven heuristics and a fast genetic algorithm for trusted grid job scheduling. In: IPDP 2005, Denver, Colorado (2005) 12. Price, K., Storn, R.: Differential evolution. Dr. Dobb’s Journal 22(4), 18–24 (1997) 13. Fonseca, C.M., Fleming, P.J.: An overview of evolutionary algorithms in multiobjective optimization. Evolutionary Computation 3(1), 1–16 (1995) 14. Dong, F., Akl, S.G.: Scheduling algorithms for grid computing: State of the art and open problems. Technical Report2006–504, School of Computing, Queen (2006) 15. Fitzgerald, S., Foster, I., Kesselman, C., von Laszewski, G., Smith, W., Tuecke, S.: A directory service for configuring high-performance distributed computations. In: Sixth Symp. on High Performance Distributed Computing, Portland, OR, USA, pp. 365–375. IEEE Computer Society, Los Alamitos (1997) 16. Czajkowski, K., Fitzgerald, S., Foster, I., Kesselman, C.: Grid information services for distributed resource sharing. In: Tenth Symp. on High Performance Distributed Computing, San Francisco, CA, USA, pp. 181–194. IEEE Computer Society, Los Alamitos (2001) 17. Wolski, R., Spring, N., Hayes, J.: The network weather service: a distributed resource performance forecasting service for metacomputing. Future Generation Computer Systems 15(5–6), 757–768 (1999) 18. Gong, L., Sun, X.H., Waston, E.: Performance modeling and prediction of non– dedicated network computing. IEEE Trans. on Computer 51(9), 1041–1055 (2002)
Scheduling with Precedence Constraints: Mixed Graph Coloring in Series-Parallel Graphs ˙ nski1,2, Hanna Furma´ nczyk1 , Adrian Kosowski2, and Pawel Zyli´ 1
University of Gda´ nsk, Poland Institute of Computer Science {hanna,pz}@math.univ.gda.pl 2 Gda´ nsk University of Technology, Poland Department of Algorithms and System Modeling [email protected]
Abstract. We consider the mixed graph coloring problem which is used for formulating scheduling problems where both incompatibility and precedence constraints can be present. We give an O(n3.376 log n) algorithm for finding an optimal schedule for a collection of jobs whose constraint relations form a mixed series-parallel graph.
1
Introduction
Scheduling problems containing incompatibility constraints are often modeled by undirected graphs: every vertex corresponds to a job and two vertices are adjacent if their corresponding jobs cannot be processed in the same period of time. A vertex coloring of the graph then gives a possible scheduling with respect to the constraints. However, in general scheduling problems, there are often additional requirements, hence the classical coloring model is too limited. Consider an exam schedule when written exams have to be taken before oral ones, in addition to the usual constraint that no student can take two or more exams during the same period. This is an example of scheduling problems containing precedence constraints: several pairs of jobs have to be executed in a given order. And to handle these problems, a more general coloring model had to be introduced, capable of taking into account these requirements: mixed graph coloring [13,14,15]. Mixed graph coloring. A mixed graph GM = (V, E, A) is a graph with vertex set V and containing edges (set E) and arcs (set A). An edge joining vertices i and j is denoted by {i, j}, while an arc with tail p and head q is denoted by (p, q). The concept of mixed graphs was introduced for the first time in [16].1 A k-coloring of GM is a function ϕ : V → {1, 2, . . . , k} such that ϕ(i) = ϕ(j) for 1
Supported from the grant No. N516 029 31/2941. In some sense, the mixed graph is the same as the disjunctive graph, which was proposed by Roy and Sussman in 1964, see [12] for more details.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1001–1008, 2008. c Springer-Verlag Berlin Heidelberg 2008
1002
˙ nski H. Furma´ nczyk, A. Kosowski, and P. Zyli´
{i, j} ∈ E and ϕ(p) < ϕ(q) for (p, q) ∈ A. Observe that the mixed graph GM must be acyclic, i.e., must not contain any directed circuit, otherwise no proper k-coloring exists. The smallest k for which GM has a k-coloring is called the chromatic number of mixed graph GM and denoted by γ(GM ). Clearly, if the arc set of GM is an empty set, then γ(GM ) = χ(GM ), the (classical) chromatic number of GM . Moreover, by the definition: γ(GM ) ≥ l(GM ), where l(GM ) is the length of the longest directed path in subgraph (V, ∅, A) plus one. Notice that whereas classical graph coloring may be thought of as the problem of solving a given set of inequalities of the form ϕi = ϕj with variables {ϕ1 , . . . , ϕn } in the range of values {1, 2, . . . , k}, in the mixed graph coloring model we attempt to solve a more general set of inequalities, in which we allow additional conditions of the form ϕi < ϕj . There are not many known theoretical results concerning mixed graph coloring; some additional work was done in fields of application, which we shall discuss later on. In [6], an O(n2 )-algorithm to color optimally mixed trees and bounds on the mixed chromatic number for arbitrary mixed graphs are given. In the same paper, the mixed chromatic number of a mixed bipartite graph GM is shown to be bounded from above by l(GM ) + 1, and hence can take only one of two values, however, the problem of deciding whether the mixed chromatic number is l(GM ) or l(GM ) + 1 is NP-complete even for mixed bipartite planar graphs of maximum degree 3 as shown by Ries in [10]; Ries also provided an optimal O(nk+2 )-algorithm for partial k-trees [11]. In [11,16] variants of mixed graph coloring where considered, in which an arc (p, q) implies the weak relation φ(p) ≤ φ(q) between the colors of vertices. Finally, in [14] some branch-andbound exact and approximate algorithms are discussed and tested on random graphs. Scheduling with precedence constraints. The mixed graph coloring model can be used for formulating scheduling problems where both incompatibility and precedence constraints are present. Formally, let T be a collection of jobs (with unit processing times). These jobs have to be processed taking into account the following constraints: 1. Precedence constraints. There is a set of ordered jobs (i, j) such that i must be processed before j. 2. Disjunctive constraints. For a family I = {I1 , . . . , Il } of subsets of T , no two jobs in Iα can be processed simultaneously, α = 1, . . . , l. Consider now a mixed graph GM = (V, E, A) obtained as follows: 1. With each job j in T we associate a vertex j in V . GM currently has no other vertices, and no arcs and edges. 2. For each ordered pair (i, j) of jobs we introduce an arc (i, j) in GM .
Scheduling with Precedence Constraints
1003
3. For each subset Iα , α = 1, . . . , l, we introduce a clique associated with the jobs in Iα . (If an edge is needed between vertices i and j, we introduce it only if there was no previous arc or edge joining i and j.) It easy to see that there is one-to-one correspondence between feasible schedules in k time units and k-colorings of the mixed graph GM . And, it is worth pointing out that the unit-time job-shop problem can also be considered via mixed graph coloring. In this case, (V, ∅, A) is the union of disjoint paths and (V, E, ∅) is the union of disjoint cliques (see [13,14] for more details). Our results. We design an O(n3.376 log n) time algorithm for the class of seriesparallel mixed graphs which constructs an optimal coloring. The idea of the algorithm is based upon parse trees [8].
2
Series-Parallel Graphs
A graph is series-parallel if it is K4 -minor-free [17]. The class of series-parallel graphs can be also defined recursively [7] and is highly relevant in many scheduling problems — [1,2,3,5,9] to mention but a few. Definition 1 (1) An edge {u, v} is a series-parallel graph with terminals u and v. (2) If G1 and G2 are series-parallel graphs with terminals u1 , v1 and u2 , v2 , respectively, then we may form a series-parallel graph by either making a series connection or a parallel connection. To make a series connection, we simply take the graph G1 ∪ G2 , and identify the vertices v1 and u2 ; to make a parallel connection, we take the graph G1 ∪ G2 , identify the vertices u1 and u2 , and identify the vertices v1 and v2 — see Fig. 1. A parallel connection cannot be made if multiple edges result. (3) Only those graphs which can be constructed by a finite number of applications of series or parallel connections are series-parallel graphs. A series-parallel graph G can be represented by its binary parse tree T (G). The leaf nodes of the parse tree correspond to edges of G, and each internal node is labeled either with S, which corresponds to a series connection of its two children, or with P, which corresponds to a parallel connection of its children. The root of the parse tree corresponds to graph G. Kikuno et al. [8] gave an algorithm which for a given series-parallel graph produces its parse tree in linear time. It easy to see that both the idea of parse trees and Kikuno at al.’s algorithm can be adapted for mixed series-parallel graphs as well, as illustrated in Figure 2: all we need is to store in a leaf node → − the direction of an arc, namely, for an arc l, label l of its leaf node in the parse ← − tree denotes that l = (u, v), and label l denotes that l = (v, u), where u and v are the terminals of the arc (ref. Definition 1); the loss of direction in the leaf node of the parse tree denotes that the node corresponds to an edge.
˙ nski H. Furma´ nczyk, A. Kosowski, and P. Zyli´
1004
(a)
(b) G1 u1
w
G1
G2
v2
u
v G2
Fig. 1. (a) A series connection (v1 = u2 = w). (b) A parallel connection (u1 = u2 = u, v1 = v2 = v). (b)
P
(a)
root ← − h
S h
u
g
c
a b
d
P
v f
S ← − c
S
d
P
e − → a
b
P – parallel connection S – series connection
S
g e
fe
Fig. 2. (a) A series-parallel mixed graph and (b) its parse tree
3
Algorithm
In this section we propose an O(n3.376 log n) time algorithm for determining the chromatic number γ of a mixed series-parallel graph. First, we give an algorithm which for a given series-parallel mixed graph GM and an integer k produces the answer ‘yes’ if GM can be colored with k colors, and answer ‘no’ otherwise. The procedure consists of creating a binary matrix Ax of size k × k for each node x in the parse tree of GM , where Ax [i, j] = 1 if and only if terminal ux of the series-parallel subgraph corresponding to x can be colored with color i and terminal vx can be colored with color j, respectively. To achieve this, we use the recursive structure of graph GM . Case 1: the node x corresponds to an edge {u, v}. Then 1 if i = j; Ax [i, j] = 0 otherwise.
Scheduling with Precedence Constraints
1005
Case 2: the node x corresponds to an arc (u, v). Then 1 if i < j; Ax [i, j] = 0 otherwise. Case 3: the node x corresponds to an arc (v, u). Then 1 if i > j Ax [i, j] = 0 otherwise. Case 4: the node x corresponds to a subgraph Gx formed by a parallel connection of graphs Gy and Gz . Then 1 if Ay [i, j] = 1 and Az [i, j] = 1; Ax [i, j] = 0 otherwise. Notice that the correctness of the above cases follows directly from the definition of the parse tree and the series connection. We are left with the last case. Case 5: the node x corresponds to a subgraph Gx formed by a series connection of Gy and Gz . Then we put 1 if (Ay · Az )[i, j] > 0; Ax [i, j] = 0 otherwise. The correctness of Case 5 can be argued by the following lemma. Lemma 1. Let Gx be a graph formed by a series connection of graphs Gy and Gz , and assume that we have already computed matrices Ay and Az . Then terminals ux and vx can be colored with colors i and j, respectively, if and only if the product of the i-th row of Ay and the j-th column of Az is greater than 0, that is, ⎡ ⎤ Az [1, j] Ay [i, 1] . . . Ay [i, k] × ⎣ . . . ⎦ > 0. (1) Az [k, j] Proof. Let uy and vy be terminals of graph Gy , and let uz and vz be terminals of graph Gz . By the definition of the series connection, in Gx we have either uy = ux , vy = uz , and vz = vx or uz = ux , vz = uy , and vy = vx . Without loss of generality, assume the first case holds (the latter case can be solved with similar arguments). Then terminals ux and vx can be colored with colors i and j, respectively, if and only if (a) terminal uy can be assigned color i in Gy , and terminal vy can be assigned color t such that (b) terminal uz can be assigned color t in Gz , and terminal vz can be assigned with color j. In other words, bearing in mind the structure of matrices Ay and Az , (a)-(b) imply that there must exist t ∈ {1, . . . , k} such that Ay [i, t] · Az [t, j] = 1, thus getting the equivalence with inequality (1). 2
1006
˙ nski H. Furma´ nczyk, A. Kosowski, and P. Zyli´
Consequently, we are ready to formulate our algorithm (Algorithm 1). When computing matrix Ax for a node x, most time will be spent on matrix multiplication. Thus this step requires at most O(k 2.376 ) time [4]. Since there are roughly the same number of internal vertices of the parse tree as leaves of the parse tree [8], and the number of leaves of the parse tree is equal to the number of edges of the series-parallel graph, and m < 3n, traversing the parse tree requires linear time. Therefore Algorithm 1 has time complexity O(nk 2.376 ), where n is the number of vertices of a series-parallel mixed graph; recall that a parse tree can be computed in O(n) time [17]. Algorithm 1 Input: A series-parallel mixed graph GM , an integer k. Output: An answer whether the graph GM can be colored with k colors. 1. Construct the parse tree T (GM ) for a series-parallel mixed graph GM . 2. Traverse tree T (GM ) in postorder. Fix matrix Ax for each node x, as described earlier. 3. Let r be the root of tree T (GM ). If there exists i, j such that Ar [i, j] = 1 then the answer is ‘yes’, otherwise the answer is ‘no’.
Theorem 1. Given a series-parallel mixed graph GM and an integer k, the question whether GM is k-colorable can be answered in O(nk 2.376 ) time. 2 The discussed approach answers only the question of the existence of a proper k-coloring. Thus, for a given series-parallel mixed graph GM , if the answer is ‘yes’, it would be natural to construct such a coloring ϕ. This can easily be done in O(nk) time by traversing the tree from its root to its leaves, in preorder, as follows. Let i, j be colors such that Ar [i, j] = 1. Then we fix ϕ(ur ) = i and ϕ(vr ) = j. The next colors are recursively fixed as follows. Let x be a node corresponding to a subgraph Gx with terminals ux , vx which have received colors i and j, respectively. (1) If Gx is a subgraph formed by a parallel connection of Gy and Gz , then uy = uz = ux and vy = vz = vx . Thus the colors for terminals of Gy and Gz are fixed. (2) If Gx is a subgraph formed by a series connection of Gy and Gz , then uy = ux and vz = vx . So we have ϕ(uy ) = i and ϕ(vz ) = j, and we have only to fix a color for vy = uz . As Ax [i, j] = 1, by the definition of matrix Ax , the inequality (1) is satisfied, that is, there exists t ∈ {1, . . . , k} such that Ay [i, t] · Az [t, j] = 1 (ref. Lemma 1). Such t can be found in O(k) time, and consequently, we set ϕ(vy ) = ϕ(uz ) = t. Theorem 2. An optimal coloring of a given series-parallel mixed graph GM can be found in O(n3.376 log n) time.
Scheduling with Precedence Constraints
(a)
(b)
⎤ 0 00 ⎣ 0 1 1⎦ 0 00
⎡
s a
S
root
b ⎡
u
c
w
d
⎤ 0 00 ⎣ 1 0 0⎦ 0 00
v
⎤ 11 1 ⎣ 1 1 0⎦ 00 0
⎡
P
d
⎡
(c) 3
1
⎤ 01 1 ⎣ 0 0 1⎦ 00 0
2
⎤ 01 1 ⎣ 1 0 1⎦ 11 0
⎡
← − c
S
⎡
2
1007
⎤ 00 0 ⎣ 1 0 0⎦ 11 0
⎡
− → a
⎤ 011 ⎣ 1 0 1⎦ 110
b
P – parallel connection S – series connection Fig. 3. (a) A series-parallel mixed graph GM with terminals u, v. (b) The parse tree of G, and the computed arrays. (c) A proper 3-coloring of graph G.
Proof. Since the set of colors used in any optimal mixed coloring is the discrete interval {1, 2, 3, . . . , γ(GM )}, clearly γ(GM ) ≤ n. Consequently, by using the binary search technique, at most O(log n) calls of Algorithm 1 will result in an exact value of γ(GM ). Then, by the above discussed remarks, an optimum coloring of graph GM can be constructed in O(n2 ) time, thus resulting in total O(n3.376 log n) time complexity. 2
4
Example
Figure 3 illustrates an exemplary execution of Algorithm 1 for k = 3. As Aroot [2, 2] = 1 (or Aroot [2, 3] = 1), the positive answer is returned, and a proper 3-coloring ϕ is constructed as follows: − Aroot [2, 2] = 1, thus we set ϕ(u) = 2 and ϕ(v) = 2; − series connection with Ax [2, 2] = 1: ⎡
⎤ 0 0 0 ⎣ Ax = Ay · Az = 0 1 1 ⎦ , 0 0 0
⎡
⎤ 0 00 ⎣ Ay = 1 0 0⎦ , 0 00
⎤ 0 1 1 ⎣ Az = 1 0 1 ⎦ , 1 1 0
thus we have t = 1, and we set ϕ(w) = 1; − parallel connection with Ax [2, 1] = 1: ⎡ ⎤ 1 11 Ay = ⎣ 1 1 0 ⎦ , 0 00
⎡ ⎤ 0 00 Az = ⎣ 1 0 0 ⎦ ; 1 10
⎡
1008
˙ nski H. Furma´ nczyk, A. Kosowski, and P. Zyli´
− series connection with Ax [2, 1] = 1: ⎡
⎤ 2 1 1 Ax = Ay · Az = ⎣ 1 1 0 ⎦ , 0 0 0
⎡
⎤ 01 1 Ay = ⎣ 0 0 1 ⎦ , 00 0
⎡
⎤ 0 11 1 0 1 ⎦, Az = ⎣ 1 10
thus we have t = 3, and at last we set ϕ(s) = 3 (see Fig. 3(c)).
References 1. Abdel-Wahab, H.M., Kameda, T.: Scheduling to minimize maximum cumulative cost subject to series-parallel precedence constraints. Operations Research 26(1), 141–158 (1978) 2. Bender, M.A., Rabin, M.O.: Online scheduling of parallel programs on heterogeneous systems with applications to cilk. Theory of Computing Systems 35(3), 289–304 (2002) 3. Burns, R.N., Steiner, G.: Single machine scheduling with series-parallel precedence constraints. Operations Research 29(6), 1195–1207 (1981) 4. Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation 9, 251–280 (1990) 5. Finta, L., Liu, Z., Milis, I., Bampis, E.: Scheduling UET-UCT series-parallel graphs on two processors. Theoretical Computer Science 162(2), 323–340 (1996) 6. Hansen, P., Kuplinsky, J., de Werra, D.: Mixed graph colorings. Mathematical Methods of Operations Research 45, 145–160 (1997) 7. Hedetniemi, S.T., Laskar, R., Pfaff, J.: Linear algorithms for independent domination in series-parallel graphs. Congressus Numerantium 45, 71–82 (1984) 8. Kikuno, T., Yoshida, N., Kakuda, Y.: A linear algorithm for the domination number of a series-parallel graph. Discrete Applied Mathematics 5, 299–311 (1983) 9. M¨ ohring, R.H., Sch¨ affter, M.W.: Scheduling series-parallel orders subject to 0/1communication delays. Parallel Computing 25(1), 23–40 (1999) 10. Ries, B.: Coloring some classes of mixed graphs. Discrete Applied Mathematics 155, 1–6 (2007) 11. Ries, B., de Werra, D.: On two coloring problems in mixed graphs. Electronic Journal of Combinatorics (to appear) 12. Roy, B., Sussmann, B.: Les problems d’ordonnancement avec constraintes disjonctives, Node DS No. 9 Bis, SEMA (1964) 13. Sotskov, Y.N.: Scheduling via mixed graph coloring. In: Operations Research Proceedings, pp. 414–418. Otto-von-Guericke Univeristy (2000) 14. Sotskov, Y.N., Dolgui, A., Werner, F.: Mixed graph coloring for unit-time job-shop scheduling. International Journal of Mathematical Algorithms 2(4), 289–323 (2001) 15. Sotskov, Y.N., Tanaev, V.S., Werner, F.: Scheduling problems and mixed graph colorings. Optimization 51(3), 597–624 (2002) 16. Sotskov, Y.N., Tanaev, V.S.: Chromatic polynomial of a mixed graph. Vestsi Akademii Navuk BSSR, Seriya Fiz.-Mat. Navuk 6, 597–624 (1976) 17. Valdes, J., Tarjan, R.E., Lawler, E.L.: The recognition of series parallel digraphs. SIAM Journal on Computing 11, 189–201 (1979)
A New Model of Multi-installment Divisible Loads Processing in Systems with Limited Memory Maciej Drozdowski1, and Marcin Lawenda2 1
Institute of Computing Science, Pozna´ n University of Technology, Piotrowo 2, 60-965 Pozna´ n, Poland [email protected] 2 Pozna´ n Supercomputing and Networking Center, Noskowskiego 10, 61-704 Pozna´ n, Poland [email protected]
Abstract. In this paper we study multi-installment divisible load processing in a heterogeneous distributed system with limited memory. Divisible load model applies to computations which can be arbitrarily divided into parts and performed independently in parallel. The initial waiting for the load may be shortened by sending many small chunks of load instead of one huge. The load chunk sizes must be adjusted to the speeds of communication, computation, and memory sizes, such that the whole processing time is as short as possible. We propose a new realistic model of memory management, and formulate it as mixed quadratic programming problem which is solved by branch and bound algorithm. Since this problem is computationally hard we propose heuristics, and analyze their performance in a series of computational experiments. Keywords: Scheduling, divisible loads, multiple installments, memory limitations.
1
Introduction
Many computational problems can be divided into pieces of arbitrary sizes which can be processed independently in parallel. These are the foundations of divisible load theory (DLT), a simple but realistic model of distributed computations. DLT originated in paper [3] where the authors studied a chain of intelligent sensors and a problem of reaching balance between the local and distributed processing. Local computing may last too long, while distributing the computations costs communication delays. The problem consists in adjusting part sizes to the speeds of communications, and computations such that the whole process is finished in the shortest time. Thanks to the above assumptions a generic methodology has been proposed which reduced the problem to solving a set of linear equations. DLT has been generalized to include various interconnection topologies, communication algorithms, computation costs, as well as applications in
Partially supported by Polish Ministry of Science and Higher Education.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1009–1018, 2008. c Springer-Verlag Berlin Heidelberg 2008
1010
M. Drozdowski and M. Lawenda
database, measurement, video processing, parameter sweep calculations, search for combinatorial objects. Surveys of DLT may be found in [1,4,9]. Distributing the load causes inevitable communication delays. To shorten them, the load may be sent to processors in many small pieces rather than in one long chunk. This results in earlier starting the computations, and hence also finishing earlier. This way of multi-installment or multi-round divisible load processing was proposed in [2]. Computer systems have limited memory which imposes constraints on the sizes of the assigned load chunks. For single-installment case the limited memory buffers were studied first in [7], where a fast heuristic has been proposed. It was shown in [6] that this problem is NP-hard if a fixed startup time is required for initiation of communications. A bunch of heuristics has been proposed and evaluated in [6]. The problem of divisible load processing with both multiple installments and memory constraints has been already studied in [5]. In [2] a fixed iterative sequence of communications to the processors was assumed. In [5] an arbitrary communication sequence in heterogeneous system was allowed, however memory management model was simplifying reality. We discuss it in more detail in the next section. In this paper we study multi-installment divisible load processing in heterogeneous system, with communication startup time, memory limits, arbitrary communication sequence, and we propose a new more realistic model of memory usage. The rest of the paper is organized as follows. In the next section we formulate our problem and discuss memory management issues. Section 3 describes algorithms for solving our problem. In Section 4 we present results of computational experiments. The paper is concluded by Section 5.
2
Problem Formulation
By a processor we mean here a processing element comprising a CPU, some memory, and a hardware network interface. The network interface can communicate while CPU is computing, and simultaneous communication and computation are possible. We analyze a star interconnection (also called a single level tree). In the center of the star (root of the tree) an originator processor P0 is located. P0 has some amount V of load to process. The originator is only distributing the load to worker processors P1 , . . . , Pm . Each worker processor Pi is characterized by parameters: Ai - computing rate (inverse of speed), Bi - size of available memory, Ci - communication rate (inverse of bandwidth), Si - communication startup time. Words installment, chunk, message, communication will be used interchangeably. The load is scattered in a sequence of communications. The communication sequence may be arbitrary: some processors may receive load more often than other processors, some processors may be too inefficient and excluded from computations. The problem consists in finding a communication sequence, i.e. the schedule of communications from the originator to the worker processors, and sizes of load parts transferred in the consecutive messages. Let n denote the number of communications in the sequence. Let σ be some given communication sequence, and σ(i) the number of the processor receiving the i-th
A New Model of Multi-installment Divisible Loads Processing in Systems
1011
message. We will denote by αi the size of the i-th load chunk. We assume that the time of returning the results is negligible compared to the load distribution and computation times. There may be idle times between the communications when processor buffers are full, and no new load may be uploaded to any processor. There are no unnecessary idle times in the computations. Once the load is fully received it is processed immediately after finishing the preceding load chunks (if any). The sequence of processing the chunks on a given processor is the same as the sequence in which they were received. t
t
t a)
communications
t
Bi
memory usage
Bi
memory usage
Bi
communications
t b)
memory usage
communications
t c)
Fig. 1. Memory management: a) each chunk uses whole buffer, b) memory gradually released, c) block memory releases
Let us consider forms of memory management. We assume that memory is allocated from the operating system pool at the beginning of the communication. It is possible that only one chunk is held by a processor at a time. Hence, the chunk uses all the available memory (cf. Fig.1a), and a constraint αi ≤ Bσ(i) is imposed. This approach was used in [6,7]. Yet, it is insufficient in multi-installment processing when many messages may arrive at the processor, the load can be gradually uploaded to the processor, or new and old buffers can be swapped ”on the fly” without stopping the computations. In [5] it was assumed that the already received and still unprocessed messages together may not exceed memory size Bi . However, it was also assumed in [5] that memory is released to the operating system with very fine granularity equal to the load unit. Consequently, memory occupation was decreasing linearly during computations (cf. Fig.1b). Thanks to such a simplification load chunk sizes could be calculated using linear programming for a given communication sequence σ. Still, this way of releasing memory seems rather unusual. Here we assume that memory allocation and release have block nature (cf. Fig.1c). When a message of size αi is about to arrive to a processor, a block of αi load units is requested from the operating system. This block exists in the memory pool of the application until finishing computation on chunk i. On completion of chunk i a block of size αi is released to the operating system. The sizes of coexisting memory blocks cannot exceed limit Bi . In other words, for each moment t, j∈H(i,t) αj ≤ Bi , where H(i, t) is the set of chunks received by Pi and not completed by time t. We are going to formulate our problem as mixed quadratic program for some given communication sequence σ. Note that chunks can be numbered as they leave the originator (this will be called global numbering), or can be counted as they arrive at a certain processor (this will be called local numbering). Observe
1012
M. Drozdowski and M. Lawenda
that for a given sequence σ we still do not know which chunks will overlap on a processor. In other words, given σ, it is not yet determined which chunks arrive before completion of which other chunks. This information is necessary to formulate constraints on the sizes of chunks which reside together on a processor. Since the chunks are computed in the sequence in which they were received, we can express chunks overlap by variables xijk ∈ {0, 1}. xijk = 1 means that on processor i chunk j is not fully processed until the time when k-th communication to processor i starts, where k > j. xijk = 0 denotes that message j is finished before starting sending message k to processor i. We will assume that numbers of communications j, k are local on processor i. In the following mathematical programming formulation we will use the following notation: αj - size of j-th (global number) load chunk sent from the originator (variable); xijk - binary variable; ni - number of chunks sent to processor i (constant for a given σ); ρ(i, k) - global number of the k-th chunk received by processor i, i.e. mapping from local chunk numbers on processor i to the global numbers (constant for σ); M - big constant, greater than schedule length, for example, M ≥ V (maxm j=1 Cj + m maxm j=1 Aj ) + n maxj=1 Sj ; Tmax - schedule length (variable); ti - time when sending of message i (global number) starts (variable); fik - time when processing of message k (local number) on processor i finishes. Our problem can be formulated in the following way. minimize Tmax subject to t1 = 0
(1)
ti ≥ ti−1 + Sσ(i−1) + Cσ(i−1) αi−1 i = 2, . . . , n, (2) fik ≥ tρ(i,k) + Si + Ci αρ(i,k) + Ai αρ(i,k) (3) i = 1, . . . , m, k = 1, . . . , ni , fik ≥ fi,k−1 + Ai αρ(i,k)
(4)
i = 1, . . . , m, k = 2, . . . , ni , fij ≥ tρ(i,k) − (1 − xijk )M i = 1, . . . , m,
(5)
j = 1, . . . , ni − 1, k = j + 1, . . . , ni
αρ(i,j) +
ni k=j+1
fij ≤ tρ(i,k) + xijk M i = 1, . . . , m, j = 1, . . . , ni − 1, k = j + 1, . . . , ni
(6)
xijk ≤ xilk i = 1, . . . , m, j = 1, . . . , ni − 1, k = j + 2, . . . , ni , l = j + 1, . . . , k − 1
(7)
xijk ≥ xijl i = 1, . . . , m, j = 1, . . . , ni − 1, k = j + 1, . . . , ni , l = k + 1, . . . , ni
(8)
xijk αρ(i,k) ≤ Bi
i = 1, . . . , m, j = 1, . . . , ni
(9)
A New Model of Multi-installment Divisible Loads Processing in Systems
V =
n
αi
1013
(10)
i=1
Tmax ≥ fini
i = 1, . . . , m
(11)
In the above formulation inequalities (1), (2) determine time when communication i starts. Note, that idle times between consecutive communications are allowed. Constraints (3),(4) determine the earliest time moment fik when computation on chunk k of processor i finishes. According to the value of xijk , inequalities (5), (6) guarantee that processing of chunk j is, or is not, finished before starting message k. Only one of inequalities (5), (6) is active for some i, j, k. If xijk = 0 then chunk j should be finished before starting the kth communication to processor i. In this case inequality (6) is active (and (5) is inactive) ensuring the requirement. If xijk = 1, then chunk j is still unfinished when message k is initiated. In this case inequality (5) is active ((6) is inactive) ensuring the overlap of chunks j, k. Inequalities (7) guarantee that if chunk j is not processed when chunk k arrives, then also chunks between j, and k are unprocessed. Inequalities (8) ensure that if chunk j is finished before arriving of some chunk k, then j can no longer become unprocessed again. By inequalities (9) memory limits are observed. All the load is processed by (10). Schedule length is not shorter than the completion time on any processor by constraints (11). Formulation (1)-(11) is a mixed quadratic programming problem for a given activation sequence σ because we have both binary variables (xijk ), and continuous variables (αi , fik , ti , Tmax ). Since we have multiplication of variables in constraint (9) it is a quadratic programming program. However, for given xijk it becomes a linear program (LP). Hence, our problem can be solved by a tandem of methods. The first solves the combinatorial part of the problem by selecting activation sequence σ, and chunk overlap xijk . The second part solves the LP for the given σ and xijk . We propose such methods in the next section. In summary, let us observe that the possibility of chunk overlap makes the problem harder than earlier formulations of scheduling divisible loads with memory limits.
3 3.1
Algorithms Branch and Bound Algorithm
A branch and bound algorithm (BB) is a standard technique used to solve combinatorial problems. In BB algorithm a branching rule divides the set of possible solutions until distinguishing unique solutions. The pruning (or bounding) rule eliminates sets of solutions which are certainly not better than some already known solution, or are infeasible. In our problem one has to determine a sequence of communications, and chunk overlapping. Then, the distribution of the load can be obtained from formulations (1)-(11) simplified to a linear program. Communication sequences were built by appending a new processor to an already constructed leading sequence. For example, sequence σ = (Pa , . . . , Pz ) represents all the solutions
1014
M. Drozdowski and M. Lawenda
with the leading communication σ. This set is divided by appending a communication to any processor from set {P1 , . . . , Pm }. Thus, the set of solutions represented by σ is partitioned into subsets of solutions represented by leading sequences: (Pa , . . . , Pz , P1 ), . . . , (Pa , . . . , Pz , Pm ). For each admitted communication sequence all possible overlaps were enumerated. Enumeration of possible solutions was pruned by two methods. For a given sequence σ a lower bound LB(σ) on the schedule length was calculated as follows. n The startup times in σ were summed up: τ1 = i=1 Sσ(i) . If the communication consisted of startup times only, the maximum load V1 that could be processed in interval τ1 would be V1 = i∈σ (τ1 − g(i) j=1 Sσ(j) )/Ai , where g(i) is the index of the first appearance of processor Pi in σ. The load must be sent from the originator in time at least τ2 = V minm i=1 {Ci }. In parallel with this communication, at most V2 = mτ2 1 units of load could be processed. If V3 = V − V1 − V2 > 0, i=1 Ai
then this remaining load will be processed in time at least τ3 =
mV3
1 i=1 Ai
. The
lower bound is equal to LB(σ) = τ1 + τ2 + max{0, τ3 }. Let T be the length of some already known solution. If T ≤ LB(σ) then successors of σ were discarded. Another mechanism used n in sequence elimination was based on the maximum memory M EM (σ) = i=1 Bσ(i) which could possibly become available in σ. If M EM (σ) < V then various overlap values were not enumerated for σ. Observe that there are O(mn ) communication sequences of length n for m processors, and for each processor the number of possible ways of overlapping the communication chunks is also exponential in ni . Notice that n is not a part of the input, and must be determined by the scheduling algorithm. For the reasons of high computational complexity, in our implementation of BB we imposed an upper bound on n. Consequently, BB delivers optimum solutions only if the optimum number of communications is not greater than the imposed limit. In the opposite case BB was not able to deliver an optimum, or even a feasible solution. 3.2
Heuristics
We used two types of heuristics to construct the communication sequences and overlap of the load chunks. Appender heuristics used some parameters of the processors to append them to the communication sequence. Random heuristics were used as reference to verify performance of the appender heuristics. The communication sequences, and overlap of the load chunks constructed by both types of heuristics were used to formulate linear programs solved for the optimum load chunk sizes (αi ), and schedule length (Tmax ). Appender Heuristics. For the purpose of constructing the communication sequence it was assumed that each chunk has (phantom) size equal to the memory limit. After receiving the load chunk a processor computes it and then it becomes free again. Then new load was sent to the first free processor. The difference between the heuristics consisted in the way of selecting the first free processor. Heuristic appender A (AppA) sent the load first to the free processor with minimum Ai . Heuristic appender SBC (AppSBC) sent the load first to a
A New Model of Multi-installment Divisible Loads Processing in Systems
1E+05 t
1.7
[s] Rnd1 Rnd2 Rnd3 AppA AppSBC BB m=2 BB m=8
1E+04 1E+03 1E+02
1.6 1.5 1.4
1E+01 1E+00
n 3
1015
4
5
6
7
8
9
10
11
1.3 1.2
Rnd1 Rnd2 Rnd3
1E-01
1.1
AppA AppSBC O1
1E-02
1 1E-03
Fig. 2. Execution time vs n
n 3
4
5
6
7
8
9
10
11
Fig. 3. Heuristic quality vs. n, m = 2
free processor with minimum Si + Ci Bi . The processors were appended to the communication sequence until the sent (phantom) load was at least three times greater than V . It was assumed that only pairs of the consecutive communication chunks overlap, i.e. the chunks overlap by one message. Random Heuristics. The first random heuristic (Rnd1) appended random processors to the communication sequence until the sum of their memory limits was at least as big as load size V . It was assumed that each chunk would carry load equal to the processor memory limit. Therefore, it was assumed that there were no overlap of the consecutive load chunks. The second random heuristics (Rnd2) doubled the communication sequence of heuristics Rnd1, and applied overlap of the pairs of consecutive load chunks. The third random heuristic (Rnd3) expanded the communication sequence of Rnd1 by a random number of communications to random processors. The upper bound on the extension of the initial sequence was three times the original sequence length. The overlap of the load chunks was random. Since some chunks overlap and together cannot exceed memory limits of the processors, the total memory collected on a certain processor may be smaller than its number of chunks multiplied by memory size limit. Hence, due to the random overlapping of the load chunks some Rnd3 solutions could be infeasible.
4
Computational Experiments
Computational experiments were performed on a network of 16 workstations, with Pentium 4 CPU, Windows XP. The software was written in Borland C++ 6.0. The linear programs were solved by lp solve 5.5.10 [8]. Each of the points shown in the following figures represents an average value of 100 randomly generated instances. Instance parameters Ai , Ci , Si were generated with uniform distribution from interval [0,1]. In experiments involving variable m, n (cf. Fig.2,3,4,5), problem size was V = 20, and Bi parameters were randomly generated with uniform distribution from [0, V2 ] to minimize interaction between
1016
1E+06
M. Drozdowski and M. Lawenda
2.6
t [s] Rnd1 Rnd2 Rnd3 AppA AppSBC BB
1E+04 1E+03 1E+02
2.2
AppA AppSBC O1
2 1.8 1.6
1E+01 1E+00
Rnd1 Rnd2 Rnd3
2.4
1E+05
m 2
3
4
5
6
1.4 1.2
1E-01
1
1E-02
Fig. 4. Execution time vs m, n = 8
2
3
4
5
6
m
Fig. 5. Heuristic quality vs. m, n = 8
V, Bi , n. In the experiments involving changing problem size V (Fig.6,7), Bi were generated uniformly from [0, 10]. Fig.2 presents execution time vs the maximum length n of the communication sequence in BB algorithm. As it can be seen, BB has exponential running time in n. It takes days to solve even small instances. The execution time of the heuristics is much smaller. It is also independent of n because the heuristics had no provisions for limiting communication sequence length. In Fig.3 quality of the heuristic solutions vs. n is shown. The quality measure is an average ratio of the schedule length obtained by some heuristic to the best solution constructed either by BB or any other heuristic. The lower is the value in Fig.3 the better is the heuristic. The trend of degrading quality with growing n is an artifact related to limited communication sequence of BB algorithm. As the sequence is allowed to be longer, BB is able to construct better solutions and hence the quality of heuristics is relatively worsening. Rnd1, Rnd2 are noticeably better than the appender heuristics in the current setting. The lowest line in Fig.3 is the quality of a ’heuristic’ which uses the best communication sequence (found by BB or any other heuristic) and applies overlap by one message. Let us call this heuristic O1. As it can be seen O1 performs quite well which means that overlap by one communication is often very good solution. This situation repeats also in the following experiments. In Fig.4 execution time of the algorithms vs. m is shown. Though the running time of BB grows polynomially in m (for fixed n), the number of possible sequences, and ways of message overlapping is still very big, such that the execution time of BB is unacceptably big. The execution time of the heuristics grows only slightly in the observed range of m. In the current setting random heuristics are worse than the appender heuristics which is different situation than in Fig.3. This is a result of the number of available processors and lengths of communication sequences. Appender heuristics append the first free processor to the communication sequence. This often results in a repetitive pattern of communications to processors select for minimum Ai or Si +Bi Ci. Consequently
A New Model of Multi-installment Divisible Loads Processing in Systems
1E+04
t [s]
Rnd1 Rnd2 Rnd3 AppA AppSBC BB
1E+03 1E+02
2.4
1.8
1E+00
1.6
1E-01
1.4
1E-02
1.2
V 1
2
5
10
20
50
100 200 500
AppA AppSBC O1
2
1E+01
1E-03
Rnd1 Rnd2 Rnd3
2.2
1017
1
V 1
2
5
10
20
50
Fig. 6. Execution time vs V , m = 4, n = 8 Fig. 7. Heuristic quality vs. V , m = 4, n = 8
processors with big memory buffers may be selected less often (or not at all). This results in communication sequences longer than randomly generated. Furthermore, in Fig.3 the number of processors is small, and random heuristics have more chance to select a good processor to extend its communication sequence. On the other hand, for the instances in Fig.5 the number of processors is bigger and random heuristics have more chance to choose a bad processor. As a result, appender heuristics are slightly better. In Fig.6 we show the execution time of the algorithms vs. problem size V . Intuitively, for fixed buffer sizes the communication sequence length should grow to accommodate growing problem size V . Therefore, running time of BB grows initially until reaching the limit on the communication sequence length imposed for the complexity reasons (n = 8). If the length of communication sequence σ is too short then too little memory will be available for processing the load, and because M EM (σ) < V , the sequence will be pruned. Consequently, the search is pruned more often and BB has shorter running time. It can be seen at V = 20 and V = 50 that BB running time stops growing. On the other hand, with growing problem size V all heuristics build longer communication sequences which results in bigger construction times, bigger linear programs which need more time to be solved. The quality of the heuristics is depicted in Fig.7. The seeming improvement in the heuristic quality with growing V is an artifact resulting from the limitations of BB. The bigger problem size is the longer communication sequence is needed, the more often BB fails to give feasible solution, and the more often the best solution is constructed by a heuristic. This results in artificial ’improvement’ in the quality of heuristic solutions. The appender heuristics are moderately better than the random heuristics.
5
Conclusions
In this work we proposed a method for scheduling divisible computations on a heterogeneous star with limited memory buffers. The load is processed in
1018
M. Drozdowski and M. Lawenda
multiple rounds, communication startup time is taken into account. Memory allocations and releases have block nature. Since this problem has both continuous and combinatorial nature, a combination of methods has been proposed: the first solves the combinatorial part by providing communication sequence and message overlaps, the second finds the optimum load distribution for the given sequence and overlap. A branch and bound algorithm has been proposed which, unfortunately, has prohibitive running time. It seems possible to improve BB by providing better bounding techniques. Two fast heuristics were proposed with the intent of using processors with fast CPU (AppA) or fast communication (AppSBC). These heuristics do not perform particularly well as far as the quality of the solutions is considered because in the studied range of parameters random heuristics give solutions of comparable quality. It seems that the quality of the solution depends on the coincidence of too many parameters, and these heuristics are too simple to grasp such relationship. Therefore, further analysis of relation between problem parameters is needed to propose better heuristic methods.
References 1. Bharadwaj, V., Ghose, D., Mani, V., Robertazzi, T.: Scheduling divisible loads in parallel and distributed systems. IEEE Computer Society Press, Los Alamitos (1996) 2. Bharadwaj, V., Ghose, D., Mani, V.: Multi-installment Load Distribution in Tree Networks with Delays. IEEE Transactions on Aerospace and Electronic Systems 31, 555–567 (1995) 3. Cheng, Y.-C., Robertazzi, T.G.: Distributed computation with communication delay. IEEE Transactions on Aerospace and Electronic Systems 24, 700–712 (1988) 4. Drozdowski, M.: Selected problems of scheduling tasks in multiprocessor computer systems. Monographs, vol. 321. Pozna´ n University of Technology Press (1997), http://www.cs.put.poznan.pl/mdrozdowski/txt/h.ps 5. Drozdowski, M., Lawenda, M.: Multi-installment Divisible Load Processing in Heterogeneous Systems with Limited Memory. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 847–854. Springer, Heidelberg (2006) 6. Drozdowski, M., Wolniewicz, P.: Optimum divisible load scheduling on heterogeneous stars with limited memory. European Journal of Operational Research 172, 545–559 (2006) 7. Li, X., Bharadwaj, V., Ko, C.C.: Processing divisible loads on single-level tree networks with buffer constraints. IEEE Transactions on Aerospace and Electronic Systems 36, 1298–1308 (2000) 8. Lp solve reference guide (2007), http://lpsolve.sourceforge.net/5.5/ 9. Robertazzi, T.: Ten reasons to use divisible load theory. IEEE Computer 36, 63–68 (2003)
Scheduling DAGs on Grids with Copying and Migration Israel Hernandez and Murray Cole Institute for Computing Systems Architecture School of Informatics University of Edinburgh [email protected], [email protected]
Abstract. Effective task scheduling mechanisms for Grids must address the dynamic and heterogeneous nature of the system by allowing rescheduling and migration of tasks in reaction to significant variations in resource characteristics. Migration may be invoked when the cost of the migration itself is outweighed by the global time saved due to execution at the new site. We extend our previous results in this area by considering the maintenance of a collection of reusable copies of the results of completed tasks. We show how to reuse such copies to improve overall application makespan. We present the Grid Task Positioning with Copying facilities GTP/c system. We compare the performance of GTP/c with our previous model GT P and DLS/sr, an adaptive version of the dynamic level scheduling static method.
1
Introduction
Our research focuses on the DAG scheduling problem on Grid style resource pools. The core issues are that the availability and performance of grid resources, which are already by their nature heterogeneous, can be expected to vary dynamically, even during the course of an execution. Since any credible formalisation of the scheduling problem is NP-complete, a number of heuristics have been proposed in the literature. However, most approaches [3,6] assume that heterogeneous resources are dedicated and unchanging over time. Our previous model, the Grid Task Positioning GT P system [1], addressed this issue by allowing rescheduling and migration of tasks in response to significant variations in resource characteristics. We observed that in an execution with relatively frequent migration, it may be that, over time, the results of some task have been copied to several other sites, and so a subsequent migrated task may have several possible sources for each of its inputs. Some of these copies may now be more quickly accessible than the original, due to dynamic variations in communication capabilities. To exploit this observation, we have extended our model with a Copying Management (CM) function, resulting in a new version, the Grid Task Positioning with copying facilities (GT P/c) system. The idea is to reuse such copies, in subsequent migration of placed tasks in order to reduce the impact of migration cost on makespan. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1019–1028, 2008. c Springer-Verlag Berlin Heidelberg 2008
1020
I. Hernandez and M. Cole
ITG G
Process
μ0 μt
Static Reschedule
Schedule
GRID
GRP STG/C
GRP
Data
G
GRP
STG/c copies
Application monitoring information Application Feedback
GRID monitoring information
Schedule Generation (only once at the beggining) Schedule Evaluation (Rescheduling) Schedule Feedback (only once at the end)
Fig. 1. The GTP/c mapping method
In common with its predecessor, GT P/c is based on a list-scheduling heuristic approach. It assumes that an initial schedule is generated by a static algorithm, and launched onto the Grid. We tackle the dynamic problem by introducing a fixed-period rescheduling cycle. In each cycle, our model updates the available information about the partial completion of tasks, reusable copies generated and dynamic resource changes then considers whether rescheduling of some tasks would be beneficial. Our overall system is sketched in figure 1, in which IT G represents the task graph, ST G/c contains dynamic information concerning the progress of the application, and GRP contains dynamic information concerning the performance of the Grid. We will define these structures more formally in section 2. We compare the performance of our new model with that of GT P [1] and that of DLS/sr [2,6], an adaptive version of the Dynamic Level Scheduling algorithm. Our results show that GTP/c tends to reduce the makespan of the application, particularly in those scenarios where resources vary considerably and the data transfers among tasks tend to be intensive. The remainder of this paper is structured as follows. In Section 2 we introduce the DAG scheduling problem and define our terminology. Section 3 we present the GTP/c scheduling method, and in Section 4 we define the cost estimation functions which drive it. Section 5 presents the results of some simulated executions of our method, compared to GT P and DLS/sr. Section 6 describes future work.
2 2.1
Problem Definition Definition of the Grid Architecture
We represent Grid Resource Pools (GRP) with graphs GRP :: (P, L,avail,bandwidth) where P is the set of available processors in the system, pi (1 ≤ i ≤ |P |).
Scheduling DAGs on Grids with Copying and Migration
1021
L is the set of communication links connecting pairs of distinct processors, li (1 ≤ i ≤ |L|) such that l(m, n) ∈ L denotes a communication link between pm and pn . Our dynamic scheduling decisions will be based upon the latest available resource performance information (as returned by standard Grid monitoring tools such as NWS or Globus MDS). Thus, at time t we assume knowledge of availt :: P → [0..1], capturing the availability of each CPU and bandwidtht :: L → F loat capturing the available bandwidth on each link. We assume that the intra-processor communication cost (pm = pn ) is negligible. 2.2
Definition of the Input Task Graph (ITG)
Static information about the DAG application is represented by an input task graph IT G :: (V, E, data, W ). V is the set of tasks, vi (1 ≤ i ≤ |V |). E ⊆ V × V is the set of directed edges connecting pairs of distinct tasks, ei (1 ≤ i ≤ |E|), where e(i, j) ∈ E denotes a direct dependency and data transfer from task vi to task vj . For future convenience, we define the notation P red(vi ) to denote the subset of tasks which directly precede vi and Succ(vi ) to denote the subset of tasks which directly succeed vi . We use data :: V × V → Int to describe the size of data transfers in standard units, such that data(i, j) denotes the amount of data to be transferred from vi to vj . We represent computation times with W :: V × P → Int, where W (i, m) denotes the execution time in standard units of task vi on heterogeneous processor pm , when working at full availability (i.e. availability 1 in terms of function avail). In common with previous approaches, we assume the availability of complete cost information for the DAG. In practice this may be difficult to obtain, and concrete realizations of such systems may have to rely upon programmer estimates, information from previous runs and other ad-hoc methods. We will factor in the effect of dynamically affected processor availabilities and link bandwidths during execution. 2.3
Definition of the Situated Task Graph (STG/c)
We model dynamic information on the progress of the DAG execution by augmenting the static ITG, to form a Situated Task Graph ST G/c. This includes information on current schedule of tasks, reusable copies generated, partial completion of tasks and partial completion of communications. This is necessary, together with monitored information on the availability of processors and links, to allow us to iteratively compute improved schedules, taking into account migration costs and resource availability changes. As in GT P , we use the concept of placed task. A task is said to become placed on a processor once it has begun to gather its input data on that processor. A task which has merely been assigned to some processor by the current schedule is said to be non-placed. The distinction is important because of its impact on migration costs associated with data retransmission. The strategy of considering reusable copies will allow us to reduce the impact of migration cost when a placed task is migrated. The decision to migrate a non-placed task will incur no additional migration cost because retransmission of data is not needed.
1022
I. Hernandez and M. Cole
We define ST G/c :: (V, E, data, W, Π, κc , κd , Ω), where the first four components are taken directly from the corresponding IT G. We use Π :: V → P + to represent placement information. P + represents P augmented with the special value N ON E. For placed tasks vi , Π(vi ) indicates the corresponding processor. For non-placed tasks vi , Π(vi ) = N ON E. A placed task remains placed until migrated or until the whole application terminates, because even after task completion we will later need to retrieve (or re-retrieve in the case of migration) its results. We assume that information concerning the progress of computations and communications is made available by monitoring mechanisms at each rescheduling point. We use κc :: V → [0..1] to capture the proportion of a task’s computation which has been completed, and similarly, κd :: E → [0..1] to capture the proportion of a data transfer which has been completed. The key new concept is that of reusable copy. A data transfer for a particular edge e(i, j) is said to become reusable copy on a processor once it has been totally transmitted (κd (e(i, j)) = 1) from Π(vi ) to Π(vj ). It is reusable because if during the process, vj migrates to a different processor, the copy may be used as source in subsequent scheduling decisions. The copy will remain reusable until task vj finishes execution. The adaptive nature of our model allows several reusable copies for a particular e(i, j), since task vj can migrate at each rescheduling point, if the benefits are substantial. We expect that reusable copies will help to reduce the impact of migration on makespan by avoiding unnecessary data transfer between tasks, exploiting the network link which offers the minimum data transfer cost according to the latest performance resource information. Thus, we need to keep the information about every reusable copy generated at time t in our model. We use Ω :: E → P(P ) to describe the subset of P where copies of the given (edge) data are available. The initial ST G is effectively just the IT G with all completions and reusable copies equal to zero, and all task placements set to NONE.
3
The GTP/c System
Like GT P , GT P/c addresses the dynamicity of Grids with cyclic use of a static mapping method. At each rescheduling point the objective is to obtain an improved task schedule which minimises the anticipated makespan, given the current status of both application and resources. During remapping, the GRP structure is updated with the resource availability changes and the ST G/C structure is updated with the reusable copies generated, and details of partial completion of both tasks and data transfers. In keeping with the principles of list-scheduling, we maintain a list of unfinished tasks ordered by a rank, which is computed statically. Thus, for each such task vi , GTP/c computes the Earliest Finished Time (EFT) value for scheduling vi to all processors pi ∈ P and remaps vi onto the processor which offers the smallest EFT. The task schedule generated at the checkpoint at time t is represented as a function μt : V → P such that μt (vi ) = pm denotes that task vi is to be
Scheduling DAGs on Grids with Copying and Migration
1023
executed by processor pm at t. Notice that for placed tasks vi , which have not been migrated by μt , we will have μt (vi ) = Π(vi ). 3.1
Setting Task Ranks
There are several methods to statically set the priorities of tasks for a heterogeneous environment. We use Ru (vi ) (also known as blevel ), which is an upward rank computed from the exit node to vi and defined as the length of the critical path from v to an exit node. Ru (v) is computed recursively as, Ru (vi ) = Wi + maxvj ∈Succ(vi ) (data(vi , vj ) + Ru (vj ))
(1)
where Wi is the average execution cost of task vi across all processors and it is defined by,
Wi =
p (Σm=1 W (vi , pm )) |P |
(2)
Notice that the computation weight of a node is approximated by the average of its weights across all processors, following the approach of [3]. 3.2
The Task Migration Model in GTP/c
The adaptive nature of the GT P/c model is illustrated in figure 2 and from where we can observe the difference between strategies used by GT P and GT P/c. In terms of our formalization, a placed task vi is migrated when it has been rescheduled onto a processor other than Π(vi ). We recall that GT P uses a pessimistic model, in which the migrated task must be restarted from the very beginning, including regathering all inputs directly from the predecessors (see figure 2(a)). With GT P/c, in an execution with relatively frequent migration, it may be that, over time, the results of some task have been copied to several other sites, and so a subsequent migrated task may have several possible sources for each of its inputs. Some of these copies may now be more quickly accessible than the original, due to dynamic variations in communication capabilities. To illustrate this, in figure 2(b) at the Rescheduling Point RPn , task v3 could not be executed as v3 only received the required data from task v1 . However, the idea behind the GT P/c model, is that we now maintain the copy of the result generated by v1 in the system in Ωn (e(v1 , v3 )) , such that it may be used as an input in future migrations for v3 . Thus, at RPn and after considering the latest information about both resources and progress of the application, task v3 is migrated from p4 to p2 and we observe that the required data from v1 can be transmitted from the site p4 storing the copy or from the site p1 where v1 was executed. The decision to select the site from which the data will be transmitted will depend upon the prediction of the minimum estimated finish time which involves the estimated availability of the processors (which may have other tasks to complete first) and the estimated availability of input data (which may have to be transferred from
1024
I. Hernandez and M. Cole
P3
P2
e(v2−v3)
v0
e(v1−v3)
v1
RPs
P4 P3
v2
e(v0−v1)
P1
v3
P4 e(v2−v3)
P3 e(v2−v3) v3
P2 P1 n
v3
Placements
Placements
e(v1−v3)
P4
e(v1−v3)
P2
e(v1−v3)
t
P1 n+1
n+2
P4
e(v2−v3)
P3 v0
P4
v3
P4
e(v1−v3)
P3
v2
P3 e(v2−v3)
P2
e(v1−v3)
e(v0−v1)
P2
v1
P1
Rescheduling Points
n
P1
v3
e(v2−v3)
v3 e(v1−v3)
P2
e(v2−v3)
P1 n+1
e(v1−v3)
t n+2
Copy of Data (v1−v3)
Data Transfer completed Data Transfer considered
Copy of Data (v2−v3)
a) The GTP model
b) The GTP/c model
Fig. 2. The GTP/c Migration model
other processors). Following the example, at RPn+1 , v3 was not computed as it had only received data from v2 . This creates a new copy in the system, maintained in copy Ωn+1 (e(v2 , v3 )) for future migration for v3 . At RPn+1 task v3 is now migrated from p2 to p4 , and we observe that there are several possible sources for each preceding task. At the end we observe that v3 is finally executed, using the copy Ωn (e(v1 , v3 )) and a direct data transfer for e(v2 , v3 ).
4
Costing of Candidate Schedules
Our cost prediction approach is based upon redefinition of concepts drawn from the standard scheduling literature [3,6], together with some additional operations required by the dynamically heterogeneous nature of our target system. 4.1
Estimating Communication Cost
During (re)scheduling at time t, we need to predict how much time will be required to transfer data for various candidate assignments of tasks to processors. In general, this will depend upon the latest performance information of the link (bandwidth) associated with the processors involved, the reusable copies generated and any previous partial completion of the transfers. The estimated communication cost in standard-units to transfer data associated with an edge e(vi , vj ) from μt (vi ) = pm to μt (vj ) = pn is defined by, C t (vi , pm , vj , pn ) = StartU p +
datat (vi , vj ) bandwidtht (pm , pn )
(3)
StartUp is the system dependent fixed time ts taken between initiating a request for data and the beginning of the data transfer, and is therefore only applicable to transfers which have not already begun, ⎧ d ⎪ ⎨ (κ (vi , vj ) = 0) or StartU p = ts , if (Π(vi ) = pm ) or ⎪ ⎩ (Π(vj ) = pn )
(4)
Scheduling DAGs on Grids with Copying and Migration
1025
being 0 otherwise. datat (vi , vj ) denotes the remaining volume of data to transmit from task vi to task vj at time t and is computed as datat (vi , vj ) = data(vi , vj ) ∗ (1 − κd (vi , vj ))
(5)
For an edge e(i, j), the Copying Management (CM t ) function finds the cheapest way of getting a copy of the corresponding data to μt (vj ), the processor to which the subsequent computation is currently allocated. This is achieved by considering the original source and existing copies of the data in turn, in the light of the current bandwidth on the corresponding links CM t (vi , vj ) = min{C t (vi , p, vj , μt (vj ))}
(6)
where p ∈ {μt (vi )} ∪ Ω(e(i, j)). 4.2
Estimating Computation Cost
In estimating the value of candidate schedules we need to predict the time at which some task could begin execution on some processor and the time at which that execution will finish. These times depend upon the availability of the processors (which may have other tasks to complete first) and the availability of input data (which may have to be transferred from other processors). We must first define two mutually referential quantities. EST t (vi , pm ) is the Estimated Start Time of task vi on processor pm where the estimate is made at time t. For tasks which have already begun (or even completed) on pm at t, EST will be t (the effect of already completed work will be allowed for in EFT). μt (vi ) = pm and t EST (vi , pm ) = t, if (7) κc (vi ) > 0, For other tasks it will be determined by the need for predecessors of vi to complete and send their data to pm . EST t (vi , pm ) = max{P At (pm ), DAt (vi )}
(8)
where P At (pm ) is a function which returns the latest estimated finish time among tasks already assigned to pm (i.e. the time at which the processor becomes available, having completed other tasks). P At (pm ) = max{vi | (μt (vi )=pm )} {EF T (vi , pm )}
(9)
and DAt (vi ) is the estimated earliest time at which data from a predecessor task vj (mapped on μt (vj )) will be available at pm . DAt (vi ) = maxvj ∈P red(vi ) {EF T (vj , pk ) + CM t (vj , vi )}
(10)
The max block in Equation 10 returns the estimated time of arrival of all data needed to execute task vi onto processor pm . This is calculated by considering the evolving status of each vj ∈ P red(vi ), and any available copies of their results.
1026
I. Hernandez and M. Cole
Similarly, EF T t (vi , pm ) is the Estimated Finished Time of the computation of task vi on processor pm . For already completed tasks (at t) we will have EF T t (vi , pm ) = t, if κc (vi ) = 1,
(11)
For other tasks it will be determined by the quantity of work outstanding and the availability of pm . EF T t (vi , pm ) = EST t (vi , pm ) + W t (vi , pm )
(12)
where W t (vi , pm ) denotes the amount of work still to completed for task vi on processor pm , defined by W t (vi , pm ) =
W (vi , pm ) ∗ (1 − κc (vi )) availt (pm )
(13)
As with communication cost prediction, migrated tasks must be costed for a restart from scratch (ie we reset κd (vi , vj ) = 0). We note that our model ignores possible contention in communication by assuming an effectively unlimited number of links from pm to pn . The discrepancy between real and predicted times is incorporated into our rescheduling as a result of the difference between actual completion information (κc , κd ) returned by monitoring, and that which would have been expected at the preceding rescheduling point. Thus, the overall objective of minimising the real makespan of the DAG application is achieved by iteratively minimising the estimated makespan.
5
Performance Evaluation
We have selected the well known HEF T [3] algorithm, our previous model GT P [1] and DLS/sr against which to evaluate the performance of GT P/c. In [2], a selective rescheduling policy is proposed to reduce the frequency of rescheduling attempts. We use this approach to build an adaptive version (DLS/sr) of the well known Dynamic Level Scheduling (DLS)[6] algorithm. For our purposes, we will use the spare time between tasks as selective rescheduling policy, which denotes the maximal time that for a particular edge e(vi , vj ) ∈ E, task vi can execute without affecting the start time of their successor vj . It also includes the adjacent task of vi in the execution order of the assigned processor. Our research uses fixed-period rescheduling cycles to reschedule the application, and the benchmark enables us to focus on the importance that data reuse may have to minimize the impact of migration on makespan. Our evaluation is conducted by simulation, since this allows us to generate repeatable patterns of resource performance variation. We have used the Simgrid simulator [4] for this purpose. 5.1
Comparison Metrics
We use the Normalized schedule length (NSL) to compare the performance of GT P/c with that of GT P and DLS/sr. The NSL metric is defined as the ratio of
Scheduling DAGs on Grids with Copying and Migration
1027
the schedule length (makespan) to the sum of the computational weights along the critical path and can be computed as N SL = MakespanW . Other metrics vi ∈CP ath
i
to be used to understand the behaviour of each model are the average number of migrated tasks, the average number of remappings and average overhead cost. 5.2
DAG Applications
The shape of the graphs considered in our experiments were taken from the Standard Task Graph Project (STDGP) [5]. Since ST DP considers the DAGs to be executed in a homogeneous environment, without communication cost, we had to add randomly (but repeatable) generated W and data information to produce our ITGs. The graph size (in number of tasks) varied in the range {50,100,300,500,1000}. In keeping with the principles of schedule feedback, we assume the availability of the latest makespan of the application, and we set the fixed-period rescheduling cycle at 10% of the value of the makespan. 5.3
Simulation Results
We created a number of test scenarios to evaluate the performance of GT P/c. A scenario involves a sequence of randomly defined (but repeatable) events, each simulating a resource change in either processor or bandwidth availability. Our scenarios are distinguished by the bound placed on the maximum variation allowed in one event, expressed as a percentage of the peak performance of a resource. For example, an scenario with a bound of 30%, any one event can cause the availability of a processor to decrease to no less than 70% of its peak performance, or of a link to decrease to no less than 70% of its maximum bandwidth. We experimented with a bound ranging from 0% to 90% in increments of 10%. Our graphs embraces the whole spectrum of bounds. For reasons of space we report here only the results for SCE20-1000 (20 processors and 1000 tasks). Note that scenarios with 0% of variability (no resource variation) allows us to investigate the extent to which emerging discrepancies between real and predicted behaviour are handled by GT P/c. The experimental results show that GT P/c outperforms HEF T , GT P and DLS/SR in most cases. Exceptions are limited to the use of DAGs with few tasks (mainly 50 and 100 tasks) and low variability in Grid resources. GT P/c has a better performance particularly when the application is computation and data intensive (i.e. 500 and 1000 tasks). This is because, the number of copies will tend to increase and the migrated tasks will have several possible sources to retrieve the information. In SCE20-1000 the average NSL for GT P/c when the variability is 10%, outperforms HEFT by up to 15% and as shown in figure 3 it tends to considerably increase the performance as the variability increases, due to the static nature of HEFT. GT P/c outperforms GT P by up to 7% and the performance of DLS/sr by up to 10%, with increasing improvement as variability increases. In general terms, some factors such as ignoring traffic contention, variability of resources, number of tasks and even the shape of the DAGs, tend
1028
I. Hernandez and M. Cole SCE(20 Processors, 1000 Tasks)
SCE(20 Processors, 1000 Tasks)
100
22 HEFT GTP/c GTP DLS/sr
20
80
18
70 Average Remappings
16
Average NSL
GTP/c GTP DLS/sr
90
14 12 10
60 50 40
8
30
6
20 10
4
0
2 0
10
20
30
40
50
60
70
80
0
90
10
20
30
40
50
60
70
80
90
% of Variability in Grid Resources
% of Variability in Grid Resources
SCE(20 Processors, 1000 Tasks)
SCE(20 Processors, 1000 Tasks)
2000
100000 GTP/c GTP DLS/sr
1800
GTP/c GTP DLS/sr
95000 90000
1600 Average Overhead Cost
Average Migrated Tasks
85000 1400
1200
1000
80000 75000 70000 65000
800 60000 600
55000
400
50000 0
10
20
30
40
50
60
% of Variability in Grid Resources
70
80
90
0
10
20
30
40
50
60
70
80
90
% of Variability in Grid Resources
Fig. 3. Results for GTP/c,GTP,HEFT and DLS/sr
to affect the cost predictions of tasks. In our evaluation, DLS/sr is the dynamic model most affected by such factors as they will affect the prediction of the spare time of tasks. A natural consequence is that the number of remappings will increase, increasing the migrated tasks, increasing the overhead cost and finally increasing the makespan of the application. This can be observed in the graphs showing the average number of remappings, migrated tasks and overhead cost.
References 1. Hernandez, I., Cole, M.: Reactive Grid Scheduling of DAG Applications. In: Proceedings of 25th IASTED (PDCN), pp. 92–97. Acta Press (2007) 2. Sakellariou, R., Zhao, H.: A low-cost rescheduling policy for efficient mapping of workflows on grid systems. Scientific Prog. SPR 12(4), 253–262 (2004) 3. Topcuoglu, H.: Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing. Trans. Parallel Dist.Syst. 13(3), 260–274 (2002) 4. The Simgrid project, http://simgrid.gforge.inria.fr/ 5. The Standard Graph Project, http://www.kasahara.elec.waseda.ac.jp/schedule/ 6. Sih, G., Lee, E.: A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. Trans.Parallel Dist.Syst. 4(2), 175–187 (1993)
Alea – Grid Scheduling Simulation Environment Dalibor Klus´aˇcek, Ludˇek Matyska, and Hana Rudov´ a Faculty of Informatics, Masaryk University, Botanick´ a 68a, 602 00 Brno, Czech Republic {xklusac, ludek, hanka}@fi.muni.cz
Abstract. This work concentrates on the design of a system intended for study of advanced scheduling techniques for planning various types of jobs in a Grid environment. The solution is able to deal with common problems of the job scheduling in Grids like heterogeneity of jobs and resources, and dynamic runtime changes such as arrivals of new jobs. Our new simulator called Alea is based on the GridSim simulation toolkit which we extended to provide a simulation environment that supports simulation of varying Grid scheduling problems. To demonstrate the features of the GridSim environment, we implemented an experimental centralised Grid scheduler which uses advanced scheduling techniques for schedule generation. By now local search based algorithms and some dispatching rules were tested. The scheduler is capable to handle both static and dynamic situation. In the static case, all jobs are known in advance while the dynamic situation means that jobs appear in the system during simulation. In this case generated schedule is changing through time as some jobs are already finished while the new ones are arriving. Comparison of FCFS, local search and dispatching rules is presented for both cases and we demonstrate that the new local search based algorithm provides the best schedule while keeping the running time acceptable. Keywords: Grid scheduling, Local search, Dispatching rules, Simulation with GridSim.
1
Introduction
Grid [9] is generally understood as a distributed and heterogeneous computing system with various types of different resources. First of all it is important to maximise their usage, on the other hand it is also desirable to provide nontrivial quality of service to the users and their applications. We are proposing a complex and extensible simulation environment unlike to frequently used ad-hoc simulators [2,12,24,3] that is able to model various situations such as different types of jobs and applications or different Grid topology and then evaluate proposed solutions and algorithms. The goal is to use the simulator to design Grid schedulers with differing scheduling techniques and test their behaviour in different but fully controlled conditions. We apply known advanced scheduling techniques [11,5,22] to the Grid scheduling problem [8]. Currently we are focusing on local search based algorithms [11] R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1029–1038, 2008. c Springer-Verlag Berlin Heidelberg 2008
1030
D. Klus´ aˇcek, L. Matyska, and H. Rudov´ a
and dispatching rules [13,22]. Various techniques were applied in related problems such as job scheduling on single machine [20,17], identical parallel machines [24,2] or heterogeneous parallel machines [21,3]. Unfortunately, these solutions are usually tested only in static situation, i.e., when all the jobs are known to the scheduler before execution. One of our main interest is the dynamic aspect of scheduling [25,4,16], i.e., scheduling of jobs arriving during the system run [3,12], sometimes referred to as incremental scheduling [16] or on-line reasoning [4]. Using the extended GridSim environment we can simulate scheduling and execution of different types of non preemptive jobs in both static and dynamic fashion on resources composed of parallel and heterogeneous machines. Administrator’s demands on resource utilisation can be satisfied by makespan minimisation and user requirements can be handled through optimisation of the total tardiness of all jobs. The simulation environment allows us an easy comparison of the scheduling algorithms. Among other algorithms we have implemented local search algorithms [15,14] for both the static and dynamic problems. The typical use of the local search for the static problems is related with a high computational costs. Our experiments show that this cost can be significantly reduced for dynamic problems while its optimisation performance is preserved. This seems to be very interesting since we are not aware of any relevant work in the area of local search for dynamic problems1 . The structure of this paper is following. The next section describes the simulation toolkit we used as the basis for the Alea simulator, the extensions we have done, and it presents design description of the scheduler. The following section describes the algorithms used to create and optimise the schedule. Next we present some experimental results, and the last section concludes our research and discuss the future work.
2
Characteristics of the Alea Simulator
There are numerous Grid simulators that provide various functionality. Bricks [23] is designed for simulations of client–server architectures in Grids, SimGrid [18] is used for the simulation and development of distributed applications in heterogeneous and distributed environment. Simbatch [10] allows to evaluate scheduling algorithms for batch schedulers and MicroGrid [19] can be used for systematic study of the dynamic behavior of applications, middleware, resources, and networks. As the basis of our simulator we use Java based simulation toolkit GridSim [6]. This toolkit is flexible and universal and it has a very good documentation. It provides functionality to simulate the basic Grid environment and its behaviour. GridSim provides simple implementation of common entities such 1
Our investigation and questions addressed to the Institute on Resource Management and Scheduling of the European CoreGRID Network of Excellence to get information about previous applications of local search-based algorithms on dynamic Grid scheduling problems did not give any relevant response.
Alea – Grid Scheduling Simulation Environment
1031
as computational resources or users and also allows to simulate simple jobs, network topology, data storage and other useful functionalities. However, provided implementations are too simple and it was necessary to extend these entities to fit more complex requirements. This has been done by implementing new Java class which inherits from the existing GridSim class. Then new functions can be implemented or modified and new parameters can be added to provide the desired functionality of this new entity. Since all important components like Grid resource, job, etc., are defined in separate classes, it is very easy to modify them or create a new one. We have developed new specialised entities such as the centralised scheduler, the jobs with special parameters or the submission system with dynamically appearing jobs that allows us to build complex dynamic simulations of Grid environment. There exists decentralised scheduler implemented in GridSim [1], but it does not allow dynamic behaviour of the system and, due to the older version of GridSim, it is not capable of simulating network topology and other recent features. The reason is the incompatibility between older and new GridSim versions. The following text describes our solution and its main features. 2.1
Description
The Alea simulator2 is modular, composed of independent entities which correspond to the real world. It consists of the centralised scheduler, the job submission system, and the Grid resources. These entities communicate together by message passing. Currently Grid users are not directly simulated but a job generator attached to the job submission system is used to simulate job arrivals. The Job submission system stores jobs before and after they are executed and communicates with the scheduler to get a scheduling strategy, which it further uses to select a resource to execute a job. The Job generator is attached to the job submission system and it is used to simulate job arrivals. It generates new synthetic jobs that appear during simulation run. The job arrival times correspond to the selected statistical distribution. Currently we support uniform and normal distribution, but it is easy to add new distributions or real workload data. The scheduler is responsible for schedule generation and further optimisation. In the dynamic situation this schedule may change in time as some jobs are already finished while new ones appear. Since it is a standalone entity it has to be able to communicate with other entities, mainly with the job submission system. To keep the scheduler extensible it was designed as a modular entity composed of three main parts. The first part is responsible for communication with the job submission system. The second part stores dynamic information about each Grid resource such as jobs currently being executed or prepared schedule. It also implements functions that approximate makespan, tardiness, and other values important for the scheduling process. These information are used by the third part of the scheduler, i.e., by the scheduling algorithms. 2
Alea can be downloaded from http://www.fi.muni.cz/∼ xklusac/alea
1032
D. Klus´ aˇcek, L. Matyska, and H. Rudov´ a
Each Grid resource is responsible for the job execution. The resource is selected by the scheduler and job is then submitted by the job submission system. Completed jobs are returned to the job submission system. 2.2
Communication Scheme
The figure 1 shows the common communication scheme between the job submission system, scheduler, and one Grid resource. Job submission system submits job descriptions to the scheduler. The scheduler uses the list of all available Grid resources and their parameters such as the number of CPUs and their rating. The scheduler is centralised, therefore it has information about and access to all available resources in the system. On the basis of this information scheduler generates separate schedule for each Grid resource. Using these schedules and also information of jobs currently in execution the scheduler is able to approximate various parameters of the schedule such as makespan or expected tardiness for the jobs before their execution and completion. This allows the scheduler to compare two different schedules with respect to the optimisation criteria and select the better one.
Fig. 1. Communication scheme between the job submission system, the scheduler and the Grid resources
According to the constructed schedule and the current simulation time the scheduler responds to the submission system with scheduling information, i.e., provides information which resource was selected to execute the job. It is important to notice that this communication is asynchronous. Job submission system does not wait for the response from the scheduler and sends the job descriptions as the new jobs are available from the job generator. On the other hand the scheduler sends the scheduling information to the job submission system according to the current load of the resources. The scheduler works with the
Alea – Grid Scheduling Simulation Environment
1033
job description while the job stays with the job submission system. This helps to save scheduler’s network bandwidth and prevents the scheduler to become a bottleneck of the whole system. Once some job is finished the job submission system sends the acknowledge message to the scheduler. The scheduler then updates its internal information of the current resource load. It is also an impulse to check whether another job should be sent on a resource to prevent the resource from being idle. 2.3
Extensibility
Since the Alea simulator is modular and the main functionality of the scheduler is divided into separate parts, it is easier to simulate different types of job, scheduling algorithms or optimisation criteria by making small changes in existing simulator. For example, if we want to test some new scheduling algorithm we will modify only the appropriate class. If we want to schedule different type of jobs we will only change the job generator and possibly corresponding objective function in the scheduler. The rest of the classes stays intact so the experiments can be repeated with the exactly same setup. The changes are encapsulated and the results are easily comparable. The Alea’s behaviour is driven by the events, the active entities like job submission system, scheduler or resources are simulated as independent interacting entities, which makes this solution more closer to the real world. Therefore the simulation environment is ready for a future extensions, e.g., adding network simulation or fault tolerance.
3
Problem Description and Algorithms
The Grid scheduling problem [8] is generally defined by a set of resources (typically machines, storage, memory, network, etc.), a set of tasks, an optimality criterion, an environmental specification and by other constraints. Grid is also a typical example of the dynamic environment. The problem complexity strongly depends on the machine, job, and environmental characteristics. We understand the machine as a computational resource with one or more CPUs. The machine environment may consist of a single machine, identical parallel machines, parallel machines with different speeds, etc. Machines may also become unavailable (breakdown). Job parameters are also very important. Jobs have computational length (processing time), may require one or more CPUs, they may have various release and due dates or priorities (weight). Some jobs may be migrated during the execution from one resource to another (preemption) or may have the precedence constraints typical for workflows or parallel DAG applications. Some jobs require only specific kind of machines (machine – job suitability). Problem complexity rises yet more with additional parameters such as Grid network topology, network bandwidth requirements or fault tolerance. The goal of the scheduling is to satisfy user’s and system administrator’s demands, e.g., to minimize the total or weighted tardiness of the jobs or to minimize the makespan. Currently we consider system with no failures such as resource failure or job loss, etc. The system is composed of parallel multiprocessor machines. Different
1034
D. Klus´ aˇcek, L. Matyska, and H. Rudov´ a
machines may have different CPU rating. We also assume that the computational length of the job is known prior to its execution and job requires one or more CPUs for its run. We simulate the static situation when all jobs are known in advance as well as dynamic situation when new jobs appear in the system during execution so the generated schedule is changing through the time as some jobs are already finished while the new ones are arriving. In such case the scheduler has to create schedule incrementally. Our scheduler does not support job preemption or job migration yet. We have implemented various algorithms to minimise the makespan, the total tardiness or the number of delayed jobs. Different queue-based policies such as First Come First Served (FCFS), Earliest Release Date (ERD) or Earliest Due Date (EDD) [13] were implemented. We also proposed and implemented various algorithms based on the construction of the global schedule such as composite dispatching rules or local search based algorithms such as Tabu search [15,14], Hill climb or Simulated annealing [11]. We will concentrate on some of the interesting results with the total tardiness minimization that demonstrate the variability of the simulator. 3.1
Dispatching Rules and Local Search
We developed composite dispatching rules [14] that select a proper resource in the first step and then use common dispatching rule to put a new job into the existing schedule of this resource. Minimum Tardiness Earliest Due Date (MTEDD) dispatching rule selects such resource where the overall tardiness after adding new job will rise minimally. In that resource’s schedule the job is placed according to its due date so the jobs with earlier due dates will be executed earlier. Generally it leads to the minimization of the tardiness of this job. Another option is to apply Minimum Tardiness Earliest Release Date (MTERD) dispatching rule where the job is placed to the resource’s schedule wrt. its release date. When using local search optimisation an initial schedule is always required. The scheduler uses two steps to create the final schedule. The MTEDD (or MTERD) dispatching rule is used for initial schedule generation then the Tabu search [14] algorithm is applied to optimise this schedule. The algorithm moves delayed jobs from one resource’s schedule to another and checks if this move is good or not. The job is placed into tabu list to prevent cycling of the algorithm. The Tabu search finishes when there is no job in the schedule expected to be delayed. It also finishes after a fixed number of non improving moves or when the upper bound of iterations is reached. Detailed description of the dispatching rules and local search algorithms can be found at [15,14].
4
Results and Discussion
The experiments were performed on the Intel Pentium 4 2.6 GHz machine with 512 MB RAM. The tests were run for a different number of available machines with different CPU rating. Each machine has 4 CPUs. Tests were done for two
Alea – Grid Scheduling Simulation Environment
1035
to eight machines available in the Grid. Single machine situation was not tested because Tabu search only moves jobs between different schedules. We have run tests for 1000 jobs with release dates and due dates. The release date is taken as a hard constraint in the sense that the job cannot start its execution earlier while the due date is a soft constraint which may not be complied. The data sets were generated synthetically and results were averaged from 20 different data sets. The scheduler optimises the total tardiness of all jobs. In the following, we compare the results between the FCFS, MTEDD dispatching rule, the MTERD dispatching rule, and their Tabu search optimisation. The figure 2 shows the total tardiness and the time required to generate the schedule when using FCFS, MTEDD, MTERD, or their Tabu search extensions in the static situation (all the jobs were known to the scheduler before the start of the scheduling process). In the dynamic case the simulated environment was identical to the static simulation example but the jobs were submitted to the system during the execution. The time intervals were generated with the uniform distribution. In the present experiment, the Tabu search optimisation with the fixed number of iterations was performed in three different ways. In the first scenario the Tabu search was run for each newly arrived job. In the other two
Fig. 2. Total tardiness and total schedule generation time for FCFS, MTEDD, MTERD and Tabu search in static situation
Fig. 3. Total tardiness and total schedule generation time for FCFS, MTEDD, MTERD and Tabu search in dynamic situation
1036
D. Klus´ aˇcek, L. Matyska, and H. Rudov´ a
cases the Tabu search was run only after each five or ten jobs. The figure 3 shows the total tardiness and the time required to generate the schedule when using FCFS, MTEDD, MTERD and Tabu search applied to initial MTERD. 4.1
Discussion
It is clear that the total tardiness strongly depends on the number of available machines. We can see that the Tabu search makes improving moves in all tested situations. On the other hand the basic methods (FCFC, MTEDD and MTERD) are much faster than the Tabu search in both static and dynamic situation. The static MTERD schedule is not very good but the Tabu search is able to perform more significant improvement in the value of objective function. The figure 3 also shows that the total time required to create schedule (i.e., sum of all times required to include new job into existing schedule) rises when the Tabu search is used for each newly arriving job in contrast to the situation when it is performed only after every five or ten new jobs. The reason why the Tabu search is much slower than MTEDD is because it makes more complex changes in existing schedule and computes the expected total tardiness. This value has to be calculated many times which makes it much more time consuming than MTEDD. The Tabu search combined with MTERD was the best approach wrt. the total tardiness minimisation in comparison with the other algorithms applied. When the Tabu search optimisation is run only after every five or ten jobs, the total time increase is acceptable while the resulting schedule is still very good among all the compared schedules. As expected, the time required to perform Tabu search in the dynamic situation becomes stable with the growing number of available machines. We used the same number of jobs for all the simulations, i.e., with more machines the size of the schedule decreases because jobs previously scheduled are already finished and the algorithm in each run works with the smaller number of jobs. It is natural that in such case the calculation of expected tardiness of yet unfinished jobs on one machine is faster but on the other hand it has to be calculated for higher number of machines.
5
Conclusion and Future Work
In our research we managed to propose an extensible simulation environment Alea usefull to design and test the scheduling algorithms for some typical Grid scenarios. Using this environment, new centralised scheduler with different scheduling algorithms was implemented and its behaviour was evaluated. The tested Tabu search extensions of MTEDD and MTERD methods demonstrated the added value of the local search in dynamic cases. An easy combination of the scheduling algorithms using our environment allowed to detect promising direction in the development of new efficient local search algorithms applicable for Grid scheduling as a part of the evaluation of the new simulation environment. Further study of local search-based algorithms designed to solve dynamic problems is the main subject of our current work since it is a new unexplored area [15,14].
Alea – Grid Scheduling Simulation Environment
1037
Our future work will on one hand focus on extensions of the simulation environment, e.g., addition of network topology, support for different job types including preemptive and priority jobs, workflows, etc. We will also be able to model job migration or an estimated running time of the job. The environment will be further extended to simulate different failures (resource or job crash, network disconnection, etc.) to test robustness of scheduling algorithms. On the other hand, we plan to use this environment to implement and evaluate differing scheduling algorithms and technologies, such as Backfilling or Convergent Scheduling [7]. Acknowledgments. This work was supported by the Ministry of Education, Youth and Sports of the Czech Republic under the research intent No. 0021622419 and by the Grant Agency of the Czech Republic with grant No. 201/07/0205.
References 1. Abramson, D., Buyya, R., Murshed, M., Venugopal, S.: Scheduling parameter sweep applications on global Grids: A deadline and budget constrained cost-time optimisation algorithm. International Journal of Software: Practice and Experience (SPE) 35(5), 491–512 (2005) 2. Armentano, V.A., Yamashita, D.S.: Tabu search for scheduling on identical parallel machines to minimize mean tardiness. Journal of Intelligent Manufacturing 11, 453–460 (2000) 3. Baraglia, R., Ferrini, R., Ritrovato, P.: A static mapping heuristics to map parallel applications to heterogeneous computing systems: Research articles. Concurrency and Computation: Practice and Experience 17(13), 1579–1605 (2005) 4. Bidot, J.: A General Framework Integrating Techniques for Scheduling under Uncertainty. PhD thesis, Institut National Polytechnique de Toulouse, France (2005) 5. Brucker, P.: Scheduling Algorithms. Springer, Heidelberg (1998) 6. Buyya, R., Murshed, M.: GridSim: A toolkit for the modeling and simulation of distributed resource management and scheduling for Grid computing. The Journal of Concurrency and Computation: Practice and Experience (CCPE) 14, 1175–1220 (2002) 7. Capannini, G., Baraglia, R., Puppin, D., Ricci, L., Pasquali, M.: A job scheduling framework for large computing farms. In: SC 2007 International Conference for High Performance Computing, Networking, Storage and Analysis (to appear, 2007) 8. Fibich, P., Matyska, L., Rudov´ a, H.: Model of Grid Scheduling Problem, Exploring Planning and Scheduling for Web Services, Grid and Autonomic Computing, Papers from the AAAI 2005 workshop. Technical Report WS-05-03, AAAI Press (2005) 9. Foster, I., Kesselman, C.: The Grid 2: Blueprint for a New Computing Infrastructure, 2nd edn. Morgan Kaufmann, San Francisco (2004) 10. Gay, J.-S., Caniou, Y.: Simbatch: an API for simulating and predicting the performance of parallel resources and batch systems. Research Report 6040, INRIA (2006) 11. Glover, F.W., Kochenberger, G.A. (eds.): Handbook of Metaheuristics. Kluwer, Dordrecht (2003)
1038
D. Klus´ aˇcek, L. Matyska, and H. Rudov´ a
12. He, X., Sun, X., von Laszewski, G.: QoS guided min-min heuristic for Grid task scheduling. Journal of Computer Science and Technology 18(4), 442–451 (2003) 13. Ho, N.B., Tay, J.C.: Evolving dispatching rules for solving the flexible job-shop problem. In: Proceedings of the IEEE Congress on Evolutionary Computation, pp. 2848–2855 (2005) 14. Klus´ aˇcek, D., Matyska, L., Rudov´ a, H.: Local Search for Deadline Driven Grid Scheduling. In: Third Doctoral Workshop on Mathematical and Engineering Methods in Computer Science (MEMICS 2007), pp. 74–81 (2007) 15. Klus´ aˇcek, D., Matyska, L., Rudov´ a, H.: Local Search for Grid Scheduling. In: Doctoral Consortium at the International Conference on Automated Planning and Scheduling (ICAPS 2007), Providence, RI, USA (2007) 16. Kocjan, W.: Dynamic scheduling state of the art report. Technical report, SICS Technical Report T2002:28 (2002) 17. Kolliopoulos, S.G., Steiner, G.: On minimizing the total weighted tardiness on a single machine. In: Diekert, V., Habib, M. (eds.) STACS 2004. LNCS, vol. 2996, pp. 176–186. Springer, Heidelberg (2004) 18. Legrand, A., Marchal, L., Casanova, H.: Scheduling distributed applications: the SimGrid simulation framework. In: Proceedings of the Third IEEE International Symposium on Cluster Computing and the Grid, pp. 138–145. IEEE Computer Society, Los Alamitos (2003) 19. Liu, X., Xia, H., Chien, A.: Validating and scaling the MicroGrid: A scientific instrument for Grid dynamics. The Journal of Grid Computing 2(2), 141–161 (2004) 20. Loganantharaj, R., Thomas, B.: An overview of a synergetic combination of local search with evolutionary learning to solve optimization problems. In: IEA/AIE 2000: Proceedings of the 13th international conference on Industrial and engineering applications of artificial intelligence and expert systems, pp. 129–138. Springer, Heidelberg (2000) 21. Mathirajan, M., Sivakumar, A.I.: Minimizing total weighted tardiness on heterogeneous batch processing machines with incompatible job families. The International Journal of Advanced Manufacturing Technology 28, 1038–1047 (2006) 22. Pinedo, M.: Planning and Scheduling in Manufacturing and Services. Springer, Heidelberg (2005) 23. Takefusa, A., Matsuoka, S., Aida, K., Nakada, H., Nagashima, U.: Overview of a performance evaluation system for global computing scheduling algorithms. In: HPDC 1999: Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing, USA, pp. 97–104. IEEE, Los Alamitos (1999) 24. van den Akker, J.M., Hoogeveen, J.A., van Kempen, J.W.: Parallel machine scheduling through column generation: Minimax objective functions. In: Diekert, V., Habib, M. (eds.) STACS 2004. LNCS, vol. 2996, pp. 648–659. Springer, Heidelberg (2004) 25. Verfaillie, G., Jussien, N.: Constraint solving in uncertain and dynamic environments: A survey. Constraints 10(3), 253–281 (2005)
Cost Minimisation in Unbounded Multi-interface Networks Adrian Kosowski1 and Alfredo Navarra2, 1
Department of Algorithms and System Modeling, Gda´ nsk University of Technology, Narutowicza 11/12, 80952 Gda´ nsk, Poland [email protected] 2 Dipartimento di Matematica e Informatica, Universit´ a degli Studi di Perugia, Via Vanvitelli 1, 06123 Perugia, Italy [email protected]
Abstract. Given a graph G = (V, E) with |V | = n and |E| = m, which models a set of wireless devices (nodes V ) connected by multiple radio interfaces (edges E), the aim is to switch on the minimum cost set of interfaces at the nodes in order to satisfy all the connections. A connection is satisfied when the endpoints of the corresponding edge share at least one active interface. Every node holds a subset of all the possible k interfaces. The problem is called Cost Minimisation in Unbounded MultiInterface Networks and in [1] the case with bounded k was studied. In this paper we generalise the model by considering the unbounded version of the problem, i.e., k is not set a priori but depends on the given instance. We distinguish two main variations of the problem by treating the cost of maintaining an active interface as uniform (i.e., the same for all interfaces), or non-uniform. In general, we prove that the problem is not approximable within O(log k) while it holds min{ k+1 , 2m }-approximation factor for n √ 2 the uniform case and min{k − 1, n(1 + ln n)}-approximation factor for the non-uniform case. Next, we also provide hardness and approximation results for several classes of networks: with bounded degree, trees, planar and complete graphs. Keywords: energy saving, wireless network, multi-interface network, approximation algorithm.
1
Introduction
Cost Minimisation in multi-interface networks has been first addressed in [1,2]. It takes care of the advances and the growing power of nowadays devices. Namely, special effort is required for managing new kinds of communication problems. Wireless devices, in fact, may hold multiple radio interfaces, allowing switching
The research was partially funded by the European project COST Action 293, “Graphs and Algorithms in Communication Networks” (GRAAL). The research was done while the author was a Post-Doc at the LaBRI-University of Bordeaux, France.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1039–1047, 2008. c Springer-Verlag Berlin Heidelberg 2008
1040
A. Kosowski and A. Navarra
from one communication network to another according to required connectivity and QoS. The selection of the “best” radio interface for a specific connection is a challenging issue and it might depend on various factors. Namely, its availability in specific devices, the required communication bandwidth, the cost (in terms of energy consumption) of maintaining an active interface, the available neighbours and so forth. Due to the wireless environment and the typologies of the devices that may meet in such kind of networks, one of the most important parameter to take care in order to set up the right connections concerns energy consumption. Devices are usually battery powered and the network survivability might depend on their persistence in the network. This introduces a natural optimisation problem that must take care of different variables at the same time. Generally speaking, given a graph G = (V, E), where V represents the set of wireless devices and E the set of required connections according to proximity of devices and the available interfaces that they may share, the problem can be stated as follows. What is the cheapest way, i.e., which subset of available interfaces in each node must be activated in order to satisfy (cover) all the connections described by E while minimising the overall cost? Note that a connection is satisfied when the endpoints of the corresponding edge share at least one active interface. Moreover, for each node v ∈ V there is a set of available interfaces, from now on denoted as I(v). v∈V I(v) determines the set of all the possible interfaces available in the network, whose cardinality is denoted by k. An example of a feasible network is shown in Figure 1.
Fig. 1. The composed network according to available interfaces and proximities
The problem is called Cost Minimisation in Unbounded Multi-Interface Networks (CMI for short). In this paper, we study the complexity of CMI in various scenarios. CMI turns out to be a very hard problem in general, hence we also consider possible approximation algorithms. We deal with two main variations of the problem: the case in which the cost of activating an interface is the same for
Cost Minimisation in Unbounded Multi-interface Networks
1041
each interface, and the more general case in which such a cost may be different. Indeed, the first model is equivalent to asking for the minimum total number of activated interfaces inside the network in order to cover all the connections. We also consider different graph classes that are of interest from both theoretical and practical points of view, namely: with bounded degree, since in real world scenarios users are normally connected to a limited number of nodes; planar, since the induced graph of joining users in a network is likely to be planar; trees, since MiddleWare strategies are heavily based on this kind of structure (see for instance [3]); complete graphs, since this is one of the main structures used for modelling P2P networks (see for instance [4]). In contrast to its natural importance, to date there appear to be no known complexity results of CMI. In [1] only the bounded version of the problem was studied, where the number k of interfaces is a priori bounded. Since nowadays devices support many and different interfaces, it make sense to not assume as given the number of interfaces that may occur in a composed network. It might depend, in fact, on the number of nodes participating in the network. The problem originated from [2] where a slightly different model of CMI is introduced. That model considers also the possibility of having mutually exclusive interfaces, i.e., interfaces that, if activated, preclude the activation of some other interfaces. The motivation is quite technical, for instance the WiFi interface can operate in different modalities: Infrastructure and Ad-Hoc. If a device activates WiFi in the Infrastructure modality, it cannot satisfy connections that require the Ad-Hoc modality and vice versa. In this paper we have not introduced this further restriction since the problem is already of practical relevance and not easily solvable. Table 1 summarises our obtained results. As shown, some of the results directly follow from [1]. Table 1. Hardness and approximability of the CMI problem Graph class
Complexity of CMI non-uniform costs uniform costs General graphs (k − 1)-approx [1] k+1 -approx [1] 2 √ 2m ( n(1 + ln n))-approx -approx [1] n not approx within O(log k) not approx within O(log k) Δ+1 Graphs of bounded Δ Δ-approx [1] -approx [1] 2 APX -hard for Δ ≥ 5, k ≥ 3 [1] APX -hard for Δ ≥ 5, k ≥ 3 [1] Planar graphs 6-approx 6-approx not approx within 2 − ε not approx within 2 − ε Trees 2-approx 2-approx not approx within 2 − ε not approx within 2 − ε Complete graphs not approx within O(log k) not approx within O(log k)
Outline: The next section provides some definitions and notation in order to formally describe the CMI problem. Section 3 contains the results concerning the hardness, and Section 4 gives approximation algorithms for the CMI problem in
1042
A. Kosowski and A. Navarra
general graphs. Section 5 is devoted to the hardness and the approximation factors of CMI with respect to various classes of graphs. In particular, we consider graphs with bounded degree, planar, with bounded treewidth and complete. Finally, Section 6 contains conclusive remarks and a discussion of interesting open problems.
2
Definitions and Notation
Unless otherwise stated, the network graph G = (V, E) is always assumed to be simple (i.e., without multiple edges), undirected and connected. Moreover, we always denote by n and m the cardinality of the sets V and E respectively. The degree of node v ∈ V is denoted by Δv , the set of its neighbours by N (v) and, the set of its available interfaces by I(v). The minimum node degree of graph G is denoted by δ, and its maximum node degree by Δ. A global characterisation of interfaces of respective nodes from V is given in terms of an appropriate interface assignment function W , according to the following definition. Definition 1. A function W : V → 2{1,...,k} is said to cover graph G = (V, E) if for each {u, v} ∈ E the set W (u) ∩ W (v) = ∅, with k = v∈V I(v). The cost of activating an interface for a node is assumed to be identical for all nodes and given by cost function c : {1, . . . , k} → N, i.e., the cost of interface i is written as ci . The considered CMI optimisation problem is formulated as follows. CMI: Cost Minimisation in Unbounded Multi-Interface Networks A graph G = (V, E), an allocation of available interfaces W : V → 2{1,...,k} covering graph G, an interface cost function c : {1, . . . , k} → N. Solution: An allocation of active interfaces WA : V → 2{1,...,k} covering graph G such that WA (v) ⊆ W (v) for all v ∈ V . Goal : Minimise the total cost of the active interfaces, c(WA ) = v∈V l∈WA (v) cl . Input:
In all further considerations, the problem instance is stated in the form of the triple (G, W, c).
3
Hardness in General Graphs
Theorem 1. CMI is hard to approximate within a factor of O(log k), even when restricted to the unit cost interface case. Proof. The proof proceeds by reduction to the Minimum Hitting Set problem. Let us recall that for a collection of non-empty subsets C1 , C2 , . . . , Cl ⊆ {1, 2, . . . , k}, set S ⊆ {1, 2, . . . , k} is called a hitting set if ∀1≤i≤l Ci ∩ S = ∅.
Cost Minimisation in Unbounded Multi-interface Networks
1043
The problem of minimising the cardinality of the hitting set is as hard as the Minimum Set Cover problem [5], and consequently, hard to approximate within a factor of O(log k) [6]. Now we define the following instance of CMI with unit cost interfaces. Let graph G be the complete bipartite graph, where V = X ∪ Y , and all vertices from set X = {x1 , x2 , . . . , xl } are connected with all vertices from set Y = {y1 , y2 , . . . , yp }. For all vertices from set Y allocate the entire set of available interfaces, W (yi ) = {1, 2, . . . , k}, while for all vertices from set X allocate fixed subsets of interfaces to each vertex, W (xi ) = Ci . Consider an arbitrary activation WA constituting a solution to CMI for this instance. For any fixed j, 1 ≤ j ≤ p, we have ∀1≤i≤l WA (yj )∩WA (xi ) = ∅, and since WA (xi ) ⊆ W (xi ) = Ci , obviously ∀1≤i≤l WA (yj ) ∩ Ci = ∅, thus WA (yj ) is a hitting set for the collection of sets {Ci }. Thus, given a solution to the considered CMI instance with a total cost of c, we can immediately show a hitting set S for {Ci } such that |S| ≤ pc . Conversely, given a hitting set S for {Ci }, we can construct a valid solution to the CMI instance by putting ∀1≤j≤p WA (yj ) = S, and activating appropriate one-element sets in set X (∀1≤i≤l |WA (xi )| = 1 ∧ WA (xi ) ⊂ S). The cost of the obtained solution is then equal to c = p|S| + l; by putting p = l in the definition of the graph, we have |S| = pc − 1. The obtained bounds lead to the conclusion that any polynomial time aapproximation algorithm for CMI leads to a polynomial time a-approximation algorithm for the Minimum Hitting Set problem. The thesis follows from the stated hardness result for the approximation of Minimum Hitting Set [6].
4
Approximation Algorithms
Theorem 2. √ Given a graph G = (V, E), CMI is approximable within a factor of min{k − 1, n(1 + ln n)}. Proof. Since the√(k − 1)-approximation directly follows from [1], we only have to provide the ( n(1 + ln n))-approximation algorithm in order to conclude the proof. Consider an instance of the CMI problem in the form of the triple (G, W, c). For any fixed interface j, 1 ≤ j ≤ k, let Bj ⊆ E denote the set of all edges of G which are covered when interface j is enabled at all possible nodes of graph G. Next, let Ae,j be a family of one-element sets of edges, Ae,j = {e} where index e spans all edges of graph G, while index j spans all network interfaces which can be used to cover edge e. We now consider the family of sets F = {Bj } ∪ {Ae,j } as an instance of the Weighted Minimum Set Cover problem. With each set we associate a weight w, corresponding to the cost of activating all interfaces within a given set: w(Bj ) = nj · cj , w(Ae,j ) = 2 · cj , where nj denotes the number of vertices v of the graph with available interface j (j ∈ W (v)). The goal is to find a subfamily S ⊆ F such that S∈S S = E and the total cost w = S∈S w(S) is as small as possible.
1044
A. Kosowski and A. Navarra
First, observe that the existence of a solution to the considered instance of Weighted Minimum Set Cover with a total weight of w implies the existence of a solution to the CMI instance with total cost c ≤ w; it suffices to activate the interfaces corresponding to the edges selected in the set cover. Conversely, suppose that there exists a solution to CMI with a total activation cost of c = c1 |V1 | + c2 |V2 | + . . . ck |Vk |, where |Vj | is the set of vertices with interface j active in the solution. We now construct a solution S to Weighted Minimum Set Cover by performing the following augmentations for all interfaces j: √ – if |Vj | > n, then add Bj to S, – otherwise, for all vertices u, v ∈ Vj connected by an edge e = {u, v} ∈ E, add Ae,j to S. It is easy to see that, since the solution to CMI covered all edges of G, the set S is a correct cover of the set of edges of the graph. The cost of solution S increases at every step. In the first of the considered cases we have w(Bj ) = nj · cj ≤ √ √ |V | n · cj ≤ n · |Vj |cj , while in the latter case,√ e w(Ae,j ) ≤ ( 2j )cj ≤ n · |Vj |cj . Summing over all j we finally obtain w ≤ n · c. √ The two proven inequalities (c ≤ w for one side of the reduction and w ≤ n·c for the other) imply that the existence of any √ a-approximation algorithm for Weighted Minimum Set Cover leads to an (a n)-approximation algorithm for CMI. The proof is complete when we take into account that Weighted Minimum Set Cover admits a (1 + ln n)-approximation [7].
5 5.1
Results for Various Graph Classes Graphs with Bounded Maximum Degree
It is known that the CMI problem is computationally hard even for restricted graph classes. Theorem 3 ([1]). CMI is APX-hard even when restricted to instances of maximum degree Δ = 5, with any fixed value of k ≥ 3, and unit cost interfaces. Let us recall that a graph is called d-degenerate for some value of parameter d if all its subgraphs have a node of degree at most d, d ≥ minH⊆G δ(H). Theorem 4. For any fixed value of d, CMI is approximable within d + 1 for d-degenerate graphs. Proof. By a simple characterisation [8], in linear time it is possible to order the node set of a d-degenerate graph in the form of a sequence s = {v1 , . . . , vn }, such that ∀1≤i≤n |{v1 , . . . , vi−1 }∩N (vi )| ≤ d. Consider an algorithm which assigns the set WA (v) to successive nodes of V according to sequence s. Initially, all the sets WA (v) are empty; in the first step, set WA (v1 ) remains empty. In the i-th step, for i ≥ 2, we define WA (vi ) ⊆ W (vi ) in such a way that ∀1≤j
Cost Minimisation in Unbounded Multi-interface Networks
1045
process can be performed optimally, since at this stage WA (vi ) will have at most d elements. Trying all possibilities requires O(k d ), where d is a constant value by assumption. Moreover, for all nodes vj ∈ N (vi ), 1 ≤ j < i, set WA (vj ) is augmented by at most one interface of minimum possible cost to guarantee that WA (vj ) ∩ WA (vi ) = ∅. Once the process is complete, the obtained function WA is clearly a correct solution to CMI. Observe that in the i-th step, i ≥ 2, set WA (vi ) is assigned a cost value not greater than that of the optimal interface assignment for node vi , and that the cost of only at most d other sets WA (vj ), 1 ≤ j < i, is additionally increased by a value not exceeding the cost of WA (vi ). The cost of the obtained solution is thus not greater than (d + 1)(Copt − Copt v1 ), where Copt denotes the total cost of an optimal solution for graph G and Copt v1 is the cost of some optimal solution for node v1 , and the proposed algorithm is clearly a (d + 1)-approximation.
It is well known that after the removal of an appropriately chosen single edge, a graph of maximum degree Δ becomes (Δ − 1)-degenerate [9]. Consequently, be a slight generalisation of the above theorem we obtain the following corollary. Corollary 1. For any fixed value of Δ, CMI is approximable within Δ for graphs of maximum degree Δ. Note that for the case of unit cost interfaces an improved approximation ratio is known. Theorem 5 ([1]). CMI is approximable within degree Δ, in the case of unit cost interfaces. 5.2
Δ+1 2
for graphs of maximum
Planar Graphs and Graphs of Bounded Treewidth
Surprisingly, CMI remains hard to approximate even for restricted classes of trees, such as stars. Theorem 6. For any ε ∈ (0, 1], CMI is hard to approximate within a factor 2 − ε for stars, even when restricted to the unit cost interface case. Proof. The star is a complete bipartite graph in which one of the partitions is of order 1. Indeed, the proof of Theorem 2 (hardness for bipartite graphs) can be retraced by putting p = 1, hence leading to the sought 2-non-approximability result.
On the positive side, many popular graph classes may be solved using the general approach for d-degenerate graphs (Theorem 4). For instance, taking into account that trees are 1-degenerate and planar graphs are 5-degenerate, we have the following corollaries. Corollary 2. CMI is approximable within a factor of 2 for trees. Corollary 3. CMI is approximable within a factor of 6 for planar graphs. It is interesting to ask whether there exist any better approximation algorithms for the class of graphs with bounded treewidth (which includes outerplanar graphs, and series-parallel graphs), for instance making use of the dynamic programming approach.
1046
5.3
A. Kosowski and A. Navarra
Complete Graphs
Theorem 7. Given a complete graph Kn of n nodes, CMI is hard to approximate within a factor of O(log k) even when restricted to the unit cost interface case. Proof. The proof can be obtained by a simple modification of the proof of Theorem 2. In the CMI instance used reduction, it suffices to create a new interface (labeled k + 1) available only for all vertices from set X, and another new interface (labeled k + 2) available only for all vertices from set Y . Graph G is then transformed from the complete bipartite graph to the complete graph by adding all possible edges within sets X and Y . The corresponding approximation ratios remain unaffected in terms of growth rate.
Despite the above result, it appears likely that unit cost CMI in complete graphs is easier to approximate than unit cost CMI in general graphs; however, no such results are known to date.
6
Conclusion and Future Work
Nowadays wireless devices hold multiple radio interfaces, allowing switching from one communication network to another according to required connectivity and related quality. The selection of the “best” radio interface for a specific connection might depend on various factors. In this paper we have considered the Cost Minimisation in Unbounded Multi-Interface Networks problem. We have concentrated our attention on problem hardness and approximation factors in general and more specific settings. The obtained results have shown that the problem is hard and approximation algorithms are highly relevant. This suggests the need for future study of better performing approximation algorithms or heuristics. Another very interesting issue would be to study the problem from a distributed point of view. What kind of considerations should be taken into account in the case when each device is aware only of its own connections? What information is crucial and how far should it be sent for the purpose of establishing a network? In such a setting the goal would be to minimise the maximum energy spent by a single device in order to prolong the lifetime of each device. Practical heuristics and experimental studies might be a first step in this direction. Which strategy should a selfish device apply in order to obtain its required connections? What happens, instead, in collaborative environments where devices provide their resources for the convenience of the whole network? As also mentioned in the introduction, other slightly different models of the problem are of main interest in further investigations. The mutual exclusion among interfaces might be one of them. Time constraints or bounded hops might also be considered for practical environments.
Cost Minimisation in Unbounded Multi-interface Networks
1047
References 1. Klasing, R., Kosowski, A., Navarra, A.: Cost minimisation in multi-interface networks. In: Chahed, T., Tuffin, B. (eds.) NET-COOP 2007. LNCS, vol. 4465, pp. 276–285. Springer, Heidelberg (2007) 2. Caporuscio, M., Charlet, D., Issarny, V., Navarra, A.: Energetic Performance of Service-oriented Multi-radio Networks: Issues and Perspectives. In: Proceedings of the 6th International Workshop on Software and Performance (WOSP), pp. 42–45. ACM Press, New York (2007) 3. Caporuscio, M., Carzaniga, A., Wolf, A.L.: Design and evaluation of a support service for mobile, wireless publish/subscribe applications. IEEE Transactions on Software Engineering 29(12), 1059–1071 (2003) 4. Cilibrasi, R., Lotker, Z., Navarra, A., Perennes, S., Vitanyi, P.: About the lifespan of peer to peer networks. In: Shvartsman, M.M.A.A. (ed.) OPODIS 2006. LNCS, vol. 4305, pp. 288–302. Springer, Heidelberg (2006) 5. Ausiello, G., D’Atri, A., Protasi, M.: Structure preserving reductions among convex optimization problems. J. Comput. Syst. Sci 21(1), 136–153 (1980) 6. Raz, R., Safra, S.: A sub-constant error-probability low-degree test, and a subconstant error-probability PCP characterization of NP. In: STOC, pp. 475–484 (1997) 7. Chv´ atal, V.: A greedy heuristic for the set-covering problem. Mathematics of Operations Research 4(3), 233–235 (1979) 8. Matula, D.W., Beck, L.L.: Smallest-last ordering and clustering and graph coloring algorithms. Journal of the ACM 30(3), 417–427 (1983) 9. Brooks, R.L.: On coloring the nodes of a network. Proceedings of Cambridge Philosophical Society 37, 194–197 (1941)
Scheduling in Multi-organization Grids: Measuring the Inefficiency of Decentralization Krzysztof Rzadca LIG, Grenoble Universities, France Polish-Japanese Institute of Information Technology, Warsaw, Poland [email protected]
Abstract. We present a novel, generic model of the grid that emphasises the roles of individual organizations that form the system. The model allows us to study the global behaviour of the system without introducing external forms of recompense. Using game-theory and equitable multicriteria optimization, we study three diverse types of computational grids: an off-line system with dedicated uniprocessors, an on-line system with divisible load and an off-line system with parallel jobs. Results show that, unless strong assumptions are made, the complete decentralization leads to a significant loss of performance. Keywords: game theory, fairness, scheduling, grid, multi-objective optimization.
1 Introduction Grids are systems that allow users to access resources belonging to different administrative entities [1]. Such an administrative decentralization imposes new requirements on resource management systems, which must not only optimize the efficiency of the whole system, but also ensure that all the parties are treated fairly. Otherwise, resulting conflicts may break the grid agreements. The goal of this paper is to present the generic multi-organizational model of the computational grid. The model allows us to study the problem of fair scheduling without the need to introduce any external forms of recompense, such as money. The analysis of the global behaviour of the system is thus fairly straightforward and does not require many out-of-model assumptions (such as e.g. supply-demand curves). At the same time, the model is general enough to be applied in a variety of systems. We use equitable optimization to study grids with strong central control and game-theory, when central control is weak. The problem of grid scheduling was addressed in a number of papers. Grid economy approaches [2] introduce free market economy. [3] uses multicriteria optimization in context of divisible load scheduling. In [4], each job is scheduled independently by a broker. The global behaviour of the system cannot be, however, easily studied with those approaches.
This research was partly supported by the FP6 Network of Excellence CoreGRID funded by the European Commission (Contract IST-2002-004265).
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1048–1058, 2008. c Springer-Verlag Berlin Heidelberg 2008
Scheduling in Multi-organization Grids
1049
This paper is organized as follows. In Section 2 we briefly present game theory and equitable optimization, tools used in analysis of our models. Section 3 presents the generic multi-organizational grid model and three approaches used for theoretical analysis. The following sections present example applications of the model: Section 4 in grids composed of dedicated uniprocessors; Section 5 in divisible load scheduling; Section 6 in parallel job scheduling.
2 Tools: Game-Theory and Equitable Optimization Game theory [5] studies situations in which independent parties (players) make decisions (use strategies). For each player Pk , the outcome uk of the game is a function of his/her strategy σk , but also of strategies of other players σ = [σ1 , . . . , σN ]. The players are assumed to be selfish and rational, i.e. concerned only with maximizing their own outcomes. Furthermore, in strategic games (considered in this paper) players chose their strategies at the same time. Nash Equilibrium (NE) is a profile of players’ strategies such that no player has an incentive to unilaterally change his/her strategy. Given the NE strategies of other players, each player optimizes his/her outcome by playing a NE strategy. It is expected that a game will end in a NE. However, a NE does not necessarily results in the global optimum (usually defined as the optimal sum of players’ outcomes). Price of Anarchy (PoA) measures the inefficiency of a NE by computing the ratio between the worst NE and the global maximum. Game theory has been previously applied to the problem of scheduling. [6] considered selfish jobs competing for common infrastructure. [7] analyses a problem of electing one of selfish resources to execute a job. Equitable optimization [8] incorporates the notion of distributive fairness to multicriteria optimization. In multi-criteria optimization, a solution is considered optimal, if no outcome can be improved without worsening other outcome (Pareto optimality). Equitable optimization puts a further restriction. A transfer of any small amount from an outcome to any other relatively worse-off outcome results in a more equitable solution. We say that the latter solution equitably dominates the former (e.g. solution [3, 2, 1] equitably dominates [4, 2, 0]). An equitably optimal solution is a solution that is not equitably dominated by any other solution. The notion of equitable optimality is broader than min max fairness: in equitable optimization [4, 2, 0] is as fair as [3, 2, 2], yet min max chooses the latter solution. An outcome is equitably-optimal iff it is not equitably dominated by any other outcome.
3 The Generic, Multi-organizational Computational Grid Model In our model, a grid is an agreement between selfish, independent organizations to share their resources. Thus, the central notion of our model is that of an organization, an entity that groups a resource donated to the grid and local users willing to employ the whole system.
1050
K. Rzadca
3.1 The Core of the Model An organization (denoted as Ok ) is an administrative entity such as a laboratory or a faculty. Each organization contributes its resource (denoted as Mk ) to the grid. By contributing, an organization expects that its users will have access to other resources in a fair manner. O denotes the set of all organizations O = {O1 , . . . , ON }. Organizations are independent from each other. Thus, an organization is concerned only with the performance of the jobs produced by its members. Our notion of an organization differs from Virtual Organization, because we assume that an organization must own, and grant access to, a resource. As there are no external users, each job (denoted as Jki ) is local to some organization. i Jk is ith job produced (and owned) by organization Ok . pik denotes job’s computation time. For Mk , jobs Jk produced by the resource’s organization Ok are called local jobs. Remaining jobs J−k assigned for execution on Mk are called foreign jobs. We assume that there are no external means of recompense for accessing resources. An organization cannot explicitly “pay” other organization neither in some kind of money, nor in barter trade. 3.2 Additional Characteristics of the Model In order to derive results, we make additional assumptions, commonly present in the theory of scheduling [9]. We assume that the exact size pik of every job submitted to the system is known. Preemption is not allowed. A job that has been started must be completed. We do not consider communication times. We also assume that the system is perfectly reliable. In order to assess the performance of the system, we measure the completion time of jobs [9]. Cki denotes the completion (finish) time of job Jki . To measure the performance experienced by organization Ok , wecompute two aggregated measures. The sum of completion times is the sum Ck = i Cki of completion times of jobs Jk owned by Ok . The makespan (maximum completion time) is the time when the last job of Ok finishes Cmax (Ok ) = maxi Cki . On the system level, the global sum of completion times ΣC iis defined as the sum of completion times of all the jobs in the system ΣC = k i Ck . The global makespan Cmax is the time when last job in the system finishes Cmax = maxi,k Cki . If a job Jki cannot be started before certain date (called release date rki ), it is usual to measure the flow time Fki defined as the time job Jki spends in the system, Fki = Cki −rki . The aggregated measures are defined similarly. 3.3 Approaches for Optimization We introduce a centralized, grid-level scheduler which proposes a schedule to each resource. However, the power of the centralized scheduler and, consequently, the kind of solutions it can impose on individual processors, depends heavily on the level of control the individual organizations have over their resources. We will study the problem from three perspectives, leading to three different approaches for optimization: multi-criteria optimization, game theory and constrained multi-criteria optimization.
Scheduling in Multi-organization Grids
1051
Firstly, in the most restricted case, we assume that an organization is neither able to impose any schedule on its local resource, nor to quit the grid. The goal of the grid scheduler is to share the pool of available resources fairly among organizations. Consequently, the problem transforms into equitable multi-criteria optimization of performance measures of organizations. Secondly, each organization may have complete control over the schedule of the local resource. Each organization is tempted to locally modify the solution proposed by the grid scheduler, if the organization’s gain is increased. Consequently, such a problem must be analyzed with a game-theoretic approach. In the resulting game, the set of players is equal to the set of organizations O. Strategy σk of player Ok is a schedule of jobs on player’s local resource Mk . Finally, payoff function uk for player Ok is the performance of player’s local jobs Jk . Thirdly, we assume that each organization independently decides whether to join or to leave the grid. Once inside, the organization grants complete control over its resources to the grid scheduler. Yet, an organization will leave the system, if perceived performance is lower than the performance the organization could achieve being outside (called self-reliant performance). This problem is equitable constrained multi-criteria optimization of performance measures of respective organizations, with the constraints of the self-reliant performance. The three approaches defined above are far from being exhaustive to the problem of grid scheduling. However, we claim that they are sufficiently varied to cover a number of grid application scenarios. The multi-criteria optimization approach is suitable for classic systems where a number of resources must be shared between organizations. An example scenario is a supercomputer bought by a state agency, that is later shared between a number of public laboratories and universities. The game theoretic perspective concerns systems with almost no central control, in which independent parties try to maximize their own gain, with no motivation to optimize the performance of the whole system. We expect that highly distributed, peer-to-peer systems will behave in that manner. Finally, the constrained multi-criteria optimization concerns systems where individual goals are noticed, but not necessarily selfishly maximized. Some level of trust and social control can be maintained. This perspective models a number of grids where there are a few participating organizations and the participation is somehow limited (like in e.g. academic grids).
4 Resource Management in Dedicated Grids In this section, we consider the grid as a tool for accessing specialized resources. Each i job Jk,l in the system must be computed on specific resource Ml , not necessarily the one belonging to owner Ok of the job. The scheduling proposed by classic approaches is to execute jobs on each processor in order of their increasing computation times, regardless of owners of jobs (denoted as SPT). However, this solution may be unfair for organizations with very popular equipment. The following additional assumptions are made. The model is off-line. There are no release dates. Each resource has only one processor, therefore jobs are sequential. Each
1052
K. Rzadca
organization Ok computes the sum of completion times Ck of locally produced jobs Jk,· = Jk,l . As resources are uniprocessors, a schedule for resource Ml is a permutation of jobs J·,l = Jk,l . However, we may restrict our attention to schedules that order jobs of each organization in non-decreasing processing time order (called Shortest Processing i Time, SPT). In a SPT schedule, for each organization Ok , if pik,l < pjk,l , Jk,l is executed j before Jk,l . Any non-SPT schedule is Pareto-dominated by a SPT schedule (the proof is by exchange argument on jobs in non-SPT order). 4.1 Optimization Approach The scheduling problem remains hard, even if it is restricted to two organizations and one resource P 1||(ΣCiA , ΣCiB ) [10]. There can be exponential number of Paretoefficient schedules. The decision version of the problem is NP-Complete. Equitable Walk (EW) [11] is a heuristics which produces a number of grid schedules by iterative modifications of the initial SPT schedule in order to improve the outcome of the disfavored organizations. The algorithm modifies the schedules by switching the order of two jobs executed one after another on the same resource. Although there is no guarantee about the optimality of the produced results, during experiments [11] EW delivered results close to optimal, running a few orders of magnitude faster than a reference exact algorithm 4.2 Game-Theoretic Approach Assertive organization Ok , which is able to impose a schedule for J·,k , would use a greedy My Jobs First (MJF) strategy, which schedules all the local jobs Jk,k before any foreign job. Given any strategies of the rest of organizations, MJF strategy will reduce the total finish time Ck . By MJF we denote the profile of strategies in which every organization uses MJF. We define the payoff uk (σ) for each player Ok as the gain over MJF for that player, uk (σ) = Ck (MJF) − Ck (σ). Proposition 1. MJF is the only Nash equilibrium of the one round, non-cooperative grid scheduling game. Proof. Assume that, for a particular instance J = {Jk,l }, the grid scheduler is able to produce schedule σ ∗ = [σ1∗ , . . . , σn∗ ], which results in non-negative payoff uk (σ ∗ ) ≥ 0 for all players and a positive payoff for at least one player. Consequently, there must be at least one player Ok , for whom the proposed strategy σk∗ is different than MJF. Thus, j i in Mk ’s schedule, there is at least one foreign job Jl,k scheduled before a local job Jk,k . j If Ok decides to switch the order of execution of those two jobs, local job Jk,k will be finished faster and, thus, player’s payoff uk will increase. At the same time payoff ul will decrease. It follows that the strategy maximizing uk is MJF, given that the others play any profile of strategies σ−k . Additionally, if all the other players play MJF, the only strategy which guarantees non-negative uk for Ok is to play MJF as well. In [11] we have shown an example instance, in which MJF strategies resulted in a O(n) increase of makespan of every organization. Thus, the price of anarchy is at least
Scheduling in Multi-organization Grids
1053
linear with the number of jobs, and, consequently, the price the grid pays for the lack of control is considerable. 4.3 Constrained Optimization Approach In dedicated grids, self-reliant performance corresponds to MJF strategies. Therefore, we can equitably optimize the gains [uk ] defined in the previous section with a constraint uk ≥ 0 for each organization Ok . To produce such solutions, we use Adjusted EW (AEW) [11] algorithm, that maximizes uk , instead of minimizing Ck . AEW starts with a SPT schedule. However, SPT may not be feasible, and, generally, AEW may be unable to produce a feasible result. Nevertheless, during our experiments [11], AEW always returned at least one feasible schedule, if such existed. Moreover, the constraint did not caused large loss of performance.
5 Load Balancing of Divisible Load In this section, we consider the grid as a tool to compute divisible load jobs. Divisible load [12] models jobs that can be divided into a large number of fragments, that can be computed independently in parallel. In our model, jobs may be computed in parallel on many resources with the consent of the owners of the non-local resources. We assume that the system must guarantee the latest finish time of every submitted job, which is announced to the user in the moment of job submission. We claim that such guarantee gives better quality of service than best-effort execution commonly used in distributed divisible load computing (e.g. in BOINC [13]). We consider an on-line model with release dates. Jobs are unknown before they are released. Jobs are processed in FIFO order. Ok measures the flow time Fk of locallyproduced jobs. As long as there are no local jobs, computing a foreign job is free. However, if a new local job is produced, such a foreign job is blocking the needed resources and thus delaying the newly-produced local job. Note that once foreign load is accepted, in order to guarantee its finish time, it cannot be interrupted. Consequently, the performance of the owner of the resource is degraded. For the theoretical analysis, we also assume that the jobs are produced by a Poisson process. With this assumption, we analyze the expected result (and not the worst-case) of the proposed algorithms and strategies. 5.1 Optimization Approach In classic, centralized systems, a straightforward approach is to use the averaging load balancing algorithm (ALB). ALB sends parts of loads from overloaded to underloaded resources so that all resources finish at the same time. ALB appends such parts at the end of the schedule of an underloaded resource. With one organization and no communication costs, this strategy is optimal. It follows that, in our model, the expected result of ALB remains optimal1 . 1
As in the worst case the on-line adversary produces a next local job every time some foreign load is accepted, the gain of the sender’s completion time is equal to the loss of the receiver’s completion time. Consequently, no algorithm that sends load can achieve better results than the local computation.
1054
K. Rzadca
Proposition 2. ALB is an equitably-optimal strategy for two-organizational grid when, on each resource, the load is composed of only one job. Proof. We will start with computing the delay g(r, Lk , Φk ) in the start time of the “next” local job caused by the foreign load. g(r, Lk , Φk ) is a function of the next job’s unknown release time r, the length of the known local load Lk and the length of the incoming foreign load Φk (assuming that foreign load comes at time t = 0). If r ≤ Lk (the job is released before the local load is completed), g(r, Lk , Φk ) = Φk (the job is delayed by the size of the foreign load). If r > Lk + Φk , g(r, Lk , Φk ) = 0 (the job is not delayed). Finally, if Lk < r ≤ Lk + Φk , g(r, Lk , Φk ) = Lk + Φk − r. Assuming that local jobs Jk are produced by a Poisson process with known mean time between arrivals λk , we can compute the expected value of the delay as: −λk Lk EG(Lk , Φk ) = Φk + e λk e−λk Φk − 1 . Note that EG (Lk , Φk ) < Φk for all positive values of Φk and Lk . ALB, assuming two resources, one job on each resource, p11 > p12 , sends half of the difference in loads to the less loaded resource. The gain in the completion time C1 of the sender Φ2 = 12 (p11 − p12 ) is thus higher than the loss EG(p12 , Φ2 ) of the receiver. The resulting load distribution optimizes thus both the worst utility (C1 ) and the sum of utilities. 5.2 Game-Theoretic Approach Even if ALB is optimal, a selfish organization holding full control over its resource will never accept incoming foreign load, as it may delay future local jobs. The expected delay EG (Lk , Φk ) is positive for all positive values of Φk and Lk . Consequently, the only non-dominated action for the less-loaded resource is not to receive anything, which leads to significant loss of performance of the grid perceived as a whole. 5.3 Constrained Optimization Approach As EG is positive, a receiver looses when cooperating. Thus, it is not possible to apply the constrained optimization to the original problem. However, we can modify the rules of the game, so that participating in the load-balancing algorithm becomes profitable even for less loaded organizations. Note that, through following mechanisms, we cannot guarantee that each organization will gain from LB. We only increase the probability that, on average, organizations gain. Firstly, if the system forces the organizations to commit to their decisions for a longer period of time, the probability of being the receiver is similar to that of being the sender (assuming that the resources are similarly loaded). In our experiments [14], this mechanism was sufficient to make ALB the dominating strategy in grids composed of similarly loaded resources. Secondly, if the load of resources differ, we introduce two mechanism in the load balancing algorithm, iterativeness and bounds, to distribute the gains from cooperation fairly among the participants. In iterative LB, firstly the least-loaded resources are balanced, then the resources are iteratively added in the order of their local load. In bounded LB, each organization Ok declares its participation level lk . The algorithm
Scheduling in Multi-organization Grids
1055
ensures that, if a resource receives some load, its local queue will not be extended beyond the declared participation level. To motivate organizations to declare lk > 0, an overloaded resource cannot send more load than its lk . In experiments [14], bounded, iterative LB in grids with one overloaded resource managed to improve Fi of underloaded resources, and thus to make load balancing the dominating strategy.
6 Parallel Job Scheduling Here, we extend to multiple organizations the classic model of scheduling parallel, rigid jobs on a multiprocessor resource in order to minimize the makespan. Each organization Ok owns resource Mk with m processors and minimizes the maximum completion time (makespan) Cmax (Ok ) of the locally-produced jobs. A job Jki must be executed on qki processors of exactly one resource. The model is off-line. There are no release dates. Consequently, we may assume that a foreign job executed after all local jobs does not cause any cost. 6.1 Optimization Approach Multi-organizational scheduling is an extension of scheduling sequential jobs on two processors, which is NP-hard [15]. For parallel jobs scheduled on one resource, list 1 ∗ scheduling (LS) algorithm is a (2 − m )-approximation of Cmax [16]. LS works in two phases. In the first phase, jobs are ordered into a list. In the second phase, the schedule is constructed by assigning jobs to processors in a greedy manner. lower bounds on the global makespan can be defined. Let us denote as W = Two pik qki the total surface of the jobs, and as pmax = max pik the length of the longest ∗ ¯ = W . job. Firstly, all the jobs must fit into available processors, so Cmax ≥ W Nm ∗ Secondly, the longest job must be executed, so Cmax ≥ pmax . We have assumed that a job cannot be executed in parallel on two resources. Consequently, we cannot treat N resources as one resource having N m processors. ∗ Proposition 3. LS is a 3-approximation of Cmax .
Proof. The proof is by contradiction. Let us assume that the last job finishes after ∗ ∗ 3Cmax . It is thus started after 2Cmax . At the moment the last job is started, all the other resources are busy (otherwise, the job would have been started earlier). Let us denote ¯ , pmax ). Consequently, on each resource Mk we have Cmax (Mk ) ≥ as LB = max(W ∗ 2Cmax ≥ 2LB. On each resource Mk , we have [16] (as LB ≥ pmax ): uk (t) + uk (t + ¯ ¯ . After integrating this inequality, we get: W LB) ≥ m for 0 ≤ t ≤ W 0 uk (t)dt + W W W LB+W ¯ ¯ ¯ ¯ ¯ . After uk (t + LB)dt ≥ m 0 1dt, i.e. 0 uk (t)dt + LB uk (t)dt ≥ mW 0 W ¯ adding inequalities for every resource Mk , we get: 1≤k≤N ( 0 uk (t)dt LB+W ¯ ¯ ≥ W . Left-hand side of the inequality is the surface + LB uk (t)dt) ≥ N mW ¯ ] and [LB, LB + W ¯ ]. Those of the jobs computed on all resources in periods [0, W periods do not overlap. The surface computed is thus greater than the surface of all the tasks available, which leads to a contradiction.
1056
K. Rzadca
a
b
c
Fig. 1. Globally-optimal solution (b) extends Cmax (O1 ) in comparison with the local solution (a). The best solution not extending O1 ’s makespan is (c).
6.2 Game-Theoretic Approach Let us assume that each organization can control the schedule on the local resource, but not the allocation of jobs to resources, which is given. A strategy similar to MJF (Section 4.2) is as follows. An organization firstly schedules its local jobs. Then, foreign jobs are scheduled so that they do not delay any local job: either at the end of the schedule, or in gaps. Proposition 4. MJF is a Nash equilibrium of the scheduling game. Proof. (Sketch) Similarly to Proposition 1, given any profile of strategies of other players, MJF minimizes the Cmax (Ok ). Proposition 5. The Price of Anarchy is at least 32 . Proof. Consider an instance in Figure 1. MJF solution, depicted in (c), has Cmax = 3, whereas the global optimum has Cmax = 2. 6.3 Constrained Optimization Approach We can use load-balancing techniques similar to those presented in the previous section to optimize the system-wide makespan, at the same time not worsening makespans of individual organizations. Multi-Organizational Load Balancing Algorithm (MOLBA) [17] starts with scheduling jobs Jk on local resource Mk with LS in HighestFirst order (i.e. non-increasing number of required processors). Resulting Cmax (Ok ) form the constraint for the rest of the algorithm. Then, for organizations with ¯ + pmax , all the jobs are mixed and rescheduled in HF order by LS alCmax (Ok ) ≥ 3W gorithm on all resources (however, LS is adjusted so that on resources of organizations that only receive jobs, no local job is delayed). In [17], we proved that this algorithm is a 4-approximation of Cmax , at the same time not increasing any Cmax (Ok ).
7 Conclusions The paper presented a new model of computational grid that emphasises its organizational heterogeneity. The notion of organization, that donates resources, but also consumes other resources, allows us to avoid using external forms of recompense that must be present in the classic, provider-consumer schemes. Through a series of applications we have demonstrated that the model is useful for theoretical analysis of various grids: working off-line or on-line, processing divisible load, sequential or parallel jobs. The
Scheduling in Multi-organization Grids
1057
perspectives of analysis ranged from systems that are almost like classic supercomputers (optimization), through partly distributed ones (constrained optimization), to systems that are highly distributed (game-theory). Such a spectrum allowed us to compare the performance and thus to measure the cost of the decreased control. We have shown that in highly distributed systems the loss of performance of the grid is significant (with an exception of the last model, that required a strong off-line assumption). However, we were able to guarantee good results for all organizations, if partial control was granted to a centralized grid scheduler. We conclude that grids cannot be fully distributed to achieve acceptable performance. A strong control of community must be present. In our future work we plan to, firstly, further investigate the scheduling of rigid, parallel jobs, and secondly, to apply the multi-organizational model in other contexts, such as data-intensive computation. In parallel, we plan to implement some of presented ideas in real-world grid resource managers. Acknowledgements. The author would like to thank Fanny Pascual, Denis Trystram, and Adam Wierzbicki.
References 1. Foster, I.: What is the grid (2002), http://www-fp.mcs.anl.gov/∼foster/Articles/WhatIsTheGrid.pdf 2. Buyya, R., Abramson, D., Venugopal, S.: The grid economy. In: Special Issue on Grid Computing, vol. 93, pp. 698–714. IEEE Press, Los Alamitos (2005) 3. Marchal, L., Yang, Y., Casanova, H., Robert, Y.: A realistic network/application model for scheduling divisible loads on large-scale platforms. In: Proceedings of the International Parallel & Distributed Processing Symposium (IPDPS), vol. 01, p. 48b (2005) 4. Kurowski, K., Nabrzyski, J., Oleksiak, A., Weglarz, J.: Multicriteria aspects of grid resource management. In: Nabrzyski, J., Schopf, J.M., Weglarz, J. (eds.) Grid resource management: state of the art and future trends, pp. 271–293. Kluwer, Dordrecht (2004) 5. Osborne, M.J.: An Introduction to Game Theory, Oxford (2004) 6. Liu, J., Jin, X., Wang, Y.: Agent-based load balancing on homogeneous minigrids: Macroscopic modeling and characterization. IEEE TPDS 16(7), 586–598 (2005) 7. Kwok, Y.-K., Song, S., Hwang, K.: Selfish grid computing: Game-theoretic modeling and nash performance results. In: Proceedings of CCGrid (2005) 8. Kostreva, M.M., Ogryczak, W., Wierzbicki, A.: Equitable aggregations and multiple criteria analysis. EJOR 158, 362–377 (2004) 9. Brucker, P.: Scheduling Algorithms. Springer, Heidelberg (2004) 10. Agnetis, A., Mirchandani, P.B., Pacciarelli, D., Pacifici, A.: Scheduling Problems with Two Competing Agents. Operations Research 52(2), 229–242 (2004) 11. Rzadca, K., Trystram, D., Wierzbicki, A.: Fair game-theoretic resource management in dedicated grids. In: Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGRID 2007) (2007) 12. Robertazzi, T.G.: Ten reasons to use divisible load theory. Computer 36(5), 63–68 (2003) 13. Anderson, D.: BOINC: a system for public-resource computing and storage. In: Grid Computing, 2004. Proceedings. Fifth IEEE/ACM International Workshop on, pp. 4–10 (2004)
1058
K. Rzadca
14. Rzadca, K., Trystram, D.: Promoting cooperation in selfish computational grids. European Journal of Operational Research (to appear, 2007) 15. Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of NPCompleteness. WH Freeman & Co, New York (1979) 16. Eyraud-Dubois, L., Mounie, G., Trystram, D.: Analysis of scheduling algorithms with reservations. In: Proceedings of IPDPS, pp. 1–8. IEEE Computer Society, Los Alamitos (2007) 17. Pascual, F., Rzadca, K., Trystram, D.: Cooperation in multi-organization scheduling. In: Kermarrec, A.-M., Boug´e, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, Springer, Heidelberg (2007)
Tightness Results for Malleable Task Scheduling Algorithms Ulrich M. Schwarz Institut f¨ ur Informatik Christian-Albrechts-Universit¨ at zu Kiel Olshausenstraße 40, 24098 Kiel, Germany [email protected]
Abstract. Malleable tasks are a way of modelling jobs that can be parallelized to get a (usually sublinear) speedup. The best currently known approximation algorithms for scheduling malleable tasks with precedence constraints are a) a 2.62-approximation for certain classes of precedence constraints such as series-parallel graphs [1], and b) a 4.72-approximation for general graphs via linear programming [2]. We show that these rates are tight, i.e. there exist instances that achieve the upper bounds.
1
Introduction
The concept of malleable tasks is a powerful way of modelling large-scale reallife applications because it permits two kinds of parallel execution: on the one hand, different jobs can be executed concurrently on different machines. On the other hand, each job itself is parallelizable in the sense that assigning more machines to it will result in faster execution. In the setting we consider here, data flow between the jobs is modeled by a directed acyclic graph (dag) of precedence constraints, but we do not impose communication delays. The goal is to minimize the makespan, i.e. the completion time of the job that finishes last. Related work. The key works that this paper hinges upon are the 2.62-approximation1 of Lep`ere et al. [1] and Jansen and Zhang’s 4.72-approximation [2] based on the procedure for solving the subproblem allotment given by Skutella [3]. However, the notion of malleable tasks is older, going back to Du and Leung [4] and Turek et al. [5]. For the simplest setting, no precedences at all, the best algorithms currently known are an (1.5 + )-approximation due to Mouni´e et al. [6] and an AFPTAS by Jansen [7]. Malleable tasks with tree constraints have also been studied on hierarchical machine architectures, such as SMP systems [8]. An overview of existing results on malleable task scheduling can be found in Chaps. 25 and 26 of [9] (where malleable tasks in the sense used in this paper are referred to as moldable), and more recent results appear in Chap. 45 of [10].
1
Research of the author was supported by EU research project AEOLUS: Algorithmic Principles for Building Efficient Overlay Computers, EU contract number 015964. All numerical values in this paper are rounded to three significant figures.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1059–1067, 2008. c Springer-Verlag Berlin Heidelberg 2008
1060
U.M. Schwarz
Our contribution. We show that both the approximation rate of 2.62 given in [1] and the approximation rate of 4.72 given in [2] are tight. This paper is structured as follows: in section 2, we will formally define the problem and the two subordinate problems allotment and packing. Section 3 shows that the ratio proven for the packing step in [1] is tight if an optimal allotment is known. Section 4 shows how this argument can be extended to approximation algorithms for the allotment problem that have been proposed by Skutella and Jansen & Zhang.
2
Definitions
The problem MT-Scheduling can be defined as follows: we are given n jobs J1 , . . . , Jn and a set of m machines. Each job can be executed on any number of machines, but this number is fix once the job has started. Hence, we may represent each job Ji by its execution time function pi : {1, . . . , m} → N. The following constraints are made on the processing time function: ∀k ∈ {1, . . . , m − 1} : pi (k) ≥ pi (k + 1)
(1)
∀k ∈ {1, . . . , m − 1} : k pi (k) ≤ (k + 1)pi (k + 1) ,
(2)
modelling that speedup cannot be detrimental as the number of processors increases, but cannot be superlinear either. Furthermore, the jobs we consider are linked by precendence constraints. The objective is to find a schedule of minimal makespan. In the three-field notation of [11], our problem is P |fctn j , prec|Cmax . The algorithms we consider here tackle the problem MT-Scheduling in two steps: first of all, the number of machines to be used by each job is fixed (Allotment). This fixes the shape of each job, and jobs can then be scheduled with algorithms for the problem P |size j , prec|Cmax (Packing). Of course, it is desirable to find a “close-to-optimal” allotment, as can be defined by the following min-max problem: n 1 LB := min max α(i)pi (α(i)), pi (α(i)) (3) α P m j=1 Ji ∈P
where α runs over all possible allotments and P over all execution paths in the precedence dag. This is in essence the usual lower bound on the makespan given by total area of the jobs and length of a critical path. The allotment problem, just as its generalization, the discrete time/cost tradeoff problem, is NP-hard. However, FPTAS exist if the precedence dag has a suitable structure, in particular if it is series-parallel, of bounded width, or all its prime modules in the modular decomposition tree are of bounded width [1,12,13]. The packing is then conducted with a straightforward extension of Graham’s List Scheduling algorithm as given in Fig. 1. We note that it is necessary to limit the maximal number of machines a single job can be allotted to some number μ ≤ (m+1)/2, which is the main parameter of the algorithm. Then, the following result holds:
Tightness Results for Malleable Task Scheduling Algorithms
1061
Theorem 1 (Theorem 4 in [1]). Given an allotment α that is a q-approximate solution to (3), the algorithm creates a schedule of length at most max
m 2m − μ , qLB , μ m−μ+1
(4)
which is minimized at 2.62 for μ → 0.382m as m → ∞. Input: instance J1 , . . . , Jn , ≺, allotment α, truncation parameter μ 1. for all jobs Ji , i = 1, . . . , n: if α(i) > μ set α(i) := μ 2. while unscheduled jobs exist: (a) find the earliest point of time at which some job Jj may start (i.e. there are enough idle machines and all its predecessors are finished) (b) schedule Jj at that time Fig. 1. Algorithm List Schedule
3
Tightness in the Packing Step
In this section, we will show that the rate of 2.62 given in [1] is tight, if we are given a specific allotment. This immediately gives a tightness result for the algorithms that calculate almost optimal allotments for special cases, if we set execution times as follows: consider a job J that we want to take up mj machines during a running time of pj (mj ). We set pj (mj ) i ≥ mj pj (i) = (5) mj pj (mj ) i i < mj . It is easily observed that an optimal allotment will w.l.o.g. assign mj machines. Hence, we need only consider packings of jobs of fixed size. It is a trivial observation that the bound of m μ LB is tight for a single job with 2m−μ linear speedup. Hence, we will concentrate on the bound m−μ+1 LB . One core idea of bad instances for precedence-constrained scheduling has remained since the very beginning [14]: construct an instance so that the following observations hold: OPT – there is an execution path of length close to Cmax and other, “filler” jobs OPT such that the total sum of processing times is close to mCmax . – in the optimal schedule, the critical path is started immediately, while in the algorithmic non-optimal schedule, it is delayed as far as possible.
Since the critical path takes up comparatively little area, the filler jobs cannot profit much from few extra machines available to them, and the overall approximation ratio is slightly less than 2. In our setting, there are additional effects we can exploit: jobs can be assigned a number of machines different from 1, and this assignment is done statically before the scheduling starts.
1062
U.M. Schwarz
In particular, this allows machines to be idle even though jobs are available, because all available jobs want more machines than are currently idle. Since we truncate the number of machines per job to at most μ, the worst case is obviously that all jobs that are ready would use μ machines, but only μ − 1 machines are idle (and m − μ + 1 are busy). This seems to conflict with our previous description: if many machines are idle, it is difficult to delay the critical path unless this path itself takes up a lot of space, and this space is then not available to filler jobs. However, this is not so: since we consider precedence constrained jobs, the critical path does not necessarily consist of one job only, but of many, and each job is assigned machines seperately. In particular, it is sufficient that the first job is assigned μ machines (or m in the optimal allotment, truncated to μ in the schedule’s allotment), while later jobs get few machines. This setting is shown in Fig. 2 on p. 1062. B
A C
A
C B
(a) Optimal schedule
(b) Worst-case schedule
Fig. 2. Coarse structure of optimal and bad instances
We will first analyze what happens in the ideal case: Lemma 1. If there exists a set C of filler jobs that can either be scheduled on m − 1 machines or, with the same allotment, on m − μ + 1 machines, then the approximation rate of the scheduling algorithm is no less than (2m − μ)/(m − μ + 1). Proof. We will consider only the case that the “blocking job” A takes one unit of time and is not changed by truncation. This is sufficient to obtain the claim OPT as Cmax → ∞. We note that in the optimal schedule, |B| = |C|, so we get OPT Cmax = 1 + |B|, while the algorithm’s makespan is Cmax = 1 + |B| +
m−1 m−1 |C| = 1 + |B| + |B| m−μ+1 m−μ+1
(6)
by volume arguments. As |B| → ∞, we now obtain Cmax m−1 |B| m−1 2m =1+ →1+ = , OPT Cmax m − μ + 1 1 + |B| m−μ+1 m−μ+1 as claimed.
(7)
It remains to demonstrate a suitable construction for the set of filler jobs C. This can be simplified further: since we want m−1 machines busy in one configuration
Tightness Results for Malleable Task Scheduling Algorithms
1063
and m − μ + 1 in the other, much of the structure can be the same for either. The outline of our construction is shown in Fig. 3. We will explain the construction by focussing on (b). Fixing k such that m − 1 = kμ + r for some 0 ≤ r < μ, the filler set C consists of k + 1 different sets: the sets labelled Rj , L1j , . . . Lk−1,j are simple: they contain chains of jobs (j = 1, . . . , ) that are executed in lockstep, i.e. all first jobs precede all second jobs and so on: ∀1 ≤ i, i ≤ k : ∀1 ≤ j < :
Rj , Li,j ≺ Rj+1 , Li ,j+1
(8)
The jobs forming set S are not as tightly integrated, but depend on the R and L jobs: ∀1 ≤ i ≤ k : ∀1 ≤ j < : Rj , Li,j ≺ Sj+1 . (9) All these jobs have the same length, but different numbers of machines assigned to them: the R jobs get r machines, the L jobs get μ machines and the S jobs get 2 machines. If all jobs without predecessors in set D need μ machines, the schedule outlined in Fig. 3(b) is indeed feasible for the algorithm: by selecting one job from each chain, (k − 1)μ + r + 2 = m − μ + 1 machines are used, only μ − 1 are idle, so no job of D would be able start. Of course, there is a more efficient way of scheduling: delaying execution of all the Sj jobs. Keeping in mind that in the optimal schedule, we are already running job B on one machine, this means jobs B, R, L take up m − μ machines, so that we can schedule the jobs from set D.
Rj
Rj μ
L1,j
μ
L2,j
μ
D
Sj m − 1
μ
L1,j
μ
L2,j
2
Sj
D m − μ + 1
(a)
(b)
Fig. 3. Configurations of the “filler set” C
So, we have reduced our problem to finding a suitable set D that can be packed either on μ machines (in the optimal solution) or on never less than m − μ + 1 machines to prevent job A from starting in the algorithm’s solution. We must keep in mind that the algorithm will have all m machines at its disposal, though, so we must be careful that no job in D may be executed earlier. This is fulfilled if we choose the construction depicted in Fig. 4: As we can see, the jobs fall into groups, and there are three kinds of group: the first kind occurs only once and consists of three large jobs to run on μ machines each, followed by r + 1 small
1064
U.M. Schwarz
...
Fig. 4. The jobs of set D
jobs to run on one machine each. The second kind is similar, but there are only two large jobs, not three. Finally, the last kind consists of two large jobs only. Now, there are two ways of packing the jobs, as desired: in an optimal solution (where we want to pack into a strip of μ machines), we first pack all large jobs and then all small jobs. (We ideally choose the number of groups of the second kind such that the total number of small jobs is a multiple of μ.) The algorithm, with m machines available, may first pack the three large jobs of the first group. By setting of μ, we know that 3μ ≥ m − μ + 1. Then, the algorithm may always pack the small jobs from one group with the large jobs from the next, which always keeps m − μ + 1 machines busy.
μ
...
μ
m−μ+1
μ
... (a) Schedule D
...
μ
(b) Schedule D
Fig. 5. Possible schedules of set D
This construction has all the desired properties. The first-kind group of jobs in D uses slightly more area than anticipated, however, this is a constant loss whereas the number of jobs and makespan of the schedule can be increased without bounds by increasing the number of second-kind groups.
4
Tightness in the Linear-Programming Case
As noted above, it is in general not feasible to solve the allotment problem exactly, since it is NP-hard. A linear programming approach was first proposed by Skutella [3], and later refined by Jansen and Zhang [2] (which see for a detailed
Tightness Results for Malleable Task Scheduling Algorithms
1065
description). Performance of the algorithm depends on a rounding parameter ρ ∈ (0, 1) in the following way: Lemma 2 (Lemma 3.2, Theorem 7.1 in [3]). The rounding process increases the length of any job by at most ρ−1 and the area covered by any job by at most (1 − ρ)−1 . Jansen and Zhang analyze a min-max problem to arrive at an overall approximation ratio—for allotment and scheduling combined—of 4.73 by asymptotically setting ρ = 0.43 and μ = 0.271m, improving over an earlier ratio of 5.24 with ρ = 0.5 and μ = 0.382m [1]. Intuitively, we might expect the following change in a bad instance based on the one in Fig. 2: in addition to all other effects, (i) the length of B might increase by a factor of ρ−1 , and (ii) the area of C might increase by a factor of (1 − ρ)−1 . Then, (7) becomes Cmax 1 1 m−1 → + , (10) OPT Cmax ρ 1−ρ m−μ+1 which equals the upper bound of 4.72 for ρ and μ as defined above. First, we turn to point i, creating a path B that will stretch by a factor of ρ−1 . We will have to consider three instances simultaneously: the fractional LP solution, the rounded LP solution, and an optimal integral solution. (After all, we want to compare the optimal integral solution to the rounded LP solution.) For ease of exposition, we will assume that the path consists of 100 jobs, each of which has linear speedup and a maximal length of 1. If the fractional LP solution assigns an execution time of ρ to each job (in particular, the longest subjob), rounding will stretch every job to a length of 1, as desired. However, an integral solution could assign two machines to 58 of the jobs and three machines to the other 42 jobs. Then, the total length of the path in this solution is 58 42 + = 43 = 100ρ , 2 3
(11)
just as in the fractional solution. As we can see in Fig. 6, our overall construction is such that the LP need not try to shorten the path any more, nor is the total amount of work used by the fractional solution important, because the critical path passes through a job E that is not malleable. For point ii, let us assume we want a job J that takes up mj machines in the rounded LP solution. We define the job to have “almost-constant” execution time, i.e. J(i) = c − i (12) for some suitably small and constant c. Then, we can consider m − 1 subjobs to be (almost) isomorphic, ranging from cost c for execution time 0 to cost 0 for execution time c. If the linear program now assigns a cost (1 − ρ)c to mj of the subjobs, this results in an execution time of c − mj . The rounding will increase the cost of the job to mj c, as desired. However, the optimal integral solution just needs to assign a single machine to J to (almost) match the execution time. This is sufficient since J is not on any critical path.
1066
U.M. Schwarz
(1 − ρ)C A
(1 − ρ)F B E
(a) Good schedule B
A C
E
F
(b) Bad schedule Fig. 6. Coarse structure of critical instance for LP programming
This shows that we can re-use the construction of C and D given above, however, we now only need an area of (1 − ρ)(m − 1)|B| for this in the fractional solution. This means that asymptotically, we have an area of ρm|B| at ρ our disposal in the fractional solution, which will be rounded to 1−ρ m|B| ≈ 0.75m|B|. Considering that we must keep m − μ + 1 = 0.73m machines busy at all times, we will just use a simple schedule for brevity, arriving at a rate of 1/ρ + (m − 1)/(m − μ + 1) + 1 = 4.69. A construction similar to that of D will add a factor of 0.75 0.73 instead of 1, making our result tight, but has been omitted for space reasons.
5
Conclusion and Future Work
We have given instances that demonstrate tightness of approximation ratio for two two-phase algorithms for scheduling malleable tasks under precedence constraints. These are currently the best algorithms known for their respective classes of input. While we have only shown the construction for the concrete choice of parameters given in the literature, it is general enough to be extended to different choices of rounding and truncation parameters, so it is unlikely that parameter tweaking will improve the algorithms further. A natural next step is to suitably sort the list of jobs in the packing phase by a lower bound similar to (3). It turns out to be difficult to generate bad instances with such a sorting; on the other hand, LPT and critical path heuristics do not lead to asymptotical improvement of the worst-case bound in many cases, so whether this will result in improvements here remains an open question.
Tightness Results for Malleable Task Scheduling Algorithms
1067
References 1. Lep`ere, R., Trystram, D., Woeginger, G.J.: Approximation algorithms for scheduling malleable tasks under precedence constraints. Int. J. Found. Comput. Sci. 13(4), 613–627 (2002) 2. Jansen, K., Zhang, H.: An approximation algorithm for scheduling malleable tasks under general precedence constraints. In: Deng, X., Du, D.-Z. (eds.) ISAAC 2005. LNCS, vol. 3827, pp. 236–245. Springer, Heidelberg (2005) 3. Skutella, M.: Approximation algorithms for the discrete time-cost tradeoff problem. Technical Report 514/1996, Technische Universit¨ at Berlin (1996) 4. Du, J., Leung, J.Y.T.: Complexity of scheduling parallel task systems. SIAM J. Disc. Math. 2(4), 473–487 (1989) 5. Turek, J., Wolf, J.L., Yu, P.S.: Approximate algorithms scheduling parallelizable tasks. In: Proceedings of SPAA 1992, pp. 323–332 (1992) 6. Mounie, G., Rapine, C., Trystram, D.: A 3/2-approximation algorithm for scheduling independent monotonic malleable tasks. SIAM J. Comp. 37(2), 401–412 (2007) 7. Jansen, K.: Scheduling malleable parallel tasks: An asymptotic fully polynomial time approximation scheme. Algorithmica 39(1), 59–81 (2004) 8. Dutot, P.F.: Hierarchical scheduling for moldable tasks. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 302–311. Springer, Heidelberg (2005) 9. Leung, J.Y.T. (ed.): Handbook of Scheduling. CRC Press, Boca Raton (2004) 10. Gonzalez, T.F. (ed.): Handbook of Approximation Algorithms and Metaheuristics. Chapman & Hall/CRC, Boca Raton (2007) 11. Graham, R.L., Lawler, E.L., Lenstra, J.K., Rinnooy Kan, A.H.G.: Optimization and approximation in deterministic sequencing and scheduling: a survey. Annals of Discrete Mathematics 5 (1979) 12. Grigoriev, A., Woeginger, G.J.: Project scheduling with irregular costs: complexity, approximability, and algorithms. Acta Informatica 41(2–3), 83–97 (2004) 13. Schwarz, U.M.: Design and analysis of approximation algorithms for certain scheduling problems. Diploma thesis, University of Kiel (2006) 14. Graham, R.L.: Bounds for certain multiprocessor anomalies. Bell Systems Tech. J. 45, 1563–1581 (1966)
Universal Grid Client: Grid Operation Invoker Tomasz Barty´ nski1,2 , Maciej Malawski1,2 , Tomasz Gubala2,3 , and Marian Bubak1,2 1
2 3
Institute of Computer Science, AGH, Mickiewicza 30, 30-059 Krak´ ow, Poland Academic Computer Centre – CYFRONET, Nawojki 11, 30-950 Krak´ ow, Poland Section Computational Science, University of Amsterdam, Kruislaan 403, 1098 SJ Amsterdam
Abstract. In this paper we present a high-level approach to programming applications which use the Grid from the client side. This study is devoted to resolving the need for a language that would allow expressing the application logic in a precise way and combining it with the capability of remote access to powerful Grid resources and complex computational software. We introduce the concept of a universal Grid client - a Grid Operation Invoker (GOI). It provides a client-side interface to computational resources that use various middleware packages within a high-level scripting language. The system prototype is written in JRuby [1] which is a Java implementation of a popular object-oriented scripting language interpreter – Ruby [2]. We also present issues that have emerged in the course of work on GOI and which we have found challenging. Finally, we discuss Grid applications implemented in JRuby, proving that GOI can be used to solve highly complicated and computationally-intensive problems. Keywords: high level Grid programming, Grid middleware, operation invocation, JRuby.
1
Introduction
Grid infrastructures offer a great amount of computational power that can be used to solve complex scientific problems. Moreover, there is a variety of software, deployed on Grid infrastructures and in computational centers, providing rich functionality. In spite of the advanced middleware technologies that provide access to computational power and allow publishing services, there are still difficulties and inconveniences in harnessing the capabilities of the Grid due to the fact that resources are distributed, while middleware technologies are not compatible with one another and can not cooperate. In our opinion, users should have a friendly tool providing a uniform interface to Grid resources and enabling rapid development of applications. Such a tool should not limit the user to creating simple workflow applications, but enable constructing Grid software containing full application logic. In addition, it needs to remain easy to master for a person having at least basic programming R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1068–1077, 2008. c Springer-Verlag Berlin Heidelberg 2008
Universal Grid Client: Grid Operation Invoker
1069
experience and it should be based on a proven, reliable and widely supported technology. All these requirements could be met by using a simple, but powerful modern scripting language [3]. Such languages provide full expressiveness in terms of creating software and enable fast application development. The language should be supported by a library allowing seamless access to Grid resources, hiding the complexity of underlying middleware and communication protocols. In order to meet these requirements, we propose a solution called the Grid Operation Invoker (GOI). It provides a standardized interface for operation execution on Grid systems, thus enable coherent usage of Grid resources. Besides, GOI will add extra programming features to operation invocation and provide a lowlevel API for direct access to specific resources. Since various Grid computations use different programming models, GOI should integrate multiple technologies to support these models: Web Services (for quick stateless tasks), Web Service Resource Framework (for stateful computations) and distributed components (for dynamical deployment and asynchronous interaction with computations). It will support established and reliable job oriented-middleware technologies, like gLite (LCG) [4] and Unicore [5], to enable access to powerful Grid infrastructures like EGEE [6] and DEISA [6]. Moreover, GOI must be easily extensible: adding support for emerging middleware technologies should require as little effort as possible. This paper is organized as follows: First, we give the background on difficulties in utilizing Grid systems due to their heterogeneity. Next, we briefly discuss middleware technologies and existing solutions. Subsequently, we describe the Grid Operation Invoker as the solution for the defined problem, which is followed by a discussion on results of applying our tool for solving real problems. The last chapter summarizes our work.
2
Background – Middleware Technologies
The main problem which GOI addresses is the heterogeneity of middleware technologies available to Grid application developers. In this section we describe four technologies that provide access to computation in a heterogeneous environment with distinct interaction models. Web Services (WS) is a well defined, standardized technology [7], approved by both the industry and academic communities as a way of publishing remote APIs, integrating distributed systems and building Grid systems [8]. WS provides Remote Procedure Call (RPC) semantics for stateless service invocation using the SOAP protocol [9], as well as programming language and platform independence. Numerous WS frameworks exist and support publishing applications implemented in various programming languages as services. Besides, WSDL [10] enables standardized description of services, thus facilitating service discovery and integration of services into larger systems. Web Service Resource Framework (WSRF) is a family of specifications published by OASIS [11]. It defines an open framework for modeling and
1070
T. Barty´ nski et al.
accessing stateful resources using Web Services. State of the service is modeled as a stateful resource [12] which is described with Resource Properties. Of the diverse of WSRF specification implementations, Apache WSRF [13] and Globus Toolkit 4 [14] are the most broadly accepted and used. There are also implementations in Python [15] and Perl [16]. Grid Components. The distributed component model is an interesting alternative for service-oriented middleware technologies.A component is described by its interfaces (called ports or facets), which define the functionality it provides as well as its dependencies. The component model is interesting for Grid applications due to its deployment and composition mechanisms. There is a wide range of standards for component-based systems, such as CCA [17], GridCCM [18] and CoreGRID GCM [19]. As our previous research suggests, the H2O [20] approach, which introduces the concept of lightweight service containers, combined with the CCA model can be used to build a component framework for metacomputing [21]. Therefore MOCCA [22] is the first component framework we decided to support. Batch processing - Jobs. Job-oriented middleware packages are the most proven and broadly supported technologies in Grid computing. They provide reliable mechanisms to submit computations solving large-scale problems as jobs. Such packages are deployed on huge production Grid infrastructures. For instance, EGEE [6] uses LCG [4] middleware, while DEISA [23] uses UNICORE [5].
3
Overview of High-Level Grid Programming Toolkits
Numerous solutions exist that provide uniform access to computation. Most of them, however, are focused on one middleware technology. For instance, the Web Service Invocation Framework [24] provides a Java API for invoking services regardless of how the service is implemented and accessed. It enables interaction with abstract representations of Web services through their WSDL descriptions instead of working directly with the Simple Object Access Protocol APIs- thus developers can work with the same programming model. The Grid Application Toolkit (GAT) [25], currently evolving into the Simple API for Grid Applications, provides an object-oriented, language-neutral invariant API to basic Grid use cases such as operations on files, monitoring and events, resources, jobs, information exchange, error handling and security. NetSolve/GridSolve [26] is a RPC-based client/agent/server system. Clientside libraries query agent for a list of servers and then delegate the execution of operation to a selected server providing input parameters. The server executes the appropriate service and returns output parameters or error status to the client. Two other interesting projects are GridSAM [27] and The Application Hosting Environment [28]. The first one provides a Web service for job submission and monitoring, while AHE is a lightweight hosting environment, based on the ideas of WEDS [29]. It allows deploying applications onto computational Grids and represent them as stateful Web Services.
Universal Grid Client: Grid Operation Invoker
1071
In spite of the diversity of existing solutions, the problems mentioned in Section 1 remain unsolved. None of the solutions satisfies all of requirements listed therein. Some can not adapt to dynamically changing Grid environments. What is more, most projects are focused on supporting one middleware technology only and can not cooperate with other solutions. Systems like NetSolve/GridSolve require certain infrastructure where parts of the system need to be deployed. Although the GAT approach is very interesting and has valuable concepts from our point of view, it does not support RPC semantics and it does not allow rapid development of high-level applications.
4
Grid Object as an Abstraction over Heterogeneous Environment
The described diversity of middleware technologies and the fact they do not interoperate with one another calls for a special layer of virtualization. This provides a uniform and abstract method of description and access of distinct technologies. For this purpose we introduce the concepts of Grid Object, Grid Object Class and Grid Object Instance. Grid Object is a client-side abstract representative of a computational service, a component or an application available through a job submission infrastructure. Grid Objects offer programming features such as stateless or stateful interaction modes, synchronous or asynchronous invocation of operations, concurrent execution of operations and public or private sharing modes. Grid Object Instance is a specific type of operational software that can be accessed remotely in its specific protocol. Grid Object Class is a set of Grid Object Instances that provides exactly the same functionality through identical interfaces in terms of inputs and outputs,though not necessarily in terms of communication protocol. From the end user’s point of view, instances of one class are identical. The semantics of defined concepts could be compared to the concepts of objects and classes in object-oriented programming. We believe that developers of Grid applications should focus on solving the problem and avoid the burden related with interfacing various middleware technologies. On the other hand, different middleware technologies are needed, as there is no single approach to fit all kinds of computations. However, the developer should only specify which functionality is required, by choosing the appropriate Grid Object Class. Finding an optimal Grid Object Instance that satisfies his needs and communicating with the selected instance in its specific protocol should be done automatically and should be transparent for the developer. Using heterogeneous services distributed all over the world and invoking operations on them should be as simple and concise as in the sample code presented below: gObj = GObj.create(’gridspace.weka.OneRuleClassifier’) result = gObj.classify(data) The developer should also have the opportunity to choose the concrete instance to be used, for example due to accounting issues or software quality and reliability factors. In such a case, a developer creates a Grid Object for the concrete instance:
1072
T. Barty´ nski et al.
gObj = GObj.createInstance(instanceID#7) In order to fulfill the requirements for programming environments specified in Section 1, the universal Grid client needs data that describes Grid Object Class in terms of functionality it provides and lists the members of a class, as well as data providing technical specification of Grid Object Instances. As a result, the requirements for the Grid Operation Invoker can be summarized as follows: – functional: • provide uniform and transparent access to computation covering the complexity of handling invocations of operations in various protocols, • enable automatic selection of Grid Object Instance, as well as choosing concrete instances, • add programming features such as asynchronous invocation of synchronous operations and concurrency between operations; – nonfunctional: • be easy to use, • remain a lightweight client-side tool, • provide convenient mechanisms for adding support for emerging middleware technologies, • be easily configurable for use with various registries and optimizers.
5
Design and Implementation of Grid Operation Invoker
Following analysis of the most commonly used modern scripting languages, such as Ruby [2], Python [30] and Perl [31], we have chosen to use the Ruby language. Ruby seems to be the best choice due to its simplicity, ease of creating complex applications, growing popularity, good support for distributed computing and clear syntax. In addition, Ruby allows dynamic execution of string expressions that contain programming commands thus enabling metaprogramming. Furthermore, it provides good support for object-oriented programming and has an implementation in Java, called JRuby, which allows using Java classes within a Ruby script. Our tool is a JRuby library that enriches the standard interpreter with mechanisms providing uniform and transparent access to heterogeneous Grid resources within an interpreted script. It contains both the Ruby code that implements the logic of our system, and Java libraries that are required for some technologies and that enable client-side communication with Grid Object Instances. The idea of using our system is depicted in Fig. 1. Two kinds of data related with Grid Object Instances are required by GOI in order to realize its objectives. Firstly, it is the the unique identification of the optimal instance that should be used. This identification is used to retrieve further information which consists of technology data that describes a concrete instance in terms of its communication protocol, endpoint address, interface and other technical details. These dependencies can be satisfied by either a simple local registry and optimizer or by remote systems.
Universal Grid Client: Grid Operation Invoker
1073
Fig. 1. Grid Operation Invoker system and Grid middleware
GOI has a modular architecture (see Fig. 2). The system is divided into functional blocks that communicate with each other via well defined interfaces, therefore every component can be reused in other projects and the system is more customizable. Each component can be replaced by a different one that better suits the needs of end users, provided it implements the required interface. Moreover, component-based design of the Grid Operation Invoker enables good extensibility of the system. GOI contains the following components: – GObj provides users with an interface for creating client-side representatives of Grid Object Instances (Grid Objects). It is responsible for querying for the optimal instance that should be used, then for querying for technical information about it. Finally, GObj delegates requests for creating Grid Objects to the appropriate adapters. GObj is the only class that developers must interact with. – RegistryClient delegates the query from GObj to a specific registry and propagates the result back to the GObj. – OptimizerClient delegates the query from GObj to a specific optimizer and propagates the result back to GObj. – Adapters provide a set of adapter classes, each responsible for interaction with one concrete middleware technology. Each adapter produces representatives for Grid Object Instances in a specific technology, used inside the Grid application just like any other Ruby object. Both the RegistryClient and the OptimizerClient can be easily replaced by other implementations, thus the Grid Operation Invoker system is customizable to work with various registries and optimizers. These four components, combined with the JRuby interpreter, constitute a customizable and extensible system that provides abstraction over the heterogeneous Grid environment. Developers use a simple interface to perform the following actions:
1074
T. Barty´ nski et al.
Fig. 2. Architecture of the Grid Operation Invoker
– to create a Grid Object of a certain class, in which case she/he writes gObj = GObj.createInstance(’gridspace.weka.OneRuleClassifier’) In such case GObj queries the optimizer to find an optimal instance published in any supported middleware technology. The Optimizer returns unique identification which is used by GObj to query the registry for technical information. Once it is obtained, GObj checks the technology name which determines which adapter will be used. Subsequently GObj verifies that appropriate adapter is loaded or loads it at runtime if necessary. Finally, the request for creation of a Grid Object is delegated to the adapter. – to create a Grid Object for a concrete Grid Object Instance, in which case the code would be: gObj = GObj.createInstance(’instanceID#7’) In this scenario interaction with optimizer is omitted, but the rest of the process is analogous to the one described above. Once a Grid Object is created the developer can use it the same way as any Ruby object: prediction = gObj.classify(data) In order to be easily extensible, the Grid Operation Invoker imposes a certain naming convention. Source file names and adapter class names must be based on a technology name concatenated with the appropriate suffix. Such a mechanism enables adding support for new technology simply by placing the adapter’s source file in the appropriate directory.
Universal Grid Client: Grid Operation Invoker
6
1075
Scientific Applications That Use GOI
So far, we have implemented an operational prototype of GOI. It uses the remote Grid Resource Registry and a Java Grid Application Optimizer. For now we support Web Service and MOCCA components technologies; prototype support for jobs on EGEE using gLite/LCG middleware is also implemented, and adapters for WSRF technology are scheduled for future work. We have applied our system to constructing and executing real-life examples. In the code presented below we use two Grid Objects, dataProvider and classifier. The former provides a remote interface to retrieve integrated data from various databases in the ARFF format [32], allows user to split data into two parts and evaluate the similarity of two data sets. The latter is a data classifier; once trained with sample data, it predicts one attribute in the given data set using the One-Rule algorithm. The classifier uses the Weka [33] data mining library. In the script shown in Fig. 3, data is first retrieved and split into training and testing sets with dataProvider. Next, one set of data is used to train the classifier which is then used to classify the other data set. Finally, dataProvider is used again to estimate the quality of classification. In this experiment, the dataProvider Grid Object is implemented as a Web Service, while the classifier is a stateful MOCCA component, dynamically deployed on the computing resource.
require ’GridOperationInvoker/Core/g_obj’ dataProvider = GObj.create(’gridspace.weka.WekaGem’) A = dataProvider.loadDataFromDatabase(DATABASE, QUERY, USER, PASSWORD) B = dataProvider.splitData(A, 20) trainA = B.predictingData testA = B.testingData classifier = GObj.create(’gridspace.weka.OneRuleClassifier’) classifier.train(trainA, ATTRIBUTENAME) prediction = classifier.classify(testA) predictionQuality = dataProvider.compare(testA, prediction, attributeName)
Fig. 3. Sample data mining application script
The Grid Object Invoker is also applied as a core of the runtime system of the Virtual Laboratory (http://virolab.cyfronet.pl), developed for the ViroLab [34] project. The usage of Virtual Laboratory includes such experiments as analysis of HIV genomic structure and prediction of virus resistance to various types of drugs [35].
7
Summary
In this paper we discussed the need for a universal Grid client that would allow rapid Grid application development without the burden associated with using a heterogeneous environment. We defined requirements for such a system and introduced a unified approach to the problem. All these requirements can be
1076
T. Barty´ nski et al.
satisfied by our system which is a JRuby interpreter enriched with our library providing uniform access to heterogeneous resources. The Grid Operation Invoker was applied to real-life problems and proved the usability of our system for developing and running Grid applications. The experiment presented in Section 6 illustrates the ease of developing applications and shows the advantages of GOI over systems which only allow construction of simple workflows.
Acknowledgments This work has been made possible through the support of the European Commission ViroLab Project [34] Grant 027446. This research is also partly funded by EU IST Project CoreGRID and Polish SPUB-M grants. Maciej Malawski kindly acknowledges support from the Foundation for Polish Science.
References 1. 2. 3. 4. 5. 6. 7. 8.
9. 10. 11. 12.
13. 14. 15. 16. 17. 18.
JRuby: Jruby home (2007), http://jruby.codehaus.org/ Ruby: Ruby home (2007), http://www.ruby-lang.org/ Tate, B.A.: Beyond Java. O’Reilly, Sebastopol (2005) CERN: Lcg project (2006), http://www.cern.ch/lcg Unicore Forum: Unicore (2004), http://www.unicore.org EGEE Project: Website (2006), http://public.eu-egee.org/ W3C: Web services activity (2002), http://www.w3.org/2002/ws/ Foster, I., Berry, D., Djaoui, A., Grimshaw, A., Horn, B., Kishimoto, H., Maciel, F., Savva, A., Siebenlist, F., Subramaniam, R., Treadwell, J., Reich, J.V.: Open grid services architecture v1 draft document (2004), http://www.ggf.org/documents/Drafts/draft-ggf-ogsa-spec.pdf W3C: SOAP version 1.2 W3C recommendation (2003), http://www.w3.org/TR/soap12-part0/ W3C: Web service description language (WSDL) 1.1 W3C note (2001), http://www.w3.org/TR/wsdl OASIS: OASIS Web Service Resource Framwork (2006), http://www.oasis-open.org/committees/tc home.php?wg abbrev=wsrf Foster, I., Frey, J., Graham, S., Tuecke, S., Czajkowski, K., Ferguson, D., Leymann, F., Nally, M., Sedukhin, I., Snelling, D., Storey, T., Vambenepe, W., Weerawarana, S.: Modeling stateful resources with Web services (2004), http://www. ibm.com/developerworks/library/ws-resource/ws-modelingresources.pdf Apache foundation: Apache WSRF project (2005), http://ws.apache.org/wsrf/wsrf.html Globus Alliance: The WS-resource framework (2007), http://www.globus.org/toolkit/ Globus Alliance: Python core (2007), http://dev.globus.org/wiki/Python Core Keown, M.M.: Wsrf::lite (2007), http://www.sve.man.ac.uk/Research/AtoZ/ILCT Armstrong, R., et al.: The CCA component model for high-performance scientific computing. Concurr. Comput.: Pract. Exper. 18(2), 215–229 (2006) Lacour, S., et al.: Deploying CORBA components on a computational grid. In: Emmerich, W., Wolf, A.L. (eds.) CD 2004. LNCS, vol. 3083, pp. 35–49. Springer, Heidelberg (2004)
Universal Grid Client: Grid Operation Invoker
1077
19. CoreGRID Programming Model Virtual Institute: Basic features of the grid component model (assessed), Deliverable D.PM.04, CoreGRID (2006), http://www.coregrid.net 20. Kurzyniec, D., Wrzosek, T., Drzewiecki, D., Sunderam, V.: Towards self-organizing distributed computing frameworks: The H2O approach. Parallel Processing Letters 13(2), 273–290 (2003) 21. Malawski, M., Bubak, M., Placek, M., Kurzyniec, D., Sunderam, V.: Experiments with distributed component computing across grid boundaries. In: Proceedings of HPC-GECO/COMPFRAME Workshop in Conjunction with HPDC 2006, pp. 109–116 (2006) 22. Malawski, M.: Mocca homepage (2007), http://www.icsr.agh.edu.pl/mambo/mocca 23. DEISA: DEISA project (1999), http://deisa.org 24. Foundation, T.A.S.: Web service invocation framework (2006), http://ws.apache.org/wsif/ 25. IST: Grid application toolkit (2004), http://www.gridlab.org/WorkPackages/wp-1/ 26. ICL: Netsolve/gridsolve (2006), http://icl.cs.utk.edu/netsolve/ 27. OMII: Grid job submission and monitoring web service (2007), http://gridsam.sourceforge.net/ 28. RealityGrid: Application hosting environemt (2006), http://www.realitygrid.org/AHE/ 29. RealityGrid: WSRF environment for distributed simulation (2003), http://www.realitygrid.org/WEDS/ 30. Python community: Python programming language (2007), http://www.python.org/ 31. perl.org: The perl directory (2007), http://www.perl.org/ 32. The University of Waikato: Attribute-Relation File Format (2002), http://www.cs.waikato.ac.nz/∼ ml/weka/arff.html 33. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco (2005) 34. ViroLab Project Consortium: ViroLab (2006), http://virolab.org 35. Sloot, P.M., Tirado-Ramos, A., Altintas, I., Bubak, M., Boucher, C.: From molecule to man: Decision support in individualized e-health. Computer 39(11), 40–46 (2006)
Divide-and-Conquer Parallel Programming with Minimally Synchronous Parallel ML Radia Benheddi and Fr´ed´eric Loulergue LIFO – University of Orl´eans, France {radia.benheddi,frederic.loulergue}@univ-orleans.fr
Abstract. Minimally Synchronous Parallel ML (MSPML) is a functional parallel programming language. It is based on a small number of primitives on a parallel data structure. MSPML programs are written like usual sequential ML program and use this small set of functions. MSPML is deterministic and deadlock free. The execution time of the programs can be estimated. Divide-and-conquer is a natural way of expressing parallel algorithms. MSPML is a flat language: it is not possible to split the parallel machine in order to implement divide-and-conquer parallel algorithms. This paper presents an extension of MSPML to deal with this kind of algorithms: a parallel composition primitive. Keywords: functional programming, divide-and-conquer parallelism, parallel composition.
1
Introduction
Many problems require a large amount of computing ressources to be solved. The use of massively parallel computers is mandatory in this cases. Yet programming such architectures is still difficult. The development of applications is hindered by errors easily done when writing parallel programs with message passing libraries such as MPI: non-determinism and deadlocks. Efficiency in computations is not sufficient. Efficiency in the development process is also necessary. Highlevel parallel programming languages – algorithmic skeletons, parallel extensions of functional languages, parallel logic and constraint programming – have produced methods and tools that improve the price/performance ratio of parallel software, and broaden the range of target applications. In the past years, we developed several libraries for the Objective Caml [1] language to ease the writing, performance prediction and validation of parallel programs. In particular, Bulk Synchronous Parallel ML (BSML) [2] allows to implement Bulk Synchronous Parallel (BSP) [3] algorithms using a small set of primitives on a polymorphic data-structure. For efficiency reasons, it is not possible to use BSML to program metacomputers (clusters of parallel machines). We designed a two-tiered model and language called Departmental Meta-computing ML or DMML [4]: each node is seen as a BSP machine and programmed using BSML and an additional level, R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1078–1085, 2008. c Springer-Verlag Berlin Heidelberg 2008
Divide-and-Conquer Parallel Programming
1079
minimally synchronous, is used for the coordination. The concepts introduced for this level can also be used independently in a parallel language. We call it Minimally Synchronous Parallel ML or MSPML. The primitives of MSPML are similar to BSML ones but the execution model of the two languages are different. In BSML, communications are followed by a synchronisation barrier involving all the processors of the parallel machine. In MSPML, synchronisations occur only among processors involved in exchanges of data. To take advantage of the asynchronism of MSPML, we chose to add a new parallel spatial composition primitive called juxtaposition. This primitive divides the machine into two independent sub-machines. Each sub-machine executes a different part of a MSPML program. This new primitive can be used to express very naturally and effectively divide-and-conquer algorithms. This paper starts with a brief presentation of the MSPML language (section 2). Then we present the juxtaposition primitive from the user and implementation view points (section 3). Related work and conclusions end the paper.
2
Flat Minimally Synchronous Parallel ML
2.1
The Message Passing Machine Model
The Message Passing Machine or MPM model [5] proposes an execution and cost model for programs run on distributed memory parallel machines. A MPM program is a sequence of m-steps. At each m-step, each processor performs a computation phase followed by a communication phase. During the communication phase, the processors concerned by the communication exchange the data needed for the next m-step. There is no synchronisation barriers but only synchronisation between processors which exchange data. For a given processor i and a m-step s, Ωs,i is the set containing the processor i and all processors communicating with it, called incoming partners, which send messages to processor i. The parallel machine is characterised by the following parameters: p the number of processors, g the communication gap, and L the network latency. The execution time of a MPM program is bounded by : Ψ = max{ΦR,j |j ∈ {0, 1, . . . , p − 1}} Φs,i is inductively defined by : Φ1,i = max{w1,j |j ∈ Ω1,i } + (g × h1,i + L) Φs,i = max{Φs−1,j +ws−1,j |j ∈ Ωs,i }+(g×hs,i +L) where ws,i and hs,i are respectively, the local computation time of processor i − + − during the m-step s and hs,i = max{h+ s,i , hs,i } where, hs,i (respectively hs,i ) is the number of the received (respectively sent) words by the processor i during the m-step s, where i ∈ {0, . . . , p − 1} and s ∈ {2, . . . , R} with R is the number of m-steps in the program. The experiences done show that this model is suited to MSPML [6].
1080
2.2
R. Benheddi and F. Loulergue
The Core Library
The MSPML library is based on the following primitives: p: unit → int g: unit → float l: unit → float
mkpar: (int→ α ) → α par apply: ( α → β ) par → α par → β par get: α par → int par → α par
They give access to the parameters of the parallel machine. In particular, the function p return the static number of processors of the parallel machine. This value do not change during the execution, as long as the parallel composition is not introduced. There is also a polymorphic abstract type α par which represents the type of the parallel vectors composed of p expressions of type α, one expression per processor. The nesting of parallel expressions is not allowed and can be avoided by a type system. The parallel constructors operate on parallel vectors. These parallel vectors are created by the primitive mkpar. The expression (mkpar f) is a parallel or global expression. For example, mkpar(fun i → i) will be evaluated to the parallel vector 0 , . . . , p − 1 . In the MPM model, an algorithm is written as a combination of asynchronous local computation phases and communication phases. The asynchronous phases are programed with mkpar and apply. The expression apply (mkpar f ) (mkpar e) will be evaluated to (f i)(e i) at processor i. The communication phases are programmed using get and mget. The semantics of get, where % is the modulo, is given by: get v0 , . . . , vp−1 i0 , . . . , ip−1 = vi0 %p , . . . , vip−1 %p The function mget is a generalisation of get. During mget, a given processor could receive (respectively send) several messages from (respectively to) different processors during the same m-step. The complete language contains also a global conditional which is necessary to take into account data computed locally for the global control. For the sake of conciseness these two primitives are omitted here. 2.3
Costs
The execution time of a MSPML program is represented by a cost vector of execution time on each processor: c0 , . . . , cp−1 . If the evaluation of a ML expression e outside a mkpar, require time w then its evaluation as a MSPML program adds w to each component of the cost vector: c0 + w, . . . , cp−1 + w. The evaluation of a mkpar expression requires wi at processor i, the time required to evaluate (f i). For apply time required is that of the evaluation of the arguments giving the vectors f0 , . . . , fp−1 and v0 , . . . , vp−1 , then for processor i the time wi required for the evaluation of (fi vi ). The execution time of a communication primitive get, if the vector of costs after the evaluation of its arguments is c0 , . . . , cp−1 , is given by the cost vector c0 , . . . , cp−1 . It is defined by:
Divide-and-Conquer Parallel Programming
1081
– the first and second argument of get are respectively, v0 , . . . , vp−1 and i0 , . . . , ip−1 ; – Ωk = {j|ij = k} ∪ {k} is the set of the incoming partners of processor k. The execution time is given by: ck = max{ci |i ∈ Ωk } + max{(
#vk ), (#vik if ik = k)} × g + L
j∈Ωk \{k}
2.4
Examples
The standard library of MSPML contains functions which are defined using only the primitives. For example, the function replicate creates a parallel vector which contains the same value everywhere. let replicate x = mkpar(fun pid→ x) It is also very convenient to apply the same sequential function to all the components of a parallel vector. That can be done by using the function parfun: let parfun f v = apply (replicate f) v The semantics of the total exchange is given by: totex v0 , . . . , vp−1 = f , . . . , f , . . . , f where ∀i.(0 ≤ i < p−1) ⇒ (f i)=vi The code is presented below where noSome is a function which removes the constructor Some and compose is the functional composition: let totex vv = parfun (compose noSome) (mget (parfun(fun v i→ v)vv) (replicate(fun i → true))) Its parallel cost is (p − 1) × s × g + L, where s is the size in words of the biggest value v held by a processor. The set of bcast functions broadcast the value of a parallel vector at a given processor to all the other processor. Their semantics is given by: bcast v0 , . . . , vp−1 r = vr%p , . . . , vr%p The bcast direct function is one possible implementation, using only one mstep. It could be written as follows: let bcast direct root vv = get vv (replicate root) Its parallel cost is (p − 1) × s × g + L, where s is the size in words of the value vn at processor n. The standard library of MSPML contains a collection of such functions to ease parallel programs writing. Thus, its same to write MSPML programs or programs with data-parallel skeletons [7,8]. But if the MSPML standard library does not provide the required function it is possible to write new skeletons as higher order functions, using the described in the section 2.2.
1082
3 3.1
R. Benheddi and F. Loulergue
A Parallel Composition Primitive Overview of the Parallel Juxtaposition
The spatial parallel composition, called “juxtaposition”, divides a parallel machine of p processors into independent sub-machines. The sub-machines avaluate independently different parallel programs in the same time on the same machine. The machine is divided into groups of contiguous processors. The primitive of juxtaposition could be performed recursively, i.e. a sub-machine could be partitionned into submachines. The evaluation of the expression (juxta m E1 E2 ) proceeds as follows. The m first processors evaluate the expression E1 and the p − m remainder evaluate E2 . These p − m processors are however renamed, the processor m becoming 0 and the processor (p − 1) becoming (p − 1) − m. The juxtaposition does not modify the MPM cost model. The result of evaluation of the parallel juxtaposition is: juxta m v0 , . . . , vm−1 v0 , . . . , vp−1−m = v0 , . . . , vm−1 , v0 , . . . , vp−1−m
In order to avoid the evaluation of the parallel arguments before the call to the juxtaposition, the type of the parallel composition is: juxta: int → (unit → α par) → (unit → α par) → α par. The following program is a divide-and-conquer version for prefix sum. Its semantics is: scan ⊕ v0 , . . . , vp−1 = v0 , . . . , v0 ⊕ v1 ⊕ . . . ⊕ vp−1 where ⊕ is an associative binary operation. let rec scan op vec = if p()=1 then vec else let mid = p()/2 in let vec’ = juxta mid (fun()→ scan op vec) (fun()→ scan op vec) and msg vec = get vec (mkpar(fun i → if(i<mid) then i else mid−1)) and parop=parfun2(fun x y → match x with None→ y|Some v→ op v y)in parop (msg vec’) vec’ The network is divided into two parts and the function scan is recursively applied to the two parts. The value at the last processor of the first part is broadcast to all the processors of the second part. Then this value and the local values computed by the recursive call are combined with the operation op on each processor of the second part. 3.2
The Communication Mechanism
The asynchronous nature of MSPML led us to design a communication mechanism which relies on the storage of values to be potentially requested by other processors in a data structure called communication environment. Each processor has its own communication environment and a value is stored per m-step.
Divide-and-Conquer Parallel Programming
1083
During the execution of a MSPML program, for each process i, the system has a variable mstepi containing an integer indicating the current m-steps. For flat MSPML, all the processors perform the same number of m-steps. When the expression (get vv vi), is evaluated at a given process i: 1. mstepi is increased by one; 2. The value that this process holds in parallel vector vv is stored with the value of mstepi in the communication environment; 3. the value j that this process holds in parallel vector vi is the process number from which the process i wants to receive a value. Thus process i sends a request to process j: it asks for the value at m-step mstepi . When process j receives the request (threads are dedicated to handle requests, so the work of process j is not interrupted by the request), there are two cases: – mstepj ≥ mstepi : it means that process j has already reached the same m-step than process i. Thus process j looks in its communication environment for the value associated with m-step mstepi and sends it to process i; – mstepj < mstepi : nothing can be done until process j reaches the same m-step than process i. If i = j, this third step is not performed. Without juxtaposition all processes execute the same number of m-steps. It is not the case when juxtaposition is introduced. For example, in the following expression: (let this = mkpar(fun i → i) in juxta 2 (get this this) this) only the two first processes increase their m-step counters by evaluating the get primitive. Thus we cannot rely on a simple numbering of m-steps by naturals in order to correctly exchange messages among processors. To distinguish the various messages from different sub-machines we introduce the following m-step numbering: step ::= (n, m) | step.R(n, m) | step.L(n, m) where n and m are naturals. Each time a processor calls the juxtaposition primitive, its m-step counter grows: L (resp. R) indicates that the processor belongs to the sub-machine which evaluates the first (resp. second) parallel expression of a juxtaposition. The natural n is a counter used to know how many m-steps (use of get, mget or at) have been performed in the given sub-network. When the call to the juxtaposition ends the last pair of the m-step counter and the sub-machine indicator are removed. But it is possible to have successive non-nested uses of the juxtaposition: let e1=(juxta m s1 s2) in let e2=(juxta m’ s’1 s’2) in e Thus if we count only the number of m-steps in a give call to the juxtaposition, two values of two successive non-nest calls to the juxtaposition could have the same m-step counter, which would lead to an incorrect mechanism. Thus the second natural in the pairs (n, m) gives the number of successive non-nested calls to the juxtaposition in the given sub-network. The figure 1 illustrates the m-steps numbering on an example.
1084
R. Benheddi and F. Loulergue
0
P − 1 step = (n,0)
machine of p processor
juxta m e e’
0
m−1
0
step=L(0,0).(n,1)
P−m−1 step = R(0,0).(n,1) juxta m’ e e’
0
m’−1
0
P−m−m’−1 step=R(0,0).R(0,1).(n,1)
0
P−m−m’−1 step=R(3,0).R(0,1).(n,1)
step=L(0,0).R(0,1).(n,1)
0
m’−1
step=L(5,0).R(0,1).(n,1)
0
P−m−1 step = R(0,1).(n,1)
Fig. 1. M-step Numbering
4
Related Work
[9] showed that NESL [10] is more effective when the vectors size is constant. Even if it isn’t the case, the majority of the operations of NESL can be implemented in MSPML. In particular the nested lists can be implemented like in [11]. From this point of view MSPML can seem lower level than NESL. But MSPML offers a high level functions whereas it is not the case for NESL. [12] describes the mechanism of the structural clocks which allows the execution of data-parallel programs written in a small imperative language SPMD. The difficulty within this framework is that the number of communication phases can be different on each processor because there is a parallel composition. Our m-step numbering is thus similar to structural clocks. [13] presents another way to divide-and-conquer in the framework of an objectoriented language. There is no formal semantics and no implementation from now on. The same author advocates in [14] a new extension of the BSP model in order to ease the programming of divide-and-conquer BSP algorithms. It adds another level to the BSP model with new parameters to describe the parallel machine.
5
Conclusion and Future Work
The new MSPML primitive, called juxtaposition, allows to write easily divideand-conquer parallel programs. An implementation was developed. It remains
Divide-and-Conquer Parallel Programming
1085
to experiment the new possibilities offered by the juxtaposition, as well as to validate the cost model. In the current implementation, the management of the communication environments requires a global synchronisation from time to time. In the case of MSPML without parallel composition, a new mechanism was proposed which removes these global synchronisations. We should adapt it for MSPML with juxtaposition.
References 1. Leroy, X., Doligez, D., Garrigue, J., R´emy, D., Vouillon, J.: The Objective Caml System release 3.09 (2005), web pages at: www.ocaml.org 2. Loulergue, F., Gava, F., Billiet, D.: Bulk Synchronous Parallel ML: Modular Implementation and Performance Prediction. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3515, pp. 1046–1054. Springer, Heidelberg (2005) 3. Valiant, L.G.: A bridging model for parallel computation. Communications of the ACM 33(8), 103–111 (1990) 4. Gava, F., Loulergue, F.: A Functional Language for Departmental Metacomputing. Parallel Processing Letters 15(3), 289–304 (2005) 5. Roda, J.L., Rodr´ıguez, C., Morales, D.G., Almeida, F.: Predicting the execution time of message passing models. Concurrency: Practice and Experience 11(9), 461– 477 (1999) 6. Loulergue, F., Gava, F., Arapinis, M., Dabrowski, F.: Semantics and Implementation of Minimally Synchronous Parallel ML. International Journal of Computer and Information Science 5(3), 182–199 (2004) 7. Cole, M.: Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, Cambridge (1989) 8. Pelagatti, S.: Structured Development of Parallel Programs. Taylor & Francis, Abington (1998) 9. Bamha, M.: L’impl´ementation d’un langage portable a ` parall´elisme emboˆıt´e en processus statiques. Master thesis, University of Orl´eans, LIFO (1996) 10. Blelloch, G., Chatterjee, S., Hardwick, J., Sipelstein, J., Zagha, M.: Implementation of a portable nested data-parallel language. Journal of Parallel and Distributed Computing 21(1), 4–14 (1994) 11. Hu, Z., Takahashi, T., Iwasaki, H., Takeichi, M.: Segmented Diffusion Theorem. In: IEEE International Conference on Systems, Man and Cybernetics (SMC 2002), October 6-9, 2002, IEEE Press, Los Alamitos (2002) 12. Melin, E., Raffin, B., Rebeuf, X., Virot, B.: A Structured Synchronization and Communication Model Fitting Irregular Data Accesses. Journal of Parallel and Distributed Computing 50, 3–27 (1998) 13. Tiskin, A.: A New Way to Divide and Conquer. Parallel Processing Letters (4) (2001) 14. Martin, J.M.R., Tiskin, A.: BSP modelling a two-tiered parallel architectures. In: Cook, B.M. (ed.) WoTUG 1990, pp. 47–55 (1999)
Cloth Simulation in the SILC Matrix Computation Framework: A Case Study Tamito Kajiyama1,2, Akira Nukada1,2 , Reiji Suda1,2 , Hidehiko Hasegawa3, and Akira Nishida1,2 1
CREST, Japan Science and Technology Agency, Saitama 332–0012, Japan 2 The University of Tokyo, Tokyo 113–8656, Japan 3 University of Tsukuba, Ibaraki 305–8550, Japan
Abstract. This paper presents a case study of numerical simulations in an easy-to-use matrix computation framework named Simple Interface for Library Collections (SILC), which allows users to use matrix computation libraries in an environment- and language-independent manner. As a practical example of numerical simulations in SILC, we selected cloth simulation based on a mass-spring model and the implicit backward Euler method. We constructed two SILC-based versions of an existing cloth simulation code according to two proposed application styles of SILC. Experimental results showed that both versions achieved some performance gains, thereby demonstrating the feasibility of numerical simulations in SILC and the usability of the proposed application styles.
1
Introduction
Matrix computations such as solutions of linear systems and eigenvalue analyses are key components of numerical simulations being conducted in various scientific and industrial fields; thus, an increasing number of matrix computation libraries have been developed to facilitate the development of numerical simulation codes. However, the application programming interfaces (APIs) of the libraries are not generally uniform, which makes it quite costly to employ the libraries to develop and maintain simulation codes. It is burdensome for users to have to learn a number of different library-specific APIs in order to write simulation codes using these libraries. In addition, there are numerous reasons why users must switch from one library to another (e.g., in order to switch computing environments, to try out alternative libraries for better performance, and so on). In such cases, users are required to make considerable modifications to the simulation codes, since the codes usually depend on the particular libraries in use. To relieve the burden of writing simulation codes based directly on libraryspecific APIs, the authors have been proposing an easy-to-use matrix computation framework named Simple Interface for Library Collections (SILC) [1]. In short, SILC is a piece of middleware that gives users access to matrix computation libraries in an environment- and language-independent manner. With the aim of setting some guidelines for SILC users, we have also been proposing two applications styles for writing simulation codes within the SILC framework [2]. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1086–1095, 2008. c Springer-Verlag Berlin Heidelberg 2008
Cloth Simulation in the SILC Matrix Computation Framework
1087
The purpose of the present paper is to verify the effectiveness of the proposed framework and its application styles through a case study in numerical simulation. As a practical example of numerical simulations in SILC, we have selected cloth simulation that is an important technology widely used in many academic and industrial fields including computer graphics and the fashion industry. According to the proposed application styles, we have developed two SILC-based versions of an existing cloth simulation code written in C. In the rest of the paper, we describe how SILC was applied to the original cloth simulation code, and present some experimental results. We also make a brief survey of related work and finally draw some conclusions.
2
Overview of the SILC Matrix Computation Framework
Simple Interface for Library Collections (SILC) is a matrix computation framework that allows users to use various matrix computation libraries independently of particular libraries, computing environments, and programming languages. SILC is currently implemented based on a client-server architecture. Instead of using matrix computation libraries through library-specific APIs, user programs for SILC (i.e., simulation codes in the SILC framework) utilize libraries in the following three steps. First, the user programs deposit data such as matrices and vectors into a SILC server, together with names for later reference. Next, the user programs make requests for computation by means of mathematical expressions in the form of text. These computation requests are translated into calls for appropriate library functions and carried out on the server side. Finally, the user programs retrieve the results of the computation (if necessary) from the server by specifying the names of the computation results to be retrieved. The computation results are kept in the server unless they are explicitly deleted. Figure 1 shows a user program written in C in the SILC framework. This program solves an initial value problem of a two-dimensional diffusion equation using the Crank-Nicolson method. Suppose that t0 is the initial time and Δt > 0 is a constant time interval. The Crank-Nicolson method requires the solution of a linear system Axk = Cxk−1 for each time step tk = tk−1 + Δt (k = 1, 2, 3, . . .), where A and C are sparse matrices. The user program first deposits A, C, and the initial values x0 at t0 into a SILC server by three separate calls for the SILC_PUT routine. Then for each time step tk , the program issues a request to solve Axk = Cxk−1 using the SILC_EXEC routine. The computation request results in calls for some library functions, which are carried out in the server. After that, the program fetches the solution xk at tk by the SILC_GET routine. The primary benefit of using SILC is independence from matrix computation libraries, computing environments, and programming languages. User programs for SILC do not depend on particular libraries and their underlying computing environments, as illustrated by the user program in Fig. 1. Sequential user programs can automatically obtain performance gains by simply using a parallel SILC server. SILC is also independent of programming languages in the sense that the same mathematical expressions can be used to make computation
1088
T. Kajiyama et al.
silc_envelope_t A, C, x; /* create matrices A and C, and the initial values x0 at time t0 */ SILC_PUT("A", &A); SILC_PUT("C", &C); SILC_PUT("x", &x); /* x0 */ for (k = 1; k <= num_time_steps; k++) { SILC_EXEC("x = A \\ (C * x)"); SILC_GET(&x, "x"); /* solution xk at time tk */ /* output xk */ }
Fig. 1. An example of a user program for SILC, written in C, which solves an initial value problem using the Crank-Nicolson method. The backslash operator for solving linear systems is represented by a backslash, which is used to escape special characters in string literals in C. Therefore, the operator is written as “\\” in the program.
requests from user programs in any programming language. Another benefit is ease of use. SILC makes it easy to write user programs that utilize matrix computation libraries, relieving users from the burden of using library-specific APIs that differ prominently in terms of data structures for matrices and vectors, parameters of library functions, compilation and linking procedures, and so on. SILC comprises useful functionalities for matrix computations. Supported data types include dense, band, and sparse matrices and vectors. Mathematical expressions are composed of various math operators (such as arithmetic operators and the backslash operator for solving linear systems), built-in functions (e.g., function norm2 computes the 2-norm of a vector), and subscripts (for example, A[1:5, k:k+4] yields a 5 × 5 submatrix of A). There is no construct for loops and conditional branching in the mathematical expressions of SILC, since SILC is intended to be a replacement for library calls. Control flows are expressed by the languages in which user programs are written, as shown in Fig. 1.
3
Two Application Styles of SILC
As a few basic guidelines on writing user programs for numerical simulations in the SILC framework, we have been proposing two different application styles (see Table 1 for a comparison of the two application styles). Limited Application Style. User programs in this application style realize the most time-consuming, computationally intensive part of the user programs by depositing data into a SILC server, making requests for computation, and retrieving the results of the computation from the server. Those computations that are hard to realize in terms of matrix computations are implemented in the user programs by fetching data from the server and sending the results of computation back to the server. The limited application style is easy to use, although it imposes some communication overheads due to frequent data transfer between the user programs and the server. In addition, the maximum amount of data that can be handled is largely restricted by the memory capacity of a user program.
Cloth Simulation in the SILC Matrix Computation Framework
1089
Table 1. A comparison of the limited and comprehensive application styles
Ease of application The amount of data maintained by a user program The amount of data maintained by a SILC server The amount of data transfer The amount of parallelizable computation
Limited Easy Large Small Large Small
Comprehensive Hard Small Large Small Large
Comprehensive Application Style. User programs in this application style first move all relevant data to a SILC server. After that, the user programs issue a series of computation requests to control the server-side computations, while having few data communications with the server in the middle of the simulations. The comprehensive application style can be difficult to employ since all computations are not necessarily easy to realize by means of SILC’s mathematical expressions. On the other hand, the comprehensive application style imposes fewer communication overheads than the limited application style. The maximum amount of data mainly depends on the server’s memory capacity, so that this application style allows a larger amount of data to be handled than the limited application style. Moreover, the amount of parallelizable computation is larger than in the limited application style, since most computations are done on the server side.
4
Cloth Simulation in SILC
With the aim of exemplifying the usability of SILC in numerical simulations, we applied the two application styles to an existing sequential cloth simulation code written in C. The simulation code employs a mass-spring model to represent the geometry of cloth and computes the motion of the cloth (governed by Newton’s law of motion) based on the implicit backward Euler method [3]. The mass-spring model represents cloth as a mesh of n particles connected by springs. Let xi ∈ R3 be a position vector that specifies the location of particle i. We simply represent the geometry of the entire cloth by x ∈ R3n . Similarly, we represent the velocity of particle i by v i ∈ R3 and those of all particles by v ∈ R3n . Two particles are connected by a weightless spring k with a spring constant bk , a damping constant hk , and a natural length lk . In the implicit backward Euler method, we need to solve a linear system for each time step. Let x0 and v 0 be the position and velocity of the cloth at the end of the previous time step. The main loop over time steps in the simulation code consists of the following three steps: Step 1. Compute force f = f (x, v) and its derivatives ∂f /∂x and ∂f /∂v. The force f ∈ R3n that acts on the cloth is calculated particle-wise as follows. Let Pi be the set of particles that are connected to particle i; then the force f i ∈ R3 that acts on particle i is defined as a sum of spring force f ij and damping force dij between each pair of particles i and j connected by spring k:
1090
T. Kajiyama et al.
fi =
(f ij + dij )
j∈Pi
f ij = bk (|xj − xi | − lk )
xj − xi |xj − xi |
dij = −hk (v i − v j ) The derivatives ∂f /∂x and ∂f /∂v are Jacobian matrices [4], each of which consists of n2 submatrices as follows: ⎛ ∂f 1 ⎛ ∂f 1 ∂f 1 ⎞ ∂f 1 ⎞ ∂x1 · · · ∂xn ∂v 1 · · · ∂v n ∂f ⎜ .. ⎟ , ∂f = ⎜ .. .. ⎟ = ⎝ ... ⎝ . . ⎠ ∂v . ⎠ ∂x ∂f n ∂f n ∂f n ∂f n ∂x1 · · · ∂xn ∂v 1 · · · ∂v n Off-diagonal submatrices are defined as follows:
∂f i bk lk (xj − xi )(xj − xi )T ∂f i = bk I − I− , = hk I 2 ∂xj |xj − xi | |xj − xi | ∂v j Diagonal submatrices are defined in terms of off-diagonal ones as follows: ∂f i ∂f i ∂f i ∂f i =− , =− ∂xi ∂xj ∂v i ∂v j j∈Pi
j∈Pi
Step 2. Solve a linear system AΔv = b to find a change in velocity Δv, where ∂f ∂f A = M − Δt2 − Δt ∂x
∂v ∂f b = f + Δt v 0 Δt ∂x and M is a diagonal matrix that represents the mass of particles. The linear system is solved by the Conjugate Gradient (CG) method [5] since A is sparse and symmetric positive definite. Step 3. Update velocity v and position x as follows: v = v 0 + Δv x = x0 + vΔt The most time-consuming, computationally intensive part of the original simulation code is the second step, where solving linear systems takes about 80% of the original code’s execution time. Therefore, we developed a SILC-based code in the limited application style by rewriting the second step of the original code by means of the SILC framework. Figure 2 (a) is part of the original code that calls for lis_solve, a library function of the Lis iterative solvers library [6], to solve a linear system AΔv = b with the CG method. Figure 2 (b) is the same part of the SILC-based code in the limited application style, where the linear
Cloth Simulation in the SILC Matrix Computation Framework
1091
LIS_MATRIX A; LIS_VECTOR b, dv;
silc_evelope_t A, b, dv;
for (k = 1; k <= num_time_steps; k++) { /* 1. Compute f , ∂f /∂x, and ∂f /∂v */ : /* 2. Solve AΔv = b */ lis_solve(A, b, dv, lis_params, lis_options, lis_status); /* 3. Update velocity v and position x */ : }
for (k = 1; k <= num_time_steps; k++) { /* 1. Compute f , ∂f /∂x, and ∂f /∂v */ : /* 2. Solve AΔv = b */ SILC_PUT("A", &A); SILC_PUT("b", &b); SILC_EXEC("dv = A \\ b"); SILC_GET(&dv, "dv"); /* 3. Update velocity v and position x */ : }
(a) The original code
(b) The SILC-based code
Fig. 2. The original code and the SILC-based code in the limited application style
system is solved by depositing A and b into a SILC server, making a request for it to solve the linear system, and fetching the solution Δv from the server. In both codes, A is stored in the Compressed Row Storage (CRS) format [7]. We also developed a SILC-based code in the comprehensive application style, in which all relevant data is moved to a SILC server at the beginning of the simulation code and all computations are realized by means of SILC’s mathematical expressions. The data transfer is performed in the initialization part of the code. Some computations are also carried out during the initialization, although most computations are concentrated in the main loop over time steps. Figure 3 shows the iterative part of the code in the comprehensive application style. The Jacobian ∂f /∂v, referred to as DfDv in the figure, is computed during the initialization since it is constant. A few additional constant vectors and matrices are also defined in the initialization part. All matrices involved in the code are sparse and stored in the CRS format. The mathematical expressions in Fig. 3 have been written so that data parallelism can be exploited as much as possible. For example, the mathematical expressions in the first call for SILC_EXEC are requests for computing the distance between two particles connected by a spring. Let s be the number of springs; then matrix Y is a linear map for transforming x ∈ R3n into p ∈ R3s so that every three elements of p represent xj − xi . The expression p *@ p stands for elementwise multiplication, and X_T is another linear map from R3s to Rs such that multiplying it by a vector sums up every three elements of the vector. Finally, function sqrt computes the square root of each element in a given vector. All these computations can be parallelized in a data-parallel manner.
5
Numerical Experiments
We conducted numerical experiments to investigate the performance of the SILCbased simulation codes. Table 2 shows the computing environments used for the experiments. We compared the performance of the original code and the SILC-based codes by running them on the same PC (Dell Dimension 8400) and measuring their execution time for the first 20 time steps. The cloth used for the experiments consisted of 10,000 particles. The SILC-based codes were tested
1092
T. Kajiyama et al.
silc_envelope_t v, x; /* Compute force f and Jacobian ∂f /∂x */ SILC_EXEC("p = Y * x; z = sqrt(X_T * (p *@ p))"); SILC_EXEC("fij = p *@ (X * (K_stiff *@ (z - L) /@ z))"); SILC_EXEC("dij = (Y * v) *@ (X * K_damp)"); SILC_EXEC("f = Mg - Y_T * (fij + dij)"); SILC_EXEC("zhat = ones(s, 1) /@ z"); SILC_EXEC("pzhat = p *@ (X * zhat)"); SILC_EXEC("U_L = sparse(U_L_row, U_col, pzhat, 3*n, s)"); SILC_EXEC("U_R = sparse(U_R_row, U_col, pzhat, 3*n, s)"); SILC_EXEC("U = U_L - U_R"); SILC_EXEC("tmp = zhat *@ K_stiff *@ L"); SILC_EXEC("T2 = Y_T * diag(X * tmp) * -Y"); SILC_EXEC("T3 = -U * diag(tmp) * U’"); SILC_EXEC("DfDx = T1 - T2 + T3"); /* Solve AΔv = b */ SILC_EXEC("A = M - (dt * dt) * DfDx - dt * DfDv"); SILC_EXEC("b = dt * (f + dt * (DfDx * v))"); SILC_EXEC("dv = A \\ b"); /* Update velocity v and position x */ SILC_EXEC("v += dv *@ fixed; x += dt * v"); SILC_GET(&v, "v"); SILC_GET(&x, "x");
Fig. 3. The SILC-based code in the comprehensive application style. Only the code segment within the main loop of the simulation is shown. Table 2. The computing environments used for the experiments Name Dell Dimension 8400 SGI Altix 3700
Specifications Intel Pentium 4 3.4 GHz, 1 GB RAM, Windows XP SP2 Intel Itanium 2 1.3 GHz × 32, 32 GB RAM (cc-NUMA), Red Hat Linux Advanced Server 2.1
with two SILC servers, one in the same PC and another in SGI Altix 3700, both in the same Gigabit Ethernet LAN. The original code and local SILC server in the PC utilized a sequential version of the Lis iterative solvers library to solve linear systems, while the remote server in Altix employed an OpenMP-based parallel version of the same library. The execution time of the original code was 11.80 seconds; solving the linear systems took 81.70% of the execution time. Table 3 shows the performance results of the SILC-based code in the limited application style using the local SILC server in the same PC and the remote SILC server running on different numbers of threads. The time spent for client-side computations and the time for data transfer were almost constant regardless of the number of threads, while the time spent for server-side computations was significantly reduced by the multithreaded server. As a result, a total execution time of 8.33 seconds was achieved by using the remote SILC server running on 16 threads. It constituted a 1.90 times speedup compared to the execution time with the remote SILC server on one thread, and a 1.42 times speedup compared to the original code. Table 4 shows the performance results of the SILC-based code in the comprehensive application style using the local and remote SILC servers. There is
Cloth Simulation in the SILC Matrix Computation Framework
1093
Table 3. Performance results of the SILC-based code in the limited application style. The three rows of client-side computations, data transfer, and server-side computations show the breakdowns of the total execution times (in seconds). SILC server Number of threads Client-side computations Data transfer Server-side computations Total execution time Speedup
Local – 1.94 3.36 9.98 15.28 –
1 2.13 4.11 9.58 15.82 1.00
2 2.17 4.05 6.29 12.52 1.26
Remote 4 8 2.08 2.11 4.21 5.24 3.61 1.57 9.90 8.92 1.60 1.77
16 2.10 5.20 1.03 8.33 1.90
32 2.04 5.59 1.57 9.20 1.72
Table 4. Performance results of the SILC-based code in the comprehensive application style. The two rows of data transfer and server-side computations show the breakdowns of the total execution times (in seconds). SILC server Number of threads Data transfer Server-side computation Total execution time Speedup
Local – 1.66 434.91 436.57 –
1 2.35 335.09 337.44 1.00
2 1.71 238.49 240.20 1.40
Remote 4 8 1.23 1.05 114.67 56.46 115.90 57.51 2.91 5.87
16 1.06 32.23 33.28 10.14
32 1.40 23.27 24.67 13.68
no client-side computation since all data is maintained by a SILC server; only the velocity v and position x are retrieved from the server. The code showed good scalability, thanks to the mathematical expressions written so as to exploit data parallelism. However, the code was 2.09 times slower than the original code even with the remote SILC server running on 32 threads. This is mainly because of extra non-floating point operations present in the SILC-based code. For example, the mathematical expressions in Fig. 3 include four matrix-matrix multiplications, a transposition (by the ’ operator), and two calls for the sparse function. All these operations create a sparse matrix in the CRS format as a result of computation through a number of non-floating point operations such as counting non-zero elements to be generated and packing them per row. As described in Section 3, on the other hand, the comprehensive application style has certain advantages with regard to the amount of data on the client side and the amount of data transfer. The amount of data maintained by the SILC-based code in the limited application style is 16.8 MB, while the amount of data in the SILC-based code in the comprehensive application style is 3.82 MB when data is deposited into a SILC server at the beginning of the code and is reduced to 0.763 MB after the initialization. Similarly, the amount of data transfer per time step is 17.7 MB in the case of the limited application style, whereas it is 0.458 MB in the case of the comprehensive application style. These advantages of the comprehensive application style allow a larger piece of cloth to be simulated even in a PC with a restrictive memory capacity, together with a remote SILC server running in a high-performance parallel computer.
1094
6
T. Kajiyama et al.
Related Work
Since SILC is a piece of middleware based on a client-server architecture, it is related to Grid RPC middleware such as Ninf-G [8] and NetSolve [9]. In these systems, numerical simulation codes are usually parallelized in a task-parallel manner. It is common in Grid computing to employ geographically distributed servers, which makes it prohibitive to exchange data among the servers to perform computations in a data-parallel manner. To address this issue, Tanaka et al. [10] proposed a hybrid programming model which combines Ninf-G and MPI. In this model, tasks are sent to multiple servers via Grid RPC, while each task is carried out within a server in a data-parallel manner based on MPI. In Ninf-G, however, it is the user’s responsibility to write the MPI-based parallel codes the servers perform. In SILC, computation requests are expressed by means of SILC’s mathematical expressions and automatically parallelized in a data-parallel manner, so that users can write numerical simulation codes without knowing the details of parallel computation on the server side. The use of mathematical expressions to express computation requests in an environment-independent manner is closely related to parallel implementations of Matlab. There are four major categories of parallel Matlab systems: (1) embarrassingly parallel, (2) message passing, (3) back-end support, and (4) Matlab compilers [11]. Among them, those systems in the second and third categories are relevant to SILC. A typical example in the second category is MatlabMPI [12], which allows users to write MPI-based parallel codes in Matlab. User programs for SILC can also be MPI-based parallel programs. In addition, SILC can be used in sequential programs; if that is the case, users can automatically gain the benefits of parallel computation by just using parallel SILC servers. The same approach is taken by parallel Matlab systems of the third category. A representative system in this category is Star-P [13,14] which enables a parallel back-end server to be interactively utilized in Matlab. The most significant difference from SILC is that in Star-P, local data in Matlab and remote data in the back-end server are seamlessly handled in such a way that the data on both sides can be referred to and used in one mathematical expression without any restriction. In SILC, data management in a SILC server is completely separated from user programs. This design decision has made SILC’s client-side API relatively simple, allowing the SILC framework to be used in a number of programming languages including C, Fortran, Java, Python, and GNU Octave. In addition, user programs for SILC can be executed in large-scale distributed parallel computing environments based on batch queuing systems [1]. In Star-P, the focus is on the seamless integration of a back-end server with Matlab and their interactive utilization, while SILC focuses on a higher degree of independence from computing environments and programming languages.
7
Concluding Remarks
This paper presented a case study of numerical simulations in the SILC framework. In this study, we adapted an existing cloth simulation code to the SILC
Cloth Simulation in the SILC Matrix Computation Framework
1095
framework and obtained two SILC-based versions of the code according to the proposed application styles. Using a remote parallel SILC server, the SILC-based code in the limited application style outperformed the original code while the SILC-based code in the comprehensive application style achieved good scalability. These results demonstrate the feasibility of numerical simulations within the SILC framework and the usability of the proposed application styles. Our future work includes a performance evaluation of the SILC-based cloth simulation codes in MPI-based parallel computing environments, performance tuning of SILC servers for faster execution of user programs in the comprehensive application style, and further case studies with other types of numerical simulations such as computational fluid dynamics. Acknowledgment. This research was supported by a grant-in-aid project [6] in the Core Research for Evolutional Science and Technology (CREST) program of the Japan Science and Technology Agency.
References 1. Kajiyama, T., Nukada, A., Suda, R., Hasegawa, H., Nishida, A.: Distributed SILC: An easy-to-use interface for MPI-based parallel matrix computation libraries. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 860–870. Springer, Heidelberg (2007), http://ssi.is.s.u-tokyo.ac.jp/silc/ 2. Kajiyama, T., Nukada, A., Suda, R., Hasegawa, H., Nishida, A.: Numerical simulations in the SILC matrix computation framework. In: Proc. ICCM 2007 (2007) 3. Baraff, D., Witkin, A.: Large steps in cloth simulation. In: Proc. ACM SIGGRAPH 1998, pp. 43–54 (1998) 4. Lang, S.: Calculus of Several Variables, 2nd edn. Addison-Wesley, Reading (1979) 5. Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards 49, 409–436 (1952) 6. Nishida, A., Kotakemori, H., Kajiyama, T., Nukada, A.: Scalable software infrastructure project. In: L¨ owe, W., S¨ udholt, M. (eds.) SC 2006. LNCS, vol. 4089, Springer, Heidelberg (2006), http://ssi.is.s.u-tokyo.ac.jp/ 7. Barrett, R., et al.: Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. SIAM, Philadelphia (1994) 8. Ninf Project: http://ninf.apgrid.org/ 9. NetSolve: http://icl.cs.utk.edu/netsolve/ 10. Tanaka, Y., Takemiya, H., Nakada, H., Sekiguchi, S.: Design and implementation of flexible, robust and efficient Grid-enabled hybrid QM/MD simulation. Computational Methods in Science and Technology 12, 79–87 (2006) 11. Choy, R., Edelman, A.: Parallel MATLAB: Doing it right. Proceedings of the IEEE 93, 331–341 (2005) 12. Kepner, J., Ahalt, S.: MatlabMPI. Journal of Parallel and Distributed Computing 64, 997–1005 (2004) 13. Shah, V., Gilbert, J.R.: Sparse matrices in MATLAB*P: Design and implementation. In: Boug´e, L., Prasanna, V.K. (eds.) HiPC 2004. LNCS, vol. 3296, pp. 144–155. Springer, Heidelberg (2004) 14. Interactive Supercomputing, Inc.: http://www.interactivesupercomputing.com/
Computing the Irregularity Strength of Connected Graphs by Parallel Constraint Solving in the Mozart System Adam Meissner, Magdalena Niwi´ nska, and Krzysztof Zwierzy´ nski Institute of Control and Information Engineering, Pozna´ n University of Technology, pl. M. Sklodowskiej-Curie 5, 60-965 Pozna´ n, Poland {Adam.Meissner, Magdalena.Niwinska, Krzysztof.Zwierzynski}@put.poznan.pl
Abstract. In this paper we show how the problem of computing the irregularity strength of a graph can be expressed in terms of CP(FD) programming methodology and solved by parallel computations in the Mozart system. We formulate this problem as an optimization task and apply the branch-and-bound method and iterative best-solution search in order to solve it. Both of these approaches have been evaluated in experiments. We also estimate the speedup obtained by parallel processing. Keywords: irregularity strength, parallel constraint solving, Mozart.
1
Introduction
Computing the irregularity strength of a graph is one of the graph theory problems which gained much attention since the pioneering work of Chartrand et al. in [3]. For a simple graph G, it consists in assigning positive numbers to the edges of G, such that the labeling meets the following conditions. Let a value assigned to the edge be a weight of an edge and let the weight of a vertex v be the sum of weights of all edges incident to v. If all the weights of vertices of G are pairwise distinct then the considered labeling is proper (admissible) and the maximum edge weight is called the strength of a graph. The irregularity strength of a graph G, denoted as s(G), is the minimum strength counted over all admissible labelings. It should be remarked that such a labeling does not exist for graphs containing a component isomorphic to K2 or containing more than one isolated vertex; in this case s(G) is set to ∞. In the following paper we restrict ourselves to connected and undirected graphs hence the term graph hereinafter denotes this kind of graph by default. Analytic results for the irregularity strength have been achieved only for some particular classes of graphs like complete graphs, complete bipartite graphs, lattices, cycles and paths. For example the s(G) equals 3 for complete graphs (except K1 , K2 ) and for graphs K2k,2k , where k ≥ 3 [3]. It has been proved that s(G) ≤ n − 1 for any graph G on n vertices with finite irregularity strength [2]. However, a general solution of the considered problem can be obtained by algorithmic methods. Among them, constraint programming over finite domains R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1096–1103, 2008. c Springer-Verlag Berlin Heidelberg 2008
Computing the Irregularity Strength of Connected Graphs
1097
i.e. CP(FD) [1,6], appears to be particularly convenient. In this approach a problem is described by a set of formulas (constraints) containing variables with domains limited to some finite sets of nonnegative integers. The solution of the problem is any variable assignment which satisfies all the constraints. In order to compute the solution two operations are performed, to wit a propagation and a distribution. A propagation successively narrows domains of the variables eliminating values which do not satisfy the constraints. A distribution splits the domain of the given variable into disjoint subsets. This is done by adding new formulas to the constraint set. The rules stating which variable to select and how to split its domain form a distribution strategy. The computation process can be described by a search tree with nodes corresponding to sets of constraints created during subsequent steps of a distribution. Every leaf of the tree either represents a solution or it is a contradictory set of formulas. In the first case, a domain of every variable appearing in the constraints is restricted to a singleton set (i.e. the variable is determined ). It should be noted that the computation process may be carried out in many ways since there are many strategies of the search tree exploration. In [5] we showed how the irregularity strength of a graph can be computed by sequential constraint solving over finite domains. On the other hand, the search tree can be divided into subtrees and explored in parallel on distributed machines. This possibility is provided by the Mozart system [8] initially developed by the Mozart Consortium consisting of Universit¨ at des Saarlandes, the Swedish Institute of Computer Science and the Universit catholique de Louvain. The system is a programming environment for the Oz language [7], which integrates popular programming methodologies like declarative, imperative, object-oriented and constraint programming in a coherent whole. Particularly, it also enables parallel and distributed constraint solving. This task, namely the distributed exploration of the search tree is performed by special objects called parallel search engines. The engine can be parametrized in the way it computes the best solution with respect to a given order relation. In this paper we take advantage of this feature and formulate the considered graph problem as an optimization task in terms of CP(FD). Since the program is independent from the search strategy, it can be easily parallelized which observably reduces the computation time. It should be remarked, that we successfully used a similar method (excluding the best solution search) to solve some other graph problem [4]. The paper is organized as follows. In section 2 we express the problem of finding the irregularity strength of a graph in terms of CP(FD) in the Mozart system. We also present two methods which are applied to find the solution by parallel computations. In section 3 we discuss the results of computational experiments. Section 4 comprises some final remarks.
2
Formulation of the Problem
Let G =< V, E > be a simple, connected graph consisting of a set of vertices V and a set of edges E, let E(v) ⊆ E denote the set of all edges, incident to a vertex
1098
A. Meissner, M. Niwi´ nska, and K. Zwierzy´ nski
v ⊂ V and let |.| symbolize the cardinality of a set. A function f : E → 1, 2, 3, . . . is called an assignment or labeling of a graph G, a value f (e) is called a weight or a label of an edge e ∈ E and a value f (v) = f (e1 ) + . . . + f (en ), where n = |E(v)| and ei ∈ E(v) for i = 1, . . . , n is called a weight or a label of a vertex v ∈ V . An assignment f is called valid iff for every two distinct vertices u and v of a graph G, f (u) = f (v). Let F (G) be the set of all valid assignments of a graph G. The irregularity strength s(G) of a graph G ([3]) is defined as s(G) = minf ∈F (G) {maxe∈E f (e)}. Three exemplary labelings of a given graph are shown in Figure 1. Each of the labelings is valid since it assigns numbers to graph edges in the way that all vertices weights are pairwise distinct. Figure 1 also contains the strength of every labeling that is to say, the maximal edge weight in this assignment. The irregularity strength of the exemplary graph, in turn, equals 2 in consequence of the fact that it is a minimal strength taken over all valid labelings (also including the labelings not presented in the figure).
Fig. 1. Valid labeling of a simple graph G; s(G) = 2
As said before, finding the irregularity strength of a graph can be expressed as a CP(FD) problem in the Oz language. For this purpose the function ValidLabs is defined. It creates the set of constraints representing any valid assignment of a given graph. More precisely, the function takes the adjacency matrix ATab of the graph and returns the unary procedure, of which the argument (Sol) is a list of variables symbolizing labels of the edges. The symbol $ in line 2 and 10 stands for a procedure name and causes that the whole definition of the procedure becomes its identifier. fun {ValidLabs ATab MaxLab} proc {$ Sol} Constr VWeigs in {MakeConstr ATab VertNo MinDeg MaxDeg Constr Sol} Sol ::: 1#MaxLab {FD.list VertNo MinDeg#MaxDeg*MaxLab VWeigs} {FD.distinct VWeigs} {List.forAllInd Constrs proc {$ I L} {FD.sum L ’=:’ {List.nth VWeigs I}} end} {FD.distribute ff Sol} end end
% 1 % 2 % 3 % 4 % 5 % 6 % 7 % 8 % 9 % 10 % 11 % 12 % 13 % 14 % 15
Computing the Irregularity Strength of Connected Graphs
1099
At first, the procedure MakeConstr is called (line 5) in order to create the list Sol and, additionally, the list Constr. It also counts the number of vertices of the input graph (VertNo) and computes the minimal (MinDeg) and respectively the maximal (MaxDeg) vertex degree. The list Constr consists of sublists grouping labels of edges (denoted by variables), which are incident to the same vertex. Hence, the length of the list Constr equals VertNo. Domains of the variables from the list Sol are restricted to positive integers not greater than the input argument MaxLab (line 6). The usage of this argument is described in the sequel. In the next step (line 7), the list VWeigs is created which elements symbolize weights of the vertices. This is done by the predefined procedure FD.list. It should be noticed that the weight of any vertex can not be lower than MinDeg as well as it can not exceed the product MaxDeg*MaxLab. According to the definition of the valid assignment of a graph, all weights of the vertices are pairwise distinct (line 8). Constraints defined in lines 9 - 12 state that the sum of weights of all edges incident to the same vertex is equal to the weight of this vertex. In other words, the constraints define connections between elements of the list VWeigs and variables forming sublists of the list Constr. Finally, line 13 indicates the distribution strategy called first fail (in abbreviation ff ), which is predefined in the Mozart system. The strategy selects the leftmost undetermined variable from the list Sol with the current domain of the minimal range. Furthermore, it uses the lowest element of the domain as a splitting constant. We observed that for the majority of input graphs the strategy ff generates smaller search trees than any other predefined strategy. The size of the search tree also depends on the order of variables contained in the list Sol. However, the issue of finding the ordering rules which reduce the search tree for a given input graph is still open. As said before, a valid assignment represented by a valuation of elements of the list Sol can be computed by Oz objects called parallel search engines. Below, we give an example of a command that creates the engine Eng. Eng = {New Search.parallel init(a:1#ssh b:2#ssh)}
In consequence, four computational processes are initiated, to wit a manager and three workers - one on the remote machine a and two on the machine b. The manager controls the computations by finding a work for idle workers and collecting the solutions while the workers explore fragments of the search tree. All the processes communicate one to another using the remote command interpreter ssh. The engine executes the procedure returned by the function call {ValidLabs ATab {Dim ATab}-1} in order to find a valid assignment of a given graph. The restriction {Dim ATab}-1} imposed on the maximal edge label follows from the result cited in section 1. The expression {Dim ATab} denotes a number of vertices of the graph which is equal to the dimension of the adjacency matrix ATab. Due to the technical demands of Oz the procedure has to be passed to the engine via functor ([8]) hereinafter named SolFun. The engine may look for any one solution (if there is no solution, the empty list nil is returned): {Eng one(SolFun Sol)}
1100
A. Meissner, M. Niwi´ nska, and K. Zwierzy´ nski
or it may search for the best solution with respect to an order specified by some binary procedure defined in the functor SolFun: {Eng best(SolFun Sols)}
In the latter case the argument Sols is an increasingly ordered list of lists representing successive solutions. Thus, the best solution is the last element of the list Sols. The engine computes a better solution using a branch-and-bound search (BB) that works as follows. Every worker looks for a new solution being better than the so-far best solution. When a new solution is found, it is passed to the manager which checks whether it is really better. If so, the manager broadcasts the new so-far best solution to all workers. It should be noticed that the maximal element of every list Sol represents a strength of the graph. Therefore if the list Sols is ordered by the less-than relation on strengths, then the last sublist of Sols contains the irregularity strength. The considered order can be expressed by the following binary procedure BetterSol defined inside the functor SolFun. proc {BetterSol OldSol NewSol} NewSol ::: 1#{MaxElem OldSol}-1 end
The procedure constraints all variables from the list NewSol to the interval [1, {MaxElem OldSol} − 1] where the expression {MaxElem OldSol} represents the maximal element of the list OldSol. In other words, the procedure BetterSol makes the strength in the list NewSol at least lower by 1 than the strength in OldSol. After the search engine constructs the list Sols then the irregularity strength of the given graph can be denoted as {MaxElem {List.last Sols}}, that is the maximal element of the last sublist in Sols. The irregularity strength can be also computed by iterative best-solution search (IB). This simple method, sketched in [6], consists in finding one solution which acts as a base for building a new constraint. Then, the process is repeated with the extended set of constraints until no further solution is found. This approach is implemented by the following function. fun {IB Eng ATab} % 1 fun {IB1 Eng ATab MaxLab} % 2 Sol % 3 in % 4 Export the procedure {ValidLabs ATab MaxLab-1} from the functor SolFun % 5 {Eng one(SolFun Sol)} % 6 if Sol == nil then MaxLab % 7 else {IB1 Eng ATab {MaxElem Sol}} end % 8 end % 9 in % 10 {IB1 Eng ATab {Dim ATab}-1} % 11 end % 12
The function IB applies the auxiliary function IB1 to compute the irregularity strength of a graph given by the adjacency matrix ATab. In order to skip some
Computing the Irregularity Strength of Connected Graphs
1101
technical details, line 5 is given in pseudocode. It contains the call of the function ValidLabs, which constructs a new set of constraints with the restriction MaxLab-1 imposed on the maximal edge label. The parameter MaxLab is initially set to {Dim ATab}-1 and in every subsequent iteration it is equal to the strength computed in the previous function call (line 8); if no other solution is found, it becomes the final result (line 7). The iterative best-solution search is in general less efficient than the branchand-bound strategy. It follows from the fact that the first method reconstruct the search tree from scratch in every iteration, which is not necessary in the second approach. However, the iterative best-solution search may yield a very good performance if the cost of the search of the solution is low in every iteration. In the next section we discuss the results of computational experiments concerning both of the presented methods.
3
Computational Experiments
The experiments described in this section consisted in measuring the time of parallel computations performed on distributed machines and aimed at calculating the irregularity strength of exemplary graphs. The goals of the experiments were as follows: – evaluation of iterative best-solution search and branch-and-bound search, – estimation of the speedup achieved by parallel processing. All graphs selected for tests and depicted in Figure 2 belong to the class of 3regular graphs on 12 vertices. The graphs were chosen under a general criterion that a time of computations executed on one machine should range from 30 to 300 sec. and should be different for every exemplary graph. The computational environment generally comprised six identical machines equipped with Pentium P4D 3.4 GHz processor, 1 GB RAM, 1 GB Ethernet adapter, MS Windows 2000 Professional 5.00 and powered by the Mozart system 1.3.2. One of the machines
Fig. 2. Test graphs with a valid assignment
1102
A. Meissner, M. Niwi´ nska, and K. Zwierzy´ nski Table 1. The computational time [sec.] for BB and IB search strategies 1 worker Graph G1 G2 G3 G4 G5
BB 33.59 40.85 75.89 253.59 147.41
IB 33.46 40.46 74.92 252.07 145.84
2 workers BB 17.28 20.83 38.39 128.1 74.24
IB 17.23 20.54 37.85 126.38 73.69
3 workers BB 11.95 14.39 26.11 85.72 50.14
IB 11.94 14.24 25.81 84.95 49.93
4 workers BB 9.12 11.08 19.94 64.71 38.08
IB 9.16 10.94 19.75 63.85 37.81
5 workers BB 7.48 8.96 16.06 51.8 30.65
IB 7.59 8.9 16.09 51.2 30.17
was designated for running the manager while each of the other computers processed one worker. The results of the tests are collected in Table 1. The first header row contains a number of workers appearing in the given variant of the computational environment. For example, the column marked by the label “3 workers” corresponds to the variant consisting of the manager and three machines which run the workers. The symbols in the second header row (i.e. BB or IB) denote a search strategy applied to calculate the irregularity strength. The computational time was taken from the system clock and every entry in the table is an arithmetic mean of seven runs. One can easily observe that the results obtained for both BB and IB strategies are nearly identical, which is rather an unexpected effect. However, this can be explained by experiments in which we measured the time of every iteration of the IB algorithm (the analogous tests are impossible for the BB strategy since it is hardwired in the search engine). It turned out that in every case almost the whole computational time (i.e. more than 98% of it) was taken up by the last iteration necessary for detecting that no more solutions can be found. This process demands the exploration of the whole search tree and since it does not lead to any solution, a strategy of selection of the best solution has no impact on it. In Figure 3 we present a speedup of computations (regarding the BB strategy) which is defined as a quotient of the computational time measured in the environment containing one worker and respectively in the environments with subsequently growing number of workers. Every bar depicts a speedup obtained
Fig. 3. Speedup of computations
Computing the Irregularity Strength of Connected Graphs
1103
for a particular graph; a shading of the bar indicates the number of workers in the computational environment as it is given in the legend on the right side of the chart. One can observe that the speedup is almost linear for every graph. This proves the efficiency of the method implemented in Mozart search engines in order to partition a search tree.
4
Final Remarks
We show that the irregularity strength of a graph can be successfully computed by means of CP(FD) programming and parallel processing in the Mozart system. The experiments showed that both search strategies applied to solve this problem, that is branch-and-bound search and iterative best-solution search, result in nearly identical computational time for all the exemplary graphs. This follows from the fact that more than 98% of the time is consumed by the exploration of the whole search tree where the strategy of finding the best solution has no application at all. One can put into consideration whether this effect holds for other graphs, too. However, we expect that the efficiency of computations can be improved by defining a distribution strategy which minimizes the search tree with respect to the structure of a considered input graph.
References 1. Apt, K.R.: Principles of Constraint Programming. Cambridge University Press, Cambridge (2003) 2. Aigner, M., Triesch, E.: Irregular assignments of trees and forests. SIAM J. Discrete Math. 3(4), 439–449 (1990) 3. Chartrand, G., Jacobson, M.S., Lehel, J., Oellermann, O.R., Ruiz, S., Saba, F.: Irregular networks. Congressus Numerantium 64, 187–192 (1988) 4. Meissner, A., Zwierzy´ nski, K.: Vertex-magic Total labeling of a Graph by Distributed Constraint Solving in the Mozart System. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 952–959. Springer, Heidelberg (2006) 5. Meissner, A., Niwi´ nska, M.A., Zwierzy´ nski, K.: Computing an irregularity strength of selected graphs. El. Notes in Discrete Mathematics 24, 137–144 (2006) 6. Schulte, Ch.: Programming Constraint Services, PhD thesis. University of Saarlandes (2000) 7. Smolka, G.: The Oz Programming Model. In: van Leeuwen, J. (ed.) Computer Science Today. LNCS, vol. 1000, pp. 324–343. Springer, Heidelberg (1995) 8. The Mozart Programming System (2004), http://www.mozart-oz.org
DPSKEL: A Skeleton Based Tool for Parallel Dynamic Programming Ignacio Pel´ aez, Francisco Almeida, and Fernando Su´ arez Departamento de Estad´ıstica, I. O. y Computaci´ on Universidad de La Laguna, c/ Astrof´ısico F. S´ anchez s/n 38271 La Laguna, Spain [email protected], [email protected]
Abstract. Skeleton based libraries are considered one of the alternatives to reduce the distance between end users and parallel architectures. Algorithmic skeletons are based in general procedures describing the method to be implemented. Although a gap between general formalizations for dynamic programming and software components can be found, we develop a skeleton tool for dynamic programming problems. The design strategy is general enough to consider a wide rage of dynamic programming recurrences. As usual in skeleton approaches, the parallelism is provided in a transparent manner, so that, sequential users may access to the system. A set of tests problems representative of different classes of dynamic programming formulations has been used to validate the distributed memory implementation on an IBM-SP.
1
Introduction
Dynamic programming (DP) is an important problem-solving technique that has been widely used in various fields such as control theory, operations research, biology and computer science. We aim to the development of an algorithmic skeleton devoted to DP. It is a well known fact that algorithmic skeletons are usually supported by general procedures describing the method to be implemented. These general procedures are usually derived from the formalization of the algorithmic technique. There exist numerous formalizations for dynamic programming, for example [1], [2]. Most of these works do however stop short of presenting a single, generic program that solves all applications of their formalization at one stroke. The construction of such generic program, and the appropriate means for expressing it, have been the subjects in [3] and [4] but, the important class of polyadic recurrences [5], [6] is out of the scope in both approaches. In some other cases, [7] the problem formulation is devoted to nonserial polyadic problems while multistage serial problems are assumed to be obtained as particular cases. Although this is true from the theoretical point of view, in practise, to be efficient each class of DP problem must be considered as an individual case.
This work has been partially supported by the EC (FEDER) and the Spanish MEC (Plan Nacional de I+D+I, TIN2005-09037-C025).
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1104–1113, 2008. c Springer-Verlag Berlin Heidelberg 2008
DPSKEL: A Skeleton Based Tool for Parallel Dynamic Programming
1105
We conclude that generic approaches are limited to classes of problems or they are not suitable to be assumed by a software component. A similar situation has been found in the parallel case. Parallelizations for specific DP problems have been presented by many authors [8], [9], [10], [11], [12], [13], [14], [15] and general parallel procedures are restricted to limited classes of recurrences. All of these cases suggest, however, parallel algorithms for problems on the same class. A unified parallel general approach was presented in [16] as an extension to the work of [3] but the the strong theoretical effort introduced on the polyadic case dissuades as of using it as a model in our skeleton. At this point and according to the former discussion, we claim that the subject of finding generic DP formalizations and software components as a whole remains to be an open question. This fact becomes a handicap to develop general software tools for DP, both in sequential and parallel. For this reason, most of the software approaches devoted to dynamic programming [17], [18], [19], [20] are inefficient or they are oriented just to particular recurrences. That is the case of the DP skeleton of the Mallba [21] library, a library of algorithmic skeletons to solve combinatorial optimization problems. That Mallba-DP release was restricted to multistage DP problems and the parallelization was focused only to clusters of PCs. The extensibility of that skeleton was coerced by the initial design and new paradigms were not supported. We present here the message passing release of DPSKEL [22], that follows the general design strategy of the Mallba library, i. e., a common interface for the algorithm on sequential and parallel platforms, hiding the parallelism to non expert users. The contribution is a general tool supporting most classes of recurrences appearing in DP problems that includes the mechanisms needed to extend the library when new recurrences are found. The main motivations to develop the current design of the library are a remarkable absence of tools for DP joined to many sparse formalizations not general enough to be considered in an algorithmic skeleton. Many parallel platforms are being supported: shared and distributed memory machines, hybrid architectures combining both of the former, and heterogeneous platforms. Moreover, several parallelizations can be supplied on each platform for the same recurrence and new approaches are easily introduced. Parallelizations can now be so efficient as ad-hoc algorithms developed for DP programs. Here we deliver the MPI distributed memory implementation of the library and we test it using different dynamic programming recurrences. The paper has been structured as follows. Section 2 summarizes the DP method and the design strategy and the skeleton is presented in section 3. In section 4 we show the computational results obtained on an IBM computer. The results prove the efficiency of the approach and the advantages of the methodology. We finalize the paper with a section of conclusions and future lines of work.
2
The Dynamic Programming Methodology
We found a starting point for our methodology in [23] where sequential and parallel dynamic programming formulations are presented by cases of studio, and
1106
I. Pel´ aez, F. Almeida, and F. Su´ arez
each case represents a general class of recurrences. When a new problem appeals to be solved, any of the former examples could be used as a guide. These samples suggest parallel algorithms for other problems in the same class. DPSKEL provides support for these well known classes of recurrences. DP problems belonging to these classes can be implemented using a common interface for sequential and parallel executions. The design is flexible enough to deal with new DP problems. Next, we summarize some DP concepts needed to use the skeleton. We use the notation commonly used in most of the formalizations. DP represents the solution to an optimization problem as a recursive equation whose left side is an unknown quantity and whose right side is a minimization (or maximization) expression. Such an equation is called functional equation or recurrence equation for DP. DP problems may be classified in terms of the functional equation. A functional equation that contains a single recursive term yields a monadic DP formulation. DP formulations whose cost functions contains multiple recursive terms are called polyadic formulations. The functional equations provide the mechanism to obtain optimal solutions to subproblems. The dependencies between subproblems in a DP formulation can be represented by a directed graph. Each node in the graph represents a subproblem. Nodes of the graph are usually referred as states. A directed edge from state i to state j indicates that the solution to the subproblem represented by state i is used to compute the solution represented by state j. The edges of the graph are commonly known as decisions. The optimal solution for a subproblem is obtained as a sequence of decisions (a policy) starting from the initial state. Usually the states of the graph are arranged in a bidimensional matrix (the DP table) holding the information relevant to evaluate the functional equation and the optimal policy. The DP table is necessary to store precomputed states. Once the table has been computed, a reversal traversing of the table allows to obtain the optimal policy. Thus, solving a Dynamic Programming problem consist of: obtaining the functional equation, evaluate the DP table and return back the optimal policy if necessary. If the graph is acyclic, then the states of the graph can be organized into levels or stages such that subproblems at a particular level depend only on subproblems at previous stages. In this case, the DP formulation can be categorized as follows. If subproblems at all levels depend only on the results at the immediate preceding levels, the formulation is called a serial or multistage DP formulation; otherwise, it is called a nonserial DP formulation. Based on the preceding classification criteria, four classes of DP recurrences can be defined: serial monadic, serial polyadic, nonserial monadic, and nonserial polyadic. These classes are not exhaustive; some DP formulations cannot be classified into any of these categories. Even though, our methodology may also be appliable and the skeleton may be easily extended if required. Table 1 shows examples of DP problems on each one of the classes. This examples will be used as test problems in sections 3 and 4 to illustrate and validate the use of our skeleton. See [23] for a detailed description of these problems and formulations.
DPSKEL: A Skeleton Based Tool for Parallel Dynamic Programming
1107
Table 1. Examples of DP recurrences Problem Recurrence DP Category 0/1 Knapsack (KP) fkc = max{fk−1c , fk−1c−wk + pk } Serial Monadic Longest Common fij = fi−1j−1 + 1 if xi = yj ; or Subsequence (LCS) fij = max{fij−1 , fi−1j } if xi = yj Nonserial Monadic Floyd’s All-Pairs k−1 Shortest-Paths (FSP) dkij = min{dk−1 + dk−1 Serial Polyadic ij , dik kj } Optimal Matrix Parenthesization (MMP) cij = min{cik + ck+1j + ri−1 rk rj } Nonserial Polyadic
3
The Dynamic Programming Skeleton
Several reasons brought us to consider the actual design of DPSKEL. The notable absence of general software tools for DP (sequential and parallel) and the current apparent gap among general methods and DP applications. We propose to cover this gap trying to minimize the user effort to manage the tool while keeping standard methodologies at the same time. We enumerate important features that, in our opinion, should be covered by a software library oriented to the DP problem solving paradigm: – Coherence with a methodology. The tool should be close to the methodology used to derive the functional equations. – Expressiveness to represent functional recurrences. – Support for a wide range of problems in terms of classes of recurrences. – Flexibility to add new classes of recurrences or solving strategies. – Easiness of use. – Portability. – Efficiency. – Transparent use of parallel platforms for non expertised users. The development of the software skeleton for DP involves to analyze those elements in the technique that may be abstracted from particular cases and those elements that depend on the application. Assuming that the user has been able to obtain the functional equations by herself. The user supplies the structure of a state and its evaluation through the functional equations and the DP table is abstracted as a table of states. DPSKEL provides the table and several methods to traverse it during the evaluation of the states. These methods allow different traversing modes (by rows, by columns, by diagonals) and the user picks the best that doesn’t violate the dependences of the functional equations. Some recurrences admit several traversing modes in terms of dependences, but the running time may vary from one mode to the other. In the sequential case, the traversing mode indicates that the states of the row (column or diagonal respectively) will be processed in sequence, while in the parallel case, the semantic appeals to the evaluation of the row (column or diagonal respectively) using the whole set of processors simultaneously. This approach allows to introduce any of the parallelization strategies designed for DP algorithms. The dimensions of the DP table
1108
I. Pel´ aez, F. Almeida, and F. Su´ arez
are dependent of the instance and should be given at running time. The table is created dynamically getting memory from the system, so it could be very big (depending on the system’s resources). Both, object oriented languages and functional languages, provide the high levels of expressiveness required by the skeleton. Functional languages are sometimes being critizied of being inefficient and far from most of the standards in developments. The parallelization under this paradigm could also be an inconvenient. Among the object oriented languages, C++ has proved to achieve acceptable levels of expressiveness at low loss of efficiency. Parallelism can be easily introduced in C++ through the bindings provided by parallel standard libraries like MPI or OpenMP. DPSKEL has been developed in C++ what makes it portable to most of the sequential and parallel platforms. DPSKEL skeleton follows the model of classes described in the Mallba library and adds those particular elements related to DP, while keeping the features of efficiency and easiness of use at the same time. The concepts of State and Decision are abstracted into C++ classes required to the user. The user describes the problem and the solution, and the methods to evaluate a state (the functional equation) and to get the optimal solution. The classes provided by DPSKEL will allocate and evaluate the DP table supplying the necessary methods to return the solution. Implementation details are hidden to the user. Several solver engines are provided to manage sequential and parallel executions on shared, distributed, hybrid and heterogeneous platforms. Each solver implements different traversing modes of the DP table. We will show now some basic classes in DPSKEL. 3.1
The Class State
Class State holds the information associated to a DP State. This class stores and computes the optimal value in the state and the decision associated to the optimal evaluation. The evaluation of a state involves the access to the information of some other states in the DP table. DPSKEL provides an object of the class Table hidden on each instance of the class Solver. The methods GET STATE(i, j) and PUT STATE(i, j) of the Table allow to get and insert states in a table. Code of figure 1 defines the class state for our example. It defines a problem (pbm), a solution (sol), a decision (d) and the DP table (table). These variables may be considered as generic variables since they should appear on any problem to be solved. The variable value stores the optimal profit. It’s worth to mention a particular method in this class, the method Evalua that implements the functional equation. The method Evalua is assumed to receive the indices of a state at the DP table, any of the recurrences of table 1 can be expressed under this prototype. If the functional equation for an specific problem requires a different prototype, the skeleton is open to overload the method making appeal to the polymorphism inherent to C++.
DPSKEL: A Skeleton Based Tool for Parallel Dynamic Programming
1109
requires class State { const Problem &pbm; Solution / int value; Decision d; Table &table; public: .... // Methods to manage the class void Evalua(int stage, int index); // Evaluate the state: Functional equation }; void State::Evalua(int stage, int index){ State st(pbm, sol, table); State tmp(pbm, sol, table); ..... st = table.GET_STATE(stage-1, index); st.set_decision(0); if(index >= pbm.weight(stage)) { tmp = table.GET_STATE(stage-1, index-pbm.weight(stage)); if (st.get_value() < (tmp.get_value() + pbm.profit(stage)) { st.set_value(tmp.get_value() + pbm.profit(stage)); st.set_decision(1); } } table.PUT_STATE(st, stage, index); .....
Fig. 1. Defining the class State. Implementation of the method Evalua of class State for 0/1 Knapsack Problem.
3.2
The Main() Function
Once all required classes have been defined, they can be used to solve the problem and get the results. To achieve it, the Main() function should be implemented. Code in figure 2 illustrates an example of the implementation for this function. The library "KP.hh" is included. It holds the classes previously defined and the classes provided by the DP skeleton. A provided class (Setup) stores the instance dependent parameters that are needed by the solver engine, in our example, the number of stages and the number of states to allocate the DP table. An instance solver of the class solver seq is created using as parameters the instance to be solved. This object will be used as a solver for our problem through the method run byrows(). When the method run byrows() is invoked, the states of the DP table are automatically evaluated following the traversing mode by rows. A parallel execution in a distributed memory machine could be performed using the class solver distributedmemory, using the same method run byrows() the parallel evaluation of the rows in the DP table is achieved. Note that the interface for the sequential and parallel execution is the same, only a different solver class is expected to be used. 3.3
The Class Solver
The class Solver provides solver engines for different platforms. This class contains the data structures and methods needed to perform a DP execution according to the specifications. In practise, it is a virtual class and the solvers
1110
I. Pel´ aez, F. Almeida, and F. Su´ arez
#include "KP.hh" int Main (int argc, char** argv) { Problem pbm; Solution sol(pbm); Setup setup; ifstream f1(argv[1]); f1 >> setup; ifstream f2(argv[2]); f2 >> pbm; Solver_seq solver(pbm, setup); solver.run_byrows(); solver.show_table(); solver.get_policy(setup.get_Num_Stages() - 1, setup.get_Num_States()); sol = solver.solution(); cout << sol << endl;
Fig. 2. Implementation of the Main() function
provided are defined as subclasses of this main class. Current design considers the following set of Solvers: – – – – –
Solver seq. The sequential solver. Solver sharedmemory. The solver for shared memory systems. Solver distributedmemory. The solver for distributed memory machines. Solver hybrid. The solver for hybrid platforms combining both of the former. Solver heterogen. The solver on heterogeneous environments where the processing elements may have different computational capabilities, and the interconnection network may be different on any pair of processing units. – Solver debug. The solver to run under debug mode. – Solver profile. The solver to profile and tune the application.
When an object of the Solver class is instantiated, the DP table is dynamically created according to the setup parameters. In figure 3 we can see how DP table is traversed by the method run byrows of the class Solver distributedmemory. Rows are evaluated in sequence and the set of processors compute every row in parallel according to a SPMD programming model. The inner loop is parallelized and the DP table is replicated. An allgather operation is needed after each iteration. The states are evaluated using the method Evalua supplied by the user. Note that each entry in the DP table stores all the information associated to a state, this information must be serialized before the allgather operation be developed. The interchanged values must also be properly retrieved to the DP table. This operations introduce an extra overhead in the parallel skeleton.
4
Computational Results
This section is devoted to validate experimentally the performance of the skeleton. Running times for sequential executions are compared with execution times provided by the Solver distributedmemory and the Solver sharedmemory. As test
DPSKEL: A Skeleton Based Tool for Parallel Dynamic Programming
1111
void Solver_distributedmemory::run_byrows() { int i, j, first_state, last_state, num_stages, num_states; int num_stages = setup.get_Num_Stages(); int num_states = setup.get_Num_States(); first_state = get_first_state(num_states, name, numprocs); // Pseudocode last_state = get_last_state(num_states, name, numprocs); for (i = 0; i < num_stages; i++) { for (j = first_state; j < last_state; j++) { table.GET_STATE(i, j).Evalua(i, j); serialize_state(j); } AllGather_row(i); insert_in_table_row(i);
Fig. 3. Implementation of the function run byrows at the Distributed Memory Solver
problems we consider the problems shown in table 1. They represent four different classes of recurrences and can be used as samples for many other DP problems. Three series of instances have been randomly generated for each problem. Since the KP and FSP problems are serial problems they have been solved using the traversing mode by rows, both in sequential and parallel. The nonserial property of the LCS and MPP problems impose the use of diagonal modes on the traversing. For the LCS, the diagonal mode starts the evaluation at the left upper corner of the DP table and for the MMP problem it begins the evaluation at the main diagonal and ends at the right upper corner. The computational experience has been developed in an IBM RS-6000 SP with 8*16 Nighthawk Power3 @375Mhz (192 Gflops/s) with 64 Gb RAM. The nodes are connected through a SP Switch2 operating at 500MB/sec. In our computational experiments we used only one node of 16 processors. Table 2 shows the running times obtained for the KP, LCS, FSP and MPP respectively (see table 1) in a distributed memory architecture and in a shared memory architecture. We tested the skeleton using randomly generated instances. We have generated small, medium and large problem sizes for each case. Running times are all expressed in seconds. Except for the small instances (KP1, MPP1), the running times decrease in all the cases when the number of processors is increased. As expected, larger instances show better performances. The computational results in table 2 for a shared memory architecture, demonstrate the advantages of design methodology. The tool shows a satisfactory performance in all the classes. The computational results in table 2 showing the results for a distributed memory architecture shows not so good performance, specially for the first two classes of problems used. The reason for this low performance is the need of serialize and communicate the data in the distributed memory architecture, when the computational time required for the evaluation of a DP state is small compared with the needed time for the serialization and the communication, the performance is not so good. In the last two examples the state evaluation is more complex so the performance is better.
1112
I. Pel´ aez, F. Almeida, and F. Su´ arez Table 2. Running Times for different DP problems
Distributed memory architecture #Processors Problem 1 2 4 8 KP1; N = 1600 C = 3200 19.93 13.53 10.79 11.71 KP2; N = 3200 C = 6400 87.74 59.65 47.69 41.54 KP3; N = 6400 C = 6400 179.32 124.94 98.32 89,33 LCS1; N 1 = 1000, N 2 = 1000 3.41 2.32 1.83 1.69 LCS2; N 1 = 3000, N 2 = 3000 32.66 21.87 16.85 14.66 LCS3; N 1 = 5000, N 2 = 5000 92.79 62.57 47.64 41.57 FSP1; N = 200 23.41 15.1 10.8 8.84 FSP2; N = 300 79.63 74.97 36.2 29.46 FSP3; N = 350 126.77 84.78 57.7 46.38 MPP1; N = 100 0.11 0.12 0.09 0.12 MPP2; N = 500 13.69 13.71 10.64 6.49 MPP3; N = 1000 133.45 130.65 97.39 55.69
5
Shared memory architecture #Processors 1 2 4 8 2.379 1.222 0.640 0.359 9.531 4.841 2.493 1.326 18.975 9.627 4.944 2.632 0.895 0.477 0.255 0.148 10.049 4.761 2.289 1.233 30.647 13.379 6.391 3.336 4.537 2.316 1.169 0.611 15.706 7.823 3.917 1.973 25.043 12.454 6.371 3.197 0.0314 0.0185 0.0118 0.0102 6.189 3.080 1.570 0.823 72.826 33.529 15.821 7.8610
Conclusions and Future Work
We have developed a skeleton tool for dynamic programming. Different architectures implementations have been successfully tested and validated. The computational results confirm the advantages of the design methodology using several classes of Dynamic Programming functional equations. The parallelism is hidden to the users and the skeleton is portable, efficient and can be easily applied to many applications. As future lines of work we plan to devise new Solvers for different platforms and to use the skeleton with some other dynamic programing problems. Acknowledgements. Thanks to the CEPBA for allowing us to use their machines.
References 1. Helman, P.: A common schema for dynamic programming and branch and bound algorithms. Journal of the ACM 36, 97–128 (1989) 2. Karp, R.M., Held, M.: Finite state process and dynamic programming. SIAM Journal in Applied Mathematics 15, 693–718 (1967) 3. Ibaraki, T.: Enumerative Approaches to Combinatorial Optimization, Part II. Annals of Operations Research 11, 1–4 (1988) 4. de Moor, O.: Dynamic programming as a software component. In: Mastorakis, N. (ed.): Proc. 3rd WSEAS Int. Conf. Circuits, Systems, Communications and Computers (1999) 5. Li, G., Wah, B.: Parallel processing of serial dynamic programming programs. In: Proc. of COMPSAC 1985, pp. 81–89 (1985) 6. Wah, B., Li, G., Fen, C.: Multiprocessing of combinatorial search problems 18, 93–108 (1985)
DPSKEL: A Skeleton Based Tool for Parallel Dynamic Programming
1113
7. Gibbons, A., Rytter, W.: 3.6. In: Efficient parallel algorithms, Cambridge University Press, Cambridge (1988) 8. Bitz, F., Kung, H.: Path planning on the warp computer using a linear systolic array in dynamic programming. Inter. J. Computer Math. 25, 173–188 (1988) 9. Galil, Z., Park, K.: Parallel algorithms for dynamic programming recurrences with more than o(1) dependency. Journal of Parallel and Distributed Computing 21, 213–222 (1994) 10. Louka, B., Tchuente, M.: Dynamic programming on two-dimensional systolic arrays. Information Processing Letters 29, 97–104 (1988) 11. Miguet, S., Robert, Y.: Dynamic programming on a ring of processors. Hypercube and Distributed Computers, 19–33 (1989) 12. Rytter, W.: On efficient parallel computations for some dynamic programming problems. Theoretical Computer Science 59 (1988) 13. Rodr´ıguez, C., Gonz´ alez, D., Almeida, F., Roda, J., Garc´ıa, F.: Parallel algorithms for polyadic problems. In: Proceedings of the 5t h Euromicro Workshop on Parallel and Distributed Processing, pp. 394–400 (1997) 14. Andonov, R., Balev, S., Rajopadhye, S., Yanev, N.: Otimal semi-oblique tiling and its application to sequence comparison. In: 13th ACM Symposium on Parallel Algorithms and Architectures (SPAA) (2001) 15. Andonov, R., Rajopadhye, S.: Optimal Orthogonal Tiling of 2-D Iterations. Journal of Parallel and Distributed Computing 45, 159–165 (1997) 16. Morales, D., Almeida, F., Rodr´ıguez, C., Roda, J., Coloma, I., Delgado, A.: Parallel dynamic programming and automata theory. Parallel Computing (2000) 17. Eckstein, J., Phillips, C.A., Hart, W.E.: PICO: An object-oriented framework for parallel branch and bound. Technical report, RUTCOR (2000) 18. Cun, B.L.: Bob++ library illustrated by VRP. In: European Operational Research Conference (EURO 2001), Rotterdam, p. 157 (2001) 19. Lubow, B.C.: SDP: Generalized software for solving stochastic dynamic optimization problems. Wildlife Society Bulletin 23, 738–742 (1997) 20. Lohmander, P.: Deterministic and stochastic dynamic programming, http://www.sekon.slu.se/PLO/diskreto/dynp.htm 21. Alba, E., et al.: MALLBA: A library of skeletons for combinatorial optimisation (research note). In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 927–932. Springer, Heidelberg (2002) 22. Pel´ aez, I., Almeida, F., Gonz´ alez, D.: High level parallel skeletons for dynamic programming. Parallel Processing Letters (to appear, 2006) 23. Kumar, V., Grama, A., Gupta, A., Karypis, G.: Introduction to Parallel Computing Design and Analysis of Algorithms. The benjamin/Cummings Publishing Company, Inc. (1994)
SkelJ: Skeletons for Object-Oriented Applications Joao L. Sobral Departamento de Inform´ atica, Universidade do Minho, Braga, Portugal [email protected]
Abstract. Development of parallel applications requires adequate languages to effectively modularise and reuse parallelisation patterns. In object oriented applications parallelisation issues typically cut across multiple classes becoming tangled with domain specific code, harming reusability. In this paper we propose a new skeleton-based language for object oriented applications, aiming to attain more modular, reusable, composeable and (un)pluggable parallelisations. We propose a set of template based operators that implement common object oriented parallel patterns. The collection includes a set of low level parallel patterns that can be composed to develop more high level patterns. Results obtained suggest that this is a feasible alternative to traditional approaches and that the performance penalty introduced by the approach has a minor impact on application scalability.
1
Introduction
The development of parallel applications includes all the complexity of sequential programming plus the additional burden of parallelism management, e.g., work partition and distribution among available resources. In traditional approaches these concerns are tangled in source code, i.e., when using MPI the application domain specific functionality (core functionality) is mixed with MPI primitives that deal with data partition and communication among processes. Send/receive paradigm has been criticised due to the fact that sends and receives are scattered through the code [7]. Tradicional parallel programming languages are not able to effectively modularise this cross-cutting nature of parallelism. Several approaches aim to improve the development of parallel applications. One approach introduces more high level entities to express parallel computations, such as actors, active objects, one-way calls and futures [18] or object aggregates [3]. These approaches merge object and parallelism abstractions: parallelism is an intrinsic feature of the provided abstractions. Skeleton approaches [4] encapsulate parallelisation code into skeletons. Skeletons are usually implemented as high order functions, parametrised by core functionality. Using high level entities to express parallel computations requires an upfront investment on application redesign and does not promote a separation between R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1114–1121, 2008. c Springer-Verlag Berlin Heidelberg 2008
SkelJ: Skeletons for Object-Oriented Applications
1115
domain specific functionality and parallelisation issues. Skeleton approaches promote this separation, however they also require application redesign to fit into the skeleton structure, they suffer from extensibility problems (e.g., provided skeletons can only be extended in limited ways) and they have problems modelling more complex communication patterns, due to their intrinsic functional nature. This paper proposes a set of skeletons based on code templates to overcome limitations of previous approaches. Our skeletons are based on aspect oriented techniques [10,11] to implement high level operators that accept classes and methods as parameters, similar to high order functions in traditional skeletonbased approaches. In this paper we describe a collection of generic, (un)pluggable, composeable parallel skeletons, that promote a stronger separation between the domain specific code and parallelisation issues and provides a smother path from sequential to parallel programming. The rest of the paper is organised as follows. Section 2 present our collection of parallel patterns and section 3 presents several benchmarks. Section 4 compares this work with other related approaches and section 5 concludes the paper.
2
Skeleton Collection
This section presents a set of skeletons that were implemented as a set of parametrised aspects. The goal of these patterns is to provide means to (un)plug parallel code into sequential object oriented applications. The idea is close to the original work on skeletons, where a programmer would write a ”sequential like” code that would efficiently run on a wide range of parallel systems. Like the original idea of skeletons the goal is not to perform an automatic parallelisation but to modularise and reuse code that specifies the parallel behaviour. Skeletal programming environments provide efficient implementations of common parallel interaction patterns (e.g., Farm and Pipeline). In these environments the programmer selects the skeleton(s) better suited for his application and provides application specific behaviour. In our approach this is achieved by selecting and composing (i.e., nesting) pre-built parametrised aspects and specifying the skeleton specific parameters (i.e., target class and methods). Instantiation and nesting of parametrised aspects are activities directed to framework users, intended to be a non specialists in parallel computing, as it does not requires specific knowledge of aspect implementation details. Inheritance between parametrised aspects is a programming API whose targets are framework developers and advanced users, as it gives access to inner features of aspect implementations and provides means to extend the set of pre-built aspects. The description of these extension points is out of scope of this paper. Our approach is based on a set of low level patterns, used as foundations of more high level patterns. We start this section by describing these low level patterns and then we show how to implement more high level parallel patterns by composing these low level patterns.
1116
2.1
J.L. Sobral
Low Level Patterns
The skeleton library is based on three object-oriented abstractions: separable objects, object aggregates and asynchronous methods [3,18]. Separable objects can be placed on remote nodes and can only be accessed by remote method invocations. An object aggregate is a group of objects that is type compatible with a singe instance of the group. The idea is to transparently replace a single instance with a group of similar objects to support parallel processing among elements of the group. Asynchronous methods allow the client object to proceed execution without waiting for the called method to finish. The Separate pattern transforms Class T into a type compatible class whose instances can be placed into remote nodes. We provide two types of separate patterns for a finer control of object locality. The default separate implementation can distribute objects across nodes of a local cluster. The SeparateFar can distribute objects across cluster boundaries (e.g., inter-clusters). The Replicate creates an object aggregate with n instances of class T (were n is defined at run-time or given by the programmer). Object aggregates support additional patterns to control the way method calls are executed by the aggregate. Aggregate members can be referred by their id (an index within the aggregate). By default, one special replica is in charge of executing all method calls, called the aggregate representative. This behaviour ensures compatibility with the original code in the case where no other patterns are applied. The Broadcast broadcast method calls specified by pointcut P (a expression specifying a set of methods) to all aggregates members. Additional patterns Scatter and Reduce can be used, respectively, to provide a different parameter to each aggregate member and to reduce multiple results into a single value. The Delegate can be used to redirect a method call to a specific aggregate member, given by a member function. Active makes instances of Class T active objects [18]. Active objects and asynchronous method calls can be used to introduce parallel processing between a client and a server. Pattern Async spawns a new thread to execute method calls specified by P. An optional pointcut join can be used to specify the point in execution where the spawning thread waits for spawned threads. This join point is automatically provided whenever a method returns a value, by means of Future values [2]. Synch provides functionality similar to the synchronized Java keyword. A broader set of concurrency patterns and primitives is available (including, Guards, Readers/Writers and Barriers); their description is out of scope of the paper. The After skeleton can be used to specify additional actions in the parallelisation code after specific events. Table 1 illustrates the use of the skeleton collection. The domain specific code (at the left) creates an instance of class Filter and calls its method mfilter. By applying the parallelisation code, an aggregate of filters is created (Replicate skeleton) and all aggregate elements asynchronously execute method mfilter (Broadcast and Async patterns). The skeleton After specifies that after execution of the mfilter method each aggregate element should print its aggregate id, by calling the print method.
SkelJ: Skeletons for Object-Oriented Applications
1117
Table 1. Illustrative skeleton example Core functionality
Parallelisation code Replicate
public class Filter { void mfilter() { ... } }
Broadcast Async
Filter f = new Filter(); f.mfilter();
void print() { ... }; After
Low level patterns can be composed to achieve more complex parallel patterns. For instance, Separate> creates an aggregate of objects of class SomeClass, whose elements can be placed in remote nodes. An aggregate of aggregates can be specified by Replicate>. These examples are our equivalents of traditional skeleton nesting. The current implementation of these low level skeletons relies on a source to source tool that generates AspectJ code. 2.2
High Level Patterns
To illustrate low level patterns can be composed to achieve more high level patterns we present a pattern for a farm parallelisation, identical to the one provide by the JaSkel skeleton framework [6]. The farming parallelisation strategy divides a task into smaller tasks that are processed in parallel by several workers. The master splits the task into subtasks, sends one or several subtasks to each worker, collects processed subtasks and joins the results. Our skeleton for a parallel farm has the following syntax: Farm. It takes some domain specific Class T and pointcut compute, creates an aggregate of elements of the same class T (workers), intercepts the request to process a task (compute), splits a task into subtasks and sends a subtask to each worker. The next code presents the implementation of a Farm skeleton using the previously presented low-level skeletons. 01 02 03 04 05 06 07
Farm { Class aggregate = Replicate Scatter Reduce }
The farm pattern is implemented by relying on the Replicate low level pattern (line 03) to create an aggregate of instances of type T. Line 4 uses the scatter
1118
J.L. Sobral
skeleton to call the compute method in all aggregate elements, using the split method to provide a different parameter to each aggregate element. Each request may be issued in a new thread to allow parallel processing among workers (code not provided, for simplicity). The Reduce skeleton joins the processed subtasks and returns the processed task, using the method join, when all workers have completed. The farm pattern has been applied to several farm based applications. Table 2 shows the farm pattern applied to the JGF RayTracer [13]. Class T was replaced by the RayTracer class and the compute joinpoint becomes calls to the render method. Table 2. Ray tracer parallelisation Core functionality
RayTracer rt = new RayTracer(); Interval intr = new Interval(0,500); int[] result = rt.render(intr);
3
Farm parallelisation Vector split(Interval in) { ... // split in into sub-intervals } int[] join(Vector Separate
Performance Evaluation
This section aims to evaluate the overhead introduced by our parallel patterns. We start by presenting a benchmark of the core low-level pattern (Separate). Then we present benchmarks of the high level farm pattern, using the RayTracer parallel benchmark taken from the Java Grande Forum [13]. Presented results are median of 10 runs and were collected on an 8-node cluster, with two Xeon 5130 per node (a total of 4 cores per node), 4 GB RAM, connected through Gbit Ethernet, running CentOS 4.0, Sun JDK 1.5.0 3 in client mode and AJDT 1.4.0 [1]. The first test measures the overhead of the Separate skeleton. Our current implementation is based on Java Nio sockets. This test is a ping-pong where a class T is remotely placed and receives and returns an array of integers. We compare execution times and bandwidths (Table 3) against a hand written version of the same class T, where communication primitives were directly introduced into source code. For reference, we also provide execution times for a C/MPI (gcc 4.0 and MPICH 1.5.2) and KaRMI (1.07). The overhead relative to manually generated code of our Separate skeleton is very low. In these results show that the well known RMI implementation (KaRMI) falls behind the efficiency of Java Nio, and that all Java implementations have higher latency that C/MPI. We also implemented the Separate skeleton on top of KaRMI, but since KaRMI is less efficient than Java Nio we use the former in all high level benchmarks. However, since the high level parallel
SkelJ: Skeletons for Object-Oriented Applications
1119
Table 3. Latency and bandwidth of various communication middleware (Latency in microseconds and Bandwidth in Mbit/second) Latency
Bandwidth 1K
Bandwidth 16K
Bandwidth 1M
C/MPI
27
161
454
858
Java Nio Separate KaRMI
44 45 63
120 119 53
422 415 194
964 964 804
Table 4. Raytracer speed-ups 4
8
16
32
Manual implementation
3.9
7.8
15.5
30.2
Farm skeleton
3.9
7.7
15.4
30.3
patterns are built upon the Separate skeleton it is trivial to use or to support other communication middleware. Table 4 compares speed-ups for the RayTracer (image size of 1000x1000) using the Farm skeleton and an equivalent manually implemented version. The RayTracer presents close to linear speed-up since it only requires communication at the beginning and at the end of the computation. Performance of both implementations is comparable (less than 1%).
4
Related Work and Discussion
Several approaches provide high level constructs to unify concurrency and object management, where parallelism is implicit on object semantics. Examples include CHARM++ [9], JavaParty [12] and ProActive [2]. These approaches do not promote the separation between the parallelisation strategy and the core functionality, as it is embedded into high level constructs. Skeletons [4] have a purpose close to our approach, providing a separation between core functionality and the parallelisation strategy. One difference is on how parallel patterns and core functionality are composed to build a parallel application. In the former, the core functionality is decomposed into code fragments to fill the hooks provided by the skeleton. In our approach the core functionality must provide the required hook points to plug parallel patterns. We believe that our approach requires fewer changes to the core functionality code (although it can not completely avoid it), it promotes a stronger separation and the core functionality can remain more ”sequential like”. OpenMP and other annotation based frameworks support (un)plugability of parallel behaviour. However, annotations are difficult to compose and extend.
1120
J.L. Sobral
Recent works showed how AOP techniques can modularise concurrency [5], distribution [17] and persistence [14]. [15] proposed an approach to decompose parallelisation issues into more fine grained aspects, namely partition, concurrency and distribution. [16] presented an aspect collection to support this approach. We present a set of generic parallel patterns that promote and implement a separation of concerns leveraging these previous approaches. Our approach differs from previous approaches by relying on parametrised aspects and on a model of aspect composition inspired in C++ template nesting to improve skeleton usability and composeability.
5
Conclusion
This article presented a collection of skeletons for object oriented applications. These skeletons rely on generic aspects (i.e., template based), that can be composed by template nesting. The skeleton library delivers one of the first programming environments to address efficient execution on Multi-core, Cluster and Grid systems. This goal can only be achieved by providing highly modularised parallelisation patterns and a flexible composition scheme. The proposed collection can only be applied to object oriented applications. We believe that currently only object oriented applications can provide enough degree of modularity to achieve a clean separation of parallelisation issues and to support the specification of ”sequential like” code to implement the core functionality. This is due to the richer set of join point to plug parallel code in object oriented applications. Although our skeletons can be (un)plugged into sequential codes the base code should be amenable for parallelisation, i.e., the amount of parallelism that can be introduced by parallel skeletons is limited by the dependencies in application tasks and data. Moreover, composition of parallel skeletons with core functionality requires a set of suitable join points, otherwise the source code must be refactored to expose the necessary join points. The current open issue is how to define design rules that overcome these limitations. One possible solution is to promote fine grained decompositions in object oriented applications. The current compiler can produce code compliant with standard JVM and this paper showed that the performance penalty introduced can be very low. Current work includes the extension of low level patterns to support data distributions and the development more high level patterns.
Acknowledgments This work was supported by PPC-VM project (Portable Parallel Computing Based on Virtual Machines, POSI/CHS/47158/2002) and by SeARCH (Services & Advanced Computing with HTC/HPC, CONC-REEQ/443/EEI/2005) both funded by Portuguese FCT (POSI) and European funds (FEDER).
SkelJ: Skeletons for Object-Oriented Applications
1121
References 1. http://www.eclipse.org/ajdt/ 2. Caromel, D.: Towards a Method of Object-Oriented Concurrent Programming. Communications of the ACM 36, 9 (1993) 3. Chien, A., Karamcheti, V., Plevyak, J., Zhang, X.: Concurrent Aggregates (CA) Language Report - Version 2.0, TR, Dep. Computer Science, University of Illinois, UC (November 1993) 4. Cole, M.: Algorithmic Skeletons: structured management of parallel computation. MIT Press, Cambridge (1989) 5. Cunha, C., Sobral, J., Monteiro, M.: Reusable Implementations of Concurrency Patterns and Mechanisms using Aspect-Oriented Programming. In: AOSD 2006, Bonn (March 2006) 6. Fernando, J., Sobral, J., Proenca, A.: JaSkel: A Java Skeleton-Based Framework for Structured Cluster and Grid Computing (CCGrid 2006), Singapore (May 2000) 7. Gorlatch, S.: Send-Receive Considered Harmful: Myths and Realities of Message Passing. ACM TOPLAS 26(1) (January 2004) 8. Hannemann, J., Kiczales, G.: Design Pattern implementation in Java and in AspectJ. In: OOPSLA 2002, Seattle, USA (November 2002) 9. Kale, L., Krishnan, K.: CHARM++: A Portable Concurrent Object Oriented System Based on C++. In: OOPSLA 1993, ACM SIGPLAN Notices, vol. 28, p. 10 (October 1993) 10. Kiczales, G., Hilsdale, E., Hugunin, J., Kersten, M., Palm, J., Griswold, W.: An Overview of AspectJ. In: Knudsen, J.L. (ed.) ECOOP 2001. LNCS, vol. 2072, Springer, Heidelberg (2001) 11. Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C., Lopes, C., Loingtier, J., Irwin, J.: Aspect Oriented Programming. In: Aksit, M., Matsuoka, S. (eds.) ECOOP 1997. LNCS, vol. 1241, Springer, Heidelberg (1997) 12. Philippsen, M., Zenger, M.: JavaParty - transparent remote objects in Java. Concurrency: Practice and Experience 11 (November 1997) 13. Smith, A., Bull, J., Obdrz´ alek, J.: A Parallel Java Grande Benchmark Suite. In: SC 2001 (November 2001) 14. Soares, S., Loureiro, L., Borba, P.: Implementing Distribution and Persistence Aspects With AspectJ. In: OOPSLA 2002 (November 2002) 15. Sobral, J.: Incrementally Developing Parallel Applications with AspectJ. In: IEEE IPDPS 2006, Rhodes, Greece (April 2006) 16. Sobral, J., Cunha, C., Monteiro, M.: Aspect-Oriented Pluggable Support for Paral´ lel Computing. In: Dayd´e, M., Palma, J.M.L.M., Coutinho, A.L.G.A., Pacitti, E., Lopes, J.C. (eds.) VECPAR 2006. LNCS, vol. 4395, Springer, Heidelberg (2007) 17. Tilevich, E., Urbanski, S., Smaragdakis, Y., Fleury, M.: Aspectizing Server-Side Distribution. In: IEEE ASE 2003, Canada (October 2003) 18. Yonezawa, A., Tokoro, M. (eds.): Object-Oriented Concurrent Programming. MIT Press, Cambridge (1987)
Formal Semantics of DRMA-Style Programming in BSPlib Julien Tesson and Fr´ed´eric Loulergue LIFO – University of Orl´eans, France {julien.tesson,frederic.loulergue}@univ-orleans.fr
Abstract. BSPlib is a programming library for C and Fortran which supports bulk synchronous parallelism (BSP). This paper is about a formal semantics for the DRMA programming style of the BSPlib library. The aim is to study the behavior of BSPlib programs and to propose some syntactic characterizations used to provide guarantees on semantic properties. This work is the basis for future tools dedicated to the validation of BSPlib programs. Keywords: BSP, formal Semantics, Parallel Programming.
1
Introduction
In the range of possibilities to program parallel architectures, from concurrent programming with an imperative language and a message passing library such as MPI [12] to sequential programming and parallelizing compilers, bulk synchronous parallelism or BSP [11] is an intermediate approach. It aims at maximizing the portability of performances by adding a notion of explicit processes to data parallelism. There are several libraries and languages which support bulk synchronous parallel programming : libraries to be used with imperative languages such as C and Fortran [6], or to be used with object oriented languages [5], or to be used with functional languages [9,10]. If in parallel programming the execution should be fast, other aspects such as the ease of programs development or the ease of programs validation are also important. In the case of concurrent programming, the difficulty of these two tasks are confirmed by the high complexity of related validation problems [1]. Moreover the semantics of a concurrent program being in general very complex, the time required to run it (related to its operational semantics) is also difficult to determine, which hinders the portability of performances. The structured parallelism of the BSP model eases both programming and validation. Performance prediction has been validated by experiments. For pure functional bulk synchronous parallel programming, the complexity is the same than the proof of pure functional sequential programs. It is possible to use the Coq proof assistant to extract functional BSP programs from constructive proofs [4]. Other theories of the proof of BSP programs [7,13,3,8] are close in complexity to the sequential case. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1122–1129, 2008. c Springer-Verlag Berlin Heidelberg 2008
Formal Semantics of DRMA-Style Programming in BSPlib
1123
In this paper we focus on the semantics of imperative BSP programs in SPMD style. The proposed semantics models the BSPlib library subset which allows direct remote memory access (DRMA) communications. From this semantics we want to find properties on the syntax of programs which could guarantee some properties on the semantics of the programs. Our aim was not to set a priori constraints on the syntax to guarantee semantic properties such as done in [2] for data-parallelism. We aimed at modeling a widely used and practical library for BSP programming (BSPlib), to exhibit some undesirable behaviors and some ways to avoid them. In the next section we give a quick overview of the BSPlib and the model we designed, called BSP-IMP. In section 3 we present the rules of the formal semantics. Section 4 relates syntactic properties of BSP-IMP programs to semantic properties and gives an example. We end by conclusion and future work in section 5. Omitted proofs and complete semantics can be found in [14].
2
An Overview of BSPlib and BSP-IMP
BSPlib [6] is a library for bulk synchronous parallel (BSP) programming. In the BSP model, a computer is a set of uniform processor-memory pairs, a communication network allowing inter-processor delivery of messages and a global synchronization unit which executes collective requests for a synchronization barrier (for the sake of conciseness, we refer to [11] for more details). A BSP program is executed as a sequence of super-steps, each one divided into (at most) three successive and logically disjoint phases: (a) Each processor uses its local data (only) to perform sequential computations and to request data transfers to/from other nodes; (b) the network delivers the requested data transfers; (c) a global synchronization barrier occurs, making the transferred data available for the next super-step. BSPlib contains 20 basic operations and follows the SPMD paradigm. These operations are distributed into two parts: One for direct remote memory access (DRMA) and one for bulk synchronous message passing (BSMP). The BSPlib offers functions to start and to stop the parallel execution as well as functions to access the process identifier and the number of processes. The synchronization barrier is called with the bsp sync function. In DRMA style, communications are performed by the bsp put and bsp get functions: – bsp put(dest, src, tgt, offset, nbytes) sends data to a remote memory location. dest is the identifier of the process where data are to be stored, src and tgt are the locations where the data are to be read / stored, offset is a displacement in byte from tgt where data will be copied and nbytes is the amount of data to transfer. – bsp get(dest, rloc, offset, tgt, nbytes) requests data from a remote memory location. dest is the identifier of the process where requested data are. rloc and tgt are the locations where the data are to be remotely
1124
J. Tesson and F. Loulergue
read / locally stored, offset is a displacement in byte from src from where data will be copied and nbytes is the amount of data to transfer. DRMA access are allowed only on registered memory locations: registration and unregistration are done using the bsp push reg, bsp pop reg functions. In our model, called BSP-IMP, the programs instructions consist of a small imperative subset and two DRMA communication instructions: put(dest, src, tgt) and get(dest, rloc, tgt) where dest and src are arithmetic expressions as we only use integer values and tgt and rloc are variables. Memory locations are not registered in BSP-IMP but this could be easily added to the semantics. The following grammars define respectively the set of arithmetic expressions aexp, the set of boolean expressions Bexp and the set of programs or commands Com: aexp : a ::= n | X | a + a | a − a | a × a | This | Nproc bexp : b ::= True | False | a = a | a ≤ a | ¬b | b0 ∧ b1 | b0 ∨ b1 com : c ::= c; c | X := a | if b then c end | while b do c end | skip | put(a, a, X) | get(a, X, X) | sync where X is a variable ( memory location ) and n is an integer constant.
3
Formal Operational Semantics
The operational semantics specifies, by means of a set of rules, how a program will be executed. In the BSP model the execution is a sequence of super-steps. In each super-step, the first phase of asynchronous computations is performed independently on each processor. These computations are described by a first set of rules which are called local rules because these rules describe the computation at a specific processor of the parallel machine. The communications and the synchronization barrier need the cooperation of all the processors. These phases of the super-steps are described by a second set of rules called global rules. The first set of rules defines a relation −→ip between: - A triple c, σ, r consisting of a program c (an element of the set Com), an environment σ which describes the memory state as a function from variables to values, and a communication requests queue r; - A triple s, σ , r consisting of an execution state s being either Ok, Err or Wait(c), an environment and a communication requests queue. Ok refers to the final state of a process that ended well, Err to the state of a process ending with an error. Wait(c ) means that the local process is waiting for a global synchronization, c is a sequence of commands that have to be executed after the synchronization. This relation means “starting from an initial memory state σ and a communication requests queue r, the program c will evaluate at processor i in a parallel machine with p processors to the execution state s with final memory state σ and final communication requests queue r ”.
Formal Semantics of DRMA-Style Programming in BSPlib
1125
The second set of rules defines a relation −→p between: - A triple C, Σ, R of vectors of width p. C is the vector of programs [c0 , . . . , cp−1 ]p , as BSP-IMP follows the SPMD paradigm, initially we have the same program c everywhere. Σ is the vector of environments (one per processor) and R is the vector of communication requests queues (one per processor). The environment (resp. queue) at processor i is written Σ[i] (resp. R[i]). - A triple S, Σ , R where S is the final global execution state which can be either Ok or Err (it is not a vector). Σ and R are the final vectors of environments and queues. 3.1
Local Rules
We omit here the rules for the evaluation of boolean and arithmetic expressions. They are similar to the ones in [15] and can be found in [14]. There are two special arithmetic expressions: This which evaluates to the processor identifier and Nproc which evaluates to the number of processors. These two values are the ones given on the relation −→ip . We focus here on the evaluation of commands. Idle Command. The skip command does nothing. Its main purpose is to indicate that there is nothing to do after a synchronization. skip, σ, r −→ip Ok, σ, r
(1)
Sequence of Commands. For a sequence of command c0 ; c1 if c0 ends well then c1 is evaluated in the new environment (rule 2), if c0 raises an error c1 is not evaluated and the error is re-raised (rule 3), finally if c0 leads to a waiting state then c1 is added in this state as remaining work (rule 4). c0 , σ, r −→ip Ok, σ , r c1 , σ , r −→ip s, σ , r c0 ; c1 , σ, r −→ip s, σ , r
(2)
c0 , σ, r −→ip Err, σ , r c0 ; c1 , σ, r −→ip Err, σ , r
(3)
c0 , σ, m −→ip Wait(c0 ), σ , m c0 ; c1 , σ, m −→ip Wait(c0 ; c1 ), σ , m
(4)
Conditional Execution. In the evaluation of if b then c end if the condition b evaluates to True then c is evaluated (rule 5) else there is nothing to do (rule 6). b, σ, m −→ip True c, σ, m −→ip s, σ , m if b then c end, σ, m −→ip s, σ , m
(5)
b, σ, m −→ip False if b then c end, σ, m −→ip Ok, σ, m
(6)
1126
J. Tesson and F. Loulergue
While Loop. In the evaluation of while b do c end if the condition b evaluates to False there is nothing to do (rule 7), else the body c of the loop is evaluated. If it evaluates to Err then the evaluation of the while loop is stopped (rule 8) otherwise while b do c end is evaluated in the new environment obtained after the evaluation of the body of the loop. This recursive evaluation could lead either to the request for a synchronization barrier (rule 9) or not (rule 10). b, σ, m −→ip False while b do c end, σ, m −→ip Ok, σ, m
(7)
b, σ, m −→ip True c, σ, m −→ip Err, σ , m while b do c end, σ, m −→ip Err, σ , m
(8)
b, σ, m −→ip True c, σ, m −→ip Wait(c ), σ , m while b do c end, σ, m −→ip Wait(c ; while b do c end), σ , m b, σ, m −→ip True c, σ, m −→ip Ok, σ , m while b do c end, σ , m −→ip s, σ , m while b do c end, σ, m −→ip s, σ , m
(9)
(10)
Remote Memory Write. put(a1 , a2 , X) is a command which aims at writing the value of the expression a2 in the memory location X at processor given by expression a1 . If the arithmetic expression a1 evaluates to a value in the range [0, p − 1] then a communication request is added to the local queue (rule 11). The communication request X@j ← n means that value n should be written into memory location X at processor j. If a1 is not a valid processor identifier an error is raised (rule 12). a1 , σ, m −→ip j, j ∈ [0, p − 1] a2 , σ, m −→ip n i put(a1 , a2 , X), σ, m −→p Ok, σ, m.X@j ← n
(11)
a1 , σ, m −→ip j, j ∈ [0, p − 1] put(a1 , a2 , X), σ, m −→ip Errput , σ, m
(12)
Remote Memory Read. Similar to remote memory write. a1 , σ, m −→ip j, j ∈ [0, p − 1] get(a1 , Y, X), σ, m −→ip Ok, σ, m.X@i ← Y@j
(13)
a1 , σ, m −→ip j, j ∈ [0, p − 1] get(a1 , Y, X), σ, m −→ip Errget , σ, m
(14)
Local Affectation. The local environment is modified by changing the value σ(X) to the value of the arithmetic expression a. a, σ, m −→ip n X := a, σ, m −→ip Ok, σ[X → n], m
(15)
Formal Semantics of DRMA-Style Programming in BSPlib
1127
Synchronization Awaiting. The command sync requests a global synchronization. The synchronization barrier can only be global so this request can only be performed at the global level. Thus at the local level the sync command leads to a waiting state Wait(skip). sync, σ, m −→ip Wait(skip), σ, m 3.2
(16)
Global Rules
The global rules are used to perform the communication requests and the global synchronization barrier or to end globally the computation. In the following rules, P denotes the range of processor identifiers. There are four different cases. Rule (17): All processes are in a waiting state. In this case data are exchanged which is modeled by the C operation between the vector of memory states and the vector of communication requests queues. C could be either: (a) A relation to model the behavior of the BSPlib: in this case the semantics is non-deterministic because two processors could write different values in the same memory location of a third processor and the behavior is not specified. (b) A function to determinise the semantics. This could be done for example by giving a priority to each processor for remote memory write, or by giving a binary commutative operator to combine the different values written on the same memory location by remote processors. It is also possible to add a rule to raise an error when two processors try to write different values to the same memory location. This various options are described in more details in [14]. Rule (18): if at least one process ends ( ↓ ) either in the Ok state or erroneously while at least one other is requesting a global synchronization then a global error Errsync is raised. Rule (19): If all processes end well, the final global execution state is Ok. Rule (20): If at least one local process ends with an error ErrL ∈ {Errget ; Errput } and no other requests a global synchronization then the ErrG error is raised at the global level. ∀i ∈ P, ci , Σ[i], R[i] −→ip Wait(ci ), Σ [i], R [i] C (Σ , R , Σ ) [c0 , . . . , cp−1 ]p , Σ , ∅ −→p ↓, Σ , R [c0 , . . . , cp−1 ]p , Σ, R −→p ↓, Σ , R
∃i ∈ P, ci , Σ[i], R[i] −→ip Wait(ci ), Σ [i], R [i] ∃j ∈ P, cj , Σ[j], R[j] −→jp ↓, Σ [j], R [j] [c0 , . . . , cp−1 ]p , Σ, R −→p Errsync , Σ , ∅ ∀i ∈ P, ci , Σ[i], R[i] −→ip Ok, Σ [i], R [i] [c0 , . . . , cp−1 ]p , Σ, R −→p Ok, Σ , ∅ ∃i ∈ P, ci , Σ[i], R[i] −→ip ErrL , Σ [i], R [i] ∀j ∈ P, ci , Σ[j], R[j] −→jp ↓, Σ [j], R [j] [c0 , . . . , cp−1 ]p , Σ, R −→p ErrG , Σ , ∅
(17)
(18)
(19)
(20)
1128
4
J. Tesson and F. Loulergue
Synchronization Error Free Programs
An interesting property to check for a BSP-IMP program is the absence of synchronisation errors. A program is free of such an error if each process reaches the same number of sync during the program evaluation. Due to possible presence of sync in or after a loop the problem is undecidable in general. Nevertheless we can decide it for a subset of BSP-IMP programs. We characterize those who have the replicate synchronization property . A program c ∈ Com is said to have the replicate synchronization property if for all “if b then c end” and “while b do c end” in which c contains sync, b evaluates to the same value at each processor in [0, Nproc − 1]. Of course, to evaluate each sync needed at global level, each process has to be free of local errors that could break the normal program evaluation flow. Theorem 1. A program P r without local error, wich terminates and for which the replicate synchronization property hold, is synchronization error free. A variable which has the same value at all processors is called a replicated variable. It can be seen as a shared variable. A boolean expression will evaluate to the same value at all processors if all the variable occurences are replicated. A subset Rep(P r) of replicated variables in a program P r can be build from variables not modified by a communication, and which are affected to expression that contains only constants and replicated variables. Those affectations cannot be made inside while or if statements for which the condition does not evaluate identically over all the processors. Furthermore a value has to be previously assigned to the variable at least one time in the program. Indeed initial local environment are not in general identical over all processor, so uninitialized variables are not replicated occurrences. We have here mutually dependent definitions of replicated variables and replicated boolean expressions, but the Rep(P r) can be build as the greatest fixed point of variables having the previous property. The following scan algorithm computes the parallel prefix sums: Algorithm 1 (Scan) i:=1; while (2i−1 ≤ Nproc ) do if (This ≥ 2i−1 ) then get(This − 2i−1 , X, Xin ) end; sync; if (This ≥ 2i−1 ) then X = Xin + X end; i= i+1 end
It is an example of program that can be shown synchronization error free using the previous characterization. We can easily prove that there is no error at local level. Furthermore the only conditional component of the program which contains a sync is the main while loop and its condition (2i−1 ≤ Nproc) contains only replicated variables. Nproc has clearly the same value over all processors and i satisfies the conditions previously described.
Formal Semantics of DRMA-Style Programming in BSPlib
5
1129
Conclusion and Future Work
We proposed an operational semantics for a small bulk synchronous parallel imperative language. This BSP-IMP syntax and semantics model very closely the behavior of the BSPlib programming library. With some additional conditions BSP-IMP programs are deterministic. It is to notice that the BSPlib could be easily modified to follow the BSP-IMP semantics which raises an error when non deterministic remote memory writes occur. We used this semantics to show how a subclass of BSP programs can be shown to be free of synchronization errors. The presented work is limited to the DRMA part of the BSPlib library. Future work includes the extension to the bulk synchronous message passing part (BSMP) of BSPlib. Other classes of programs will be studied. We also plan to develop tools for the analysis of BSPlib programs.
References 1. Apt, K.R., Olderog, E.-R.: Verification of sequential and concurrent programs, 2nd edn. Springer, Heidelberg (1997) 2. Boug´e, L.: Le mod`ele de programmation ` a parall´elisme de donn´ees: une perspective s´emantique. RAIRO Technique et Science Informatiques 12(5) (1993) 3. Chen, Y., Sanders, W.: Top-Down Design of Bulk-Synchronous Parallel Programs. Parallel Processing Letters 13(3), 389–400 (2003) 4. Gava, F.: Formal Proofs of Functional BSP Programs. Parallel Processing Letters 13(3), 365–376 (2003) 5. Gu, Y., Lee, B.-S., Cai, W.: JBSP: A BSP programming library in Java. Journal of Parallel and Distributed Computing 61(8), 1126–1142 (2001) 6. Hill, J.M.D., McColl., W.F., et al.: BSPlib: The BSP Programming Library. Parallel Computing 24, 1947–1980 (1998) 7. Jifeng, H., Miller, Q., Chen, L.: Algebraic laws for BSP programming. In: Fraigniaud, P., Mignotte, A., Boug´e, L., Robert, Y. (eds.) Euro-Par 1996. LNCS, vol. 1123, pp. 359–368. Springer, Heidelberg (1996) 8. Lecomber, D.S.: Methods of BSP Programming. PhD thesis, Oxford University Computing Laboratory (July 1998) 9. Loulergue, F., Gava, F., Billiet, D.: Bulk Synchronous Parallel ML: Modular Implementation and Performance Prediction. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3515, pp. 1046–1054. Springer, Heidelberg (2005) 10. Miller, Q.: BSP in a Lazy Functional Context. In: Trends in Functional Programming, vol. 3, Intellect Books (May 2002) 11. Skillicorn, D.B., Hill, J.M.D., McColl, W.F.: Questions and Answers about BSP. Scientific Programming 6(3), 249–274 (1997) 12. Snir, M., Gropp, W.: MPI the Complete Reference. MIT Press, Cambridge (1998) 13. Stewart, A., Clint, M., Gabarr´ o, J.: Axiomatic Frameworks for Developing BSPStyle Programs. Parallel Algorithms and Applications 14, 271–292 (2000) 14. Tesson, J., Loulergue, F.: Formal Semantics for the DRMA programming style subset of the BSPlib library. Technical report, LIFO, University of Orl´eans (to appear, November 2007) 15. Winskel, G.: The Formal Semantics of Programming Languages. Foundations of Computing Series. MIT Press, Cambridge (1993)
A Container-Iterator Parallel Programming Model Gerhard Zumbusch Friedrich-Schiller-Universit¨ at Jena, Institut f¨ ur Angewandte Mathematik, Ernst-Abbe-Platz 2, 07743 Jena, Germany [email protected] http://cse.mathe.uni-jena.de
Abstract. There are several parallel programming models available for numerical computations at different levels of expressibility and ease of use. For the development of new domain specific programming models, a splitting into a distributed data container and parallel data iterators is proposed. Data distribution is implemented in application specific libraries. Data iterators are directly analysed and compiled automatically into parallel code. Target architectures of the source-to-source translation include shared (pthreads, Cell SPE), distributed memory (MPI) and hybrid programming styles. A model applications for grid based hierarchical numerical methods and an auto-parallelizing compiler are introduced. Keywords: parallel programming models, automatic parallelization, domain specific code generation, parallel numerical methods, multigrid, MPI, Posix threads, Cell processor.
1
Introduction
Development of parallel code for numerical computations is still an issue, for several reasons: On the one hand, current computers are parallel computers, with increasing degree of parallelism. On the other hand, parallelism is still very visible in the code: Current parallel programming models tend to be either limited in their applicability, limited in their efficiency or lead to a low level of parallel programming. Standard parallel programming models include thread libraries like POSIX threads [1] and Intel’s TBB [2] on top, lightweight threads like in Cilk [3] and Concur (Microsoft) [4], and OpenMP [5] for shared memory computers and two- and single-sided message passing for distributed memory computers like in MPI [6]. A higher level approach is represented by the loop- and array-parallel HPF [7], Co-Array Fortran[8] and related Fortran version constructs to be generalized in projects like Chapel (Cray) [9] and Fortress (Sun) [10]. Distributed shared memory techniques are refined in OpenMP on clusters (Microsoft), and in the current evolution of splitted global address space languages like unified parallel C (UPC) [11] and in the X10 (IBM) project [12]. In addition to efficient general purpose parallel programming models, easier to use domain specific models are important. The simplicity of such a restricted R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1130–1139, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Container-Iterator Parallel Programming Model
1131
computational model may have several advantages: For example, a single source code may compile to efficient sequential and parallel code for different types of parallel architectures, which is advantageous for code development and the lifetime of the code. Higher level program constructs may be automatically analysed and compiled into parallel code, which is not possible for general programming models. Compiler analysis may further assist and verify parallelization of the code, such that common mistakes can be eliminated. The container-iterator programming model we propose here is supposed for the gap between expressibility and simplicity with numerical applications in mind. Several parallel Fortran extensions focus on parallel vector and array expressions. This is well suited for dense linear algebra and some finite difference methods, but may be less adequate for sparse matrices, non-uniform Finite Element grids, randomly distributed particles and other less structured numerical data. We try to attack this problem by a split of the code into an iterator which handles the non-uniformity of data in a container and a computational kernel applied to each element of the container. This is equivalent to parallel arrays for array-structured data with a slightly different syntax, but more flexible in other cases. Shared memory programming models like OpenMP are not very well suited for distributed memory environments. The abstraction of a parallel loop with fixed bounds in early OpenMP revisions is a further severe restriction to the application. However, OpenMP is far simpler to use than the underlying thread programming model. We try to provide a programming model, which can be both compiled to shared memory and distributed memory models. In the case of a loop of independent operations, container-iterator is equivalent to OpenMP parallel loops with a slightly different syntax. Virtual shared memory programming models like UPC, which combine the ease of use of shared memory (of certain coherence) with distributed memory architectures rely heavily on the underlying distributed shared memory engine and its coherence model. In the container-iterator model we try to create explicit message-passing code by a data dependence analysis of the compiler. This is possible due to the restriction to numerical applications and detailed knowledge of the container. However, it is independent of the computational kernel, as long as it really is a parallel algorithm. Finally, parallel skeletons are based on a approach to parallel computing comparable to the container-iterator model. A collection of skeletons defines different patterns of algorithms like a linear loop or a divide and conquer tree, see e.g. [13]. The user code is derived from a specific skeleton. The parallelization is done by a compiler, which is able to transform each skeleton into parallel code. Some differences to the compiler-iterator model are: We start from a global persistent data decomposition transparent to the code, such that several parts of the algorithm can be applied to the container, one after the other. The user codes derived from an iterator may have different dependencies leading to different communication patterns, each of which can be detected by a compiler dependence analysis. Some communication patterns are currently not available in parallel skeletons. However,
1132
G. Zumbusch
ForEach(iteration_variables, iterator, code);: syntax. iteration_variables: C++ declaration of access variables to container elements, e.g. ‘int i, int j’ or ‘tree *b’. iterator: instance of an iterator, includes container specification (like bounds or root pointer) and type of iteration. code: regular C++ function body code. The iteration_variables are defined in the code and point to the current atom. Certain restrictions apply to the access of other atoms (dependency) and variables of outer scopes (read only or reduction). Fig. 1. A formal parallel iterator
given appropriate compilation techniques and support libraries, the container iterator approach could be interpreted as a flavour of parallel skeletons. The compilation technique we propose here uses static analysis. In fact it is incorporated in an experimental C++ compiler based on the current 4.2.1 Gnu gcc release. The remaining parts of the system are some pre- and post-processing facilities using m4 macro processor and perl scripts for code generation and C++ run-time libraries. A previous version of the system [14] was based on a speculative run-time analysis performed with a standard C++ compiler and different scripts. Related projects include the Rose compiler [15] and expression template libraries like Pooma [16].
2
The Container-Iterator Programming Model
In the domain of numerical software, good strategies to parallelize a given sequential implementation are often known. Larger sets of data can often be decomposed and mapped to coupled parallel tasks efficiently, e.g. many codes for the solution of partial differential equations, particle methods and other “local” methods which respect physical space. Some examples and many references can be found in [17]. Further, the sequential algorithm does not need to be changed, but often can be re-arranged according to the data decomposition. We assume that data is organized in a large container of similar smaller units, called atoms. The sequential algorithms operate on all or parts of the atoms in a similar way and in a given order. Parallelization means to schedule operations on different atoms of an algorithmic step. This is a data parallel approach. A formal description of the iterator can be found in figure 1. The syntax is similar to the C++ for-loop for(iteration variable declaration and initialization; upper bound; increment){code}, which is also used in the C++ STL [18] for iterators on arbitrary containers. We prefer to place the body code always right next to the loop iterator instead of a separate class definition (like in TBB [2]), which can also be done using the C++ Boost library lambda expressions [19] like in Concur [4]. Note that the container-iterator model does include, but is not limited to a contiguous data vector or an ascending access of totally ordered atoms. For example the model also includes different ways to access leaves of a tree. However, the access order and the parallelism of the operation follow from the data
A Container-Iterator Parallel Programming Model
1133
dependency of the atoms. This can be detected by the compiler and need not be specified in the code. Note that we need more algorithmic patterns than just divide and conquer, see e.g. [20]. We restrict ourselves to numerical algorithms and a data container-iterator programming model. Further, we assume that a good domain specific data decomposition scheme is available. Now, numerical algorithms can be written as a sequence of (parallel) iterators and sequential operations. An iterator consists of a specification of the iteration space and a code fragment for the algorithmic atom. The atom will be executed for each element of the iteration space. The atom can be analysed using standard compilation techniques such as data dependency and flow analysis. The goal is to facilitate automatic parallelization. Compilation targets are shared memory and different distributed (virtual shared) memory systems with their respective low level programming models. hybrid architectures such as clusters of shared memory nodes, hierarchies of tightly and weakly coupled systems can be targeted with a mixture of message passing and multiple threads. Further, specialised architectures like the Cell processor [21] can be targeted, which will be described in more detail. Note that this programming model also covers cases of loop and array parallelism addressed in OpenMP and some Fortran versions. It does not contain explicit parallel commands and it is not as expressive as UPC and as low level parallel programming models.
3
Sample Containers and Iterators
As an illustrative example for the container-iterator programming style, we show how it looks like for two domain specific extensions of C++, see figures 2 and 3. We have constructed a source-to-source compilation system which translates the domain specific code into standard C++ code with a domain specific library and a standard parallel programming model. Figure 2 give a brief sketch of the different code stages of the source-to-source translation system. On top is a code sample of the container-iterator programming model. A one-dimensional grid, an iterator over an interval and a vector data structure are constructed. The ForEach loop features nearest neighbour access y(-1), y(+1) and a global reduction e. On the left, the code is transformed into a sequential for loop. On the right, a data dependence analysis, based on the extended Gnu g++ compiler lists load, sore and reduction operations. This analysis is transformed into a pthread code (left), MPI code (middle) and a Cell processor code (right). Note that the loop is split into two parts for the pthread code and even into different files (for incompatible PPU and SPU binaries) for the Cell processor. Further, the necessary send and receive (MPI [6]) or get and put (Cell, see [26]) commands are generated using the data dependence analysis. An application specific library provides data containers like uniform grids implemented as multi-dimensional arrays or a hierarchical decomposition of a set of particles implemented as quad-trees. In each case a geometric domain decomposition works well for parallelization. The applications in mind based on the containers include the solution of partial differential equations
1134
G. Zumbusch Grid1 *g = new Grid1(0, n+1); Grid1IteratorSub it(1, n, g); DistArray<double> x(g), y(g); double e = 0;
application library
ForEach(int i, it, x(i) += ( y(i+1) + y(i-1) )*.5; e += sqr( y(i) ); )
domain language
sequential code
load y(i-1), y(i), y(i+1) store x(i) reduce add e
for (int i=1; i
analysis
pthread parallel code void *sub1(void *arg) { ... double e_local = 0; for (int i=n_local0; ie = e_local; } for (int p=0; pe; }
MPI message passing parallel code if (p_left) MPI_Send(&y(n_local0), 1, MPI_DOUBLE, p_left,...); if (p_right) { MPI_Recv(&y(n_local1+1), 1, MPI_DOUBLE, p_right,...); MPI_Send(&y(n_local1), 1, MPI_DOUBLE, p_right,...); } if (p_left) MPI_Recv(&y(n_local0-1), 1, MPI_DOUBLE, p_left,...); double e_local = 0; for (int i=n_local0; i
Cell SPElib: SPU+PPU parallel code int main(unsigned long long id, addr64 argp, addr64 envp) { mfc_get(...); mfc_read_tag_status_all(); ... double e_local = 0; for (int i=n_local0; ie; }
Fig. 2. Source-to-source transformation of the domain language to several target programming models (Posix threads, MPI message passing, Cell processor SPE library). Vector/ one dimensional grid example.
(finite elements, finite differences, multigrid solver, grid refinement), fast summation techniques (fast multipole expansion) and integral equations. Array operations may look like figure 2. Particle methods, unstructured grids with refinement and fast summation techniques may require a programming style like in figure 3. Note that the operations on the tree nodes are no longer independent in some cases. However, a bottom-up or top-down iteration still offers enough parallelism for an efficient parallelisation. The basic access patterns for a fast multipole summation [22] are given for a binary tree, which easily generalises to quad- or oct-trees. class tree : public KAryTree public: // generic binary tree provides tree* child(int); complex<double> m, l, f, x; tree *root = new tree; TreeIterator iter(root); ForEach(tree *b, iter, b->f = b->l; ) ForEach(tree *b, iter, ‘ for (int i=0; i<2; i++) if (b->child(i)) b->child(i)->l += b->l; ’)
ForEach(tree *b, iter, ‘ for (int i=0; i<2; i++) if (b->child(i)) b->m += b->child(i)->m; ’) Require( list inter, fetch ); ForEach(tree *b, iter, ‘ for (list::const_iterator i = b->inter.begin(); i != b->inter.end(); i++) b->l += log(abs(b->x - (*i)->x)) * (*i)->m; ’)
Fig. 3. Binary tree examples: Declaration, embarrassingly parallel, top-down (no communication with replicated root), bottom-up (communication upwards), top-down with neighbourhood communication defined by a relation ‘fetch’
A Container-Iterator Parallel Programming Model
1135
The analysis of the first ForEach in figure 3 gives a local read ‘l’ and a local write ‘f’ leading to an embarrassingly parallel loop over all atoms. The second ForEach gives a local read ‘l’ and child read and write ‘l’. This does only make sense for a top-down tree traversal. For distributed memory, the variable ‘l’ must be provided in the ghost atoms. The third ForEach gives the reverse local read and write ‘m’ and child read ‘m’, which only makes sense for bottom-up traversal. Variable ‘m’ has to be sent in message passing. Finally the last ForEach introduces an indirect addressing. Dependence analysis includes local variable ‘l’ and remote variables ‘m’ and ‘x’. The user provided relation ‘fetch’ gives a hierarchical hull of all candidates of the remote atoms. In the fast-multipole and some Finite Element algorithms the relation is based on the geometric distance of atoms.
4
A Sample Compilation System
The sample compilation system translates the extended C++ code using the container-iterator parallel programming model into plain C++ code using standard parallel programming models according to figure 2. The source-to-source translation seems to enable more flexibility than a direct single pass-compiler and can be considered as code generation [23]. Currently it is implemented using a general code dependence analysis tool and container/ iterator specific scripts. The tool uses the front end and some of the optimization stages of the back end of the Gnu gcc compiler (C++ front end, current release version 4.2.1). It dumps data type and dependency information extracted from the internal treessa (static single assignement) language independent code representation. This includes names and types of outer-scope variables which are read, written, or updated in the code. Special mechanisms exist for updates leading to reduction operations.The post-processing for different targets uses a sequence of perl scripts. the code generation itself is performed by the m4 macro processor. The idea of the parallelizing compiler is to do a certain global data dependence analysis. The user code is analysed for each atom separately, see code in figure 1. The ForEach loop is expanded into an iterator, which executes the atom for each node of the data container. For the parallel implementation, we control both global data distribution and the parallel iterator. Standard strategies include the use of distributed atoms with an “owner-computes” policy. Communication is be implemented through a small amount of replicated atoms like ghost zones or a common replicated root. Whether replicated data is computed locally, updated by message passing, put into shared memory or discarded in an algorithmic step depends on the parallel iterator and the communication pattern. The data dependence analysis of each atom is passed to templates of parallel iterators. For each parallel target, be it message passing, threading or a mixed model, and for each possible communication pattern there exists a different parallel implementation of the iterator. One part of the templates can be found in the domain specific library. The remaining parts, including the selection and invocation of the correct template and the variables to be transferred or updated are located in the result of the source-to-source translation. The whole process can be considered as a compilation with a single coarse grain dependence
1136
G. Zumbusch
analysis. However, clever constructions of expression templates or parallel skeletons may lead to comparable results for certain containers, see Pooma [16]. Note that the source-to-source transformation process does create the same number of communication operations than hand coded examples. Hence message passing parallel code on distributed memory computers is expected to perform comparable. The performance of multi-threaded code on shared memory computers heavily depends on the data and memory layout. This is given by the application library, not the code transformation system. Hence, even in this case a competitive performance is expected.
5
Experiments
We consider a hierarchical, three-dimensional numerical grid/array code as a test example with pthread and Cell processor as target computer architecture. Message passing MPI results and results for numerical tree codes of the older (speculative analysis) container-iterator translation system can be found in [25] and [14] respectively. The NAS multigrid benchmark code Fapin [24] implements a geometric multigrid V0,1 -cycle with one post-smoothing step for a Poisson equation on a set of nested three-dimensional cartesian grids with constant coefficients. The Fortran77 code was ported C++ using the distributed array classes, single precision floating point numbers and run on other data sets (7 levels, fine grid 1293 ) than originally conceived, in order to fit into the main memory (256 Mbytes) of a Sony Playstation3 with a Cell processor. First we consider the Posix thread (pthread) and the sequential versions of the code. Table 1 shows execution times for different platforms: A four dual-core AMD Opteron processor system at 1.8GHz (Linux, gcc 3.4.6), a two dual-core Intel Xeon 5110 processor system at 1.6GHz (Linux, g++ 3.4.6), a 1GHz six-core quad-issue Sun T1 processor (with limited floating point performance) (Solaris, SunStudio 11 CC) and a dual-core Intel Centrino Duo T2300 system at 1.66GHz (Windows XP, g++ 3.4.4). The pthread implementation, which imposes slight parallel overhead, on a single processor core runs almost as fast as the sequential version. Increasing the number of cores gives good speedup. Note that computational work on coarser grids is performed on a single core for efficiency reasons, and finer grids are distributed to all cores. However, the exact break even point for the strategy Table 1. Execution times of the multigrid example, wall clock in sec. Sequential and pthread parallel versions. no. of threads
1
1
6 cores Sun T1 4 * 2 cores AMD64 2 * 2 cores Xeon 2 cores Centrino
57.3 2.10 1.61 1.70
85.8 2.11 1.59 2.06
2
3
4
5
6
7
8
12
18
24
44.4 30.0 22.5 18.3 16.1 17.3 15.3 13.0 13.1 13.1 0.93 1.41 1.26 1.16 1.11 1.06 1.05 1.07 0.83 0.77 1.14
A Container-Iterator Parallel Programming Model
1137
has not been optimised for each architecture. Note that the dual issue PowerPC processor of Cell also benefits from pthread parallelism, Windows seem increase the threading overhead, the AMD numbers show large variations due to memory affinity, and the test case is relatively small for large servers. Now, we consider the genuine Cell processor implementation. Different programming models in C/C++ are possible, see [26]. The PPU (host) and SPU (worker) execute different binaries and can be considered as heterogeneous multithreading. However, local memory size (256 kbytes) and main memory access (direct memory access DMA block transfer only) of the SPUs is limited and the performance characteristics of both core types are different, see [21]. We chose the function off-load programming model. The main code and data resides on the host PPU. Each for loop is executed on SPUs. The current binary SPU code is downloaded to the SPU, executed and the PPU wait for the SPU termination. The SPU features no direct memory access, no cache, but user instantiated DMA block transfers. Hence the SPU code loads a data block which fits into local memory, performs the for loop on the block and write data back to main memory. This process is repeated until all data is processed. In order to hide the effect of DMA memory access, a double buffer strategy is employed. Read and write operate on data in one buffer, computations on data in the other buffer. Afterwards both buffers are swapped. Further difficulties arise from the fact that DMA memory access has to be aligned to host cache lines of 128 bytes size. Table 2 shows again preliminary execution times for different numbers of cores. The Sony Playstation3 is based on a Cell processor featuring a dual-issue PowerPC (PPU) and 7 synergistic units SPUs. Six SPUs are available to the user code. We present the numbers for sequential and pthread parallel PPU code for comparison. The next line shows numbers for complete function off-load to the SPUs. Note that the number of SPUs involved on coarser grids is smaller for reasons of efficiency. We show numbers with a mixed SPU/PPU execution model, where coarse grid computations remain on the PPU, while fine grids are distributed over the SPUs. The wall clock time with Yellow Dog Linux and IBM xlC 8.2 SPU and PPU compilers show the performance of the PowerPC PPU, an SPU implementation with non-overlapping communication (single buffer) and multi-buffer implemementation without and with explicit Altivec parallel SIMD instructions on the SPU. We obtain a decreasing execution time for larger numbers of cores. The single SPU performance on small data sets is slow compared to the PPU and its theoretical performance. Some additional experiments may explain this: Usual optimization uses SIMD vector operations for 4 floating point values per word, which is in our does pay off. Start-up times and synchronisation of 6 SPUs accounts for 0.43s, the sequential PPU computations for 0.19s. A strategy to fuse several loops into one SPU code to avoid repeated code load (code size of roughly 85kbytes per ForEach) has to be weighted against additional synchronisation. The remaining time is spent by SPU initiated memory transfer and overlapped computations. Further, there is a strong influence of local buffer size. We expect larger local memory size to improve overall performance. However, preliminary
1138
G. Zumbusch
Table 2. Execution times of the multigrid example, wall clock in sec on Cell processor. PPU, SPU and SPU with Altivec SIMD instructions. no. of threads
1
PPU SPU, single buffer SPU SPU+Altivec
3.12 3.29 2.00 1.79
2
3
4
5
6
2.56 1.69 1.52 1.44 1.43 1.42 1.31 1.15 1.07 1.03 1.01 0.97 0.89 0.85 0.83 0.83
results indicate major bottleneck of main memory size. Larger problem sizes are expected to give better vector and parallel efficiency.
6
Conclusions
A domain specific parallel programming model is presented, which allows for efficient automatic parallelization of numerical computations. However, it is designed as a set of domain dependent language extensions. An application specific library provides a (distributed) data container. Further, iterators are defined on the data containers. An application, which is written as a sequence of iterators can be parallelized: The iteration code is compiled into source code using standard parallel programming models.
Acknowledgements I want to thank the anonymous referees for their helpful comments. This work was partially supported by DFG grant SFB/TR7 “gravitational wave astronomy”. Current versions of the compiler, libraries and benchmark codes are available at http://parallel-for.sourceforge.net.
References 1. Butenhof, D.R.: Programming with POSIX Threads. Addison-Wesley, Reading (1997) 2. Reinders, J.: Intel Threading Building Blocks. O’ Reilly (2007) 3. Blumofe, R.D., Joerg, C.F., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: An efficient multithreaded runtime system. In: 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPoPP 1995, pp. 207–216. ACM, New York (1995) 4. Sutter, H.: The Concur project: Some experimental concurrency abstractions for imperative languages (2006), slides at: http://www.nwcpp.org/Downloads/2006/The Concur Project - NWCPP.pdf 5. Chandra, R., Menon, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J.: Parallel programming in OpenMP. Morgan Kaufmann, San Francisco (2000) 6. Pacheco, P.: Parallel programming with MPI. Morgan Kaufmann, San Francisco (1996)
A Container-Iterator Parallel Programming Model
1139
7. Koelbel, C.: The High Performance Fortran handbook. MIT Press, Cambridge (1993) 8. Numrich, R.W., Reid, J.: Co-array Fortran for parallel programming. ACM Fortran Forum 17(2), 1–31 (1998) 9. Chamberlain, B.L., Callahan, D., Zima, H.P.: Parallel programmability and the Chapel language. Int. J. high perf. computing 21(3), 231–312 (2007) 10. Steele, G.: Parallel programming and parallel abstractions in Fortress. In: Proc. 14th Int. Conf. on Parallel Architectures and Compilation Techniques, pp. 157– 160. IEEE, Los Alamitos (2005) 11. Ghazawi, T.E., Carlson, W., Sterling, T.L.: Distributed shared-memory programming with UPC. Wiley, Chichester (2005) 12. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: An object-oriented approach to non-uniform cluster computing. In: Proc. 20th ACM conf. on object oriented programming, pp. 519– 538. ACM Press, New York (2005) 13. Herrmann, C., Lengauer, C.: HDC: A higher-order language for divide-andconquer. Parallel Proc. Let. 10(2/3), 239–250 (2000) 14. Zumbusch, G.: Data parallel iterators for hierarchical grid and tree algorithms. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 625–634. Springer, Heidelberg (2006) 15. Quinlan, D., Schordan, M., Yi, Q., de Supinski, B.R.: Semantic-driven parallelization of loops operating on user-defined containers. In: Rauchwerger, L. (ed.) LCPC 2003. LNCS, vol. 2958, pp. 524–538. Springer, Heidelberg (2004) 16. Oldham, J.D.: POOMA. A C++ Toolkit for High-Performance Parallel Scientific Computing. CodeSourcery (2002) 17. Griebel, M., Knapek, S., Zumbusch, G.: Numerical simulation in molecular dynamics. Springer, Heidelberg (2007) 18. Austern, M.H.: Generic programming and the STL. Addison-Wesley, Reading (1999) 19. J¨ arvi, J., Powell, G.: The lambda library: Lambda abstraction in C++. In: Proc. 2nd workshop on C++ Template Programming at OOPSLA 2001 (2001), http://www.oonumerics.org/tmpw01 20. Birken, K.: Semi-automatic parallelisation of dynamic, graph-based applications. In: Proc. Conf. ParCo 1997, pp. 269–276. Elsevier, Amsterdam (1998) 21. Kahle, J.A., Day, M.N., Hofstee, H.P., Johns, C.R., Maeurer, T.R., Shippy, D.: Introduction to the Cell multiprocessor. IBM J. Res. & Dev. 49(4/5), 589–604 (2005) 22. Warren, M.S., Salmon, J.K.: A portable parallel particle program. Comput. Phys. Commun. 87(1–2), 266–290 (1995) 23. Lengauer, C.: Program optimization in the domain of high-performance parallelism. In: Lengauer, C., Batory, D., Consel, C., Odersky, M. (eds.) Domain-Specific Program Generation. LNCS, vol. 3016, pp. 73–91. Springer, Heidelberg (2004) 24. Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., Simon, H.D., Venkatakrishnam, V., Weeratunga, S.K.: The NAS parallel benchmarks. Inter. J. Supercomp. Appl. 5(3), 63–73 (1991) 25. Zumbusch, G.: Data dependence analysis for the parallelization of numerical tree codes. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 890–899. Springer, Heidelberg (2007) 26. IBM: Cell Broadband Engine Programming Tutorial. 2.1 edn. (2007)
Semantic-Oriented Approach to Performance Monitoring of Distributed Java Applications Wlodzimierz Funika, Piotr Godowski, and Piotr P¸egiel Institute of Computer Science, AGH, ul. Mickiewicza 30, 30-059 Krak´ ow, Poland [email protected], {flash,pegiel}@student.agh.edu.pl
Abstract. In this paper, we present a new approach to the semantic performance analysis in the on–line monitoring systems. We have designed a novel monitoring system which uses ontological description for all concepts exploited in the distributed systems monitoring. We introduce a complete implementation of a robust system with semantics, which is not biased to any kind of the underlying ”physical” monitoring system, giving the end user the power of intelligent monitoring features like automatic metrics selection and collaborative work.
1
Introduction
The design of distributed application is in many cases a challenge to the developer ([1], [2], [3]). On the one hand, there are the limitations and performance issues of distributed programming platforms. So one of the most important tasks is to increase the performance and reliability of distributed applications. On the other hand, the developer must assure that the application utilizes distributed resources efficiently. Therefore, understanding application’s behaviour through performance analysis and visualization is crucial. It is especially true now, when many distributed systems exploit the SOAP protocol. The functionality of the program is divided into pieces and implemented as Web Services. The monitoring of data flow between components could be very helpful for the user to find potential problems with a system. The biggest problems one can face while using performance tools (especially, these working ”on-line”) are their complexity. Thus many users give up what these systems offer and benefit from often less complex but easier to use tools. So a very important task is to ease user’s interactions with the monitoring system. Certainly, the simplification of the program use shall not lead to limiting its functionality related to performance evaluation. On the other hand, more and more developed software involve software agents which guide the user step by step. Such agents usually use a semantic description of software’s features and through the analysis of user’s behaviour there emerge suggestions what he/she should do in order to achieve a desired result. A similar approach can be used in tools used for performance monitoring in HPC. The first steps were already done (AutoPilot, PerfOnto), but the authors of these tools developed their own architectures for semantic description R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1140–1149, 2008. c Springer-Verlag Berlin Heidelberg 2008
Semantic-Oriented Performance Monitoring of Distributed Java Applications
1141
(mostly based on feedback from metrics), not using already existing solutions (like OWL/RDF). The Semantic Web paradigm has introduced the concept of semantic description of resources (OWL/RDF, DAML), semantic description of available services (mostly Web Services, OWL-S, DAML-S). We can leverage from existing standards to develop a performance monitoring tool using some knowledge which describes performance metrics, and this is the primary goal of this paper. The rest of this paper is organized as follows: Section 2 discusses a motivation and system use cases. The related work is discussed in Section 3, while in Section 4 we present our proposed ontology and system architecture for on–line monitoring system with semantics, followed by Summary and Future work in Section 5.
2
System Use Cases
In this paper we intend to present a semantic–oriented monitoring infrastructure called SemMon. The architecture of the tool fits in the OMIS model [4] and is able to work properly with available monitoring systems (like J–OCM[6], JMX1 ). Semantics in the monitoring architecture should exploit as much as possible from existing solutions, libraries and tools, with special attention paid to Open Source software and solutions developed in European Grid projects (like GOM[7] developed in the K-Wf Grid[5] project). The following general use cases show the usability of the designed system: The user should be able to: – monitor the performance of a Java application running under control of the physical monitoring system – use a GUI tool for visualization of performance data, browse available performance metrics, collect graphs of at least one metric in a single chart and check which available metrics are semantically connected with the current metric – use in an automatic way with a set of metrics which are meaningful for the user and a desired result – see information about the metrics that should be called in the next step after using the first metric – browse semantic descriptions of metrics in a user-friendly form. The system administrator should be able to: – create, destroy, and insert a semantic description of available metrics and elements of the monitored system – manage historical performance data – provide new metrics in a physical monitoring system, and describe them in semantic way. 1
Java Management eXtensions http://java.sun.com/javase/technologies/core/mntr-mgmt/javamanagement/
1142
W. Funika, P. Godowski, and P. P¸egiel
The work should be based on XML format both for sending and receiving requests about semantic descriptions. Developing a system as a Web Service (with a semantic description provided) leads to a universal solution, which perfectly fits into the Semantic Web paradigm. The developed system should be designed in such a way that it should work not only within the J–OCM system or any OMIS–compliant system, but also within any existing Grid solutions for grid monitoring.
3
Related Work
In this section, related work concerning semantics usage in the on–line monitoring systems is presented. We will concentrate not on all available monitoring systems, but mainly on those ones where semantics are introduced. Autopilot [8] was developed in the Grid Application Development Software (GrADS) Project [10] and is responsible for adaptive control of distributed applications. Autopilot architecture consists of performance sensors, software actuators and decision control unit using fuzzy logic to analyse received data from sensors and preparing messages to actuators. Autopilot is the very first example of exploiting some kind of semantics usage, or rather of fuzzy logic usage to help with monitoring and adaptation actions. PerfOnto [9] is a new approach to performance analysis, data sharing and tools integration in Grids that is based on ontology. Basically, PerfOnto is an OWL ontology describing experiment–related and resource–related concepts. Experiment–related concept describes experiments and their associated performance data of applications. The structure of the concept is described as a set of definitions of classes and properties, like Application, Version, Code region and Experiment classes. A region summary has performance metrics (property hasMetric) and subregion summaries (property hasChildRS). PerformanceMetric describes a performance metric, each metric having a name and value. In addition to the detailed description of performance resources and experiments, the prototype PerfOnto system is able to search data in an ontological (i.e. using knowledge base) manner. For example, it is possible to find a region summary executed on a particular node with specified metric value exceeding some threshold value. Such queries can be used to monitor critical sections of executed application, triggering administration notifications or even can be a suggestion to a site scheduler to migrate existing job to another node or to stop scheduling jobs for an overloaded node. PerfOnto gives a rich description of performance data, but does not provide any automation for using it. There are no adaptation algorithms yet, however it gives a powerful architecture to start with.
4
Overview of the SemMon System
The visualization of monitoring data in a ,,user friendly” form is one of the most key features provided by any monitoring system. Due to the great amount
Semantic-Oriented Performance Monitoring of Distributed Java Applications
1143
Ontology Subsystem Core Subsystem Ontology storage (Resource description, Metrics description) monitoring information over physical monitoring system, control information Monitoring core computer/cluster
Metrics results database
computers/nodes with monitoring agents Computers with monitoring GUI tool (GUI Subsystem)
Fig. 1. System architecture as distributed environment. Monitoring core computers include Core Subsystem and Ontology Subsystem.
of gathered information, proper presentation and interpretation of observation results becomes a very hard and complicated task. A high level architecture overview of SemMon is introduced in Figure 1. The heart of the model are computers that provide primary system functionality like processing ontology with Resource Capabilities and Metrics, or storing monitoring data. In order to support knowledge persistency, a database is needed. This functionality is implemented in the Ontology subsystem. Another part provides support for a ”physical” monitoring system. This subsystem has to provide functionality for registering monitoring agents as well as for processing monitoring data. The second part of the system contains computers with monitoring agents. The agents expose monitored resources to our monitoring system. All of them will register to the Core subsystem, after that Core will be able to introspect the available resources that are exposed. The agents are programs on the nodes/computoers that use the ”physical” monitoring system, e.g. JMX, JOCM. The last part of the system are computers with Graphical User Interfaces which are connected to the Core subsystem. All of the Core subsystem functionality are accessible from within GUI (e.g. browsing monitoring resources, running measurements) in a simple and user friendly way. It is optimally designed for the advanced user as well as for the beginners. GUI provides functionality for collecting some part of knowledge base data – like a metrics usefulness score. GUI is also an environment for collaborative work – users share metric ranks between different GUI instances in order to help other users in proper decision making. In the following we focus on a description of the subsystems used in the SemMon monitoring system.
1144
4.1
W. Funika, P. Godowski, and P. P¸egiel
Ontology Subsystem
The Ontology subsystem is the heart of the whole designed system. It contains methods for parsing, automatic interpreting, searching, creating, and, finally, sharing ontology data. The ontology subsystem brings a unique feature to the designed system: the capability to interpret what is monitored, both for system users and (what is even more important) for the system itself. Using the knowledge deployed in the underlying ontology data, the system is aware what is monitored and what should be monitored in a next step within the monitored application’s lifetime. Every single type of resource accessible to the monitoring system is described in the OWL ontology and reflects a natural computing resources hierarchy. There is a possibility to update part of or even the whole of the description. Resources in question are: Resource classes (like Node, CPU, JVM), Resource instances (i.e. OWL instances of resources available in the underlying monitoring system, like CPU i386 node2 cluster1) and the measurable attributes for the resource instances. Each Resource class defines which measurable attributes are available for its instances. Measurable attribute, called in this paper ResourceCapability might be both an atomic attribute (like LoadAvg1Min) or an OWL superclass for a set of ResourceCapabilities. This way a natural hierarchy of capabilities can be constructed. A special Resource property hasResourceCapability is a glue between Resources and ResourceCapabilities. Any type of Resource can contain any number of Resource Capabilities. Figure 2 shows the Resources ontology class hierarchy while Figure 3 presents a fragment of ResourceCapabilities ontology class hierarchy. An ontology describes metrics concepts like the OWL classes or individuals describing metrics available to be executed by the user. Metrics also conform to their hierarchy in order to provide a rich description for ontology reasoners. Metrics can be simple which means that the metric is able to measure only one attribute or custom which is meant to measure as many capabilities as required and it is even possible to provide custom implementations for metrics (user– defined metrics). A metrics ontology is designed from a flat list of all available metrics to be considered by the monitoring system. However, having only a flat list without a hierarchy (specialization) introduced, it is impossible to provide any powerful reasoning process. This is because no generic–specific or ”is related to” relationships are provided. When looking at a flat list of all possible metrics, the next step is to find out which of them are generic and which are specific. Such relationships can be expressed in an ontology as the rdfs:subClassOf property. A sample superclass metric might be SoftwareMetric with its specific subclass JVMThreadCPUTimeMetric. As a result, metrics form a tree which can be used for a reasoning process. A special metric property monitors is a glue between the Metrics ontology and the Resources and ResourcesCapabilities ontology. The property monitors has a domain in the AbstractMetric class (and its subclasses) and a range in the ResourceCapability classes. Because the cardinality of this
Semantic-Oriented Performance Monitoring of Distributed Java Applications
ThirdPartSoftware
Software
JVM
1145
ClassLoader
Thread
OperatingSystem Class
Cluster Resource
NetworkInterface Node CPU
Hardware
Memory
Storage
HardDisk
Fig. 2. Resources ontology diagram
CPUCapability
MemoryCapability
HardwareCapability
CPUUsageCapability
AvailableVirtualMemory Capability
StorageCapability LoadCapability
ResourceCapability
NodeCapability
NetworkingCapability ThreadCapability
SoftwareCapability OperatingSystemCapability
JVMCapability
UptimeCapability
Fig. 3. ResourceCapabilities ontology diagram
property is not limited, any type of AbstractMetric is able to monitor any number of capabilities. This means that the total number of measurements available in the system does not equal to the number of subclasses and individuals of the AbstractMetric class, but is a sum of cardinalities of the monitors properties in the metrics ontology. The metric property hasCustomImplementationClass is used to inform the system that the metric is a custom metric, i.e. has its own implementation. This property points to the fully qualified Java class name implementing the Custom Metric interface. Custom Metric has its own implementation rules that is exactly returned as a measurement process, which is explained as follows. Since Custom Metric can access the Core public API, and the Resources registry, it can request any number of capabilities’ values from the underlying monitoring system. The only contract that Custom Metric must meet is to return a single number whenever it is requested for.
1146
4.2
W. Funika, P. Godowski, and P. P¸egiel
Core Subsystem
The core subsystem is responsible for connecting to the underlying monitoring system’s initialisation (using its protocol adapter mechanism), deploying, initialisating and executing metrics (including user–defined metrics), providing an interface to the ontology subsystem and, last but not least, Core has a public (remote) interface for GUI clients to connect to. Core also manages GUI clients subscribed to the list of connected resources, running metrics, running metric values and alarms (i.e. conditional action metrics notifications). The Core subsystem consists of three components – Adapter, Resource Registry, and Remote interface for GUI. The Adapter component follows the commonly used Adapter structural design pattern and is used for ”translation” of all Core requests into the requests specific to the underlying monitoring system (JMX, J–OCM, OCM– G, etc.). Due to major differences and interface incompatibilities of a wide range of monitoring systems available on the market, a common interface called Protocol Adapter is designed. Resource Registry is a service that couples the Core and Protocol Adapter subsystems. Resource Registry holds all found (with a help from Protocol Adapter) resource instances in the underlying monitoring system and maps them into Core identifiers. Protocol Adapter resolves incompatibility issues between different physical monitoring systems. User–defined metrics have full access to the public Core API. Therefore a user–defined metric (implementing the Custom Metric interface) can introspect Resource Registry and using Protocol Adapter’s implementation is capable to send a specific query to the underlying monitoring system. This feature is useful when the underlying monitoring system has some specific features, not covered by the generic Protocol Adapter interface. Remote interface for GUI allows remote GUI clients to connect to SemMon to enable collaborative work. The following aspects of the system environment can be addressed by this interface: – interface for attached resources – It is possible for a GUI client to subscribe to notifications about newly attached and detached monitoring systems and their resources, – interface for running metrics – GUI client is enabled to subscribe to notifications about newly started and stopped measurements running on the Core subsystem, – interface for running metrics values – In addition to the subscription to the running metrics notification, GUI clients can subscribe to selected metric values. After a successful GUI client subscription, metric values are sent to the client at the same periods of time as defined in the polling interval of the running metric, – interface for alarms –Alarms are conditional action metric notifications. When some action metric is running on the Core subsystem and its value exceeds a declared threshold action value, all the unconditional action metrics that are declared in the underlying ontology are sent as notifications to all subscribed GUI users. The user is enabled to take an action to resolve the
Semantic-Oriented Performance Monitoring of Distributed Java Applications
1147
alarm (e.g. to start a new metric from within a list of metrics suggested by the system). 4.3
GUI Subsystem
One of the most important tasks for GUI is to enhance performance visualization capabilities and to make system usage much easier both for advanced users and for beginners.
Fig. 4. Performance visualisation window
The most important SemMon GUI component is the Visualisation manager that creates Visualisation windows. This component allows the user to create, configure, show, and delete visualisations. It comprises two lists: – List of running metrics – the user can select which metric (started/running metric) he/she would like to add to the visualisation. It is possible to add more than one metric to one visualisation. By using this form the user has to be able to see an average metric rank (based on the rank from all users) and add an own rank. – List of visualisations – a list of the currently created visualisations. This list is private for the user – it is not shared with other users unlike the running metrics list. The Visualisation component is responsible for managing and displaying a single visualisation. The visualisation term is related to the presentation of data provided by the launched metric(s). The presentation layer uses different types of charts and settings for these charts. A window with a sample visualisation is presented in Fig. 4. This visualisation involves two metrics – the CPU Usage metric and Available Physical Memory metric. The visualisation is provided in panel 1.
1148
W. Funika, P. Godowski, and P. P¸egiel
The main features of the chart are: the possibility of dynamical creation of axis Y separately for different metrics on the same chart; all the time (before visualization starts or afterwards) it is possible to add a new metric to the chart. Obviously, it possible to remove a metric, even after the visualization has been started; while dynamical adding a new data to the chart (or updating), the axis is automatically scaled (it is possible to control auto-scaling by using settings marked by 2). If auto-scale is disabled the user can scroll the chart scroll box 3. The sample visualisation is only one of many features provided by GUI. Below we present a summary list for another facilities included in GUI: GUI Logic Center provides a universal model for many different system aspects, like: Resource structure, Metric structure, Alarm notification model; GUI Alarm Manager is responsible for presenting to user the actions that can be taken when an alarm occurs. It shows a list of available metrics to run as an action to be taken in case of alarm; Remote Adapter is a component which connects GUI with the Core subsystem. It also works as a fa¸cade that hides communication details from components that store GUI logic.
5
Summary and Future Work
The main objective of this paper was to present the design and implementation of a robust and flexible semantics–oriented monitoring system, SemMon. It seems to be one of the first complete approaches to the joint ”worlds” of system monitoring and Semantic Web ideas. The SemMon system extensively uses ontology to semantically describe all concepts used in. In addition, it is flexible, starting from picking up automatic ontology changes, through automatic metric selection assistance, collaborative users’ knowledge leveraging, user–defined metrics, finally to the extensible and clear visualisation options. There are still places for further improvements. The main tasks for the future work are: switching to the pure SPARQL (instead of using Jena ARQ), using Java Web Start for the GUI, case study over AJAX based GUIs, and full integration with K-Wf’s GOM. One of the most important tasks is to develop efficient algorithms for reasoning in the ontology frameworks. This is the key feature of the ontology and by now it still works in a brute–force scenario. Although there are some improvements in the query algorithms, they are just based on an additional caching layer rather than on optimizing algorithms. Hopefully, we are expecting that the future will bring a great progress in this area, since many scientists and commercial research centres (HP Semantic Labs2 and W3C Semantic Web Working Group3 are the perfect examples) are already hard working on the Semantic Web development. Without the anticipated progress in the mentioned domains, the current version of the implemented system should be considered as an alpha version of productivity quality – it meets its functional requirements, but there are still 2 3
http://www.hpl.hp.com/semweb/ http://www.w3.org/2001/sw/
Semantic-Oriented Performance Monitoring of Distributed Java Applications
1149
performance–related issues. There are left some open issues to solve, but the robust, flexible and extensible engine is ready to use in further research. Acknowledgements. The research is partially supported by the EU IST int.eu.grid project 031857 and its corresponding SPUB-M.
References 1. Gerndt, M., Wism¨ uller, R., Balaton, Z., Gomb´ as, G., Kacsuk, P., N´emeth, Z., Podhorszki, N., Truong, H.-L., Fahringer, T., Bubak, M., Laure, E., Margalef, T.: Performance Tools for the Grid: State of the Art and Future. APART-2 Working Group, Technische Universitaet Muenchen, January 2004. Research Report Series, Lehrstuhl fuer Rechnertechnik und Rechnerorganisation (LRR-TUM), vol. 30, Shaker Verlag (2004) ISBN 3-8322-2413-0 2. Podhorszki, N., Kacsuk, P.: Presentation and Analysis of Grid Performance Data. In: Kosch, H., B¨ osz¨ orm´enyi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 119–126. Springer, Heidelberg (2003) 3. Reed, D.A., Ribler, R.L.: Performance Analysis and Visualization. In: Foster, I., Kesselman, C. (eds.) Computational Grids: State of the Art and Future Directions in High-Performance Distributed Computing, pp. 367–393. Morgan-Kaufman Publishers, San Francisco (1998) 4. Ludwig, T., Wism¨ uller, R., Sunderam, V., Bode, A.: OMIS – On-line Monitoring Interface Specification (Version 2.0). LRR-TUM Research Report Series, vol. 9, Shaker Verlag, Aachen (1997) 5. Bubak, M., Fahringer, T., Hluchy, L., Hoheisel, A., Kitowski, J., Unger, S., Viano, G., Votis, K., Consortium, K.-W.: K-Wf Grid Knowledge based Workflow system for Grid Applications. In: Proceedings of the Cracow Grid Workshop 2004, p.39, Academic Computer Centre CYFRONET AGH, Poland (2005) ISBN 83-9151414-5 6. Funika, W., Bubak, M., Sm¸etek, M., Wism¨ uller, R.: An OMIS-based Approach to Monitoring Distributed Java Applications. In: Kwong, Y.C. (ed.) Annual Review of Scalable Computing, ch. 1, vol. 6, pp. 1–29. World Scientific Publishing Co. and Singapore University Press (2004) 7. Krawczyk, K., Slota, R., Majewska, M., Kryza, B., Kitowski, J.: Grid Organization Memory for Knowledge Management for Grid Environment. In: Proceedings of the Cracow Grid Workshop 2004, Poland, pp. 109–115. Academic Computer Centre CYFRONET AGH (2005) ISBN 83-915141-4-5 8. Ribler, R.L., Vetter, J.S., Simitci, H., Reed, D.A.: Autopilot: Adaptive Control of Distributed Applications. In: Proceedings of the High-Performance Distributed Computing Conference (July 1998) 9. Truong, H.-L., Fahringer, T.: Performance Analysis, Data Sharing and Tools Integration in Grids: New Approach based on Ontology. In: International Conference on Computational Science (ICCS 2004). LNCS, Springer, Krakow, Poland (2004) 10. Berman, F., Chien, A., Cooper, K., Dongarra, J., Foster, I., Dennis Gannon, L.J., Kennedy, K., Kesselman, C., Reed, D., Torczon, L., Wolski, R.: The GrAds project: Software support for high-level grid application development. Technical Report Rice COMPTR00-355, Rice University (February 2000)
Using Experimental Data to Improve the Performance Modelling of Parallel Linear Algebra Routines Luis-Pedro Garc´ıa1, Javier Cuenca2 , and Domingo Gim´enez3 1
2
Servicio de Apoyo a la Investigaci´ on Tecnol´ ogica, Universidad Polit´ecnica de Cartagena, 30203 Cartagena, Spain [email protected] Departamento de Ingenier´ıa y Tecnolog´ıa de Computadores, Universidad de Murcia, 30071 Murcia, Spain [email protected] 3 Departamento de Inform´ atica y Sistemas Inform´ aticos, Universidad de Murcia, 30071 Murcia, Spain [email protected]
Abstract. The performance of parallel linear algebra routines can be improved automatically using different methods. Our technique is based on the modellisation of the execution time of each routine, using information generated by routines from lower levels. However, sometimes the information generated at one level is not accurate enough to be used satisfactorily at higher levels. Therefore, a remodelling of the routines is performed by using (applied appropriately) polynomial regression. A remodelling phase is proposed, and analysed with a parallel matrix multiplication. Keywords: auto-tuning, performance modelling, self-optimisation, hierarchy of libraries.
1
Introduction
Important tuning systems exists that attempt to adapt software to tune automatically to the conditions of the execution platform. These include FFTW for discrete Fourier transforms [1], ATLAS [2] for the BLAS kernel, sparsity [3] and [4], SPIRAL [5] for signal and image processing, MPI collective communications [6], linear algebra routines [7,8], etc. Furthermore, the development of automatically tuned software facilitates the efficient utilisation of the routines by non-expert users. For any tuning system, the main goal is to minimise the execution time of the routine to tune, but with the important constraint of not overly increasing the installation time.
This work has been partially supported by the Consejer´ıa de Educaci´ on de la Regi´ on de Murcia, Fundaci´ on S´eneca 02973/PI/05.
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1150–1159, 2008. c Springer-Verlag Berlin Heidelberg 2008
Using Experimental Data to Improve the Performance Modelling
1151
A number of auto-tuning approaches are focused on modelling the execution time of the routine to optimise. When the model has been obtained theoretically and/or experimentally, then given a problem size and execution environment, this model is used to obtain the values of some adjustable parameters with which to minimise the execution time. The approach chosen by FAST [9] is an extensive benchmark followed by a polynomial regression to find optimal parameters for different routines in homogeneous and heterogeneous environments. Vuduc et al. [4] apply the polynomial regression in their methodology to decide the most appropriate version from variants of a routine. They also introduce a black-box pruning method to reduce the enormous implementation spaces. In the FIBER approach [10] the execution time of a routine is approximated by fixing one parameter (problem size) and varying the other (unrolling depth for an outer loop). A set of polynomial functions of grades 1 to 5 are generated and the best is selected. The values provided by these functions for different problem sizes are used to generate another function where the second parameter is now fixed and the first is varied. Tanaka et al. [11] introduce a new method, Incremental Performance Parameter Estimation, in which the estimation of the theoretical model by polynomial regression is started from the least sampling points and incremented dynamically to improve accuracy. Initially, they apply it on sequential platforms and seeking just one algorithmic parameter. Lastovetsky et al. [12] reduce the number of sampling points starting from a previous shape of the curve that represents the execution time. They also introduce the concept “speed band” as the natural way to represent the inherent fluctuations in the speed due to changes in load. Our approach has used dense linear algebra software for message-passing systems like routines to be tuned. In our previous studies [13] a hierarchical approach to performance modelling was proposed. The basic idea is to exploit the natural hierarchy existing in linear algebra programs. The execution time models of lower-level routines are constructed firstly from code analysis. After that, to model a higher-level routine, the execution time is estimated by injecting in its model the information of the lower-level routines that are invoked by the higher-level routine. In this work, a technique for redesigning the model from a series of sampling points by means of polynomial regression has been included in our original methodology. The basic idea is to start from the hierarchical model using the information from lower level routines to model the higher level ones without experimenting with the latter. However, if for a concrete routine all this information is not useful enough, then its model, and/or the models of its set of lower level routines, would be rebuilt again from the beginning, using a series of executions and appropriately applied polynomial regression. The rest of the paper is organised as follows: in section 2 the improved architecture of a self-optimised routine is analysed, then in section 3 some experimental results are described with their special features, and finally, in section 4 the conclusions are summarised and possible future research is outlined.
1152
2
L.-P. Garc´ıa, J. Cuenca, and D. Gim´enez
Creation and Utilisation of a Self-optimised Linear Algebra Routine
The design, installation and execution of a Self-Optimised Linear Algebra Routine (SOLAR) was shown step by step in [13]. In this section, this life-cycle is summarised and the modifications/extensions of that original proposal are described in detail. The main difference lies in that in our previous works, during the installation of a routine its theoretical model and information from lower level routines were used to complete a theoretical-practical model. Now this process is extended with a testing sub-phase of this model, comparing, from a series of sampling points (different sets of problem sizes plus algorithmic parameters), the distance between the modelled time and the experimental one. If this distance is not small enough then a remodelling sub-phase starts, where a new model of the routine is built from zero, using benchmarking and polynomial regression in various ways. 2.1
Design
This process is performed only once by the designer of the Linear Algebra Routine (LAR), when this LAR is being created. The tasks to be done are: Create the LAR: The LAR is designed and programmed if it is new. Otherwise, the code of the LAR does not have to be changed. Model the LAR theoretically: The complexity of the LAR is studied, obtaining an analytical model of its execution time as a function of the problem size, n, the System Parameters (SP ) and the Algorithmic Parameters (AP ): Texec = f (SP, AP, n)
(1)
SP describe how the hardware and the lower level routines affect the execution time of the LAR. SP are costs of performing arithmetic operations by using lower level routines, for example, BLAS routines [14]; and communication parameters (start-up, ts , and word-sending time, tw ). AP represent the possible decisions to take to execute the LAR. These are the block size, b, in block based algorithms, and parameters defining the logical topology of the processes grid or the data distribution in parallel algorithms. Select the parameters for testing the model: The most significant values of n and AP are written in the structure M odel T esting V alues by the LAR designer. They will be used to test the theoretical model during the installation process. These AP values will also be used at execution time to select the theoretical optimum values when the runtime problem size is known. Create the SOLAR manager: The SOLAR manager is also programmed by the LAR designer. This subroutine is the engine of the SOLAR. It is in charge of managing all the information inside this software architecture.
Using Experimental Data to Improve the Performance Modelling
2.2
1153
Installation
In the installation process of a SOLAR, the tasks performed by the SOLAR manager are: Execute the LAR: It executes the LAR using the different values of n and AP contained in the structure M odel T esting V alues. Test the model: It compares all these execution times with the theoretical times provided by the model for the same values of n and AP , and using the SP values returned by the previously installed lower level SOLARs1 . If the average distance between theoretical and experimental times exceeds an error quota, the SOLAR manager would remodel the LAR and/or its set of lower level LARs. The new model of a LAR is built from the original model by designing a polynomial scheme for the different combinations of n and AP in the structure M odel T esting V alues. In order to determine the coefficients of each term of this polynomial scheme, different methods could be applied in the following order: FIxed Minimal Executions (FI-ME), VAriable Minimal Executions (VAME), FIxed Least Squares (FI-LS) and VAriable Least Squares (VA-LS). When one of them provides enough accuracy, then the other methods are discarded. FI-ME: The experimental time function is approximated by a single polynomial. The coefficients are obtained with the minimum number of executions needed. VA-ME: The experimental time function is approximated by a set of p polynomials corresponding to p intervals of combinations of (n, AP ). For each of these intervals the above method is applied. FI-LS: In this method, like in the FI-ME method, the experimental time function is approximated by a single polynomial. Now a least squares method that minimises the distance between the experimental and the theoretical time for a number of combinations of (n, AP ) is applied to obtain the coefficients. VA-LS: As in the VA-ME method, the execution time function is divided in p intervals, and the FI-LS method is applied in each. 2.3
Execution
When a SOLAR is called to solve a problem of size nR , the following tasks are carried out: Collecting system information: The SOLAR manager collects the information that the NWS [15] (or any other similar tool installed on the platform) provides about the current state of the system (CPU load and network load between each pair of nodes). Tuning the model: The SOLAR manager tunes the T ested M odel according to the system conditions. Basically the arithmetic parts of the model will 1
When a SOLAR receives a request from an upper level SOLAR about its execution time for a specific combination of (n, AP ) it substitutes these values in its tested model, obtains the corresponding theoretical time, and sends it back to the SOLAR in question.
1154
L.-P. Garc´ıa, J. Cuenca, and D. Gim´enez
change inversely to the availability of CPU, and the same occurs for the communication parts of the model and the availability of the network. Selection of the Optimum AP : Using the T uned M odel, with n = nR , the SOLAR manager looks for the combination of AP values that provides the lowest execution time. Execution of the LAR: Finally, the SOLAR manager calls the LAR with the parameters (nR , Optimum AP ).
3
Self-optimised Parallel Linear Algebra Routine Examples
In this section the indicated procedure in 2 is applied to one version of the parallel matrix-matrix multiplication. We evaluate the theoretical costs of the computation and the communications used in this algorithm. The idea is to show, through this particular application, that it is possible to automatically obtain the best possible execution time for given input parameters (n, AP ), as well as the scheme followed by a SOLAR manager to build (if necessary) the models of its set of lower level routines and/or a new model for a LAR. 3.1
Parallel Matrix-Matrix Multiplication
In the implemented version of parallel matrix-matrix multiplication Cn×n = An×n × Bn×n , the matrix B is replicated onto p processes and the matrix A is partitioned into p blocks of np × n size. The process owning matrices A and B distributes the blocks of A and the whole matrix B to all other processes. After the communication step, each process computes a submatrix Ci = Ai × B, where Ci and Ai are of size np × n and B is of size n × n. In order to build the analytical model of the execution time, we identify different parts of the algorithm with different basic computation and communication routines: – A routine that uses point-to-point communications distributes the blocks of matrix A onto p − 1 processes. – A broadcast communication routine is used to distribute the matrix B. – The BLAS3 function DGEMM is used to compute the p matrix multiplication of size np × n × n. Thus, the execution time for this algorithm can be modelled by the formula: 3 2 n n T (n, p) = tmult + (p − 1)tSend + tBcast n2 (2) p p 3
where tmult ( np ) is the theoretical execution time for one matrix multiplication of size n2 p
n3 p ,
2
tSend ( np ) is the theoretical time for a point-to-point message of size
and tBcast (n2 ) is the theoretical time for a broadcast message of size n2 .
Using Experimental Data to Improve the Performance Modelling
3.2
1155
Experimental Results
The experiments have been performed for two different systems: two Intel Itanium 2 systems connected by GigabitEthernet. where each node is equipped with four sets of dual-core 1.4 GHz Montecito processor, i.e. 8 processors for node (Rosebud); and two Intel Xeon systems connected by GigabitEthernet, each node is equipped with two sets of dual-core 3.0 GHz 5050 processors, i.e. 4 processors por node (Sol). Optimized versions of BLAS [14], provided by the manufactures, have been used for the basic routine DGEMM. The library used for the communications is MPI, MPICH2 in Rosebud and LAM/MPI in Sol. In both cases, for problem sizes less than 3000, the experiments are performed twelve times, the minimum and maximum values are discarded and an average value is obtained. Modelling the computations. In the experiments for Rosebud and Sol we found that for the BLAS3 DGEMM routine the third order polynomial function was the best, therefore we use this number of coefficients for its SOLAR. These coefficients can be calculated using any of the four methods described in section 2, but in Rosebud and Sol good results were obtained using the FIxed Least Square (FI-LS) method. The values of the coefficients are obtained by performing a set of matrix multiplications of rectangular size, like in the algorithm. If the values are obtained with square matrices, the theoretical prediction is much worse. In addition, since Rosebud is a system with dual-core processors and shared memory access, it was necessary to perform several simultaneous executions of the routine and to use an average value for the coefficients of the polynomial that models DGEMM. Modelling the communications. Initially the SOLAR could use a linear model for the communications obtained from a ping-pong between two processes, and use it to try to model the communications; but the start-up, ts , and wordsending, tw , times will be different when using a point to point communications than when using a broadcast. Since the values of ts and tw depend on the number of processes, a polynomial function was obtained for each case. For the routine that distributes the blocks of matrix A, it was necessary to consider the access scheme used previously to the send the data with the routine MPI Send, i.e. a model for the routine was obtained and not only for the MPI Send. In Rosebud and for the broadcast routine MPI Bcast, when the number of process p was greater than twelve, it was necessary to use the VAriable Least Square (VALS) method. Table 1 compares the execution time provide by the model (mod.) and the experimental time (exp.) for the MPI Bcast routine when using FI-LS method and when using the VA-LS method, along with the deviation (dev.) |t −t | ( modtexp exp ) with p = 16 processes. Discussion of experimental results. The numbers in table 2 represents the preferential order in which the values of the parameter p are selected by the model (mod.), and the order obtained from the experiments (exp.), on Rosebud. The
1156
L.-P. Garc´ıa, J. Cuenca, and D. Gim´enez
Table 1. Comparison of the experimental and theoretical execution times (in seconds) using the FI-LS and VA-LS methods for the MPI Bcast routine in Rosebud for p = 16 processes n exp. FI-LS dev. (%) VA-LS dev. (%)
512 0.0528 0.0495 6.13 0.0507 3.96
1536 0.4603 0.4490 2.46 0.4605 0.04
2048 0.8157 0.7986 2.10 0.8277 1.47
3072 2.0998 1.7973 14.40 1.9823 5.59
4096 3.4852 3.1956 8.31 3.5369 1.48
5120 5.4764 4.9933 8.82 5.5356 1.08
6144 8.0366 7.1905 10.53 7.9785 0.72
6656 9.0876 8.4390 7.14 9.3666 3.07
Table 2. Different processes selection, ordered by performance, in Rosebud
n 1024 2048 2560 4096 5120 5632 6144 6656
exp. 2nd 2nd 2nd 4th 4th 4th 4th 4th
p=4 mod. dev. (%) 1st 2.49 2nd 0 2nd 0 4th 0 4th 0 4th 0 4th 0 4th 0
exp. 1st 1st 1st 1st 1st 1st 1st 1st
p=8 mod. dev. (%) 2nd 2.49 1st 0 1st 0 1st 0 1st 0 1st 0 1st 0 1st 0
exp. 3rd 3rd 3rd 2nd 3rd 3rd 3rd 3rd
p = 12 mod. dev. (%) 3rd 0 3rd 0 3rd 0 3rd 0.37 3rd 0 3rd 0 3rd 0 3rd 0
exp. 4th 4th 4th 3rd 2nd 2nd 2nd 2nd
p = 16 mod. dev. (%) 4th 0 4th 0 4th 0 2nd 0.37 2nd 0 2nd 0 2nd 0 2nd 0
preferential order of p varies for different problem sizes, but with the model and with the inclusion of the VAriable Least Square (VA-LS) method for different sizes of broadcast messages a satisfactory selection of p is made in almost all the cases. A value 0 in the deviation (dev.) means the value of p selected by our method coincide with the obtained experimentally. In the cases where the selection of the value of p does not coincide with the experimental one, the deviation in the execution time is very low and the mean of the deviations is 0.38%. In Sol, the best selection is to execute the routine with 8 processes, but with problem sizes lower than 1024 it is recommendable to execute the routine with 4 processes. Table 3 shows the experimental execution time, the theoretical execution time, and the deviation between both for different problem sizes and number of processes. The optimum experimental and theoretical times are highlighted. From the table it is seen that the model successfully predicts the number of processes with which the optimum execution times are obtained, except for matrix size 1024. In this case the execution time obtained with the values provided by the model is about 5.41 % higher than the optimum experimental time. Remodelling parallel matrix multiply. In the methodology proposed, the SOLAR manager could build a new model for a LAR. The method consists of
Using Experimental Data to Improve the Performance Modelling
1157
Table 3. Comparison of the experimental and theoretical execution times (in seconds) for different number of processes, in Sol. n 1024 2048 2560 3072 p=2 exp. the. dev. (%) p=4 exp. the. dev. (%) p=6 exp. the. dev. (%) p=8 exp. the. dev. (%)
3584
4096
0.413 2.695 4.913 8.254 12.694 18.826 0.444 2.760 5.081 8.424 12.972 18.911 7.61 2.40 3.42 2.05 2.19 0.45 0.351 1.945 3.359 5.449 8.136 11.783 0.338 1.847 3.270 5.263 7.917 11.325 3.58 5.04 2.63 3.41 2.68 3.89 0.367 1.870 2.997 5.037 6.822 9.488 0.362 1.791 3.055 4.775 7.000 9.796 1.44 4.24 1.94 5.22 2.61 3.25 0.370 1.740 2.750 4.436 6.166 9.033 0.312 1.507 2.549 3.949 5.753 8.008 15.68 13.38 7.32 10.97 6.70 11.36
defining a new polynomial function for the different combinations of n and AP . For this algorithm the polynomial function used is: n3 + a2,1 n2 p + a2,0 n2 + p n2 n a0,−1 a2,−1 + a1,1 np + a1,0 n + a1,−1 + a0,1 p + a0,0 + p p p
T (n, p) = a3,1 n3 p + a3,0 n3 + a3,−1
(3)
and twelve coefficients must be calculated with any of the methods indicated in section 2, good results were obtained with the FIxed Least Square (FI-LS) method. Table 4 shows the values obtained for the coefficients in Rosebud with p = {2, 4, 6} and n = {1000, 1500, 2000, 2500, 3000, 3500, 4000}. Now SOLAR manager verifies the new T heoretical M odel for the parallel matrix multiply algorithm. The values for the structure M odel T esting V alues are: – Algorithmic Parameters: number of processes p = {2, 4, 6, 8} – Matrix sizes n = {1024, 2048, 3584, 4608, 5632, 6656} Table 4. Coefficients for the new theoretical model in Rosebud -8.921443e-01 6.421089e-01 -8.618004e-02 1.483827e-03 -1.059608e-03 1.282712e-04 -7.001814e-07 5.088882e-07 -3.868396e-08 4.377307e-10 -5.207178e-11 4.620444e-12
1158
L.-P. Garc´ıa, J. Cuenca, and D. Gim´enez
Table 5. Comparison of the experimental and theoretical execution times (in seconds) with remodelling for different number of processes and problem sizes, in Rosebud n 1024 2048 3584 p=2 exp. the. dev. (%) p=4 exp. the. dev. (%) p=6 exp. the. dev. (%) p=8 exp. the. dev. (%)
4608
5632
6656
0.238 1.794 9.012 18.824 34.014 55.781 0.235 1.752 8.955 18.695 33.710 55.133 0.91 2.35 0.64 0.68 0.89 1.16 0.167 1.147 5.181 10.497 18.640 30.135 0.164 1.118 5.237 10.490 18.316 29.205 1.98 2.55 1.09 0.08 1.74 3.09 0.159 1.109 4.227 8.202 14.145 22.453 0.153 0.978 4.117 7.935 13.497 21.117 4.19 11.86 2.60 3.26 4.58 5.95 0.165 1.066 4.029 7.559 12.517 19.499 0.157 0.962 3.647 6.793 11.322 17.490 5.00 9.71 9.48 10.14 9.54 10.30
Table 5 shows the execution time provide by the model (the.) and the experimental time (exp.) for different problem sizes and number of processes, along with the deviation (dev.). The optimum experimental and theoretical times are highlighted. From the table it is seen that the model successfully predicts the number of processes with which the optimum execution times are obtained. The relative error ranges from 0.08 % to 11.86 %. This means that the new T heoretical M odel could be the T ested M odel. The values of the algorithmic parameters vary for different systems and problem sizes, but with the model and with the inclusion of the possibility of remodelling, a satisfactory selection of the parameters is made in all the cases, enabling us to take the appropriate decisions about their values prior to the execution. Thus, we can conclude that the methodology proposed can be used to obtain execution times close to the optimum without user intervention.
4
Conclusions and Future Works
The modelling of a routine allows us to introduce information about its behavior in the tuning process, and so guides the process. In order to improve the theoretical model of the routines, different techniques of using experimental data have been studied in this work, and satisfactory results have been obtained. Today, our research group is working on the inclusion of meta-heuristics techniques in the modelling [16], and in applying the same methodology to other types of routines and algorithmic schemes [17] [18].
Using Experimental Data to Improve the Performance Modelling
1159
References 1. Frigo, M.: FFTW: An adaptive software architecture for the FFT. In: Proceedings of the ICASSP conference, vol. 3, pp. 1381–1384 (1998) 2. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27(1-2), 3–35 (2001) 3. Dongarra, J., Eijkhout, V.: Self-adapting numerical software for next generation applications. In: ICL Technical Report, ICL-UT-02-07 (2002) 4. Vuduc, R., Demmel, J.W., Bilmes, J.: Statistical models for automatic performance tuning. In: Proceedings of the International Conference on Computational Science, ICCS, pp. 117–126 (2001) 5. Singer, B., Veloso, M.: Learning to predict performance from formula modeling and training data. In: Proceedings of the 17th International Conference on Mach. Learn., pp. 887–894 (2000) 6. Vadhiyar, S.S., Fagg, G.E., Dongarra, J.J.: Automatically tuned collective operations. In: Proceedings of Supercomputing 2000, pp. 3–13 (2000) 7. Chen., Z., Dongarra, J., Luszczek, P., Roche, K.: LAPACK for clusters project: An example of self adapting numerical software. In: Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 2004) 90282.1 (2004) 8. Katagiri, T., Kise, K., Honda, H., Yuba, T.: ABCLib DRSSED: A parallel eigensolver with an auto-tuning facility. Parallel Computing 32, 231–250 (2005) 9. Caron, E., Desprez, F., Suter, F.: Parallel extension of a dynamic performance forecasting tool. Scalable Computing: Practice and Experience 6(1), 57–69 (2005) 10. Katagiri, T., Kise, K., Honda, H., Yuba, T.: FIBER: A generalized framework for auto-tuning software. In: Veidenbaum, A., Joe, K., Amano, H., Aiso, H. (eds.) ISHPC 2003. LNCS, vol. 2858, pp. 146–159. Springer, Heidelberg (2003) 11. Tanaka, T., Katagiri, T., Yuba, T.: d-spline based incremental parameter estimation in automatic performance tuning. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 3–13. Springer, Heidelberg (2007) 12. Lastovetsky, A., Reddy, R., Higgins, R.: Building the functional performance model of a processor. In: Biham, E., Youssef, A.M. (eds.) SAC 2006. LNCS, vol. 4356, pp. 23–27. Springer, Heidelberg (2007) 13. Cuenca, J., Gim´enez, D., Gonz´ alez, J.: Architecture of an automatic tuned linear algebra library. Parallel Computing 30(2), 187–220 (2004) 14. Dongarra, J., Croz, J.D., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Software 14, 1–17 (1988) 15. Wolski, R., Spring, N.T., Hayes, J.: The network weather sevice: a distributed resource performance forescasting service for metacomputing. Journal of Future Generation Computing System 15(5–6), 757–768 (1999) 16. Mart´ınez-Gallar, J.P., Almeida, F., Gim´enez, D.: Mapping in heterogeneous systems with heuristical methods. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, Springer, Heidelberg (2007) 17. Cuenca, J., Gim´enez, D., Mart´ınez-Gallar, J.P.: Heuristics for work distribution of a homogeneous parallel dynamic programming scheme on heterogeneous systems. Parallel Computing 31, 735–771 (2005) 18. Carmo-Boratto, M.D., Gim´enez, D., Vidal, A.M.: Automatic parametrization on divide-and-conquer algorithms. In: Proceedings of International Congress of Mathematicians (2006)
Comparison of Execution Time Decomposition Methods for Performance Evaluation Jan Kwiatkowski, Marcin Pawlik, and Dariusz Konieczny Institute of Applied Informatics, Wroclaw University of Technology, Wybrze˙ze Wyspia´ nskiego 27, 50-370 Wroclaw, Poland {jan.kwiatkowski, marcin.pawlik, dariusz konieczny}@.pwr.wroc.pl
Abstract. With the advent of multi-core processors and the growing popularity of local cluster installations, better understanding of parallel applications behavior becomes a necessity. It can be argued that the raising popularity of parallelization results in the dare need of methods and tools capable of automatic analysis and prediction of parallel applications efficiency. Traditional methods of performance evaluation based on wall-clock time measurements require consecutive application executions or, when the detailed application profile is created, involves a time-consuming data analysis. In the paper an alternative approach is analyzed. Utilizing the execution time decomposition, a separate analysis of the computations and overhead time is performed to determine the analyzed application efficiency.
1
Introduction
With the advent of multi-core processors and the growing popularity of local cluster installations, better understanding of parallel applications behaviour becomes a necessity. The plans to build a sustainable HPC ecosystem for Europe will make hardware installations needed for execution of large-scale parallel applications even more accessible. It can be argued that the raising popularity of parallelization results in the dare need of methods and tools capable of automatic analysis and prediction of parallel applications efficiency. Be it a small cluster or a pan-European grid installation, the possibility to determine efficiency of parallel applications will always be welcomed, if not required. Performance evaluation can be employed to raise the usefulness of all the parallel and distributed processing environments. The results of the evaluation are suitable for resource allocation, load balancing, on-demand reservation of dynamic resources, quality of service negotiations, runtime and wait time predictions, etc. During parallel program performance evaluation various metrics are used - the runtime, speedup or efficiency are all usually measured using the wall-clock time [6]. Although the wall-clock time is indisputably one of the most important factors to be measured, it can be shown that concentrating only on this single value can hide important information. Basing on the decomposition of the execution time between the time devoted to the computations and the overhead time, the R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1160–1169, 2008. c Springer-Verlag Berlin Heidelberg 2008
Comparison of Execution Time Decomposition Methods
1161
Execution Time Decomposition (ETD) [1] based analysis can be carried out, providing not only insight into the application execution but also possibility to perform fast and accurate performance estimations. The accuracy of techniques based on internal and external measurements is evaluated and compared to the results achieved with the classical analysis. The structure of the paper is as follows. Section 2 briefly describes different techniques and metrics used during performance evaluation. The next section is dedicated to the description of the ETD approaches analyzed in the paper. The fourth section discusses the experiments performed and the achieved results. The final section concludes the paper and presents the future work plans.
2
Performance Evaluation Methods and Metrics
Performance evaluation constitutes an intrinsic part of every application development process. The additional degrees of freedom present in the realm of parallel computing make this analysis even more important for the development of correct and efficient parallel programs. The performance achieved by a parallel application depends not only on the application features but it also depends on the interactions between the hardware and software resources of the system where the application is being executed. Such application characteristics as algorithmic structure, input parameters, problem size, influence these interactions by determining how the application exploits the available resources and the allocated processors. It means that when designing the parallel application the goal of the design process is not to optimize a single metrics like for example speed, that is enough in case of sequential applications. The good design has to take into consideration a problem specific function of execution time, memory requirements, implementation cost, and others. In general the performance analysis can be carried out analytically or through experiments. In the first case we call it performance modeling, when in the second performance measurement. The performance modeling can be divided on simulation based modeling and analytical modeling. There are three general classes of simulation modeling techniques. The first one is profile-based, static modeling. In this technique we need to know a application execution profile that indicates how often each instruction is executed. The second techniques is the trace-driven simulation. In this case we need to know the sequence of instruction execution. The last one, the most sophisticated techniques is execution driven simulation. This techniques enables a detailed simulation of the whole system. In case of analytical modeling different approaches depending on the used fundamental theoretical knowledge are used, for example such as a queuing models, Markov models, Petri nets models, etc. The performance modeling is very useful during design phase, but when the aim of our work is to evaluate the existing solution it is better and easier to use performance measurement. A typical approach to performance measurement based on instrumenting the application, monitoring its execution and finally analyzing its performance using collected information by some dedicated tool. The instrumentation can used some specific hardware features as for example the
1162
J. Kwiatkowski, M. Pawlik, and D. Konieczny
performance counters as well as architectural features such as trap or breakpoint instructions. On the other hand in performance measurement two different approaches can be distinguish, the first when the time measurement is included as additional code in the evaluated application and the second one when some specific external tool is used for time measurement, for example, profilers. Later, the second approach can be divided onto two classes of tools, the first which modified existing operating system and the second one which are autonomic. Independently on the used evaluation method, analytical or experimental the same performance metrics are used [5]. The first one is the parallel run time. It is the time from the moment when computation starts to the moment when the last processor finishes its execution. The parallel run time is composed as an average of three different components: computation time, communication time and idle time. Therefore, the parallel run time depends not only on the size of the problem but also on the complexity of the interconnection network and the number of used processors. One can find that above definition is ambiguous: what is measured? Wall clock time, CPU time or something else. Moreover different execution overheads related mainly to operating system can appear during parallel as well as sequential program execution. It means that the wall clock time which capture all of these overheads should be preferred during performance measurement. On the other hand it can be shown that concentrating only on this single value can hide important information. The parallel wall-clock runtime is composed of two different components - the computation time and the overhead introduced by the parallel execution. The first represents the time when the CPU performs computations. In the second, the time is captured when the processes perform communication or are idle waiting for other ones to deliver necessary data. The amount of time spent in these states depends not only on the utilized algorithm but also on various hardware and software environment properties. Different processor cache utilization patterns, unequal memory access speeds, variations in communication buffer sizes and similar characteristics can contribute to the program behaviour that is clearly observable but not easy to understand and explain. Most of others metrics are related to the parallel run time. The next one frequently used metric is speedup, which captures the relative benefit of solving given problem using parallel system. There are different speedup definitions. In general the speedup is defined as the ratio of the time needed to solve the problem on a single processor to the time required to solve the same problem on parallel system with p processors. Then, depending on the way in which the sequential time is measured we can distinguish absolute, real and relative speedups. In the paper the relative speedup is used with the sequential time defined as a time of executing a parallel program on one of the processors of the parallel computer. Typically, speedup can not exceed the number of processors used during program execution, however different speedup anomalies can be observed. In theory a speedup anomaly occurs when the program does not execute in the way predicted by the performance model [3]. When the application observed changes its behaviour to the one not consistent with the expectations, the usual solution is
Comparison of Execution Time Decomposition Methods
1163
to perform the detailed analysis, switching from the rough measurements based on the wall-clock time to the performance evaluation based on careful examination of the execution profile. This approach leads to the precise explanation of the program behaviour, including the phenomena observed, but requires full and detailed knowledge not only of the algorithm that the examined application uses but also the precise characteristics of the utilized hardware elements and the detailed information about executed machine-level instructions. This process although indisputably fruitful can sometimes be impossible to conduct. Both of above discussed performance metrics do not take into account the utilization of processors in the parallel system. While executing a parallel algorithm processors spend some time for communication purposes and some processors can be idle. Then the efficiency of a parallel program is defined as a ratio of speedup to the number of used processors and represents the fraction of the time which processors spent on program execution. In the ideal parallel system the efficiency is equal to one but in practice efficiency is between zero and one. The next measure, which is often used is the cost of solving a problem by the parallel system. Usually it is defined as a product of the parallel run time and the number of processors. The next useful measure is the scalability of parallel system. It is a measure of its capacity to increase speedup in proportion to the number of processors. One can say that a system is scalable when for increasing number of processors and a size of the problem the efficiency is the same.
3
ETD-Based Estimation Methods
The ETD-based estimation is build upon analysis of the fundamentals of the application timeline behavior. To explain them the theoretical and measured time values need to be compared. In the following equations the lower indexes denote the type of the time value analyzed (computation, overhead, etc.). The upper index in the form of p, k represents the number of all the processors the analysis is performed for (p) and the processor index (k). One can define the time of a parallel algorithm execution as composed of the p,k computations time (tp,k comp ), communication time (tovhd comm ) and the time processors spend idle waiting for the others to finish their computations (tp,k ovhd idle ). In the case of MPI-based applications the two additional time periods can be introduced - the time before MPI Init() function call is finished (tp,k init ) and the time from the start of the MPI Finalize() call until the application end (tp,k f inal ). It can be thus said that the application runtime is described by the equation: p,k p,k p,k tp,k wall = tinit + tcomp + tovhd
comm
+ tp,k ovhd
idle
+ tp,k f inal
For the sake of the following analysis the concept of the fundamental computations time (tp,k comp f und ) is defined as the time of computational operations present in the algorithm when it is executed on a single processor. This value describes the time of operations always present in the parallel algorithm, no matter how many processors are used. The parallelization of the algorithm can
1164
J. Kwiatkowski, M. Pawlik, and D. Konieczny
introduce additional computational overhead (tp,k ovhd time is thus described by the equation: p,k tp,k comp = tcomp
f und
+ tp,k ovhd
comp ).
The computations
comp
The communication time tovhd comm is defined as the time processors spend sending the data to other processors or (if the synchronous transfer is used) waiting for other parties to receive the data sent. The idle time (tovhd idle ) is the time processors spend waiting for the needed data to arrive. tinit
tcomp fund
tovhd comp
tovhd comm
tovhd idle
tfinal
wall comp fund
MPI prof. wall MPI prof. comp fund
rsus wall rsus comp fund
- correctly omitted in measurement - correctly included in measurement - erroneously omitted in measurement - erroneously included in measurement - partially erroneously included in measurement - partially erroneously omitted in measurement
Fig. 1. Time measurements in the ETD methods
In the paper the two important methods of efficiency measurement and approximation are described - the method based on external measurement of the CPU time (rsus) and the method based on results gathered from the MPI profiling (MPI prof.). Utilizing both the methods, on every processor, the wall-clock time is measured (tp,k wall ) and the measurements are performed to estimate the amount of fundamental time (tp,k comp f und ). The final efficiency value is estimated as: p
E≈
k=1
tp,k comp p
k=1
f und
tp,k wall
In the case of popular MPI profilers (e.g. MPIP) the wall-clock time is measured with the exclusion of the initialization and finalization times and the fundamental time is approximated utilizing the difference between the wall-clock time and the time spent on MPI-related operations. CPU-time based measurements utilize complete wall-clock time measurements and the fundamental time is approximated by the CPU-time measured.
Comparison of Execution Time Decomposition Methods
1165
An MPI-based application has to call the MPI Init() before any other MPIrelated functions. At the end of MPI-related section the application calls the MPI Finalize() function. Usually this call is also the last one in the application but it is theoretically possible to perform an arbitrary number of non-MPI operations after this call. In the case of the MPI profiler the measurements start after the MPI Init() call and terminates before the MPI Finalize() call resulting in the inability to measure the initialization and finalization (tinit and tfinal ) times. The measurements erroneously include in overhead time all the computational operations that may be present in MPI functions and should be attributed to the fundamental time. On the other hand additional computations induced by parallelization and performed outside the MPI functions may be erroneously considered to be a part of the fundamental time. In the case of CPU-time based measurements the wall-clock time is measured correctly. The fundamental time measurement erroneously includes additional operations such as additional computations induced by parallelization, a part of operations needed during communication (included in tovhd comp and tovhd idle ), and computational operations occurring before MPI Init() call (part of tinit ) and after MPI Finalize() call (part of tfinal ). The wall-clock time and fundamental time measurements as performed by ETD methods are depicted in the figure 1. The first two lines denote the actual time values, while the next lines represent the values measured by MPI profilers and CPU-time based measurement tools accordingly.
4
Case Studies
To confirm the correctness of the theoretical analysis presented in the previous section the series of experiments was performed. The tests were executed on two clusters - the Cumulus cluster located at the Wroclaw University of Technology and the Cacau cluster in the HLRS supercomputing center in Stuttgart. The Cumulus cluster [2] is equiped with 26 homogeneous IBM PC Sempron 1.7GHz, space-shared nodes connected with the fast Ethernet network switch. The Cacau cluster utilizes 400 Intel Xeon 3.5GHz CPUs connected with the Infiniband and Ethernet networks. The experiments were performed utilizing the MPIP parallel profiler [4] and the RSUS distributed CPU-time measurement system created specifically for the purpose of low-overhead performance estimation measurements. The tool was created as a distributed application with measurement deamons monitoring the application execution on all the processors involved in computations. It offers an non-invasive, external mode of operations with the ability to perform low overhead CPU-time consumption measurements. As the third estimation method the combined method was used combining the wall-clock time measurement performed by the RSUS tool with the fundamental time approximation taken in MPI profiler based method.
1166
J. Kwiatkowski, M. Pawlik, and D. Konieczny Cannon alg. size=8, Calvus - mpich
stdev [%]
1.2
1.1
4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0
1
5
10 15 20 processors
25
efficiency
1.0
0.9
0.8
0.7
efficiency rsus profiler combined
0.6
1
5
10
15
processors
20
25
Fig. 2. Efficiency measurements - Cannon alg. on Calvus cluster
To perform the comparison of the execution time decomposition methods a selection of applications representative for the HPC environments was performed. After an analysis of typical production runs it was decided to perform the measurements for the programs from the NAS Parallel Benchmark suite [7] together with own implementations of the Cannon parallel matrix multiplication algorithm and the parallel algorithms for longest common subsequence (LCS) problem. To avoid the execution time anomalies the experiments were performed for data sizes sufficiently larger than CPU cache size and smaller than the main memory limits [3]. Over eighty thousand of application executions were measured for more than one hundred combinations of input data size and execution environment. On every processor involved in computations the CPU time consumption was recorded with 10 times per second frequency, leading to the acquisition of more than twenty two million individual probes. Due to the lack of space, in the following analysis only the most important results will be discussed. The illustrations included present the efficiency measurement in the relation to the number of used processors. The chart line described efficiency represents the actual efficiency, the rsus, profiler and combined lines represent efficiency estimations. A sub-chart in the upper right corner presents the standard deviation of the results represented as the percentage of the mean value.
Comparison of Execution Time Decomposition Methods
1167
Cannon alg. size=8, Cacau - mpich 2.0 stdev [%]
1.2
1.5 1.0 0.5
1.0
0.0
1
10 20
30 40 50 processors
60
70
efficiency
0.8
0.6
0.4
0.2
0.0
efficiency rsus profiler combined 1
10
20
30
40
processors
50
60
70
Fig. 3. Efficiency measurements - Cannon alg. on Cacau cluster
During the initial tests performed on the Calvus cluster (Fast Ethernet with the MPICH library) all the tested methods behaved in the similar manner, offering estimation closely matching the actual efficiency. The typical example of the obtained result can be seen of Fig. 2. It can be seen that results obtained with the MPIP profiler almost ideally match the actual shape of the efficiency curve. The combined method slightly underestimates the efficiency. The identical test performed on the Cacau cluster for Ethternet interconnect and the MPICH library brings different results for the profiler-based estimation. Because of much larger initialization time (tinit ) in this environment the initialization and finalization times ignored in this method have considerable impact on the final wall-clock time and the efficiency. When the wall-clock time measurement in the combined method is corrected, the efficiency estimation becomes accurate (Fig. 3). In the case of fast interconnects (e.g. Infiniband) another source of estimation inexactness surfaces. The estimation performed with rsus would assume the computational overhead is the part of the fundamental computations time. That assumption leads to the serious overestimation of the application efficiency. In the case of the Cacau cluster, the erroneous measurement of startup and finalization time in the profiler based methods results in the similar efficiency overestimation as in the case of the Ethternet interconnect. The utilization of combined method eliminates this problem (Fig. 4).
1168
J. Kwiatkowski, M. Pawlik, and D. Konieczny
NAS - BT alg. size=A, Cacau - ompi-mpip 5
1.2 stdev [%]
4
1.1
3 2 1 0
1
10 20
30 40 50 processors
60
70
efficiency
1.0
0.9
0.8
0.7
efficiency rsus profiler combined
0.6
1
10
20
30
40
processors
50
60
70
Fig. 4. Efficiency measurements - NAS BT alg. on Cacau cluster
5
Conclusions and Future Work
The paper discusses the possibilities to employ the idea of execution time decomposition in the estimation of parallel applications efficiency. The experiments performed show that utilizing the time decomposition an estimation of the parallel application fundamental time can be performed. The estimation accuracy varies for different measurement techniques and different execution environments. In the environments where the amount of busy-waiting performed in MPI-related functions is minimal, a simple measurement of CPU-time consumption is sufficient to offer a good estimation accuracy. In environments exhibiting larger busy-waiting overheads, the estimation can be performed utilizing application profiling results. The estimation accuracy can be further improved if profiling and CPU-time measurement results are jointly analyzed. The results gathered can be seen as a proof of the existence of an opportunity to perform an efficiency estimation using only the data from a single parallel application execution. To extend the ETD methods accuracy and utilization possibilities, the estimation will be further extended with the analysis of the CPU-time consumption data gathered on-line during the application execution. Acknowledgements. This research was partially supported by European Commission funded project HPC-Europa, contract number 506079.
Comparison of Execution Time Decomposition Methods
1169
References 1. Pawlik, M., Kwiatkowski, J., Konieczny, D.: Parallel program performance evaluation with execution time decomposition. In: Proc. of the 16th International Conference on Systems Science, Wrocaw, Poland (2007) 2. Kwiatkowski, J., Pawlik, M., Frankowski, G., Balos, K., Wyrzykowski, R., Karczewski, K.: Dynamic clusters available under Clusterix Grid. Lect. Notes. Comput. Sci., Springer, Heidelberg (2006) 3. Kwiatkowski, J., Pawlik, M., Konieczny, D.: Parallel Program Execution Anomalies. In: Proc. of First International Multiconference on Computer Science and Information, Wisa, Poland (2006) 4. Vetter, J., McCracken, M.: Statistical Scalability Analysis of Communication Operations in Distributed Applications. In: Proc. ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (2001) 5. Cremonesi, P., Rosti, E., Serrazzi, G., Smirni, E.: Performance Evaluation of Parallel Systems. Parallel Computing 25 (1999) 6. Foster, I.: Designing and Building Parallel Programs. Addison-Wesley Pub., Reading (1995) 7. Bailey, D., et al.: The NAS Parallel Benchmarks, RNR Technical Report RNR-94007 (1994)
An Extensible Timing Infrastructure for Adaptive Large-Scale Applications Dylan Stark1 , Gabrielle Allen1 , Tom Goodale1 , Thomas Radke2 , and Erik Schnetter1 1
2
Center for Computation & Technology, Louisiana State University, 216 Johnston Hall, Baton Rouge, LA 70803, USA [email protected] http://www.cct.lsu.edu/ Max Planck Institute for Gravitational Physics (Albert Einstein Institute), Am M¨ uhlenberg 1, D-14476 Golm, Germany
Abstract. Real-time access to accurate and reliable timing information is necessary to profile scientific applications, and crucial as simulations become increasingly complex, adaptive, and large-scale. The Cactus Framework provides flexible and extensible capabilities for timing information through a well designed infrastructure and timing API. Applications built with Cactus automatically gain access to built-in timers, such as gettimeofday and getrusage, system-specific hardware clocks, and high-level interfaces such as PAPI. We describe the Cactus timer interface, its motivation, and its implementation. We then demonstrate how this timing information can be used by an example scientific application to profile itself, and to dynamically adapt itself to a changing environment at run time.
1
Introduction
Profiling has long been an important part of application development. In the early days profiling was restricted to overall performance metrics such as wallclock time for a particular calculation or routine, and optimisation was often limited to finding better algorithms — ones which would take fewer operations. Today it is possible to access hardware counters which give developers information on memory metrics such as cache behaviour and floating point performance, and many tools such as SGI Speedshop, Intel’s VTUNE, or the Sun Studio Performance Analyzer are available which can provide a complete performance profile of an application down to individual source lines. These tools are excellent for application developers tuning their codes, but are not useful for adaptive tuning by applications themselves, and further are limited to particular platforms or operating systems. Self-tuning of applications is becoming increasingly important in today’s world of massively networked, dynamic data-driven applications [1] and the Grid computing and peta-scale applications of tomorrow. For modern applications it is necessary to have a programming API which allows the application to query and analyze its own performance characteristics R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1170–1179, 2008. c Springer-Verlag Berlin Heidelberg 2008
An Extensible Timing Infrastructure for Adaptive Large-Scale Applications
1171
on-the-fly. It must be easy to create caliper points between which to measure performance and to query them, and it must be possible to access the wide range of different metrics available on modern hardware. In this paper we describe the approach taken within the Cactus framework. Cactus provides a rich timing API which can be used with basic timing metrics such as wall clock or user CPU time, or more sophisticated metrics available by plugging in libraries such as the Performance API (PAPI) [2] developed at the University of Tennessee. The Cactus Framework [3,4,5] is an open source, modular, highly portable programming environment for collaborative high performance computing. Cactus has a generic parallel toolkit for scientific computing with modules providing parallel drivers, coordinates, boundary conditions, elliptic solvers, interpolators, reduction operators, and efficient I/O in different data formats. Also, generic interface definitions (e.g. an abstract elliptic solver API) make it possible to use external packages and improved modules, which are immediately available to users of the abstract interface. Although Cactus originated in the numerical relativity community, it is now used as an enabling HPC framework for applications in many disciplines including computational fluid dynamics, coastal modeling, astrophysics, and quantum gravity. Cactus has also been a driving application for computer science research, particularly in Grid and distributed computing. For example, the socalled “Cactus-Worm” application [6] used contracts based on runtime performance to trigger migration of an astrophysics application across distributed Grid resources. Another Cactus application used MPICH-G2 to distribute a single simulation across multiple machines connected by wide area networks. Using adaptive algorithms to tune communication patterns to available bandwidth, this application showed good overall scaling [7]. By using abstract interfaces for accessing meta-information about the system state on which it is running, Cactus enables an application to be aware of its surroundings in a very portable, system-independent manner. This allows users to easily implement and experiment with dynamic scenarios, such as responding to increased delays in disk I/O times, adapting algorithmic parameters to changes in an AMR (adaptive mesh refinement) grid hierarchy, or postponing analysis methods from in-line to a post-processing step. In this paper, we describe the design and implementation of the Cactus timing infrastructure. Sec. 2 covers the timing infrastructure and clock API. Sec. 3 discusses how it can be used in different applications scenarios, including a new use to adaptively control checkpointing intervals for large scale simulations. In Sec. 4 the results of a case study for the checkpointing scenario are presented, and Sec. 5 compares the Cactus timing infrastructure with other packages and libraries, and explains the benefits of profiling within the application code.
2
Cactus Timing Infrastructure
Code using the Cactus framework is divided into modules, or components, called thorns. Each thorn declares an interface to Cactus and schedules a number of
1172
D. Stark et al.
Fig. 1. Left: The relationship between timers and clocks in Cactus. Clocks are lowlevel entities representing e.g. hardware counters, timers are used by application code. Right: Example wall time distribution onto different stages of a simulation.
routines. Cactus controls the execution of these routines, providing a natural place to put caliper points to time routines. The presence of timers in the scheduling mechanism obviates the need for developers to place explicit timers in code and allows any user or other routine to obtain timing statistics for any routine used in a particular simulation by querying the internal timer database. Cactus provides a generic and extensible timing infrastructure. This infrastructure allows the code to access timers such as gettimeofday and getrusage, system-specific hardware clocks, and high-level interfaces such as PAPI, all in a portable manner. This timing information can be accessed programmatically at runtime through the Cactus timing API and made available to the user via online application monitoring interfaces integrated in Cactus. It is also logged semi-automatically for post-mortem review. The Cactus timing infrastructure consists of two core concepts: timers, which are used to place caliper points around sections of code and can be switched on and off or reset, and clocks, which provide the actual timing measures, such as wall-clock time or number of floating point operations. Figure 1 shows the relationship between clocks and timers. Querying a timer returns the timing results for all clocks associated with that timer. Clocks themselves can be registered with the timing infrastructure using Cactus’s standard registration techniques and thus can be provided by a thorn. This provides an extensible mechanism by which extra clocks, and hence timing metrics, can be used with no modification to any of the existing timing code. Cactus thorns offer a wide variety of clocks. Some of these clocks are only available on certain architectures, or if certain libraries have been installed. A new clock can be easily added by providing callback functions to create, destroy, start, stop, read out, and reset the clock. Clocks are not restricted to measure time; they can measure any kind of event, e.g. discrete events such as cache misses, I/O failures, or network packet losses. Table 1 lists clocks which are currently available. Table 2 describes the Cactus clock API. Clocks are usually not used directly; they are instead encapsulated in timers.
An Extensible Timing Infrastructure for Adaptive Large-Scale Applications
1173
Table 1. Available clocks in Cactus. Some of these clocks are only available on certain architectures, or if certain libraries have been installed. Clock name gettimeofday getrusage MPI Wtime PAPI counts rdtsc
Unit Description sec UNIX wall time sec UNIX system time sec Wall time Many hardware counters, e.g. instructions or Flop Intel CPU time stamp counter
Table 2. Cactus clock API. A clock is an object which measures certain events. Several clocks of the same kind can exist and can be running at the same time, measuring potentially overlapping durations. Clock can measure several values at the same time, e.g. multiple PAPI counters. Clocks are not meant to be called by user thorns (although this is of course possible); instead, clocks are encapsulated in timers. See also Table 3. Function Description create Create a new clock, returning a pointer to it destroy Destroy the clock start Start this clock stop Stop this clock reset Reset this clock, i.e., set the accumulated time to zero get Get the clock’s values set Set the clock’s values
The Cactus timing API is the interface which can be used to time or profile events or regions of code. Timers are usually created at startup time (or the first time a routine is entered), and they are started and stopped before and after the events that should be measured. The values of the clocks associated with a timer can be output explicitly using timer calls, or using a Cactus functionality that outputs all existing timers periodically to a log file. Table 3 gives an example of using timers, the complete API is described in the reference manual [3]. The accuracy of the timing information is obviously limited by the accuracy of the underlying clocks. Many clocks have accuracies measured in microseconds, and are hence not suitable for profiling very short events or routines. Other clocks, such as e.g. rdtsc, have nanosecond resolution and can measure with a very fine granularity. One has to keep in mind that measuring time changes the instruction flow through the CPU, often acting as barriers, so that it is impossible to measure with sub-nanosecond accuracy on today’s CPU architectures. The Cactus timer interface is a high performance interface. Creating and destroying timers typically requires allocating and freeing memory, so this should not be done in inner loops. Starting and stopping timers is as efficient as the underlying clocks implement it, plus overhead from indirect function calls. (The clocks’ routines are called via function pointers.)
1174
D. Stark et al.
Table 3. Example source code using Cactus timers, illustrating how a timer is created, started, stopped, and printed /* Create timer */ static int handle = -1; if (handle < 0) { handle = CCTK_TimerCreate ("Poisson: Evaluate residual"); if (handle < 0) CCTK_WARN (CCTK_WARN_ABORT, "Could not create timer"); } ... other code ... CCTK_TimerStartI (handle); /* Start timer */ ... evaluate residual ... CCTK_StopTimerI (handle); /* Stop timer */ ... other code ... CCTK_TimerPrintDataI (handle, -1); /* Output all clocks of this timer */
3
Use Cases
In this section we present use cases for application self-profiling. We give two examples describing the current use of the timing infrastructure for automated report generation and the use of profiling information to guide adaptive control of applications. We also suggest some other possible application scenarios which are possible given the above timing infrastructure. 3.1
Timer Report
Cactus automatically sets up timers for each scheduled routine, as described in Sec. 2. The information from these timers is dynamically available to the application through the timer API. This information is used to provide details about application performance while the simulation is running, reporting it e.g. to standard output, via a web-accessible HTTP interface1 , or via log files. The same mechanism can also be used to influence the behaviour of the application, allowing it to adapt itself to changes in the simulation or the environment. Timer reports are generated for any Cactus application by setting the parameter Cactus::print timing info="full". Figure 2 shows part of such a report for one of the runs of the use case presented below. The information in the report is collected by querying the timers periodically. In this case, the two clocks available to the simulation were gettimeofday and getrusage. Figure 1 shows a graphical representation of such a report. 3.2
Adaptive Checkpointing
Checkpointing is often used by applications deployed on clusters and supercomputers to provide protection again hardware and software failures, to allow 1
See http://cactus.cct.lsu.edu:5555/TimerInfo/index.html for timing information for the perpetual Cactus demonstration run.
An Extensible Timing Infrastructure for Adaptive Large-Scale Applications
1175
Thorn | Scheduled routine in time bin | gettimeofday [secs] | getrusage [secs] =========================================================================================== CarpetIOHDF5 | Evolution checkpoint routine | 79.76328000 | 13.66692200 ------------------------------------------------------------------------------------------| Total time for CCTK_CHECKPOINT | 79.76328000 | 13.66692200 =========================================================================================== AdaptCheck | Adaptive checkpointing startup | 0.00001300 | 0.00000000 BSSN_MoL | Register provided slicings | 0.00000700 | 0.00000000 =========================================================================================== | Total time for simulation | 1417.13730900 | 1305.43354400 ===========================================================================================
Fig. 2. Part of the standard timer report available for any Cactus application by setting a simple parameter. This report shows the time spent in some scheduled routines.
for simulations which require longer run times than available on batch queues, and more recently to enable different dynamic grid computing scenarios. Cactus provides application-level checkpointing which saves a snapshot of the running simulation by writing to file all active grid variables, parameters and other state information. The checkpoint file uses a platform independent file format, and the run can be restarted either on the same machine or on a completely different architecture using a different number of processors. The current checkpointing mechanism in Cactus allows for checkpointing after initial data generation, periodic checkpointing based on iteration count, and checkpointing on termination. We developed a new thorn AdaptCheck which dynamically controls the checkpointing characteristics of a Cactus application using real-time profiling timing information provided through the timer infrastructure. Writing a checkpoint file for a simulation can take a relatively long time, depending on the number of state variables to be saved, the file system characteristics, and the efficiency of I/O layer. The time needed can also vary over the lifetime of a simulation. For example, when using adaptive mesh refinement, the amount of data to be stored varies with the number of refinement levels, which itself depends on dynamic quantities such as the truncation error. Assuming that the run is allocated some fixed amount of wall time for which it can use a resource, a growing checkpoint time would necessarily take away from the actual time spent on the problem. AdaptCheck allows the user to specify the maximum percentage of a time the simulation should spend checkpointing in order to best use the fixed amount of time available on some resource. This is a weak upper bound, which means that the thorn guarantees that no checkpoint will be performed if the current percentage of time spent checkpointing is above the specified level, but it does not guarantee that a checkpoint will not occur which will result in the percentage of time being higher than the specified level. The quality of the fault tolerance provided by checkpointing depends on the frequency of the snapshots. Cactus currently allows the user to specify a checkpointing interval in terms of iterations, independent of the runtime performance of the simulation. Adaptive checkpointing in the manner described above could result in long periods of time without checkpointing. In order to prevent this, AdaptCheck also respects an upper bound on the length of wall time a simulation
1176
D. Stark et al.
will progress without checkpointing. This guarantees that checkpoints will be generated with some regularity, with respect to wall time. The current implementation of AdaptCheck uses the gettimeofday clock for measuring both the simulation and checkpointing durations. This will be extended to allow for the use of other user-specified clocks. We also plan to incorporate a better prediction for the time required for the next checkpoint. This will then be used to remain closer to the user-specified maximum percentage of wall time. It will also make final checkpoints reliable, which have to be finished before the queue time is used up. We present in Sec. 4 some results from tests using AdaptCheck to control checkpointing for an AMR code. 3.3
Future Scenarios
The previous examples illustrate two scenarios where application-side real-time profiling is used by large scale applications. Taking advantage of the flexible, well-designed timing infrastructure in Cactus, many other uses are planned. Building on the basic timing report mechanisms described in Sec. 3.1, more advanced and informative reports can generated, for example with the web interface providing graphical interpretation of results, or automated documents could be produced in a readable format that can be easily interpreted. The technique for adaptive checkpointing can be applied to Cactus analysis thorns, whose methods are called only when output is required. As with checkpointing, it is usual to output at regular iteration intervals, a more effective mechanism would involve choosing the output frequency dynamically based both on user requirements and performance for a particular analysis method. In Grid computing, as new capabilities become available on production resources, taking advantage of application-oriented APIs such as the Simple API for Grid Applications [8], previously prototyped scenarios such as simulation migration, adaptive distributed simulations, task spawning, will become more
Fig. 3. Left: Percentage of time spent checkpointing during a run. The adaptive run keeps within the desired bound (5%). Right: Total time spent checkpointing during a run. The adaptive version checkpoints less frequently to keep close to the 5% bound. The dashed line indicates the increase in the number of grid points as new refinement levels are added.
An Extensible Timing Infrastructure for Adaptive Large-Scale Applications
1177
regularly used. Accurate information from applications will be needed to make decisions about when and how to use such services. As described in Sec. 5, current profiling services rely on a remote service discovering and interpreting information from applications, however we believe scenarios will be more powerful and reliable when closely coupled in the application code. For peta-scale machines, currently being deployed in the US with tens or hundreds of thousands of processors, dynamic and real-time profiling will be essential, and in particular profiling which is inherently tied into the application and automatically generated with little overhead during an application run. Current projects, in the D-Grid and US, are developing technologies for Cactus simulations to automatically produce and store profiling and application metadata from simulations. This information will then be used for analysis to lead to optimized codes and potentially improved parallel computing paradigms.
4
Experiment and Analysis
To illustrate the advantages of application-side adaptivity using real-time profiling a series of experiments was performed using the adaptive checkpointing described in Sec. 3.2. The application code used was the Ccatie astrophysics code [9], which can simulate the collision of black holes, to test the adaptive checkpointing functionality. This compute and data intensive application solves Einstein’s equations for general relativity in 3D, evolving over twenty partial differential equations using high order finite differences. We used the Carpet driver [10,11] to provide adaptive mesh refinement. Starting from a uniform grid with 403 grid points, we added additional refined levels every 5120 iterations. Such a strategy is e.g. necessary to simulate the collapse of a stellar core in a supernova, where the central density increases considerably during the collapse. Note that for L refinement levels, the computing time per iteration grows as O(2L ), while the amount of data to be checkpointed grows at the same time as O(L). The simulation runs for 1, 737 seconds, spending 19% of the total time checkpointing. The original configuration checkpointed every 512 iterations. With the AdaptCheck thorn the maximum percentage of time spent checkpointing was restricted to 5% of the total wall time in the adaptive configuration. Figure 3 compares the actual percentage of time spent checkpointing for the original and the adaptive configuration. The results show that we are able to successfully bound this measure. This also yielded a 17% reduction in the total runtime. Each regridding increases the problem size by 403 grid points. This increase means that, with additional levels, each iteration takes more time to compute, the time between periodic checkpoints will increase, and the amount of time to checkpoint increases during the run. A common practice is to choose the checkpointing interval to be short enough at all times of the run. Unfortunately, this means that checkpointing occurs much too frequently early in the run. In another run using the AdaptCheck thorn, we bounded the interval between checkpoints independent of the performance of the
1178
D. Stark et al.
run and the I/O system. This reduced the amount of time spent checkpointing from 319 s to 75 s; the total runtime was reduced by 20%. This functionality can be used to guarantee a certain level of fault tolerance when adapting the checkpointing based on the simulation’s characteristics.
5
Related Work and Conclusions
The above results indicate how a scientific code can use a generic, self contained timing infrastructure for runtime profiling and adaption leading to significantly improved overall performance — in this case increasing the time spent in computation and the fault tolerance of the run, while reducing checkpointing time. The timing infrastructure was implemented in a highly portable manner in the Cactus Framework, and is easily available to users, either via parameter choices for higher level tools, or through an API for code developers. The infrastructure is able to use platform dependent clocks, as well as libraries such as PAPI. A substantial amount of work has recently been seen in automated application profiling and adaption, motivated by new possibilities in Grid computing, and a growing realization of the new tools needed for peta-scale computing. Attention has focused on developing general libraries and tools for application profiling, adaption and steering (e.g. SciRun, GrADS, RealityGrid). For example, in the GrADS project, a program development framework has been developed which can encapsulate general applications as configurable object programs, and then optimize these for execution on a specific set of Grid resources [12]. GrADS uses the Autopilot system for real-time application monitoring and closed loop control. Autopilot sensors can be embedded in application code by developers, or as in the GrADS system an automated mechanism can be used. The Cactus timing infrastructure incorporates its own application profiling, adaption and steering. The design of the Cactus Framework also allows thorns to be easily written to connect to external packages when these provide an advantage, as in experiments with GrADS, Autopilot, and ongoing work with SciRun. A key advantage of the Cactus infrastructure, however, is that there is an intimate connection with the scientific application — even with no attention to application profiling. Cactus applications are automatically enabled with steerable parameters and profiling at the level of thorn methods, thorns, schedule bins, as well as communication times and I/O times. Such an understanding of the application structure and scientific content is crucial for effective steering and control [13]. Higher level Cactus tools can build on these capabilities, and leverage current work on intelligent adaption in distributed environments (e.g. [14]), to provide powerful capabilities for analysis and control of scientific applications in HPC and Grid environments.
Acknowledgements We acknowledge contributions from the Cactus Team in the timing infrastructure implementation, in particular David Rideout, Thomas Schweizer, John Shalf,
An Extensible Timing Infrastructure for Adaptive Large-Scale Applications
1179
Jonathan Thornburg, Andre Werthmann, and Steve White. We thank Ed Seidel for suggestions, and Elena Caraba for her help with preparing this manuscript. This work was partly supported by NSF Grant 540179 (DynaCode) and the German Federal Ministry of Education and Research (D-Grid 01AK804). Computing resources were provided by the Center for Computation & Technology at LSU.
References 1. Report from NSF DDDAS Workshop, Washington (January 2006), http://www.nsf.gov/cise/cns/dddas/2006 Workshop/wkshp report.pdf 2. The Performance API (PAPI), http://icl.cs.utk.edu/projects/papi/ 3. Cactus Computational Toolkit, http://www.cactuscode.org/ 4. Goodale, T., Allen, G., Lanfermann, G., Mass´ o, J., Radke, T., Seidel, E., Shalf, J.: The Cactus framework and toolkit: Design and applications. In: Palma, J.M.L.M., Sousa, A.A., Dongarra, J., Hern´ andez, V. (eds.) VECPAR 2002. LNCS, vol. 2565, pp. 197–227. Springer, Heidelberg (2003) 5. Allen, G., Goodale, T., Lanfermann, G., Radke, T., Rideout, D., Thornburg, J.: Cactus Users’ Guide (2004) 6. Allen, G., Angulo, D., Foster, I., Lanfermann, G., Liu, C., Radke, T., Seidel, E., Shalf, J.: The Cactus Worm: Experiments with dynamic resource discovery and allocation in a grid environment. Int. J. of High Performance Computing Applications 15(4) (2001) 7. Allen, G., Dramlitsch, T., Foster, I., Karonis, N., Ripeanu, M., Seidel, E., Toonen, B.: Supporting efficient execution in heterogeneous distributed computing environments with Cactus and Globus. In: Proceedings of Supercomputing 2001 (2001) 8. Goodale, T., Jha, S., Kaiser, H., Kielmann, T., Kleijer, P., von Laszewski, G., Lee, C., Merzky, A., Rajic, H., Shalf, J.: SAGA: A Simple API for Grid Applications – High-Level Application Programming on the Grid. Computational Methods in Science and Technology 8(2) (2005) 9. Alcubierre, M., Br¨ ugmann, B., Dramlitsch, T., Font, J.A., Papadopoulos, P., Seidel, E., Stergioulas, N., Takahashi, R.: Towards a stable numerical evolution of strongly gravitating systems in general relativity: The conformal treatments. Phys. Rev. D 62, 044034 (2000) 10. Schnetter, E., Hawley, S.H., Hawke, I.: Evolutions in 3D numerical relativity using fixed mesh refinement. Class. Quantum Grav. 21(6), 1465–1488 (2004) 11. Adaptive mesh refinement with Carpet, http://www.carpetcode.org/ 12. Berman, F., et al.: New grid scheduling and rescheduling methods in the grads project. Int. J. Parallel Program. 33(2), 209–229 (2005) 13. Vetter, J.S.: Experiences with computational steering on existing scientific applications. In: Parallel Processing for Scientific Computing (1999) 14. Reed, D.A., Mendes, C.L.: Intelligent monitoring for adaptation in grid applications. Proceedings of the IEEE 93(2), 426–435 (2005)
End to End QoS Measurements of TCP Connections Witold Wysota and Jacek Wytrebowicz Institute of Computer Science, Warsaw University of Technology
Abstract. This paper presents selected issues on quality measurements of TCP connections, seen from the end user’s perspective and not from the service provider’s point of view. The main focus is put on selection of parameters, which values are worth gathering, and can be gathered without disturbing end user applications. To verify the presented ideas, a measuring tool — QMP (QoS Measurement Plug In) was developed. QMP is placed inside the kernel of Linux 2.6 operating system as an extension to the packet filter. The paper shortly describes the tool and some experiments performed with it.
1
Introduction
Quality of services provided via Internet is not just a research subject; it is a crucial problem for Internet Service Providers (ISPs) and for Internet Content Providers (ICPs). The providers are able to hold up sufficient quality inside their networks for selected applications, e.g. IP telephony, using packet classification and bandwidth reservation for selected classes of packets. However, we observe that there are more and more Internet services that support a variety of contents and applications for network users. These services mainly use the HTTP protocol, which is usually carried as a best effort transfer. Thus the quality from a user point of view depends on bandwidth overbooking applied by ISPs inside their networks and on bandwidth of interconnections between them. Of course, in local networks or in corporate networks the quality for any service can be assured by a careful configuration of all switches; under condition that ingress switches support a flow classification mechanism. However the problem exists: how to deal with service quality in the open Internet? The general response “there is no service quality in the open Internet” is satisfactory neither for users nor for ICPs. In practice they can obtain better quality by: – routing table configuration, in the case they have connections with several ISPs; – changing an ISP; – changing a service level offered by an ISP. Unfortunately the end users have no good means for measuring the quality. They need methods and tools for quality measurement of connections in TCP/IP R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1180–1189, 2008. c Springer-Verlag Berlin Heidelberg 2008
End to End QoS Measurements of TCP Connections
1181
networks, which are easy to apply. The tools should work regardless of the network service kind — is it supported by a packet classification mechanism or not. Achieving Quality of Service in TCP/IP enabled networks has been a subject of a number of papers. Most of these were mainly focused on aspects such as implementing QoS mechanisms in network switches and routers[1,2]. It is not easy to find scientific work dealing with application view of network QoS, moreover such work that introduces ways of measuring the quality of the whole route. However, similar issues were taken up in papers such as [3], [4] or [5]. The aim of presented work was to select a set of metrics, which can be used in a tool deployed at an end station connected to the Internet. Such a tool could be useful for quality measurement of services build over TCP, for choice of Internet Service/Application Providers, and for selection of a placement of application servers, especially for distributed computing systems, as an ability to evaluate the quality of connections is vital for a proper allocation of tasks in the network grid.
2
Metrics for QoS of TCP Connections
IP Performance Metrics Workgroup has described quantities that allow measuring quality of a service in the network layer (Internet Protocol)[6]. The workgroup has developed a framework that is independent of the type or characteristics of the traffic under test. The framework allows for creation of a number of templates that can be used to instantiate metrics for different protocols and types of traffic. Based on those and other quantities we found fit, we created a set of metrics adapted to the transportation layer, in particular the Transport Control Protocol. We took into consideration quantities, which should give distinctive results for different types of traffic and for different environment conditions, so that it is possible to decide, which connection is better and which is worse for a given connection type. We have assumed that all measurements should be performed on the TCP level. Below we have shortly described each measure we have chosen. 2.1
One Way/Two Way Delay (Round Trip Time)
The first and according to us the most important quantity which can be measured is the time delay between sending and receiving a data segment. Here we can talk about either one-way delay (time span of transmitting a segment from sender to receiver) or about two way delay (time span of transmitting a segment from sender to receiver and transmitting back an acknowledge to the sender) also called Round Trip Time (RTT). Measuring one way delay would give more precise information about a connection, but we find it very difficult to measure correctly. One reason is that the timestamping TCP option is not available in all TCP implementations. This option permits the receiver host for calculating time difference between system clocks on both hosts, and it allows to correct the stamp received. On the other hand TCP calculates itself round trip time for controlling the repetition
1182
W. Wysota and J. Wytrebowicz
timer. Under normal circumstances (when the gap between segment arrival and acknowledgement generation is very small) the round trip time is the span between putting the segment on the wire and receiving a packet that confirms the segment (recognized by an active ACK flag and the SYN value larger than the largest octet number sent in the segment). Please note that we cannot calculate a one-way delay by dividing the round trip time by a factor of 2, because the connection may use different paths or different traffic policies in each direction. After gathering a set of RTT values we can calculate different aggregations and statistics of the data. Apart obvious values — the maximum and minimum delays that define a range of the delays seen during the connection, we can also calculate several distributions. Guided by IPPM advises, we suggest to calculate the Poisson distribution along the percentage distribution and a median value[7,8]. When the median value is hard to calculate (due to limited data storage resources) we suggest calculating an average value instead. The percentage distribution is complementary to the previously mentioned characteristics, which is not so resource consuming for calculation as the median and gives more information about the ranges of values that were actually measured. 2.2
Time and Duration
The second data set we find interesting is the time of the beginning, the end and duration of the connection. The importance of the duration is self explainable, but the time of day (and possibly the date) may need an explanation. Network traffic characteristics may change depending on the hour and day of the week — for example people tend to use the Internet more during the day than during the night and commercial services will get higher traffic during weekdays than during weekends. Therefore it is important to compare data sets obtained in similar conditions. 2.3
Connection Bandwidth
An average data flow per time unit called the maximum flow capacity or the connection bandwidth calculated by dividing the volume of data transferred by the duration of the connection. The value received will be lower than the theoretical capacity of the link in perfect conditions and will denote a practical bandwidth used by a particular connection, which allows one to confront this result with the link capacity (bandwidth) advertised by the Internet Service Provider. When stations taking part in a series of measurements make use of the same congestion avoidance mechanism, then the results are comparable independent of the service which is being measured[9]. One could try to normalize the results obtained with different congestion avoidance algorithms (for example by ignoring those parts of the transmission which are influenced by congestion), but we find it worth to avoid such normalization so that the environment under test is exactly the same as the one experienced by the end user.
End to End QoS Measurements of TCP Connections
2.4
1183
TCP Options
It is very easy to gather information about TCP options that are available, supported and used by a connection, as their presence is marked in the protocol header during the handshake phase and also often reported in system structures associated with a particular network connection. The effort is really minimal, therefore we suggest checking which options might be used and deciding if they are helpful for the kind of service one wants to support. Options like Selective Acknowledgement (SAK) or Negative Acknowledgement (NAK) can reduce the number of retransmissions in the network and at the same time optimize the bandwidth use. Explicit Congestion Notification (ECN) option allows informing about potential congestions in the network. Using the above-mentioned options we can gather statistics about how much was gained by using each of the options. 2.5
Average Receiving Window Size
Size of the receiving window delivers information about the amount of traffic in the network and how fast the receiver can process incoming data (if the data come in faster than an application read them, the window size gets decreased to prevent buffer overflow). Window size depends on the operating system and the congestion avoidance mechanism used — the window size may be reduced when IP packets get lost and the flow is interrupted. It would be the most interesting to know how the window size changes during the whole connection time. It might require a significant amount of resources to track the changes, so instead it may be desired to store only some basic statistics and calculate minimum, average and maximum window size of each of the connection ends. These quantities may be very important for situations when a stable transmission speed is required. 2.6
Other
There are also other quantities that influence network connections and their quality, such as jitter (delay variation), retransmission timeout, number of retransmissions or a maximum segment size. Jitter states how stable the connection is — changes of latencies suggest that network conditions change frequently which could make the link unsuitable for services that require constant bit rates. Measures associated with TCP retransmissions give information about packet losses in the network. The more packets are lost and the larger the retransmission timer timeout values, the more congestion one might expect. Longer timeouts also imply that retransmitted segments will reach their destination later, increasing the overall delay which in turn could cause some real-time services to malfunction or fail. Maximum segment size is mostly important for short connections or when data comes in large bursts. If the protocol implementation uses the slow start algorithm, it will allow sending only a single segment of data at the beginning. If all the transmission data fits into a single segment, this allows for very swift
1184
W. Wysota and J. Wytrebowicz
and short connections. On the other hand too large segment sizes cause longer latencies when the protocol holds the transmission waiting for the segment to fill up with data. Most TCP implementations wait a short amount of time before sending data, to see if more data could be queued for sending, instead of transmitting a large amount of very small segments. 2.7
Interpretation of Measurements
Having a single set of measurements or calculated aggregations or statistics doesn’t lead to many definite conclusions. Thus we suggest comparing the dataset with datasets from different measurements — gathered from different paths, different time periods, etc. In that way we can select more suitable paths or time periods. A completely different issue is to interpret gathered statistics. For some of the metrics their meaning is obvious whereas others may require some additional effort to be able to get valuable information from them. In our studies we focused on the Poisson distribution suggested by IPPM. Poisson distribution determines the number of occurrences of an event during a series of independent tries and is described by the lambda parameter. Taking RTT as an example, lambda can be obtained by inverting the average RTT value. Based on that the Poisson distribution can be calculated. The distribution by itself may not give much useful information, but by transforming the distribution function or the cumulative distribution function information such as probability of a delay falling into a particular range, probability of receiving no more than k segments per second or a probability of a delay being smaller or bigger than a particular fraction of a second[10].
3
QoS Measurement Plugin
QoS Measurement Plugin (QMP) measurement tool is a proof of concept for the quantities we have selected for end-to-end QoS analysis of TCP connections. We have decided to apply in QMP the passive measurement approach, even that its implementation is more complex. In passive approach, a tool sniffs the existing network traffic and extracts all the information it needs. In active approach, a tool emits own well known segment stream into the network and analyses related acknowledgements. Active measurement gives better control over the measurement process, because it makes easy to mark and track each segment. Unfortunately active methods influence the traffic itself, causing incorrect results. Moreover, the end-user wants to know the efficiency of network services he or she uses. We have also decided to apply a single point measurement technique. The multi-point (mostly double or triple) technique means that traffic is generated on one machine and monitored on destination machine or on network nodes. Using a multi-point technique influences machine performance less, than using a single point technique. Unfortunately, deployment of a multi-point technique
End to End QoS Measurements of TCP Connections
1185
requires some cooperation with a network service provider. Because our tool is addressed for clients of an ISP, we have to apply a single point technique. QMP has to be as simple as possible while being able to conduct experiments of gathering above mentioned data. We have designed QMP as an extension to the Linux packet filter — Netfilter. 3.1
Netfilter
Netfilter (also known as iptables) was introduced to the Linux kernel in version 2.4 replacing older firewalling systems — ipfwadm and ipchains. The concept of Netfilter is based around rule tables. These rules are matched against packets flowing through a network routing system in a few places where routing decisions are made.
Fig. 1. Netfilter hooks
If a packet matches a rule, an action associated with the rule is triggered. Figure 1 shows available hooks — the ones in circles are associated with packet filtering and mangling, whereas PREROUTING and POSTROUTING are used mostly for NAT1 . At each point, the packet can be manipulated using rule actions if a packet matches a rule or matches default policies associated with each table if no rules are matched. Actions hardwired into the kernel are ACCEPT, DROP, STOLEN, QUEUE and REPEAT. Each Netfilter action must use one of them to either accept the packet at the current stage, discard the packet silently from the system, remove the packet from the flow for further processing, queue the packet into userspace for external processing or recheck the rule list again[11]. Using the decisions, Netfilter implements its own builtin actions of accepting the packet, rejecting it and explaining the reason to the other end of communication, dropping the packet silently from the system or redirecting it to another set of rules (table) for checking. It is also possible to send a packet to a custom target implemented as an extension to the packet filter — this is the feature that QMP uses to achieve its goal. 1
Network Address Translation – changing source or destination address of packets flowing through a router.
1186
3.2
W. Wysota and J. Wytrebowicz
QMP Architecture and Design
Extending Netfilter consists of two steps. The first step is to implement the decision making kernel module, which has to return one of the hardwired decisions based on the content of the packet passed to it. The second step is to make a plugin for iptables (a userspace tool used for programming sets of rules for Netfilter). The plugin tells it how to handle user provided rules that make use of a particular packet filter extension — commandline parsing and specific data structures memory allocation is performed by the plugin. QMP works by monitoring TCP connections flowing through the filter (matching iptables rules) and building a set of containers according to TCP synchronisation segments, where each container (called a bucket) stores data associated with a single TCP connection that is monitored. QMP processes every TCP segment to extract interesting data and store them in the related bucket. When QMP finds a FIN segment (TCP connection is closing), then it writes collected data to the system log and releases the bucket. Figure 2 shows an example of such a log. QMP: QMP: QMP: QMP: QMP: QMP: QMP: QMP:
SACK: Y WSCALE: Y TSTAMP: Y DATA SENT: 675B RCVD: 25598B START: 1150487082, STOP: 1150487090, DURATION 8s SEND RATE: 84B/s, RECV RATE: 3199B/s Local wnd size [B] > MIN: 1041 AVG: 32977 MAX: 64545 Remote wnd size [B] > MIN: 63750 AVG: 63750 MAX: 63750 Delay (RTT) [us] > MIN: 0 AVG: 300718 MAX: 4348707 SMSS: 1448 RMSS: 536 Fig. 2. Sample of QMP output
Some additional data, like the SYN number of the segment and the kernel time when it has been seen, is stored for each outgoing data segment so that it is possible to map segments to their acknowledgements. When a matching response is noticed, the latency (round trip time) can then be computed by substracting those two values. As the tool is meant to be just a proof of concept, a decision was made to make it very simple and straightforward. Because of that, the efficiency of the network stack is maintained and the complexity of the tool itself is very low. For example, QMP only stores the maximum, minimum and average value of a quantity being measured as those values are extremaly quick to process (the average is calculated as the sum of latency in miliseconds divided by the total number of entries) and use up a constant and limited amount of memory, so that it the stability and efficiency of the whole OS is not influenced. The down side of that is that only the limited amount of data can be extracted from the measurement — it doesn’t track the whole history of the measurement. Another
End to End QoS Measurements of TCP Connections
1187
thing is that the plugin always treats the side that does the active connect (client) as the local one. This is usually not a problem as QMP is meant to be deployed on client stations, but it can cause strange output for services like FTP, where it can happen that it is the server which initialises the data connection. When this happens, one has to exchange results for local and remote ends. The assymetry is present in other aspects of the tool too — it only tracks latency of outgoing data, because measuring in the other direction would heavily depend on the segment time stamping option support on the remote end, as it would be the only way to fetch values which could then help calculating RTT. It would all make the tool much more complex and prone to problems. 3.3
Experiments Performed
We have made some experiments to test the tool. The main goal of the experiments was to check if measuring different kinds of traffic yields distinctive results. Such results mean that chosen measures really allow to spot differences between services quality seen by the end user. All experiments had the same point of origin — a host where QMP was installed. We have executed them for different routes and different service types. We have selected the following remote ends: a domestic close proximity (low number of hops required to reach) server, a big commercial application server at national network, a Microsoft FTP server situated somewhere in USA, and a small private ADSL server placed in the same city. We have tested FTP and SSH connections, as they generate completely different types of network traffic. We have made all tests at afternoon local time, and then we have repeated them past midnight. Our motivation was to check, how the tool recognizes connection quality changes caused by heavier or lighter network load observed at different times of day. Table 1 presents the results we have achieved, where Table 1. Test results Quantity
Unit
duration s bandwidth kB/s minimum average maximum
B B B
average maximum
0∗ μs s
∗)
A day night 25 25 122 123
B day night 29 25 106 123
C D day night day night 21 23 165 65 146 133 0.183/5.9 0.173/5.7
Receiving TCP window size 181 181 181 181 2105 2105 2094/89 2094/4146 57079 57135 57131 57141 65064 65086 34k/20k 21k/5k 61654 61654 61654 61654 65407 65407 64k/61k 60k/61k Round trip time 0∗ 999 39 12 43545 35940 141215 257919 40020 > 1M ∗ 0.025 0.003 1.241 0.848 12.458 13.130 118.73 29.244
possible measurement error
1188
W. Wysota and J. Wytrebowicz
columns A, B and C correspond respectively to: domestic, commercial and Microsoft servers — with FTP as the tested service; and column D corresponds to the small ADSL server running an SSH service. The results confirm that different quantities characterize different types of service. For example, for FTP the most important is the bandwidth, and for SSH the latency is the most important measure. Furthermore, we observe that the quality of connections is better at night time. A remarkable result is that Microsoft’s FTP server receives much worse valuations at night. The reason is clear, because it resides in a different time zone than the test point — when there is night in Poland, it is late afternoon in US and the heaviest traffic in US networks. The results from our experiments confirm that QMP is able to spot relations between a connection quality and the selected route between a client and a server machine. The length (number of hops) and the bandwidth of a route influence remarkable the gathered results.
4
Conclusions
The idea of this paper was to point the importance of connection quality measurement between end points, and to demonstrate how it should be realized. The recommended metrics for a network service that uses TCP are: connection time and duration, two way delay, bandwidth, receiving window size, jitter, maximum segment size, and support for 3 TCP options. The selected options are: Selective Acknowledgement (SAK), Negative Acknowledgement (NAK) and Explicit Congestion Notification (ECN). They influence the efficiency of dealing with communication errors and congestion. From pragmatic reasons we recommend passive measurements from a single (a client) point of a connection. Our tool is not a market ready application; it is just a proof of concept. However it implements the most difficult part of such an application for a Linux operating system. Our tool is based on a widely used Linux firewall — Netfilter. Some simple experiments, we have described, demonstrate expected behavior of this tool. To make it a market ready application a nice graphical interface should be developed under it. This interface should allow for some statistics analysis of gathered by our tool data. Such an application will be useful for IT staff of any enterprise that uses network services over Internet or even over a VPN via an MPLS external network. MPLS providers guarantee CoS and support some quality measurements between end points of their network. However an enterprise network administrator is responsible for quality assurance for connections between application servers and their end users. Thus he needs such a tool as presented in this paper. Also distributed computing systems may benefit from the solution as it allows for better resource usage and higher efficiency due to a better allocation of available network bandwith between nodes which results from an ability to estimate the real traffic in the network.
End to End QoS Measurements of TCP Connections
1189
References 1. Siler, M., Walrand, J.: Monitoring Quality of Service: Measurement & Estimation. In: Proc. IEEE CDC, Tampa Bay, FL (December 1998) 2. Telkamp, T., Fineberg, V., Xiao, X., Ni, L.: A Practical Approach for Providing QoS in The Internet Backbone. IEEE Communications Magazine 40, 56–62 (2002) 3. Huard, J., Lazar, A.: On End-to-End QoS Mapping. In: Proc. 5th International Workshop on Quality of Service, Columbia University, New York, USA, pp. 303– 314 (1997) 4. Dressler, F.: A Metric for Numerical Evaluation of the QoS of an Internet Connection. In: 18th International Teletraffic Congress (ITC18), Berlin, Germany, vol. 5b, pp. 1221–1230 (August 2003) 5. Cao, Y., Sun, H.: Quality of Service: Delivering QoS on the Internet and in Corporate Networks. Computer Communications 22(10), 981 (1999) 6. IPPM WG: IP Performance Metrics (November 2005), http://www.ietf.org/html.charters/ippm-charter.html 7. Almes, G., Kalidindi, S., Zekauskas, M.: A One-way Delay Metric for IPPM. RFC 2679 (Proposed Standard) (1999) 8. Almes, G., Kalidindi, S., Zekauskas, M.: A Round-trip Delay Metric for IPPM. RFC 2681 (Proposed Standard) (1999) 9. Mathis, M., Allman, M.: A Framework for Defining Empirical Bulk Transfer Capacity Metrics. RFC 3148 (Informational) (2001) 10. Wysota, W.: Measuring Quality of Internet Connection Through a Plugin in TCP Implementation. Master’s thesis, Inst. of Computer Science, Warsaw University of Technology (June 2006) 11. Russell, R., Welte, H.: Linux netfilter hacking HOWTO, http://www. netfilter.org/documentation/HOWTO/netfilter-hacking-HOWTO.html
Performance Evaluation of Basic Linear Algebra Subroutines on a Matrix Co-processor Ahmed S. Zekri and Stanislav G. Sedukhin Graduate School of Computer Science and Engineering, The University of Aizu, Aizu-Wakamatsu City, Fukushima 965-8580, Japan {d8062103, sedukhin}@u-aizu.ac.jp
Abstract. As increasing clock frequency approaches its physical limits, a good approach to enhance performance is to increase parallelism by integrating more cores as coprocessors to general-purpose processors in order to handle the different workloads of scientific and signal processing applications. Many kernels in these applications lend themselves to the data-parallel architectures such as array processors. The basic linear algebra subroutines (BLAS) are standard operations to efficiently solve the linear algebra problems on high performance and parallel systems. In this paper, we implement and evaluate the performance of some important BLAS operations on a matrix coprocessor. Our analytical model shows the performance of the Level-3 BLAS represented by the n×n matrix multiply-add operation approaches the theoretical peak as n increases since the degree of data reuse is high. However, the performance of Level-1 and Level-2 BLAS operations is low as a result of low data reuse. Fortunately, many applications are based on intensive use of Level3 BLAS with small percentage of Level-1 and Level-2 BLAS.
1
Introduction
As increasing clock frequency approaches its physical limits, a good approach to enhance performance is to increase parallelism by integrating more cores as coprocessors to general-purpose processors. Coprocessors are special units dedicated to leverage performance on the compute-intensive parts of applications. For example, Vector coprocessors and graphics processing units enhance floatingpoint and streaming applications, respectively [1], [2], [3]. The basic linear algebra subroutines (BLAS) are standard operations used to achieve portability, modularity, and efficiency of solving the linear algebra applications on high performance and parallel computer systems [4], [5], [6]. Level-1 and Level-2 of the BLAS describe the basic vector-vector and matrix-vector operations, respectively. They involve O(n) and O(n2 ) computations operated on O(n) and O(n2 ) data, where n is the matrix/vector dimension. Level-3 BLAS describe the matrix-matrix operations and involve O(n3 ) computations and O(n2 ) data. To achieve good performance on high performance and parallel machines, large matrices are divided into blocks or sub-matrices that can be held in the upper levels of the memory hierarchy of each processor and perform optimized matrix-matrix operations on those blocks [7]. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1190–1199, 2008. c Springer-Verlag Berlin Heidelberg 2008
Performance Evaluation of BLAS on a Matrix Co-processor
1191
One key point that determine the effectiveness of coprocessors is the cost of moving data between main memory and the coprocessor. A good strategy for reducing this overhead is to reuse the loaded data as much as possible. In this paper, we introduce a matrix coprocessor consisting of an N ×N torus array processor (AP) whose operations are mainly executing optimal N ×N matrix multiply-add (MMA) operations. This operation has a high degree of data reuse, O(N ), a property that can be exploited to overlap data load and/or store with processing. To further reduce the cost of data movement, the matrix coprocessor has a local memory tightly coupled with the processing elements (PEs) so that the data needed for processing move between this local memory and the registers of the PEs through load/store operations. We design algorithms for some basic operations from the three levels of BLAS, and we evaluate their performance analytically on the matrix coprocessor. The idea behind the algorithms design is that the matrices and vectors are divided respectively into N ×N and N -element contiguous blocks that can be loaded into the AP, and using proper alignment operations the result is obtained by executing optimal N ×N MMA operations on the matrix registers of the AP. We show that the performance of the Level-3 BLAS represented by the n×n MMA operation, n>N , approaches the peak as n increases due to the high degree of data reuse, while the performance of Level-1 and Level-2 BLAS is low as a result of low data reuse. In Section 2, we introduce our matrix coprocessor model. In Section 3, we present optimal 2D data allocations to execute the N ×N MMA operation on the N ×N torus coprocessor. Algorithms to implement some BLAS operations are given in Section 4. In Section 5, we describe an analytical model of the matrix coprocessor and evaluate its performance on the selected BLAS operations. Conclusions are given in Section 6.
2
Coprocessor Model
Our matrix coprocessor consists of an AP coupled with a local memory to hold data and results needed for processing on the PEs. The AP composed of N ×N simple PEs connected by a torus. The latency of the long wrap-around connections can be overcomed by folding the torus such that all connections between PEs become equal [8]. All PEs are executing the same operation on different data, i.e., Single Instruction Multiple Data (SIMD) execution model . Each PE performs a fused ’multiply-add-roll’ operation where the scalar multiply-add c←c+a×b, a, b, and c are scalars, is performed and the appropriate data (and may be partial results) are rolled (or cyclically shifted) to the appropriate neighbor PEs. Each PE can send and receive data simultaneously to/from its four North, East, West, and South neighbors. For example, it can simultaneously receive data from South and East and send data and/or partial results to North and West, respectively. We assume that each PE has a set of scalar registers to hold data and results during processing. Because the PEs are working in a lock-step mode, executing
1192
A.S. Zekri and S.G. Sedukhin
Fig. 1. (Left) Block diagram of the proposed coprocessor, (Right) The elements of A, B, and C inside the 3D torus where k = [i + j] mod N , N = 4; two optimal 2D data allocations (from projection) to execute the N ×N MMA operation on the N ×N matrix coprocessor are shown
the same instruction on the same register in all PEs gives us the view that we have matrix instructions operating on matrix registers (see Fig. 1). For the purpose of data manipulation and alignment, a few matrix registers are reserved to hold predefined 0-1 matrices such as the identity matrix and all-ones matrix. In this paper, the matrix coprocessor is regarded as an accelerator to a node in a cluster, and it is invoked to execute the matrix-formulated kernels of applications. Before computing, the SIMD instructions are moved by the host processor to an instruction memory in the array controller while the data is stored into the local memory of the coprocessor. During computing, the array controller issues instructions simultaneously to: move data between the matrix registers and the local memory, and to process data on the AP. Hence, the data load/store can be overlapped partially or fully with computing. Fig. 1 shows a block diagram of the matrix coprocessor.
3
Optimal N ×N Matrix Multiply-Add
The main operation of the matrix coprocessor is executing the N ×N MMA operation C←C+A×B, where A = [a(i, k)], B = [b(k, j)], and C = [c(i, j)], 0 ≤ i, j, k < N . In this operation three operands are input and one operand is output. The N ×N ×N cubical index space of the MMA operation is defined by = {(i, j, k)T : i, j, k∈ {0,1,. . . ,N -1}}. At each index point p=(i, j, k)T the computation c(i, j)←c(i, j)+a(i, k)×b(k, j) is calculated. It is required to schedule the computations inside so that the elements of A, B, and C properly meet at correct index points within each scheduling step while preserving data dependencies.
Performance Evaluation of BLAS on a Matrix Co-processor
1193
Fig. 2. Three optimal data allocations to perform the N ×N MMA operation C←C+A×B on the N ×N torus AP, N = 4
We seek a schedule that removes data broadcasting and allow high data reuse in order to reduce the gap between processing and memory speeds. To fulfill these goals, we used cyclical shifting of data (using wrap-around connections), and all elements of matrices A, B, and C are re-used in all scheduling steps. That is, at each step the elements a(i, k), b(k, j), and c(i, j) are rolled along the j-, i-, and k-axis, respectively. In this case, the index space is regarded as a 3D torus where the cyclical nature of computing and data movement are described by 3D modular scheduling [9]. Using the projection method [10], we got all optimal 2D data allocations to perform the N ×N MMA operation on the N ×N torus AP in the minimum number of steps. Fig. 1 shows the active computation points at the 3D index space in one scheduling step. Projection along i-, j-, and k-axis gives
Fig. 3. Data alignment to perform on the N ×N torus matrix coprocessor, N =4: (a) c←xT ·y, (b) y←y+s·x, (c) y←y+A·x, (d) C←C+y·z T
1194
A.S. Zekri and S.G. Sedukhin
three optimal 2D data allocations to perform the N ×N MMA operation on the 2D torus AP in N multiply-add-roll time-steps. The 2D data allocation resulted from projection along the k-axis is the initial distribution of the well-known Cannon’s algorithm [11], while the other two projections (along i- and j-axis) give other optimal allocations that are computationally equivalent to Cannon’s allocation but different in data layout and alignment [9]. Considering all possible 3D data rolling of A, B, and C inside the 3D torus index space and using the projection method, other optimal 2D data allocations are obtained. In Fig. 2, we selected only three optimal 2D data allocations to use in this paper. In Allocation 1, which is Cannon’s distribution, both A and B are initially aligned by skewing the rows and columns, respectively, using circular shifts. Then, during computing, matrix C remains stationary while matrices A and B are rolled westward and northward, respectively, after multiplying the current elements of A and B inside each PE and accumulating the result in the corresponding element of C. In Allocation 2, after aligning B and C, matrix A remains while matrices B and C are rolled northward and westward, respectively, at each step. In Allocation 3, matrices A and C are initially aligned, matrix B remains, while matrices A and C are rolled westward and northward, respectively.
4
Selected BLAS Operations on the Matrix Coprocessor
In this section, we describe the implementation of some important BLAS operations on the N ×N matrix coprocessor. We assume the matrices and vectors are divided into N ×N blocks and N -element segments, respectively, and the elements of each block/segment are stored sequentially in column-major format. Also, we assume one matrix register can be loaded or stored during computing. 4.1
Level-1 BLAS
The two important operations in Level-1 BLAS are the vector reduction (or DOT) c←xT ·y, and the vector scaling (or SAXPY) y←y+s·x, where x and y are n×1 vectors, c and s are scalars. To implement the vector reduction operation, N segments from each of vectors x and y are packed or loaded into two matrix registers Rx and Ry . The matrix register Rx is transposed inside the AP using three N ×N MMA operations [12]. Then, the N ×N MMA operation Rc ←Rc +Rx ×Ry is executed by applying Allocation 1 without the initial alignment; Rc is initially zeroed, Rx is rolled westward and Ry northward (Fig. 3(a)). After N multiply-add-roll time-steps, the results of reducing each row of Rx and the corresponding column of Ry is located inside the diagonal element of the output matrix register Rc . Pseudo-code Fig. 4(i) describes the whole implementation of the vector-reduction operation where α =n/N 2 . At the end, the accumulated partial results in Rc are skewed westward/northward to move the diagonal elements to column/row#0 of the torus AP. Then, the partial results are summed up to get the final result c←xT ·y
Performance Evaluation of BLAS on a Matrix Co-processor
1195
Pseudo-code (i):
Pseudo-code (ii):
Load x 0 ; for I=1 to α-2 do Transpose xI−1 & Load yI ; Compute Rc & Load xI+1 ; end Transpose xα−1 & Load yα−1 ; Compute Rc ; Skew Rc ; Reduce Rc into scalar c; Store c;
Load s; Copy s & Load x 0 ; Load y 0 ; for I =0 to α-2 do Compute y I & Load y I+1 ; Store y I ; Load x I+1 ; end Compute y α−1 ; Store y α−1 ;
Pseudo-code (iii):
Pseudo-code (iv):
Load x 0 ; for K=1 to n1 -1 do Transpose and Skew xK−1 northward & Load xK ; end Transpose and Skew xn1 −1 northward & Load y0 ; for I=0 to m1 -1 do Skew yI westward & Load AI,0 ; for K=0 to n1 -2 do Compute yI & Load AI,K+1 ; end Compute yI ; Skew yI eastward & Load y(I+1) mod m1 ; Store yI ; end
Load z0 ; for K=0 to n1 -2 do Transpose and Skew zK northward & Load zK+1 ; end Transpose and Skew zn1 −1 northward & Load C0,0 ; for I=0 to m1 -1 do Load yI ; Skew yI westward; for K=0 to n1 -2 do Compute CI,K & Load CI,K+1 ; Store CI,K ; end Compute CI,n1 −1 & Load C(I+1) mod m1 ,0 ; Store CI,n1 −1 ; end
Fig. 4. Pseudo-code to implement: (i)vector reduction, (ii)vector scaling, (iii)matrixvector multiplication, (iv)rank-one update, on the matrix coprocessor
inside PE(0,0). The summation or reduction operation is done using the N ×N MMA operation Rt ←Rt +Rw ×Rs where, Rt is an output matrix register holding the reduced scalar c and Rw is a predefined 0-1 matrix register which has 1’s at the PEs of column#0 and 0’s elsewhere. During computing, only register Rs is rotated vertically while Rt and Rw remains. For the implementation of the vector scaling operation, the scalar s is loaded into the matrix register Rs at position PE(0,0). Then, s is copied to column#0 of the torus AP using the same N ×N MMA operation used in the summation above. After copying scalar s, N segments from each of vectors x and y are loaded into two matrix registers Rx and Ry , then the N ×N MMA operation Ry ←Ry +Rt ×Rx is executed where Rx and Ry are remained inside the PEs while Rt is rotated horizontally (Fig. 3(b)). Pseudo-code Fig. 4(ii) describes the whole procedure to scale all vector y on the N ×N torus matrix coprocessor. 4.2
Level-2 BLAS
We selected the two Level-2 BLAS operations, matrix-vector multiplication and the rank-one update to implement on the torus coprocessor. In the matrix-vector multiplication y←y+A·x, x is an n×1 vector, y is an m×1 vector, each segment of y is calculated as a dot-product of one row of blocks of A and all the segments of vector x. Multiplying an N ×N block AI,K of matrix A with the corresponding N ×1 segment xK of the vector x is done using Allocation 2 to
1196
A.S. Zekri and S.G. Sedukhin
reduce the alignment overhead. After loading AI,K and xK into matrix registers Ra and Rx , respectively, Rx is transposed and skewed northward before computing. The Transpose and Skew operations can be merged and done together using two N ×N MMA operations in 2N multiply-add-roll time-steps (see [12]). Fig. 3(c) shows the data layout inside the 4×4 torus AP. Now, a partial result of the segment yK is computed by executing the N ×N MMA operation Ry ←Ry +Ra ×Rx . All segments of vector y are computed in a similar manner. Pseudo-code Fig. 4(iii) shows the whole implementation of the matrix-vector product, m1 = m/N and n1 = n/N . To implement the rank-one update C←C+y·z T , where C is an m×n matrix, y is an m×1 vector, and z is an n×1 vector, matrix C is partitioned into m1 ×n1 square blocks each of dimension N ×N and the two vectors y and z are divided into segments of length N . We assume that the segments of vector y or z can be loaded at beginning into the matrix registers of the AP. Using Allocation 1, each block CI,K is computed by applying the N ×N MMA operation Ra ←Ra +Ry ×Rz , where the matrix registers Ra , Ry and Rz hold the block CI,K , the segment yI , and the segment zK , respectively. During computing, Ry and Rz are rolled westward and northward, respectively, while Ra remains without rolling. Pseudo-code Fig. 4(iv) shows the implementation of the rank-one update on the torus coprocessor. 4.3
Level-3 BLAS
The most important routine in Level-3 BLAS is the matrix-matrix multiplication since the bulk of computations in many numerical linear algebra algorithms is matrix-matrix multiplications [13]. On the 2D N ×N torus AP, we presented in [9] four versions of the GEneral N ×N MMA operation (GEMM) D←D+op(A)×op(B), where op(X)=X or X T . The four versions correspond to four different initial layouts of A and B inside the AP. Here, we describe the GEMM variant D←D+A×B, where A is an m×n matrix, B is an n× matrix, and n, m, N . (The other three variants can be treated in a similar way.) Assume a row/column of N ×N blocks of A, B, or D can be loaded initially into the matrix registers of the AP and reused before storing back to the local memory. Therefore, we have two ways to find the blocks of D. In one case, a block row/column of D is loaded and Allocation 1 is applied where no skewing is needed for the D blocks. In the other case, a block row/column of A (or B) is loaded initially and Allocation 2 (or 3) is applied to compute the D blocks as a dot-product of the loaded row/column block and the corresponding column/row block of B (or A). Since we assumed one matrix register can be loaded or stored during computing, the latter case of computing D blocks seems more efficient. Pseudo-code Fig. 4(iii) can be augmented to compute matrix D, where an outer loop for 0 ≤ J ≤ 1 − 1, 1 = /N is added, and the segments xK and yI are replaced by the blocks BK,J and DI,J , respectively.
Performance Evaluation of BLAS on a Matrix Co-processor
5
1197
Anatytical Results
We measure the performance of the matrix coprocessor on the selected BLAS operations by the speed of computing; the ratio between the amount of work (in multiply-add operations) in the sequential algorithm and the parallel execution time. We assumed in our model that the time of one multiply-add-roll time-step, τ , takes one clock cycle using loop unrolling and software pipelining techniques. In our execution time model, the total execution time in clock cycles is the sum of the elapsed times in: computing, load/store operations, and data manipulation operations such as Transpose, Skew, Copy (or one-to-all broadcast), and Reduce (or all-to-one collect). To keep pace with the processing speed of the 2D N ×N torus AP, we assume the local memory is tightly coupled with the PEs. The time of loading/storing one memory word is assumed c1 clock cycles. The time of loading/storing an N ×N block of elements into one matrix register is c2 clock cycles. The time to load/store one N -element segment into/from one matrix register is c3 clock cycles. Moreover, we needed to load/store N segments into/from one matrix register (as in Level-1 BLAS operations) therefore, we assume this time is c4 clock cycles. 5.1
Expected Times and Performance Evaluation
We calculated the expected execution times of the BLAS operations; vector reduction, vector scaling, matrix-vector multiplication, rank-one update, and matrix multiply-add, respectively as: 3N τ + α · max(c4 , 3N τ ) + (α − 1) · max(c4 , N τ ) + c1 + c4 , N τ + α · max(c4 , N τ ) + c1 + 2αc4 , [n1 · max(c2 , N τ ) + max(c3 , N τ ) + N τ + c3 ] · m1 + n1 · max(c3 , 2N τ ) + c3 , (n1−1)·max(c3 , 2N τ )+max(c2 , 2N τ )+[n1 ·max(c2 , N τ )+N τ +n1 c2 +c3 ]· m1 +c3 , [m1 N + n1 · max(c2 , 2N τ ) + m1 (n1 + 1) · max(c2 , N τ ) + (m1 + 1)c2 ] · 1 . Fig. 5 shows the performance of the BLAS operations with changing the bandwidth of data load/store from/to the local memory of the coprocessor. Using the MMA operation to perform the Level-1 BLAS, the speed couldn’t exceed N multiply-add operations/cycle. The speed of the vector reduction is lower than vector scaling even with increasing c4 to one word/cycle due to the overhead of the Transpose operation needed to align data inside the AP. In Fig. 5(c-d), the speed of the Level-2 BLAS approached the expected maximum, N multiply-add operations/cycle. However, the matrix-vector operation converges faster because of using Allocation 2 which keeps the current block of matrix C in the matrix registers as long as possible before storing back to the local memory. While, the rank-one update operation used Allocation 1 which require storing the block of C immediately after its update. In Fig. 5(e) the speed of the n×n MMA operation (i.e. Level-3 BLAS) approached the peak of the coprocessor, N 2 multiply-add operations/cycle. The
1198
A.S. Zekri and S.G. Sedukhin
Fig. 5. Performance of the matrix coprocessor, N = 8 on: (a-b) Level-1 BLAS, (c-d) Level-2 BLAS, (e) Level-3 BLAS
main reason is the high degree of data reuse compared to the other two BLAS levels and the hiding of the alignment overhead with data load/store. Moreover, we see that decreasing the time of loading an N ×N block, c2 , below N cycles does not degrade the performance. However, increasing c2 above N cycles has a great impact on the performance.
6
Conclusions
In this paper, a coprocessor to optimally execute the matrix-formulated kernels in scientific, engineering, and signal/image applications have been introduced. We designed algorithms to implement some important BLAS operations and analytically evaluated their performance on the coprocessor. The main idea behind our design is that the matrices and vectors are divided, respectively, into N ×N blocks and N -element segments that are stored on the local memory of the coprocessor and moved as contiguous blocks to/from the AP. Once loaded into the
Performance Evaluation of BLAS on a Matrix Co-processor
1199
matrix registers of the AP, appropriate N ×N MMA operations can be applied to get the results. Compared to Level-1 and Level-2 BLAS, the Level-3 BLAS (represented by the n×n MMA operation) showed a near peak performance on the coprocessor due to the high degree of data reuse and hiding the alignment overhead with data load/store. The performance of Level-1 BLAS is the lowest among the three levels due to the very low degree of data reuse and the cost of data alignment (transpose) before computing. Fortunately, many applications are based on intensive use of Level-3 BLAS with small percentages of Level-1 and Level-2 BLAS. Therefore, our matrix coprocessor is a good candidate to accelerate these applications. Acknowledgments. The authors would like to thank the Telecommunications Advancement Foundation (TAF) for the partial support of this research.
References 1. Gustafson, J.L., Greer, B.S.: A hardware accelerator for the Intel Math Kernel. White paper, ClearSpeed Technology Inc. (2006), at http://www.clearspeed.com 2. Chen, T., Raghavan, R., Dale, J.N., Iwata, E.: Cell broadband engine architecture and its first implementation - A performance view. IBM J. Res. Dev. 51(5) (2007) 3. The Berkeley Intelligent RAM (IRAM) Project, http://iram.cs.berkeley.edu 4. Lawson, C.L., Hanson, R.J., Kincaid, R.J., Krogh, F.T.: Basic linear algebra subprograms for FORTRAN usage. ACM Trans. Math. Software 5, 308–323 (1979) 5. Dongarra, J.J., Croz, J.D., Hammarling, S., Hanson, R.J.: An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Software 14, 1–17 (1988) 6. Dongarra, J.J., Croz, J.D., Duff, I., Hammarling, S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Software 16, 1–17 (1990) 7. Dongarra, J.J., Duff, I.S., Sorensen, D.C., van der Vorst, H.A.: Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia (1998) 8. Dally, W., Towles, B.: Principles and Practices of Interconnection Networkks. Elsevier, Amsterdam (2004) 9. Zekri, A.S., Sedukhin, S.G.: The general matrix multiply-add operation on 2D torus. In: The 20th IEEE IPDPS Symp., PDSEC 2006 Workshop, Rhodes Island, Greece, IEEE Computer Society, Los Alamitos (2006) 10. Kung, S.: VLSI Array Processors. Prentice-Hall, Englewood Cliffs (1988) 11. Cannon, L.: A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, Montana State University (1969) 12. Zekri, A.S., Sedukhin, S.G.: Matrix transpose on 2D torus array processor. In: The 6th IEEE International Conference on Computer and Information Technology, Seoul, Korea, p. 45. IEEE Computer Society, Los Alamitos (2006) 13. Golub, G.H., Loan, C.F.V.: Matrix Computations. John Hopkins, Baltimore, Maryland (1996)
High Throughput Comparison of Prokaryotic Genomes Luciana Carota1 , Lisa Bartoli2 , Piero Fariselli2 , Pier L. Martelli2 , Ludovica Montanucci2 , Giorgio Maggi3 , and Rita Casadio2 1
CNAF-INFN, National Institute of Nuclear Physics, Bologna, Italy [email protected] http://www.infn.it/indexen.php 2 Biocomputing Group, University of Bologna,Italy http://www.biocomp.unibo.it 3 BA-INFN, National Institute of Nuclear Physics, Bari, Italy
Abstract. This work handles the optimization of the grid computing performances for a data-intensive and high ”throughput” comparison of protein sequences. We use the word ”throughput” from the telecommunication science to mean the amount of concurrent independent jobs in grid. All the proteins of 355 completely sequenced prokaryotic organisms were compared to find common traits of prokaryotic life, producing in parallel tens of Gigabytes of information to store, duplicate, check and analyze. For supporting a large amount of concurrent runs with data access on shared storage devices and a manageable data format, the output information was stored in many flat files according to a semantic logical/physical directory structure. As many concurrent runs could cause reading bottleneck on the same storage device, we propose methods to optimize the grid computing based on the balance between wide data access and emergence of reading bottlenecks. The proposed analytical approach has the following advantages: not only it optimizes the duration of the overall task, but also checks if the estimated duration is compliant with the scientific requirements and if the related grid computing is really advantageous compared to an execution on a local farm.
1
Introduction
The development of new technologies for genome sequencing has helped to disclose a large amount of knowledge about the complete set of genes and proteins in many different species, going from unicellular organisms to plants and animals. The availability of genomic data in public databases[1] allows a comprehensive comparative large scale investigation with the aim of addressing many research areas in molecular biology and systems biology, such as functional and structural genomic annotation[2,3,4]. In this work the completely sequenced bacterial genomes were considered in order to understand common traits of prokaryotic life. Many hundreds genomes, corresponding to about one million protein sequence, were analyzed by the grid technology. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1200–1209, 2008. c Springer-Verlag Berlin Heidelberg 2008
High Throughput Comparison of Prokaryotic Genomes
1201
The grid computing is based on a flexible, secure, coordinated and largescale resource sharing among dynamic collections of individuals, institutions and resources. In this context the Virtual Organization (VO) is a people community that share common computing resources and usage policies because they have the same scientific or economic intents. The grid middleware include protocols, services and application programming interfaces that are categorized according to their roles in enabling resource sharing[7,8]. We employed the grid technology because it offers a huge number of parallel resources that is more suitable to challenging genome comparisons that a local farm. 355 input files were extracted from the public database of the National Center for Biotechnology Information of United States (NCBI, updated on 31st August 2006)[5], where each file contains all the protein sequences of a genome in f asta format[6]. For the biological task, detailed in the next paragraph, the data are organized in about one million query files, each one containing one sequence. All the sequences are also included in an overall database. Input data download and management were accomplished outside the grid environment. The query files and the overall database were stored and replicated on grid storage devices (Storage Elements - SEs) according to a hierarchical name-space structure of Logical File Names (LFNs), mapped to the respective physical positions through a grid Logical File Catalog (LFC)[10]. Our task was organized in three main phases: the comparisons of sequences by the BLAST program[11], the symmetry check of BLAST output files and the analysis of the results. In this work we describe how the grid infrastructure was used in the first and second phase in order to optimize its performance. The first and the second phases are very challenging, given the highly parallel scale of the involved runs: about one million and one hundred million of independent operations respectively. The high concurrency was reduced by efficiently grouping elementary computing steps inside a single shell script. A script corresponds to the grid elementary job and it is sent through specialized grid elements towards the final element, named (Worker Node,WN). The WNs are the grid elements where jobs run. Once the single scripts are defined, the grid middleware allows the users to manage one or many of them through a single grid specific file, edited in Job Descriptio Language(JDL)[13]. The Workload Management System(WMS) (the grid element that accepts user jobs, assigns them to the most appropriate grid local site and record their status and retrieve their output) supports JDL jobs of Collection and DAG type, where many nodes are grouped in a single structure[12]. The grid middleware used in this task is gLite3.0[9], were JDL files of Collection and DAG type show exactly the same performance when composed by unrelated nodes. Therefore they are indifferently used. The time needed to complete the run of a script (wall time) depended not only on the state of the grid, the computing power of the WN and the number of runs included in single scripts, but also on the presence of bottlenecks when getting access to input data. We accurately analyzed the script wall times to check the
1202
L. Carota et al.
presence of bottlenecks and consequently to propose a method to estimate the optimal amount of data to be stored in a single SE and the needed amount of data replication. It’s important to note that bottleneck behaviours increase not only the duration of the considered task, but also slow down the shared SE accesses of all the grid users.
2
Methods on Grid
As each protein sequence was compared by BLAST to the overall sequence database, the distribution of input query files and output files on SEs was engineered to address the following requirements: – user friendly access to reflect the structure and organization of the NCBI original flat files; – data format suitable to BLAST comparisons; – efficient recording of an high number of output files coming from independent and parallel BLAST runs. A hierarchical name-space structure of Logical File Names (LFNs) was adopted to support the mapping between directories and original genomes. The correspondence between logical paths and physical paths is recorded in the grid File Catalog(LFC). SEs also store index files providing the mapping of the genome and sequence biological names to the directory names and files in the LFC logical name-space. The index file addresses the data retrieval during analysis. Data were replicated to different SEs and the corresponding replica logical file names were consequently recorded in the LFC catalog. Figure 1 illustrates the adopted data structure. Once organized the input information on a SE, the computing task was submitted from a grid User Interface (UI), a set of executables that give the grid resource access. The UI sends the JDL files to the WMS, that sends the executable scripts towards the grid local computing sites. The first step of BLAST comparisons consists of one million of independent readings from SE and a related one million of result writings. As already mentioned, the throughput of concurrent executables was optimized by grouping different elementary operations. In particular 103 BLAST comparisons are executed in a loop on a single WN to cut down the number of executables from 106 to 103 . This operation does not reduce the number of SE data accesses, as each BLAST comparison still needs to download a single query sequence from the SE: grouping only implies the serial execution of elementary operations. As shown in the following section, this data flow management can potentially originate reading bottlenecks which cause the increase of the expected run duration. In the majority of the cases this negative feature can be eliminated by compressing all the needed query files in one file and downloading only the compressed data from the SE. Nevertheless, for just 103 concurrent scripts the data compression does not give significant advantages for the grid computation. The second part of the task consisted of verifying, and eventually correcting, the
High Throughput Comparison of Prokaryotic Genomes
1203
Fig. 1. Data structure on Storage Elements (SEs) and grid File Catalog (LFC). In the directory $LF C P AT H/user input/, query sequences were recorded and organized into sub-directories named by the genome number (N umG). In addition BLAST output files in user output were stored according to the query sequences structure in user input; N umG ranges from 1 to 355, while N umS is the sequence number. An index file maps biological names to tagging numbers.
symmetry of BLAST output files. They are composed of 100 rows on average, each row contains different columns: the first column is entirely composed by the same query sequence used in the comparison with the total database, the second column contains the extracted sequences that show a good match and the other columns are for the score of the match. As the results must be symmetric, for each line of a BLAST output file we expect a different output file containing an equivalent line, with the same two protein sequences that are reported in reverse order. Therefore the symmetrization process consists of checking the symmetry in the output file which comes from the comparison between the second sequence of the considered line of an output file and the overall database. As each output file contains 100 lines on the average, this part of the task involves at least 100 ∗ 106 independent computing steps (106 is the number of BLAST comparison). Such elementary operations are grouped in a loop of 10 iterations to analyze a total of 10*100 lines in a single script, reducing the grid concurrency to 105 executables. The high throughput of the second phase does not allow good performances if based on many concurrent readings of BLAST output files. In fact checking the symmetry of just an output file involves 100 additional data readings on SE. In order to prevent a massive number of concurrent data accesses, the genome directories were compressed and downloaded just once during the run of a script on WN. This optimization cuts down the number of data accesses within an executable to a maximum of 355, as checking the symmetry of all the sequences
1204
L. Carota et al.
included in a genome requires the download from SE of the entire directory of the corresponding genome. In order to increase the amount of available computing resources, the high concurrency of jobs in grid was also supported by running JDLs from different Virtual Organizations at the same time. In addition reading bottlenecks were prevented by replicating all the data on different SEs. In the following section an analytical approach is proposed to verify the presence of bottleneck and consequently to optimize the organization of the total computing task.
3
Grid Performance Results
The wall time of individual jobs distributed on the production grid was recorded and analyzed during the BLAST comparison phase. An elementary grid job corresponds to a shell script with three input arguments. The genome identifier corresponds to the first argument and the last two arguments set the range of the analyzed sequences. Purpose of the study is the analysis of grid behavior for the estimation of the advantages of concurrent executions and the identification of bottlenecks during accesses to the shared SE. It’s necessary emphasize that the analyzed data has a very heterogeneous origin, as the WNs are installed on different hardware architectures and job of different owners can run at the same time on the same WN. Nevertheless, the following analysis shows that the random origin acts only as a noisy background that does not hide significant bottleneck behaviors. The analysis of the distribution of durations for equivalent jobs, each one including 103 BLAST runs, shows a span from one hour to about 20 hours. In order to discover general factors causing different durations, the grid behaviour of the simplest case of BLAST comparison was analyzed: figure 2 shows the statistic analysis of a large number of scripts which include just one BLAST comparison. It’s important to note that a script of 103 BLAST comparisons does not equal 3 10 scripts with one BLAST comparisons, because the run of an individual script always requires the download of the overall database, of the BLAST program itself and of additional configuration files. In order to check a possible correlation of run durations with WNs, in figure 2 hostnames related to the faster executions are reported on few histogram bin. The duration slightly depends on the hosts, but it is not the principal cause of different durations, as similar hosts of the same local farm can have completely different behaviors. Therefore tests show that, in spite of the heterogeneity of the grid system, a probable hypothesis associated to the large span of wall time can be the emerging of bottlenecks during the accesses to the SE. Verifying the last hypothesis requires the analytical study of grid dynamics, which is a challenging task given the heterogeneous nature of the grid environment. In figure 3 the time traces of the grid jobs executing a single BLAST
High Throughput Comparison of Prokaryotic Genomes
1205
180 160 140
prod−wn−018.pd.infn.it prod−wn−031.pd.infn.it prod−wn−026.pd.infn.it prod−wn−023.pd.infn.it prod−wn−027.pd.infn.it prod−wn−012.pd.infn.it prod−wn−018.pd.infn.it prod−wn−023.pd.infn.it prod−wn−020.pd.infn.it prod−wn−021.pd.infn.it
atlaswn11.na.infn.it atlaswn15.na.infn.it atlaswn15.na.infn.it atlaswn08.na.infn.it atlaswn10.na.infn.it atlaswn16.na.infn.it prod−wn−007.pd.infn.it atlaswn10.na.infn.it atlaswn16.na.infn.it atlaswn08.na.infn.it
n34 n57.unile.it n58.unile.it n55.unile.it n28 n52 n21 n52 n50 n52
job counts
120 100 80
atlaswn15.na.infn.it grid008.roma2.infn.it n59.unile.it atlaswn11.na.infn.it n58.unile.it n55.unile.it n55.unile.it atlaswn08.na.infn.it atlaswn03.na.infn.it atlaswn16.na.infn.it
grid011.roma2.infn.it n55.unile.it n56.unile.it prod−wn−028.pd.infn.it prod−wn−034.pd.infn.it n55.unile.it grid008.roma2.infn.it prod−wn−032.pd.infn.it prod−wn−021.pd.infn.it n56.unile.it
1000
1500
atlaswn08.na.infn.it n48 n18 n25 n60.unile.it n55.unile.it n28 n42 n55.unile.it n42
n55.unile.it n60 n60
60 40 20 0 0
500
2000
2500
wall clock time on WN [sec]
Fig. 2. Duration histogram for equivalent scripts composed of one BLAST comparison. In order to understand if the duration span depends on the Worker Nodes where jobs run, names of few faster host are reported on the related bins. The arrows point to hosts that show completely different behaviour, evidencing only a very light duration dependence on hosts. The hypothesis is that the longer durations are related to bottlenecks on the access to shared storage devices.
comparison (considered in figure 2) are shown. Each horizontal line plots the duration of one script, while scripts are enumerated on the vertical axis and the horizontal axis represents the absolute time. A number of the vertical axis identify one and only one job. Some part of figure 3 shows a bursting semi-periodic behaviour, probably due to queues of jobs which, after a waiting period, starts to run in a serial progression. The queue duration is a time interval that starts from the job creation in the Local Resource Management System, which is a system that accepts shell scripts with control attributes, protects the job until it is run and delivers output back to the submitter. The queue duration interval ends when the job starts to run in the WN. Such a complexity was analyzed by dividing the absolute time axis in small identical intervals. In order to allows a good approximation of instantaneous measures, the chosen small unit interval is as long as the minimum job duration. In figure 4 the number of contemporary running jobs in grid is counted in each unit time interval along the entire duration of the computing task (55 hours). On the average 20 concurrent jobs run in grid, because the queue times avoid that all the sent jobs (in this case about 4000) start to run within a short time interval. In figure 5 different measures are reported versus the number of concurrent jobs. Black points measure the service probability, defined as the number of job conclusions in a time interval on the total number of parallel jobs in the same
1206
L. Carota et al.
4000
3500
job tag number
3000
2500
2000
1500
1000
500
0 0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
minutes
Fig. 3. Time traces of 3888 grid scripts executing a single BLAST comparison: each horizontal line corresponds to the duration of the script on Worker Node; the scripts are enumerated on the vertical axis and the horizontal axis represent the absolute time
interval. The gray line is the average time duration. Figure 5 shows that, as the number of concurrent running jobs increases, service probabilities decrease, whereas the mean trend of the run duration increases. The duration increase is more evident and correctly measured where the number of black points is higher (concurrent jobs less than 40). At the same number of parallel jobs there are different trends of the black points, probably due to different rates of job conclusions, which depend on the features of the local sites. These different rate can be considered in relation with the bursting properties of figure 3. Summarizing figure 5 helps to fix few important points. – Opposite trends of the service probability and mean durations demonstrate bottlenecks in SE data access. – Through a linear interpolation of the grey line of figure 5, the relation between the mean duration and the number of concurrent jobs can be measured. The longest duration, that preserves the advantages of running job in grid instead of a local farm, and the relative number of concurrent jobs can be also extracted. Once an upper limit of concurrent jobs that access to a single SE have been experimentally measured versus the accepted longest job duration in grid, the required high throughput is preserved by replicating all the data also to other SEs. The replication number can be calculated as the absolute maximum number of concurrent jobs in figure 5 divided by the previous extracted upper limit of concurrent jobs.
High Throughput Comparison of Prokaryotic Genomes
1207
Fig. 4. Counts of the number of concurrent jobs in grid. The data used for the histogram are the same of figure 3. The horizontal axis is for the absolute time and the bin width is as long as the minimum wall time. On the average there are about 20 concurrent running job.
Fig. 5. Various measures vs. number of concurrent jobs. The black points measure the service probability, defined as the number of concluded jobs in an interval divided by the total number of contemporary jobs in the same interval; in gray the average time duration versus the number of concurrent jobs. The axis on the left is for the service probability (for black points); the gray axis on the right is for the job duration; the horizontal axis is shared among the two measures.
Although the experiment reported in the previous figures is a good example for introducing an analytical approach to the study of the bottlenecks emergence, it doesn’t clearly show the grid advantages. In fact, considering the best case where the minimum grid job duration (in this case 140 sec) corresponds to the duration on a not-grid host and the number of job in the computing task is about 4000, we would have a non-grid wall time of 151 hours, which is comparable with the grid task duration of 55 hours. This behaviour depends on the fact that the computing task was carried out in the context of a National Virtual Organization. It’s possible to obtain a faster computation by increasing the amount of computing resources: the submission from a European VO makes the speed of successful jobs 10 times larger than in a National VO. Therefore in this case the grid duration of the total task would have been of about 6 hours, demonstrating that such a challenging BLAST comparison must run only in the context of a large Virtual Organization .
1208
4
L. Carota et al.
Conclusions
Beyond to be an example of a large scale BLAST comparison that is achievable only in a grid environment, this work underlines the importance of a detailed analytical study of the grid dynamics in order to optimize the total duration of any grid computing task. During high throughput tasks, problems of data access bottlenecks have to be managed and an analytical approach is proposed. The proposed method firstly consists of dividing the task in independent equivalent scripts, then submitting in grid a part of them as a test sample in order to extract the average queue duration and to trace the relation of the job run interval versus the number of concurrent jobs (as in figure 5). Once an average queue time is measured by the test sample, it’s possible to estimate a ”reasonable” upper limit of the single job duration from a plot similar to figure 5, taking into account that a reasonable grid run duration must be equal or longer of the related queue duration. Through the figure 5, the estimated upper limit of the job duration (vertical axis) correspond also to the upper limit of concurrent jobs (horizontal axis), which can be used to estimate the necessary number of replicas on other SEs. In fact, the number of replicas of all the data on different SEs is the absolute maximum number of concurrent jobs (also extracted from figure 5) divided by the estimated upper limit of concurrent jobs. Summarizing, although it’s very difficult to exactly predict time executions on grid, as they depend firstly on the grid occupation state and in general the resources availability, however optimized performances are based on three principal points: – an efficient definition of the single executable scripts which in general implies a run duration equal or longer of the queue duration; – data replication which avoids too long run durations because of the bottleneck behaviour; – submitting jobs in the context of a VO with a large amount of resource that makes possible the previous two requirements. This experience opens the opportunity to fix grid standard operational procedures to enlarge the BLAST comparisons as more as new genomes are sequenced, filling periodically specific databases by the new results. The analytical study of grid dynamics helps to make faster these procedures in compliance with the genome sequencing technological processes.
Acknowledgments Our work was granted by the Italian project LIBI - International Laboratory of Bioinformatics.
High Throughput Comparison of Prokaryotic Genomes
1209
References 1. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: GenBank. Nucleic Acids Res. 34(Database issue), D16–D20 (2006) 2. Ivakhno, S.: From functional genomics to systems biology. FEBS J. 274(10), 2439– 2448 (2007) 3. Marsden, R.L., Lewis, T.A., Orengo, C.A.: Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint. BMC Bioinformatics 8, 86 (2007) 4. Bateman, A., Valencia, A.: Structural genomics meets computational biology. Bioinformatics 22(19), 2319 (2006) 5. http://www.ncbi.nlm.nih.gov/ 6. http://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml 7. Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International J. Supercomputer Applications 15(3) (2001), http://www.globus.org/alliance/publications/papers/anatomy.pdf 8. Foster, I., Kesselman, C., Tuecke, S.: The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. Open Grid Service Infrastructure WG, Global Grid Forum (June 22, 2002), http://www.globus.org/alliance/publications/papers/ogsa.pdf 9. gLite - Lightweight Middleware for Grid Computing, http://glite.web.cern.ch/glite/ 10. gLite User Guide. Data Management,107:130, https://edms.cern.ch/file/722398/1.1/gLite-3-UserGuide.pdf 11. Altshul, S.F., Gish, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990) 12. EGEE User s Guide. WMS Proxy Service, https://edms.cern.ch/file/674643/1/WMPROXY-guide.pdf 13. EGEE User s Guide. JDL Attributes Specification (submission via WMS WMProxy), https://edms.cern.ch/file/590869/1/EGEE-JRA1-TEC-590869-JDLAttributes-v0-8.pdf
A Parallel Classification and Feature Reduction Method for Biomedical Applications Mario R. Guarracino, Salvatore Cuciniello, and Davide Feminiano High Performance Computing and Networking Institute, Italian Research Council
Abstract. Classification is one of the most widely used methods in data mining, with numerous applications in biomedicine. The scope and the resolution of data involved in many real life applications require very efficient implementations of classification methods, developed to run on parallel or distributed computational systems. In this study we describe SVD-ReGEC, a fully parallel implementation, for distributed memory multicomputers, of a classification algorithm with a feature reduction. The classification is based on Regularized Generalized Eigenvalue Classifier (ReGEC) and the preprocessing stage is a filter method algorithm based on Singular Value Decomposition (SVD), that reduces the dimension of the space in which classification is accomplished. The implementation is tested on random datasets and results are discussed using standard parameters. Keywords: Binary classification, Generalized Eigenvalue Classifier, Feature transformation.
1
Introduction
In the last decade the amount of information created in the business and scientific areas becomes very challenging to work with. New technologies that are emerging every day make it possible to acquire data in almost every scale from micro to macro in very high resolution. While storing such large-scale information is a tough task, discovering its knowledge content is much harder and requires very efficient computational data processing methods. Due to the size and efficiency problems, such very large databases could only be processed or mined using a group of connected computers (multicomputers) that run in parallel and communicate among themselves. Special algorithms must be designed in order to exploit their strong computational infrastructure. There are a number of comprehensive surveys on parallel implementations of widely used data mining and knowledge discovery methods and their application spectrum [1]. In the parallel and distributed computation domain, widely used general data mining methods include classification, clustering, association rules and graph mining, among which classification is the most commonly used method, with many applications in biomedicine. In supervised classification, a computational R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1210–1219, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Parallel Classification and Feature Reduction Method
1211
system learns to discriminate between different classes of data, based on the features of the samples and their class membership. After the system is trained, data points whose class memberships are unknown can be classified as a member of one of the classes. Biomedical data have the unusual feature of comprising a very large number of variables. For example, publicly available datasets contain gene expression data for tens of thousands genes, which are updated regularly. This tendency is going to bring in the next years the need for classification algorithms that can handle such complexity. To solve the problem, feature selection or transformation methods have been introduced. Principal Component Analysis (PCA) [2], Singular Value Decomposition (SVD) [3] and statistical feature selection methods [4] are examples of techniques to reduce the curse of dimensionality. Support Vector Machines (SMVs) [5] are state-of-the-art supervised classification methods and they have been widely accepted in many application areas. Given a set of data such that each sample belongs to one of two classes, SVM method finds a hyperplane, which separates the points that belong to different classes, while maximizing the distance of the hyperplane from the convex hull of both sets of points. Finding such a hyperplane requires to solve a Quadratic optimization Problem (QP), for which very efficient solution techniques are available. Small size problems can be solved using general purpose QP solvers, whereas large-scale problems require more efficient methods like subset selection [6] and decomposition [7]. Parallel and distributed implementations of SVMs are also available [8], but they all share problems related to the efficiency of the algorithms and the fact that the solution may vary with the number of computing nodes involved in the computation. Algorithms such as general proximal SVMs [9] exploit the special structure of the optimization problem by defining it as a generalized eigenvalue problem. This formulation is different from the standard SVMs, in the sense that, instead of finding a separating hyperplane between the classes, it finds two hyperplanes that approximate their class by minimizing the distance from the elements of their class and maximizing the distance from those of the other class. The prior study by Mangasarian et. al. [9] requires the solution of two different eigenvalue problems. This problem is reduced to regularized general eigenvalue classifier (ReGEC)[10] which only requires the solution of a single eigenvalue problem, and thus halves the execution time. In this study, a parallel implementation of the regularized general eigenvalue classifier, which makes use of SVD for feature reduction is introduced. The proposed method is firstly compared with other standard classification methods and then tested on a very large synthetic dataset. Results regarding its efficiency are reported. The remainder of the paper is organized as follows. In Section 2, we describe the SVD method. In Section 3, we define ReGEC and give its solution. Then we describe the classification accuracy of ReGEC and SVD-ReGEC to show the consistency of the methods. In Section 4, we introduce the parallel implementation of the proposed method in detail. In Section 5, we present the preliminary
1212
M.R. Guarracino, S. Cuciniello, and D. Feminiano
computational results of the parallel implementation on a large synthetic dataset, using different number of computing nodes. Finally, in Section 6, we conclude and address the future work directions.
2
Singular Value Decomposition
The singular value decomposition is a linear algebraic method to factorize a matrix as product of three matrices. It reveals the rank of the matrix, thus allowing the detection of relevant features and discarding noise and redundancy. The SVD of a matrix X ∈ IRm×n with rank r ≤ min{m, n}, is: X = U ΣV T
(1)
U T XV = Σ = diag(σ1 , ..., σk ) ∈ Rm×n
(2)
with with σ1 ≥ ... ≥ σk ≥ 0 called singular values of X and k = min{m, n}. A significative relation about SVD is: X (l) =
l
uk sk vkT
(3)
k=1
Equation (3) is the closest rank-l matrix approximation of X and it allows to use SVD for dimensional reduction. In fact, considering a matrix X with features on columns, the right singular matrix contains an eigenvectors basis for the space. Taking only the first l rows of V T , with l ≤ m, the representation of X in the subspace spanned by VlT is VlT X. The choice of the l eigenvectors needed to represent data variations can be performed evaluating the percentage of information defined by fraction of the singular values related to the selected l eigenvector and calculated by: pl =
l j=1
σj2 /
r
σk2 .
(4)
k=1
In this way, the product between V and X returns a matrix of dimensions n × l.
3
ReGEC Algorithm
The SVM algorithm basically finds a hyperplane that maximizes the margin between two classes of data, defined as the maximum distance between the hyperplane and the data points in either class. The latter hyperplane is typically defined by a relatively very small subset of data points from both classes, which are called support vectors. SVMs are supervised algorithms, therefore the separating hyperplane is found using the data points whose classes are known.
A Parallel Classification and Feature Reduction Method
1213
Let the points from the two classes be defined as two matrices A ∈ IRn×m and B ∈ IRk×m , where each data point has m features. Then, the optimal separating hyperplane can be obtained solving the following quadratic optimization problem. w w 2 s.t. (Aw + b) ≥ e min f (w) =
(5)
(Bw + b) ≤ −e, where e is a unit vector of appropriate dimension. An alternative approach, as proposed by Mangasarian et al. [9], is based on finding the hyperplane closest to the points of each class and the furthest from the points of the other one. The class membership of a new point can be found by comparing its distance from the two hyperplanes. The hyperplane for class A can be found by solving the following optimization problem (6). Aw − eγ2 . w,γ =0 Bw − eγ2 min
(6)
In the nonlinear case, using Gaussian kernel, the kernel matrix can be defined as 2 K(A, B)i,j = e−
Ai −Bj σ
.
(7)
K(A, C)u − eγ2 . u,γ =0 K(B, C)u − eγ2
(8)
Letting,
C = [A B ] ,
problem (6) becomes min
In this case the eigenvalue problem can be singular, since it involves a matrix of order n + k + 1 and rank at most m, and therefore a regularization is required. The conditions and the correctness of the solution can also be found in [10]. The proposed classification method is outlined in Fig. 1. Here, kernel(A, C, σ) is the kernel matrix whose entries are described in 7, where ai is the ith row of matrix A and cj is the j th row of matrix C, and σ is the shape parameter of the kernel. Function ones(nrow , ncol ) is a matrix of size nrow × ncol with all entries 1, and diag(·) returns the main diagonal of a square matrix. To compare the classification accuracy of SVD-ReGEC with other methods, a sequential Matlab code has been implemented. Matlab function eig for the solution of the generalized eigenvalue problem is used as computational kernel of ReGEC. Matlab functions svmtrain and svmclassify are used to compute SVM. Classification algorithms K-Nearest Neighbors (KNN) [11] and Kernel Linear Discriminant Analysis (KLDA) are included in the open-source software package MatlabArsenal [12]. Those algorithms have been tested on the following data set: Breast cancer (Brca2 ) [13], High-grade glioma (Nutt ) [14], Thyroid and Wisconsin Prognostic Breast Cancer (WPBC ) [15].
1214
M.R. Guarracino, S. Cuciniello, and D. Feminiano
Let A ∈ IRn×m and B ∈ IRk×m be the training points in each class. Choose appropriate δ and σ ∈ IR % Feature reduction with SVD tot = n + k; train = [A; B]; {U, S, V } = SV D(train); train = Vnt ∗ train; A = train(trainl == +1); B = train(trainl == −1); % Build G and H matrices g = [kernel(A, C, σ), −ones(n, 1)]; h = [kernel(B, C, σ), −ones(k, 1)]; G = g ∗ g; H = h ∗ h; % Regularize the problem G∗ = G + δ ∗ diag(H); H ∗ = H + δ ∗ diag(G); % Compute the classification hyperplanes [V, D] = eig(G∗ , H ∗ ); Fig. 1. SVD-ReGEC algorithm
In Table 1, for each data set, there are name, dimension of the training and test sets, distribution of points between classes and trivial classification values in percentage. Tests have been performed training the algorithms on a random sample of 90% of the dataset and validating the accuracy on the remaining 10%. Tests have been repeated 100 times and mean accuracy results are shown. In Table 2, classification accuracy percentage is evaluated for SVM, KNN, KLDA and ReGEC. The same algorithms have been applied after SVD has been used to reduce the number of features, in a preprocessing stage to the methods. For the methods without the preprocessing stage, results are better for datasets with fewer features. Indeed, a large number of features increases the space and time complexity, but it also leads to problems of overfitting and underfitting. When SVD is used, classification accuracy well compares with previous results. Table 1. Dataset description Dataset train test
m 1 -1 null(%)
Brca2 20 2 3226 8 11 Nutt 45 5 12625 28 22 WPBC 99 11 32 41 69 Thyroid 140 75 5 65 150
34 44 31 30
A Parallel Classification and Feature Reduction Method
1215
Table 2. Comparison of classification accuracy (%) Dataset SVM KNN KLDA ReGEC SVD-SVM SVD-KNN SVD-KLDA SVD-ReGEC Brca2 83.00 Nutt 54.00 WPBC 60.27 Thyroid 94.72 Banana 89.15
75.50 71.80 50.82 95.64 86.36
62.50 54.00 63.64 70.25 73.34
73.00 46.00 60.18 92.76 84.44
62.50 54.00 60.27 94.72 89.15
80.50 71.80 47.82 95.64 86.36
62.50 54.00 60.27 70.25 73.34
86.00 72.00 60.27 92.76 84.44
In particular, on Brca2 and Nutt datasets, which have thousands of features, ReGEC accuracy improves. In the next section SVD-ReGEC parallel implementation for multicomputers is detailed. It is a straightforward implementation, since it is based on de facto standard software libraries.
4
Implementation Details
Our aim has been to realize an efficient, portable and scalable parallel implementation of SVD-ReGEC to be used on different Multiple Instruction Multiple Data (MIMD) distributed memory architectures. As well known, these are multiprocessor computers, in which each node has its own local memory and communicates with the others through message passing. Let us suppose that each processor executes the same program and the same operations on different data (Single Program Multiple Data - SPMD). Given the algorithm structure, a flexible connection topology is supposed to exist among the nodes, which enablesa point to point communication, as well as the broadcast and gather of data. Finally, we suppose to have a network in which processors are connected by a mesh topology. Considering this environment, it is natural to develop a program in terms of loosely synchronous processes, executing the same operations on different data, and synchronizing each other through message passing. To better explain the case, we suppose that each node is driven by a single process. In Fig. 1, linear algebra operations are essentially used to execute SVD, matrix-matrix multiplications and a generalized eigenvalue problem solution. In order to obtain an efficient, portable and scalable parallel implementation of SVD-ReGEC we decided to use standard message passing libraries, i.e. Basic Linear Algebra Communication Subprograms (BLACS) [16] and Message Passing Interface (MPI) [17], and de facto standard numerical linear algebra software, Parallel Basic Linear Algebra Subprograms (PBLAS) and Scalable Linear Algebra Package (ScaLAPACK) [18], that are also the same on which Matlab routines are based. Since matrices involved in the algorithm are distributed among processing nodes, memory is used efficiently and no replication of data occurs. On single node, the use of optimized level 3 BLAS [19] and LAPACK [20] routines enables both its efficient use and a favorable computation/communication ratio. The routine to compute the SVD is PDGESVD of ScaLAPACK and the main
1216
M.R. Guarracino, S. Cuciniello, and D. Feminiano
routine of PBLAS used to evaluate matrix-matrix multiplications in the implementation of Fig. 1 is PDGEMM. Current model implementation of PBLAS assumes matrix operands to be distributed according to the block scatter decomposition of PBLAS and ScaLAPACK. Routines for eigenvalues problems are not included in PBLAS, but they are covered by ScaLAPACK. The evaluation of the generalized eigenvalue problem G∗ x = λH ∗ x has been performed using the routine PDSYGVX. We required machine precision in the computation of eigenvalues and memory to be dynamically allocated for reorthogonalization of eigenvectors. Current version of ScaLAPACK does not permit to orthogonalize eigenvectors against those in different processors memory, which can lead to slightly different results, with respect to sequential computation. The auxiliary routines for parallel kernel computation and for diagonal matrices operations have been derived from PDGEMM. To compute each element K(A, B)ij as in (7), we need to evaluate the exponential of the scalar −Ai − Bj 2 /σ. This value can be obtained substituting in a matrix - transpose matrix product the parallel scalar product between Ai and Bj with the parallel 2-norm of the difference of Ai and Bj . The operation count of parallel ReGEC is exactly the same as the sequential one. Thanks to computational characteristics of linear algebra kernels, the parallel implementation of the algorithm described in Fig. 1 has a computational complexity on p nodes that is exactly 1/p of the sequential one, and a communication complexity of one order magnitude less than the computational one. Thisis usually a target in parallel linear algebra kernels, because it assures scalable implementations.
5
Performance Evaluation
The datasets used in this study are random, because biological researches do not yet provide results with many samples and features. In each of the following tables are shown different dataset dimensions, varying from 500 points with 2000 features to 1000 points with 6000 features; those dataset configurations provide significant case studies for the analysis of genomic sequences. Results regard performance in terms of execution time and efficiency. Execution times and the other accuracy results have been calculated using a Beowulf cluster of 8 Pentium 4 1.5 GHz, with 512MB RAM, connected with a Fast Ethernet network. Each node runs a Linux kernel 2.4.20, GNU C Compiler (gcc) 2.96, MPI for Heterogeneous Clusters (mpich) 1.2.5, BLACS 1.1, ScaLAPACK 1.7, LAPACK 3.0, BLAS with Automatic Tuned Linear Algebra Software (ATLAS) optimization. Tests have been performed on idle workstations; the time refers to wall clock time of the slower executing node and it has been measured with function MPI WTIME() provided by mpich. The maximum memory available on each node led to the impossibility to run some test cases on a small number of processors.
A Parallel Classification and Feature Reduction Method
1217
The execution times, parallel efficiency and speed up are shown in Table 3, 4 and 5, using a different number of training elements and CPUs. Tests have been performed on logical 2D meshes of 1 (1), 2 (1 × 2), 4 (2 × 2), and 8 (2 × 4) processors. In Table 4 the efficiency is calculated using the following formula: ef f =
t1 , #cpu ∗ t#cpu
(9)
where t# is the execution time using # number of CPUs. Results show that, for an increasing number of processors, the execution time decreases proportionally, if the problem to be solved has sufficient computational complexity. Moreover, time reduction increases for larger problems, with a consistent gain in performance. It is also clear that with a fixed number of processors, performances increase with problems dimension. A sensible execution time reduction is obtained when the number of processors increases. In Table 5 the speed up is calculated using the following formula: speedup =
t1 , t#cpu
(10)
Table 3. Execution times
500x2000 750x2000 1000x2000 500x4000 750x4000 1000x4000 500x6000 750x6000 1000x6000
1
2
4
8
11.13 44.58 136.34 11.05 44.33 130.02 10.99 44.18 131.39
11.74 36.61 94.65 11.74 34.87 88.64 11.79 34.98 88.25
8.79 24.18 55.57 8.83 23.75 55.84 8.82 23.91 55.44
8.99 23.95 44.11 8.98 21.67 61.58 9.03 21.78 44.17
Table 4. Efficiency 1 2 500x2000 750x2000 1000x2000 500x4000 750x4000 1000x4000 500x6000 750x6000 1000x6000
1 1 1 1 1 1 1 1 1
0.47 0.61 0.72 0.47 0.64 0.73 0.47 0.63 0.74
4
8
0.32 0.46 0.61 0.31 0.47 0.58 0.31 0.46 0.59
0.15 0.23 0.39 0.15 0.26 0.26 0.15 0.51 0.37
1218
M.R. Guarracino, S. Cuciniello, and D. Feminiano Table 5. Speedup 1 2 500x2000 750x2000 1000x2000 500x4000 750x4000 1000x4000 500x6000 750x6000 1000x6000
1 1 1 1 1 1 1 1 1
0.95 1.22 1.44 0.94 1.27 1.47 0.93 1.26 1.49
4
8
1.27 1.84 2.45 1.25 1.87 2.33 1.25 18.48 2.37
1.24 1.86 3.09 1.23 2.05 2.11 1.22 2.03 2.97
Again, it is possible to see a proportional growth rate of speedup with an increasing problem complexity. We can conclude that parallel ReGEC is efficient and scalable on the target architecture.
6
Conclusion and Future Work
Classification algorithms constitute a major part in the core of data mining and knowledge discovery area. The new classification algorithms should respond to the demands of real life problems, which typically require efficient processing on large scale data sets. Parallel implementation of such algorithms provides time efficiency and computational accuracy. In this work we introduce SVD ReGEC, a parallel implementation of a recently developed regularized general eigenvalue classifier with a feature reduction filter method based on SVD. The proposed implementation is tested on a large scale synthetic data base. The preliminary results show the proposed implementation to be efficient and scalable. Future work will revolve around testing the implementation on large scale data sets from different research areas, mostly in the biomedical domain, and comparing its performance with the other parallel classification methods.
References 1. Cannataro, M., Talia, D., Srimani, P.: Parallel data intensive computing in scientific and commercial applications. Par. Comp. 28(5), 673–704 (2002) 2. Oja, E.: A simplified neuron model as a principal component analyzer. Journal of Mathematical Biology 15, 267–273 (1982) 3. Wall, M., Dyck, P., Brettin, T.: SVDMAN - Singular Value Decomposition analysis of microarray data. Bioinformatics 17(6), 566–568 (2001) 4. Golub, T., et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999) 5. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995) 6. Osuna, R., Girosi, F.: An improved training algorithm for support vector machines. In: IEEE Workshop on Neural Networks for Signal Processing, pp. 276–285 (1997)
A Parallel Classification and Feature Reduction Method
1219
7. Platt, J.: Fast training of SVMs using sequential minimal optimization. In: Advances in Kernel Methods: Support Vector Learning, pp. 185–208. MIT press, Cambridge (1999) 8. Graf, H., Cosatto, E., Bottou, L., Dourdanovic, I., Vapnik Parallel, V.: support vector machines: the cascade SVM. In: Press, M. (ed.) Proc. of Neural Information Processing Systems (NIPS), vol. 17 (2004) 9. Mangasarian, O., Wild, E.: Multisurface proximal support vector classification via generalized eigenvalues. Technical Report 04-03, Data Mining Institute (September 2004) 10. Guarracino, M.R., Cifarelli, C., Seref, O., Pardalos, P.M.: A classification algorithm based on generalized eigenvalue problems. Opt. Meth. Soft. 22(1), 73–81 (2007) 11. Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley-Interscience Publication, Chichester (2000) 12. Yan, R.: A matlab package for classification algorithms (2006), http://finalfantasyxi.inf.cs.cmu.edu/tmp/MATLABArsenal.zip 13. Hedenfalk, I., et al.: Gene-expression profiles in hereditary breast cancer. The New England Journal of Medicine 344, 539–548 (2001) 14. Nutt, C., et al.: Oligonucleotide microarray for prediction of early intrahepatic recurrence of hepatocelllaur carcinoma after curative resection. The Lancet 63(7), 1602–1607 (2003) 15. Blake, C., Merz, C.: Uci repository of machine learning databases (1998), www.ics.uci.edu/∼ mlearn/MLRepository.html 16. Dongarra, J., Whaley, R.: A user’s guide to the blacs v1.1. Technical Report UTCS-95-281, Dept. of CS, U. of Tennessee, Knoxville (1995) 17. Gropp, W., Lusk, E., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message Passing Interface, 2nd edn. The MIT Press, Cambridge (1999) 18. Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrouchov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.: Scalapack: A portable linear algebra library for distributed memory computers - design and performance. Comp. Phys. Comm. (97), 1–15 (1996) 19. Choi, J., Dongarra, J., Ostrouchov, S., Petitet, A., Walker, D., Whaley, R.: A proposal for a set of parallel basic linear algebra subprograms. Technical Report UT-CS-95-292, Dept. of CS, U. of Tennessee, Knoxville (1995) 20. Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Croz, J.D., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., Sorensen, D.: LAPACK Users Guide, 2nd edn. SIAM, Philadelphia (1995)
Applying SIMD Approach to Whole Genome Comparison on Commodity Hardware Arpith Jacob1 , Marcin Paprzycki2,3, Maria Ganzha2,4 , and Sugata Sanyal5 1
3
Department of Computer Science and Engineering Vellore Institute of Technology, Vellore, India [email protected] 2 Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland {Marcin.Paprzycki,Maria.Ganzha}@ibspan.waw.pl Computer Science Department, Warsaw Management Academy, Warsaw, Poland 4 Department of Administration, Elblag University of Humanities and Economics, Elblag, Poland 5 School of Technology and Computer Science, Tata Institute of Fundamental Research, Mumbai, India [email protected]
Abstract. Whole genome comparison compares (aligns) two genome sequences assuming that analogous characteristics may be found. In this paper, we present an SIMD version of the Smith-Waterman algorithm utilizing Streaming SIMD Extensions (SSE), running on Intel Pentium processors. We compare two approaches, one requiring explicit data dependency handling and one built to automatically handle dependencies and establish their optimal performance conditions.
1
Introduction
Sequence similarity searches are frequently performed in Computational Biology. They identify closely related genetic sequences, assuming that high degree of similarity often implies similar function or structure. To establish similarity an alignment score is calculated. Exact algorithms to calculate the alignment score, based on the dynamic programming, are very slow (even on fastest workstations). Hence, heuristic alternatives are used; but they may not be able to detect distantly related sequences. Among exact methods, the Smith-Waterman algorithm is one of the most popular. However, due to its compute-intensive nature, it is rarely used for large scale database searches. In this paper we describe an efficient implementation of the Smith-Waterman algorithm that exploits the fine-grained parallelism. We use Intel’s MMX/SSE2 SIMD extensions to speedup the algorithm within a single processor.
2
Related Work
Parallelization of the Smith-Waterman algorithm proceeds on two fronts: finegrained and coarse-grained parallelism. In the fine-grained approach the pairwise R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1220–1229, 2008. c Springer-Verlag Berlin Heidelberg 2008
Applying SIMD Approach to Whole Genome Comparison
1221
comparison algorithm is parallelized and each processing element performs a part of matrix calculation to help determine the optimal score. This approach was widely used on single instruction multiple data parallel computers at the time when they were very popular. For multiple instruction, multiple data computers, subsets of the database are independently searched by processing elements. Five architectures used for sequence comparison were described in Hughey [1]: (1) special purpose VLSI, (2) reconfigurable hardware, (3) programmable coprocessors, (4) supercomputers and (5) workstations. Special purpose VLSI provides the best performance but is limited to a single algorithm. The Biological Information Signal Processing (BISP) system was one of the first systolic arrays used for sequence comparison. Reconfigurable hardware is typically based on Field Programmable Gate Arrays (FPGAs). They are more versatile than special purpose VLSI and can be adapted to different algorithms. A number of FPGA systems, such as DeCypher [3], accelerate Smith-Waterman algorithm, reporting several orders of magnitude speedup. These systems can easily be ported to newer generations of FPGAs with only a minimum re-design. Programmable co-processors strive to balance the flexibility of reconfigurable hardware with the speed and high density of processing elements. Kestrel [19] is a 512 element array of 8-bit PEs that was used for sequence alignment. For high performance computers let us mention the BLAZE [4], an implementation of the Smith-Waterman algorithm, written for the SIMD MasPar MP1104 computer with 4096 processors. As far as workstations are concerned, Wozniak [5] presented an implementation that used the SIMD visual instruction set of Sun UltraSparc microprocessors to simultaneously calculate four rows of the dynamic programming matrix. Rognes and Seeberg [6] used the SIMD multimedia extension instructions on Intel Pentium microprocessors to produce one of the fastest implementations on workstations. Networks of workstations have also been used effectively by Strumpen [7], who utilized a heterogeneous environment consisting of more than 800 workstations, while Martins and colleagues [8] presented an event-driven multithreaded implementation of the sequence alignment algorithm on a Beowulf cluster consisting of 128 Pentium Pro microprocessors.
3
The Smith-Waterman Algorithm
Initially, Needleman and Wunsch [9] and Sellers [10] introduced the global alignment algorithm based on the dynamic programming approach. Smith and Waterman [11] proposed an O(M 2 N ) algorithm to identify common molecular subsequences, which took into account evolutionary insertions and deletions. Later, Gotoh [12] modified this algorithm to run at O(M N ) by considering affine gap penalties. These algorithms depended on saving the entire M × N matrix in order to recover the alignment. The large space requirement problem was solved by Myers and Miller [13] who presented a quadratic time and linear space algorithm, based on a divide and conquer approach. Finally, Aho, Hirschberg and Ullman [14] proved that symbol comparing algorithms (to see if they are equal or not), have to take time proportional to the product of their string lengths.
1222
A. Jacob et al.
Fig. 1. Comparison Matrix: Optimal score: 13, Match: 5, Mismatch: -4, Penalty: 0+7k. Optimal Alignment: A C A T A, A C - T A.
Let us now describe the Smith-Waterman algorithm (an example of its operation was depicted in Figure 1). Let us consider two genomic sequences A and B of length M and N respectively, to be compared using a substitution matrix ∂, and utilize the affine gap weight model. The gap penalty is given by: Wi + kWe where Wi > 0 and We > 0. Wi is the penalty for initiating the gap and We is the penalty for extension of the gap, which varies linearly with the length of the gap. The substitution matrix ∂ lists the probabilities of change from one “structure” into another in the sequence. There are two families of matrices used in the algorithm: the Percent Accepted Mutation (PAM) and the Block Substitution Matrices (BLOSUM). Maximization relation is used in order to calculate the optimum local alignment score according to the following recurrence relations (the highest value in the H matrix gives the optimal score): E(i, j) = H(i, j) = F (i, j) = 0, for i = 0 or j = 0 E(i − 1, j) − We E(i, j) = max H(i − 1, j) − Wi − We F (i, j − 1) − We F (i, j) = max H(i, j − 1) − Wi − We ⎧ ⎫ 0 ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ E(i, j) H(i, j) = max F (i, j) ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ H(i − 1, j − 1) + ∂(Ai , Bj ) These recurrences can be understood as follows: the E (F ) matrix holds the score of an alignment that ends with a gap in the sequence A (B). When calculating the E(i, j)th (F (i, j)th) value, both extending an existing gap by one space, or initiating a new gap is considered. The H(i, j)th cell value holds the best score of a local alignment that ends at position Ai , Bj . Hence, alignments with gaps in either sequence, or the possibility of increasing the alignment with
Applying SIMD Approach to Whole Genome Comparison
1223
a matched or mismatched pair are considered. A zero term is added in order to discard negatively scoring alignments and restart the local alignment. One of possible many optimal alignments can be retrieved by retracing steps taken during computation of matrix H, from the optimal score back to the zero term. To quantify the performance of dynamic programming algorithms, the measure: millions of dynamic programming cell updates per second (MCUPS ) has been defined. It represents the number of cells in the H matrix computed per second, and includes all memory operations and corresponding E/F matrix cell evaluations.
4
SIMD-Based Approach
Multimedia extensions have been added to the Instruction Set Architectures (ISAs) of most microprocessors [15]. They exploit low-level parallelism, where computations are split into subwords, with independent units operating on them simultaneously (a form of SIMD parallelism). Intel introduced the Pentium MMX microprocessor [16] in 1997. The MMX (MultiMedia eXtensions) technology aliased the eight 64-bit MMX registers with the floating point registers of the x87 FPU, allowing up to eight byte operations performed in parallel. SIMD processing was enhanced with the addition of the SSE2 (Streaming SIMD Extensions) in the Pentium 4 microprocessor. It allows handling sixteen simultaneous byte operations in 128-bit XMM registers. Note however that because of the smaller number of available bits, overflows or underflows occur more frequently. They are handled by two methods: wraparound arithmetic, which truncates the most significant bit; and saturation arithmetic. In the latter case the result saturates at an upper or lower bound and the result is limited to the largest/smallest representable value. Hence, for unsigned integer data types of n bits, underflows are clamped to 0, and overflows to 2n − 1. Saturation arithmetic is advantageous because it offers a simple way to eliminate unneeded negative values and automatically limits results without causing errors, and thus is used in our implementation. Finally, let us note that compiler support for SIMD instructions is still somewhat rudimentary, and thus hand coding in assembly language using SIMD instructions is often required. 4.1
Challenges in Parallelizing the Smith-Waterman Algorithm
Parallelizing the dynamic programming algorithm is done by calculating multiple rows of the H matrix simultaneously. Figure 2 shows the data dependencies of each cell in the H matrix. Value in the (i, j)th cell depends on (i − 1, j − 1)th , (i − 1, j)th and (i, j − 1)th cell values. Hence, before a cell can be computed, cells immediately above, to the left and diagonally across must be available. This figure also shows why a systolic-array would be ideal for this type of calculations. The alignment matrix H, can be evaluated in parallel rows (columns) or anti-diagonals.
1224
A. Jacob et al.
Fig. 2. Data dependencies in the similarity matrix
4.2
Diagonal Approach
When computation proceeds diagonally across the alignment matrix the interdependencies are automatically handled. The main disadvantage is that substitution scores cannot be accessed linearly from memory, but have to be independently loaded for each diagonal cell. Symbols from two sequences have to be read, and a look-up into the substitution table made in order to calculate the corresponding match or mismatch score. This procedure has to be repeated for every element in the diagonal before parallel computation can proceed. The second disadvantage is that the size of the diagonal varies at the beginning and the end of the matrix sweep. For example, if parallel computation involves four cells at a time, the first three diagonals of the alignment matrix with one, two and three cells respectively do not contain enough cells to load the 4-way SIMD word. To solve this problem three dummy symbols both at the beginning and the end of the query sequence are added. Furthermore, appropriate entries must be added in the similarity table between the dummy symbol and each symbol in the alphabet (the dummy symbol included), with a score of zero, so that the optimal score remains unchanged. Computation then proceeds along successive diagonals through the entire length of the query sequence. The next four rows of the matrix are then computed in the same manner. If the database sequence length is not a multiple of the SIMD word length, the sequence must be concatenated at the end with an appropriate number of dummy symbols. This process is illustrated in figure 3 (panel a). SIMD operations can be performed with signed or unsigned integers. To avoid reducing the maximum representable integer value all elements in the similarity matrix are biased by a positive value (forcing them to remain positive). The bias is then subtracted without affecting the optimal score. In order to obtain optimum performance, a number of techniques have been used to speedup the code. (1) To optimize utilization of cache memory arrays E and H are interleaved as an array of structures. (2) SIMD words in memory
Applying SIMD Approach to Whole Genome Comparison
a)
1225
b)
Fig. 3. Fine-grained parallelization of the Smith-Waterman algorithm using 4-way subword processing. a. Diagonal approach. b. Horizontal approach.
are aligned at appropriate boundaries: 64-bit memory access with MMX registers requires the target address to be aligned at 8 byte boundaries and 128-bit memory access with XMM registers requires alignment at 16 byte boundaries. (3) to reduce effects of memory latency memory references for subword access (one addition, one multiplication and two memory reference instructions in the outer loop, along with three memory read instructions in the inner loop) can be appropriately re-arranged. 4.3
Horizontal Approach
When computation proceeds horizontally along the rows of the alignment matrix the interdependencies are not resolved. To calculate the value of the (i, j)th cell in the H matrix, the values of the (i, j − 1)th cell in the F and H matrices are required and thus parallel calculation of horizontal cells of H is impossible. An interesting empirical observation concerning utilization of the SmithWaterman algorithm for biological sequences was made by Phil Green (and implemented in the SWAT program, [18]). In most cells of E, F and H matrices, values are clamped to zero (when using saturated arithmetic) and thus do not contribute to H. Specifically, the (i, j)th cell value in the F matrix will remain zero if the (i, j − 1)th cell value is already zero, as long as H(i, j − 1) ≤ Wi + We . F (i, j − 1) − We F (i, j) = max H(i, j − 1) − Wi − We If the H value is below this threshold, F will remain zero within that row. For example, when using a 4-way SIMD word, the F values can be ignored from the iteration if the four H values in its relation are below the threshold Wi + We . If one or more of the H values exceeds this threshold, the F values must be recalculated sequentially. This effect depends on the threshold value. If the gap open and gap extend penalties are very small, most H values are above the threshold and there will be no speedup in the algorithm. On the other hand if the threshold values are too large results will be “incorrect” as useful information may be lost. An advantage of the horizontal method is that substitution scores can be loaded with a single memory read operation using a query sequence profile table. The query sequence profile table contains the substitution scores of the query
1226
A. Jacob et al.
sequence placed horizontally across the matrix, versus an imaginary sequence made up of all symbols in the alphabet and is created once for the query sequence. This process is illustrated in Figure 3 (panel b). Note that most optimization techniques used in the diagonal method are relevant here. The query sequence profile table is computed once before the database comparison procedure and is usually small enough to fit in the first level cache of the microprocessor. The conditional loop presents a problem because it is cumbersome to implement using Intel’s media processing ISA. Further, it increases the runtime because of the possibility of misprediction of the branch target address. Thus, the SIMD conditional loop is unrolled. 4.4
Experimental Results
The two algorithms were implemented using MMX and SSE2 technology and tested on a Pentium III 500Mhz with 128MB RAM, running Windows 2000, and a Pentium 4, 1.4Ghz with 128MB RAM, running Windows NT. Finally, we have run two series of experiments on the Intel Pentium 4, 2.80GHz with 1GB of RAM, running Windows XP. The user interface, file handling and memory allocation code was written in C and compiled using the Visual C/C++ 6.0 compiler. The Smith-Waterman algorithm was written in assembly language and compiled using the Netwide ASeMbler 0.98.08 (NASM). Timings were measured by reading the microprocessor timestamp (using the assembly mnemonic RDTSC) before and after completion of the target function and dividing by the microprocessor clock speed in Hz. For each test, the total program runtime, total I/O overhead, total time spent in the Smith-Waterman function, MCUPS, and their averages were noted. In the tests, the local alignment score between two DNA sequences was calculated without recovering the alignment. The pam47 substitution matrix was used which assigns a value +5 for a match and -4 for a mismatch between two nucleotides. A bias value of +4 was used to eliminate negative elements from the substitution matrix. An affine function 0 + 7k was used for the gap open and gap extension penalties. Tests were performed using query sequences ranging in length from 100 to 1000 nucleotides; in steps of 100. We used the annotated Drosophilia genome release 3.0 [17], containing 17,878 sequences with a total of 28,249,452 nucleotides. Plots of search times versus query lengths for different SIMD implementations are shown in Figure 4. The bulk of the program time (96-97%) is spent in the Smith-Waterman sequence comparison function. Only a small percent of the time is spent as overhead for reading the sequences from the disk. For a gap penalty of 0 + 7k, the diagonal method was found to be 1.30 to 1.87 times faster than the horizontal method. Using the 128-bit XMM registers on processors with SSE2 technology doubles the size of the SIMD word as compared to the 64-bit MMX registers of the older MMX technology. Theoretically, this should result in a two fold speed increase. Practically, the speedups ranged from 1.17 to 1.40 as with an increase in the SIMD word length there is a corresponding increase
Applying SIMD Approach to Whole Genome Comparison
1227
in clock cycles. Surprisingly, the horizontal approach using the byte precision within the SSE2 technology was slower than its MMX implementation by 14%. As expected, searches using 8-bit subwords in the SIMD word, as compared to searches using subwords of 16-bits, were found to be faster by a factor of 1.31 to 1.79. Most comparison scores in sequence searches are well below the maximum value representable in 8 bits. Any score close to 255 represents an interesting match which is investigated by other means, irrespective of its actual score. Hence in most cases, byte precision is sufficient for database searching. Another interesting observation is the scalability of the SIMD implementation between processors in the same family. The diagonal approach using MMX technology results in a performance boost of 3.68 and 3.91 for the byte and word precisions respectively. The horizontal approach achieved more modest speedups of 1.44 and 1.63 for the byte and word precisions. Finally, we have found out that move from the 1.4 GHz Pentium 4 to the 2.8 GHz Pentium 4 resulted in the MCUPS rate (for query length 1000) to jump from 215 to 488 for the diagonal method, and from 165 to 373 for the horizontal approach. While we do not have a direct explanation of the jump that is more than two-fold; let us note that both machines have been running different operating systems and had substantially different amounts of available memory. What matters is the fact that for the algorithm in question the clock-speed of the processor (and thus its raw power) translates directly into performance.
Fig. 4. Search times versus query lengths for different implementations
To complete our investigation, in figure 5 we depict effect of gap penalties on the horizontal method. Searching the database using a query of length 900 with a gap penalty of 0+7k takes 174 seconds, while a search with a penalty of 40+2k takes only 43.9 seconds (however, with the change in the gap penalty the optimal scores are no longer equal). The speed of the horizontal method varies from 142 MCUPS with a gap penalty of 0 + 7k, to its saturation point at approximately
1228
A. Jacob et al.
Fig. 5. Effect of SWAT optimization
580 MCUPS with a gap penalty of 40 + 2k. The diagonal method, on the other hand, does not incorporate the SWAT optimization and runs at constant speeds for different gap penalties. Hence, database searching with a high gap penalty favors the horizontal method over the diagonal method (as long as the final results remain reliable).
5
Concluding Remarks
The aim of this work was to propose a fine grain implementation of the SmithWaterman algorithm utilizing multimedia extensions on Intel processors. Our experiments showed significant speedups on Pentium workstations. Since the general-purpose microprocessors are constantly being updated with more advanced features allowing micro-parallelization of algorithms, it can be expected that features like the newly introduced Simultaneous Multi-threading (hyperthreading) technology will offer further potential for performance increase. However, note that this and other similar approaches heavily rely on assembly coding. As a results codes are not portable at all. For instance our codes run only on 32-bit architectures and are useless for the modern 64-bit processors.
Acknowledgments We thank the manager of Centre for Technical Support at T. S. Santhanam Computing Centre, Muthu‘G., and the laboratory personnel Mr. Ravikumar V. (GA), Mr. Elson Jeeva T. (MIS), Mr. Srinivasan V. and Ms Dharani (ME) for providing equipment required to conduct experiments.
Applying SIMD Approach to Whole Genome Comparison
1229
References 1. Hughey, R.: Parallel hardware for sequence comparison and alignment. Computer Applications in the Biosciences 12(6), 473–479 (1996) 2. Yamaguchi, Y., Maruyama, T., Konagaya, A.: High Speed Homology Search with FPGAs. In: Pacific Symposium on Biocomputing, pp. 271–282 (2002) 3. http://www.timelogic.com 4. Brutlag, D.L., Dautricourt, J.P., Diaz, R., Fier, J., Moxon, B., Stamm, R.: BLAZE: An implementation of the Smith-Waterman Comparison Algorithm on a Massively Parallel Computer. Computers and Chemistry 17, 203–207 (1993) 5. Wozniak, A.: Using video-oriented instructions to speed up sequence comparison. Computer Applications in the Biosciences 13(2), 145–150 (1997) 6. Rognes, T., Seeberg, E.: Six-fold speed-up of Smith Waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics 16(8), 669–706 7. Strumpen, V.: Parallel molecular sequence analysis on workstations in the Internet. Technical report, Department of Computer Science, University of Zurich (1993) 8. Martins, W.S., del Cuvillo, J.B., Useche, F.J., Theobald, K.B., Gao, G.R.: A multithreaded parallel implementation of a dynamic programming algorithm for sequence comparison. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 311–322 (2001) 9. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two sequences. J. of Molecular Biology 48(3), 443–453 (1970) 10. Sellers, P.H.: On the theory and computation of evolutionary distances. SIAM J. of Applied Mathematics 26, 787–793 (1974) 11. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. of Molecular Biology 147(1), 195–197 (1981) 12. Gotoh, O.: An improved algorithm for matching biological sequences. Journal of Molecular Biology 162(3), 705–708 (1982) 13. Myers, E.W., Miller, W.: Optimal alignments in linear space. Computer Applications in the Biosciences 4(1), 11–17 (1988) 14. Aho, A.V., Hirschberg, D.S., Ullman, J.D.: Bounds on the complexity of the longest common subsequence problem. J. of the ACM 23(1), 1–12 (1976) 15. Lee, R.B.: Multimedia extensions for general-purpose processors. In: Proceedings IEEE Workshop on Signal Processing Systems, pp. 9–23 (1997) 16. Peleg, A., Wilkie, S., Weiser, U.: Intel MMX for multimedia PCs. Communications of the ACM 40(1), 25–38 (1997) 17. Berkeley Drosophila Genome Project (2003),http://www.fruitfly.org/sequence/ sequence_db/na_wholegenome_CDS_dmel_RELEASE3.FASTA.gz 18. http://www.phrap.org/phredphrap/swat.html 19. Di Blas, A., Dahle, D.M., Diekhans, M., Grate, L., Hirschberg, J.D., Karplus, K., Keller, H., Kendrick, M., Mesa-Martinez, F.J., Pease, D., Rice, E., Schultz, A., Speck, D., Hughey, R.: The UCSC Kestrel Parallel Processor. IEEE Transactions on Parallel and Distributed Systems 16(1), 80–92 (2005)
Parallel Multiprocessor Approaches to the RNA Folding Problem ´ Etienne Ogoubi1 , David Pouliot1 , Marcel Turcotte2 , and Abdelhakim Hafid1 1
2
D´epartement d’informatique et de recherche op´erationnelle Universit´e de Montr´eal, Pavillon Andr´e-Aisenstadt C.P. 6128 succursale Centre-ville, Montr´eal, Qu´ebec H3C 3J7, Canada ogoubi,pouliot,[email protected] School of Information Technology and Engineering, University of Ottawa 800 King Edward Avenue, Ottawa, Ontario K1N 6N5, Canada [email protected]
Abstract. To quickly and efficiently fold long ribonucleic acid (RNA) sequences, fast computational models are needed. This paper compares two parallel multiprocessor computer architectures for the prediction of RNA secondary structure. We show promising experimental results using the OpenMP programming environment. This work is intended to be a testbed for the development of new approaches for the prediction of consensus RNA secondary structure from multiple sequences. Keywords: parallel computer architecture; message passing; RNA; secondary structure; folding.
1
Introduction
In recent years, the life sciences have seen not only an explosion of the volume of the data but also a vast increase of the number and complexity of the data sources. The study of ribonucleic acid (RNA) molecules is a research area that has seen a tremendous growth. These molecules have been shown to be involved in a vast array of normal biological processes as well as pathological conditions [1]. In RNAs whose function depends on their native, folded three-dimensional shape (such as those of the ribosomes), the secondary structure, defined by the set of internal basepair interactions, is more consistently conserved than the primary structure, defined by the sequence of nucleotides. Many algorithms have been developed for the inference and (database) similarity search of RNA secondary structure. However, the execution time (and memory requirement) is often a polynomial with a degree as high as 6. This complexity limits the application of these algorithms. For instance, Baird et al. estimated that it would take 6 months to search for a single regulatory element in a database of 20,000 entries (untranslated regions) [2]. The main contribution of this paper to parallel computing in bioinformatics is to help understanding how RNA folding algorithms can be computed in a R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1230–1239, 2008. c Springer-Verlag Berlin Heidelberg 2008
Parallel Multiprocessor Approaches to the RNA Folding Problem
1231
parallel environment. The methods developed to parallelize the basic RNA folding algorithm will be enhanced and used to parallelize eXtended Dynalign, an application for the simultaneous alignment and secondary structure prediction of three RNA sequences [3]. Finally, these results will give us good indications on how eXtended Dynalign can be mapped to An Isometric on Chip Multiprocessor Architecture (ICMA) [4]. This paper is organized as follows: Section 2 gives a rundown of the method to parallelize the RNA folding algorithm. Section 3 describes the implementation and experimental results. Section 4 concludes the paper and investigates future works.
2 2.1
Method The RNA Folding Algorithm
Herein, the simplified algorithm presented in [5] was implemented. This allows us to focus on the decomposition rather than the complex scoring systems. Specifically, this algorithm finds, in an input sequence of nucleotides, a secondary structure having the maximum number of (Watson-Crick) base pairs. See [6] for a review of the programs implementing the nearest neighbor model, as well as [7] for a review of the recent developments in this field. This algorithm recursively calculates the optimal structure as shown in the following equations. Let N be the length of an arbitrary sequence of RNA nucleotides, numbered from 0 to N − 1, and let γ be an N × N superior triangular matrix γ(i,j), 0 ≤ i ≤ N-1, i ≤ j ≤ N-1. The inputs of this matrix are given by the following recurrence equations, for 0 ≤ i < j < N: ⎧ γ(i + 1, j) ⎪ ⎪ ⎨ γ(i, j − 1) γ(i, j) = max (1) γ(i + 1, j − 1) + δ(i, j) ⎪ ⎪ ⎩ max i< k <j [γ(i, k) + γ(k + 1, j)] where the function δ is defined as follow: 1 if i and j can form a base pair δ(i, j) = 0 otherwise. RNA sequences are strings over a 4 letter alphabet: A, C, G and U. The canonical base pairs are: G:C, C:G, A:U and U:A. An additional base pair, the wobble, is often considered: G:U or U:G. The base case is defined as follows: γ(i, i) = 0 for i = 0, n − 1; γ(i, i − 1) = 0 for i = 1, n − 1.
1232
´ Ogoubi et al. E.
(a)
(b)
Fig. 1. Figure a) filling the matrix γ for an arbitrary sequence AGGGUAGCGA. The recurrence equations are used to fill the upper half of the matrix. The computation of the element (1, 7) of the matrix γ needs the value of elements (1, 2) to (1, 6) of the line 1, the value of elements (2,7) to (6,7) of the column 7 and the value of the element (2,6). b) The elements on a diagonal shown by the diagonal lines can be computed in parallel starting from the most central diagonal as shown by the numbering.
The RNA folding algorithm consists of two steps; the fill stage of the matrix γ and its traceback. The fill stage is based on the recurrence equation 1. The traceback identifies the base pairs that form an optimal structure in the folded RNA. Figure 1 a gives a brief overview of how the matrix γ is filled. The explicit algorithm for the traceback is as follows. Let us define a stack S onto which we push the element (0, N − 1) of the matrix γ. The following is then repeated until the stack is empty. – – – – –
Pop(i, j) If i ≥ j then continue; else if γ(i + 1, j) = γ(i, j) then Push(i + 1, j); else if γ(i, j − 1) = γ(i, j) then Push(i, j − 1); else if γ(i + 1, j − 1) + δ(i, j) = γ(i, j) then: • i and j are canonical base pair • Push(i + 1, j − 1); – else for k = i + 1 to j − 1 if γ(i, k) + γ(k + 1, j) = γ(i, j) • Push(i, k) • Push(k + 1, j) • end loop; The matrix is filled using the dynamic programming technique. The fill stage takes more time, O(N 3 ), while the traceback time is O(N ).
Parallel Multiprocessor Approaches to the RNA Folding Problem
2.2
1233
OpenMP
The OpenMP Application Programming Interface (API) is a multi-platform shared memory multiprocessing programming environment. OpenMP is a set of computer directives, library routines and environment variables [8]. The foremost advantage of using OpenMP in this study is its relatively simple environment for programmers familiar with C and C++. The ability of all the processors to access the same pool of variables with no communication overheads cost and no network transit time cost makes OpenMP more suitable for our application compared to Message Passing Interface (MPI). Herein, we tried to understand how multiple processors can be used to fill the matrix γ. This approach will then be applied to the more complex RNA secondary structure algorithms, such as eXtended Dynalign [3], and later to the reconfigurable multiprocessor and multi-memories parallel computer architectures [4]. 2.3
Decomposition and Parallel Computation
As presented in Section 2.1, the execution time of the fill stage is the most expensive of the two steps. Speeding up the matrix fill stage time must, in accordance with Amdahl law [9], yield a better overall performance computational time. 1. The fill stage of the RNA folding algorithm gives the asymptotic order of the overall algorithm computation time; 2. The traceback, as presented here, is fundamentally sequential, hence, difficult to parallelize. The fill stage offers a potential for parallel computation. We can see that each element on a diagonal can be computed in parallel with the other elements of the same diagonal. As shown in Figure 1 b, each element γ(i, j) on a diagonal line can be computed independently from its neighbor γ(i + 1, j + 1) or γ(i − 1, j − 1). However, the diagonals must be computed in order, as shown by the numbering in Figure 1 b. Since each element on a given diagonal is computed independently. Many processors can be used to compute the elements on the same diagonal. However, two elements of the same diagonal may access the same variable in order to compute their own value. It is important to notice that the interdependence between elements gets more and more complex when we move from the first diagonal to the last element at the position (0, 8) for the 9 × 9 matrix γ, and as the number of elements on the diagonal diminishes. This observation gives an indication as to which parallel processing architecture we must choose between message passing interface (MPI) and shared memory. Based on the above observation, and in accordance to the specifications of MPI and shared memory architectures, an implementation of RNA folding algorithm on MPI architecture will lead to a heavy communication bottleneck. Figure 2 introduces the proposed implementation for the fill stage of the RNA folding algorithm. It is written in pseudo-code for P processors. This architecture is for both MPI and shared memory architectures.
1234
´ Ogoubi et al. E.
for i = 1 to N − 1 do {Loop 1} for l = 1 to N − k do {Loop 2} for p = 1 to P do {Loop 3} if p = l mod P then i = l − 1; j = i + k a = γ(i + 1, j); b = γ(i, j − 1); c = γ(i + 1, j − 1) + δ(i, j); d = 0 for n = i + 1 to j − 1 do {Loop 4} if γ(i, n) + γ(n + 1, j) > d then d = γ(i, n) + γ(n + 1, j) γ(i, j) = max(a, b, c, d) Fig. 2. Computation of the fill step of the RNA folding Algorithm for parallel computation
In Figure 2, the Loop 3 can be computed in parallel. We now show that execution time of the fill stage of the RNA folding algorithm can be done in O(N 2 ). This execution time is optimal since the fill stage of a square matrix N × N takes at least O(N 2 ) time to access all the elements of the matrix. On top of the cost of the matrix γ fill stage, there is the communication overhead cost that varies significantly depending on the architecture on which the algorithm is implemented. Let us measure the speed-up and efficiency. We define the efficiency here as the speed-up by processor. We will assume some simplification. However, if we remain consistent in our computation, we will obtain by approximation the best architecture between distributed and shared memory for parallel computation of the RNA folding algorithm. Distributed Memory. The fill stage of the RNA folding algorithm when computed sequentially takes O(N 3 ). Let us assume that the cost of communication between processors is identical. Let us also find a way to distribute the elements of the matrix γ onto the memories associated with the processors. First of all, we need to decompose the matrix γ. The following rules demonstrate how the elements of the matrix γ are distributed on different memories. For all P , processors p0 , . . . , pP −1 have their own memory. The number of memories necessary (and by the same logic, the number of processors necessary) to accommodate the matrix γ is M (M = min{P, N }). The distribution of the elements onto P processors and onto the M memories is: – γ(i, j) is computed by processor pk if i ≡ k mod M . All elements of γ a processor must have in memory in order to compute γ(i, j) is given by the following rule: – γ(i, j) is assigned to the memory of processor pk if i ≡ k mod M or if j ≥ k. Let’s call C the communication cost generated by the decomposition of the matrix γ computation. C= min{i, P − 1}. 0≤i<j
Parallel Multiprocessor Approaches to the RNA Folding Problem
1235
−2 l (N −2)(N −1)N If P ≥ N − 1, then C = N = O(N 3 ). While, if l=1 k=1 k = 6 P < N − 1, N −2 l N −P −1 t C = l=1 k=1 k − t=1 s=1 r = 16 [(N − 2)(N − 1)N − (N − P − 1)(N − P )(N − P + 1)] = 16 [N 3 − 3N 2 + 2N − N 3 + 3N 2 P − 3N P 2 + N + P 3 − P ] = O(P N 2 ). Based on the above equation, the statement C = O(P N 2 ) is true in both cases. Let us evaluate the computation cost without C. C will be added later in the process. Statements in the body of Loop 3, with the exception of Loop 4, take O(1) time, see Figure 2. Loop 4 is executed k times, and therefore requires O(k) time. Each iteration of Loop 3 thus takes O(K) time, and is computed by N − K processors simultaneously up to P . As N − K ≥ P , the
time needed to compute the Loop 2 of the Figure 2 is proportional to NP−k , which leads to an execution time Tp of N −k N TP = O(k ) = O(k ). P P By including the Loop 1 of the Figure 2, the total execution time is N TP = O(N 2 ). P By adding the total communication cost λC, where λ is the communication time between two processors, the combination of the two results, the total computation time is: N TP = O(N 2 [ + P λ]). P Shared Memory. Assume that λ is the cost of communication time between processor and a shared Memory. By analogy with the Distributed Memory calculation, we can assert that only the Loop 4 has non-zero weighted execution cost. This cost is represented by O(Kλ ). Since the number of processors is less than the number of parallel threads, any time the Loop 2 is executed, the cost becomes: N TP = O( Kλ ). P By including the Loop 1, the execution time becomes: N TP = O(N 2 λ ). P Based on the choice of the number of processors P , the computation time is different as stated in the experimental results in the next section. Ideally, P is a N →∞ function of N (P (N ))/ P N (N ) −→ c, where c is a real number. Thus, Tp = O(N 2 ) but Tp = O(N 3 ).
1236
´ Ogoubi et al. E.
Comparison and Choice of Architecture. We compare here two measurements: the speed-up and efficiency of the Distributed and Shared Memory architectures for the parallel RNA folding algorithm. For the speed-up, let Sp be the speed-up for Distributed Memory architecture and Sp the speed-up for the Shared Memory architecture. SP =
T1 N3 N ≈ 2 N
= N
≈ TP N ( P + P λ) P + Pλ SP ≈
N2
N3 N
P
λ
N = N
P
λ
≈
N P
N . + Pλ
P . λ
We can see that the speed-up for the Distributed Memory architecture Sp is nonlinear, and linear for the Shared Memory architecture. If N is fixed, Sp is limited
to a maximum of P ≈ N λ , which is found by a differential equation. We can conclude at this stage that the communication cost has a more significant effect on the Sp than on Sp . For the efficiency, let Ep be the efficiency of the Distributed Memory architecture and Ep be the efficiency of the Shared Memory architecture. EP =
N SP λP −1 = O( N P ) ≈ O((1 + ) ). P N + λ P
EP =
SP N 1 = O( N ) ≈ O( ). P λ P P λ
We can see that the efficiency of the Shared Memory architecture is constant while the efficiency of the Distributed Memory architecture depends on P and N . Based on above results, we conclude that the Shared Memory architecture is a better choice for parallel computation of the RNA folding algorithm.
3 3.1
Experimental Results Implementation Environment
The parallelized RNA folding algorithm was ran on a Sun Enterprise 1000 server configured with 64 UltraSPARC II processors, running at 400 MHz, and a total of 60 Gbytes of RAM. The nodes are connected by an SBus running at 25 Mhz. The server is running Solaris 2.8. 3.2
Results and Analysis
Execution Analyses. Figure 3 (a) and (b) show the execution time and the curves for the runs with N ∈ {500, 1000, 1500, 2000, 2500, 3000} nucleotides and P processors. Curves on both figures show the execution time as a function of the number of processors.
Parallel Multiprocessor Approaches to the RNA Folding Problem 350
N=500 Nucleotides N=1000 Nucleotides N=1500 Nucleotides
20
N=2000 Nucleotides N=2500 Nucleotides N=3000 Nucleotides
300 Execution Time (Second)
Execution time (second)
25
1237
15
10
5
250 200 150 100 50
0
0
2
4
6 8 10 Number of processors
12
14
16
0
0
5
10 15 Numer of Processors
(a)
20
(b)
Fig. 3. Curves for the observed execution time for a) 500, 1000 and 1500 nucleotide sequences, b) 2000, 2500 and 3000 nucleotide sequences Table 1. Number of processors required for an optimal computation of N nucleotides sequences N 500 1000 1500 2000 2500 3000 Processors 7 8 10 13 16 19
The relationship between the computation time and the number of nucleotides shows an exponential growth. For example, when the number of nucleotides N doubles from 500 to 1000, the execution time using one processor is seven times longer. For two processors, the execution time is five times longer. When N doubles from 1000 to 2000, the execution time with 4 processors is 2.201 seconds for N = 1000 and 16.07 for N = 2000, that is to say 7.3 times longer. Table 1 shows the number of processors required for an optimal execution time for each N nucleotide sequences. The following table has no evident rule, we can infer for the minimum number of processors needed to optimally compute the parallelization of RNA folding algorithm, as the number N of nucleotide sequences grows. We cannot conclude on this aspect of our experimental results without further simulations with a large number N of nucleotide sequences. Memory Space Analyses. Table 2 and Figure 4 show the size of the memory required for the execution of the algorithm on sequences of N nucleotides. The measurement is made for one processor. Each added processor creates a new thread that requires an additional 4 KB of memory. Analysis of the curve shows that N 2 is an asymptotic function to the curve. If N = 500 is the length of the input sequence, and an integer is 32 bit long, the memory space required to store the matrix is 500 KB. The 4400 KB memory space used here for the computation of the RNA folding algorithm includes all memory resources for the overall computation. The growth of the curve shows that even if N grows by its square, the number of program and environment variables does not grown by the same factor.
1238
´ Ogoubi et al. E. 4
4
x 10
Memory size required for N nucleotide sequence
3.5
Memory size in KB
3 2.5 2
1.5 1 0.5
0
0
500
1000 1500 2000 Number of Nucleotides
2500
3000
Fig. 4. Curve of observed memory size required for the execution of RNA folding algorithm on N nucleotide sequences Table 2. Memory size needed for the computation of the RNA folding algorithm on N nucleotides sequences N 500 Memory/KB 4400 N 2000 Memory/KB 19072
4
1000 7344 2500 27880
1500 12232 3000 38616
Conclusion
The aim of this work was to compare two parallel multiprocessor computer architectures for the prediction of RNA secondary structure. Analyses show that on a shared memory parallel computer architecture with P processors, it is possible to achieve a factor N reduction for this algorithm. Hence, the fill stage requires N 2 time instead of N 3 . Maximizing the number of base pairs is not realistic. Instead, elaborate free energy models have been developed [10]. Approaches based on minimum free energy have been shown to have an average accuracy of 70 % [11]. Much improvement has been seen using consensus approaches. One example is Dynalign, which combines the RNA folding problem and the sequence alignment problem into a single set of recurrence equations [12]. We rencently extended this work for three input sequences [3], and showed promising results. However, the time and space complexity of these applications are huge. For eXtended Dynalign, the time scales as O(|N |3 M 6 ), where M is a parameter that controls the maximum separation between aligned positions in the three input sequences, N is the length of the shortest input sequence. During the fill stage, two six dimensional matrices, V and W , must be filled: for i = 1 to |S1 | do for j = i + minloop to |S1 | do for k = i + M downto i − M do for l = j − M to j + M do for m = i + M downto i − M do for n = j − M to j + m do fill V fill W
Parallel Multiprocessor Approaches to the RNA Folding Problem
1239
The work described will be helping us better understand how eXtended Dynalign [3] can be decomposed to be computed in parallel on the ICMA architecture [4].
References 1. Costa, F.F.: Non-coding RNAs: Lost in translation? Gene 386(1-2), 1–10 (2007) 2. Baird, S.D., Turcotte, M., Korneluk, R.G., Holcik, M.: Searching for IRES. RNA 12(10), 1755–1785 (2006) 3. Masoumi, B., Turcotte, M.: Simultaneous alignment and structure prediction of three RNA sequences. International Journal of Bioinformatics Research and Applications 1(2), 230–245 (2005) 4. Ogoubi, E., Hafid, A.: ICMA: An isometric on chip multiprocessor architecture. Technical report, Universit´e de Montr´eal (2006) 5. Eddy, S.R.: How do RNA folding algorithms work? Nature Biotechnology 22(11), 1457–1458 (2004) 6. Zuker, M.: Calculating nucleic acid secondary structure. Curr. Opin. Struct. Biol. 10(3), 303–310 (2000) 7. Reeder, J., H¨ ochsmann, M., Rehmsmeier, M., Voss, B., Giegerich, R.: Beyond Mfold: recent advances in RNA bioinformatics. J. Biotechnol. 124(1), 41–55 (2006) 8. OpenMP Architecture Review Board: OpenMP application program interface (May 2005), http://www.openmp.org 9. Hennessy, J., Patterson, D.: Computer Architecture: a Quantitative Approach, 3rd edn. Morgan Kaufmann Publishers, San Francisco (2003) 10. Lu, Z.J., Turner, D.H., Mathews, D.H.: A set of nearest neighbor parameters for predicting the enthalpy change of RNA secondary structure formation. Nucleic Acids Res. 34(17), 4912–4924 (2006) 11. Doshi, K.J., Cannone, J.J., Cobaugh, C.W., Gutell, R.R.: Evaluation of the suitability of free-energy minimization using nearest-neighbor energy parameters for RNA secondary structure prediction. BMC Bioinformatics 5, 105 (2004) 12. Mathews, D.H., Turner, D.H.: Dynalign: An algorithm for finding the secondary structure common to two RNA sequences. Journal of Molecular Biology 317, 191– 203 (2002)
Protein Similarity Search with Subset Seeds on a Dedicated Reconfigurable Hardware Pierre Peterlongo1, Laurent No´e2 , Dominique Lavenier1, Gilles Georges1 , Julien Jacques1 , Gregory Kucherov2, and Mathieu Giraud2 1 2
Symbiose, IRISA, INRIA, CNRS, Universit´e Rennes 1 Sequoia/Bioinfo, LIFL, INRIA, CNRS, Universit´e Lille 1 http://www.irisa.fr/remix/arc.html
Abstract. With a sharp increase of available DNA and protein sequence data, new precise and fast similarity search methods are needed for largescale genome and proteome comparisons. Modern seed-based techniques of similarity search (spaced seeds, multiple seeds, subset seeds) provide a better sensitivity/specificity ratio. We present an implementation of such a seed-based technique on a parallel specialized hardware embedding reconfigurable architecture (FPGA), where the FPGA is tightly connected to large capacity Flash memories. This parallel system allows large databases to be fully indexed and rapidly accessed. Compared to traditional approaches presented by the Blastp software, we obtain both a significant speed-up and better results. To the best of our knowledge, this is the first attempt to exploit efficient seed-based algorithms for parallelizing the sequence similarity search. Keywords: sequence, similarity search, spaced seeds, subset seeds, indexing, FPGA, reconfigurable architecture, dedicated hardware.
1
Introduction
Sequence similarity search is one of the fundamental tasks in genomic research. Its main goal is to locate similar regions in DNA or protein sequences which correspond to biologically relevant conserved (or homologous) regions. A typical task, for example, is to query a genomic databank with a newly discovered DNA sequence. Observed similarities with other known genes witness their putative common biological function and direct further investigations. With rapidly growing genomic databases, bioinformatics projects processing hundreds of gigabytes of data lead to computationally challenging tasks. Since searching for similarities between raw sequences is often the first step to more complex bioinformatics analysis, and since this process requires vast computational resources, there is a big interest in optimizing these computations. Different approaches have been studied to reduce the computation time of sequence similarity search while keeping the same sensitivity as Blast, a commonly used software based on a seed-based heuristic [1] (see section 2.1). All these approaches exploit parallelism, though at different levels. A fine-grained R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1240–1248, 2008. c Springer-Verlag Berlin Heidelberg 2008
Protein Similarity Search with Subset Seeds
1241
parallelism can be obtained through the use of SIMD instructions [2,3]. One immediate approach consists in splitting a genomic databank across a cluster of computers, like in the mpiBLAST implementation [4,5]. In that scheme, each processor performs an independent search on a part of the database. A final step merges the results. The efficiency of this coarse-grained parallelization is due to a small communication overhead between the involved computers. Another approach is to parallelize the algorithm itself on a dedicated hardware (see section 2.2). We can exploit both a fine-grained parallelism (on a VLSI or FPGA) and a coarse-grained one (architecture with several boards). Such solutions can provide a lower cost and a better efficiency than a generic cluster. For example, a single dedicated hardware can be easier to administrate than a 64-node cluster with the same computing power. This paper presents an implementation of a recently proposed seed-based heuristic, called subset seeds, on a parallel hardware designed for indexing large volumes of data such as genomic banks. Two levels of parallelism can be considered: a coarse-grained level and a fine-grained level. Here, only the first one will be discussed: it makes a subtle use of subset seeds in order to simultaneously run several partial searches on large indexes stored in a Flash memory. The fine-grained level is similar to [6] where a Blast for DNA search was proposed. The rest of the paper is organized as follows. The next section introduces a background for sequence similarity search. Section 3 describes our parallel strategy based on subset seeds. Section 4 presents performance results obtained for a large-scale biological application and draws conclusions.
2 2.1
Similarity Searches Seed-Based Similarity Search
An alignment between two sequences is defined in terms of a scoring function minimizing possible substitutions, deletions and insertions needed to transform one sequence into the other. Given a set of scores assigned to those edit operations, dynamic programming (DP) equations compute the best local alignments between two sequences in quadratic time [7]. Some optimizations achieve a sub-quadratic complexity [8], but the computation time remains prohibitive for whole-genome comparisons. Most of the time, true alignments contain small patterns, called seeds, that are shared by the two sequences in an exact way. These seeds are used to reduce DP computations to small neighborhoods of seed occurrences. For example, the Blast [1] single-hit strategy proceeds in 3 stages (Figure 1): – Stage 1: searching for words of size k (seeds) that occur in both strings, – Stage 2: extending each seed by allowing a limited number of substitutions, and keeping only those with a score greater than a given threshold, – Stage 3: applying the full DP algorithm to successfully extended seeds. About five years ago, it was understood that instead of contiguous k-words, it is more advantageous for Stage 1 to use so-called spaced seeds that correspond
1242
P. Peterlongo et al.
Fig. 1. Schematic view of the Blast 3-stage algorithm. Stage 1: identify exact seeds (black diagonals). Stage 2: compute seed extension allowing a small number of substitution errors (grey diagonals). Stage 3: perform a full DP computation (white squares) on remaining extended seeds. Here only seed (b) leads to an output alignment.
to gapped diagonals in the DP matrix. The idea of using spaced seeds for biological sequence comparison was first proposed in the PatternHunter software [9] and then used in a more elaborate form in the YASS software [10]. Theoretical design and usage of better seeds is an active field of research [11,12,13,14,15]. For protein search, Stage 1 of Blastp looks for words in databank sequences that are sufficiently close (in terms of the scoring function) to the query word. This strategy is captured by the general concept of vector seeds proposed in [13]. Recent works on seed-based protein search [16,17] apply some extended definition of this concept. An important advantage of all those seed models is the possibility to design appropriate seeds according to sensitivity/selectivity criteria and the class of target alignments. Moreover, instead of using a single seed, one can use several seeds simultaneously (so-called multiple seeds), to further improve the sensitivity/selectivity trade-off. 2.2
Dedicated Hardware for Similarity Search
There have been many attempts to efficiently implement the DP technique of sequence comparison in a specialized hardware. Dynamic programming equations can be projected on 2D or 1D systolic arrays [18,19,20]. A backtracking phase follows the score computation phase to build the alignment [21]. Special edition scores significantly reduce the hardware resources [19]. Between 1990 and 2006, more than twenty different architectures were proposed on VLSI or FPGA circuits [22] (see [23] or [24] for some recent works). Hardware implementations for seed-based heuristics have been less studied. The first ASIC implementation of a seed-based heuristic was done in 1993 with the BioSCAN architecture [25]. Several FPGA implementations have been independently developed since 2003 (see [22] for a review). Some authors developed both a new algorithm (DASH) with better sensitivity than Blast as well as an FPGA implementation [26].
Protein Similarity Search with Subset Seeds
3
1243
Parallel Implementation of Subset Seeds
To the best of our knowledge, no dedicated hardware has been proposed so far to efficiently implement modern seed-based sequence comparison methods. Moreover, some features of those methods are costly to implement at the software level, but can be easily implemented in hardware. 3.1
Subset Seeds for Protein Searches
The detection of an occurrence of a seed (a hit ) in Stage 1 is done by first constructing an index for all keys corresponding to the seed. Very general seed models, such as vector seeds [13], lead to more expressive hit definitions but also to more complex and less cache-efficient implementations of this process. More specifically, whereas traditional seeds imply accessing one entry of the index for each query key (direct indexing), vector seeds require to store, for each key p, its neighborhood, i.e. a set of all keys that reach a given score threshold when compared to p. Therefore, this leads, for each key, to multiple accesses the main index at non-contiguous positions, inducing a larger latency. For example, the 3-letter seed ### implies that for each 3-letter key (word) occurring at a query position, only one (identical) key should be looked up in ≥11 the index. The single-hit Blastp strategy (seed ### ) uses the same index, but should look up, for each query key, for all possible keys that score at least 11 when compared with the query key. This strategy gives a theoretically expected number of 26 index look-ups for the Blosum-62 background distribution of amino acids. In this work, we use the subset seeds model, first proposed in [27] for DNA similarity search. Subset seeds are more expressive than spaced seeds but less expressive than vector seeds. The main idea of subset seeds is that they use elements (seed letters) that distinguish between different types of mismatches. The main advantage of this model is that it provides a powerful seed definition and at the same time preserves the possibility of direct indexing. Consider the alphabet of amino acids Σ = {C, F, Y, W, M, L, I, V, G, P, A, T, S, H, Q, E, R, K, D, N }. A subset seed is defined as a word s1 s2 . . . sm such that: – each seed letter si denotes a partition of the alphabet Σ, grouping amino acids that can be exchanged at this position, – a subset seed s1 s2 . . . sm matches an alignment fragment (x1 , y1 )(x2 , y2 ) . . . (xm , ym ) ∈ (Σ 2 )m if, for each position i, amino acids xi and yi belong to the same set in the partition si . Figure 2 provides an example of seed letters and a subset seed. The design of seed letters, i.e. partitions of the set of amino acids, will be subject of a separate publication. Once seed letters are fixed, we use the approach proposed in [27] to estimate the performance of a given seed. Theoretical estimates for sensitivity and selectivity are computed on Bernoulli background and foreground models taken from the Blosum-62 matrix model (Blocks database version 5) using the original program of [28]. Seeds achieving the best sensitivity/selectivity ratios are selected for practical evaluation (see section 4).
1244
P. Peterlongo et al. ⎧ b0 ⎪ ⎪ ⎨ b1 b2 ⎪ ⎪ ⎩ b3
= {CF Y W M LIV GP AT SN HQEDRK} = {C, F Y W, M LIV, G, P, AT S, HQERK, DN } = {C, F Y W, M L, IV, G, P, A, T S, H, QE, RK, DN } = {C, F, Y, W, M, L, I, V, G, P, A, T, S, H, Q, E, R, K, D, N }
Fig. 2. Example of seed letters ranging from a don’t care symbol (b0 , the whole set of amino acids) to a match symbol (b3 , the partition into singletons). With this alphabet, the subset seed s = b1 b3 b2 matches the alignment fragment (H, K)(L, L)(F, W ).
3.2
Hardware Search Filter
As shown on Figure 3, the hardware prototype architecture, called ReMIX, is composed of several 64 GB Flash memory boards, each linked to a FPGA component. An implementation of a seed-based heuristic with fixed seeds was presented in [6]. The key point is that, in the index, each position of each seed key is stored together with its neighborhood (Figure 4), allowing both Stage 1 and Stage 2 to be computed without additional memory accesses. As we use a multiple seed (a set of several subset seeds), each seed requires a separate index of the database, and each index is stored on a separate memory board. In runtime, each seed is thus separately processed on distinct couples (flash memory board / FPGA), motivating the coarse-grained parallelism.
Fig. 3. Principle of the ReMIX architecture. Flash memory boards, linked to a FPGA filter, are linked to a host computer. In this experience, four boards are used.
More specifically, Algorithm 1 below shows how to find local alignments between a query and a database, closely following the heuristic described in section 2.1. During a preprocessing phase, the databank is indexed off-line with respect to the specified seed (Figure 4). The index points to the positions of keys in the database matched by the seed s (for the Stage 1) and their neighborhoods. This index is stored in the Flash memory. The Flash technology allows all those data to be quickly accessed at the runtime. The latency of 20 μs for a random access can be hidden by a large number of successive calls. Moreover, as shown on the right part of Figure 4, for each query position, only one index look-up is needed, reducing the total latency, and thus the total filtering time.
Protein Similarity Search with Subset Seeds
1245
Fig. 4. Each index line, on the ReMIX architecture, contains the position of the seed key (database pos.) and its neighborhood (20 amino acids on each side). On the left, ≥11 the index used on traditional ### and ### seeds gives all the database positions of occurrences of a given key. On the right, the index used with subset seeds uses less keys. There is more data for each key: the full database size remains the same.
Algorithm 1. Querying a database indexed with one seed Input: database, seed s, query 1: index the database (with respect to s) 2: store the index into Flash memory 3: for each key of the query do 4: using the index, focus on similar key occurrences in the database (Stage 1) 5: for each key occurrence in the database do 6: using the FPGA, filter out the neighborhoods of the occurrence (Stage 2) 7: end for 8: end for 9: on host computer, perform DP computations on filtered sets of positions (Stage 3) Output: local alignments of the query against the database
Those neighborhoods are processed on the FPGA together with the neighborhoods of the query (Stage 2). Each FPGA filter can compute approximatively 160 ungapped alignments simultaneously in 50 clock cycles. As a clock cycle is around 25 nanoseconds, up to 128 millions ungapped alignments per board can be computed each second. Finally, the host computes the final set of alignments from the remaining set of filtered positions given by the FPGA (Stage 3). With our approach, using both dedicated hardware and subset seeds, Stage 3 is not a limiting factor, even computed in software on the host.
4
Performance Results and Conclusions
In our application, the database was extracted from the hard-masked human genome (UCSC Release hg18) translated according to the six possible reading frames. The query was a set of seven archea and bacteria proteomes deriving from a study on mitochondrial diseases. The goal of this study was to detect potential insertions of mitochondrial genes in the human genome. We selected three sets consisting respectively of 1, 2, and 4 subset seeds among the sets with
1246
P. Peterlongo et al.
Table 1. Comparison between different seeds. The fixed seed ### is given as a reference. The number n is the number of seeds of the set: the computation is distributed over n boards. Experimental values for sensitivity (third column) are obtained through comparison with Smith-Waterman alignments on human chromosomes 1 – 11. For the other columns, we focused on the chromosome 1 (85×106 amino acids). All subset seeds ≥11 ≥11 used here were chosen to have a better sensitivity than ### . On this data, the ### seed looks up the index 17 more times than any of the subset seeds (fourth column). The number of returned database positions (fifth column) estimates the selectivity, as most of them are false positive that will be filtered out in Stage 2. When several seeds and boards are used (n > 1), we show the results for the slowest one. In comparison, the usual single-hit Blastp implementation would take more than 3400 minutes on this dataset with a 3 GHz PC (data not shown). Seed model
n sensitivity
Fixed seed ### ≥11
Blastp seed ### Subset seed n◦ 1 Subset seeds n◦ 2 Subset seeds n◦ 3
1
92.87%
1 1 2 4
99.09% 99.11% 99.14% 99.13%
index calls positions returned (×106 ) (×109 ) 5.4 24 92.9 5.4 5.4 5.4
max: max:
246 231 109 69
filtering time (min:sec) 24:13 257:50 216:25 max: 103:51 max: 65:35
the best sensitivity/specificity ratio and with a global sensitivity comparable to ≥11 the Blastp seed ### . Using one or several boards, we performed tests parallelizing algorithm 1, except for Stage 3 that has been computed by the host computer on the merged results from all the boards. The computation time of Stage 3, lower than Stage 1 and 2, is hidden by successive calls of queries. Time and data results are shown in Table 1. On average, FPGA filter took approximatively 67 nanoseconds for processing one index line, representing around 15 million of index entries filtered each second. Even with traditional Blastp seeds, one ReMIX board provides a 13× speed-up over a conventional software implementation. A convenient speed-up is obtained by joining several PCI boards inside a host PC [6]. Moreover, the use of the subset seeds gives an additional speed-up due to reduced access to the memory. Here, the best results (in proportion to the number of boards) are achieved with the set of 2 subset seeds, giving a 24% speed-up over the implementation of Blastp seeds. Thus a simple host computer equipped with 4 ReMIX boards with those subset seeds provides a 4 × 13 × 1.24 > 64× speed-up. Thus this host is equivalent to a 64-node cluster processing traditional Blastp seeds. As the cost of the FPGA circuit and the Flash memory declines, this solution becomes more interesting than the cluster. One may argue that the hardware design enabling the fine-grained parallelism required an additional time of development. However, the use of subset seeds is fully transparent for the architecture, as the index in the memory does only see integer keys. A better algorithmic
Protein Similarity Search with Subset Seeds
1247
design of seeds provides an additional speed-up with the very same fine-grained operators on the same FPGA architecture. One possible extension is to index only a part of the databases positions rather than all of them. This would reduce the index size and speed up the search. The loss in sensitivity could be limited with specially designed seeds. The efficiency of such an approach remains to be studied. Another open question is if a similar approach could be used to speed up DNA similarity searches. Acknowledgments. Support for this work was provided by INRIA through the grant ARC Flash “Seed Optimisation and Indexing of Genomic Databases”.
References 1. Altschul, S., Gish, W., Miller, W., Myers, W., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990) 2. Rognes, T.: ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches. Nucleic Acids Research 29(7), 1647–1652 (2001) 3. Farrar, M.: Striped smith–waterman speeds database searches six times over other simd implementations. Bioinformatics 23(2), 156–161 (2007) 4. Darling, A., Carey, L., Feng, W.: The design, implementation, and evaluation of mpiBLAST. In: ClusterWorld Conference and Expo (CWCE 2003) (2003) 5. Thorsen, O., Smith, B., Sosa, C.P., Jiang, K., Lin, H., Peters, A., Fen, W.: Parallel genomic sequence-search on a massively parallel system. In: Int. Conference on Computing Frontiers (CF 2007), pp. 59–68 (2007) 6. Lavenier, D., Xinchun, L., Georges, G.: Seed-based genomic sequence comparison using a FPGA/FLASH accelerator. In: Field Programmable Technology (FPT 2006), pp. 41–48 (2006) 7. Smith, T., Waterman, M.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981) 8. Crochemore, M., Landau, G., Ziv-Ukelson, M.: A sub-quadratic sequence alignment algorithm for unrestricted cost matrices. In: Symposium On Discrete Algorithms (SODA 2002), pp. 679–688 (2002) 9. Ma, B., Tromp, J., Li, M.: PatternHunter: Faster and more sensitive homology search. Bioinformatics 18(3), 440–445 (2002) 10. No´e, L., Kucherov, G.: YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Research 33, W540–W543 (2005) 11. Cs¨ ur¨ os, M., Ma, B.: Rapid homology search with two-stage extension and daughter seeds. In: Wang, L. (ed.) COCOON 2005. LNCS, vol. 3595, pp. 104–114. Springer, Heidelberg (2005) 12. Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. Journal of Computer and System Sciences 70(3), 342–363 (2005) 13. Brejov´ a, B., Brown, D., Vinar, T.: Vector seeds: An extension to spaced seeds. Journal of Computer and System Sciences 70(3), 364–380 (2005) 14. Li, M., Ma, M., Zhang, L.: Superiority and complexity of the spaced seeds. In: Symp. on Discrete Algorithms (SODA 2006), pp. 444–453 (2006) 15. Mak, D., Gelfand, Y., Benson, G.: Indel seeds for homology search. Bioinformatics 22(14), e341–e349 (2006) 16. Kisman, D., Li, M., Ma, B., Li, W.: tPatternhunter: gapped, fast and sensitive translated homology search. Bioinformatics 21(4), 542–544 (2005)
1248
P. Peterlongo et al.
17. Brown, D.: Optimizing multiple seeds for protein homology search. IEEE Transactions on Computational Biology and Bioinformatics 2(1), 29–38 (2005) 18. Kung, H.T., Leiserson, C.: Algorithms for VLSI processors arrays. Addison-Wesley, Reading (1980) 19. Lipton, R., Lopresti, D.: In: Fuchs, H. (ed.) A systolic array for rapid string comparison, pp. 363–376. Computer Science Press, Rockville, MD (2004) 20. Chow, E., Hunkapiller, T., Peterson, J., Waterman, M.S.: Biological information signal processor. In: International Conference on Application Specific Array Processors (ASAP 1991), pp. 144–160 (1991) 21. Hoang, D.: Searching genetic databases on splash 2. In: IEEE Workshop on FPGAs for Custom Computing Machines (FCCM 1993), Napa, California, pp. 185–191 (1993) 22. Lavenier, D., Giraud, M.: Bioinformatics Applications. In: Gokhale, M.B., Graham, P.S. (eds.) Reconfigurable Computing, Springer, Heidelberg (2005) 23. Dydel, S., Bala, P.: Large scale protein sequence alignment using FPGA reprogrammable logic devices. In: Becker, J., Platzner, M., Vernalde, S. (eds.) FPL 2004. LNCS, vol. 3203, pp. 23–32. Springer, Heidelberg (2004) 24. Court, T.V., Herbordt, M.C.: Families of fpga-based accelerators for approximate string matching. Microprocessors and Microsystems 31(2), 135–145 (2007) 25. Singh, R.K., Tell, S.G., White, C.T., Hoffman, D., Chi, V.L., Erickson, B.W.: A scalable systolic multiprocessor system for analysis of biological sequences. In: Borrielo, G., Ebeling, C. (eds.) Symposium on Research on Integrated Systems, pp. 168–182 (1993) 26. Knowles, G., Gardner-Stephen, P.: A new hardware architecture for genomic and proteomic sequence alignment. In: IEEE Computational Systems Bioinformatics Conference (CSBC 2004) (2004) 27. Kucherov, G., No´e, L., Roytberg, M.: A unifying framework for seed sensitivity and its application to subset seeds. J. Bioinf. Comp. Biology 4(2), 553–569 (2006) 28. Henikoff, S., Henikoff, J.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10915–10919 (1992)
Parallel DNA Sequence Alignment on the Cell Broadband Engine Adrianto Wirawan1 , Kwoh Chee Keong1 , and Bertil Schmidt2 1
School of Computer Engineering, Nanyang Technological University, Singapore 639798 {adri0004, asckkwoh}@ntu.edu.sg 2 University of New South Wales, 1 Kay Siang Road, Singapore 248922 [email protected]
Abstract. Sequence alignment is one of the most important techniques in Bioinformatics. Although efficient dynamic programming algorithms exist for this problem, the alignment of very long DNA sequences still requires significant time on traditional computer architectures. In this paper, we present a scalable and efficient mapping of DNA sequence alignment onto the Cell BE multi-core architecture. Our mapping uses two types of parallelization techniques: (i) SIMD vectorization within a processor and (ii) wavefront parallelization between processors.
1 Introduction Sequence alignment is a popular tool to determine the degree of similarity between nucleotide or amino acid sequences which is assumed to have same ancestral relationships. The optimal local alignment of a pair of sequences can be computed by the dynamic programming (DP) based Smith-Waterman algorithm [1]. However, this approach has O(mn) complexity, which is expensive in terms of time and memory cost. One technique to speedup this time-consuming task is to introduce heuristics in the search algorithm, e.g. BLAST [2]. The drawback of this approach is that the more efficient the heuristics, the worse is the result. In other words, these algorithms sacrifice sensitivity for speed.Hence, more distant sequence relationship may not be detected. Another popular approach to reduce computational time without sacrificing the performance is to use High Performance Computing. Examples of parallel architectures that have been evaluated for this problem include FPGAs [3], GPUs [4] and SIMD arrays [5]. In this paper, we investigate how the Cell Broadband Engine, a recently introduced heterogeneous multi-core architecture, can be used as a computational platform to accelerate the Smith-Waterman algorithm. The rest of this paper is organized as follows. Section 2 highlights the important features of the Cell Broadband Engine. Section 3 describes the basic sequence alignment algorithm. Section 4 presents our mapping of the algorithm onto the Cell BE using both wavefront parallelization and SIMD vectorization. Experimental results are presented in Section 5. Section 6 concludes the paper. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1249–1256, 2008. c Springer-Verlag Berlin Heidelberg 2008
1250
A. Wirawan, K.C. Keong, and B. Schmidt
Fig. 1. Block diagram of the Cell BE Architecture
2 Cell Broadband Engine The Cell Broadband Engine [6] (Cell BE) is a single-chip heterogeneous multi-core processor which is developed by Sony, Toshiba and IBM. The Cell BE has a generalpurpose architecture, offering a unique assembly of thread-level and data-level parallelization options. It is operating at the upper range of existing processor frequencies (3.2 GHz for current models) and is projected to run at more than 5 GHz in the near future. Apart from that, the power consumption is also comparable to that of mobile processors. The Cell BE combines an IBM PowerPC Processor Element (PPE) and eight Synergistic Processor Elements (SPEs) [7]. An integrated high-bandwidth bus called the Element Interconnect Bus (EIB) connects the processors and their ports to external memory and I/O devices. The block diagram of the Cell BE is shown in Figure 1. The PPE is a 64-bit Power Architecture core and contains a 64-bit general purpose register set (GPR), a 64-bit floating point register set (FPR), and a 128-bit Altivec register set. It is fully compliant with the 64-bit Power Architecture specification and can run 32-bit and 64-bit operating systems and applications. Each SPE is able to run its own individual application programs. Each SPE consists of a processor designed for streaming workloads, a local memory, and a globally coherent DMA engine. The EIB is a 4-ring structure, and can transmit 96 bytes per cycle, for a bandwidth of 204.8 Gigabytes/second. The EIB can support more then 100 outstanding DMA requests. The most distinguishing feature of the Cell BE lies within the variety of the processors it has, i.e. the PPE and the SPEs. Heterogenous multi-core systems can lead to decreased performance if both the operating system and application are unaware of the heterogeneity. The PPE is designed to run the operating system and, in many cases, the top-level control thread of an application, while the SPEs is optimized for computeintensive applications, hence, providing the bulk of the application performance.
Parallel DNA Sequence Alignment on the Cell Broadband Engine
1251
The SPE can access RAM through direct memory access (DMA) requests. The DMA transfers are handled by the Memory Flow Controller (MFC). The MFC provides the interface, by means of the EIB, between the local storage of the SPE and main memory. The MFC supports DMA transfers as well as mailbox and signal-notification messaging between the SPE and the PPE and other devices. Data transferred between local storage and main memory must be 128-bit aligned. The size of each DMA transfer can be at most 16 KB. DMA-lists can be used for transferring large amounts of data (more than 16 KB). A list can have up to 2,048 DMA requests, each for up to 16 KB.
3 Smith-Waterman Algorithm The Smith-Waterman algorithm finds the optimal local alignment of two sequences by means of dynamic programming. Given two sequences, the algorithm will calculate their most similar subsequences. It compares two sequences by computing a distance that represents the minimal cost of transforming one segment into another. Two basic operations are used in the transformation, i.e. insertion/deletion and substitution. The distance between the segments is measured as the smallest number of operations required to change one segment into another.
Fig. 2. Sequence alignment of ACCT and GACCA
Consider two strings S1 and S2 with length m and n, respectively. The SmithWaterman algorithm computes the similarity value M (i, j) of two sequences ending at position i and j of the two sequences S1 and S2, respectively. For affine gap penalties, i.e. α = β, the computation of M (i, j), for 1 ≤ i ≤ m, 1 ≤ j ≤ n, is given as the following: M (i, j) = max{M (i − 1, j − 1) + sbt(S1[i], S2[j]), E(i, j), F (i, j), 0}, E(i, j) = max{M (i, j − 1) − α, E(i, j − 1) − β}, F (i, j) = max{M (i − 1, j) − α, F (i − 1, j) − β}, where sbt is a character substitution cost table, α is the cost of the first gap; β is the cost of the following gaps. Initialization values are given as the following: for 0 ≤ i ≤ m, 0 ≤ j ≤ n, M (i, 0) = M (0, j) = E(i, 0) = F (0, j) = 0. Each position of the matrix M is a similarity value. The two segments of S1 and S2 producing this value can be determined by a trace-back procedure.
1252
A. Wirawan, K.C. Keong, and B. Schmidt
Figure 2 illustrates an example of computing the local alignment between two DNA sequences ACCT and GACCA using the Smith-Waterman algorithm. The matrix M (i, j) is shown for the linear gap cost α = β = 1, and a substitution cost of +2 if the characters are identical and -1 otherwise. The highest score in the matrix (+6) is the optimal score for the alignment. The trace-back procedure, shown in form of arrows, shows that the optimal local alignment are ACC and ACC.
4 Mapping onto the Cell BE Architecture Our parallel algorithm employs a static load balancing strategy, which means that the work load is known at the start and distributed equally across processes and processors. The algorithm starts by reading the input dataset. The PPE then preprocesses the set of input sequences such that all the SPEs will have their respective sequence parts in their local memory. Consider two sequences, S1 and S2 of length m and n respectively. Assume that p SPEs, P1 , ..., Pp , are used for the computation. S1 is broadcast to all SPEs, while S2 is divided into p pieces, of size n/p , and each SPE Pi , 1 ≤ i ≤ p, receives the i-th piece of S2. Each SPE has to compute a non-overlapping m × n/p submatrix of the whole m × n DP matrix. This computation is performed in q + p − 1 rounds, where q = m/r and r denotes the number of consecutive rows calculated in one round. Hence, each round computes an r × n/p submatrix in a number of SPEs in parallel in each round. The scheduling scheme follows a wavefront pattern and is illustrated in Figure 3. n n/p r
0
P1
1
P2
1
P1
2
P2
2
P1
2
p-1
P3
p
Pp
Pp
p+1
Pp
m
k
q-
1P 1
q
Pi
P2
q+p-2
Pp
Fig. 3. Block diagram of the parallel algorithm
The notation k Pi denotes the sub-matrix computed by the SPE Pi at round k. Thus, at the start, P1 starts computing 0 P1 at round 0. Then, P1 and P2 computes 1 P1 and 1 P2 , respectively at round 1; P1 , P2 and P3 computes 2 P1 , 2 P2 and 2 P3 , respectively at round 2, and so on. Due to the limitation of the local storage of SPE of 256 KB for both the program and the data, we implemented a linear space algorithm. Hence, in each k Pi , the similarity value M (i, j) at position i and j is then computed by the following equations:
Parallel DNA Sequence Alignment on the Cell Broadband Engine
1253
M (j) = max{M (j − 1) + sbt(S1[i], S2[j]), E, F (j), 0} E = max{M (j − 1) − α, E}, F (j) = max{M (j) − α, F (j) − β} After computing the kth part of the k Pi , SPE Pi sends the elements of the rightmost column of k Pi to SPE Pi+1 . Using these information, SPE Pi+1 can compute the k+1 Pi+1 . After q + p − 1 rounds, SPE Pp receives its necessary information from Pp−1 and computes q+p−2 Sp and finishes the entire alignment. During the entire computation, each SPE updates and stores its maximum local score. At the end of the computation, each SPE sends its maximum local score to the PPE through the mailbox function. The PPE uses spe_stat_out_mbox function to fetch the status of the SPU outbound mailbox for each SPE thread and read each maximum local score. Using those scores, the PPE then determines the global optimal score. In order to further exploit the capabilities of the Cell Broadband Engine, our parallel implementation has been modified using Single Instruction Multiple Data (SIMD) registers of the SPEs for further optimization using the concepts of the vectorization strategy for Smith-Waterman comparison done by Wozniak [8]. Due to the additional memory requirement for this method as well as the local storage memory limitation of the SPEs, each SPE can only compute a submatrix of size 128×128 in each round. Hence, with 8 SPEs, we can compute an overall DP matrix of size 2048×1024. This length, however, is quite short for real life application. Hence, we have extended the algorithm such that it can compute alignment of longer sequences. In this new approach, the computation is split into blocks of size 2048×1024. Each block is computed using 8 SPEs, in which the larger 2048×1024 block is divided into smaller blocks computation of size 128×128. Once a 2048×1024 block has been computed, the local maximum is then sent to the PPE through the mailbox function and the right-most column of this block is saved. The next 2048×1024 block is then offloaded to the SPEs to be computed. Due the nature of the Smith-Waterman algorithm, the block directly below the current block will be chosen as next block (vertical priority). Once all the blocks in the current vertical column has been computed successfully, the concatenation of the right-most column of the vertical blocks are sent and processed to compute the next batch of blocks. Pseudocode of the SIMD parallelization scheduling is illustrated in Figure 4. At the end of all block computations, the maximum of the local maximums collected by the PPE is determined as the global optimal score . Throughout the entire computation, data is sent using direct SPE to SPE communication in order to avoid the latency of communicating through shared memory. Thus, synchronizing the communication between SPEs is crucial. Our implementation uses the MFC sendsignal command (mfc_sndsig) for the means of synchronization. The mfc_sndsig requires the effective address of the target SPE signal-notification channel as well as a 32-bit signal value. The command increments the channel count of the target SPEs signal-notification channel by one. The SPE verifies that the previous value has been read by performing an MFC get command from the effective address of the target SPE signal-notification register and ensuring that it has been reset by a channel read on the target SPE. The target SPE uses a read-channel instruction on the signalnotification channel of interest to receive the 32-bit signal value. This read-channel
1254
A. Wirawan, K.C. Keong, and B. Schmidt
Input: Number of SPEs used num , Sequences S1 and S2 with lengths m and n , respectively Output: Global maximum score for the optimum local alignment of S1 and S2 SPE Pseudocode: Initialize; while (outerloop<(n/1024)){ Fetch the right-most column of Pnum of the previous iteration from PPE through DMA transfer; Fetch part of S2 of the corresponding block from the PPE through DMA transfer; innerloop=0; while (innerloop<(m/2048)){ Fetch part of S1 of the corresponding block from the PPE through DMA transfer; while (count<2048){ if (i>0){ Receive signal and data from Pi-1; } Compute sub-block for size 128x128; if (i
Fig. 4. Pseudocode of the SIMD parallelization scheduling
instruction will return immediately, reset any set bits in the signal-notification register, and reset the channel count if the associated signal-notification register has a waiting unread signal value. Otherwise, the read-channel instruction will cause the SPU to stall until a write to the signal-notification register happens.
5 Performance Evaluation In this section, we analyze the performance of our parallel algorithm for varying number of SPEs and varying sequence lengths using artificial DNA data sets.The experiment has been conducted on the IBM Full System Simulator for the Cell Broadband Engine [9], which is a generalized simulator that can be configured to simulate a broad range of fullsystem configurations. The simulator supports full functional simulation and is able to simulate and capture many levels of operational details on instruction execution, cache and memory subsystems, communications, and other important system functions. Furthermore, it supports cycle-accurate simulation, which not only models functional accuracy but also timing. It considers internal execution and timing policies as well as the mechanisms of system components, such as arbiters, queues, and pipelines. The performance statistics measured from the simulator for the parallel algorithm are then converted to the following measurements: computational time, speed-up, cell updates per second (CUPS), and the parallel efficiency. This is shown in Figure 5. The term l(r) describes that the aligned sequences are of length l, and r rows are being sent from one SPE to another at one time. Figure 5(a) shows the computational time of our parallel algorithm on the abovementioned datasets. By using 8 SPEs, our parallel algorithm managed to reduce the computational time of aligning sequences of length 2048 from 64.34 milliseconds to 9.47 milliseconds by sending 64 rows at a time. The speed-up of our parallel algorithm is shown in Figure 5(b). By using 8 SPEs, we managed to achieve a speed-up of up to
Parallel DNA Sequence Alignment on the Cell Broadband Engine
Speed-up Performance Graph
70
8
60
7 6
50 Speed-up
Computational time (ms)
Computational Time Performance Graph
40 30 20
5 4 3 2 1
10
0
0 0
2
4
6
0
8
2
4
128(1)
128(2)
512(16)
1024(32)
6
8
Number of SPUs
Number of SPUs 256(8)
128(1)
128(2)
2048(64)
512(16)
1024(32)
256(8) 2048(64)
(b)
(a) Cell updates per second (CUPS) Performance Graph
Efficiency Performance Graph
500
120.00%
400
100.00% Efficiency
MCUPs
1255
300
200
80.00% 60.00% 40.00%
100
20.00% 0.00%
0 0
2
4
6
8
0
2
4
6
8
Number of SPUs
Number of SPUs 128(1)
128(2)
256(8)
128(1)
128(2)
256(8)
512(16)
1024(32)
2048(64)
512(16)
1024(32)
2048(64)
(c)
(d)
Fig. 5. Performance evaluation results of the parallel algorithm Table 1. Performance evaluation results of the SIMD parallelization Size Computational Time (ms) 2048×1024 0.86 2048×2048 1.49 4096×4096 5.31 8192×8192 21.34
MCUPS 2,448.65 2,808.58 3,158.31 3,160.22
6.91 for aligning sequences of length 2048 by sending 64 rows at a time. Figure 5(c) shows the performance of our algorithm in terms of cell updates per second. By using 8 SPEs, our algorithm managed to achieve a speed-up of up to 450 MCUPS for sequence alignment of length 2048 by sending 64 rows at a time. Our algorithm also shows a good scalability as it achieves high efficiency, especially for datasets of longer sequences, as can be seen in Figure 5(d). Sequence alignment of length 2048 which sends 64 rows at a time provides 86.4% efficiency. For the SIMD parallelization, the performance statistics obtained from the simulator are converted to computational time, and cell updates per second (CUPS). The usage of larger blocks allows the alignment of longer sequences. In the experiment shown in Table 1, we have aligned sequences of length up to 8192. However, 8192 is not a length restriction of our algorithm but a limit imposed by the IBM Full System Simulator simulation time.
1256
A. Wirawan, K.C. Keong, and B. Schmidt
As shown in Table 1, our implementation achieves a performance of up to 3,160 MCUPS. Thus, our implementation is 4-5 times faster than the Smith-Waterman implementation using GLSL on a GeForce 7900 GTX presented in [4]. The FPGA implementation using Verilog presented in [3] on a Virtex-II XC2V6000 is about 1.5 times faster than ours. Although FPGAs are flexible, their configuration has to be changed for each single algorithm, which is in general more complicated than writing code for programmable architectures such as the Cell BE.
6 Conclusion We have presented a parallel algorithm for sequence alignment on a heterogeneous multi-core system using both SIMD vectorization and wavefront parallelism. Our implementation on the Cell BE simulator shows almost linear speedup and reduces the computational time for sequences of 2048 to only 9.47 ms, achieving 450 MCUPS in the process. Furthermore, we have shown that by exploiting the SIMD feature of the Cell BE, we are able to align longer sequences with excellent performance. In aligning two sequences of length 8192, our implementation achieves almost 3.2 GCUPS.
References 1. Smith, T., Waterman, M.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981) 2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990) 3. Oliver, T., Schmidt, B., Maskell, D.: Reconfigurable Architectures for Bio-sequence Database Scanning on FPGAs. IEEE Transactions on Circuits and Systems II 52(12), 851–855 (2005) 4. Liu, W., Schmidt, B., Voss, G., Mueller-Wittig, W.: Streaming Algorithms for Biological Sequence Alignment on GPUs. IEEE Transactions on Parallel and Distributed Systems (to appear, 2007) 5. Di Blas, A., et al.: The UCSC Kestrel Parallel Processor. IEEE Transactions on Parallel and Distributed Systems 16(1), 80–92 (2005) 6. Kahle, J.A., Day, M.N., Hofstee, H.P., Johns, C.R., Maeurer, T.R., Shippy, D.: Introduction to the Cell Multiprocessor. IBM Journal of Research & Development 49(4/5), 589–604 (2005) 7. Pham, D., Asano, S., Bolliger, M., Day, M.N., Hofstee, H.P., et al.: The Design and Implementation of a First-Generation CELL Processor. In: Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC 2005), San Francisco, CA, pp. 184–185 (2005) 8. Wozniak, A.: Using video-oriented instructions to speed up sequence comparison. Comput. Appl. Biosci. 13(2), 145–150 (1997) 9. IBM. Cell Broadband Engine Programming Tutorial v1.1. IBM developerWorks (2006)
Scalability and Performance Analysis of a Probabilistic Domain Decomposition Method Juan A. Acebr´on1 and Renato Spigler2 1
2
Departament d’Enginyeria Inform` atica i Matem` atiques, Universitat Rovira i Virgili, Av. Pa¨ısos Catalans, 26 43007 Tarragona, Spain [email protected] Dipartimento di Matematica, Universit` a “Roma Tre”, 1, Largo S.L. Murialdo, 00146 Rome, Italy [email protected]
Abstract. In this paper, we analyze the scalability and performance of a probabilistic domain decomposition strategy for solving linear elliptic boundary-value problems. Such a strategy consists of a hybrid numerical scheme based on a probabilistic method along with a domain decomposition, and full decoupling can be accomplished. It is shown that such a method performs well for an arbitrarily large number of processors, while the classical deterministic approach is strongly affected by intercommunications. Therefore, the overall performance degrades dramatically for rather large number of processors. Furthermore, we find that the probabilistic method is scalable as the number of subdomains, i.e., the number of processors involved, increases. This fact is clearly illustrated by an example.
1
Introduction
Domain decomposition (DD) is considered one of the most natural ways to decouple boundary-value (BV) problems for partial differential equations (PDEs) into sub-problems, in order to exploit parallel architectures, thus allowing for high-performance scientific computing of large-scale problems. The idea, which in its essence goes back to the 1870 work of H. A. Schwarz, is to split the given domain into a number of subdomains, and then assign the task of the numerical solution on such separate subdomains to as many separate processors. The main drawback, however, is represented by the fact that the solution is needed on some interfaces internal to the domain, while solving BV problems for PDEs has a global character. That is, the solution cannot be obtained even at a single point inside the domain prior to solving the full problem. Consequently, certain iterations are required across the chosen (or prescribed) interfaces, in order to determine approximate values of the sought solution inside the original domain. Both, overlapping domains and not overlapping domains have been considered in the literature, see [13,14], e.g. In any case, some additional numerical work is required to accomplish this task, and it is doubtful whether scalability can be R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1257–1264, 2008. c Springer-Verlag Berlin Heidelberg 2008
1258
J.A. Acebr´ on and R. Spigler
attained as the number of the subdomains (hence, of the processsors) increases unboundedly [10]. Recently, a method has been developed which avoids the intercommunication problems inherent to any traditional DD approach [1,2,4]. The method based on a probabilistically induced domain decomposition (PDD), and has been proven to be rather successful in homogeneous parallel architectures. In Section 2, some generalities about the method are discussed. In Section 3, a numerical example is shown, where the performance in a MPI environment was tested. In the final section, we summarize the high points of the paper.
2
The Probabilistic Method
The core of the probabilistic method is based on combining a probabilistic representation of solutions to elliptic or parabolic PDEs with a classical domain decomposition method (DD). This algorithm can be referred to as a “probabilistic domain decomposition” (for short, PDD) method. This approach allows to obtain the solution at some points, internal to the domain, without first solving the entire boundary-value problem. In fact, this can be done by means of the probabilistic representation of the solution. The basic idea is to compute only few values of the solution by Monte Carlo simulations on certain chosen interfaces, and then interpolate to obtain continuous approximations of it on such interfaces. The latter can then be used as boundary values to decouple the problem into subproblems, see Fig. 1. Each such subproblem can then be solved independently on a separate processor. Clearly, neither communication among the processors nor iteration across the interfaces are needed. Moreover, the PDD method does not even require balancing. In fact, after decomposing the domain into a number of subdomains, each problem to be solved on them will be totally independent of the others. Hence, each problem can be solved by a single host. Even though some hosts may end the computation much later than others, the results obtained from the faster hosts are correct, and can be immediately used, when necessary. The implementation of the PDD algorithm can be accomplished through the following steps: 1. Compute only few interfacial values by Monte Carlo simulations. 2. Interpolate on the corresponding nodes to obtain boundary values for the subdomains. 3. Compute the solution to the original problem in each subdomain by standard methods (e.g., finite differences or finite elements). Therefore, the key idea is to generate only very few values of the sought solution, u, by a probabilistic method, the Monte Carlo method, see [6], e.g. In some sense, this approach allows to obtain the solution in some points, internal to the domain, without solving first the entire problem. This can be done by means of the probabilistic representation of the solution, τ∂Ω τ L − 0 ∂Ω c(β(s)) ds − 0t c(β(s)) ds u(x) = Ex g(β(τ∂Ω ))e − f (β(t)) e dt , (1) 0
Scalability and Performance Analysis
1259
see [7], e.g., where β(t) is the (vector-valued) stochastic process associated to the elliptic operator L, which solves the system of (Ito type) stochastic differential equations (SDEs) dβ = b(x)dt + σ(x)dW (t),
(2)
where W (t) represents the 2-dimensional standard brownian motion (also called Wiener process), and τ∂Ω is the first passage (or hitting) time of the path β(t) started at the point x to ∂Ω. The Monte Carlo approach is trivially parallelizable, since each of the N problems above can be runned independently of the others. Stated in such terms, clearly, the method appears to be scalable as well. Even though the probabilistic algorithm we derived and considered here is scalable, naturally fault tolerant [8], and well suited to grid and heterogeneous computing [5], it suffers for some weakness, due to the inherently poor accuracy of all Monte Carlo methods. A considerable improvement, however, can be achieved using sequences of “quasirandom numbers” [6,12] instead of sequences of pseudorandom numbers. The pseudorandom numbers are those numbers obtained in practice, when we try to generate truly random numbers, hence are approximately characterized by a statistical distribution. The quasi-random numbers, instead, are deterministic uniformly distributed numbers. Using the latter allows to obtain an error (now ∗ deterministic) of order of O(N −1 logd −1 N ), d∗ representing a certain “effective” space dimension. It was shown in [3] that the underlying system of SDEs can indeed be solved numerically in a very efficient way.
Ω2
Ω1 Ω
Ω3
Ω1
Ω2
Ω3
Ω4
Ω4
Fig. 1. Sketchy diagram illustrating the numerical method, splitting the initial domain Ω into four subdomains, Ω1 , Ω2 , Ω3 , Ω4
1260
3
J.A. Acebr´ on and R. Spigler
Numerical Example
Here we present a numerical example, aiming at comparing the performance and scalability achieved by a classical deterministic domain decomposition method and the PDD method. For the purpose of illustration, we analyzed a prototype partial differential equation on a rather elementary domain, indeed, the unit square. The partial differential equation chosen is the Laplace equation. A code in an MPI environment has been implemented and runned on the MareNostrum supercomputer located at the Barcelona Supercomputing Center (BSC). Below, in order to compare the relative performance of the PDD method against some deterministic method, we solved the same equation in both ways. The deterministic algorithm is extracted from the numerical package pARMS, having chosen the overlapping Schwarz method with a FGMRES iterative method preconditioned with ILUT as local solver, see [15]. To split the given linear algebraic system corresponding to the full discretized problem into a number of subproblems, and solve them independently, in parallel, it is necessary to accomplish a mesh partitioning. This should be done balancing the overall computational load, minimizing, at the same time, the intercommunication occurring among the various processors. We used ParMETIS [11] as a partitioner, configuring it according to the characteristics of the particular mesh we adopted. To make the comparison meaningful, we discretized the local problems within the PDD algorithm by finite differences, and solved the ensuing linear algebraic system by the same FGMRES iterative solver preconditioned with ILUT. Consider the following Dirichlet problem, uxx + uyy = 0
in Ω := (0, 1) × (0, 1),
(3)
u(x, y)|∂Ω = g(x, y), (4) 2 2 where g(x, y) := x − y ∂Ω , for the Laplace equation in two dimensions. The solution of such a problem is explicitly known, and is u(x, y) = (x2 − y 2 ) in Ω = [0, 1] × [0, 1]. In order to analyze scalability of both domain decomposition methods, the initial domain was scaled conveniently as function of the number of processors, p involved. This has been done in order to keep constant the computational load per processor, being the space discretization fixed to Δx = Δy = 1.25 × 10−3 . Here two cases have been studied so far. Firstly, the domain was scaled along the x dimension proportionally to the nu mber of processors, p (in the following, it will be termed as Case A), and secondly, both dimensions x, and y were simultaneously rescaled (Case B). That is, the rescaled domain for √ √ case A becomes [0, p]x[0, 1], while for case B is [0, p]x[0, p]. For both cases, the parallel run time of the PDD method can be estimated. In the following, we focus only on estimating the time spent by the Monte Carlo part of the algorithm. This is mainly because once a continuous approximation of the interfaces is obtained, the problem is fully decoupled. Hence,the corresponding time spent by the local solver in each subdomain (being of equal size) on separate processors is independent of the number of processors.
Scalability and Performance Analysis
1261
In [1], it was shown that the time TMC required for computing a single interfacial value of the solution by Monte Carlo or quasi-Monte Carlo is given by TMC ∼ N 1+β/α d f (d).
(5)
Here α is the convergence order of the numerical scheme used to solve the underlying SDE, and β is equal to 1/2 or 1, depending on whether pseudorandom or quasi-random number sequences is chosen. N denotes the number of realizations, while d corresponds to the problem dimension. In particular, for this example, the function f (d) turns out to be constant, when only one dimension is scaled, and proportional to p, when both are scaled. This is related on how the mean first exit time depends on dimension, see [1]. For the PDD method, let us consider k nodal points on each interface. Note that the error due to interpolation, depends on the length of the interfaces. Therefore, to keep bounded such an error, the number of nodal point should be increased accordingly to the number of processors involved. This should be done only for the case B, because the length of the interfaces remains constant for case A. In fact, for case B the effective number of nodal points required to keep bounded the interpolation error should increase as p1/2 . The global domain is partitioned into p subdomains. These are nonoverlapping squares (0, 1) × (0, 1). For case A, a 1D partitioning is applied, and the number of interfaces is p−1. On √ the other hand, for case B, the number of interfaces is 2( p − 1) Therefore, for case A the overall CPU time spent by Monte Carlo would be of order TMC (p−1), √ while for case B, 2TMC ( p − 1) Thus, for case A the parallel run time for this part of the algorithm is as follows: Tp = TMC k
p−1 ∼ TMC k p
as p → ∞,
(6)
while for case B yields, Tp =
√ 2kTMC p p
√ p−1 √ = 2kTMC (p − p), p
(7)
being TMC = TMC /f (d). In Fig. 2(a), and 2(b), the parallel run time results for case A and case B, respectively, are depicted. Note that the Monte Carlo part remains constant for case A, while for case B increases linearly as function of the number of processors. This agrees qualitatively with the theoretical estimation obtained above. In Fig. 3(a), and 3(b) the pointwise numerical error made in the PDD and DD methods are shown in a contourplot. Parameters were chosen conveniently in order to attain a comparable error for both methods. For the PDD method, k was kept fixed to 2. Note that the maximum error made in each subdomain is indeed attained on the corresponding boundary. The exponential timestepping [9] used to solve the underlying SDEs (2) was characterized by λ = Δt−1 , Δt being the random exponentially distributed time step used in solving the SDEs, and the bracket denoting its average. Note that maximum error is of order 10−2 , which corresponds mainly to the statistical error obtained from the
1262
J.A. Acebr´ on and R. Spigler
200
10000 Local Solver Monte Carlo PDD
Local Solver Monte Carlo PDD
(a)
(b)
8000
Parallel Time (s)
Parallel Time (s)
150
100
6000
4000
50 2000
0
0
100 200 Number of processors
0
300
0
100 200 Number of processors
300
Fig. 2. Parallel run time of the PDD method versus the number of processors, where the time spent by Monte Carlo and the local solver is also detailed:(a) Case A, and (b) Case B
Monte Carlo method at the nodal points. Increasing the accuracy can be attained by increasing the sample size, or resorting to sequences of “quasi-random” numbers keeping fixed the sample size, see [2]. Fig. 4(a), and 4(b) shows the results corresponding to the deterministic domain decomposition method for cases A, and B, respectively. Here the overall parallel run time has been decomposed in two parts, corresponding to the time spent by the partitioner, and the iteration part. Note that both of them increases (a)
−3
x 10
0
(b)
−3
x 10 5
0
6
4.5 0.2
0.2
4
5
3.5 4
0.4
3
y
y
0.4
2.5
3 0.6
0.6
2
2
0.8
1.5 0.8
1
1
0.5 1
0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1
0
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
x
Fig. 3. Pointwise numerical error in: (a) the DD algorithm, and (b) the PDD algorithm
Scalability and Performance Analysis 5
5 Iteration time Partitioning time Total DD
Iteration time Partitioning time Total DD
(a)
(b)
4
Log [Parallel Time (s)]
Log [Parallel Time (s)]
4
3
2
1
0
1263
3
2
1
0
50
100 Number of processors
150
0
0
50
100 Number of processors
150
Fig. 4. Parallel run time of the DD method versus the number of processors, where the time spent by the partitioner and the iteration procedure is also detailed:(a) Case A, and (b) Case B
when the number of processor grows as expected. Moreover, the iteration part increases unbounded for larger number of processors due to the heavy intercommunications, degrading dramatically the performance of the algorithm.
4
Conclusions
The analysis of the performance of a probabilistic method, to accomplish domain decomposition for the numerical solution of linear elliptic boundary-value problems in two dimensions, has been conducted. The solution is generated by Monte Carlo simulations to solve the associated stochastic differential equations only at very few points inside the domain. A Chebyshev interpolation using such points as nodes is then constructed, and a full splitting into several subdomains, to be handled by separate processors acting concurrently, is made. A comparison with a deterministic DD algorithm for different partitions of the scaled domain has been made here for the first time. This has been done in an MPI environment. Working in an MPI environment also allows to test the effect of processor intercommunications which beset all deterministic DD algorithms. Besides the competitive results observed in the numerical example, the PDD method is expected to be competitive concerning scalability and fault-tolerance. These are indeed key issues if one intends to run codes on machines working with hundreds of thousands of processors or more.
Acknowledgements J.A.A. acknowledges support from the Ministerio de Ciencia y Tecnolog´ıa (MEC) through the Ram´ on y Cajal programme. The authors thankfully acknowledge the
1264
J.A. Acebr´ on and R. Spigler
computer resources, technical expertise and assistance provided by the Barcelona Supercomputing Center-Centro Nacional de Supercomputaci´ on.
References 1. Acebr´ on, J.A., Busico, M.P., Lanucara, P., Spigler, R.: Domain decomposition solution of elliptic boundary-value problems via Monte Carlo and quasi-Monte Carlo methods. SIAM J. Sci. Comput. 27, 440–457 (2005) 2. Acebr´ on, J.A., Busico, M.P., Lanucara, P., Spigler, R.: Probabilistically induced domain decomposition methods for elliptic boundary-value problems. J. Comput. Phys. 210, 421–438 (2005) 3. Acebr´ on, J.A., Spigler, R.: Fast simulations of stochastic dynamical systems. J. Comput. Phys. 208, 106–115 (2005) 4. Acebr´ on, J.A., Spigler, R.: Supercomputing applications to the numerical modeling of industrial and applied mathematics problems. J. Supercomputing 40, 67–80 (2007) 5. Acebr´ on, J.A., Dur´ an, R., Rico, R., Spigler, R.: A new domain decomposition approach suited for grid computing. LNCS. Springer, Heidelberg (in press, 2007) 6. Caflisch, R.E.: Monte Carlo and quasi Monte Carlo methods. In: Acta Numerica, pp. 1–49. Cambridge University Press, Cambridge (1998) 7. Freidlin, M.: Functional integration and partial differential equations. Annals of Mathematics Studies no. 109. Princeton Univ. Press, Princeton (1985) 8. Geist, G.A.: Progress towards Petascale Virtual Machines. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 10–14. Springer, Heidelberg (2003) 9. Jansons, K.M., Lythe, G.D.: Exponential timestepping with boundary test for stochastic differential equations. SIAM J. Sci. Comput. 24, 1809–1822 (2003) 10. Keyes, D.E.: How scalable is domain decomposition in practice? In: The Eleventh International Conference on Domain Decomposition Methods (London, 1998) (electronic), DDM.org, Augsburg, pp. 286–297 (1999) 11. Karypis, G., Kumar, V.: ParMETIS–parallel graph partitioning and field–reducing matrix ordering, http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview 12. Niederreiter, H.: Random number generation and quasi Monte-Carlo methods. SIAM, Philadelphia (1992) 13. Quarteroni, A., Valli, A.: Domain decomposition methods for partial differential equations. Oxford Science Publications, Clarendon Press (1999) 14. Toselli, A., Widlund, O.: Domain Decomposition Methods - Algorithms and Theory. Springer Series in Computational Mathematics, vol. 34. Springer, Heidelberg (2005) 15. Li, Z., Saad, Y., Sosonkina, M.: pARMS: a parallel version of the algebraic recursive multilevel solver. Numerical Linear Algebra with Applications 10, 485–509 (2003)
Scalability Analysis for a Multigrid Linear Equations Solver Krzysztof Bana´s Department of Applied Computer Science and Modelling, AGH University of Science and Technology, Mickiewicza 30, 30-059 Krak´ ow, Poland Institute of Computer Modelling, Cracow University of Technology, Warszawska 24, 31-155 Krak´ ow, Poland [email protected]
Abstract. The paper presents the parallel performance analysis of an iterative solver for systems of linear equations arising from finite element approximation. The main topic is the choice of parallel implementation strategy for a multigrid preconditioner of the Krylov space methods. The scalability of presented solutions is assessed and compared.
1
Scalability of Large Scale Computations
The scalability of computations is considered in the present article as the ability of a program to maintain high parallel efficiency when the size of the problem solved grows together with the number of processors performing computations [1], [2]. In general it is possible to consider the standard measures of parallel performance, speed-up, Sp , and efficiency, Ep , as functions of the number of processors Np and the size of the problem Ns : Ts (Ns ) Tp (Np , Ns )
(1)
Sp (Np , Ns ) · 100% Np
(2)
Sp (Np , Ns ) =
Ep (Np , Ns ) =
where: Ts (Ns ) is the execution time for the best sequential algorithm solving the problem and Tp (Np , Ns ) is the parallel execution time of the considered program. For the scalability analysis a suitable function Ns (Np ) is sought, such that when substituted to (2) the efficiency remains far from zero (Ep ≥ > 0) when the number of processors grows (in principle to infinity). However, to make a program practically scalable it is required that the size of the problem does not grow beyond the capabilities of the hardware and these capabilities are, most often, defined by the available memory. The most common situation occurs when the memory size of a parallel system – either supercomputer or a cluster – increases proportionally with the number of processors. So ideally we seek for the case when the parallel efficiency Ep remains bounded below away from zero for a sequence of program executions in which the size of the problem grows R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1265–1274, 2008. c Springer-Verlag Berlin Heidelberg 2008
1266
K. Bana´s
with the number of processors in such a way that the memory requirements of the program remain a linear function of Np . In the situation where the memory requirements of the program are proportional to the size of the problem, such that for sizes Ns = Np · N0 the growth of memory requirements is linear, the sought efficiency can be expressed as Ep (Np , Np · N0 ) =
Ts (Np · N0 ) · 100% Tp (Np , Np · N0 ) · Np
(3)
where N0 is the problem size for uni-processor execution. If Ts (Np · N0 ) = Np · Ts (N0 ) then (3) means that for the program to be scalable it suffices that the execution time as the overall function of the number of processors remains constant, Tp (Np , Np · N0 ) ≈ const (or grows slowly in the considered range of parallel system sizes). Usually this scalability condition, adopted also in the sequel of this paper, is referred to as the constant (or slowly increasing) execution time for the constant problem size per processor.
2
Parallel Performance of Large Scale Adaptive Finite Element Simulations
The focus of the present paper is the scalability analysis of parallel multigrid solvers for systems of linear equations resulting from the finite element method. Hence, a sequence of finite element problems of increasing size is considered. The most natural for finite elements is to assume that a sequence is obtained by mesh adaptations. It is assumed that the main form of mesh adaptations is the decrease of elements’ sizes. This may happen for h adaptive FEM, as well as for hp adaptive FEM with the degree of approximation p bounded. As the measure of the problem size the total number of degrees of freedom, NDOF , for the approximation is used. In the simplest case of linear finite elements, degrees of freedom are associated with element vertices. For other variants (higher order, h, p, hp adaptive, etc.) the degrees of freedom may be associated with all mesh objects – vertices, edges, faces, element interiors. For scalability analysis with h adaptivity as the main mechanism used to increase the problem size, it can be assumed that the number of degrees of freedom is proportional to the number of elements NE , independently of the particular version of approximation utilized, NDOF = αNE . The main technique for parallelization of finite element simulations is domain decomposition. Usually the whole domain is divided into subdomains, with the number of subdomains equal to the number of processors (the term processor will be used to denote a single processing unit, being a standard CPU or a core in a multi-core CPU). For the purpose of load balancing the sizes of subdomains may be made proportional to the performance of particular processors [3]. For solving in parallel systems of linear equations resulting from finite element discretization there are two variants of domain decomposition: overlapping and non-overlapping. Only the former is considered in the current paper. In its realization the subdomains coming from the original non-overlapping division of
Scalability Analysis for a Multigrid Linear Equations Solver
1267
the computational domain are enlarged by one or more layers of elements. Each processor, owning a particular subdomain, stores not only data concerning mesh objects and the associated degrees of freedom for the owned subdomain, but also for the overlap. Thus, from the point of view of a single processor, all mesh objects and degrees of freedom may be dived into owned (internal), overlap and external.
3
A Strategy for Solving Finite Element Systems of Linear Equations
The analysis performed in the current paper concerns a popular strategy for solving systems of linear equations that consist of several ingredients for which different final algorithms and methods can be used. The ingredients are the following: a Krylov space solver, multigrid preconditioner, additive Schwarz smoother and approximate solver for subdomain problems. Parallelization of all these steps, based on domain decomposition, can be easily achieved by dividing all vectors and matrices appearing in the solver algorithm in a manner similar to division of mesh objects and degrees of freedom. Each vector component and each row of a system matrix corresponds to a certain degree of freedom, which, in turn, corresponds to a certain mesh object. Analogously to classification of mesh objects, from the point of view of a single processor, each row of a system matrix and each component of solver vectors is classified as owned (internal), overlap or external. As the main solver for systems of finite element linear equations a preconditioned Krylov space method is considered. The choice of the particular algorithm depends on the characteristics of the problem [4]. In any case the solution procedure consists of several iterations, in each iteration vector operations, matrixvector products and preconditioning is performed. Vector operations may be perfectly parallelizable (like summing, scaling) or may involve communication overhead. In the case of reduction operations (scalar product, norm) the standard realization consist in performing operations on owned components, followed by an all-to-all communication, that on most system architectures can be performed in logNp steps. Parallel implementation of the matrix-vector product depends on the distribution scheme for a system matrix. In the assumed setting of domain decomposition each processor owns a set of rows of a system matrix. To achieve the proper result, each processor performs scalar product of owned rows with the components of vectors corresponding to non-zero entries in the system matrix. Each entry in the system matrix corresponds to a pair of degrees of freedom, non-zero values appear when shape functions associated with both degrees of freedom are simultaneously non-zero for any domain of integration in the finite element formulation of the problem solved. For such degrees of freedom (and the corresponding mesh objects) it will be said that the neighborhood relation holds (which means that whether two mesh objects are neighbors does not depend only on the mesh topology, but also on approximation used and the problem
1268
K. Bana´s
solved). The necessity to know, for the vector considered in the matrix-vector product, the values of all the components corresponding to non-zero entries in the owned matrix rows, usually means that each processor should exchange with other processors data corresponding to the one element wide overlap around its owned subdomain. The amount of data exchanged with other processors during the matrix-vector product is determined by the size of the overlap. The number of messages in which the values are transferred depends upon the number of neighboring subdomains for the owned subdomain. Both of these parameters, the size of overlap and the number of neighbors for each subdomain, are usually optimized by a proper domain decomposition (mesh partition) algorithm. For performance analysis it is important that for any subdomain the number of neighboring subdomains remains bounded by some small number, for all problems in the sequences considered for scalability assessment.
4
Multigrid Preconditioning
Parallel implementation of multigrid preconditioning for finite element simulations pose several problems. To achieve scalability, sequences of problems with the increasing number of processors and growing problem size are solved. If, as it was assumed at the beginning, the increase in problem size is obtained through h-adaptivity then the elements of subsequent generations form natural candidates for forming mesh levels. In such formulated geometric multigrid there are three important questions to be answered: how to construct meshes at different levels, how to organize projections between meshes of different levels and how to solve exactly the problem on the coarsest mesh. The answer to the first question is obvious for meshes obtained by uniform refinements and non-trivial for adaptive processes. Several possible strategies can be designed, like starting from the final mesh and constructing coarser meshes over the whole computational domain by including “the father” for any element resulting from refinement [5] or performing smoothing at a given level only for elements of a single corresponding generation [6]. The results obtained on any mesh level have to be transferred, according to the V-cycle multigrid algorithm, to another mesh, one level higher or lower. This projections must ensure that all data necessary for smoothing on the other level are available. Difficulties appear when additive Schwarz method is used for parallel smoothing, for which at any mesh level an overlap is considered. For ensuring convergence the overlap should have the width of one or more elements, hence, if the width measured by the number of elements is constant, the width for coarser meshes is larger then for finer meshes. When starting the V-cycle multigrid algorithm by smoothing on the finest level, each processor possess updated data for its owned extended subdomain (subdomain with overlap). Before smoothing on the next, first coarser, level, each processor has to know data corresponding to the extended subdomain, but on the coarser level. There are several ways to achieve that:
Scalability Analysis for a Multigrid Linear Equations Solver
1269
A. The overlap on the finest level can be enlarged, so to cover the overlap on the coarser level. In such a case the extended subdomains at both levels coincide and each processor can perform the projection on the coarser mesh, obtaining necessary data without communication. Then, on the coarser mesh the overlap is adjusted to the next mesh – and this requires a round of communication between processors, to receive data corresponding to elements added during adjustment. B. This last communication step can be eliminated completely by adjusting overlaps at all levels to the overlap on the coarsest mesh. In such a case the width of overlap at the finest level becomes proportional to the number of levels. This can improve convergence of the solver but significantly increases the communication during smoothing. To limit the width, the number of levels can be kept constant, but this could decrease the overall convergence rate of the solver. C. The last considered approach consist in performing projections from finer to coarser meshes by each processor only for its own original non-overlapping subdomains. All data necessary at the next level are obtained through communication. In this approach overlaps are defined separately for each level and are not adjusted. Concerning the last question, solving the problem on the coarsest mesh, it is assumed in the current analysis that the problem is large enough to be solved in parallel by the smoother algorithm, and on the other hand, small enough, so that the fact that it has to be solved exactly does not increase significantly the execution time. Other choices, especially when the coarsest problem is too small to be distributed over all the processors performing calculations include e.g. solving a separate copy of the coarsest problem on each processor or some other more elaborate strategies [7]. The multigrid algorithm is recursive in its nature. Resolving the recurrence lead to the linear complexity in terms of the number of unknowns (degrees of freedom). Hence to assess the scalability, only computations on the finest mesh are taken into account (this relies also on the assumption of fast solution of the problem on the coarsest mesh).
5
Performance Analysis
For the purpose of quantitative performance analysis several further assumptions are made. The processors are assumed to be of equal computational power and domain decomposition is assumed to divide the computational domain into subdomains of equal size. The number of internal degrees of freedom per processor is, hence, equal to αNE /Np . It is assumed that the number of elements in the overlap is equal to γ(NE /Np )β · δ (4) where parameters γ and β depends upon the space dimension, the character of elements and the properties of domain decomposition, while δ is the width of
1270
K. Bana´s
overlap measured in the number of element layers. The values of β reflect the surface to volume ratio for a given domain decomposition and ideally are equal to 1/2 for 2D problems and 2/3 for 3D problems. Therefore the total number of degrees of freedom per processor is equal to α NE /Np + γ(NE /Np )β · δ (5) Computation time is assumed to be determined by the number of floating point operations (in fact, all considered operations are axpy combinations of multiplication and addition), each of which is assumed to take an average time tc . For communication, a simple model is adopted in which sending a message consisting of d floating point numbers takes the time ts + dtw , where ts is the start-up time and tw is the transfer time for a single floating point number. Hence execution times for the most important parallel operations during the considered solution procedures can be estimated as follows: Reduction Vector Operations
t=α
NE tc + log Np (ts + tw ) Np
(6)
Matrix-Vector Multiplication
t=α
NE mtc + αγ Np
NE Np
β tw + NN ts
(7)
where m is the average number of nonzero entries in a single row of system matrix and NN is the average number of neighboring subdomains for a single subdomain (i.e. the number of processors with which any processor exchanges messages). It is assumed that all updated vector components are send in a single point-to-point message Smoothing t=α
NE + δs γ Np
NE Np
β mtc +
αδs γ
NE Np
β tw + NN ts
(8)
The presented formula corresponds to one smoothing iteration with the additive Schwarz domain decomposition and Gauss-Seidel, ILU(0) or other similar subdomain approximate solver. The complexity is close to that of the matrixvector product but the fact of including overlap components, due to additive Schwarz procedure, is taken into account. The widths δs and δs depend upon the strategy of performing updates – they can be set up independently. δs denotes the overlap included for subdomain computations, δs the overlap for which data are exchanged between subdomains. In the small overlap setting, δs may
Scalability Analysis for a Multigrid Linear Equations Solver
1271
be equal to zero and δs = 1 – similarly to the case of the matrix-vector product. For large overlap, δs will be greater than zero, and δs may be greater than one (e.g. to accelerate convergence some components during smoothing may be obtained by some combination of values computed locally and obtained from other processors). Nevertheless, it is assumed that there is only one data exchange performed during a single smoothing iteration (the strategy often referred to as RAS – restricted additive Schwarz procedure [8]). Projections between Meshes t=α
NE + δp γ Np
NE Np
β νtc +
αδp γ
NE Np
β tw + NN ts
(9)
During local projections, for each fine grid vector component the number of axpy operations depends on the properties of triangulation and discretization. In the considered setting of geometric multigrid, for any “parent-child” combination (with variations taking into account different division patterns and different combinations of degrees of approximation) the number is constant. So for quantifying the time of projections a constant, average for the whole mesh, number of operations ν is taken. Local projections are cheaper than the exchange of values between processors, hence, whenever the data after smoothing is available, they are used. This makes the widths δp and δp used in projections related to the widths δs and δs for smoothing (in principle they can be different). Their values depend upon the choice of one of the strategies described in the previous section. For case B: δp = δs + δs and δp = 0, for case C: δp = 0 and δp = 1 (for prolongation, for restriction the time corresponds to the number of components on the coarser mesh and the formula becomes more complicated), for case A the widths are in between cases B and C. Memory requirements analysis for all the phases of finite element simulations shows that, for h adaptive computations, this requirements are proportional to the number of elements [5]. Thus, the assumption of the memory requirements being proportional to the problem size is satisfied. The sequences of problems for which sizes (numbers of elements) grow proportionally to the number of processors can be used to assess scalability. According to the definitions presented in Sec. 1, the code would be scalable if the time of its execution remains constant for any sequence of problems for which the ratio NE /Np remains constant. When assessing the scalability of iterative codes two issues can be separated: the computational scalability – obtained for a single iteration (or a constant number of iterations) and independent of the convergence of the iterative process and the overall scalability (numerical scalability) – that takes convergence into account. When the number of subdomains grows the performance of the additive Schwarz smoothing deteriorates. This effect can be partially counterbalanced by the increase in the overlap size. That’s one of the reasons for introducing different strategies for organizing multigrid cycles.
1272
K. Bana´s
Concerning the computational scalability, the presented above execution time estimates for different operations during the solution procedure suffice to assess the scalability of finite element simulations with the described solver strategies. The ratio NE /Np appears in all the estimates. If all remaining parameters there were constant, then the code would theoretically be computationally scalable. There are two reasons why this is not the case. First, there exists the factor log Np (ts + tw ) in the estimate of the time of reduction vector operations. Second, the size of overlap on the finest level can be increased to ensure better preconditioning and to counterbalance the effect of the growing number of subdomains. Nevertheless, the computational scalability of the solver, defined as the scalability in performing a single iteration, is possible to be approached for the C strategy of organizing multigrid calculations. This and the other strategies are compared, in terms of numerical (taking convergence rates into account) and computational scalability in a series of experiments described in the next section.
6
Computational Experiments
The strategies for organizing multigrid calculations described in Sec. 4 are compared for the case of discontinuous Galerkin approximation of the Laplace’s equation in a simple rectangular domain (for more details of the code setting see [5]). Tetrahedral elements with linear shape functions are used for discretization. There are three finite element solutions considered – they differ only in the number of degrees of freedom (unknowns) equal to 391 168, 3 129 344 and 25 034 752. Each subsequent solution is obtained on the mesh resulting from the uniform h refinement of the previous mesh. The hardware setting consist of a set of PCs with dual-core processors connected with a standard 100Mbit Ethernet switch. The details are not relevant to the question of scalability, despite the fact that the network is much slower than in dedicated clusters, hence results for larger numbers of processors are more pessimistic. The first table compares computational scalability of different strategies for parallel preconditioning described in the paper. Four strategies are considered: single level additive Schwarz with block Gauss-Seidel within subdomains (SL) and three multigrid preconditioners, all using additive Schwarz with block GaussSeidel within subdomains smoothing. The first two correspond to the strategy B: MB1 with the number of levels kept constant equal to 3 (and hence with the growing size of the coarsest problem) and MB2 with the growing number of levels (and hence with the growing size of overlap). Finally the method MC corresponds to the strategy C. For every strategy the time of performing single iteration of the GMRES method is shown for a sequence of case pairs: in the first case of each pair the number of unknowns is equal to 391 168 and in the second case the number of unknowns is equal to 3 129 344 (eight times larger) and the number of processors is also eight times larger than in the first case. Full scalability would be obtained
Scalability Analysis for a Multigrid Linear Equations Solver
1273
Table 1. Times (in seconds) for performing a single GMRES iteration for different parallel preconditioning strategies (SL, MB1, MB2, MC – description in text) NDOF /Np = 391168 NDOF /Np = 195584 NDOF /Np = 130389 Np = 1 Np = 8 Np = 2 Np = 16 Np = 3 Np = 24 SL 0.154 0.158 0.077 0.088 0.052 0.069 MB1 0.588 0.656 0.299 0.420 0.209 0.344 MB2 0.588 0.646 0.299 0.420 0.209 0.381 MC 0.572 0.619 0.289 0.366 0.199 0.326
Table 2. Numbers of iterations to achieve GMRES convergence and total execution times for different parallel preconditioning strategies (SL, MB1, MB2, MC – description in text) NDOF = 391168 NDOF = 3129344 NDOF = 25034752 Np = 1 Np = 8 Np = 30 #iter T #iter T #iter T SL 44 12.992 88 31.678 >200 >200 MB1 14 13.102 17 16.393 20 57.019 MB2 14 13.102 17 16.451 20 64.259 MC 14 12.570 17 15.739 20 52.796
if the times in both cases for a given pair were equal. This is not the case, but several remarks are in place. First, the single level preconditioning is the closest to computational scalability - it would be the method of choice if it had proper numerical scalability. Second, the strategy C is better than both variations of strategy B - the result predicted by the estimates from Section 5. The second table shows results for a series of cases that does not allow to assess scalability of the methods but allows to compare the strategies. The series consist of three cases with the following characteristics – case 1: NDOF = 391168, Np = 1, case 2: NDOF = 3129344, Np = 8, case 3: NDOF = 25034752, Np = 30. For each case and each preconditioner the number of GMRES iterations necessary to decrease the residual by 10−9 (#iter) and the total parallel execution time (T ) are reported. The results demonstrate that despite the best computational scalability single level preconditioning does not guarantee numerical scalability and that strategy C is the best among multigrid methods.
7
Conclusions
It is not possible to obtain perfect scalability for the considered strategy of finite element simulations with a Krylov space linear equations solver and multigrid preconditioning. Not perfectly scalable factors include: – reduction vector operations with communication time independent of the problem size and growing proportionally to the logarithm of the number of processors
1274
K. Bana´s
– decrease in the convergence rate of the iterative linear equations solver with increasing number of processors (and hence subdomains) The first of these two factors causes the increase that is usually tolerable for the ranges of processor numbers being used in practice. To minimize the second factor techniques for compensating the increase in the number of subdomains should be used. The paper shows that parallel multigrid preconditioning with overlapping subdomain smoothing and the proper strategy for parallel data exchange can give good results, both in terms of computational as well as numerical scalability.
References 1. Kumar, V., Grama, A., Gupta, A., Karypis, G.: Introduction to Parallel Computing: Design and Analysis of Algorithms, 2nd edn. Addison-Wesley Longman Publishing Co., Inc., Boston (2002) 2. Scott, L.R., Clark, T., Bagheri, B.: Scientific Parallel Computing. Princeton University Press, Princeton (2005) 3. Bana´s, K., Pla˙zek, J.: Parallel h-adaptive simulations of inviscid flows by the finite element method. Mechanika Teoretyczna i Stosowana 35, 249–262 (1997) 4. Barrett, B., Berry, M., Chan, T., Demmel, J., Donato, J., Dongarra, J., Eijkhout, V., Pozo, R., Romine, C., der Vorst, H.: emplates for the Solution of Linear Systems: Building Blocks for Iterative Methods. SIAM, Philadelphia (1994) 5. Bana´s, K.: The application of the adaptive finite element method in large scale computations (in Polish). Wydawnictwo Politechniki Krakowskiej, Krak´ ow (2004) 6. Bastian, P.: Load balancing for adaptive multigrid methods. SIAM Journal on Scientific Computing 19(4), 1303–1321 (1998) 7. Chow, E., Falgout, R., Hu, J., Tuminaro, R., Yang, U.: A survey of parallelization techniques for multigrid solvers. In: Heroux, M., Raghavan, P., Simon, H. (eds.) Parallel Processing for Scientific Computing. SIAM Series on Software, Environments, and Tools, vol. 205864, SIAM, Philadelphia (2006) UCRL-BOOK-205864 8. Cai, X., Sarkis, M.: A restricted additive Schwarz preconditioner for general sparse linear systems. SIAM Journal on Scientific Computing 21, 792–797 (1999)
A Grid-Enabled Lattice-Boltzmann-Based Modelling System G´erard Dethier1 , Cyril Briquet1 , Pierre Marchot2 , and P.A. de Marneffe1 1
Department of Electrical Engineering and Computer Science 2 Department of Chemical Engineering University of Li`ege B-4000 Li`ege
Abstract. Lattice-Boltzmann (LB) methods are a well-known technique in the context of computational fluid dynamics. By nature, they can easily be parallelized but their adaptation to the Grid environment is not trivial due to hardware heterogeneity (CPU, memory. . . ) in a Grid. A load balancing method to dynamically handle the differences in terms of CPU number and power among the machines of a Grid is presented. The CPU power is dynamically estimated using a benchmark. An estimation method of execution time is also given.
1
Introduction
In the context of complex fluid flow simulations, standard techniques include those from computational fluid dynamics (CFD), one branch of fluid mechanics. The base of these techniques involves the solving of the Navier-Stokes equations using numerical methods and algorithms. Lattice Boltzmann (LB) methods constitute one family among these techniques. They present several advantages like dealing with complex boundaries (i.e. solids with complex geometries like porous media. . . ), incorporating microscopic interactions and being easily parallelizable. However, these algorithms need a lot of computational and memory resources. They need very powerful machines or computational grids. The latter solution has the advantage of being economically more interesting. In addition to that, sufficiently large machines do not exist for extreme cases. However, adapting LB-based codes to Grid computing is not simple. The main obstacle is the high level of heterogeneity in terms of hardware, software and temporal availability of the machines that are to be used. This paper presents LaBoGrid, a modelling system under development which is based on LB-methods. It is currently able to take into account the CPUs heterogeneity of the Grid machines (i.e. their number and their speed). Here after, the concepts of Grid computing and the LB methods are outlined. A load balancing method and the LaBoGrid architecture are introduced. An estimation method of LB job execution time is given and, finally, results and conclusions will be presented. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1275–1284, 2008. c Springer-Verlag Berlin Heidelberg 2008
1276
G. Dethier et al.
1.1
Grid Computing
Grid computing can be defined as “coordinated resource sharing and problem solving in dynamic, multi-institutional collaborations” [1]. Another widely accepted definition comes from Ian Foster’s What is the Grid? paper [2]. “A Grid is a system that : 1. coordinates resources that are not subject to centralized control, 2. using standard, open, general-purpose protocols and interfaces, 3. to deliver nontrivial qualities of service.” In practice, a Grid User submits a job composed of tasks to the Grid. The tasks will be distributed among the available resources. In our context, a resource is a machine of the Grid characterized by its CPU(s) and memory. The machines of a Grid do not necessarily have the same amount of memory and cores. A resource can therefore be single- or multi-core (in the remainder of this paper, we consider the terms “core” and “CPU” as equivalent). Moreover, the CPUs don’t necessarily exhibit the same speed and could be based on different architectures. 1.2
LB Methods
The LB methods model the displacements of fictitious particles along given velocity vectors on discrete lattices in a discrete time base. This means that space, velocities (i.e. velocity vectors) and time have been discretized. The particles are displaced from one site (a node of the lattice) to another neighbouring site. The neighbourhood of a site is defined by the velocity vectors that a particle is allowed to follow. In our case, 3D fluids are modelled. Each site has 19 neighbours. This situation is illustrated on Figure 1. This lattice model is named D3Q19 following the generic naming scheme DiQj where i is the number of dimensions of the lattice and j the number of neighbours for each site. From now on, the lattices considered in the remainder of this paper will all be D3Q19. The state of each site is given by a vector of real values associated to each possible velocity vector. Each value represents the proportion of particles (usually called field in the LB context) moving along a given velocity. In this paper, the state of a site consists of 19 real values. The time evolution of the sites is given by the two following differential equations [6] according to a widely used formalism [3]. fˆi (x + vi δt, t) = fi (x, t) (1) 1 fi (x, t + δt) = fˆi (x, t) + (fieq (x, t) − fˆi (x, t)) (2) τ eq where vi is the i-th velocity vector and δt the time step. The fi (x, t) is the equilibrium state calculated in function of fˆi (x, t). τ is the relaxation coefficient. Equation 1 is the propagation step explaining how a site “exchanges” data with its neighbours. Equation 2 is the collision step explaining how the new state of a given site is calculated in function of its incoming data.
A Grid-Enabled Lattice-Boltzmann-Based Modelling System
1277
Fig. 1. Neighbourhood of a site in the D3Q19 model. The grey circles indicate the 19 neighbours. The site itself is also considered as a neighbour. Table 1. LB algorithm “Initialise the state of the fluid ”; i := 0; do i < iterationsCount → “Translate particles proportions”; “Collide particles proportions”; i := i + 1 od
Table 1 shows a generic LB algorithm. The “Translate particles proportions” step implements Equation 1. The “Collide particles proportions” step implements Equation 2. In the “Translate particles proportions” step, the sites on the boundaries of the lattice have no neighbours following some velocity vectors. The effect is that these sites have outgoing data but no incoming. The most common workaround is to use circular boundary conditions. What goes out from one side comes in to the opposite side and vice versa. In general, the flow of a fluid through a given structure like a pipe or a porous medium is modelled. This structure is generally represented by a bitmap associating to each site its nature (solid or not). The nature of each site is taken into account in the collision step. If a site is of solid nature, the “bounce-back” [4] is applied instead of the equation 2. In the bounce-back, the values of opposite velocities are exchanged. The flow of the fluid is driven by input and output pressure differences [5]. This method implies a modification of the fˆi (see equation 1) at “input” and “output” of the lattice before the collision. For a given modelling, the flow will follow one direction (x, y or z). The input of the lattice is the plane at position 0 in the direction of the flow. The output is the opposite plane depending on the flow direction. Actually, these special planes don’t need incoming data any more because the incoming data are calculated by the pressure difference application.
1278
G. Dethier et al.
This implies that the circular boundary conditions can be ignored for these planes. We can safely “ignore” the outgoing values. 1.3
Adapting LB Methods to the Grid
The LB algorithm previously described is decomposed into a set of tasks, i.e. a job. Only one task my be assigned to each resource. As resources can have several available CPUs, each task will instantiate a number of threads equal to the number of available CPUs. The initial lattice is subdivided into sub-lattices. An example of a lattice decomposed into several sub-lattices can be seen in Figure 2. The sub-lattices will be processed by the threads of the tasks. If the sub-lattices are all cuboids and if a sub-lattice shares a face or an edge with only one other sub-lattice per face or edge, the neighbourhood of a sublattice has the same definition as a lattice site’s neighbourhood (see Figure 1). A sub-lattice has therefore 19 neighbours, which are not necessarily different and can include the sub-lattice itself. In Figure 3, the neighbourhood of 2 sub-lattices is showed. Only one communication channel between these two sub-lattices is needed.
Fig. 2. 3D lattice decomposed into 18 sub-lattices
Fig. 3. 2 sub-lattices and their neighbourhood
The step “Translate particles proportions” of the algorithm presented in Table 1 must be adapted in order to take into account the data exchange between sub-lattices. Indeed, the data going out of a sub-lattice at each LB time iteration must be transmitted to the neighbouring sub-lattices. The step “Collide particles proportions” of the algorithm presented in Table 1 must also be adapted. Before the collision operator can be applied on the boundaries of a sub-lattice, all incoming proportions must have been received. The algorithm showed in Table 2 is run by the tasks of a LB job. It highlights the modifications presented previously. The “Copy received boundary data” is a blocking operation which causes all tasks to be almost synchronized in their current iteration i.
A Grid-Enabled Lattice-Boltzmann-Based Modelling System
1279
Table 2. LB task algorithm “Initialise the state of the fluid ”; i := 0; do i < iterationsCount → “Send boundaries to neighbours”; “Translate non-boundary proportions”; “Collide non-boundary proportions”; “Copy received boundary data”; /*blocking operation*/ “Collide boundaries”; i := i + 1 od
2
Load Balancing with LB Tasks
The LB tasks have to be moldable, i.e. they must be able to handle any number of sub-lattices and be dynamically re-configurable (receive new sub-lattices, remove others). Actually, all tasks are handled by a centralized controller which configures them. Obviously, tasks must first register themselves with the controller. Load balancing is simply achieved with “smart” tasks configurations, i.e. how to distribute m sub-lattices among tasks. The present method is general as it can be adapted to any kind of lattice type (D2Q9, D3Q27. . . ). Moreover, it doesn’t require any adaptation of the base algorithm. 2.1
Homogeneous Tasks
If no balancing is done, each task receives the same amount of sub-lattices. This leads to the situation where resources with fast CPUs wait for resources with slow CPUs. This is a consequence of the time synchronization of the tasks (see Section 1.3). A first step towards load balancing would be to take into account the amount of CPUs available to the tasks. Each task receives a number of sub-lattices proportional to its number of available CPUs. This configuration leads to what we call “homogeneous tasks”. 2.2
Scaled Tasks
The previous method could be enhanced by taking into account the speed of the CPUs. Indeed, with homogeneous tasks, each CPU receives the same amount of work. The fast CPUs will wait for the slower ones at each LB iteration. It results that the overall LB process is limited by the slowest CPU. Therefore, “more work” should be attributed to a fast CPU. For example, a CPU i twice as fast a CPU j should receive twice as much work as CPU j. This configuration leads to what we call “scaled tasks”. A weight wi can be associated to each task i. It is calculated in the following way: ci p i wi = (3) i ci p i
1280
G. Dethier et al.
where ci is the number of available CPUs and pi their power. If m is the number of sub-lattices to distribute among n tasks, each task i receives wi m sub-lattices. It therefore remains k = m − i=1 wi m sub-lattices to distribute. They are distributed to the most “underestimated” tasks, i.e. those that maximize wi m − wi m. The “scaled tasks” configuration requires to know the pi values. These are estimated by using a benchmark. It is a small classical non-parallel LB algorithm with s LB sites. The LB algorithm is iterated l times. If it took ti seconds to execute the benchmark, pi = s×l ti . The power pi is given as a number of sites processed per second.
3
Estimation of Execution Time
In this section, a model to estimate the execution time of a LB job is presented. It will allow to estimate theoretically an upper bound on the gain that can be obtained with the presented load balancing technique. The normalized execution time t of a LB job is the time (in seconds) needed by all LB tasks to execute one LB iteration. The normalized execution time ti of a task i is the time (in seconds) the task needs to execute one LB iteration. As all tasks are synchronized, all tasks have done one iteration after t = maxi ti seconds. ti is given by tci + tti where tci is the processing time and tti the transmission time (when exchanging inter-tasks data for one iteration). The processing time tci depends on the number of CPUs ci available to task i, their power pi (we consider that all CPUs of one resource have the same power) and qi the amount of work assigned to task i. It is calculated with the following expression: qi tci = ci p i where qi is the number of LB sites that task i handles. If pi is given as the number of LB sites processed per seconds, tci is given in seconds. The transmission time tti depends on di the amount of data exchanged each iteration and BW the network bandwidth. It is calculated with the following expression: di tti = BW where di is the amount of data (in LB sites) exchanged each iteration. If BW is given as a number of LB sites transmitted per second, tti is given in seconds. The di parameter depends on the sub-lattices assigned to task i. Sub-lattices of a given task that share interfaces (faces or edges) exchange the data associated to these interfaces through memory, not through the network. In the best case, the amount of data exchanged through network is minimized by a good placement of sub-lattices (di = dˆmin ). In the worst case, no data are exchanged i ˆmin case is not simple to estimate. An apthrough memory (di = dmax ). The d i i proximation dmin is therefore made. All the sites of the sub-lattices of a given i
A Grid-Enabled Lattice-Boltzmann-Based Modelling System
1281
task are artificially grouped into a single cubic sub-lattice. The cube minimizes the surface of the faces of a cuboid given a fixed volume. The amount of data exchanged by the sub-lattice, proportional to the surface of its faces, is therefore minimum. Actually, dmin ≤ dˆmin because sub-lattices cannot always be i i organized to form a cube. The BW parameter depends on the bandwidth b of the network (given as a number of bits transmitted per second) and the size of a LB site. Our model is a D3Q19 so a site is “represented” by 19 floats. If a float is composed of 4 bytes, a site “weighs” 608 bits. BW is calculated with the following expression: BW =
4
b size(site)
LaBoGrid
LaBoGrid is a modelling platform based on LB methods and adapted to Grid computing. It is written in Java, so it can be deployed on any computer provided that a Java virtual machine is available. LaBoGrid’s implementation is based on the principles described so far.
Fig. 4. LaBoGrid components and their interconnections
LaBoGrid is based on an “in house” middleware providing a hierarchy of classes and interfaces that can be extended and implemented to gridify the LB modelling. The two main components of this middleware are the controller and the distributed agent. The distributed agents run on grid resources. The controller connects them together. A controller task can be associated to the controller. This special task instantiates the tasks of a job on the distributed agents. Figure 4 illustrates the LaBoGrid components and how they are connected.
1282
G. Dethier et al.
There are two classes that need to be extended to adapt the middleware to a given context: the controller task (LBCT) and the distributed agent task (LBDAT). The LBCT instantiates a LBDAT on each resource. It then distributes the sub-lattices among the tasks. Given its configuration, the LBCT uses either “scaled tasks” or “homogeneous tasks” (both introduced in section 2). With “scaled tasks”, the LBCT must instantiate a benchmark on each registered task before the instantiation of the LB tasks (see Section 2.2). The LBDAT runs the algorithm showed in Table 2.
5
Results
In this section, the estimated normalized execution time obtained with homogeneous tasks and scaled tasks is compared. Estimated and observed job execution times then will be compared. A LB modelling using an initial lattice of 176× 176 × 176 LB sites divided into 125 sub-lattices is run on a small grid of 10 machines. There are 4 Celeron based PCs (1 CPU each), 4 Pentium IV based PCs (1 CPU each) and 2 Xeon based servers (4 cores each). All theses machines are connected through a 100Mbits/s Fast ethernet network.
Fig. 5. Estimated normalized task execution times
Figure 5 shows the estimated normalized execution times on each machine. In the case of homogeneous tasks, the difference between execution time on Celeron based machines and the servers is large. This is because the Xeon processors of the servers are the fastest of the grid and the Celeron the slowest. With scaled tasks, this difference is much smaller when di = dmin . When i di = dmax , there is again a large difference between execution times on both i Celeron machines and Xeon machines. But in this case, the Xeon machines are the slowest. This is because, as they have more sub-lattices to process than
A Grid-Enabled Lattice-Boltzmann-Based Modelling System
1283
Celeron machines, they have more data to exchange. Their transmission time is therefore increased. In the di = dmin case, the transmission time difference is i rather small. This is not the case when di = dmax . i The are also small differences in processing times caused by the quantification error introduced by the sub-lattices. The estimated normalized job execution times are represented by the horizontal dashed lines on Figure 5. On the experimental grid, a job executed with scaled tasks can run up to twice as fast as a job with homogeneous tasks. However, this is in the case where di = dmin . With di = dmax there is still some gain i i but it is smaller.
Fig. 6. Comparison between estimated and observed job execution times
Actually, sub-lattices should be placed carefully among the tasks to minimize the network communications. In LaBoGrid, this is done with a simple partitioning algorithm based on some heuristics (with a complexity of O(n2 ) in the number of sub-lattices). The comparison between estimated and observed job execution time given in Figure 6 shows it works rather well. Indeed, the observed execution time is near the di = dmin estimated case. i
6
Conclusion
After a theoretical presentation of Grid computing, LB methods and how they can be adapted to the Grid environment, a method to handle CPUs heterogeneity among the machines of a Grid has been presented. Indeed, all these machines don’t necessarily have the same number of CPUs and CPU speeds. Using this load balancing method led to rather good results. A model to estimate the execution time of LB jobs has been given. Our implementation of a distributed 3D LB modelling system adapted to the Grid environment, LaBoGrid, has been outlined. This is a first step toward Grid enabling LaBoGrid. In the future, the scalability of such a system should be analysed. Indeed, we work in a rather good context, i.e. a few machines connected by a 100Mbits/s network. Another important step towards an adaptation to the Grid environment is fault tolerance:
1284
G. Dethier et al.
presently, if a machine interrupts its work and goes down, the overall system is stopped and can’t continue. Even worse, the data of the down machine are lost: the global data cannot therefore be reconstructed. Distributed checkpointing is a possible solution. Each task could save its state (actually, the data it processes) to some other resources. When a resource is lost, the data of its task can be retrieved and distributed to other tasks.
Acknowledgements This work was performed in the frame of a research concerted action financed by the Communaut´e Fran¸caise de Belgique.
References 1. Nabrzyski, J., Schopf, J., Weglarz, J.: Grid Resource Management: State of the Art and Future Trends. Kluwer Academic Publishers, Dordrecht (2003) 2. Foster, I.: What is the Grid? A three point checklist. Grid Today (2002) 3. Dupuis, A.: From a Lattice Boltzmann model to a parallel and reusable implementation of a virtual river. PhD Thesis, Universit´e de Gen`eve (2002) 4. Wolfram, S.: Cellular automaton fluids 1: Basic theory. J. Stat. Phys. 45, 471–526 (1986) 5. Zou, Q., He, X.: On pressure and velocity boundary conditions for the Lattice Boltzmann BGK model. 9, 1591–1598 (1997) 6. Qian, Y.H., d’Humi`eres, D., Lallemand, P.: Lattice BGK models for Navier-Stokes equation. Europhys. lett. 17(6), 470–484 (1992)
Parallel Bioinspired Algorithms in Optimization of Structures Waclaw Ku´s1 and Tadeusz Burczy´ nski1,2 1
Department for Strength of Materials and Computational Mechanics, Silesian University of Technology, ul. Konarskiego 18a, 44-100 Glwice, Poland [email protected], [email protected] 2 Institute for Computer Modelling, Cracow University of Technology ul. Warszawska 24, 31-155 Cracow, Poland
Abstract. The parallel versions of bioinspired algorithms are presented in the paper. The parallel evolutionary algorithms and artificial immune systems are described. The applications of bioinspired algorithms to optimization of mechanical structures are shown. The numerical tests presented in the paper were computed with use of grid based on Alchemi framework.
1
Introduction
The optimization methods inspired by biological mechanisms have become very popular in last few decades. Most of them give good results optimizing multimodal functions. The paper describes the computational intelligence algorithms: evolutionary algorithms and artificial immune system. The optimization of mechanical structures is a long time process because of hundreds or even thousands of objective function evaluations. The objective function evaluation is connected with a direct problem which solved by means of FEM [12] in most cases. The wall computation time can be shorten when parallel algorithms are used [3][2]. The computational grids provide sophisticated and user friendly environment for performing time consuming computations. The grids give opportunity to use distributed resources in engineering optimization problems [6]. The Alchemi framework [1] is used in the paper. The shape optimization of the anvils in two-stage forging process is considered as a benchmark problem.
2
Grid Based on Alchemi Framework
The Alchemi framework[1] was used to construct a grid. The Alchemi framework is based on Windows .NET. This makes Alchemi useful only on hardware using Windows operating system. The Alchemi consists of few elements: Alchemi Manager - the central host with scheduling capabilities, one manager is needed for grid or part of grid, Alchemi Executors - the hosts performing computations, Alchemi Cross Platform Manager - web services based manager with ability to communicate with non-Alchemi parts of the grid. The Alchemi Manager host R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1285–1292, 2008. c Springer-Verlag Berlin Heidelberg 2008
1286
W. Ku´s and T. Burczy´ nski
can also be used as a Executor one. The security policy is based on usernames and passwords. The users can be grouped. The end-users, executors and administrator groups are available by default. The information about users, tasks, jobs, executor hosts are stored in database connected with Alchemi Manager. The communication between Alchemi Manager and Executors are performed using TCP/IP selected ports. These ports need to be available through firewalls. The architecture of the grid based on Alchemi is shown in Fig. 1[1].
Fig. 1. Architecture of grid based on Alchemi framework[1]
The most important advantage of Alchemi framework is API provided for grid applications developers. The execution of grid applications is performed using remote threads running on executors. The communication between remote threads is prohibited. The task submitted to grid consists of threads. The Alchemi framework contains also Alchemi Console for monitoring tasks, threads and executors. The Alchemi Manager and Executor can work as a system service.
3
Evolutionary Algorithms
The genetic and evolutionary algorithms (EA) [8] are based on mechanisms taken from biological evolution of species. The selection based on the individual fitness, mutations in chromosomes and individuals crossover are adopted. The genetic algorithms operate on binary coded chromosomes. The term evolutionary algorithm is more widely used for different modifications of genetic algorithms (also for algorithms operating on genes containing floating point numbers). The evolutionary algorithms operate on a population of individuals. The individuals contain one chromosome in most cases. The following description concerns the evolutionary algorithm used in numerical examples. The staring population is created randomly. Next the fitness function values are computed for each chromosome. The selection chooses chromosomes for a new parent subpopulation taking into account fitness function values. Evolutionary operators change chromosomes’ genes and create chromosomes for the offspring population. The uniform and Gaussian mutations and the simple crossover are randomly chosen to perform chromosome changes. The new chromosome are evaluated. The
Parallel Bioinspired Algorithms in Optimization of Structures
1287
Fig. 2. The flowchart of the evolutionary algorithm
algorithm works iteratively till the end optimization condition is fulfilled. The flowchart of evolutionary algorithm is presented in Fig. 2[7].
4
Artificial Immune Systems
The artificial immune systems are developed on the basis of mechanism discovered in biological immune systems. An immune system is a complex system which contains distributed groups of specialized cells and organs. The main purpose of the immune system is to recognize and destroy pathogens - funguses, viruses, bacteria and improper functioning cells. The lymphocytes cells play a very important role in the immune system. The lymphocytes are divided into several groups of cells. There are two main groups B and T cells, both contains some subgroups (like B-T dependent or B-T independent). The B cells contains antibodies, which could neutralize pathogens and are also used to recognize pathogens. There is a big diversity between antibodies of the B cells, allowing recognition and neutralization of many different pathogens. The B cells are produced in bone marrow in long bones. A B cell undergoes a mutation process to achieve big diversity of antibodies. The T cells mature in thymus, only T cells recognizing non self cells are released to the lymphatic and the blood systems. There are also other cells like macrophages with presenting properties, the pathogens are processed by a cell and presented by using MHC (Major Histocompatibility Complex) proteins. The recognition of a pathogen is performed in a few steps. First, the B cells or macrophages present pathogen to a T cell using MHC, the T cell decides if the presented antigen is a pathogen. The T cell gives a chemical signal to B cells to release antibodies. A part of stimulated B cells goes to a lymph node and proliferate (clone). A part of the B
1288
W. Ku´s and T. Burczy´ nski
cells changes into memory cells, the rest of them secrete antibodies into blood. The secondary response of the immunology system in the presence of known pathogens is faster because of memory cells. The memory cells created during primary response, proliferate and the antibodies are secreted to blood. The antibodies bind to pathogens and neutralizes them. Other cells like macrophages destroy pathogens. The number of lymphocytes in the organism changes, while the presence of pathogens increases, but after attacks a part of the lymphocytes is removed from the organism. The artificial immune systems (AIS) [5] take only few elements from the biological immune systems. The most frequently used are the mutation of the B cells, proliferation, memory cells, and recognition by using the B and T cells. The artificial immune systems have been used to optimization problems, classification and also computer viruses recognition. The cloning algorithm Clonalg presented by von Zuben and de Castro [4] uses some mechanisms similar to biological immune systems to global optimization problems. The unknown global optimum is the searched pathogen. The memory cells contain project variables and proliferate during the optimization process. The B cells created from memory cells undergo mutation. The B cells are evaluated and better ones exchange memory cells. In Wierzcho´ n [11] version of Clonalg the crowding mechanism is used - the diverse between memory cells is forced. A new memory cell is randomly created and substitutes the old one, if two memory cells have similar project variables. The crowding mechanism allows finding not only the global optimum but also other local ones. The presented approach is based on the algorithm presented in [11]. The mutation operator is changed. The Gaussian mutation is used instead of nonuniform mutation in the presented approach. The parallel artificial immune system (PAIS) was introduced by [10] for classification problems. An artificial immune system is implemented as one master process, other processes - workers evaluate objective functions for B cells. The memory cells are created randomly. They proliferate and mutate creating B cells. The number of clones created by each memory cell is determined by the memory cells objective function value. The objective functions for B cells are evaluated. The selection process exchanges some memory cells for better B cells. The selection is performed on the basis of the geometrical distance between each memory cell and B cells (measured by using design variables). The crowding mechanism removes similar memory cells. The similarity is also determined as the geometrical distance between memory cells. The process is iteratively repeated until the stop condition is fulfilled. The stop condition can be expressed as the maximum number of iterations. The master part of the algorithm is placed in the central processors and workers communicate only with the master. The Alchemi grid environment was used in the computations. The flowchart of the artificial immune system is shown in Fig. 3.
5
Numerical Example
The shape optimization of anvils in first stage of two-stage forming is considered as a numerical example [7]. The forging is performed using flat die in the
Parallel Bioinspired Algorithms in Optimization of Structures
1289
Fig. 3. The flowchart of the artificial immune system
second stage. The goal of the optimization is to find such shape of anvils which gives cylindrical product of forging after second stage. The product after forging have barreled shape. The product should have cylindrical shape after two-stage forging. The forging is simulated using MSC.Marc[9] program. The finite element method is used during direct problem solution. The preform is made from highly nonlinear material. The material nonlinearities and different shapes of the preform have influence on fitness function computing time. The anvils were described using 8 parameters of the control polygon of NURBS curve. The objective function was defined as a difference between ideal cylindrical shape and shape obtained after forging [7]. The optimization were performed by using evolutionary algorithm and artificial immune system. The results obtained using both algorithms were very close to each other. The results of two-stage forging when only flat anvils were used are presented in Fig. 4. The barreling of the product can be easily observed. The results after optimization using both algorithms are presented in Fig. 5. The shape of the product is very close to the cylindrical one. The theoretical speedup of parallel population-based algorithms can be very close to the linear speedup. The speedup depends on number of chromosomes, B-cells (number of objective functions to be evaluated in each iteration) and number of processors. Consider the number of individuals n and number of processors np . We assume that time need to compute fitness function value is constant and equal to tf . The time need to perform communications between
1290
W. Ku´s and T. Burczy´ nski
Fig. 4. The product after forging using flat anvils
a)
,b)
Fig. 5. The product after forging by using the best found anvils: a) after first stage, b) after second stage of forging
processors and time need to use evolutionary operators and selection is very small (comparing to the time needed for fitness function evaluation tf ) and can be neglected. The wall time of computation of one iteration of optimization algorithm using one processor is equal to t1 = tf · n. When the np processors are used the wall time is equal to tnp = tf (n\np + r)
(1)
where \ denotes integer division (the fractional part of the result is abandoned), and the r = 0 is equal to: 0 ifn mod np = 0 r= (2) 1 otherwise The speedup is equal to: s=
t1 n = tnp n\np + r
(3)
The speedup is equal to np for the case when np > n. The ideal situation is when the remainder of integer division of number of chromosomes and processors
Parallel Bioinspired Algorithms in Optimization of Structures
1291
is equal to zero, than the theoretical speedup can achieve linear speedup values s = np for np ≤ n. The experimental measurements of speedups for different number of processors were performed. The results are shown in Tab. 1. The 18 objective functions were computed in each iteration. The theoretical measurements were performed using average objective function computation time. The results shows better measured speedups than theoretical ones in some cases. The times needed for computations of objective functions for different individuals in real experiment differs from each other. It occurs that a processor can compute two individuals when other compute only one in some cases. This situation occurs rarely, but allows to obtain and explain better performance of real experiment than theoretically predicted. Table 1. Theoretical and measured speedups number of processors measured speedup theoretical speedup 1 1.00 1.00 2 1.98 2.00 4 3.67 3.60 8 6.09 6.00 16 9.35 9.00 18 14.01 18.00
6
Conclusions
The presented parallel evolutionary algorithm and artificial immune system working in grid environment based on Alchemi framework can be successfully used in optimization problems. The complicated problems like optimization of anvils in two-stage forging can be solved using presented approaches. The time costs of optimization using both algorithms are similar. The theoretical speedup computations and comparisons with experiment were shown in the paper.
Acknowledgement The research is financed from the Polish science budget resources in the years 2005-2008 as the research project.
References 1. Akshay, L., Rajkumar, B., Rajiv, R., Srikumar, V.: Peer-to-Peer Grid Computing and a.NET-based Alchemi Framework. In: Yang, L., Guo, M. (eds.) High Performance Computing: Paradigm and Infrastructure, Wiley Press, New Jersey (2005) 2. Burczy´ nski, T., Ku´s, W., Dlugosz, A., Orantek, P.: Optimization and defect identification using distributed evolutionary algorithms. Engineering Applications of Artificial Intelligence 17(4), 337–344 (2004)
1292
W. Ku´s and T. Burczy´ nski
3. Burczy´ nski, T., Ku´s, W.: Optimization of structures using distributed and parallel evolutionary algorithms. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 572–579. Springer, Heidelberg (2004) 4. de Castro, L.N., Von Zuben, F.J.: Learning and Optimization Using the Clonal Selection Principle. IEEE Transactions on Evolutionary Computation, Special Issue on Artificial Immune Systems 6(3), 239–251 (2002) 5. de Castro, L.N., Timmis, J.: Artificial Immune Systems as a Novel Soft Computing Paradigm. Soft Computing 7(8), 526–544 (2003) 6. Ku´s, W., Burczy´ nski, T.: Grid-based evolutionary optimization of structures. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, Springer, Heidelberg (2006) 7. Ku´s, W.: Evolutionary optimization of forging anvils using grid based on Alchemi framework. In: Proc. 2nd IEEE International Conference on e-Science and Grid Computing eScience 2006, Amsterdam (2006) 8. Michalewicz, Z.: Genetic algorithms + data structures = evolutionary algorithms. Springer, Berlin (1996) 9. MSC.Marc, Users Guide (2002) 10. Watkins, A., Bi, X., Phadke, A.: Parallelizing an Immune-Inspired Algorithm for Efficient Pattern Recognition. Intelligent Engineering Systems through Artificial Neural Networks: Smart Engineering System Design 13, 225–230 11. Wierzcho´ n, S.T.: Artificial Immune Systems, theory and applications, EXIT, Warsaw (in Polish) (2001) 12. Zienkiewicz, O.C., Taylor, R.L.: The Finite Element Method. In: The Basis, Butterworth, Oxford, vol. 1-2 (2000)
3D Global Flow Stability Analysis on Unstructured Grids Marek Morzy´ nski1 and Frank Thiele2 1
Poznan University of Technology, Piotrowo 3, 60-965 Poznan, Poland [email protected] http://stanton.ice.put.poznan.pl/morzynski 2 Institute of Fluid Dynamics and Technical Acoustics, Berlin University of Technology, Straße des 17. Juni, D-10623 Berlin, Germany [email protected] http://www.cfd.tu-berlin.de
Abstract. Three-dimensional global flow stability analysis generates very large complex generalized eigenvalue problem. Solution of the global flow stability problem delivers not only information about the growth rate of disturbances and respective frequencies. Eigenvectors of this system constitute physical modes space necessary in Low Dimensional Modeling of the flow and flow control design. Difficulties in solution of the eigenvalue problem limited till now the global stability to two-dimensional analysis, referred sometimes as bi-global. The 3D global stability solution (tri-global) are very rare and limited only to structured meshes. In the present study the solution procedure for unstructured 3D global flow stability problem with subspace iteration and domain decomposition is presented. The incompressible Navier-Stokes equation are discretized with the Finite Element Method in penalty formulation. The solution is demonstrated on well documented flow around the circular cylinder. With modest RAM and CPU requirements and decomposition of the computational domain into 8 subdomains (METIS software) the large eigenproblem having about 750 000 DOFs has been solved on 2 nodes (4CPU each) of PC cluster. Keywords: 3D Global flow stability, tri-global method, large eigenvalue problem, subspace iteration.
1
Introduction
Large scale eigenvalue problems emerge from different areas of applied physics. Traditionally, structural mechanics deals with large eigenvalue problems. Eigenproblem is a result of stability of the structure considerations as well as from the vibration and normal mode analysis. In several disciplines of science the stability and vibration problems are also encountered. Plasma physics and tocamac fusion research, crucial for providing new sources of energy relay on stability of rotating plasma. Multiphysics phenomena of MHD flows are taking place on the sun surface and are highly unstable, the stability and eigenanalysis is one of the opportunity of its investigation. Finally fluid dynamics is mainly the R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1293–1302, 2008. c Springer-Verlag Berlin Heidelberg 2008
1294
M. Morzy´ nski and F. Thiele
way of describing instabilities, bifurcations, and modal interaction in the fluids and generates since more than hundred years the most difficult and the largest eigenproblems to be solved. Even future of mass production of fast computers is dependant not only on progress in computer science but also on technology of crystal growth. The instability of the process causes substantial loses in production and it is not a coincident that eigenproblems, modeling the 3D instability are the testbed for the most effective and most intensively developed methods [1]. The eigenproblems resulting from the analysis of physical phenomena described here are mostly large and sparse. While for small problems application of QR algorithms for standard eigenvalue problem and QZ for generalized one is a routine, large and sparse eigenvalue problems are still a matter of intensive investigation. Solution of hermitian eigenvalue problem resulting mostly in structural mechanics with domain decomposition tools and parallel processing is a challenge for modern computational methods and is a subject numerous papers. The fact that the solution is limited to few dominant (leftmost) eigenvalues makes the solution only partially easier. Eigenproblems in fluid dynamics like in early papers [2,3] or recent ones [4] are generalized, complex and non-hermitian. For this reason solutions of 3D flow global stability problems are rare [5,6,7]. The extensive survey of the subject for 2D and 3D cases can be found in [8]. The parallel algorithms employing domain decomposition are complicated enough for CFD problems. The solution becomes even more difficult in case of eigenvalue problem. The parallel solution of complex, generalized eigenvalue problem resulting from flow global stability with the use of domain decomposition and discretization of the domain with unstructured grid described in this paper were not attempted till now.
2
FEM Formulation of the 3D Eigenvalue Problem
In global, linear stability analysis of the non-parallel flow the critical values of Reynolds number are determined by analysis of a basic steady solution of the Navier–Stokes equations. The incompressible fluid motion is described by the unsteady Navier-Stokes equation in the form: 1 V˙ i + Vi,j Vj + P,i − Vi,jj = 0 Re
(1)
In the following, all variables are assumed to be non-dimensionalised with velocity U and length D. The flow is characterised by the Reynolds number Re = U D/ν, ν being the kinematic viscosity of the fluid. We assume that the unsteady solution of the Navier-Stokes equation (1) can be expressed as the sum of its steady solution and the disturbance: Vi = V¯i + V´i P = P¯ + P´
(2)
3D Global Flow Stability Analysis on Unstructured Grids
1295
This assumption leads us to the disturbance equation, in the form: 1 ´ ˙ V´i + V´j V¯i,j + V¯j V´i,j + V´j V´i,j + P´,i − Vi,jj = 0 Re
(3)
and the continuity equation for the disturbance: V´i,i = 0
(4)
The assumption of small disturbance allows to linearize the equation (3). The time and space dependence is separated: V´i (x, y, z, t) = V˜i (x, y, z)eλt P´ (x, y, z, t) = P˜ (x, y, z)eλt
(5)
Introducing Eq. (5) into the linearized form of Eq. (3) results in the linear partial differential equations: λV˜i + V˜j V¯i,j + V¯j V˜i,j + P˜,i −
1 ˜ Vi,jj = 0 Re V˜i,i = 0
(6)
Equations (6) are the generalized differential eigenvalue problem. Application of standard Galerkin FEM formulation leads to discrete equations (7). Penalty formulation allows us to eliminate pressure from governing equations. λ˜ vim Φk Φm dΩ + (˜ vik v¯jo + v¯ik v˜jo ) Φk,j Φo Φm dΩ Ω Ω −˜ vjk Φk,j Φm,i dΩ (7) Ω 1 ˜ 1 − Φm (−P˜ δij + Vi,k )nj dΓ + v˜ik Φk,j Φm,j dΩ = 0 Re Re Ω Γ We use here quadratic tetrahedral elements Fig. (1). Eq. (7) is the generalized algebraic eigenvalue problem of the form: Ax − λBx = 0
(8)
The eigenvalue problem (8) is in our case characterized by a very large dimension of the weak conditioned unsymmetric matrices. The formulation of differential eigenvalue problem (6) and discrete equations (7) are similar in case of 2D and 3D (2D solution procedure is presented in [9]). Solution of 3D problem described here presents however much more challenging task.
3
Flow Domain, Computational Meshes and Partitioning
In the current study we consider the eigenvalue problem resulting from global stability of three-dimensional incompressible flow around stationary cylinder with
1296
M. Morzy´ nski and F. Thiele
size D placed in a uniform stream with velocity U . Thus, the flow can be defined in a stationary domain Ω with time-independent boundary conditions. The velocity and pressure field are depend on location x ∈ Ω and time t. The 3D region is described with cartesian coordinates x = (x, y, z), such that the origin is centred in the obstacle and the x-axis points in the direction of the flow. For eigenproblem computations homogenous boundary conditions are imposed on all domain boundaries except the outflow. In the outflow the natural, stress-free boundary condition resulting from application of the Gauss-Green theorem to the pressure and viscous terms Φm σ ˜ij nj dΓ = 0 (9) Γ
assures the non-reflecting boundary. The unstructured grid in Fig. 1 is partitioned into subdomains with the METIS partitioner. In the present study eight subdomains were created. The size of the subdomain is adjusted to fit a single node of a parallel cluster.
Fig. 1. Example of computational grid (47000 of nodes) (left) and 10-node quadratic tetrahedral element used in the computations (right)
4
Solution Procedure
The eigenvalue problem (7) is solved with subspace iteration procedure. Details of the algorithm are described in [9]. The main difference of the procedure presented here is the domain decomposition and formulation of the solver in parallel environment. The concept of the Subspace Iteration Method is to reduce the system to a small-dimensional subspace for which eigensolution can be found much easier. In the first step of the method initial set of m linearly independent vectors is generated. (0) (0) (0) (0) R(0) = R1 , R2 , R3 , ...Rm (10) The base Ritz matrix is calculated as: A Θ(n) = R(n−1)
(11)
3D Global Flow Stability Analysis on Unstructured Grids
1297
Fig. 2. Partitioning of the computational domain
where n is a number of the iteration. With a given initial set of R(0) vectors the set of Θ(0) vectors is found. To increase the numerical stability, the Θ(i) vectors are normalised and orthogonalised with Gram-Schmidt procedure on each step. The matrices for the reduced problem are calculated according to: T T Aˆ(n) = Θ(n) A Θ(n) = Θ(n) R(n−1) ˆ (n) = Θ(n)T B Θ(n) B
(12)
The matrix Θ(n) multiplication of matrices A, B and the right hand side reˆ (n) have the duces the eigenvalue problem to a smaller one. Matrices Aˆ(n) and B assumed range of m. Usually in flow stability problems only few ”leftmost” eigenvalues and respective eigenvectors are relevant. The reduced generalised eigenvalue problem for m-dimensional subspace can be written as: ˆ (n) B ˆ (n) ϕˆ(n) = 0 Aˆ(n) − Ω (13) This eigenvalue problem can be easily solved using any existing library algorithm. ˆ is a diagonal matrix containing the eigenvalues of the reduced problem. The Ω eigenvectors for n-th iteration can be recalculated from the equation: T
ϕ(n) = Θ(n) ϕˆ(n) If the convergence is not obtained the new set of vectors R
(14) (n)
is calculated from
R(n) = B ϕ(n)
(15)
and the iteration has to be repeated.
5
Preconditioning
In this work the invert Cayley transformation with real shift was applied. The transformation of the problem (8) with the Cayley method results in: ((1 − α3 )A − (α1 − α2 α3 )B)x − μ(A − α2 B)x = 0
(16)
1298
M. Morzy´ nski and F. Thiele
The eigenvalues of the base problem (8) are recalculated with the following formula: λ = (α2 (μ + α3 ) − α1 )/(μ + α3 − 1) (17) Details of shift invert Cayley transformation [10] and its influence on the solution is analysed in [9].
6
Numerical Results
The first step in global non-parallel flow stability analysis is to determine the base solution - the steady flow field. In Fig. (6) the base flow computations for 3D flow around circular cylinder is depicted. Upper part of the picture shows Vy component of the flow, lower part Vx component. Left part of the picture is obtained with the 3D parallel solver developed here. In the right part of the figure the 2D solution [9] used in previous computations of global flow stability is shown for comparison.
Fig. 3. Vy (top) and Vx (bottom) components of the base flow velocity computed with 3D (left) and 2D (right) solver Re = 5
Base solution is introduced to the eigenvalue problem (6) and solved with the procedure presented in §4. For low Reynolds numbers the Stokes modes depicted in Fig. (6) are obtained. To test the program first the sequential version with single CPU is run for Re = 50 and 330 000 DOF. In the present study ten leftmost eigenvalues were sought. The conjugate pair of eigenvectors depicted in Fig. (6) is related to the most unstable von K´ arm´ an mode and fairly compare
3D Global Flow Stability Analysis on Unstructured Grids
1299
Fig. 4. Real (left) and imaginary (right) part of the eigenvector. The Vy component is depicted for Re = 50. The eigenvalue problem size is 330 000 DOFs and is computed with the sequential version of the eigensolver.
Fig. 5. Stokes modes for flow around a circular cylinder. Real part of Vy component of the eigenvector for Re = 1.
with 2D computations [9]. Also the zero frequency mode for Re = 50 present in 2D computations appeared to be present in the results. Below the complex values of the eigenvalues are listed for 330 000 DOFs computations. xxxxx 1 2 3
Eigenvalue xxxxxxx (8.936395563007604E-002,2.869685592454356E-003) (-4.628141508904429E-002, 0.439923603272213) (-4.724607994108846E-002,-0.439312875778736)
1300
M. Morzy´ nski and F. Thiele
Fig. 6. Real (left) and imaginary (right) part of the eigenvector. The Vy component is depicted for Re = 50. The eigenvalue problem size is 750 000 DOFs and is computed with the 8 CPU cluster.
Fig. 7. Real part of the Vy component of the zero frequency mode for Re = 50
4 5 6 7 8 9 10
(1.56767086752809,0.136992062184474) (45.7435183775290,-9.36542529309317) (232.585267636963,-38.9097330550378) (644.078487175255,-43.9579494666679) (747.832818036339,-43.1939271858741) (1252.84287987261,-113.216038683970) (1507.47671548085,10.7906922535626)
With the procedure verified against 2D eigensolution and the sequential, single CPU algorithm the parallel version has been tested. For the computation 2 nodes (8 CPUs) of a 24 node Linux cluster based on Dual AMD Dual Core (4 CPU)
3D Global Flow Stability Analysis on Unstructured Grids
1301
Opteron 270, 2.0 GHz, having on each node 8 GB PC 3200 ECC REG SDRAMDDR and InfiniBand network is used. Below the result of ”top” command is shown for 330 000 DOFs case. The state of the program snapshoted below shows the stage of solution of the linear equation system. For this purpose the GMRES with incomplete ILU procedure is employed. Tasks: 118 total, 6 running, 112 sleeping, 0 stopped, 0 zombie Cpu(s): 95.3% us, 0.5% sy, 0.0% ni, 4.2% id, 0.0% wa, 0.0% hi Mem: 8177068k total, 6860272k used, 1316796k free, 207324k buffers Swap: 8385920k total, 9176k used, 8376744k free,4999732k cached PID 18068 18038 18015 18070 18002 1 ...
USER mmor0931 mmor0931 mmor0931 mmor0931 mmor0931 root
PR 23 17 17 17 15 16
NI VIRT 0 452m 0 420m 0 415m 0 435m 0 46404 0 744
RES SHR S 352m 25m R 329m 21m R 330m 25m R 338m 20m R 1852 1096 S 80 52 S
%CPU %MEM 98.4 4.4 96.8 4.1 94.4 4.1 93.8 4.2 0.3 0.0 0.0 0.0
TIME+ COMMAND 2:10.94 eigv 2:10.71 eigv 2:11.02 eigv 2:10.77 eigv 0:00.13 sshd 0:12.05 init
Results of further computations with 750 000 DOFs eigenproblem are depicted in Fig. (6). In both cases nearly linear speedup is encountered. Slight loses of the performance are attributed due to more oscillatory character of internal iteration steps. These parameters are the subject of present investigations.
7
Conclusions
The problem of 3D global flow stability is formulated with the Finite Element Method. Unstructured grids used in this approach enable analysis of large group of configurations having practical importance. The benchmark problem of stability of the flow around circular cylinder is successfully solved and compares fairly well with 2D eigensolutions. To solve the large-dimensional complex generalized eigenvalue problem the own method based on domain decomposition and subspace iteration with shift invert Cayley transformation has been developed and tested. The developed system solves the eigenproblems having order of 106 DOFs on 8-10 CPU cluster and scales well with the increasing size of the problem. Further investigations target complex geometries, application of 3D modes in flow control design and development of the computational efficiency of the presented algorithm. Acknowledgments. The presented study was performed in close collaboration with Bernd R. Noack, Witek Stankiewicz, Gilead Tadmor, the Collaborative Research Center (Sfb 557) ”Control of Complex Turbulent Shear Flows” at TU Berlin. We appreciate recent stimulating discussions with Michal Nowak, Joanna Morzynska and ISTA CFD team.
1302
M. Morzy´ nski and F. Thiele
References 1. Lehoucq, R., Salinger, A.: Large-scale eigenvalue calculations for stability analysis of steady flows on massively parallel computers (1999) 2. Zebib, A.: Stability of viscous flow past a circular cylinder. J. Engr. Math. 21, 155–165 (1987) 3. Morzy´ nski, M., Thiele, F.: Numerical stability analysis of a flow about a cylinder. Z. Angew. Math. Mech. 71, T424–T428 (1991) 4. Abdessemed, N., Sherwin, S., Theofilis, V.: Linear Stability of the flow past a low pressure turbine blade. In: 36 th AIAA Fluid Dynamics Conference and Exhibit (2006) 5. Tezuka, A., Suzuki, K.: Three-Dimensional Global Linear Stability Analysis. of Flow Around a Spheroid (2006) 6. Burroughs, E., Romero, L., Lehoucq, R., Salinger, A.: Large scale eigenvalue calculations for computing the stability of buoyancy driven flows (2001) 7. Pawlowski, R., Salinger, A., Shadid, J., Mountziaris, T.: Bifurcation and stability analysis of laminar isothermal counterflowing jets. Journal of Fluid Mechanics 551, 117–139 (2006) 8. Theofilis, V.: Advances in global linear instability analysis of nonparallel and threedimensional flows. Progress in Aerospace Sciences 39(4), 249–315 (2003) 9. Morzy´ nski, M., Afanasiev, K., Thiele, F.: Solution of the eigenvalue problems resulting from global non-parallel flow stability analysis. Comput. Meth. Appl. Mech. Enrgrg. 169, 161–176 (1999) 10. Meerbergen, K., Spence, A., Roose, D.: Shift-invert and Cayley transforms for detection of rightmost eigenvalues of nonsymmetric matrices. BIT Numerical Mathematics 34(3), 409–423 (1994)
Performance of Multi Level Parallel Direct Solver for hp Finite Element Method Maciej Paszy´ nski Department of Computer Science AGH University of Science and Technology, Al. Mickiewicza 30, 30-059 Cracow, Poland [email protected] http://home.agh.edu.pl/∼ paszynsk
Abstract. The paper presents theoretical evaluation and numerical measurements of a performance of a new parallel direct solver implemented for hp Finite Element Method (FEM). The solver utilizes the substructuring method over the non-overlapping sub-domains, which consists in elimination of the sub-domains internal d.o.f. with respect to the interface d.o.f., then solving the interface problem, finally solving back the internal problems by backward substitution on each subdomain. The interface problem is solved by recursive execution of the direct substructuring method on the tree of separators associated with the subdomains on which the Schur complement approach was applied. We show that the efficiency of the solver is growing when the accuracy of the FEM solution is increased by performing hp refinements on the computational mesh. The h refinements consists in breaking some finite elements into smaller son elements, the p refinements consists in increasing the polynomial order of approximation on some finite elements edges, faces and interiors. Keywords: Parallel direct solvers, Substructuring method, Finite Element Method, hp adaptivity.
1
Introduction
The paper present theoretical evaluation of the performance of a new parallel direct solver for the hp Finite Element Method. The solver utilizes the substructuring method, where the computational domain is partitioned into subdomains, the internal degrees of freedom (d.o.f.) are eliminated with respect to interface d.o.f. on each sub-domain (by getting the Schur complements), and then the recursive multi level scheme is applied to solve the interface problem. Sub-domains are joint into pairs, and common fully assembled d.o.f. are eliminated. Then, sub-domains are joint into sets of four, and again common fully assembled d.o.f. are eliminated. This process is repeated until all sub-domains are joint, and common not fully assembled yet d.o.f. are eliminated. Finally, the recursive backward substitution is executed on the created assembly tree. The solver utilizes multiple instances of a single processor MUMPS solver [4] for getting Schur complements on sub-domains as well as for elimination of d.o.f. on R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1303–1312, 2008. c Springer-Verlag Berlin Heidelberg 2008
1304
M. Paszy´ nski
each level of the interface problem solution. The theoretical efficiency evaluations are compared with numerical measurements performed on three computational meshes with high polynomial orders of approximation, generated by the fully automatic hp adaptive FEM codes [5]. From the evaluation it follows that the efficiency of the solver increases when the computational mesh is h refined (some finite elements are broken into smaller son elemets) or p refined (polynomial orders of approximations are increased on some finite elements edges, faces or interiors).
2
Hexahedral 3D hp Finite Element
The computational meshes presented in this paper consist in hexahedral 3D hp finite elements. The hp finite elements allow to use polynomials with various orders of approximations on element edges, faces and interior. The reference 3D hexahedral hp finite element is defined as a triple (K, X(K), ΠK )
(1)
where K is a (0, 1)3 cube shape geometry of the reference element, X(K) is a space of shape functions, and Π is the projection operator from an infinite dimensional functional space X into X(K). The detailed definition of the hexahedral 3D hp finite element can be found in [3]. The tensor products of 1D hierarchical shape functions are used to define 3D shape functions on 8 vertices, 12 edges, 6 faces and in the interior of a hexahedral element. The polynomial order of approximation is equal to 1 at all vertices. The polynomial orders of approximation on the 12 edges are denoted by pi i = 1, ..., 12. There is one horizontal and one vertical polynomial order of approximation for each of the 6 faces. They are denoted as (pih , piv ) i = 13, ..., 18. The interior of an element has 3 polynomial orders of approximation, in x, y and z direction, denoted as (px , py , pz ). We introduce the projection-based interpolant up = ΠK u [2] defined in the following steps. – Interpolation at vertices. The interpolant must match function u at element vertices up (a) = u(a) for each vertex a. In the first step, we construct the trilinear lift u1 of the vertex values. – Projection over edges. In the second step, we project the difference u − u1 onto a space of edge polynomials. Each edge projection is then lifted to the element using the corresponding edge shape function. The sum of the lifts of all edge projection is denoted by u2 . – Projection over faces. In the third step, we project the difference u−u1 −u2 onto a space of face polynomials. Each face projection is then lifted to the element using the corresponding face shape function. The sum of the lifts of all face projection is denoted by u3 . – Projection over interior. Finally, we project the resulting difference u−u1 − u2 − u3 onto a space of element interior polynomials.
Performace of Multi Level Parallel Direct Solver for hp FEM
1305
The introduced projections utilize suitable edge, face, and interior seminorms [2]. The mesh regularity rules, described in detail in [3], require orders of approximation on faces to be equal to the minimum of corresponding orders of two adjacent interiors. The rules also require orders of approximation on edges to be equal to the minimum of corresponding orders of adjacent faces.
3
Estimation of the Computational Complexity
3.1
Computational Complexity of the Sequential Solver Executed over Single hp Finite Element
Let us estimate first the number of operations performed by sequential frontal solver during forward elimination over one hp finite element. The order of approximation in the interior of the element is assumed to be equal to (p1 , p2 , p3 ). The orders of approximation on element faces and edges are assumed to be equal to corresponding orders in the interior. From this assumption it follows that there are 2 faces with orders (p1 , p2 ), 2 faces with orders (p1 , p3 ) and 2 faces with orders (p2 , p3 ), as well as 4 edges with order p1 , 4 edges with order p2 and 4 edges with orders p3 . Such a finite element fullfils the mesh regularity rules. The total number of d.o.f. in such an element is equal to nrdof = (p1 + 1)(p2 + 1)(p3 + 3) (interior d.o.f.) +2{(p1 + 1)(p2 + 1) + (p1 + 1)(p3 + 1) + (p2 + 1)(p3 + 1)} (faces d.o.f.) +4{(p1 + 1) + (p2 + 1) + (p3 + 1)} (edges d.o.f.) (2) To estimate the efficiency of the sequential solver, we assume that p1 = p2 = p3 = p, e.g. by taking p = max{p1 , p2 , p3 }. Thus, the number of d.o.f. nrdof = (p + 1)3 + 6(p + 1)2 + 12(p + 1) = O(p3 )
(3)
the number of interior d.o.f. interior nrdof = (p + 1)3 = O(p3 )
(4)
and the number of interface d.o.f. interf ace nrdof = 6(p + 1)2 + 12(p + 1) = O(p2 )
(5)
The computational complexity of the frontal solver is computational complexity over 1 element = (number of d.o.f.)3 since all d.o.f. within a single element are connected. Thus, the computational complexity of the sequential frontal solver executed on a single hp element is computational complexity over 1 element = nrdof 3 = {(p + 1)3 + 6(p + 1)2 + 12(p + 1)}3 = O(p9 ). 3.2
Computational Complexity of the Sequential Solver Executed over Cube of N 3 Identical hp Finite Element
The computational domain is now assumed to be a cube with N × N × N identical hp finite elements. Orders of approximation are assumed to be equal to p
1306
M. Paszy´ nski
on all elements edges, on all faces in both directions, and in the interior in all directions. The frontal solver is eliminating elements by slices. The computational complexity of the sequential frontal solver is computational complexity over N 3 elements = (N 2 × nrdof )3 × N = O(N 7 p9 ) Here N 2 × nrdof is the number of d.o.f. over one slice of the cube, and N stands for number of slices. The frontal solver is able to eliminate all d.o.f. from one slice just after visiting all elements from this slice. 3.3
Computational Complexity of the Paralell Solver Executed over Cube of N 3 Identical hp Finite Element
Let us finally consider the cubic computational mesh of N 3 identical hp finite elements distributed into P processors. Let us also assume that the computational mesh is divided into N slice-shape subdomains, and P = N . The complexity of the parallel solver has 3 contributions Thus complexity of parallel solver = computational complexity of the Schur complement + computational complexity of the interface problem solution + communication complexity of the interface problem solution. – In the first step, the solver computes the Schur complement over each hp finite element. The computational complexity of this operation is equal to the cost of eliminating of the interior d.o.f. with respect to interface d.o.f. located on both sides of the slice. This operation can be roughly estimated as computational complexity of the Schur complement = (N 2 nrdof )3 = O(N 6 p9 ). – Then, the solver solves the interface problem recursively, by joining interface contributions into pairs, then into set of 4 contributions, then into sets of 8, up to the whole interface problem, each time eliminating fully assembled d.o.f. In the case of the cubic geometry, computational complexity of the interface problem solution = (N 2 × interf ace nrdof )3 × log2 N = O(N 6 log2 N p6 ). – The communication cost of the parallel solver can be estimated by computing size of the Schur complement contributions exchanged during the interface problem solution iterations. communication complexity of parallel solver = (nrdof interf ace × N 2 )2 × log2 P = (p2 N 2 )2 × log2 P = p4 N 4 log2 P . The communication cost of building global numbering of interface nodes can be neglected in comparison to communication cost of of sending Schur complement contributions. The efficiency of the P processors parallel solver, is equal to efficiency = T1 /(pTp ) = computational cost over N 3 elements ×tcomp / P × (computational cost of parallel solver × tcomp + communication cost of parallel solver × tcomm ) where T1 is the execution time of a sequential frontal solver over cube shape domain of N 3 finite elements, p is the number of processors, Tp is the execution time of P = N processors parallel solver over N slices of N 2 finite elements,
Performace of Multi Level Parallel Direct Solver for hp FEM
1307
tcomp is the time of a single operation execution and tcomm is the time of a single data communication. Thus efficiency = (N 2 nrdof )3 N tcomp P (((N 2 nrdof )3 +(N 2 interf ace nrdof )3 log2 N )tcomp +(interf ace nrdof N 2 )2 log2 P tcomm ) = (N 7 p9 )tcomp P ((N 6 p9 +N 6 log2 N p6 )tcomp +(p4 N 4 log2 P )tcomm ) = (N 7 p9 )tcomp (N 7 p9 +N 7 log2 N p6 )tcomp +(p4 N 5 log2 N )tcomm
since the number of processors is selected to be equal to the number of slices P = N. The efficiency of the parallel solver increases when computational mesh is uniformly p refined limp→∞ ef f iciency = 1. (6) In other words, under such a definition of hp finite elements, when the computational domain is partitioned into sub-domains on the level of faces of hp finite elements, and the most overloaded interior nodes are hiden inside hp finite elements, the efficiency of the solver grows when we go up with p. 3.4
Efficiency Estimations for h Refinements Mixed with p Refinements
The p refinements performed together with h refinements also result in increasing of the efficiency. This is because the interiors of h refined elements will still have one order of magnitude more d.o.f. then their faces. This will be proven in the following subsection. Let us estimate number of internal and interface d.o.f. over multiple h refined elements. We assume that elements are h refined to resolve single singularities located in some vertices of the initial mesh. We also assume that polynomial order of approximation for son elements created during h refinement are inherited from the father element, and, for simplicity, the order of approximation of the father element is equal to p in all directions. The first h refinement breaks an element into 8 smaller son elements. Total number of internal d.o.f. is equal to 8(p + 1)3 for 8 new interiors, 12(p + 1)2 for 12 new faces created in the interior of the father element, 6(p + 1) for 6 new edges created in the interior of the father element. The total number of interface d.o.f. is equal to 24(p+ 1)2 for new faces located on the exterior of an element, 24(p + 1) for new edges created on faces of the father element, 24(p+1) for new edges created by breaking 12 edges of the father element. Every next h refinement in the direction of vertex singularity adds 8 new interior nodes in place of 1 father’s interior node, 12 new interior faces, 6 new interior edges, as well as 12 new interface faces in place of 3 father’s interface faces, 18 new interface edges in place of 3 father’s interface edges. After h refinements performed k times in the direction of vertex singularity, illustrated in Fig. 1, total number of internal d.o.f. is equal to (7k + 1) × (p + 1)3 of interior d.o.f., 12k × (p + 1)2 of internal faces d.o.f., 6k × (p + 1) of internal
1308
M. Paszy´ nski
Fig. 1. Three h refinements of a finite element in direction of vertex singularity
edges d.o.f. Total number of interface d.o.f. is equal to (9k + 15) × (p + 1)2 of interface faces d.o.f., (15k + 9) × (p + 1) of interface edges d.o.f. Thus, for k times h refined element, the number of interior d.o.f. interior nrdof = (7k + 1) × (p + 1)3 + 12k × (p + 1)2 + 6k × (p + 1) = O(kp3 ) (7) the number of interface d.o.f. interf ace nrdof = (9k + 15) × (p + 1)2 + (15k + 9) × (p + 1) = O(kp2 )
(8)
and total number of d.o.f. nrdof = O(kp3 )
(9)
The efficiency [1] of the parallel solver over k times h-refined mesh efficiency = (computational cost over N 3 elements)×tcomp / P × {(computational cost of parallel solver) ×tcomp + (communication cost of parallel solver) ×tcomm } =
(N 2 nrdof )3 N tcomp P (((N 2 nrdof )3 +(N 2 interf ace nrdof )3 log2 N )tcomp +(interf ace nrdof N 2 )2 log2 P tcomm ) (k3 N 7 p9 )tcomp P ((k3 N 6 p9 +N 6 k3 p6 log2 N )tcomp +(k2 p4 N 4 log2 P )tcomm ) = (k3 P 7 p9 )tcomp 3 7 9 7 3 6 (k P p +P k p log2 P )tcomp +(k2 p4 P 5 log2 P )tcomm .
Thus limk→∞ ef f iciency =
=
1 1+
log2 P p3
limp→∞ ef f iciency = 1 limk,p→∞ ef f iciency = 1 A sequence of p refinements on k times h refined mesh, increases the efficiency of the parallel solver. The efficiency of the solver on a sequence of k times h refined mesh depends on the log2 P/p3 ratio. A sequence of h refinements mixed with a sequence of p refinements increases the efficiency of the parallel solver. In practical computations, the h refinements are mixed with p refinements.
Performace of Multi Level Parallel Direct Solver for hp FEM
3.5
1309
Relation between the Efficiency and Number of Processors
In this section the efficiency as a function of number of processors and order of approximation, assumed to be uniform over the entire mesh, will be derived.
Fig. 2. The efficiency as a function of number of processors and order of approximation E(P, p) = 1+ 1 log P + 11 log P tcomm for n = 16 (256 finite elements on each slice p3
2
2 n 2 p5 tcomm tcomp
shape sub-domain) and
tcomp
= 0.00001
We assume that the cube shape domain with n2 × P hp-finite elements is partitioned into P slices of n2 elements. Each slice is then assigned to a single processor. The total number of processors is then P . We also assume that the polynomial order of approximation is uniformly set to p over all edges, over all faces in both directions, and in all interiors in three directions. Thus, the computational complexity of the sequential solver is given by computational complexity over n2 × P elements = (n2 nrdof )3 P = O(n6 p9 P ). The computational complexity of the operation of computing Schur complement by the parallel solver over an interior sub-domain is computational complexity of the Schur complement over sub-domain = (n2 nrdof )3 = O(n6 p9 ). The computational complexity of the operation of interface solving by parallel solver is computational complexity of the interface solution = (n2 nrdof interf ace)3 log2 P = (n2 p2 )log2 P = O(n6 p6 log2 P ) since number of degrees of freedom on single interface slice is n2 interf acenrdof , and the depth of assembly tree is log2 P . The communication complexity of the interface problem solution is communication complexity of the interface solution = (n2 nrdof interf ace)2 log2 P = O(n4 p4 log2 P )
1310
M. Paszy´ nski
Fig. 3. Computational meshes generated by self-adaptive hp FEM code
since there is a need to send interface problem submatrix related with one slice to the other slice from a pair. The matrix size is (n2 p2 )2 since number of elements is n2 and number of degrees of freedom on each face is p2 . The efficiency is then given by efficiency = (computational cost over n2 P elements)×tcomp / P × {(computational cost of the Schur complement over sub-domain + computational cost of the interface solution)×tcomp + communication cost of the interface solution×tcomm }= n6 p9 P tcomp P (n6 p9 tcomp +n6 p6 log2 P tcomp +n4 p4 log2 P tcomm ) . Thus efficiency = 1+ 1 log P + 11 log P tcomm . p3
We conclude that limP →∞ efficiency =
2
n2 p5
2
tcomp
1 1+ p13 log2 P
limP,p→∞ efficiency = 1 This results does not depend on n2 number of elements over each subdomain. In other words, performing global hp refinements and assiging each slice to single processor, will increase efficiency of the solver. This is also illustrated in Fig. 2.
4
Efficiency of the Parallel Solver Measured on hp Finite Element Meshes
In this section we report efficiency measurements of the parallel solver on a sequence of hp finite element meshes, generated by the parallel 3D fully automatic hp-adaptive FEM code [5] utilized to solve the 3D DC borehole resistivity measurements simulations modeled by the Poisson equation [6]. 4.1
Efficiency Measurements
The detailed efficiency measurements of the parallel solver was performed on three hp refined meshes presented in Fig. 3 generated for the 3D resistivity
Performace of Multi Level Parallel Direct Solver for hp FEM
1311
Fig. 4. Left picture: Efficiency of the parallel solver on the 20 000 d.o.f. uniform p = 2 mesh. Middle picture: Efficiency of the parallel solver on the 700 000 d.o.f. uniform p = 3 mesh. Right picture Efficiency of the parallel solver on the 250 000 d.o.f. hp refined mesh.
logging problem [6]. The first mesh was the cylindrical mesh with about 2500 finite elements, with about 20000 d.o.f., with globally uniform order of approximation p = 2. The second mesh was the cylindrical mesh with 25 000 finite elements, with about 700 000 degrees of freedom, with globally uniform p = 3. The third mesh was the cylindrical mesh with 250 000 degrees of freedom, with finite elements of various size and with polynomial orders of approximation varying from p = 1 to p = 8 on some edges, faces and interiors. The efficiency measurements for the meshes are reported in Fig. 4. According to the theoretical analysis presented in the previous section, the efficiency of the solver increases when we consider more h and p refined meshes. The second mesh was created by global h and p refinement of the first mesh, whist the third mesh was created by a sequence of many h and p refinements performed on the first mesh. The efficiency of the solver on the first mesh matches the efficiency predicted by theoretical studies presented in sub-section Relation between the efficiency and number of processors. The solver attains 50 % efficiency on the second mesh. The measured efficiency does not match the predicted efficiency for 2 and 4 processors, because each sub-domain contains many slices of the computational mesh. The derived formulae for the efficiency assumes that each processor is assigned to a single slice. The efficiency for 2 and 4 processors is smaller then in Fig. 2. This is related with the d.o.f. ordering issues. There are much more internal d.o.f. over a sub-domain with many mesh slices. The solver performing Schur complement with respect to interface nodes cannot modify the ordering of the interface d.o.f., whilst the solver performing forward elimination of the entire matrix can adjust ordering of all d.o.f. In other words, when the size of subdomain is large the Schur complement computations takes more time then just forward elimination over sub-domain. On the third mesh, the solver attains 60 % efficiency, however the maximum efficiency is gain for 8 processors, and the efficiency goes down to 30% for 16 processors. The solver loses its efficiency for large number of sub-domains because of the following reasons. The third mesh is not uniformly refined. Thus, it is not possible to compare the measurements with predicted efficiency. Some
1312
M. Paszy´ nski
subdomains has interface with high p, and the solver has less freedom to produce optimal ordering of internal nodes to be eliminated so the front size is large.
5
Conclusions
– The efficiency a new parallel direct solver grows with the polynomial order of approximation over the computatinal meshes. – The sequence of hp refinements performed on the mesh also increases the efficiency of the solver. – The solver can be utilized as an efficient solver for fully automatic hp adaptive FEM, generating the sequence of hp refined meshes delivering the exponential convergence of the numerical error with respect to the number of d.o.f. [5], since the consequtive meshes are obtained by the sequence of h and p refinements performed on the initial mesh. Acknowledgments. The work reported in this paper was supported by Fundation for Polish Science under program Homming and by the Polish Ministry of Science and Higher Education grant no. 3 T08B 055 29.
References 1. Foster, I.: Desiging and Building Parallel Programs, http://www-unix.mcs.aml.gov/dbpp 2. Demkowicz, L.: Projection based interpolation, ICES Report 04-03 (2004) 3. Demkowicz, L., Pardo, D., Rachowicz, W.: 3D hp-Adaptive Finite Element Package (3Dhp90) Version 2.0, The Ultimate Data Structure for Three-Dimensional, Anisotropic hp Refinements, TICAM Report 02-24 (2002) 4. A MUltifrontal Massively Parallel sparse direct Solver, http://www.enseeiht.fr/lima/apo/MUMPS/ 5. Paszy´ nski, M., Demkowicz, L.: Parallel Fully Automatic hp-Adaptive 3D Finite Element Package. Engineering with Computers 22(3-4), 255–276 (2006) 6. Paszy´ nski, M., Pardo, D., Torres-Verdin, C., Demkowicz, L.: Fast Numerical Simulations of 3D DC Borehole Resistivity Measurements with a Parallel Self-Adaptive Goal-Oriented Finite Element Formulation, Sixth Annual Report of Joint Industry Research Consortium on Formation Evaluation, The University of Texas at Austin, August 16-18 (2006)
Graph Transformations for Modeling Parallel hp-Adaptive Finite Element Method Maciej Paszy´ nski1 and Anna Paszy´ nska2 1
2
Department of Computer Science AGH University of Science and Technology, Al. Mickiewicza 30, 30-059 Cracow, Poland International School of Business and Technology, Bielsko-Biala, Poland [email protected] http://home.agh.edu.pl/∼ paszynsk
Abstract. The paper presents composition graph (CP-graph) grammar, which consists of a set of CP-graph transformations, suitable for modeling all aspects of parallel hp adaptive Finite Element Method (FEM) computations. The parallel hp adaptive FEM allows to utilize distributed computational meshes, with finite elements of various size (thus h stands for element diameter) and polynomial orders of approximation varying locally, on finite elements edges and interiors (thus p stands for polynomial order of approximation). The computational mesh is represented by attributed CP-graph. The proposed graph transformations model the initial mesh generation, procedure of h refinement (breaking selected finite elements into son elements), and p refinement (adjusting polynomial orders of approximation on selected element edges and interiors), as well as partitioning of computational mesh into sub-domains and enforcement of mesh regularity rules over the distributed data structure. Keywords: Parallel computations, Finite Element Method, hp adaptivity, CP-graph grammar, graph transformations.
1
Introduction
The paper presents the application of the CP-graph grammar (Composite Programable graph grammar) defined in [1], [2], [3] for modeling of all aspects of parallel hp adaptive computations. The graph grammar consists of a set of graph transformations, called “productions”. Each production replaces a sub-graph defined on its left-hand-side into a new sub-graph defined on its right-hand-side. The left-hand-side and right-hand-side sub-graphs have the same number of free bounds (bounds connected to external vertices). Thus, the embedding transformation is coded in the production by assuming the same number of free bounds on both sides of the production. The executions of graph transformations are controled by diagrams prescribing the required order of productions. The 2D and 3D parallel hp adaptive Finite Element Method (FEM) codes were developed [6], [7], based on the sequential implementations [4], [5]. The R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1313–1322, 2008. c Springer-Verlag Berlin Heidelberg 2008
1314
M. Paszy´ nski and A. Paszy´ nska
codes generate a sequence of optimal hp meshes, with finite elements of various size and polynomial orders of approximations varying locally on finite element edges, faces and interiors. The generated sequence of meshes delivers exponential convergence of the numerical error with respect to the mesh size (number of degrees of freedom or CPU time). Each hp mesh is obtained by performing a sequnence of h and p refinements on the predefined initial mesh. The h refinement consists in breaking selected finite element into son elements and p refinement consists in increasing polynomial order of approximation on selected edges, faces and/or interiors. The process of generation of an hp mesh is formalized by utilizing CP-graph grammar. This allows to express sequential and parallel algorithms by means of graph transformations managed by control diagrams. This involves the generation of the intial mesh, the domain decomposition process, as well as sequential and parallel h and p refinemenets. The formalization of the process of mesh transformation allows to code the mesh regularity rules on the level of graph grammar syntax, which simplifies the algorithms and prevents computational mesh to be inconsistent. The formalism can be used to further analize of the complexity and properties of proposed algorithms.
2
Graph Transformations for Modeling Sequential Mesh Operations
In this chapter, the subset of graph transformations, modeling generation of an arbitrary hp refined mesh, based on initial mesh with horizontal sequence of elements is presented. The presented graph transformations can be generalized into a case of arbitrary two dimensional initial mesh. The process of the initial mesh generation is expressed by the graph transformations presented in Fig. 1. Once the sequence of initial mesh elements is generated, the structure of each element is produced, and the common edges of adjacent elements are identified, as it is presented in Fig. 2. Each 2D finite element consists of four vertices, four edges and the interior, represented as graph vertices. The h refinement is expressed by breaking element edges and interiors. To break of an element interior means to generate one new son vertex, four new edges and four new interiors, and connect them to the broken interior. To break of an element edge means to generate one new vertex and two new edges, and connect them to the broken edge. These procedures are expressed by (Pbreak int) and (Pbreak edge) productions in Fig. 3. The newly created finite elements are never stored in the data structure. They are dynamically localized at the bottom level of generated refinement trees. The following mesh regularity rules are enforced during the process of mesh transformation, see [4]. An element edge can be broken only if two adjacent interiors have been already broken, or the edge is adjacent to the boundary. This is expressed by (PFE1-4) productions in Fig. 3. An element interior can be broken only if all adjacent edges are of the same size as the interior. This is expressed by (PJI) production in Fig. 4. The goal of the mesh regularity rules is
Graph Transformations for Modeling Parallel hp-Adaptive FEM
1315
Fig. 1. Left picture: Subset of graph transformation allowing for generation of an exemplary initial mesh with a horizontal sequence of elements. Right picture: Graph transformation generating the structure of initial mesh.
Fig. 2. Graph transformation for identification of common edges (the same productions are defined for other directions)
Fig. 3. Graph transformations breaking element interior (Pbreak int), breaking element edge (Pbreak edge) and allowing for breaking an edge that is either adjacent to two broken interiors or to one broken interior and boundary (PFE1-4)
1316
M. Paszy´ nski and A. Paszy´ nska
Fig. 4. Graph transformation allowing for breaking an element interior (PJI)
to avoid multiple constrained edges, which leads to problems with approximation over such an edge. The mesh regularity rules enforce breaking of large adjacent unbroken elements before breaking small element for the second time, which is illustrated in Fig. 5. The mesh regularity rules are enforced on the level of graph grammar syntax. The example of the sequence of graph transformations corresponding to those presented in Fig. 5 are presented in Fig. 6. Additional productions setting adjacency relations between edges and interiors, necessary for unblocking element interiors surrounded by broken edges, are presented in Fig. 7.
Fig. 5. First picture: One small element is set to be broken. Second picture: Mesh regularity rules enforce breaking of interior of large adjacent element before breaking required small element. Third picture: Then, the edge located between two broken interiors is broken. Fourth picture: Finally, required refinement is performed.
The p refinements consist in setting polynomial orders of approximation on elements edges and interiors. This is simply expressed by attributing graph vertices by the required order (single order for a face, two orders - in horizontal and vertical direcions for an edge).
3
Graph Transformations and Control Diagrams for Modeling Parallel Mesh Operations
The parallel hp adaptive computations [6], [7] are based on the domain decomposition paradigm. The data structure, represented here by the CP-graph, is partitioned into multiple sub-domains, see Fig. 8. Each time a sequence of h and/or p refinements is performed on the data structure, the new repartition is required, to maintain uniform load balancing between multiple sub-domains. The decisions about optimal mesh partitions base on the computational cost estima 3 3 tion assigned to each initial mesh element. It is defined as (px + 1) (py + 1) 3 3 for the 2D case and (px + 1) py + 1)3 (pz + 1) for 3D case. Here (px , py , pz ) stand for the polynomial orders of approximation in x, y and z (in the 3D case) directions, respectively, in the interior of finite element. The sum is spread over all active finite elements generated within the initial mesh element beeing estimated. The load balancing on the level of the initial mesh elements is enough for
Graph Transformations for Modeling Parallel hp-Adaptive FEM
1317
efficient parallelization of mesh transformation algorithms, but for integration and solver components, hierarchical parallelization can be applied, with multiple threads assigned to each sub-domain [8]. The process of domain decomposition can be modeled by two sets of graph transformations, presented in Figures 9 and 10. The first set is recursively splitting CP-graph into two sub-graphs, on the level of edges of initial mesh elements. The second set is recursively joining two adjacent sub-graphs into a single one. The h and p refinements performed over the distributed mesh require the mesh regularity rules to be fulfilled over the entire domain. Thus, the following algorithms for parallel h and p refinements were implemented in the parallel codes [6], [7]. The concept of ghost elements and virtual refinements is proposed [5] to reduce the complexity of the algorithms. Before executing h or p refinements, both algorithms are proceed by the exchange of the ghost elements. The ghost elements are local copy of neighboring elements from adjacent sub-domains. The algorithm for parallel h refinements. The algorithm is also illustrated in Fig. 11. 1. Make decisions about required optimal h refinements. 2. Perform virtual h refinements (attributing of graph vertices) 3. do (a) Send refinement flags from physical to corresponding ghost elements. (b) Propagate virtual h refinement flags (attributing of graph vertices) until virtual h refinement flags of physical elements are not changed. (conflicts are resolved by performing logical OR on virtual refinement flags) 4. Global synchronization (mpi barrier) 5. Perform h refinements on physical elements and ghost elements (sequence of graph transformations) The procedure must be repeated, since for 2D or 3D meshes, the h refinement flags may propagate through several sub-domains (sub-graphs). The algorithm for parallel p refinements. The decisions about required optimal p refinements are made for element interiors. Then, the minimum rule is enforced, setting the order of approximation on element edges as minimum of orders of adjacent interiors, see Fig. 12. Thus, the parallel execution of p refinements require exchanging of polynomial orders of approximation with adjacent sub-domains, before performing the minimum rule. 1. Make decisions about required optimal p refinements of elements interiors. 2. Perform p refinements on physical elements (attributing of graph vertices) 3. Send p refinement attributes from physical to corresponding ghost elements (message passing followed by attributing of graph vertices) 4. Global synchronization (mpi barrier) 5. Enforcement of the minimum rule (modeled by graph transformations)
1318
M. Paszy´ nski and A. Paszy´ nska
Fig. 6. Graph representation of computational mesh related to Fig. 5 obtained by the following sequence of productions: (PI1)-(PI2)-(PI4)-(PII)2 -(PIC)(Pbreak int)2 -(PFE1)-(Pbreak edge)7 -(Peast)4 -(Pwest)4 -(Pnorth)4 -(PJI)8 (Pbreak int)-(Pbreak edge). Here, only single father-son links are denoted.
The same production for e replaced by e2 or e3 The same production for e replaced by e2 or e3
e
i
e
i
2nd NE
J
(Peast)
i
e
2nd
2nd
NE
NE
Fe
1st
J3
F1
Fe
J3
Fe
NE
Fe
J3
1st
F1
J4
F1
J4
1st
NW
SE
J
2nd
1st
(Pnorth) SE
i
e
J3
F1
NW
Fig. 7. Graph transformation (Paest) setting adjacency relation between interiors and edges located in the “east” direction. Corresponding (Pwest) production is defined with attributes NW and SW. The J vertices are translated in J3. Graph transformation (Pnorth) setting adjacency relation between interiors and edges located in the “north” direction. Corresponding (Psouth) production is defined with edge attributes SE and SW. The J3 vertices are translated into J4.
Graph Transformations for Modeling Parallel hp-Adaptive FEM
1319
Fig. 8. Examplary division of computational mesh
Fig. 9. Graph transformation for splitting a single domain into two sub-domains
Fig. 10. Graph transformation for joining two sub-domains into a single domain
The algorithms can be modeled in our formalism by control diagrams enforcing proper order of execution of graph transformations. The control diagram responsible for exchanging of ghost elements is presented in Fig. 13. The diagrams responsible for execution of h and p refinements are presented in Fig. 14.
1320
M. Paszy´ nski and A. Paszy´ nska
Fig. 11. Parallel h refinements: First picture: A local decision about h refinement of a single element is made. Second picture: Ghost elements are exchanged. Third picture: The h refinement flags are sent from physical to ghost elements. Fourth picture: Propagation of h refinement flags. Fifth picture: Breaking of large element adjacent to small element requested to be broken. Bottom picture: Finally, required h refinement can be performed.
Fig. 12. Graph transformations responsible for attributing graph vertices by polynomial orders of approximation (Patt), enforcing the minimum rule on element edges, in a case of an edge adjacent to two interiors (Pmin edge) and three interiors (Pmin interior)
Graph Transformations for Modeling Parallel hp-Adaptive FEM
1321
for all elements adjacent to interface Start
DD1 all DD1 executed
receive sub−graph
DD2 attribute as ghosts elements DD3
DD4 DD5
DD6
all DD5−6 performed
make a copy of extracted sub−graph
DD7
DD8
for original sub−graph DD5
DD6
send to adjacent sub−domains
all DD7−8 performed Stop
all DD5−6 performed DD7
DD8
all DD7−8 performed Stop
Fig. 13. Scheme of the control diagram for ghost elements exchanging
Fig. 14. Left picture: Scheme of the control diagram for h refinements. Right picture: Scheme of the control diagram for p refinements.
1322
4
M. Paszy´ nski and A. Paszy´ nska
Conclusions and Future Work
The CP-graph grammar is the tool for a formal description of sequential and parallel hp adaptive FEM. It models all aspects of the adaptive computations, including mesh generation, domain decomposition, h and p refinements, as well as mesh regularity rules, including elimination of multiple constrained nodes and the enforcement of the minimum rule. The mesh transformations are controled by diagrams defining proper order of graph transformation executions. The graph grammar can be easily extend to support anisotropic mesh refinements and three dimensional computations. Acknowledgments. The work reported in this paper was supported by Polish Ministry of Science and Higher Education grant no. 3 TO 8B 055 29 and by The Foundation for Polish Science under Homming Programme.
References 1. Grabska, E.: Theoretical Concepts of Graphical Modeling. Part One: Realization of CP-Graphs. Machine Graphics and Vision 2(1), 3–38 (1993) 2. Grabska, E.: Theoretical Concepts of Graphical Modeling. Part Two: CP-Graph Grammars and Languages. Machine Graphics and Vision 2(2), 149–178 (1993) 3. Grabska, E., Hliniak, G.: Structural Aspects of CP-Graph Languages. Schedae Informaticae 5, 81–100 (1993) 4. Demkowicz, L.: Computing with hp-Adaptive Finite Elements, vol. I. Chapman & Hall/Crc Applied Mathematics & Nonlinear Science (2006) 5. Demkowicz, L., Kurtz, J., Pardo, D., Paszynski, M., Rachowicz, W., Zdunek, A.: Computing with hp-Adaptive Finite Elements, vol. II. Chapman & Hall/Crc Applied Mathematics & Nonlinear Science (in press, 2007) 6. Paszy´ nski, M., Kurtz, J., Demkowicz, L.: Parallel Fully Automatic hp-Adaptive 2D Finite Element Package. Computer Methods in Applied Mechanics and Engineering 195(7-8,25), 711–741 (2006) 7. Paszy´ nski, M., Demkowicz, L.: Parallel Fully Automatic hp-Adaptive 3D Finite Element Package. Engineering with Computers 22(3-4), 255–276 (2006) 8. Paszy´ nski, Agent based hierarchical parallelization of complex algorithms on the example of hp adaptive Finite Element Method. LNCS (in press, 2007)
Acceleration of Preconditioned Krylov Solvers for Bubbly Flow Problems Jok Man Tang and Kees Vuik Delft University of Technology, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft Institute of Applied Mathematics, Mekelweg 4, 2628 CD Delft, The Netherlands j.m.tang, [email protected]
Abstract. We consider the linear system arising from discretization of the pressure Poisson equation with Neumann boundary conditions, derived from bubbly flow problems. In the literature, preconditioned Krylov iterative solvers are proposed, but they often suffer from slow convergence for relatively large and complex problems. We extend these traditional solvers with the so-called deflation technique, that accelerates the convergence substantially and has favorable parallel properties. Several numerical aspects are considered, such as the singularity of the coefficient matrix and the varying density field at each time step. We demonstrate theoretically that the resulting deflation method accelerates the convergence of the iterative process. Thereafter, this is also demonstrated numerically for 3-D bubbly flow applications, both with respect to the number of iterations and the computing time. Keywords: deflation, conjugate gradient method, preconditioning, symmetric positive semi-definite matrices, bubbly flow problems.
1
Introduction
Recently, moving boundary problems have received much attention in literature, due to their applicative relevance in many physical processes. One of the most popular moving boundary problems is modelling bubbly flows, see e.g. [12]. These bubbly flows can be simulated, by solving the well-known Navier-Stokes equations for incompressible flow: ⎧ 1 1 ⎨ ∂u + u · ∇u + ∇p = ∇ · μ ∇u + ∇uT + g; (1) ∂t ρ ρ ⎩ ∇ · u = 0, where g represents the gravity and surface tension force, and ρ, p, μ are the density, pressure and viscosity, respectively. Eqs. (1) can be solved using, for instance, the pressure correction method [7]. The most time-consuming part of this method is solving the symmetric and positive semi-definite (SPSD) linear system on each time step, which comes from a second-order finite-difference R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1323–1332, 2008. c Springer-Verlag Berlin Heidelberg 2008
1324
J.M. Tang and K. Vuik
discretization of the Poisson equation with possibly discontinuous coefficients and Neumann boundary conditions: ∇ · ρ1 ∇p = f1 , x ∈ Ω, (2) ∂ x ∈ ∂Ω, ∂n p = f2 , where x and n denote the spatial coordinates and the unit normal vector to the boundary ∂Ω, respectively. In the 3-D case, domain Ω is chosen to be a unit cube. Furthermore, we consider two-phase bubbly flows, so that ρ is piecewise constant with a relatively high contrast:
ρ0 = 1, x ∈ Λ0 , ρ= (3) ρ1 = 10−3 , x ∈ Λ1 , where Λ0 is water, the main fluid of the flow around the air bubbles, and Λ1 is the region inside the bubbles. The resulting linear system which has to be solved is Ax = b,
A ∈ Rn×n ,
(4)
where the singular coefficient matrix A is SPSD and b ∈ range(A). In practice, the preconditioned Conjugate Gradient (CG) method [4] is widely used to solve (4), see also References [1, 2, 3, 5]. In this paper, we will restrict ourselves to the Incomplete Cholesky (IC) decomposition [8] as preconditioner, and the resulting method will be denoted as ICCG. In this method, M −1 Ax = M −1 b,
M is the IC preconditioner,
is solved using CG. ICCG shows good performance for relatively small and easy problems. For complex bubbly flows or for problems with large jumps in the density, this method shows slow convergence, due to the presence of small eigenvalues in the spectrum of M −1 A, see also [13]. This phenomenon also holds if we use other preconditioners instead of the IC preconditioner. To remedy the bad convergence of ICCG, the deflation technique has been proposed, originally from Nicolaides [11]. The idea of deflation is to project the extremely small eigenvalues of M −1 A to zero. This leads to a faster convergence of the iterative process, due to the fact that CG can handle matrices with zeroeigenvalues [6] and the effective condition number becomes more favorable. The resulting method is called Deflated ICCG or shortly DICCG, following [19], and it will be further explained in the next section. DICCG is a typical two-level Krylov projection method, where a combination of traditional and projection-type preconditioners is used to get rid of the effect of both small and large eigenvalues of the coefficient matrix. In the literature, there are more related projection methods known, coming from the fields of domain decomposition (such as balancing Neumann-Neumann) and multigrid (such as additive coarse-grid correction). At first glance, these methods seem to be different. However, from an abstract point of view, it can be shown that they are
Acceleration of Preconditioned Krylov Solvers for Bubbly Flow Problems
1325
closely related to each other and some of them are even equivalent. We refer to [18] for a theoretical and numerical comparison of these methods.
2
Deflation Method
In DICCG, we solve M −1 P A˜ x = M −1 P b,
P is the deflation matrix,
using CG, where P := I − AZE −1 Z T ,
E := Z T AZ,
Z ∈ Rn×r ,
r n.
(5)
Piecewise-constant deflation vectors are used to approximate the eigenmodes corresponding to the components which caused the slow convergence of ICCG. More technically, deflation subspace matrix Z = [z1 z2 · · · zr ] consists of deflation vectors, zj , with
¯j ; 0, x ∈ Ω \ Ω zj (x) = 1, x ∈ Ωj , where the domain Ω is divided into nonoverlapping subdomains Ωj , which are chosen to be cubes, assuming that the number of grid points in each spatial direction is the same. This approach is strongly related to methods known in domain decomposition. Note that, due to the construction of the sparse matrix Z, matrices AZ and E are sparse as well, so that the extra computations with the deflation matrix, P , are relatively cheap. Moreover, since the piecewise-constant deflation vectors correspond to nonoverlapping subdomains, the deflation technique has excellent parallel properties. This is in contrast to the IC preconditioner. Therefore, for parallel computations, one should combine the deflation technique with for instance the block-IC preconditioner in the CG method. For more details, one can consult [20].
3
Application to Bubbly Flows
The deflation technique works well for invertible systems and when the deflation vectors are based on the geometry of the problem, see also References [9, 10]. However, in our bubbly flow applications, we have systems with singular matrices and deflation vectors that are chosen independently of the geometry of the density field. Hence, main questions in this paper are: – is the deflation method also applicable to linear systems with singular matrices? – is the deflation method with fixed deflation vectors also applicable to problems, where the position and radius of the bubbles change in every time step? The answers will be given in this section.
1326
3.1
J.M. Tang and K. Vuik
Deflation and Singular Matrices
First, we show that DICCG can be used for singular matrices. Due to the construction of matrix Z and the singularity of A, the coarse matrix E := Z T AZ is also singular. In this case, E −1 does not exist. We propose several new variants of deflation matrices P : (i) invertibility of A is forced resulting in a deflation matrix P1 , i.e., we adapt is invertible; the last element of A such that the new matrix, denoted as A, (ii) a column of Z is deleted resulting in a deflation matrix P2 , i.e., instead of Z we take [z1 z2 · · · zr−1 ] as the deflation subspace matrix; (iii) systems with a singular E are solved iteratively resulting in a deflation matrix P3 , i.e., matrix E −1 as given in Eq. (5) is considered to be a pseudoinverse. As a result, Variant (i) and (ii) give a nonsingular matrix E, whereas the real inverse of E is not required anymore in Variant (iii). Moreover, note that Variant (iii) is basically identical to the original DICCG for invertible systems, see e.g. [9,19], since the original coefficient matrix and all r deflation vectors are used in this variant. Subsequently, we can prove that the three DICCG variants are identical in exact arithmetic, see Theorem 1. = P2 A = P3 A. Theorem 1. P1 A Proof. The proof can be found in [15, 16]. We observe that the deflated systems of all variants are identical. From this result, it is easy to show that the preconditioned deflated systems are also the same, that is = M −1 P2 A = M −1 P3 A. M −1 P1 A Since the variants are equal, any of them can be chosen in the numerical experiments. In the next section, we will apply the first variant for convenience, and the results and efficiency of this variant will be demonstrated numerically. 3.2
Deflation and Varying Density Fields
The number of smallest eigenvalues of M −1 A, that are of order 10−3 , are related to the number of bubbles in Ω, see [17, Sect. 3]. To show that DICCG works in cases with varying density fields, we have to show that the deflation vectors approximate the eigenvectors corresponding to the smallest eigenvalues. Only in this case, the deflation technique eliminates those eigenvalues that causes the slow convergence of ICCG. Proposition 1 can be found in [17]: Proposition 1. Eigenvectors vi of M −1 A corresponding to small eigenvalues λi associated with bubbles remain good approximations if – one or more elements of vi corresponding to Λ0 are perturbed arbitrarily;
Acceleration of Preconditioned Krylov Solvers for Bubbly Flow Problems
1327
– elements of vi corresponding to a whole bubble of Λ1 are perturbed with a constant. Proposition 1 implies that the column space of Z can indeed approximate vi as long as each subdomain contains at most (a part of) one bubble. Therefore, as long as r is chosen sufficiently large, the deflation method works appropriately, since the spectrum of M −1 P A does not consist of eigenvalues of O(10−3 ). In addition, it appears that the subdomain deflation vectors even approximate other small eigenvalues of O(1), because the associated eigenvectors are slow-varying modes that allow small perturbations. For more details, we refer to [17]. We conclude that DICCG can be effective using subdomain deflation vectors, where Z can be constructed independently of the geometry of the density field. In the next section, numerical experiments will be presented to show the success of the method for time-dependent bubbly flow problems.
4
Numerical Experiments
We test the efficiency of the DICCG method for two kinds of test problems. 4.1
Test Case 1: Stationary Problem
First, we take a 3-D bubbly flow application with eight air-bubbles in a domain of water, see Figure 1 for the geometry. We apply finite differences on a uniform Cartesian grid with n = 1003, resulting in a very large but sparse linear system Ax = b with SPSD matrix A. Then, the results of ICCG and DICCG can be found in Table 1, where φ denotes the final relative exact residual and DICCG−r denotes DICCG with r deflation vectors. Moreover, we terminate the iterative process, when the relative update residuals are smaller than the stopping tolerance, = 10−8 . Z
X
Y
Fig. 1. An example of a bubbly flow problem: eight air-bubbles in a unit domain filled with water
1328
J.M. Tang and K. Vuik
Table 1. Convergence results of ICCG and DICCG−r solving Ax = b with n = 1003 , for the test problem as given in Figure 1 Method ICCG DICCG−23 DICCG−53 DICCG−103 DICCG−203
# Iterations 291 160 72 36 22
CPU Time (s) φ (×10−9 ) 43.0 1.1 29.1 1.1 14.2 1.2 8.2 0.7 27.2 0.9
From Table 1, one observes that the larger the number of deflation vectors, the less iterations DICCG requires. With respect to the CPU time, there is an optimum, namely for r = 103 . Hence, in the optimal case, DICCG is more than five times faster compared to the original ICCG method, while the accuracy of both methods are comparable! Similar results also hold for other related test cases. Results of ICCG and DICCG for the problem with 27 bubbles can be found in Table 2. In addition, it appears that the benefit of the deflation method is larger when we increase the number of grid points, n, in the test cases, see also [16]. Table 2. Convergence results of ICCG and DICCG−r solving Ax = b with n = 1003 , for the test case with 27 bubbles Method ICCG DICCG−23 DICCG−53 DICCG−103 DICCG−203
# Iterations 310 275 97 60 31
CPU Time (sec) φ (×10−9 ) 46.0 1.3 50.4 1.3 19.0 1.2 13.0 1.2 29.3 1.2
Finally, for the test case with 27 bubbles, the plots of the residuals during the iterative process of both ICCG and DICCG can be found in Figure 2. Notice that the behavior of the residuals of ICCG are somewhat irregular due to the presence of the bubbles. For DICCG, we conclude that the larger r, the more linear the residual plot is, so the faster the convergence of the iterative process. Apparently, the eigenvectors associated to the small eigenvalues of M −1 A have been wellapproximated by the deflation vectors and M −1 P A is better conditioned, if r is sufficiently large. 4.2
Test Case 2: Time-Dependent Problem
Next, we present some results for the 3-D simulation of a rising air bubble in water, in order to show that the deflation method is also applicable to real-life problems with varying density fields. We adopt the pressure-correction method
Acceleration of Preconditioned Krylov Solvers for Bubbly Flow Problems
1329
0
10
ICCG 3 DICCG−2 3 DICCG−5 3 DICCG−10 3 DICCG−20
−2
Norm of Residuals
10
−4
10
−6
10
−8
10
50
100
150 200 Iteration
250
300
Fig. 2. Residual plots of ICCG and DICCG−r, for the test problem with 27 bubbles and various number of deflation vectors r
for the simulations, but it could be replaced by any operator-splitting method, in general. For more details, see [13]. At each time step, a pressure Poisson equation has to be solved, which is the most time-consuming part of the whole simulation. Therefore, during this section we only concentrate on this part at each time step. We investigate whether DICCG is efficient for all those time steps. We consider a test problem with a rising air bubble in water without surface tension. The exact material constants and other relevant information can be found in [13, Sect. 8.3.2]. The starting position of the bubble in the domain and the evolution of the movement during the 250 time steps are given in Figure 3. In [13], the Poisson solver is based on ICCG. Here, we will compare this method to DICCG with r = 103 deflation vectors, in the case of n = 1003. The results are presented in Figure 4. From Subfigure 4(a), we observe that the number of iterations is strongly reduced by the deflation method. DICCG requires approximately 60 iterations, while ICCG converges between 200 and 300 iterations at most time steps. Moreover, we observe the erratic behavior of ICCG, whereas DICCG seems to be less sensitive to the geometries during the evolution of the simulation. Also with respect of the CPU time, DICCG shows very good performance, see Subfigure 4(b). At most time steps, ICCG requires 25–45 seconds to converge, whereas DICCG only needs around 11–14 seconds. Moreover, in Figure 4(c), one can find the gain factors, considering both the ratios of the iterations and the CPU time between ICCG and DICCG. From this figure, it can be seen that DICCG needs approximately 4–8 times less iterations, depending on the time step. More importantly, DICCG converges approximately 2–4 times faster to the solution compared to ICCG, at all time steps. In general, we see that, compared to ICCG, DICCG decreases significantly the number of iterations and the computational time as well, which are required for solving the pressure Poisson equation with discontinuous coefficients, in applications of 3-D bubbly flows.
1330
J.M. Tang and K. Vuik
(a) t = 0.
(b) t = 50.
(c) t = 100.
(d) t = 150.
(e) t = 200.
(f) t = 250.
Fig. 3. Evolution of the rising bubble in water without surface tension in the first 250 time steps 50
350 300
40
250
CPU Time (sec)
Number of Iterations
ICCG 3 DICCG−10
ICCG 3 DICCG−10
200 150
30
20
100
10 50 0 0
50
100 150 Time Step
200
250
(a) Number of iterations versus time step. 10
0 0
50
100 150 Time Step
200
250
(b) CPU time versus time step.
Iterations ICCG / Iterations DICCG−103 CPU time ICCG / CPU time DICCG−103
Gain Factors
8
6
4
2
0 0
50
100 150 Time Step
200
250
(c) Gain factors with respect to ICCG and DICCG. Fig. 4. Results of ICCG and DICCG with r = 103 , for the simulation with a rising air bubble in water
Acceleration of Preconditioned Krylov Solvers for Bubbly Flow Problems
5
1331
Conclusions
A deflation technique has been proposed to accelerate the convergence of standard preconditioned Krylov methods, for solving bubbly flow problems. In the literature, this deflation method has already been proven to be efficient, for linear systems with invertible coefficient matrix and not-varying density fields in time. However, in our bubbly flow applications, we deal with linear systems with a singular matrix and varying density fields. In this paper, we have shown, both theoretically and numerically, that the deflation method with fixed subdomain deflation vectors can also be applied to this kind of problems. The method appears to be robust and very efficient in various numerical experiments, with respect to both the number of iterations and the computational time.
References 1. Benzi, M.: Preconditioning techniques for large linear systems: A survey. J. Comp. Phys. 182, 418–477 (2002) 2. Gravvanis, G.A.: Explicit Approximate Inverse Preconditioning Techniques. Arch. Comput. Meth. Eng. 9, 371–402 (2002) 3. Grote, M.J., Huckle, T.: Parallel preconditioning with sparse approximate inverses. SIAM J. Sci. Comput. 18, 838–853 (1997) 4. Hestenes, M.R., Stiefel, E.: Methods of Conjugate Gradients for Solving Linear Systems. J. Res. Nat. Bur. Stand. 49, 409–436 (1952) 5. Huckle, T.: Approximate sparsity patterns for the inverse of a matrix and preconditioning. Appl. Num. Math. 30, 291–303 (1999) 6. Kaasschieter, E.F.: Preconditioned Conjugate Gradients for solving singular systems. J. Comp. Appl. Math. 24, 265–275 (1988) 7. van Kan, J.J.I.M.: A second-order accurate pressure correction method for viscous incompressible flow. SIAM J. Sci. Stat. Comp. 7, 870–891 (1986) 8. Meijerink, J.A., van der Vorst, H.A.: An iterative solution method for linear systems of which the coefficient matrix is a symmetric M-matrix. Math. Comp. 31, 148–162 (1977) 9. Nabben, R., Vuik, C.: A comparison of Deflation and Coarse Grid Correction applied to porous media flow. SIAM J. Numer. Anal. 42, 1631–1647 (2004) 10. Nabben, R., Vuik, C.: A Comparison of Deflation and the Balancing Preconditioner. SIAM J. Sci. Comput. 27, 1742–1759 (2006) 11. Nicolaides, R.A.: Deflation of Conjugate Gradients with applications to boundary value problems. SIAM J. Matrix Anal. Appl. 24, 355–365 (1987) 12. Van der Pijl, S.P., Segal, A., Vuik, C., Wesseling, P.: A mass-conserving Level-Set method for modelling of multi-phase flows. Int. J. Num. Meth. in Fluids 47(4), 339–361 (2005) 13. Van der Pijl, S.P.: Computation of bubbly flows with a mass-conserving level-set method, PhD thesis, Delft University of Technology, Delft (2005) 14. Sousa, F.S., Mangiavacchi, N., Nonato, L.G., Castelo, A., Tome, M.F., Ferreira, V.G., Cuminato, J.A., McKee, S.: A Front-Tracking / Front-Capturing Method for the Simulation of 3D Multi-Fluid Flows with Free Surfaces. J. Comp. Physics 198, 469–499 (2004) 15. Tang, J.M., Vuik, C.: On Deflation and Singular Symmetric Positive Semi-Definite Matrices. J. Comp. Appl. Math. 206, 603–614 (2007)
1332
J.M. Tang and K. Vuik
16. Tang, J.M., Vuik, C.: Efficient Deflation Methods applied to 3-D Bubbly Flow Problems. Elec. Trans. Numer. Anal. 26, 330–349 (2007) 17. Tang, J.M., Vuik, C.: New Variants of Deflation Techniques for Bubbly Flow Problems. J. Numer. Anal. Indust. Appl. Math. 2, 227–249 (2007) 18. Tang, J.M., Nabben, R., Vuik, C., Erlangga, Y.A.: Theoretical and numerical comparison of various projection methods derived from deflation, domain decomposition and multigrid methods, Delft University of Technology, Department of Applied Mathematical Analysis, Report 07-04, ISSN 1389-6520 (2007) 19. Vuik, C., Segal, A., Meijerink, J.A.: An efficient preconditioned CG method for the solution of a class of layered problems with extreme contrasts in the coefficients. J. Comp. Phys. 152, 385–403 (1999) 20. Frank, J., Vuik, C.: On the construction of deflation-based preconditioners. SIAM J. Sci. Comp. 23, 442–462 (2001)
Persistent Data Structures for Fast Point Location Michal Wichulski and Jacek Rokicki Warsaw University of Technology, Institute of Aeronautics and Applied Mechanics, Nowowiejska 24, 00-665 Warsaw, Poland {wichulski, jack}@meil.pw.edu.pl
Abstract. The paper presents practical implementation of an algorithm for fast point location an triangular 2D meshes. The algorithm cost is proportional to log 2 N while storage is a liear function of the number of mesh vertices N . The algorithm bases on using persistent data structures was used which store only changes between consecutive mesh levels. The performance of the presented approach was positively verified for a sequence of meshes starting from 1239 up to 1846197 cells.
1
Introduction
In computational physics partial differential equations are solved numerically on discrete meshes covering the region of interest. These meshes consist of a large numbers of disjoint cells filling completely the domain Ω: ¯= Ω
N Ω
Ω¯i , Ωi ∩ Ωj = ∅, i = j, Ωi = int Ωi .
i=1
Here we consider simplest triangular cells and the 2D domain expecting that the approach can be extended to tetrahedral cells in 3D. Edges of these cells constitute a collection: E = {e1 , e2 , ..., eNE }. We further assume that no hanging nodes exist in the mesh. Therefore each edge belongs either to two cells (internal edge) or to a single cell (boundary edge). Similarly the collection of nodes is given as: P = {p1 , p2 , ..., pN }. Consider now an arbitrary (random) point q. The localisation problem consist in: ¯ (or not), 1. deciding whether q ∈ Ω ¯ 2. if q ∈ Ω finding k (1 ≤ k ≤ NΩ ) such that q ∈ Ω¯k . If by chance q belongs to an edge e ∈ E or coincides with p ∈ P the problem may not have a unique solution. At the moment however we are satisfied with finding an arbitrary k such that q ∈ Ω¯k . This problems appears in many algorithms of computational geometry. Two examples will be given: R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1333–1340, 2008. c Springer-Verlag Berlin Heidelberg 2008
1334
M. Wichulski and J. Rokicki
– in the Delaunay grid generation algorithm the new point inserted into a grid has to be localised before it is accepted, – in the Chimera overlapping mesh algorithm [4][5][6] boundary points of one mesh have to be localised on the other mesh prior to all computations (here the efficiency of localisation is especially crucial if meshes move one with respect to another). The localisation problem can be easily solved by inspecting, all cells Ωk , checking each time the relation q ∈ Ω¯k . Such algorithm has an average cost ∼ NΩ which usually is not acceptable if NΩ and the number of points to be localised are large numbers (e.g., of order 105 − 106 ). It is also relatively straightforward to present algorithm based on quad-tree (octree) approach allowing to find k∗ such that q − pk∗ ≤ q − pi
i = 1, 2, ..., N
with a cost proportional to log2 N . Such algorithm requires however the second step in which neighbourhood of pk∗ is searched to find the Ωk in question (the number of cells to be traversed can be significant). The purpose of the present paper is (basing on ideas of Preparata and Tamassia [2]) to present practical one-step algorithm of solving the localisation problem with a cost proportional to logκ2 N (κ - being a small number). Again it will be shown further, that algorithm can be straightforward if available memory is proportional to N 2 . However again this is not a practical requirement and the algorithm we seek has to have: • the cost proportional to logκ2 N (κ - being a small number), • the storage proportional to N .
2
Preliminarities
We assume that in the 2D space pj = (xj , yj ). Suppose now that vertices (nodes) are ordered in such a way that y1 < y2 < ... < yN . Suppose now that the graph consisting of all edges (Fig. 1) is intersected by a line λ(y) = {(x, y) : y = const}. Firstly one should notice that the result of intersection forms a totally ordered set of intervals R(y). A graph G(y) can be associated with R(y) with vertices which are the ordered intersection points of line λ(y). The edges of this graph are corresponding intervals of R(y). Suppose now that an arbitrary point q = (x∗ , y∗ ) is given. The point location problem is now reduced to one-dimensional search within the R(y∗ ) set (with O(log N ) query cost). If the horizontal line is shifted up or down, the graph G(y ) has the same topology as G(y∗ ) as long as the line λ(y ) does not cross the vertex.
Persistent Data Structures for Fast Point Location
1335
Lemma 1. For all y , y such that yi < y , y < yi+1 , graphs G(y ) and G(y ) are isomorphic. To achieve O(log N ) query cost a usual binary-tree data structure S(y) has to be built (Fig. 2 and Fig. 3). Other data structures are possible but it has to be noticed that in any case S(y) depends only on the ordering of the intervals and therefore remains constant on each interval yi , yi+1 . The particular interval from the ordered set R(y) forms a key used to construct and access S(y). The key depends on the geometric boundaries of the interval, but as long as the elements of G(y) do not change (the intervals are neither inserted nor deleted) the data structure itself has the same topology. It leads to the following lemma [2].
Fig. 1. Intersections corresponding to the levels in the data structure
Lemma 2. The same data structure is sufficient for searching in subdivisions represented by graphs, which are isomorphic. The keys of the data structure S(y) are parameterised by the height of the horizontal line λ(y), because that line intersects edges of the mesh and points of that intersection give geometric coordinates of the intervals. If the line λ(y) is such that yi < y < yi+1 , than it fulfils conditions of Lemma 1, and additionally the same edges trim intervals. Therefore it follows that definition of the keys (that is a way of parameterising the intervals) remains unchanged. The interval between two consecutive vertices of the mesh is called now a level. Taking heed of Lemma 2 we can assert that data structure needs no changes at a given level. If there are N vertices in the mesh, N − 1 levels exists. In particular for y ∈ σi = yi , yi+1 the ordered set R(y), the graph G(y) and the data structures S(y) can be now parametrised by a level number and are denoted respectively as Gi , Ri and Si .
1336
M. Wichulski and J. Rokicki
One can observe in particular that differences between Gi and Gi+1 (as well as Si and Si+1 ) are quite limited The important remark is that the changes reduce to few basic possibilities (see Fig. 1): a) deleting the cells for which pi is the top vertex b) inserting the cells for which pi is the bottom vertex c) modifying the keys in the cells for which the vertex is the middle vertex This results in the following lemma: Lemma 3. The number of elementary changes (see above) necessary to transform Si into Si+1 is equal to mi (where mi is the number of cells containing vertex pi ). Consider an example. Two horizontal lines on Fig. 1 (dark and light gray) correspond to the levels σi and σi+1 . If the line λ leaves the lower level σi and enters σi+1 , the following changes are necessary: • cells g and h are deleted, • cells j and k are inserted, • the keys in f and i are modified. This modification consists in replacing the formulas describing the edges (i.e. edge j replaces g and edge k replaces h). Suppose that we already have the data structure Si . The new structure Si+1 can be obtained from Si by applying the changes, i.e. by adding the difference between these levels. In fact only the differences have to be stored. This observation allows to use so called persistent data structures.
3
The Algorithm
It is possible now to present point location algorithm with the query cost proportional to O(log N ). The data structure S supporting this algorithm contains N substructures Si . Each substructure Si is (for example) a binary tree consisting of ordered non-intersecting segments (as described in the previous chapter). The point location algorithm consists of two queries, in the first one the vertical position y∗ is located on the appropriate level, in the second one the horisontal position x∗ is located within appropriate segment. The query cost is therefore at most 2 log N . The number of segments in each substructure is at most N + 1. Therefore the size of the whole data structure S is proportional to N 2 . For large meshes (N > 106 ) this is fully unacceptable. The√reader may notice, that assumption M ∼√N is very pessimistic, since M ∼ N on more regular meshes. Yet, even N N = N 3/2 is too large for practical purposes. In order to overcame this problem we can still use the fact, that the consecutive substructures, say Si and Si+1 differ only by few elementary operations. Thus we will attempt to add persistence trying to store changes to the structure, rather than keeping structures themselves. One must observe, however, that persistence is difficult to implement if Si is an ordinary binary
Persistent Data Structures for Fast Point Location
1337
tree. Every time the node is deleted or inserted, the tree needs re-balancing in order to keep the optimal height. This means that the change between Si and Si+1 may concern almost all vertices. Therefore another data structure is necessary to alleviate this problem. The possible choice is a red-black binary search tree (see Fig. 2 and Fig. 3) As in every binary search tree, the node contains three fields: the key, the left pointer and the right pointer. The search through the tree, from node to node, starts at the topmost node called a root. In every node the comparison (in sense of the total order) between the wanted and the current key is done and, depending on the result, a proper branch is chosen. Without going into details, one should note that in the red-black tree-node another field is added, called colour, which is necessary to preserve balancing. For all details of algorithms of inserting, finding and deleting an item from the tree, see [1]. It is enough to say, that as a result of re-balancing of the tree some changes may occur on the path from the root to an inserted node. The re-balancing itself is performed using rotations of the tree branches.
Fig. 2. The tree corresponding to the level σ9
Fig. 3. The tree corresponding to the level σ10
Fig. 4. The persistent binary search tree for two σ9 and σ10 levels in the mesh
1338
M. Wichulski and J. Rokicki
To add persistence to the data structure we have chosen the fat node method [3]. It is based on recording all changes made to a node in the node itself. As a consequence, each node contains collection of pointers, each corresponding to a different level. Let η be the number of changes during a single operation on the tree, and let ζ be the number of nodes in which these changes occur. The rotation of a single branch, consists of three changes in the pointer fields. To insert a node no more than two rotations are necessary while three rotations are sufficient to delete a node [3]. In such a way (in a most pessimistic case) there are five changes of pointer fields for inserting a node, and seven changes for deleting the node with four and six nodes involved respectively (all from one sub-tree). The important thing is, that η and ζ are both O(1), and not O(N ). Similarly, a scope of changes within the tree is limited and relates to the neighbouring nodes in a sub-tree. Thus, since the red-black binary search tree is balanced, the changes concern nodes located in the immediate neighbourhood of either deleted or inserted node. As it was mentioned earlier the full data structure S can be seen as a sequence of red-black binary search trees Si . Each node representing a cell (or strictly speaking the slice of the cell at the present level). This node has a left and right child each corresponding to neighbouring slice. Consider now an equivalent structure consisting of all cells. Each cell has two own collections of left and right children (each child corresponding to a different level σi ) - see Fig. 4. These collections must contain only these levels, at which the cell changes location in the red-black binary search tree. The numbers sL i and sR i of entities in these collections (of left and right children respectively), as it was mentioned earlier, are expected to be O(1). Additionally to speed up the search, the collections themselves can be ordered. The total size of the full date structure is now estimated as β · N where (β ∼ s∗ = maxj sj ), while point location query cost remains α · log N (α does not exceed exceed 1 + log s∗ ).
4
Numerical Experiment
In order to investigate properties of the presented algorithm, the following numerical experiment was performed. A sequence of meshes was generated in the pentagonal domain with N ranging from 1239 to 1846197 (see Fig. 9). Each experiment consisted in localising of one million random points. Every elementary step performed by the algorithm was counted to measure the search cost. The total cost defined as a largest cost value for all queries was identified for each mesh. This value is presented in the Fig. 5 indicating the perfect logarithmic behaviour. The coefficient α (see previous chapter) is additionally shown in Fig. 6. Similarly, the total memory storage was measured in each experiments, by using the Linux ps command. The result is shown in Fig. 7. One can see that this value grows linearly with the number of cells. The estimated coefficient β
Persistent Data Structures for Fast Point Location
70
1339
3.1
60
Coefficient
Searching cost
3 50 40 30
2.9
2.8
20 2.7 10 0 3 10
104
105
2.6 3 10
106
Number of cells
104
105
106
Number of cells
Fig. 5. The cost of the point location (measured by counting elementary operations)
Fig. 6. The point location query cost coefficient α
1.6E+09
800
Memory size coefficient
1.4E+09
Memory size
1.2E+09 1E+09 8E+08 6E+08 4E+08
780
760
740
720
2E+08 0
0
1E+06
2E+06
Number of cells
Fig. 7. The memory size of the data structure
700 3 10
104
105
106
Number of cells
Fig. 8. The memory size coefficient β
Fig. 9. Meshes used in numerical experiment
1340
M. Wichulski and J. Rokicki
is shown in Fig. 8. It must be noted that this result represents the size of the memory used by the process, and not necessarily the size of the data structure (which we believe is smaller). It is suspected that possible discrepancy may be caused by the memory leakage typical when space is allocated and deallocated in repetitive manner.
5
Concluding Remarks
The paper presented a practical algorithm for point location problem with α · log N query cost and β · N memory storage. This fact was proven in numerical experiment in which both α and β were estimated. Further research is under way to extend this algorithm to 3D tetrahedral meshes.
References 1. Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. MIT Press, Cambridge (1990) 2. Preparata, F.P., Tamassia, R.: Efficient point location in a convex spatial cellcomplex. SIAM J. Comput. 21(2), 267–280 (1992) 3. Discroll, J.R., Sarnak, N., Sleator, D.D., Tarjan, R.E.: Making data structures persistent. J. Comput. Syst. Sci. 38(1), 267–280 (1989) 4. Benek, J.A., Buning, P.G., Steger, J.L.: A 3-D Chimera Grid Embedding Technique. AIAA Paper 85, 1523 (1985) 5. Rokicki, J., Floryan, J.M.: Unstructured Domain Decomposition Method for the Navier-Stokes Equations. Computers & Fluids 28, 87–120 (1999) ˙ oltak, J.: Investigation of Blending6. Drikakis, D., Majewski, J., Rokicki, J., Z´ Function-Based Overlapping-Grid Technique for Compressible Flows. Computer Methods in Applied Mechanics and Engineering 190(39), 5173–5195 (2001)
A Reliable Extended Octree Representation of CSG Objects with an Adaptive Subdivision Depth Eva Dyllong and Cornelius Grimm Institute of Computer Science, University of Duisburg-Essen, Duisburg, Germany {dyllong,grimm}@inf.uni-due.de
Abstract. Octrees are among the most widely used representations in geometric modeling systems, apart from Constructive Solid Geometry and Boundary Representations. An octree model is based on recursive cell decompositions of the space and does not depend greatly on the nature of the object but much more on the chosen maximum subdivision level. Unfortunately, an octree may require a large amount of memory when it uses a set of very small cubic nodes to approximate an object. This paper is concerned with a novel generalization of the octree model that uses interval arithmetic and allows us to extend the tests for classifying points in space as inside, on or outside a CSG object to whole sections of the space at once. Tree nodes with additional information about relevant parts of the CSG object are introduced in order to reduce the depth of the required subdivision. The proposed extended octrees are compared with the common octree representation.
1
Introduction
Most solid modeling systems use one or more of the following representation schemes: Constructive Solid Geometry (CSG), the Boundary Representation model (B-Rep) or representation schemes based on recursive cell decompositions of the space [9]. An octree model is an example of the latter [17]. A B-Rep model defines the geometry of an object according to its surface boundary, which consists of vertices, edges and faces. In the CSG approach, an object is represented as a CSG tree with simple shapes as its leaf nodes that are combined using set-theoretic operations, such as union, intersection and difference, at its interior nodes. However, all the schemes have their pros and cons. The polyhedral representations are suitable for visualization, but the computation of Boolean operations or the examination of topological consistency are complex. In contrast, for CSG models both Boolean operations and consistency checking are simple. However, the visualization or analysis of the data may require a transformation into boundary representation. An octree is an efficient data structure used to represent spatial data hierarchically in a tree-based structure. The nodes are cuboids and are recursively R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1341–1350, 2008. c Springer-Verlag Berlin Heidelberg 2008
1342
E. Dyllong and C. Grimm
subdivided into eight mutually disjoint child nodes until a node contains no parts of the surface of the modeled object or the desired closeness to the object is reached [17]. Each node is checked to see whether it is full of solid material (black), partially empty (gray) or empty (white). If the nodes are empty or full, they do not need to be subdivided any further. If they are partially empty, the nodes need to be subdivided to improve the approximation of the object. The subdivision process is repeated until all nodes are either full or empty or until the maximum resolution level has been reached. To obtain an outer approximation of the object, partially occupied leaves (gray leaves) are considered full. An object representation using the octree structure is illustrated in Fig. 1 (b). A cell representation does not depend on the nature of a real solid but only on the chosen maximum level of the hierarchical structure. Unfortunately, a common octree based solely on a recursive cell decomposition of the space may require a large amount of memory when it approximates a CSG object using a set of very small cubic boxes. Despite the choice of higher precision, an inverse transformation from cell to CSG representation is invariably impossible. In recent years, several hybrid octree models have been proposed. Brunet et al [2] suggested an efficient way to handle a boundary representation hierarchically. In that paper, the space is divided recursively, as for an octree, but the leaf nodes are also allowed to represent cubes that contain a plane, edge, or vertex of the boundary of the object. The extended octrees are more compact than octrees and may prove very suitable for some applications. Another schema that includes geometry from a boundary representation is independently proposed by Carlbom [3]. A modification of the structure is presented in [6]. The main difference between polytrees [3] and extended octrees [2] is the way they represent face, edge, and vertex in the terminal nodes. In the context of ray tracing, Wyvill et al [19] suggested continuing the subdivision process of an octree either until each cell contains fewer than two primitive objects of a suitably reduced version of the original CSG structure or until the cell is small enough to be ignored. The authors called the process adaptive space division. In this paper, we discuss a new approach to constructing a hierarchical representation of a CSG object. We focus on a reliable enclosure of the object using an extended interval-based octree data structure. Since the construction of the octree representation is based on test functions for classifying points in space as inside, on or outside the object, recursive subdivision using interval arithmetic is provided, which yields reliable results and allows us to extend those tests to whole sections of the space at the same time. Furthermore, a new type of tree node, which includes additional information about relevant parts of the CSG object, has been introduced. With the new type of node, the octree obtained is adjustable in the sense that its maximum level can be refined at any subsequent step of processing. Moreover, an exact reconstruction of the CSG object from the cell model is possible. Unlike in [7], we use the local CSG information in the octree nodes as an additional criterion to reduce the maximum subdivision level and to perform the subdivision process adaptively. Compared to classical octrees, the reduction introduces octrees that need fewer nodes for an equal
A Reliable Extended Octree Representation of CSG Objects
1343
approximation level and therefore have lower storage requirements and allow faster processing. In many applications, including robotic simulations, the modification may yield an acceleration of geometrical computations, like collision detection or distance calculation.
2
CSG Models
2.1
Basics
Constructive Solid Geometry is a powerful technique for object modeling. In CSG an object is represented as a binary tree with leaf nodes that correspond to geometrical primitives (cube, sphere, cylinder, etc.). The tree is called a CSG tree. Set operations, such as union, difference or intersection, describe in the interior nodes of the CSG tree how the object represented by the two child nodes is combined. For some approaches, for instance, rendering, CSG primitives are represented by polygonal approximations, but mostly they are processed as implicit functions. Figure 1 (a) depicts an example of a simple CSG object consisting of two primitives.
Fig. 1. (a) Example of a CSG object, and (b) the corresponding octree at level 4
2.2
Quadrics
A quadric surface (also called a quadric) is a set of all points in 3D space that satisfy a quadratic polynomial equation f (x, y, z) = 0.
(1)
The most general quadratic function of the variables x, y, z takes the form f (x, y, z) = ax2 + by 2 + cz 2 + 2dxy + 2eyz + 2f xz +2gx + 2hy + 2iz + k.
(2)
For an object F defined by a quadric, F = {(x, y, z)t ∈ R3 |f (x, y, z) ≤ 0} holds. In homogeneous coordinates u = (x, y, z, 1)t we can write the equation (1) as follows: F (Q) = ut Qu = 0, (3)
1344
E. Dyllong and C. Grimm
⎛
where
a ⎜d Q=⎜ ⎝f g
df b e e c h i
⎞ g h⎟ ⎟. i⎠ k
(4)
Quadrics represent surfaces that include planes, spheres, ellipsoids, cylinders, cones, paraboloids and hyperboloids. The plane, sphere, cylinder and cone are called the natural quadrics. We use the natural quadrics as primitives for the construction of a CSG model to achieve a generic formulation of the different primitives. Primitives of a CSG model are generally defined in a orthogonal local coordinate system. An additional matrix specifies the locally defined primitive in the world coordinate system. If the surface of a quadric F (Q) is transformed by a matrix T , the transformed surface T · F (Q) is again a quadric: T · F (Q) = F ((T −1 )t · Q · T −1 ). The equation above can be used to derive the transformed quadric between the coordinate systems. For the inverse transformations of the translation T (v) by a vector v, the scaling Sw (λ) by a factor λ and the rotation Rw (ϕ) by an angle ϕ in the wdirection, w ∈ {x, y, z}, respectively, the following equations hold: T (v)−1 = T (−v), 2.3
Sw (λ)−1 = Sw (1/λ),
Rw (ϕ)−1 = Rw (−ϕ).
Proximity Queries between Natural Quadrics
Computing the minimal euclidean distance between two primitives of a CSG object is an important step in many applications. The proposed extended octree structure provides means for efficient distance calculations between complex (in the number of primitives) CSG objects because the structure enables us to quickly prune parts of the CSG object that do not minimize the distance between two objects, especially when the octree representation is reused for several consecutive coherent distance calculations. For common octrees we need proximity queries only between rotated and/or axis-aligned cuboids. Since we involved a new node type for the extended octree, we also need to implement proximity queries between cuboids and quadric surfaces and between two quadric surfaces. At the beginning of the distance computation, it is usually necessary to determine whether an intersection between two primitives occurs. The quadratic nature of the equations defining primitives described by quadrics makes it possible to compute an explicit representation of the intersection curves. A general approach to computing an explicit parametric representation of the intersection curves can be found in [11]. Levin’s method analyzes the pencil generated by the two quadrics, that is, the set of linear combinations of the two quadrics, and it is based on solutions to fourth-degree polynomial equations. Unfortunately, when floating point numbers
A Reliable Extended Octree Representation of CSG Objects
1345
are used, the method may yield results that are geometrically or topologically wrong and, in degenerate cases, may even fail. There are several ways to improve the method for implicit quadrics with rational or integer coefficients [5,10]. Miller [14] chose another path and developed solutions for each pair of natural quadrics. The geometric approach does not require solutions to polynomials of a degree higher than two. The distance query for quadrics processed uniformly without regard to their specific shape is tackled by Lennerz [13,12]. He reduces the distance calculation for natural quadrics to the problem of solving polynomial equations of a degree of at most eight by applying the Lagrange formalism to a constrained distance optimization problem. The degree bound also holds in degenerate cases.
3
3.1
Interval-Based Extended Octrees with an Adaptive Subdivision Depth Interval Arithmetic
Interval arithmetic reliably bounds the exact result of an arithmetic operation in an interval X = [X, X] := {x ∈ R | X ≤ x ≤ X} with the lower bound X and the upper bound X. Most operators and functions on real numbers can be extended to interval arithmetic. An interval extension of a function can be done by simply substituting real number variables with interval variables and arithmetic operators and functions with their interval arithmetic equivalents in the arithmetic expression of the function. Unfortunately, the natural interval extension usually overestimates the range of the function. The overestimation does not depend greatly on the function but much more on its defining expression and is a fundamental problem of interval arithmetic. In general, it is difficult to find the best possible interval extension. Overestimation can be reduced through such methods as using the monotonicity properties of the corresponding real functions. In this context, to get a tighter enclosure of an arithmetic expression including the real power function xn , which occurs in some implicit functions of the CSG primitive, the use of the interval power function instead of multiplications is highly recommended. Another problem of interval arithmetic is the wrapping effect, which is one of the main reasons why it is difficult to apply interval arithmetic to the enclosure of moving objects. The wrapping effect occurs due to rotated interval boxes, which have to be covered by axis-aligned ones after this geometrical transformation to restore the nodes as intervals. To prevent accumulation errors and to reduce the wrapping effect, consecutive transformations should be combined in a single transformation and applied to the input data. An implementation of interval arithmetic based on floating-point numbers is required to round interval bounds outwards; that is, the lower bounds have to be rounded to minus infinity, and the upper bounds to plus infinity. Thus, the resulting interval includes the real one, in case the interval bounds cannot be represented exactly by floating-point numbers.
1346
E. Dyllong and C. Grimm
For more detailed treatment of interval arithmetic see the monographs on the subject [15,1]. For its applications in computer graphics and geometric computations see [4,18,16]. 3.2
Interval-Based Octree Data Structure
An octree node belonging to an arbitrary hierarchy level geometrically defines an axis-aligned box. In the case of an axis-aligned octree, the coordinates of the vertices of all boxes are multiples of powers of two and can be represented exactly as floating-point numbers. But this needn’t be true after a rotation. Due to rounding errors, the vertices must be replaced by small intervals with machine numbers. Axis-aligned nodes are represented by a single three-dimensional interval vector, and each component of the vector defines the extension of the box in the particular dimension. We use interval-based spatial subdivision, which leads to a reliable hierarchical approximation of a CSG tree. To construct an interval-based octree enclosure of a CSG object, we use an interval extension of the characteristic function as described in [7] for classifying points in space as inside, on or outside the object. If the octree has sufficient subdivision depth, the CSG tree can be reduced to only one primitive for most octree nodes. Following Duff [4], we simplify the CSG object for each octree node to a CSG object that is equivalent to the original object within that octree node. The implementation of the motion transformations also uses interval arithmetic to obtain guaranteed results in a dynamically changing scene. If we apply several arbitrary motion transformations to an interval vector X, we acquire wrapping effects: The intermediate results of the transformations (such as a slightly rotated box) are wrapped into their interval representations. The wrapping effect cumulates as several transformations are performed consecutively. To reduce this effect during the construction of a CSG object, we multiply the transformation matrices in the path to each leaf node of a CSG object and store the resulting matrix in the leaf nodes of the CSG tree. To determine the new position of an object after a movement, we also apply the matrix with all its previous transformations to the input object points. Furthermore, the complete scene is not considered as a whole, but all objects of the scene are decomposed into their local frames. This allows us to apply arbitrary rigid motion transformations to each object of the scene separately. The transformations are successfully implemented using interval arithmetic to get guaranteed results in a dynamically changing environment. 3.3
CSG Octrees
The idea of including additional information to a recursive cell decomposition is not new. In recent years several generalizations (polytrees, integrated polytrees, extended octrees) have been proposed that improve classical octrees and allow us to obtain an exact representation of an object, for example, a polyhedron, by including new types of terminal nodes containing part of the surface of the object [3,6,2]. Moreover, to render a CSG scene by ray tracing an adaptive space
A Reliable Extended Octree Representation of CSG Objects
1347
division method based on an octree and reduced CSG structure that contains information from the original CSG object is presented in [19]. Our idea was to store additional information about the corresponding primitives of the CSG object in all gray leaf nodes of the tree and thus to create its hierarchical approximation adaptively. We construct an interval-based octree in which each node corresponds to a cubical part of the space. If the node is a leaf of the tree, then it represents a cube that 1. is full or empty, 2. contains parts of the surface that result from only one primitive, or 3. is part of the maximum subdivision level. Since the corresponding CSG information is included in the gray nodes, the maximum level of the tree can be increased adaptively at each subsequent step of processing. More precisely, in our approach the CSG object is stored as a tree. The inner nodes are set operation nodes, and the leaf nodes are quadrics. Each octree node consists of an interval vector that describes the location and extension of the node, a color attribute and a reference to the relevant part of the original CSG object. Valid node colors are taken from the common octree – white nodes are empty nodes, black nodes are nodes that are completely inside the object, and gray nodes may contain the surface of the object. But we differentiate between two types of gray nodes: Terminal gray nodes are nodes that contain only parts of the surface that result from exactly one primitive of the CSG object and thus are not further divided. All other gray nodes are non-terminal gray and can be sub-divided if a higher octree level is required. Each CSG node can be evaluated for the space an octree node occupies. The evaluation yields the range of the implicit function of the CSG object within the given octree node. With the aid of interval arithmetic, the implementation of the CSG tree evaluation is not complicated and provides verified results. Furthermore, after the evaluation, the CSG tree is reduced to a CSG tree that is equivalent to the original CSG tree within the octree node [7]. To reduce the memory requirements, we perform only reductions that can be done without memory allocations for new CSG nodes. The CSG tree is only reduced if the resulting tree is a subtree of the previous CSG tree or the complement of one of the leaf nodes. The complement of a leaf node is created and double linked when the leaf node is created (see Fig. 2). This guarantees that the additional memory cost for the CSG information in an octree node is only that of a pointer into the CSG tree, which is important when dealing with octrees with hundreds of thousands of nodes. The subdivision process is enhanced by the new kind of leaf nodes. The number of gray nodes at a higher level of the new structure is, in general, much smaller than in the case of the classical octree structure. This is due to the fact that the number of CSG primitives per tree node approaches zero or one asymptotically as the resolution increases. From a certain subdivision level all gray nodes are located along the intersection points of the CSG primitives as Fig. 3 (b) illustrates.
1348
E. Dyllong and C. Grimm
Fig. 2. A CSG object (a) and its CSG tree representation (b). For the box X, the CSG tree can be reduced to −B. For the box Y, the CSG tree cannot be reduced.
4
Implementation and Results
We have implemented the proposed data structure in an object-oriented style in the C++ programming language. We used the C++-library for scientific computing, C-XSC [8] for interval representations and calculations and the OpenGLlibrary for visualizations. Figure 3 shows a common octree representation (a) and an extended octree representation (b) and (c) of the same CSG object. The object is an intersection of a sphere with radius 0.5 and a cube with side length 0.4. The common octree in (a) has 9,304 nodes; the octree in (b) and (c) has only 5,160 nodes. In (a) and (b), we have visualized only the non-terminal gray nodes of the highest octree depth, which need to be subdivided for the next approximation level. In (c), we also visualized those parts of the surface that are covered by terminal gray nodes. The octrees are of depth 5. If we consider a CSG object whose surface forms a two-manifold, it is obvious that gray nodes of the common octree representation are an approximation of the two-dimensional surface of the object. This suggests that the number of octree nodes approximately quadruples with each subdivision step of the octree. This assumption is supported by experimental results. Table 1 shows the total number of nodes and the number of gray nodes at each level up to depth 9 of the common octree depicted in Fig. 3 (a).
Fig. 3. The gray nodes of an octree (a) and an extended octree (b) and (c)
A Reliable Extended Octree Representation of CSG Objects
1349
In our extended octree, nodes that contain parts of the surface of only one primitive are not further subdivided since they are represented exactly as quadrics. Thus, the gray nodes approximate only the intersection area between the primitive surfaces. In the chosen examples, these intersections are one-dimensional objects. This suggests that the number of nodes in the extended octree approximately doubles with each subdivision step of the octree. This assumption is also supported by experimental results. Table 2 shows the number of different node types at different levels up to depth 11 of the extended octree depicted in Fig. 3 (b)-(c). The fourth column shows the number of gray nodes that intersect the surface of only one primitive and are not further divided. The fifth column shows the number of gray nodes that need to be further divided for the next octree level. Table 1. The number of all nodes and gray nodes at different levels of the octree in Fig. 3 (a) level nodes 0 1 1 8 2 64 3 456 4 2,360 5 9,304 6 35,064 7 141,464 8 573,280 9 2,287,272
5
gray 1 8 56 272 992 3,680 15,200 61,688 244,856 975,920
Table 2. The number of different node types at different levels of the extended octree in Fig. 3 (b)-(c) level
nodes
gray
0 1 1 1 8 8 2 64 56 3 456 272 4 1,800 752 5 5,160 1,856 6 11,544 5,072 7 24,312 11,600 8 51,192 20,984 9 104,952 40,136 10 210,456 93,272 11 418,440 198,032
terminal non-terminal gray gray 0 1 0 8 0 56 80 192 272 480 944 912 3,248 1,824 7,760 3,840 13,304 7,680 25,064 15,072 63,560 29,712 137,648 60,384
Conclusions
In this paper we have shown a new technique for constructing a reliable octree model of a CSG object. The use of interval arithmetic guarantees that no part of the CSG object is ever missed in its hierarchical enclosure. The maximum recursion depth can be increased simply by adding information about the CSG object in the gray nodes of the corresponding octree. This concept allows us to start the update with the current CSG octree representation instead of retransforming the object again. In comparison to the classical octree model, it significantly reduces the number of gray nodes, which determine the cost of the subdivision process. We are currently working on adapting our distance routines for the classical octrees to the CSG octrees. Furthermore, we are going to investigate the running times required for distance calculation or collision detections of scenes containing octree enclosures of CSG objects and their corresponding CSG octrees.
1350
E. Dyllong and C. Grimm
References 1. Alefeld, G., Herzberger, J.: Introduction to Interval Computations. Academic Press, New York (1983) 2. Brunet, P., Navazo, I.: Solid representation and operation using extended octrees. ACM Transactions on Graphics 9(2), 170–197 (1990) 3. Carlbom, I., Chakravarty, I., Vanderschel, D.: A Hierarchical Data Structure for Representing the Spatial Decomposition of 3D Objects. IEEE Computer Graphics and Applicalions 5(4), 24–31 (1985) 4. Duff, T.: Interval arithmetic and recursive subdivision for implicit functions and constructive solid geometry. Computer Graphics 26(2), 131–138 (1992) 5. Dupont, L., Lazard, S., Petitjean, S.: Towards the robust intersection of implicit quadrics. In: Proc. of Workshop on Uncertainty in Geometric Computations, Sheeld, UK (2001) 6. D¨ urst, M.J., Kunii, T.L.: Integrated polytrees: a generalized model for integrating spatial decomposition and boundary representation. Technical Report 88-002, Department of Information Science, Faculty of Science, University of Tokyo (1988) 7. Dyllong, E., Grimm, C.: Verified Adaptive Octree Representations of Constructive Solid Geometry Objects. In: Simulation und Visualisierung 2007, pp. 223–235. SCS Publishing House e. V, San Diego, Erlangen (2007) 8. Hammer, R., Hocks, M., Kulisch, U., Ratz, D.: C++ Toolbox for Verified Computing. Basic Numerical Problems. Springer, Berlin (1995) 9. Hoffmann, C.M.: Geometric and Solid Modeling. Morgan Kaufmann, San Francisco (1989) 10. Lazarda, S., Penarandab, L.M., Petitjean, S.: Intersecting quadrics: an efficient and exact implementation. Computational Geometry 35(1-2), 74–99 (2006) 11. Levin, J.: Mathematical models for determining the intersections of quadric surfaces. Computer Graphics and Image Processing 11(1), 73–87 (1979) 12. Lennerz, C.: Distance Computation for Extended Quadratic Complexes. PhD Thesis, Faculty of Computer Science, University of Saarland (2005) 13. Lennerz, C., Sch¨ omer, E.: Efficient distance computation for quadratic curves and surfaces. In: Proc. of IEEE Geometric Modeling and Processing, Los Alamitos, USA, pp. 60–69 (2002) 14. Miller, J.R.: Geometric Approaches to Nonplanar Quadric Surface Intersection Curves. ACM Transactions on Graphics 6(4), 274–307 (1987) 15. Moore, R.: Interval Analysis. Prentice Hall, Englewood Cliffs (1966) 16. Ratschek, H., Rokne, J.: Geometric computations with interval and new robust methods: applications in computer graphics, GIS and computational geometry. Horwood Publishing, Chichester (2003) 17. Samet, H.: The Design and Analysis of Spatial Data Structures. Addison-Wesley Publishing Company, Reading (1990) 18. Snyder, J.M.: Generative Modeling for Computer Graphics and CAD: Symbolic Shape Design Using Interval Analysis. Academic Press, San Diego (1992) 19. Wyvill, G., Kunii, T.L., Shirai, Y.: Space division for ray tracing in CSG. IEEE Computer Graphics and Applications 6(4), 28–34 (1986)
Efficient Ray Tracing Using Interval Analysis Jorge Fl´orez, Mateu Sbert, Miguel A. Sainz, and Josep Veh´ı Institut d’Electr` onica, Inform` atica i Autom` atica Universitat de Girona, Girona 17071, Spain {jeflorez,vehi}@eia.udg.edu, {mateu,sainz}@ima.udg.edu
Abstract. Ray tracing of implicit surfaces suffers of reliability problems: when an Implicit Surface is ray traced without guaranteed methods, thin details of the surface are not rendered correctly. The current guaranteed methods, although simple to implement, are slower than not reliable methods. This paper shows a guaranteed method that uses interval analysis to obtain extra-information of the surfaces. This information allows to determine complicated regions in which more work has to be done to obtain antialiased images, avoiding unnecessary work in other less problematic areas. This method reduces the time required to render implicit surfaces, obtaining images with the same quality of other methods in which a regular number of rays are traced per pixel.
1
Introduction
In a ray tracing process (See figure 1), the rays flow out from a view point (point s) over a surface, crossing a point (point p) in an image plane. If the ray intersects the surface, the scene illumination color is calculated and assigned to the point in the plane. This technique produces effects that can not be easily obtained by
Fig. 1. Ray tracing process
raster methods, like shadows, reflections and refractions. It has been applied to render implicit surfaces, but thin features can disappear in some special surfaces when floating point arithmetic of the computer is used. There are two main techniques used to reduce those problems in ray tracing. The first are the Lipschitz constants based methods, which include the techniques R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1351–1360, 2008. c Springer-Verlag Berlin Heidelberg 2008
1352
J. Fl´ orez et al.
of LG surfaces [1] and sphere tracing [2]. The inconvenience of these methods is that they require previous calculations to derive the Lipschitz constants for every surface to be rendered. The second technique is Interval arithmetic [3,4,5,6], in which intervals are considered instead of point values to find the roots during the intersection test. Those methods can be applied over a large variety of implicit surfaces, but the main drawback is they require too much computation time to obtain the visualization. The problem arises when antialiased techniques are required to improve the quality of the visualization, because more rays have to be traced to determine the correct color of the pixel. In computer graphics, aliasing refers to jagged edges that appear in the visualization of surfaces. Its principal manifestation is the high contrast of the border color over the background color. Aliasing also occurs when small objects are not rendered because the rays miss them. Figure 2 summarizes this problem. The surface generated without guaranteed methods shows an unconnected part (figure 2a). Using a guaranteed algorithm, the surface is rendered correctly (figure 2b). However, the surface in figure 2a is rendered in 91 seconds while figure 2b takes 137 seconds.
Fig. 2. Comparison between (a) ray tracing without guarenteed methods (generated in 91 seconds), and (b) interval ray tracing (generated in 137 seconds). The surface was generated with 100x100 pixels and four rays per pixel.
Another problem is that it is not possible to use acceleration techniques like bounding boxes in all the cases to improve the computation times, because most of the guaranteed methods require calculation of the derivatives, but not all the implicit surfaces are easily differentiable [7]. In this paper, we propose a faster and reliable approach to ray tracing implicit surfaces, based on interval arithmetic. The method has the following characteristics: – The proposed approach uses Interval Arithmetic to identify areas in which more rays have to be traced to reduce aliasing. Other areas require less traced rays, and therefore, the calculation time is reduced. – The method permits to reduce the number of primary rays by means of the study of the variation of the surface in a region (section 3), and secondary rays by determining if a region is in a shadow (section 4).
Efficient Ray Tracing Using Interval Analysis
1353
– Other algorithms based on interval arithmetic require derivatives, the one presented in this paper does not. For that reason, it can be applied to any kind of implicit surface. – The method allows us to obtain antialiased images with the same quality that images obtained with other antialiasing methods, but taking less computation time (section 5).
2
Mathematical Preliminaries
According to figure 2, let r(t) = s + t(p − s), t > 0, be the parametric representation of the ray, where s is the origin point (also called view point or eye position), p − s is the direction of the ray and t represents the distance from the origin in the direction of the ray. The intersection between a ray and an implicit surface defined by f (x, y, z) = 0 is: F (t) = f (sx + t(xp − sx ), sy + t(yp − sy ), sz + t(zp − sz )) = 0
(1)
where (sx , sy , sz ) are the coordinates of the view point and (xp , yp , zp ) are the coordinates of the point in the plane. In equation 1, the parameter t can be replaced by an interval parameter T to perform evaluations over interval sections of the ray instead of points. This was first used in [6] to isolate intervals with roots during intersection test. In [4] two new interval parameters are included to the inclusion function. In that approach, intervals for xp and yp values of the screen are taken into account, so instead of evaluating a point in the screen, a evaluation of an area is performed. The inclusion function including the new interval values is defined as follows: F (Xp , Yp , T ) = f (sx + T (Xp − sx ), sy + T (Yp − sy ), sz + T (zp − sz )) , T = [0, ∞] (2) Figure 3 shows the shape conformed by the new inclusion function. The interval T is indeed taken between zero and a value big enough that the ray reaches all the objects in the scene. If the evaluation of the previous defined inclusion function with an interval value of T contains zero, it is possible that the interval T contains a root. If the result does not contain zero, it is guaranteed
Fig. 3. Shape defined by equation 2
1354
J. Fl´ orez et al.
that the interval value of T does not contain roots. This function is used to evaluate many rays simultaneously, but only works as a rejection test, that is, to detect regions without roots [4].
3
Identification of Aliasing Regions
Figure 4 shows an example of aliasing problems. Figure 4a shows a image in which only one ray is traced in every pixel. Figure 4b shows the same image in which nine pixels are traced per pixel. The difference between the two images (aliasing regions) appears in the borders, as shows off figure 4c.
Fig. 4. Differences between a) an image rendered with one ray per pixel and b) one ray traced with 9 rays per pixel. In c) the difference between a) and b) are exposed.
This section introduces an algorithm to detect those aliasing areas using interval arithmetic. It is based in the creation of an adaptive grid with information about regions in which any ray intersects the surface (outside regions), regions in which all the rays intersects the surface (inside regions), and finally, regions with intersecting and not intersecting rays (undefined regions). The process starts performing subdivisions in the screen space over x and y values to generate boxes. The values of X and Y (which are intervals), are replaced in equation 2, which represents a set of rays instead of an individual ray. With the values of X and Y , a branch-and-bound over the parameter T (which is also an interval) is performed. If the evaluation of the function F (Xp , Yp , T ) for all the values of T does not contain zero, then there are not rays intersecting the implicit surface, and all the evaluated area of the image plane is shaded with the background color. To detect inside regions, the lowest and highest values of the interval T (infimum and supremum) at every subdivision step, are used to evaluate equation 2. After that, the intersection operation of those two intervals is calculated. The intersection operation between two intervals generates a new interval, in which the infimum value is the maximum of the infimums of the two intervals, and the supremum is the minimum of the supremum of the two intervals. This operation is also performed with the intersection obtained in the previous subdivisions steps. If the value of the intersection contains zero, the box currently evaluated represents an inside region.
Efficient Ray Tracing Using Interval Analysis
1355
Only the regions with too much variation in the surface have aliasing. The estimator used to measure this variation is the distance between the first and the last intersection point of the set of rays (Figure 5a). If that distance is bigger than a threshold, the region must be visualized using an antialiasing process (In this paper, the threshold used is 0.5). Other inside regions can be traced using only one ray per pixel. The variation in inside regions depends of the direction of the view ray. In figure 5b, the view ray from s1 has a bigger variation than the one from s2 , both having the same intersection points. The first and the last intersection points are calculated when an inside region is detected. The current interval value of the parameter T is subdivided in many points (In the current implementation, the parameter is subdivided in 50 points). Every point is evaluated and, if the result is bigger than 0, the infimum of the result is saved in a vector. If the result is less than 0, the supremum value is saved in another vector. After that, the nearest values of the two vectors are taken to calculate the distance, and also to create the final interval value of T . The distance is estimated taking the central vector of the ray (by means of the midpoint of X and Y ). The value of T , which represents the set of values of t for all the rays in the set traced, is used to determine the regions covered by shadows in the next section.
Fig. 5. Variation of the surface intersected by the set of rays. a) Distance between first and last intersection. b) Intersection points from two different view points.
If the subdivision process over the parameter T arrives until a machine precision, the box is subdivided and the process started again over the new two boxes. The subdivision over boxes continues until a threshold for the size of the box is achieved. In that case, the box is classified as an undefined region. Undefined regions are also traced using an antialiasing process. The algorithm is presented in figure 6.
4
Shadow Regions
The algorithm presented in section 3 works with primary rays (which flow from view point). However, a ray tracing process involves any kind of secondary rays (shadow rays, reflections or refractions). For every view ray, secondary rays are traced, having the same computation cost that primary rays.
1356
J. Fl´ orez et al.
This section shows how the algorithm can be improved to work with shadow rays. The algorithm of section 3 does not detect when a shadow crosses an inside region. In that case, the region is ray traced using only one ray, but the shadow could produce aliasing, especially in high illuminated regions. To detect shadows, a new ray is traced from the surface to the light source. In this case, the ray represents a set of shadow rays, as is shown in figure 7.
Subdivide screen in boxes, add them to ListBox For every Box in ListBox If size(Box)=minimum size ⇒ Undefined Box intersection=[−∞, ∞] add [0, ∞] to ListT For every T in ListT If width(T ) reach machine precision Subdivide Box ⇒ add to ListBox Take next Box End If If 0 ∈ F (Xp , Yp , T ) Ra = F (Xp , Yp , T.Inf ) Rb = F (Xp , Yp , T.Sup) max = M ax(Ra .Inf, Rb .Inf, intersection.Inf ) min = M in(Ra .Sup, Rb .Sup, intersection.Sup)] intersection = [M ax, M in] If 0 ∈ intersection Subdivide T in 50 Points For every point in Points If F (Xp , Yp , point) > 0 add T.Inf to ListA If F (Xp , Yp , point) < 0 add T.Sup to ListB End For Find nearest values between ListA and ListB Calculate distance if distance <= threshold Box is Inside ⇒ Take next Box Else Subdivide Box ⇒ add to ListBox Take next Box End If End If Bisect T ⇒ add to ListT else Remove T from List End If End For Box is Outside End For Fig. 6. Algorithm to identify regions with different levels of aliasing
Efficient Ray Tracing Using Interval Analysis
1357
Fig. 7. Shadow ray from the intersecting points of a set of view rays
The new ray is traced once the algorithm detects an inside region. Given the final value of the interval T calculated for the inside region, the intersection points are: p = s + T (p − s). The light coordinates (lx , ly , lz ) represents the new origin point. Using the previous values, the shadow ray is defined as: r (T ) = l + T (p − l). The inclusion function for the shadow ray is: F (Xp , Yp , T ) = F (lx +T (Xp −lx ), ly +T (Yp −ly ), lz +T (Zp −lz )) , T = [0, ∞] (3) The shadow ray is tested over all the surfaces in the model. If an intersection with any surface is detected, the inside region has the probability of being in a shadow. In that case, the region is subdivided and the new boxes evaluated using the algorithm described in section 3. A box must be ray traced with an antialiasing method when it has the minimum size and also the probability of being in a shadow. Other inside regions are traced using only one ray per pixel. In a traditional ray tracing method, for every primary ray, a secondary one is traced to detect shadows. In the case described in this algorithm, only some regions have to be tested for shadows. Figure 8a shows the results of the process described in this paper. The red boxes are regions inside that do not need antialiasing processes. The blue boxes
Fig. 8. An example of the results of the algorithm described in this paper. a) Identification of regions with different levels of aliasing. b) Ray traced surface with shadows.
1358
J. Fl´ orez et al.
are outside regions, that are shaded with background color. The green boxes are undefined regions, which need an antialiasing process. The yellow boxes have the probability of being covered by shadows; secondary rays are traced only in these regions. Figure 8b shows the surfaces after a ray tracing process.
5
Experimentation and Results
The method introduced in this paper is based in a previous identification of areas with different levels of aliasing (section 3). This pre-processing is similar to an uniform spatial subdivision process, which is used to accelerate the entire rendering process. In this kind of process, every box has pointers to the surfaces that actually cross it. In that manner, when a ray is tested over one of the pixels in that box, the intersection test is performed with a limited set of surfaces instead of all the surfaces defined for the model. In the case described in this paper, the identification of surfaces crossing regions is performed additionally to the process used to locate aliasing regions, but of course, only works when many surfaces are rendered. In the case of individual surfaces, this process is an ”extra time”, but the computing cost is not significative in comparison with the total rendering time. Figure 9 shows 31 spheres rendered in the same scene (9 rays per pixel in aliasing regions). The rendering time is 16 minutes, but preprocessing time (Figure 9a) is only 16 seconds, which represents 0.1% of total time. In order to test the efficiency of our algorithm, the surface of figure 9b was also rendered using a branch and bound Interval ray tracing process with 9 rays per pixel. The obtained image was identical to the rendered image with our method. A regular grid was also used to accelerate the process. In this case, the rendering time was 38 minutes (the uniform grid took 14 seconds). This means that our method is 2.3 times faster in this case. In this example, our method needs 188641 rays (primary and secundary) to render the image, while traditional interval ray tracing uses 394045. The method was also tested using the surfaces of figure 10 (300x300 pixels). A desktop computer (Pentium 4, 2.4GHz) was used in the experimentation. The ray tracing method used is an Interval Ray Tracing
Fig. 9. A set of spheres rendered with the algorithm described in this paper
Efficient Ray Tracing Using Interval Analysis
1359
algorithm based in a branch and bound strategy, which could be both reliable and faster [7,4]. However, once the algorithm has detected the different regions, any ray tracing method can be used, but of course, other methods can cause reliability losses. The antialiasing algorithm selected was supersampling, with 9 rays per pixel in the aliasing regions. There are other antialiasing methods, like adaptive sampling [8], cone tracing [9] and beam tracing [10] that could be used. The main disadvantage of those proposals is that they require computationally complex intersection tests, which can complicate the calculations when Interval ray tracing algorithms are used. The antialiasing strategy is applied in all the image for the Interval ray tracing algorithm, and only in undefined and shadow regions in our algorithm. The quality of the obtained images was the same in all the cases.
Fig. 10. Tested Surfaces
Table 1 shows the results of the test. Column 2 and 3 shows the time and number of rays traced by our algorithm. Columns 4 and 5 shows the time and rays traced by the Interval ray tracing method. The improvement is over the double in all the cases. Table 1. Results of the experimentation (times in minutes) Our Method Previous proposed techniques Surface Time Rays Traced Time Rays Traced Blob 11,6 58273 35,6 414406 Orthocircle 8,8 68631 21,1 375003 Drop 35 460495 106 1620000
6
Conclusions
In this paper, we introduced a method based in Interval Analysis to improve the efficiency in the Interval Ray Tracing of implicit surfaces, which is the principal weakness of guaranteed ray tracing algorithms. The improvement was based in the reduction of the number of intersection tests by means of the detection of regions with aliasing, saving time in unnecessary calculation in other regions. The
1360
J. Fl´ orez et al.
experimentation shows that the improvement is more than twice in the tested surfaces. Also, the presented method can be applied to any kind of implicit surface, or even scenes composed with combinations of them. As a future work we are planning to extend the study performed over shadow rays to include reflection and refraction rays too. Also, the algorithm can be improved in efficiency by means of an implementation on GPU and parallel processing.
Acknowledgements This work has been partially funded by the European Regional Development Fund and the Spanish Government (Ministerio de Ciencia y Tecnolog´ıa) through the co-ordinated research projects DPI2004-07167-C02-02, DPI2005-08668-C0302, TIN2004-07451-C03-01, and TIN-2007-68066-C04-01 and by the government of Catalonia through SGR00296.
References 1. Kalra, D., Barr, A.: Guaranteed ray intersection with implicit surfaces. Computer Graphics (Siggraph proceedings) 23, 206–297 (1989) 2. Hart, J.C.: Sphere tracing: A geometric method for the antialiased ray tracing of implicit surfaces. Siggraph 93 Course Notes 12(10), 527–545 (1997) 3. Capriani, O., Hvidegaard, L., Mortensen, M., Schneider, T.: Robust and efficient ray intersection of implicit surfaces. Reliable Computing 1(6), 9–21 (2000) 4. Fl´ orez, J., Sbert, M., Sainz, M.A., Veh´ı, J.: Improving the interval ray tracing of implicit surfaces. In: Nishita, T., Peng, Q., Seidel, H.-P. (eds.) CGI 2006. LNCS, vol. 4035, pp. 655–664. Springer, Heidelberg (2006) 5. de Cusatis, A., de Figueredo, L., Gatas, M.: Interval methods for ray casting implicit surfaces with affine arithmetic. In: Proceedings of SIBGRAPH (1999) 6. Mitchell, D.: Robust ray intersection with interval arithmetic. In: Proceedings on Graphics interface 1990, pp. 68–74 (1990) 7. Sanjuan-Estrada, J., Casado, L., Garc´ıa, I.: Reliable algorithms for ray intersection in computer graphics based on interval arithmetic. In: XVI Brazilian Symposium on Computer Graphics and Image Processing, pp. 35–44 (2003) 8. Whitted, T.: An improved illumination model for shaded display. Communications of the ACM 23(6), 343–349 (1980) 9. Amanatides, J.: Ray tracing with cones. Computer Graphics 18(3), 129–135 (1984) 10. Heckbert, P., Hanrahan, P.: Beam tracing polygonal objects. Computer Graphics 18(3), 119–127 (1984)
A Survey of Interval Runge–Kutta and Multistep Methods for Solving the Initial Value Problem Karol Gajda1 , Małgorzata Jankowska2, Andrzej Marciniak3,4 , and Barbara Szyszka1 1
4
Poznan University of Technology, Institute of Mathematics, Piotrowo 3a, 60-965 Poznań, Poland [email protected], [email protected] 2 Poznan University of Technology, Institute of Applied Mechanics, Piotrowo 3, 60-965 Poznań, Poland [email protected] 3 Poznan University of Technology, Institute of Computing Science, Piotrowo 2, 60-965 Poznań, Poland Adam Mickiewicz University, Faculty of Mathematics and Computer Science, Umultowska 87, 61-614 Poznań, Poland [email protected]
Abstract. The paper is dealt with a number of one– and multistep interval methods developed by our team during the last decade. We present implicit interval methods of Runge–Kutta type, interval versions of symplectic Runge–Kutta methods and interval multistep methods of Adams– Bashforth, Adams–Moulton, Nystrm and Milne–Simpson types. Keywords: the initial value problem, one-step interval methods, symplectic interval methods, multistep interval methods, floating-point interval arithmetic.
1
Introduction
A lot of physical problems are described in the form of the initial value problem (IVP): y = f (t, y), y(0) = y0 , (1) where y = y(t) ∈ N . From the theory of differential equations it is known that under some simple assumptions the solution of the problem (1) exists and is unique. If we cannot find the solution of the problem (1) analytically, we use numerical methods. Unfortunately, conventional (traditional) numerical methods give solutions which contain the error of the methods. Moreover, applying numerical methods on computers we deal with data representation errors and errors caused by floating-point arithmetic. Interval methods for solving the initial value problem in floating-point interval arithmetic give solutions in the forms of intervals R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1361–1371, 2008. c Springer-Verlag Berlin Heidelberg 2008
1362
K. Gajda et al.
which contain all possible numerical errors, i.e. representation errors, rounding errors, and errors of the methods. In a number of our previous papers we have presented implicit interval methods of Runge–Kutta type [6,23,27,28], interval methods of symplectic Runge– Kutta type [5], explicit interval multistep methods of Adams–Bashforth and Nystrm types [16,23,24,26], implicit interval multistep methods of Adams–Moulton and Milne-Simpson types [14,15,23,24,26], interval predictor–corrector methods based of Adams formula [17], and interval versions of the backward difference methods [12,13]. For all kinds of interval methods considered we have proved theorems that the exact solution of the initial value problem belongs to the intervals obtained. We have also estimated the widths of interval solutions obtained by all of the interval methods constructed. The methods listed above belong to a large group of numerical algorithms for solving the IVP in the area of verified computing. The methods based on Taylor series were first proposed by Moore [29] and continued by Krckberg [20], Eijgenraam [3], Lohner [21], Corliss and Rihm [31,32]. Explicit interval methods of Runge-Kutta type and explicit interval multistep methods of Adams-Bashforth type have been introduced by Kalmykov, Shokin and Juldashev in [19,34]. Another approach is represented by a method based on high-order Taylor series proposed by Berz and Makino (see e.g. [1,10]). Finally we have an implicit interval Hermite-Obreschkoff (IHO) method for the IVP with predictor and corrector phases developed by Jackson and Nedialkov [30]. In this paper we summarize our previous considerations. In Sec. 6 we present a number of conclusions on our works so far. In the next sections we use the following notations and assumptions: – Δt and Δy – sets in which the function f (t, y) is determined, i.e. Δt = {t ∈ : 0 ≤ t ≤ a}, Δy = {y = (y1 , y2 , . . . , yN )T ∈ N : bi ≤ yi ≤ bi , i = 1, 2, . . . , N }, – F (T, Y ), Ψ (T, Y ) and Ψ (T, Y ) – interval extensions of f (t, y), ψ(t, y) = f (k) (t, y) ≡ y (k+1) (t) and ψ(t, y) = f (k+1) (t, y) ≡ y (k+2) (t) respectively, – Φ(T, Y ) – an interval extension of ϕ(t, y), where ϕ(t, y) is a function occuring in the local truncation error of step k + 1 for a Ringe–Kutta method (explicit and implicit) of order p, i.e. rk+1 (h) = ϕ (tk , y(tk )) hp+1 + O(hp+2 ) = (p+1) (p+2) hp+1 hp+2 = rk+1 (0) (p+1)! + rk+1 (Θh) (p+2)! , where
(p+2) r k+1 (Θh) 0 < Θ < 1, ≤ M, (p + 2)!
(2)
(the function ϕ(t, y) depends on coefficients of the Runge–Kutta methods and on partial derivatives of the function f (t, y) and its form is very complicated and cannot be written in general form for an arbitrary p – see e. g. [2]),
A Survey of Interval Runge–Kutta and Multistep Methods
1363
– F (T, Y ) is determined and continuous for all T ⊂ Δt and Y ⊂ Δy , – F (T, Y ) is monotonic with respect to inclusion, i.e. T1 ⊂ T2 ∧ Y1 ⊂ Y2 ⇒ F (T1 , Y1 ) ⊂ F (T2 , Y2 ) – for each T ⊂ Δt and Y ⊂ Δy there exists a constant L > 0 such that d(F (T, Y )) ≤ L(d(T ) + d(Y )), where d(A) denotes the diameter of interval A (if A = (A1 , A2 , . . . , AN )T , then d(A) is defined as the maximum of d(Ai ), i = 1, 2, . . . , N ), – Ψ (T, Y ), Ψ (T, Y ) and Φ(T, Y ) are determined for all T ⊂ Δt and Y ⊂ Δy , – Ψ (T, Y ), Ψ (T, Y ) and Φ(T, Y ) are monotonic with respect to inclusion.
2
Implicit Interval Methods of Runge–Kutta Type
For t0 = 0 and y0 ∈ Y0 m–stage interval methods of Runge–Kutta type have been defined by the following formulas Yn (t0 ) = Yn (0) = Y0 , m Yn (tk+1 ) = Yn (tk ) + h wi Ki,k (h) + (Φ (Tk , Yn (tk )) + [−α, α]) hp+1 , (3) i=1
k = 0, 1, . . . , n − 1, where for implicit methods we define the following formulas m Ki,k (h) = F Tk + ci h, Yn (tk ) + h aij Kj,k (h) , j=1
(4)
α = M h0 , and where h0 is a given number, 0 < h ≤ h0 , and aij , ci and wi denote coefficients of the method. Explicit interval methods have been analyzed by Shokin (see [34]) and presented in the form (3) with the following formulas K 1,k (h) = F (Tk , Yn (tk )) , i−1 Ki,k (h) = F Tk + ci h, Yn (tk ) + aij Kj,k (h) .
(5)
j=1
In this section we fix one’s attention on implicit interval methods up to fourth order. If h0 denotes a given number, the step-size h of the method (3)-(4), where 0 < h ≤ h0 , is calculated from the formula
h=
∗ ηm , n
1364
K. Gajda et al.
where
∗ ηm = min {η0 , η1 , . . . , ηm } .
For Y0 ⊂ Δy and y0 ∈ Y0 the number η0 should fulfill the condition Y0 + η0
m
wi F (Δt , Δy ) + (Φ (Δt , Δy ) + [−α, α]) hp0 ⊂ Δy ,
i=1
and the numbers η1 > 0, η2 > 0, . . . , ηm > 0 should be chosen in such a way that Y0 + ηi ci F (Δt , Δy ) ⊂ Δy ,
i = 1, 2, . . . , m.
∗ [0, ηm ]
The interval is divided into n parts by the points tk = kh, (k = 0, 1, . . . , n) and the intervals Tk in the formulas (3)-(4) should satisfy the condition ∗ tk = kh ∈ Tk ⊂ [0, ηm ].
¿From (4) it follows that in each step of the method we have to solve a (vector) interval equation of the form Y = G (T, Y ) , where
T T ∈ Δt ⊂ I () , Y = (Y1 , Y2 , . . . , YN ) ∈ I (Δy ) ⊂ I N , G : I (Δt ) × I (Δy ) → I N .
If we assume that the function G is a contraction mapping, then the corresponding iteration process follows from the fixed–point theorem.
3
Interval Methods of Symplectic Runge–Kutta Type
Symplectic interval methods of Runge–Kutta type are implicit interval methods of Runge–Kutta type which coefficients satisfy the relations (see e.g. [33]): wi aij + wj aji − wi wj = 0 for i, j = 1, 2, . . . , m.
(6)
They are interesting because they preserve the symplecticness of Hamiltonian systems and in interval versions they produce interval solutions which are guaranteed to contain the (unknown) exact solutions. In particular, we have the Hammer–Hollingsworth method: Yn (t0 ) = Yn (0) = Y0 , h Yn (tk+1 ) = Yn (tk ) + (K1,k (h) + K2,k (h)) 2 + (Φ(Tk , Yn (tk )) + [−α, α]) h5 , k = 0, 1, . . . , n − 1 where
(7)
√
√
1 3 1 1 3 K1,k (h) = F Tk + h − , Yn (tk ) + h K1,k (h) + − K2,k (h) , 2 6 4 4 6 √
√
1 3 1 3 1 K2,k (h) = F Tk + h + , Yn (tk ) + h + K1,k (h) + K2,k (h) . 2 6 4 6 4
A Survey of Interval Runge–Kutta and Multistep Methods
4
1365
Explicit Interval Multistep Methods
Let us assume that y (0) ∈ Y0 , Y0 ⊂ Δy and the intervals Yi ⊂ Δy , such that y (ti ) ∈ Yi , i = 1, 2, . . . , k − 1, are known. An integer k = 1, 2, . . . denotes the number of method steps. We can obtain such Yi , i = 1, 2, . . . , k − 1 by applying an interval one-step method, for example one of the implicit interval methods of Runge–Kutta type (see Sec. 2 and Sec. 3). The explicit interval methods of Adams-Bashforth type (see [11,16,17,23]) are given by the formula Yn = Yn−1 + h
k−1
γj ∇j Fn−1 + hk+1 γk Ψk ,
n = k, k + 1, . . . , m,
(8)
j = 0
where h = ξ/m, ti = ih ∈ Ti , i = 0, 1, . . . , m, Fn−1 = F (Tn−1 , Yn−1 ), ∇ Fn−1 = j
j
m
(−1)
m = 0
γ0 = 1,
1 γj = j!
j m
Fn−1−m ,
j = 0, 1, . . . , k − 1,
1 s (s + 1) · · · (s + j − 1) ds,
j = 1, 2, . . . , k,
0
and Ψk = Ψ (Tn−1 + [− (k − 1) h, h] , Yn−1 + [− (k − 1) h, h] F (Δt , Δy )) . In interval arithmetic for βkj of the form k−1
j−1
βkj = (−1)
m = j−1
we have
k−1
m j−1
γj ∇j Fn−1 =
j = 0
γm ,
k
j = 1, 2, . . . , k,
βkj Fn−j .
j = 1
Hence, we can give another formula of the explicit interval methods of Adams– Bashforth type which is equivalent to (8), namely: Yn = Yn−1 + h
k
βkj Fn−j + hk+1 γk Ψk ,
n = k, k + 1, . . . , m.
(9)
j = 1
In particular, for k = 4 from (8) or (9) we obtain the following method: Yn = Yn−1 +
1 251 5 h (55Fn−1 − 59Fn−2 + 37Fn−3 − 9Fn−4 ) + h Ψ4 , 24 720
where Ψ4 = Ψ (Tn−1 + [−3h, h] , Yn−1 + [−3h, h] F (Δt , Δy )) .
(10)
1366
K. Gajda et al.
The explicit interval methods of Nystr¨om type we define as follows: Yn = Yn−2 + h
k−1
νj ∇j Fn−1 + hk+1 (νk∗ Ψk + νk∗∗ Ψk ) ,
n = k, k + 1, . . . , m,
j = 0
(11) where Fn−1 = F (Tn−1 , Yn−1 ), 1 νj = j!
ν0 = 2,
νk∗
1 = k!
1 t (t + 1) · · · (t + j − 1) dt,
j = 1, 2, . . . , k − 1,
−1
0 t (t + 1) · · · (t + k − 1) dt,
νk∗∗
−1
1 = k!
1 t (t + 1) · · · (t + k − 1) dt, 0
and Ψk = Ψ (Tn−1 + [− (k − 1) h, h] , Yn−1 + [− (k − 1) h, h] F (Δt , Δy )) . In (11) it is assumed that the integration interval [0, ξ] the intervals Yi such that y (ti ) ∈ Yi for i = 0, 1, . . . , k − 1 are known, and that h=
ξ , m
ti = ih ∈ Ti ,
i = 0, 1, . . . , m.
Let us note that we cannot write (νk∗ + νk∗∗ ) Ψk instead of νk∗ Ψk + νk∗∗ Ψk , because in general |νk∗ + νk∗∗ | may be different from |νk∗ | + |νk∗∗ |. Since k−1
νj ∇j Fn−1 =
j = 0
where j−1
δkj = (−1)
k
δkj Fn−j ,
j = 1
k−1 l = j−1
l j−1
νl ,
j = 1, 2, . . . , k,
the formula (11) can be written in more convenient form: Yn = Yn−2 + h
k
δkj Fn−j + hk+1 (νk∗ Ψk + νk∗∗ Ψk ) ,
n = k, k + 1, . . . , m. (12)
j = 1
In particular, for k = 4 from (11) and (12) we have the following method: Yn = Yn−2 +
h h5 (8Fn−1 − 5Fn−2 + 4Fn−3 − Fn−4 ) + (251Ψ4 − 19Ψ4 ) , (13) 3 720
where Ψ4 = Ψ (Tn−1 + [−3h, h] , Yn−1 + [−3h, h] (Δt , Δy )) .
A Survey of Interval Runge–Kutta and Multistep Methods
5
1367
Implicit Interval Multistep Methods
In the area of conventional multistep methods for solving the IVP the construction process of the implicit Adams–Moulton methods leads to two equivalent formulas that give the exact solution at some point. We have y (tn ) = y (tn−1 ) + h
k
γ j ∇j fn + hk+2 γ k+1 ψ (η, y (η)) ,
(14)
β kj fn−j + hk+2 γ k+1 ψ (η, y (η)) ,
(15)
j = 0
and y (tn ) = y (tn−1 ) + h
k j = 0
where η ∈ [tn−k , tn ], fn = f (tn , yn ), and ∇ fn = j
j
m
(−1)
m = 0
γ 0 = 1,
1 γj = j!
j m
fn−m ,
0 s (s + 1) · · · (s + j − 1) ds, −1 j
β kj = (−1)
k m γm, j
j = 1, 2, . . . , k + 1,
j = 0, 1, . . . , k.
m = j
Let the assumptions about Yi , i = 0, 1, . . . , k − 1, be the same as in Sec. 4. Then, from the formulas (14) and (15) we get the algorithms of the implicit interval methods of Adams–Moulton type (see [11,14,15,23,26]) as follows: Yn = Yn−1 + h
k
γ j ∇j Fn + hk+2 γ k+1 Ψ k ,
n = k, k + 1, . . . , m,
(16)
β kj Fn−j + hk+2 γ k+1 Ψ k ,
n = k, k + 1, . . . , m,
(17)
j = 0
and Yn = Yn−1 + h
k j = 0
where Ψ k = Ψ (Tn + [−kh, 0] , Yn + [−kh, 0] F (Δt , Δy )) . Let us note that the coefficients γ j in (16) are not of the same sign, i.e. γ 0 = 1 but γ j for j ≥ 1 are all negative. Since for interval arithmetic the distributive law is not generally satisfied, then we can not subtract some of the values of interval functions with the same indices. Consequently, in interval arithmetic we have in general k k γ j ∇j Fn = β kj Fn−j , j = 0
j = 0
1368
K. Gajda et al.
and hence the formulas (16) and (17) are not equivalent. Furthermore, it is easy to prove that the interval methods of Adams–Moulton type (17) gives the interval solutions with smaller widths then in the case of the methods (16). In particular, for k = 3 from (16) and (17) we get the methods of the first and the second type as follows: Yn = Yn−1 +
1 19 5 h (24Fn − 15Fn + 19Fn−1 − 5Fn−2 + Fn−3 ) − h Ψ 3 , (18) 24 720
Yn = Yn−1 +
1 19 5 h (9Fn + 19Fn−1 − 5Fn−2 + Fn−3 ) − h Ψ 3, 24 720
(19)
where Ψ 3 = Ψ (Tn + [−3h, 0] , Yn + [−3h, 0] F (Δt , Δy )) . In order to obtain the Milne-Simpson implicit methods (see e.g. [24]) we can start from the exact relation containing either backward differences, i.e. y (tn ) = y (tn−2 ) + h
k
(k+2) ν j ∇j fn + hk+2 ν ∗k+1 y (k+2) (θn∗ ) + ν ∗∗ (θn∗∗ ) , k+1 y
j = 0
(20) or only the values of the function, i.e. y (tn ) = y (tn−2 ) + h
k
(k+2) δ kj fn−j + hk+2 ν ∗k+1 y (k+2) (θn∗ ) + ν ∗∗ (θn∗∗ ) , k+1 y
j = 0
(21)
where θn∗ and θn∗∗ are some points in (t0 , tn ), ν 0 = 2,
ν ∗k+1 =
1 (k+1)!
1 νj = j!
0 t (t + 1) · · · (t + j − 1) dt,
j = 1, 2, . . . , k,
−2
−1 t (t+1) · · · (t+k) dt,
ν ∗∗ k+1 =
−2 k
l j δ kj = (−1) νl, j
1 (k + 1)!
0 t (t + 1) · · · (t + k) dt, −1
j = 0, 1, . . . , k.
l=j
Using (20) we get the following implicit interval methods: Yn = Yn−2 + h
k
ν j ∇j Fn + hk+2 ν ∗k+1 Ψ k + ν ∗∗ k+1 Ψ k ,
n = k, k + 1, . . . , m,
j = 0
(22) where Fn = F (Tn , Yn ), and where Ψ k = Ψ (Tn + [−kh, 0] , Yn + [−kh, 0] F (Δt , Δy )) .
A Survey of Interval Runge–Kutta and Multistep Methods
Since
k j = 0
ν j ∇ Fn = j
k
1369
δ kj Fn−j ,
j = 0
from (21) we can obtain the second kind of interval methods of Milne–Simpson type: Yn = Yn−2 + h
k
δ kj Fn−j + hk+2 ν ∗k+1 Ψ k + ν ∗∗ k+1 Ψ k ,
n = k, k + 1, . . . , m.
j = 0
(23) In particular, for k = 3 from (22) we get the following method of the first kind (in the conventional case we have the same method as for k = 2): h h5 (7Fn − 6Fn + 6Fn−1 − 2Fn−1 + Fn−2 ) + 11Ψ 3 − 19Ψ 3 , 3 720 (24) where Ψ 3 = Ψ (Tn + [−3h, 0] , Yn + [−3h, 0] F (Δt , Δy )). Below there is the three-step method of the second kind (obtained from (23)): Yn = Yn−2 +
Yn = Yn−2 +
h h5 (Fn + 4Fn−1 + Fn−2 ) + 11Ψ 3 − 19Ψ 3 . 3 720
(25)
If we denote Yn1 – the interval solution obtained from (22), i.e. from the formula with backward interval differences, Yn2 – the interval solution obtained from (23), i.e. from the formula without backward interval differences, then we can prove that Yn2 ⊆ Yn1 , which means that the second kind of formulas gives the interval solution with a smaller diameter (width).
6
Conclusions
Interval methods for solving the IVP in floating–point interval arithmetic give solutions in the form of intervals which contain all possible numerical errors, i.e. representation errors, rounding errors, and errors of methods. Many numerical experiments carried out by the authors [4,35] show that sometimes implicit interval methods of Runge–Kutta type are better than the explicit ones of the same order, i.e. give the interval solution with a smaller width, and sometimes explicit methods are better than the implicit ones. Because of a limitation of the number of pages for this paper we have to omit a presentation of many interesting experiments. The choice of an appropriate method depends on a problem considered. But in the case of Hamiltonian systems we recommend to use interval methods of symplectic Runge–Kutta type, which are better in comparison to other implicit interval methods of this type. The main advantage of such methods consists in reducing a number of calculations involved. For the multistep interval methods presented in this paper it follows that for the same number of steps explicit interval methods of Nystrm type are somewhat better than the methods of Adams–Bashforth type, and implicit interval
1370
K. Gajda et al.
methods of Milne–Simpson type give somewhat better results than the methods of Adams–Moulton type. Another conclusion concerning the implicit interval methods is that the methods based on backward differences give somewhat worse result than the method based only on the combinations of interval function at different points. Unfortunately, interval methods of BDF type are not so good as other multistep interval methods. The main conclusion can be expressed as follows: use implicit interval methods of Runge–Kutta type (or of symplectic Runge–Kutta type in the case of Hamiltonian systems) to obtain additional starting points for multistep interval methods (prefer the multistep methods of Nystrm or Milne–Simpson types). For each particular problem one should choose a multistep method with the appropriate step size and the number of method steps to obtain the interval solution with the smallest width (for a given step size there exists the optimal number of method steps, and for a given number of method steps there exists the best step size).
References 1. Berz, M., Makino, K.: Verified integration of ODEs and flows with differential algebraic methods on Taylor models. Reliable Computing 4 (4), 361–369 (1998) 2. Butcher, J.C.: The Numerical Analysis of Ordinary Differential Equations. Runge– Kutta and General Linear Methods. J. Wiley & Sons, Chichester (1987) 3. Eijgenraam, P.: The Solution of Initial Value Problems Using Interval Arithmetic. Mathematical Centre Tracks 144 (1981) 4. Gajda, K.: Interval Methods of a Symplectic Runge-Kutta Type [in Polish], Ph.D. Thesis, Poznan University of Technology, Pozna (2004) 5. Gajda, K., Marciniak, A.: Symplectic Interval Methods for Solving the Hamiltonian Problem. Pro. Dialog. 22, 27–38 (2007) 6. Gajda, K., Marciniak, A., Szyszka, B.: Three- and Four-Stage Implicit Interval Methods of Runge-Kutta Type. Computational Methods in Science and Technology 6, 41–59 (2000) 7. Hairer, E., Norsett, S.P., Wanner, G.: Solving Ordinary Differential Equations I Nonstiff Problems. Springer, Heidelberg (1987) 8. Hammer, R., Hocks, M., Kulisch, U., Ratz, D.: Numerical Toolbox for Verified Computing I. Basic Numerical Problems. Springer, Berlin (1993) 9. Hill, G.W.: Researches in the Lunar Theory. Am. J. Math. I (1878) 10. Hoefkens, J., Berz, M., Makino, K.: Controlling the Wrapping Effect in the Solution of ODEs for Asteroids. Reliable Computing 8, 21–41 (2003) 11. Jankowska, M.: Interval Multistep Methods of Adams Type and their Implementation in the C++ Language, Ph.D. Thesis, Poznan University of Technology, Pozna (2006) 12. Jankowska, M., Marciniak, A.: An Interval Version of the Backward Differentiation (BDF) Method. In: SCAN 2006 Conference Post-Proceedings IEEE-CPS Product No. E2821 (2007) 13. Jankowska, M., Marciniak, A.: On the Interval Methods of the BDF Type for Solving the Initial Value Problem. Pro. Dialog. 22, 39–59 (2007) 14. Jankowska, M., Marciniak, A.: On Two Families of Implicit Interval Methods of Adams-Moulton Type. Computational Methods in Science and Technology 12 (2), 109–113 (2006)
A Survey of Interval Runge–Kutta and Multistep Methods
1371
15. Jankowska, M., Marciniak, A.: Implicit Interval Multistep Methods for Solving the Initial Value Problem. Computational Methods in Science and Technology 8 (1), 17–30 (2002) 16. Jankowska, M., Marciniak, A.: On Explicit Interval Methods of Adams-Bashforth Type. Computational Methods in Science and Technology 8 (2), 46–57 (2002) 17. Jankowska, M., Marciniak, A.: Preliminaries of the IMM System for Solving the Initial Value Problem by Interval Multistep Methods [in Polish]. Pro. Dialog. 10, 117–134 (2005) ´ Applied Interval Analysis. Springer, 18. Jaulin, L., Kieffer, M., Didrit, O., Walter, E.: London (2001) 19. Kalmykov, S.A., Shokin, J.I., Juldashev, E.C.: Solving Ordinary Differential Equations by Interval Methods [in Russian]. Doklady AN SSSR 230 (6) (1976) 20. Krckeberg, F.: Ordinary differential equations. In: Hansen, E. (ed.) Topics in Interval Analysis, pp. 91–97. Clarendon Press, Oxford (1969) 21. Lohner, R.J.: Enclosing the solutions of ordinary initial and boundary value problems. In: Kaucher, E.W., Kulisch, U.W., Ullrich, C. (eds.) Computer Arithmetic: Scientific Computation and Programming Languages. Wiley-Teubner Series in Computer Science, Stuttgart (1987) 22. Marciniak, A.: Finding the Integration Interval for Interval Methods of Runge– Kutta Type in Floating-Point Interval Arithmetic. Pro. Dialog. 10, 35–46 23. Marciniak, A.: Implicit Interval Methods for Solving the Initial Value Problem. Numerical Algorithms 37, 241–251 (2004) 24. Marciniak, A.: Multistep Interval Methods of Nystrm and Milne-Simpson Types. Computational Methods in Science and Technology 13 (1), 23–39 (2007) 25. Marciniak, A.: Numerical Solutions of the N-body Problem. Reidel, Dordrecht (1985) 26. Marciniak, A.: On Multistep Interval Methods for Solving the Initial Value Problem. Journal of Computational and Applied Mathematics 199, 229–237 (2007) 27. Marciniak, A., Szyszka, B.: One- and Two-Stage Implicit Interval Methods of Runge-Kutta Type. Computational Methods in Science and Technology 5, 53–65 (1999) 28. Marciniak, A., Szyszka, B.: On Representation of Coefficients in Implicit Interval Methods of Runge-Kutta Type. Computational Methods in Science and Technology 10 (1), 57–71 (2004) 29. Moore, R.E.: Interval Analysis. Prentice-Hall, Englewood Cliffs (1966) 30. Nedialkov, N.S., Jackson, K.R.: An interval Hermite-Obreschkoff method for computing rigorous bounds on the solution of an initial value problem for an ordinary differential equation. Reliable Computing 5 (3), 289–310 (1999) 31. Rihm, R.: Interval methods for initial value problems in ODE’s. In: Topics in Validated Computations, Proceedings of the IMACS-GAMM International Workshop on Validated Computations, Oldenburg, Germany, J. Herzberger (1994) 32. Rihm, R.: On a class of enclosure methods for initial value problems. Computing 53, 369–377 (1994) 33. Sanz–Serna, J.M., Calvo, M.P.: Numerical Hamiltonian Problems. Chapman & Hall, London (1994) 34. Shokin, J.I.: Interval Analysis [in Russian]. Nauka, Novosibirsk (1981) 35. Szyszka, B.: Implicit Interval Methods of Runge-Kutta Type [in Polish], Ph.D. Thesis, Poznan University of Technology, Pozna (2003)
Towards Efficient Prediction of Decisions under Interval Uncertainty Van Nam Huynh1 , Vladik Kreinovich2 , Yoshiteru Nakamori1 , and Hung T. Nguyen3 1
Japan Advanced Institute of Science and Technology (JAIST), Tatsunokuchi, Ishikawa 923-1292, Japan [email protected] 2 University of Texas at El Paso, El Paso, TX 79968, USA [email protected] 3 New Mexico State University, Las Cruces, NM 88003 USA [email protected]
Abstract. In many practical situations, users select between n alternatives a1 , . . . , an , and the only information that we have about the utilities vi of these alternatives are bounds v i ≤ vi ≤ v i . In such situations, it is reasonable to assume that the values vi are independent and uniformly distributed on the corresponding intervals [v i , v i ]. Under this assumption, we would like to estimate, for each i, the probability pi that the alternative ai will be selected. In this paper, we provide efficient algorithms for computing these probabilities. Keywords: decision making, interval uncertainty.
1
Decisions under Interval Uncertainty: Formulation of the Problem
Making a decision when we know the exact values of the maximized quantity. Let us assume that we want to select an alternative with the largest possible value of a certain quantity. If for n alternatives a1 , . . . , an , we know the exact values v1 , . . . , vn of the corresponding quantity, then the decision maker will select the alternative ai for which the corresponding value vi is the largest. How to predict this decision. When we know the values v1 , . . . , vn , then predicting a decision means computing the index in of the largest value vi . This can be done in time O(n), by the following iterative process. At each iteration k (k = 1, . . . , n), ik will be index of the largest of the first k values v1 , . . . , vk . In the first iteration k = 1, we naturally take i1 = 1. Once we got ik , on the next (k + 1)-st iteration, we compare the largest-so-far value vik with the new value vk+1 . If vk+1 > vik , then we take ik+1 = k + 1 as the new index, otherwise we take keep the old index, i.e., take ik+1 = ik . R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1372–1381, 2008. c Springer-Verlag Berlin Heidelberg 2008
Towards Efficient Prediction of Decisions
1373
Predicting decisions under interval uncertainty: a problem. In many practical situations, we do not know the exact values of the desired quantity. In many such situations, we only know the bounds v i and v i for the (unknown) actual value vi , i.e., our only information about vi is that vi belongs to the interval [v i , v i ]. If we only know the intervals [v i , v i ] of possible values of vi , and these intervals share several common points, then it may be that, e.g., v1 is the largest and it may be that v2 is the largest. Thus, some decision makers will prefer v1 , some may prefer v2 , etc. In this case, we cannot exactly predict which selection will be made – but we can hopefully predict the probability pi of selecting vi .
2
Decision Making under Interval Uncertainty: Main Idea and Related Computational Problem
Idea. A natural idea for computing the probability pi is as follows. For each i, we assume that the (unknown) actual value vi is uniformly distributed in the corresponding interval [v i , v i ], and that different values vi are independent random variables. Then, the desired probability pi is the probability that, under this distribution, vi is the largest of n values v1 , . . . , vn . Comment. The above assumptions about the probability distributions correspond, e.g., to the Maximum Entropy (MaxEnt) approach (see, e.g., [4]), in which among all possible distributions ρ(v1 , . . . , vn ) on the given box [v 1 , v 1 ] × . . . × [v n , v n ], we select the one with the largest value of the entropy − ρ(v1 , . . . , vn ) · log(ρ(v1 , . . . , vn )) dv1 . . . dvn . This MaxEnt distribution is uniform on the box, which is equivalent to assuming that all values vi are independent and uniformly distributed. For n = 2, there are explicit formulas for computing pi . For the case n = 2 of two alternatives, p1 is the probability that v1 > v2 . There exist explicit formulas for this probability; see, e.g., [3,6,8,9,10,11,12]. So, for n = 2, we have an efficient algorithm for computing the desired probabilities p1 and p2 . Problem: how to compute pi for large n? The case of n = 2 is a toy example. In most practical decision problems, we have a large number of alternatives – sometimes so large that we need high performance parallel computers to handle these problems. How can we then compute the corresponding probabilities pi ? Since the distribution is uniform, the desired probability pi is equal to the ratio Vi /V , where V = (v 1 − v 1 ) · . . . · (v n − v n ) is the (n-dimensional) volume of the box, and Vi is the volume of the part of which box for which vi is larger than the values of all other values vj . In principle, we can compute the volume Vi by computing the corresponding n-dimensional integral. However, computing n-dimensional integrals with a given
1374
V.N. Huynh et al.
accuracy ε > 0 means that we have to consider a grid of size ∼ ε along each axis 1 1 – i.e., consider ∼ points along each axis and ∼ n points overall. ε ε For large n, this computation time is too high to be practically useful. It is therefore desirable to come up with more efficient algorithms for computing pi .
3
Monte-Carlo Simulations as a Way to Approximate the Desired Decision Probabilities
Idea. A natural idea is to use Monte-Carlo simulations; see, e.g., [7]. Specifically, we select a number N , and then N times, we simulate each vi as a uniformly distributed random variable. After that, we take Ni /N as an estimate for pi , where Ni is the number of simulations in which vi was the largest value. √ It is known that the accuracy of the Monte-Carlo simulation is 1/ N . So, to get 10% accuracy in computing pi , it is sufficient to take N ≈ 100 simulations. Limitations. The main limitation of this approach is that if we want accurate 1 estimates, with accuracy ε 1, we need a large number of simulations N ≈ 2 . ε This number is not impossible (as for direct integration) but still large. It is therefore desirable to design an algorithm for computing pi exactly.
4
Efficient Algorithm for Exact Computation of Decision Probabilities
Let us describe an efficient (O(n2 )) algorithm for computing pi . Without losing generality, we can assume that i = 1, i.e., that we need to compute the probability p1 that v1 is the largest of n values vi . The outline of this section is as follows: – First, we will describe the main idea behind this algorithm. – Then, we will show how this idea translates into an actual O(n2 ) algorithm. – Finally, we will explicitly describe the resulting algorithm. Main idea. Our idea is to first describe, for each given v1 , the conditional probability p1 (v1 ) that this v1 is the largest – under the condition that v1 is the actual value. Then, due to the Bayes formula, the overall probability p1 that v1 is the largest can be obtained by integrating this conditional probability p1 (v1 ) times the probability density of v1 : Prob(v1 is the largest) = Prob(v1 is the largest | v1 is actual) · ρ1 (v1 ) dv1 . The distribution of v1 is uniform on the interval [v 1 , v 1 ], hence 1 p1 = · p1 (v1 ) dv1 . v1 − v1
Towards Efficient Prediction of Decisions
1375
How can we describe the expression for p1 (v1 )? Once v1 is fixed, the fact that v1 is the largest means that v2 ≤ v1 , v3 ≤ v1 , etc. Since all the variables vi are independent, this probability is equal to the product of n − 1 probabilities: the probability that v2 ≤ v1 , the probability that v3 ≤ v1 , etc. For each i, the probability that vi < v1 can be determined as follows: – If v i ≤ v1 , then vi ≤ v1 with probability 1. This probability does not change the product and can thus simply be omitted. – If v1 < v i , this means that vi ≤ v1 cannot happen at all. The resulting probability is 0, so such terms can be completely ignored. – Finally, if v i ≤ v1 < v i , then, since the distribution of vi is uniform on the v1 − v i interval [v i , v i ], the probability that vi ≤ v1 is equal to . vi − v i Thus, the conditional probability p1 (v1 ) is equal to v1 − v i p1 (v1 ) = , vi − vi i:v1 ≤vi
if v1 ≥ v i for all i, and to 0 otherwise. Transforming this idea into the actual algorithm. As we see, the expression for p1 (v1 ) depends on the relation between v1 and the endpoints v i and v i of the intervals [v i , v i ]. So, if we sort these endpoints into an increasing sequence v(1) ≤ v(2) ≤ . . . ≤ v(2n) , then, in each of the resulting 2n + 1 zones z0 = (−∞, v(1) ), z1 = [v(1) , v(2) ), . . . , zj = [v(j) , v(j+1) ), . . . , z2n = [v(2n) , ∞), we will have the same analytical expression for p1 (v1 ). For each zone, the corresponding expression is a product of ≤ n linear terms. Multiplying these terms one by one, we get a polynomial of degree ≤ n in ≤ n computational steps. The integral p1 (v1 ) dv1 can be computed as the sum of integrals p1j over all the zones zj , j = 0, . . . , 2n. An integral of a polynomial a0 + a1 · v1 + . . . + ak · v1k a2 ak is equal to a0 · v1 + · v12 + . . . + · v k+1 , i.e., it can be also computed 2 k+1 1 coefficient-by-coefficient in linear time. Since we have 2n zones, we thus need (2n + 1) · O(n) = O(n2 ) time to compute all 2n + 1 sub-integrals, and then 2n = O(n) operations to add them and get p1 (v1 ) dv1 . Dividing this integral by v 1 − v 1 , we get p1 . Thus, overall, we indeed need quadratic time. Resulting algorithm. At the first step of this algorithm, we order all 2n endpoints v i and v i into an increasing sequence v(1) ≤ v(2) ≤ . . . ≤ v(2n) . As a result, we divide the real line into 2n + 1 zones z0 = (−∞, v(1) ), z1 = [v(1) , v(2) ), . . . , zj = [v(j) , v(j+1) ), . . . , z2n = [v(2n) , ∞). For the zones zj for which v(j) < v 1 , v(j+1) > v 1 , or v(j+1) < v i for some i, the integral p1j is equal to 0. For every other zone, we form the expression v1 − v i p1 (v1 ) = . v i − vi i:v(j+1) ≤v i
1376
V.N. Huynh et al.
This expression is a product of ≤ n linear functions of the unknown v1 . By multiplying by these functions one by one, we get an explicit expression for a polynomial in v1 . By processing the coefficients of this polynomial one by one, we can provide the explicit analytical expression for the (indefinite) integral P1j (v1 ) of this polynomial. The desired integral p1j can then be computed as the difference P1j (v(j+1) ) − P1j (v(j) ). Finally, the desired probability p1 is computed as p1 =
2n 1 · p1j . v 1 − v 1 j=0
Comments – The idea of dividing the real line into zones corresponding to sorted endpoints of the given intervals comes from another situation where we need to combine probabilities and intervals: namely, from the algorithms for algorithms for computing population variance under interval uncertainty [2]. – The above algorithm is based on the assumptions that we have a finite set of alternatives, that decision makers know the exact values of vi , and that the distributions are uniform. In the following sections, we consider discuss what will happen if we do not make these assumptions.
5
First Observation: What If We Have Infinitely Many Alternatives
Formulation of the problem. In many practical problem, we have infinitely many alternatives. For example, an alternative is often characterized by a continuous real-valued parameter a on a range [a, a] (or by several such parameters). In such situations, for every a, we have an interval [v(a), v(a)] of possible values of v(a). For example, we may know the approximate values v(a), and we know the bound Δ(a) > 0 on the approximation error; in this case, the (unknown) actual value v(a) belongs to the interval [ v (a) − Δ(a), v(a) + Δ(a)]. It is usually reasonable to assume that both v(a) and v(a) are continuous functions of a. Again, we assume that the values v(a) corresponding to different a are independent random variables uniformly distributed on the corresponding intervals [v(a), v(a)]. If a decision makers selects the action with the largest possible value of a, what is the probability of selecting different values of a? A minor complication here is that since there are infinitely many possible alternatives a, the maximum may not necessarily be attained. In this case, it is reasonable to fix some small value ε and select an alternative a(ε) for which v(a(ε)) ≥ max v(a) − ε. We will call such alternative ε-optimal. a
A somewhat unexpected solution. Our result is that for every ε > 0, an εoptimal alternative corresponding to the random values v(a) is ε-optimal for the function v(a). In other words, with probability 1, the decision maker will select the solution that maximizes the “optimistic” value v(a).
Towards Efficient Prediction of Decisions
1377
Proof. Before we start discussing this result, let us first prove it. It is sufficient to prove that max v(a) = max v(a). Indeed, from the fact that v(a) ≤ v(a), a
a
we conclude that max v(a) ≤ max v(a). Let us now pick any number ε > 0 a
a
and show that max v(a) ≥ max v(a) − ε ; then in the limit ε → 0 we will get a
a
max v(a) ≤ max v(a) and hence, max v(a) = max v(a). a
a
a
a
Indeed, let am be a value at which the continuous function v(a) attains its maximum. Since v(a) is continuous, there exists a value δ such that |am − a | ≤ δ implies that |v(a ) − v(am )| ≤ ε /2, i.e., that v(a ) ≥ v(am ) − ε /2 = max v(a) − a
ε /2. Let us prove that we cannot have max v(a) < max v(a) − ε . Indeed, that a
a
would imply that v(a ) < v(am ) − ε for all (infinitely many) values a for which |a − am | ≤ δ. This means that for all such a , we have v(a ) ∈ [v(a) − ε /2, v(a)] – because for values from that subinterval, we have v(a) ≥ v(a) − ε /2 ≥ (v(am ) − ε /2) − ε /2 = v(am ) − ε . The probability of being not in this interval is proportional to 1 − (ε /2)(v(a) − v(a)) and is hence ≤ 1 − (ε /2)/W , def where W = max(v(a) − v(a)). There are infinitely many such values a , and all a
variables v(a ) are independent; thus, the probability that v(a ) < v(am ) − ε for all a does not exceed (1 − (ε /2)/W )n for every n. When n → ∞, we conclude that this probability is 0. Thus, with probability 1, we have some value a for which v(a ) ≥ v(am ) − ε . The statement is proven. Discussion. The above counter-intuitive result follows from the assumption that the values vi are independent and uniformly distributed. So, to avoid this conclusion, we must relax this assumption; in the last section of this paper, we will start analyzing what will happen if relax this assumption.
6
Second Observation: What If Decision Makers also Only Know the Values of the Desired Quantity with Interval Uncertainty
Formulation of the problem. In the previous text, we assumed that the decision makers know the exact values vi of the desired quantity, and make their decisions based on these exact values. Based on this assumption, we considered the situation when we only know the intervals [v i , v i ] for vi , and we estimated the probability pi that for randomly selected values v1 ∈ [v 1 , v 1 ], . . . , vn ∈ [v n , v n ], a decision maker will select the alternative vi . In practice, decision makers may also know the values vi only approximately. How does this approximate character affect the decisions? Previous work. For the case of n = 2 alternatives, the case when decision makers know vi with accuracy δ > 0 was considered in [11]; a case of general interval bounds was analyzed in [8,9,10]. What we plan to do. In this section, we consider the simplest case of accuracy δ, and we show how to modify the above algorithms to account for this uncertainty.
1378
V.N. Huynh et al.
What happens when decision makers only know the values vi with accuracy δ: our assumption. When the decision maker knows the exact values of v1 and v2 , then the decision is straightforward: – if v1 = v2 , then both alternative are equally attractable, so any of them can be selected; – if v1 > v2 , then the first alternative a1 is better, so it will be selected; – if v1 < v2 , then the second alternative is better, so a2 will be selected. If we only know the approximate values v1 and v2 , values which are only correct within an accuracy δ, then we also have three options: – It is possible that v1 − δ > v2 + δ (i.e., equivalently, v1 − v2 > ε, where def
ε = 2δ). In this case, every value from the interval [v1 − δ, v1 + δ] is larger than every value from the interval [v2 − δ, v2 + δ]. Thus, we are sure that the alternative a1 is larger, and we select it. – It is also possible that v1 + δ < v2 − δ (i.e., equivalently, v1 − v2 < −ε). In this case, every value from the interval [v1 − δ, v1 + δ] is smaller than every value from the interval [v2 − δ, v2 + δ]. Thus, we are sure that the alternative a2 is larger, and we select it. – It is also possible that the values v1 and v2 are so close that we cannot tell whether a1 is larger or a2 is better; this case corresponds to |v1 − v2 | ≤ ε. Following [11], we assume that in the third case, both alternatives a1 and a2 are equally attractable, so any of them can be selected. What we would like to estimate. Under the above assumption, if the values v1 and v2 are close, then both a1 and a2 may be selected as the best – and we cannot predict which of them will be selected. So, for every i, instead of a single probability pi that the alternative ai will be selected, we have two different probabilities: – the probability p+ i that ai may be selected, and – the probability p− i that ai will necessarily be selected. Depending on the decision makers’ choice, the actual selection probability pi can + take any value from the interval [p− i , pi ]. + How to estimate p− i and pi . According to the above description:
– p− i is the probability that vj < vi − ε for all j = i, and – p+
i. i is the probability that vj ≤ vi + ε for all j = + The Monte-Carlo algorithm can be easily modified to compute p− i or pi : namely, − − + after we perform N simulations, we can estimate pi as Ni /N and pi as Ni+ /N , where
– Ni− is the number of simulations in which vj < vi − ε for all j = i, and – Ni+ is the number of simulations in which vj ≤ vi + ε for all j =
i.
Towards Efficient Prediction of Decisions
1379
The exact algorithm can be modified as follows: Towards an algorithm for computing p− i . For each i, the probability that vi +ε < v1 can be determined as follows: – If v i + ε < v1 , then vi + ε < v1 with probability 1. – If v1 ≤ v i + ε, this means that vi + ε < v1 cannot happen at all; the resulting probability is 0. – Finally, if v i + ε ≤ v1 ≤ v i + ε, then, since the distribution of vi is uniform on the interval [v i + ε, v i + ε], the probability that v i + ε < v1 is equal to v1 − (v i + ε) . vi − vi Thus, we arrive at the following algorithm. Algorithm for the exact computation of p− 1 . At the first step of this algorithm, we order those values v i +ε and v i +ε (i = 1) which are inside the interval [v 1 , v 1 ] into an increasing sequence v(1) ≤ v(2) ≤ . . . ≤ v(k) (k ≤ 2n − 2). As a result, we divide the interval [v 1 , v 1 ] into k + 1 zones z0 = [v 1 , v(1) ), z1 = [v(1) , v(2) ), . . . , zj = [v(j) , v(j+1) ), . . . , zk = [v(k) , v 1 ]. For the zones zj for which v(j+1) ≤ v i + ε for some i, we set p− 1j = 0. For every other zone, we form the expression v1 − ε − v i p− . 1 (v1 ) = vi − v i i:v(j+1) ≤v i +ε
This expression is a product of ≤ n linear functions of the unknown v1 . By multiplying by these functions one by one, we get an explicit expression for a polynomial in v1 . By processing the coefficients of this polynomial one by one, we can provide the explicit analytical expression for the (indefinite) integral − P1j (v1 ) of this polynomial. The desired integral p− 1j can then be computed as − − the difference P1j (v(j+1) ) − P1j (v(j) ). Finally, the desired probability p− 1 is computed as p− 1 =
k 1 · p− . v 1 − v 1 j=0 1j
Algorithm for the exact computation of p+ 1 . At the first step of this algorithm, we order those values v i −ε and v i −ε (i = 1) which are inside the interval [v 1 , v 1 ] into an increasing sequence v(1) ≤ v(2) ≤ . . . ≤ v(k) (k ≤ 2n − 2). As a result, we divide the interval [v 1 , v 1 ] into k + 1 zones z0 = [v 1 , v(1) ), z1 = [v(1) , v(2) ), . . . , zj = [v(j) , v(j+1) ), . . . , zk = [v(k) , v 1 ]. For the zones zj for which v(j+1) < v i − ε for some i, we set p1j = 0. For every other zone, we form the expression v1 + ε − v i p+ . 1 (v1 ) = vi − v i i:v(j+1) ≤v i −ε
1380
V.N. Huynh et al.
This expression is a product of ≤ n linear functions of the unknown v1 . By multiplying by these functions one by one, we get an explicit expression for a polynomial in v1 . By processing the coefficients of this polynomial one by one, we can provide the explicit analytical expression for the (indefinite) integral + P1j (v1 ) of this polynomial. The desired integral p+ 1j can then be computed as + + the difference P1j (v(j+1) ) − P1j (v(j) ). Finally, the desired probability p+ 1 is computed as p+ 1 =
7
k 1 · p+ . v 1 − v 1 j=0 1j
Third Observation: What If the Distributions Are Not Uniform
Formulation of the problem. For the case of two alternatives, the uniform distribution can be justified by the requirement that the distribution be invariant relative to arbitrary shifts v1 → v1 + a1 , v2 → v2 + a2 and conditionally invariant with respect to re-scalings v1 → λ1 · v1 , v2 → λ2 · v2 ; see, e.g., [6]. To be more precise, the corresponding (generalized) probability density function ρ(v1 , v2 ) is invariant relative to shift ρ(v1 + a1 , v2 + a2 ) = ρ(v1 , v2 ) and conditionally invariant with respect to re-scalings: ρ(λ1 · v1 , λ2 · v2 ) = a(λ1 , λ2 ) · ρ(v1 , v2 ) for some function a(λ1 , λ2 ). From the measurement viewpoint, a shift means changing the starting point for measuring a quantity, and a scaling means changing a unit in which we measure this quantity. These assumptions work well if vi are different quantities which can be independently shifted or scaled. In some practical situations, however, values v1 and v2 represent the same quantity. We can only shift both values by the same quantity a or scale both by the scale quantity λ. It is therefore desirable to describe probability distributions which are invariant relative to such shifts and scalings. Formulation of the problem in precise terms. We want to find all symmetric functions ρ(v1 , v2 ) = ρ(v2 , v1 ) for which ρ(v1 + a, v2 + a) = ρ(v1 , v2 ) for all a, and for some function a(λ), ρ(λ · v1 , λ · v2 ) = a(λ) · ρ(v1 , v2 ) for all λ. Towards a solution. Shift-invariance with a = −v1 implies that ρ(v1 , v2 ) = ρ(0, v2 − v1 ), i.e., that ρ(v1 , v2 ) = ρ0 (v2 − v1 ) for an appropriate function ρ0 (v). Since we want a symmetric distribution ρ(v1 , v2 ), we must have ρ0 (−v) = ρ0 (v), i.e., ρ0 (v) = ρ0 (|v|). In terms of this function ρ0 (v), scale-invariance means that for all λ, we have ρ0 (λ·v) = a(λ)·ρ0 (v). It is known (see, e.g., [1,5]) that all measurable solutions of this functional equation have the form ρ0 (v) = A·v −α . Since we allow generalized functions, we can also have terms proportional to the δ-function, hence ρ0 (v) = ε · δ(v) + A · v −α , and ρ(v1 , v2 ) = ε · δ(v1 − v2 ) + A · |v1 − v2 |−α .
Towards Efficient Prediction of Decisions
1381
Comment. When both intervals [v i , v i ] are non-degenerate, for the uniform distribution, the probability that v1 = v2 is 0. In contrast, for ε > 0, this probability is positive. This makes sense since degenerate situations (like v1 = v2 ) do occur in practice. Algorithm for computing p(v1 > v2 ). For the case of two alternatives with values v1 ∈ [v 1 , v 1 ] and v2 ∈ [v 2 , v 2 ], we can use Monte-Carlo simulations to find p(v1 > v2 ), p(v1 < v2 ), and p(v1 = v2 ). Open question. How can we generalize these formulas to the general case of n ≥ alternatives? Acknowledgments. This work was partially supported by NSF Grants CCF0202042 and EAR-0225670, by Texas Department of Transportation grant No. 0-5453, and by the Japan Advanced Institute of Science and Technology (JAIST) International Joint Research Grant 2006-08. The authors are thankful to anonymous referees for valuable suggestions.
References 1. Aczel, J.: Lectures on Functional Equations and Their Applications. Dover, New York (2006) 2. Ferson, S., Ginzburg, L., Kreinovich, V., Longpr´e, L., Aviles, M.: Exact bounds on finite populations of interval data. Reliable Computing 11(3), 207–233 (2005) 3. Huynh, V.N., Nakamori, Y., Lawry, J.: A probability-based approach to comparison of fuzzy numbers and applications to target-oriented decision making. IEEE Transactions on Fuzzy Systems (to appear) 4. Jaynes, E.T., Bretthorst, G.L.: Probability Theory: The Logic of Science. Cambridge University Press, Cambridge (2003) 5. Nguyen, H.T., Kreinovich, V.: Applications of continuous mathematics in computer science. Kluwer, Dordrecht (1997) 6. Nguyen, H.T., Kreinovich, V., Longpr´e, L.: Dirty pages of logarithm tables, lifetime of the universe, and (subjective) probabilities on finite and infinite intervals. Reliable Computing 10(2), 83–106 (2004) 7. Robert, C.P., Casella, G.: Monte Carlo Statistical Methods. Springer, New York (2004) 8. Sevastjanov, P., Venberg, A.: Modelling and simulation of power units work under interval uncertainty. Energy 3, 66–70 (1998) (in Russian) 9. Sevastjanov, P.V., R´ og, P.: Two-objective method for crisp and fuzzy interval comparison in optimization. Computers & Operations Research 33, 115–131 (2006) 10. Sevastianov, P., R´ og, P., Venberg, A.: The constructive numerical method of interval comparison. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2001. LNCS, vol. 2328, pp. 756–761. Springer, Heidelberg (2001) 11. Wagman, D., Schneider, M., Schnaider, E.: On the use of interval mathematics in fuzzy expert systems. International Journal of Intelligent Systems 9, 241–259 (1994) 12. Yager, R.R., Detyniecki, M., Bouchon-Meunier, B.: A context-dependent method for ordering fuzzy numbers using probabilities. Information Sciences 138, 237–255 (2001)
Interval Methods for Computing the Pareto-front of a Multicriterial Problem Bartlomiej Jacek Kubica and Adam Wo´zniak Institute of Control and Computation Engineering, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland
Abstract. Interval methods are known to be a precise and robust tool of global optimization. Several interval algorithms have been developed to deal with various kinds of this problem. Far less has been written about the use of interval methods in multicriterial optimization. The paper surveys two methods presented by other researchers and proposes a modified approach, combining PICPA algorithm with the use of derivative information. Preliminary numerical results are presented.
1
Introduction
It is well known that interval methods can be used as a precise and robust tool of global optimization (e.g. [2], [6]). Far less has been written about their use in multicriterial optimization, i.e. in solving problems of the following form: min qk (x)
k = 1, . . . , N ,
s.t. gj (x) ≤ 0
j = 1, . . . , m ,
x
xi ∈ [xi , xi ]
(1)
i = 1, . . . , n ,
where decision variable x = (x1 , . . . , xn )T ∈ IRn . In the sequel we shall denote the set of points satisfying the above conditions as X (the set of feasible points). This paper surveys briefly two algorithms presented in the literature and proposes a new algorithm inspired by them.
2
Pareto-optimality and Pareto-front
Let us formulate precisely what we understand by a solution to the multicriterial optimization problem. We are interested in seeking two sets: Pareto-set in the decision space and Pareto-front in the criteria space. Their definitions follow. Definition 1. A feasible point x is Pareto-optimal (nondominated), if there exists no other feasible point x such that: (∀k)
qk (y) ≤ qk (x) and
(∃i)
qi (y) < qi (x) .
R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1382–1391, 2008. c Springer-Verlag Berlin Heidelberg 2008
Interval Methods for Computing the Pareto-front of a Multicriterial Problem
1383
The set P ⊂ IRn of all Pareto-points is called the Pareto-set. Definition 2. The set Pf = (q1 × · · · × qN )(P ) = = (y1 , . . . , yN ) ∈ (q1 × · · · × qN )(X) | D(y1 , . . . , yN ) ∩ (q1 × · · · × qN )(X) = ∅ ⊂ IRN , where D(y1 , . . . , yN ) = the Pareto-front.
(y1 , . . . , yN ) | (∀k)yk ≤ yk and (∃i)yi < yi
is called
The interpretation of the definitions is straightforward. A feasible point is Paretooptimal if there is no other feasible point that would reduce some criterion without causing a simultaneous increase in at least one other criterion. Paretofront is the image of Pareto-set in criterion space and D is the cone of domination in this space. In the sequel one more definition will be needed. Definition 3. A point y dominates a set B, iff D(y) ∩ B = ∅ and similarly a set B dominates a set B, iff (∀y ∈ B )D(y) ∩ B = ∅.
3
Basics of Interval Computations
Now, we shall define some basic notions of intervals and their arithmetic. We follow a widely acknowledged standards (cf. e.g. [2], [4], [6]). We define the (closed) interval [x, x] as a set {x ∈ IR | x ≤ x ≤ x}. Following [7], we use boldface lowercase letters to denote interval variables, e.g. x, y, z, and IIIR denotes the set of all real intervals. We design arithmetic operations on intervals so that the following condition was fulfilled: if we have ∈ {+, −, ·, /}, a ∈ a, b ∈ b, then a b ∈ a b. The actual formulae for arithmetic operations (see e.g. [2], [4], [6]) are as follows: [a, a] + [b, b] = [a + b, a + b] , [a, a] − [b, b] = [a − b, a − b] , [a, a] · [b, b] = [min(ab, ab, ab, ab), max(ab, ab, ab, ab)] , [a, a] / [b, b] = [a, a] · 1 / b, 1 / b , 0∈ / [b, b] . Links between real and interval functions are set by the notion of an inclusion function, see e.g. [4]; also called an interval extension, e.g. [6]. Definition 4. A function f : IIIR → IIIR is an inclusion function of f : IR → IR, if for every interval x within the domain of f the following condition is satisfied: {f (x) | x ∈ x} ⊆ f(x) .
(2)
The definition is analogous for functions f : IRn → IRm . When computing interval operations, we can round the lower bound downward and the upper bound upward. This will result in an interval that will be a bit overestimated, but will be guaranteed to contain the true result of the real-number operation.
1384
4
B.J. Kubica and A. Wo´zniak
Previous Interval Algorithms for Approximating the Pareto-front
There are very few papers about applying interval methods to multicriterial optimization problems. Two of them: the one of Barichard and Hao ([1]) and of Ruetsch ([11]) have inspired our research. Now, we briefly describe methods presented in two above mentioned articles. The paper of Ruetsch is later, but represents an approach closer to unicriterial optimization, so we shall present it first. 4.1
Ruetsch Differential Formulation
The approach used in paper [11] is very similar to any branch-and-bound interval algorithm used to solve optimization problems. The initial box containing the feasible set is subdivided, enclosures of criteria for subboxes are computed and dominated boxes are discarded. Obviously, the intervals of criteria values for different boxes often overlap and several bisections may be needed before it can be determined if some sets of solutions are dominated or not. To accelerate the process an additional test is used. Basing on the gradient information, it is tried to determine if there is a feasible direction, where all criteria are improved. It may be roughly understood as an analog of monotonicity tests from unicriterial optimization (e.g. [2], [6], [10]). The idea is as follows: in each point x of the feasible set the gradient of a criterion function qk (·) defines the region, where the function decreases when coming from x (see Fig. 1). This region is called the negative gradient region ([11]) and denoted by Nk . A point is not Pareto-optimal if negative gradient regions for all criteria have a nonempty intersection (with each other and with N the feasible set): Nk ∩ X = ∅. k=1
x2
x2
g
g
x
g
x
Nk
Ck = N k ∩ N k
x1
Fig. 1. The negative gradient region – classical and interval approaches
x1
Interval Methods for Computing the Pareto-front of a Multicriterial Problem
1385
In the interval formulation for each interval we can compute the lower and upper bounds on the gradient [g, g]. So the certainly negative gradient region has the form presented on Fig. 1, i.e. Ck = N k ∩ N k . If all sets Ck have a nonempty intersection with each other and with X, we can conclude that no point in the considered box is Pareto-optimal. The operation, described above is similar to the monotonicity test not only in the use of gradients. Also results of its work are analogous: it can be used to discard boxes that do not contain solutions, but not to verify the existence of solutions in boxes. Unfortunately, details of approximating sets N k , N k and Ck – or any other details of the implementation – are not given in [11], which makes it difficult to implement the method and investigate it. Actually, there are very few informations about the implementation and performance of the algorithm. 4.2
Barichard and Hao’s PICPA
PICPA (Population and Interval Constraint Propagation Algorithm) presented in [1] is a completely different approach than the above one. Now, the bisections are done in the criteria space not in the decision space. The associated intervals of argument values are then contracted by a constraint propagation process to achieve local consistency. Hence the boxes do not overlap in the criteria space (and deletion of dominated boxes is simpler than in the previous approach), but may overlap in the decision space. The algorithm may be described by the following pseudocode. PICPA (q(·), x(0) , M axSize) // q(·) is the interval extension of the function q(·) = (q1 , . . . , qN )(·) y (0) = q(x(0) ); P = {(x, y)}; // initialize the population with a single element while (0 < |P | < M axSize) do while (0 < |P | < M axSize) do select (x, y) from P and bisect it according to one of the criteria y i ; contract the children; remove the father and add children to P ; end while; instantiate individuals if possible; delete dominated individuals if possible; end while; end PICPA The instantiation is a procedure used to obtain a feasible point contained in a given box. Authors of [1] do not describe it in details, but it seems to be complicated and costly. They use a specific parameter, called the “search effort”, to limit its computational cost; details are not given, again. If a feasible point is found, all boxes dominated by it may be discarded.
1386
5
B.J. Kubica and A. Wo´zniak
Set Inversion
Before we describe our method to approximate the Pareto-front, let us describe one more algorithm. It is a very general technique, called Set Inversion ([4], [5]), to approximate an inverse image of a set Y , i.e. the set f −1 (Y ) = {x ∈ X | f (x) ∈ Y }. The algorithm is called SIVIA (Set Inversion Via Interval Analysis). SIVIA (f(·), x(0) , Y , ε) // f(·) is the interval extension of the function // x(0) is the initial box // Y is the inverted set – usually an interval, but not necessarily in general // ε is the accuracy of approximation Lin = ∅; // the list of boxes contained in the set f −1 (Y ) Lbound = ∅; // the list of boundary boxes stack (x(0) ); while (stack nonempty) do unstack x; y = f(x); if (y ⊆ Y ) Lin = Lin ∪ x; else if (y ∩ Y = ∅) discard x; else if (wid (x) ≤ ε) Lbound = Lbound ∪ x; else bisect x; stack subboxes; end if // here we can place also additional processing of the box, e.g. rejection/reduction tests end while end SIVIA
The above code mentions some “additional processing” of the box – it is not used in classical SIVIA, but will be used in our algorithm. Details will be described later, in Subsect. 6.2.
6
A New Algorithm to Approximate the Pareto-front
An interesting feature of PICPA is that it gives only one box in the decision space associated with a box in the criteria space. This may have some benefits – like smaller number of boxes to store – but makes it difficult to locate the Pareto-front precisely in the decision space. It also does not help to use any reduction tests like the interval Newton operators, etc. 6.1
Pseudocode of the Algorithm
Our idea is to replace the “population” (as we have it in PICPA) with a set of lists. The lists are generated by a variant of SIVIA applied to invert each of the boxes in the criteria space.
Interval Methods for Computing the Pareto-front of a Multicriterial Problem
1387
Also the componentwise Newton method ([3], [9]) is used to narrow the boxes – in a similar manner as in [12]. The algorithm is expressed by the following pseudocode. compute_Pareto-front (q(·), x(0) , εy , εx ) // q(·) is the interval extension of the function q(·) = (q1 , . . . , qN )(·) // L is the list of quadruples (y, Lin , Lbound , Lunchecked ) (0) y (0) = q(x );
(0) L = y , {}, {}, {x(0) } ; while (there is a quadruple in L, for which wid y ≥ εy ) take this quadruple (y, Lin , Lbound , Lunchecked) from L; if (Lin = ∅) delete from L elements dominated by y; if (wid y > εy ) then bisect y to y (1) and y (2) ; for i = 1, 2 apply SIVIA with accuarcy εx to quadruple (y (i) , Lin , Lbound , Lunchecked ); if (the resulting quadruple has a nonempty interior, i.e. Lin = ∅) then delete quadruples that are dominated by y (i) ; end if insert the quadruple to L; end for end if end while end compute_Pareto-front
Please note that: – we do not need any instantiation procedure as in PICPA – if any subbox is added to Lin then there are feasible points related to criteria values and all boxes dominated by them may be discarded, – we can break the SIVIA procedure after finding such an interior subbox – further bisections are not necessary. The above code mentions SIVIA procedure. Obviously some proper changes have to be done in the traditional set inversion algorithm to fit our needs: – the input consists of three lists of boxes instead of a single box, – while boxes from Lin and Lunchecked are processed as in ordinary SIVIA, boxes from Lbound are not bisected – the only thing to do is check if they belong to the boundary of the new set or are to be discarded, – other tests, based on gradient information are used – they are described below. 6.2
Rejection/Reduction Tests
Rejection/reduction tests are commonly used in interval branch-and-bound methods. In our case two test are used; they are placed in SIVIA in the while loop (as indicated in the pseudocode). One of them is a multicriterial relative of the
1388
B.J. Kubica and A. Wo´zniak
monotonicity test and the other one – the componentwise Newton operator, applied to narrow boxes that do not belong to currently inverted set. Monotonicity test. One of the simplest and most commonly used accelerators for interval branch-and-bound method is the deletion of boxes, for which the criterion function is monotone – such boxes cannot contain optima, unless, obviously, some constraints are active on them. Details may be found e.g. in [2], [6], [10]. In the multicriterial case we have to check if all criteria are monotonous on a box at the same time. Moreover they have to be either all increasing or all decreasing – if all criteria are monotonous, but some are increasing and some decreasing then there can be Pareto-optimal points in such region. It is worth noting that after applying the monotonicity test for our multicriterial case in SIVIA, the algorithm does not return the inverted set any more – it returns the only the set of these boxes of the inverted set, which can contain points from the Pareto-front. Componentwise Newton operator (Ncmp ). As it was mentioned before, the componentwise Newton method is used in SIVIA to narrow/discard boxes that do not belong to the inverted set. The componentwise operator seems a good choice here, because being simpler than traditional Newton-like operators, it does not allow to verify uniquness (which we do not need here). So, the operator is used to seek solutions of equations qk (x) − y k , where k = 1, . . . , N . Values y k (components of the inverted vector interval) are parameters for these equations.
7
Numerical Experiments
We shall present results for three test problems typically used in multicriterial optimization. The first one comes from [11]. min q1 (x1 , x2 ) = (x1 + 0.5)2 + x22 , q2 (x1 , x2 ) = (x1 − 0.5)2 + x22 (3) x1 ,x2
x1 ∈ [−1, 1] , x2 ∈ [−0.5, 0.5] . The second problem, taken from [8], has a nonconnected Pareto-front. x1 min q1 (x1 , x2 ) = − 3 · (1 − x1 )2 · exp(−x21 − (x2 + 1)2 ) − 10 · − x31 − x52 × x1 ,x2 5 × exp(−x21 − x22 ) − 3 exp(−(x1 + 2)2 − x22 ) + 0.5 · (2x1 + x2 ) , (4) x2 2 2 2 3 5 q2 (x1 , x2 ) = − 3 · (1 + x2 ) · exp(−x2 − (1 − x1 ) ) − 10 · − + x2 + x1 × 5 × exp(−x22 − x21 ) − 3 exp(−(2 − x2 )2 − x21 ) , x1 , x2 ∈ [−3, 3] .
Interval Methods for Computing the Pareto-front of a Multicriterial Problem
1389
And the third one is from [13] – we may have more than two criteria there. min
x1 ,...,xn
N
qk (x1 , . . . , xn ) = x2j + (xk − 1)2
k = 1, . . . , N
(5)
j = 1, j = k
xi ∈ [−1000, 1000] i = 1, . . . , n . In numerical tests we used n = 3 and N = 3. Obtained results are given in Tables 1–3 and for the second example on Fig. 2. Table 1. Numerical results for the first test problem (3), εy = 0.1, εx = 0.01
computational time criteria evals. criteria Jacobian evals. bisections in criteria space bisections in decision space boxes deleted by monot. test boxes deleted by Ncmp resulting quadruples
breaking SIVIA non-breaking 0.210s 1.440s 6208 45926 3931 27309 125 126 1417 6686 378 626 19 0 41 41
Table 2. Numerical results for the second test problem (4), εy = 0.1, εx = 0.01
computational time criteria evals. criteria Jacobian evals. bisections in criteria space bisections in decision space boxes deleted by monot. test boxes deleted by Ncmp resulting quadruples
breaking SIVIA non-breaking 1m19.260s 33m41.300s 690082 15313711 328091 8611949 449 443 154415 2233297 9917 50968 17727 131988 179 176
Table 3. Numerical results for the third test problem (5), εy = 50, εx = 10
computational time criteria evals. criteria Jacobian evals. bisections in criteria space bisections in decision space boxes deleted by monot. test boxes deleted by Ncmp resulting quadruples
breaking SIVIA non-breaking 19m17.400s 38m2.190s 33257194 52246852 5358663 10621037 1861 2694 2678595 3354091 12106 11722 3436 0 1129 1650
1390
B.J. Kubica and A. Wo´zniak
8
6 1
4 0
2
0 2
y
y2
(q1× q2)(X)
-1
-2 -2
-4
-6
Pf
-3
-8
-10 -10
-8
-6
-4
-2
0
y1
2
4
6
8
-4
-6.5
-6
-5.5
-5
-4.5 y
-4
-3.5
-3
-2.5
-2
1
Fig. 2. Boxes enclosing the Pareto-front for the second test problem (4) – the whole set and zoom of its part
The result demonstrated on Fig. 2 shows that our algorithm can compute the approximation of the whole Pareto-front in nonconnected case, compared to pointwise approximation with only 15 points, obtained in [8]. It means that potentialities of the proposed algorithm are interesting and we hope that it can be used to solve practical problems.
8
Conclusions
Interval methods seem to be well suited to approximate the Pareto-front of a multicriterial optimization problem, but their use in this field has not been investigated much, up to now. Many improvements to existing algorithms can still be presented. The proposed algorithm, applying SIVIA and componentwise Newton method seems promising and the “breaking SIVIA” variant seems to outperform its more traditional relatives. Obviously, several more tests should be applied to obtain good heuristics for the use of accuracy parameters, componentwise Newton method and other accelerating techniques.
References 1. Barichard, V., Hao, J.K.: Population and Interval Constraint Propagation Algorithm. In: Fonseca, C.M., Fleming, P.J., Zitzler, E., Deb, K., Thiele, L. (eds.) EMO 2003. LNCS, vol. 2632, pp. 88–101. Springer, Heidelberg (2003) 2. Hansen, E.: Global Optimization Using Interval Analysis. Marcel Dekker, New York (1992) 3. Herbort, S., Ratz, D.: Improving the Efficiency of a Nonlinear–System–Solver Using the Componentwise Newton Method, available on the web at: http://www.ubka.uni-karlsruhe.de/vvv/1997/mathematik/5/5.pdf.gz
Interval Methods for Computing the Pareto-front of a Multicriterial Problem
1391
4. Jaulin, L., Kieffer, M., Didrit, O., Walter, E.: Applied Interval Analysis. Springer, London (2001) 5. Jaulin, L., Walter, E.: Set Inversion Via Interval Analysis for nonlinear boundederror estimation. Automatica 29, 1053–1064 (1993) 6. Kearfott, R.B.: Rigorous Global Search: Continuous Problems. Kluwer, Dordrecht (1996) 7. Kearfott, R. B., Nakao, M. T., Neumaier, A., Rump, S. M., Shary, S. P., van Hentenryck, P.: Standardized notation in interval analysis, available on the web at: http://www.mat.univie.ac.at/∼ neum/software/int/notation.ps.gz 8. Kim, I.Y., de Weck, O.L.: Adaptive weighted-sum method for bi-objective optimization: Pareto front generation. Structural and Multidisciplinary Optimization 29, 149–158 (2005) 9. Kubica, B.J., Malinowski, K.: An Interval Global Optimization Algorithm Combining Symbolic Rewriting and Componentwise Newton Method Applied to Control a Class of Queueing Systems. Reliable Computing 11, 393–411 (2005) 10. Kubica, B.J., Niewiadomska-Szynkiewicz, E.: An Improved Interval Global Optimization Method and its Application to Price Management Problem. In: Kragstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 1055–1064. Springer, Heidelberg (2007) 11. Ruetsch, G.R.: An interval algorithm for multi-objective optimization. Structural and Multidisciplinary Optimization 30, 27–37 (2005) 12. Shary, S.P.: A Surprising Approach in Interval Global Optimization. Reliable Computing 7, 497–505 (2001) 13. Zitzler, E., Laumanns, M., Thiele, M.: SPEA2: Improving the Strength Pareto Evolutionary Algorithm for Multiobjective Optimization. In: Giannakoglou, K., Tsahalis, D., Periaux, J., Papailiou, K., Fogarty, T. (eds.) Evolutionary Methods for Design Optimization and Control, CIMNE, Barcelona, Spain (2002)
Fuzzy Solution of Interval Linear Equations Pavel Sevastjanov and Ludmila Dymova Institute of Computer and Information Sciences, Technical University of Czestochowa, Dabrowskiego 73, 42-200 Czestochowa, Poland [email protected]
Abstract. A new concept of interval and fuzzy equations solving based on the generalized procedure of interval extension called ”interval extended zero” method is proposed. The central for this approach is the treatment of ”interval zero” as an interval centered around 0. It is shown that such proposition is not of heuristic nature, but is a direct consequence of interval subtraction operation. It is shown that the resulting solution of interval linear equations based on the elaborated method may be naturally treated as a fuzzy number. An important advantage of new method is that it substantially decreases the excess width effect. Keywords: interval linear equation, fuzzy solution, interval zero.
1
Introduction
The problem of interval or fuzzy equations solving is not trivial even for linear ones such as AX = B, (1) where A and B are intervals or fuzzy values. As it is stated in [3], ”...for certain values of A and B, Eq.(1) has no solution for X. That is, for some triangular fuzzy numbers A and B there is no fuzzy set X so that, using regular fuzzy arithmetic, AX is exactly equal to B”. Although many different numerical methods were proposed for solving interval and fuzzy equations including such complicated as Neural Net solutions [4],[5] and fuzzy extension of Newton’s method [1],[2] only particular solutions valid in specific conditions were obtained. It is known that the equations F (X) − B = 0, F (X) = B, where B is an interval or fuzzy value, F (X) is some interval or fuzzy function, are not equivalent ones. Moreover, the main problem is that conventional interval extension (and the fuzzy as well) of usual equation, which leads to the interval or fuzzy equation such F (X) − B = 0 is not a correct procedure. Less problems we meet when dealing with interval or fuzzy equation in form F (X) = B, but in many cases its roots are inverted intervals, i.e., such that x < x. To alleviate these problems, in [7] we proposed a new ”interval extended zero” method. It was presented rather as useful heuristic, which make it possible to solve the system of linear interval equations representing the Leontief’s Input-Output model with considerable reduction of resulting output intervals’ width. Therefore, in the current report we R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1392–1399, 2008. c Springer-Verlag Berlin Heidelberg 2008
Fuzzy Solution of Interval Linear Equations
1393
are focused only on the methodological aspects of proposed method.The rest of the paper is set out as follows. In Section 2, some problems of interval equations solution are discussed to clarify the origins of proposed approach. Section 3 is devoted to presentation of ”interval extended zero” method for solving linear interval equations. Section 4 concludes with some remarks.
2
The ”Right Hand Side” Problem
An important methodological problem of interval equations solution, is what we name as ”interval equation’s right hand side problem”. Suppose there exists some basic, non-interval algebraic equation f (x) = 0. Its natural interval extension can be obtained by replacement of its variables with interval ones and all arithmetic operations with relevant interval operations. As a result we get an interval equation [f ]([x]) = 0. Observe that this equation is senseless because its left part represents an interval value, whereas the right part is the non-interval degenerated zero. Obviously, if [f ](x) = [f , f ] , then equation [f ]([x]) = 0 is true only when f = f . It is easy to show that equation [f ]([x]) = 0 in general can be verified only for inverted interval [x], i.e., when x < x. Inverted intervals are analyzed in the framework of Modal Interval Arithmetic [8], but it is very hard and perhaps even impossible to encounter real-life situation when notation x < x is meaningful. It is known that if mathematical expression can be presented in the different, but algebraically equivalent forms, they usually provide different interval results after interval extension. The same is true for the equations. Let us consider interval extensions of simplest linear equation ax = b,
(2)
and its algebraically equivalent forms x=
b , a
ax − b = 0
(3) (4)
for a, b being crisp intervals (0 ∈ / a ). Since there are no strong rules in interval analysis for choosing the best form of equation among its algebraically equivalent representations to be extended, it is natural to compare the results we get from interval extensions of Eq.(2)-Eq.(4). Let [a] = [a, a] and [b] = [b, b] be intervals.For the sake of simplicity, let us first consider the case of [a] > 0, [b] > 0. Then interval extension of Eq. (2) is [a, a][x, x] = [b, b]. Using conventional interval arithmetic rule, from this equation we obtain [ax, ax] = [b, b] and finally x=
b b ,x= . a a
(5)
Interval extension of Eq. (3) results in the expressions x=
b b ,x= . a a
(6)
1394
P. Sevastjanov and L. Dymova
Consider some examples. Example 1. Let a = [3, 4], [b] = [1, 2]. Then from Eq. (5) we get x = 0.333, x = 0.5 and from Eq. (6) x = 0.25, x = 0.666. Example 2. Let a = [1, 2], [b] = [3, 4]. Then from Eq. (5) we get x = 3, x = 2 and from Eq. (6) x = 1.5, x = 4. Example 3. Let a = [0.1, 0.3], [b] = [1, 1] (i.e., b is a real number). Then from Eq. (5) we obtain x = 10, x = 3.333 and from Eq. (6) x = 3.333, x = 10. We can see that interval extension of Eq. (2) may result in the inverse intervals [x], i.e. such that x < x (see Examples 2 and 3), whereas extension of Eq. (3) provides the correct intervals. It is worth noting that interval extension of Eq. (3) will always provide the correct resulting intervals because Eqs. (6) are inferred directly from the basic definition definition of interval division [10],[11]. For our purposes it is quite enough to state that interval extension of Eq. (3) guarantees the resulting intervals be correct ones in all cases, whereas interval extension of Eq. (2) can result in practically senseless inverse intervals. It is worth noting that in Example 3, the formal interval extension of Eq. (2) leads to contradictory interval equation since on the right hand side of extended Eq. (2) we have degenerated interval [b] (real value), whereas the left hand side is an interval. In all such cases the solution of interval extension of Eq. (2) will be an inverse interval. This, at first glance, strange result is easy to explain from common methodological positions. Really, the rules of the interval mathematics are constructed in such a way that any arithmetical operation with intervals results in an interval as well. Therefore, placing the degenerated intervals in right hand side of Eq. (2) would be equivalent to the request of reducing uncertainty of the left hand side down to zero, which is possible only in the case of inverse character of interval [x]. The standard interval extension of Eq. (4) is [ax, ax] − [b, b] = 0. It leads to the solution x = ab , x = ab . It is easy to see that in any case x < x, i.e., we obtain an invert interval. Obviously, such solution may be considered only as absurd one. We can say this fact is the direct consequence of that conventional interval extension of Eq. (2) is in contradiction with the basic assumptions of interval analysis since the right hand side of this equation always is degenerated zero, whereas the left hand side is represented by interval. Summarizing, we can say that only Eq. (3) can be considered as reasonable base for interval extension. On the other hand, from this base we obtain Eq. (6), which often results in a drastic extension of output interval in comparison with input intervals (see Example 3). It is worthy to note here that we can use Eq. (3) only in the simplest cases of linear equations, whereas the key methodological problem we deal with is to find an adequate interval extension of linear and nonlinear equations in their most general form f (x) = 0. It is easy to see that there is no way to improve interval solution we obtain from interval extensions of Eq. (2) and Eq. (3). In general case, an effect of excess width can not be eliminated at all since it reflects the reality and is consistent with the indeterminacy (entropy) increase principle. Nevertheless, it does not mean that we could not attempt to improve interval arithmetic rules to reduce the width of resulting intervals to the maximum possible extent. That is why, let us turn to the consideration of Eq. (4) and look at it from another point of view. Formally, when extending equation Eq. (4) one obtains not only interval on its left hand side, but interval
Fuzzy Solution of Interval Linear Equations
1395
zero on the right hand side, and in general case this interval zero cannot be degenerated interval [0, 0]. Strictly speaking, in the framework of conventional interval analysis, any interval extension of Eq. (4) is not a correct operation since we obtain an interval mathematical expression only in the left hand side of equation, whereas in its right hand side the usual zero integer is not changed. In our opinion, the root of problem is that conventional approach to interval extension does not involve an operation we call ”interval zero extension”. In other words, we propose an operation of ”interval zero extension” to obtain an ”interval zero” in the right side of extended Eq. (4). Since ”interval zero” is not a degenerated interval, such approach makes it possible to solve the problem of correct interval extension of Eq. (4). First of all, what is ”interval zero” ? In conventional interval analysis, it is usually assumed that any interval containing zero may be considered as ”interval zero”. This is a satisfactory definition to suppress the division by zero in conventional interval arithmetic, but for our purposes a more restrictive definition is needed. Let us look to this problem from another point of view. Without loss of generality, we can define the degenerated (usual) zero as the result of operation a − a , where a is any real valued number or variable. Hence, in a similar way we can define an ”interval zero” as the result of operation [a] − [a], where [a] is an interval. It is easy to see that for any interval [a] we get [a, a] − [a, a] = [a − a, a − a]. Therefore, in any case the result of interval subtraction [a] − [a], is an interval centered around 0. Another approach to the interval subtraction [a] − [a] and division [a]/[a] is so called ”dependence” hypothesis. It is based on the assumption that any x ∈ [a] depends on corresponding x that belongs to other sample of [a]. Hence, [a]−[a] = 0 and [a]/[a] = 1. The ”dependence” hypothesis is well known and in some particular cases provides good results [9]. Nevertheless, it is relevant only on the algebraic level of consideration. When dealing with interval arithmetic, the intervals [a] = [0, 1] , [b] = [0, 1] are identical for us as they have the same boundaries. Hence, if we formally change the letter b by a, we bring no new information in our consideration. Moreover, if such formal redefinition, [b] ⇒ [a], leads automatically to the surprising establishing the strong dependence between all x ∈ [a], we have an additional problem - the lack of continuity. Really, if [a] = [0, 1000] then according to the ”dependence” hypothesis we have, [a] − [a] = 0, whereas if [a] = [a] + 10−20 we have [a] − [a] ∼ = [−1000, 1000]. Summarizing, we can say that introducing ”dependence” hypothesis - which is rather external thing, not inherent in the body of interval arithmetic- provides more problems then profits. That is why, in spite of pluralism characterizing the fuzzy and interval scientific community, the use of the ”dependence” hypothesis is rather rare or not mentioned in general [10]. Thus, if we want to treat a result of subtraction of two identical intervals as ”interval zero”, then the most general definition of such ”zero” will be ”interval zero is an interval symmetrical with respect to 0”. It must be emphasized that introduced definition says nothing about the width of ”interval zero”. Really, when extending equation such Eq. (4) with previously unknown values of variables in the left hand side , only what we can say about right hand side is that it should be interval symmetrical with
1396
P. Sevastjanov and L. Dymova
respect to 0 with not defined width. Hence, as the result of interval extension of Eq. (4) in general case we get [a, a][x, x] − [b, b] = [−y, y].
(7)
In fact, the right hand side of Eq. (7) is some interval centered around zero, which can be treated as interval extension of right hand side of Eq. (4), in other words, as interval extension of 0. This is the reason for us to call our approach ”interval extended zero” method. Of course, the value of y in Eq. (7) is not yet defined and this seems to be quite natural since the values of x, x are also not defined.
3
Solution of Linear Interval Equations Using ”Interval Extended Zero” Method
At first, consider a case of positive interval numbers [a] and [b], i.e., a, a, b, b > 0. Then from Eq. (7) we get (8) ax − b = −y, ax − b = y. Finally, from Eq. (8) we obtain only one linear equation with two unknown variables x and x: ax + ax − b − b = 0. (9) If there are some restrictions on the values of unknown variables x and x, then Eq. (9) with these restrictions may be considered as so called Constraint Satisfaction Problem [6] and an interval solution may be obtained. The first restriction on the variables x and x is a solution of Eq. (9) assuming x = x. In this degenb+b erated case we get the solution of Eq. (9) as xm = a+a . It is easy to see that xm is an upper bound for x and a lower bound for x (if x > xm or x < xm we get an inverted solution of Eq. (9)). The natural low bound for x and upper bond for x may be defined using basic definitions of interval arithmetic [11] as x = ab , x = ab . Thus, we have [x] = [ ab , xm ] and [x] = [xm , ab ]. These intervals can be narrowed taking into account Eq. (9), which in the spirit of CSM is treated as the restriction. It is clear that the right bound of x and left bound of x , i.e., xm , can not be changed as they present the degenerated (crisp) solution of (9). So let us focus of the left bound of x and right bound of x. From (9) we have x=
b + b − ax b b + b − ax b , x ∈ [xm , ], x = , x ∈ [ , xm ]. a a a a
Obviously, when x is maximal, i.e., x =
b a,
(10)
we get the minimal value of x,
ab i.e., xmin = b+b a − a2 . Similarly, from (10) we get the maximal value of x, i.e., ab b b xmax = b+b a − a2 . Since it is possible that xmin < a and xmax > a , we get the
following interval solution:
Fuzzy Solution of Interval Linear Equations
1397
b+b b+b [x] = xmax , , [x] = , xmin , (11) a+a a+a ab ab b b+b where xmax = max ab , b+b − ,x = min , − . It is important 2 2 min a a a a a that in the framework of Constraint Satisfaction Problem, the following relations between xmax and xmin should be fulfilled in calculations: if xmax = ab , then ab b+b b ab xmin = b+b a − a2 ; if xmin = a , then xmax = a − a2 . Expressions (11) define all possible solutions of Eq. (7). The values of xmin , xmax constitute an interval which produce the widest interval zero after its substitution in Eq. (7). In other words, the maximum interval solution’s width wmax = xmin − xmax corresponds to the maximum value of y: ymax = ab a − b. Substitution of degenerated solution
x = x = xm in Eq. (7) produces the minimum value of y: ymin = a·b−a·b a+a . It is clear that for any permissible solution x > xmax we have corresponding x < xmin , for each x > x the inequalities x < x and y < y take place. Thus, the formal interval solution (11) factually represents the continuous set of nested interval solutions of Eq.(9). Hereinafter, we show that this set of interval solutions can be in the natural way interpreted as a fuzzy number. We can see that values of y characterize the closeness of right hand side of Eq. (7) to degenerated zero and minimum value ymin is defined exclusively by interval parameters [a] and [b]. Hence, the values of y may be considered, in a certain sense, as a measure of interval solution’s uncertainty caused by the initial uncertainty of Eq. (7). Therefore we introduce y − ymin α=1− , (12) ymax − ymin which may be treated as a certainty degree of interval solution of Eq. (7). We can see that α rises from 0 to 1 with decreasing of interval’s width from maximum value to 0, i.e., with increasing of solution’s certainty. Consequently, the values of α may be treated as labels of α-cuts representing some fuzzy solution of Eq. (7). Finally, we obtain a solution in form of triangular fuzzy number x ˜=
xmax ,
b+b , xmin a+a
(13)
This result needs some comments.Using proposed approach to the solution of interval linear equation based on the Eq. (7), we obtain a triangular fuzzy number with a support which in all cases is included in crisp interval obtainable from conventional interval extension of Eq. (3). At first glance, it seems somewhat surprising to have a fuzzy solution of the crisp interval equation. The explanation is that the proposed ”interval extended zero” method is based on some restricting assumptions. The most important of them is introduction of ”interval zero”’ as the interval centered around 0. As a consequence we obtain a solution in a form of triangular fuzzy number, which is a more certain result than crisp interval representing its support, since such fuzzy number inherits more information about
1398
P. Sevastjanov and L. Dymova
possible real valued solutions. In a similar way, the fuzzy solutions of Eq. (7) were obtained for other placements of intervals [a] and [b]. In these cases, interval extensions of Eq. (4) are usually resulted in the expressions different from Eq. (8). Eq. (9) takes another form too. In the case of [a] < 0, [b] > 0, i.e., a, a < 0 , b, b > 0 we get ax − b + ax − b = 0, b+b x ˜ = xmax , a+a , xmin , xmax = max( ab , b+b a −
ab a2 ),
xmin = min( ab , b+b a −
ab a2 ).
In the case of [a] > 0, [b] < 0, i.e., a, a > 0 , b, b < 0 we get ax −b + ax − b =0, b+b b b+b ab ab x ˜ = xmax , a+a , xmin , xmax = max a , a − a2 , xmin = min ab , b+b a − a2 . In the case of [a] < 0, [b] < 0, i.e., a, a < 0 , b, b < 0 we get ax −b + ax − b =0, b+b ab b b+b ab x ˜ = xmax , a+a , xmin , xmax = max a , a − a2 , xmin = min ab , b+b a − a2 .
In the case of [a] > 0, 0 ∈ [b], we get ax−b+ax−b = 0, x ˜ = xmax , b+b , x , min 2a b b+b b+b b xmax = max a , a − ab , xmin = min ab , a − a .
In the case of [a] < 0, 0 ∈ [b], we get ax−b+ax−b = 0, x ˜ = xmax , b+b , x min , 2a b b b+b b xmax = max ab , b+b a − a , xmin = min a a − a . Obviously, we can assume the support of obtained fuzzy number to be a solution of analyzed problem. Such a solution may be treated as the ”pessimistic” one since it corresponds to the lowest α-level of resulting fuzzy value. We use here the word ”pessimistic” to emphasize that this solution is charged with the largest imprecision as it is obtained in the most uncertain conditions possible on the set of considered α -levels. On the other hand, it seems natural to utilize all additional information available in the fuzzy solution. We can reduce the resulting fuzzy solution to the interval solution using well known defuzzification procedures. In our case, defuzzified left and right boundaries of the solution can be represented as 1 1 x(α)dα 0 x(α)dα xdef = 1 , xdef = 0 1 (14) dα dα 0 0 For example, in the case of [a], [b] > 0, from (7) and (12) we get the expressions for x(α) and x(α). Substituting them into Eqs. (14) we have xdef = +ymin +ymin − ymax2a , xdef = ab + ymax2a . It is easy to prove that obtained interval [xdef , xdef ] is included into support interval of initial fuzzy solution, i.e. [xdef , xdef ] ⊂ [xmax , xmin ]. To illustrate, let us consider simple example: [a] = [14, 60], [b] = [25, 99]. In conventional interval arithmetic, Eq. (6) give us [x, x] = [0.42, 7.07]. Using the new method we get [xmax , xmin ] = [0.42, 1.97] and [xdef , xdef ] = [1.04, 1.82]. It is easy to see that [xdef , xdef ] ⊂ [xmax , xmin ] ⊂ [x, x]. Moreover, the length of [x, x] is in 4.3 times as large of [xmax , xmin ] and 8.5 times as large of [xdef , xdef ]. Thus, the new method provides the considerable reducing of resulting interval’s length in comparison with that obtained using conventional interval arithmetic rules. b a
Fuzzy Solution of Interval Linear Equations
4
1399
Conclusion
The aim of the paper is to present a new concept of interval and fuzzy equations solving based on the generalized procedure of interval and fuzzy extension called ” interval extended zero” method. The key idea is the treatment of ”interval zero” as an interval symmetrical with respect to 0. It is shown that such approach is a direct consequence of interval subtraction operation. The method makes it possible to solve some formerly unresolved methodological problems in applied interval analysis and fuzzy arithmetic. An important for practice advantage of new method is that it provides substantially narrower solutions than conventional methods.
References 1. Abbasbandy, S., Asady, B.: Newton’s method for solving fuzzy nonlinear equations. Applied Mathematics and Computation 159, 349–356 (2004) 2. Abbasbandy, S.: Extended Newton’s method for a system of nonlinear equations by modified Adomian decomposition method. Applied Mathematics and Computation 170, 648–656 (2005) 3. Buckley, J.J., Qu, Y.: Solving linear and quadratic fuzzy equations. Fuzzy S e t s and Systems 38, 43–59 (1990) 4. Buckley, J.J., Eslami, E.: Neural net solutions to fuzzy problems: The quadratic equation. Fuzzy Sets and Systems 86, 289–298 (1997) 5. Buckley, J.J., Eslami, E., Hayashi, Y.: Solving fuzzy equations using neural nets. Fuzzy Sets and Systems 86, 271–278 (1997) 6. Cleary, J.C.: Logical Arithmetic. Future Computing Systems 2, 125–149 (1987) 7. Dymova, L., Gonera, M., Sevastianov, P., Wyrzykowski, R.: New method for interval extension of Leontiefs input-output model with use of parallel programming. In: Dymova, L., Gonera, M., Sevastianov, P., Wyrzykowski, R. (eds.) Proceedings of the International Conf. on Fuzzy Sets and Soft Computing in Economics and Finance(FSSCEF), St.Petersburg, Russian, pp. 549–556 (2004) 8. Gardnes, E., Mielgo, H., Trepat, A.: Modal intervals: Reasons and ground semantics. In: Nickel, K. (ed.) Interval mathematics 212. LNCS, pp. 27–35. Springer, Berlin (1985) 9. Hanss, M., Klimke, A.: On the reliability of the influence measure in the transformation method of fuzzy arithmetic. Fuzzy Sets and Systems 143, 371–390 (2004) 10. Jaulin, L., Kieffir, M., Didrit, O., Walter, E.: Applied Interval Analysis. Springer, London (2001) 11. Moore, R.E.: Interval analysis, Englewood Cliffs, N. Prentice-Hall, Englewood Cliffs (1966)
On Checking the Monotonicity of Parametric Interval Solution of Linear Structural Systems Iwona Skalna AGH University of Science & Technology Department of Applied Computer Science ul. Gramatyka 10, 30-067 Cracow, Poland [email protected]
Abstract. If the solution of a parametric linear system is monotone as a function of interval parameters, then an interval hull of parametric solution can be computed easily. Some attempts to solve the problem of checking the monotonicity of parametric solution have been made in the literature. However, no complete algorithm has been given; only some directions for further research were presented. In this paper some investigations on checking monotonicity of parametric interval solution have been made. A method based on author’s earlier research is presented. Some illustrative examples of structural mechanical systems are included to check the performance of the method. Keywords: parametric linear systems, monotonicity, hull solution.
1
Introduction
In structural analysis many factors rise uncertainty or imprecision. They are connected either to external factors, such as boundary conditions or applied loads, or to internal factors, such as mechanical or geometric characteristics [1,6,7,8,9,21]. Imprecise values can be modelled using probability distributions, compact intervals or fuzzy numbers – these are the most popular approaches. Problems involving uncertain parameters should be solved properly to bound all possible responses of mechanical system. This paper focuses on linear systems of structure mechanics with interval parameters. Methods for solving parametric linear systems have been developed in recent years [4,3,12,11,19,18]. To solve a system means to enclose parametric solution by an interval vector. The quality of the enclosure depends on the number and the width of interval parameters. The narrowest possible enclosure is called the interval hull solution. The combinatorial approach and the monotonicity approach have been favoured by many authors in solving linear problems [14,15]. The combinatorial solution is computed as a convex hull of the solutions to 2k (k is a number of interval parameters) real linear systems corresponding to all combinations of the endpoints of the parameter intervals. Many numerical examples show that this method yields very good results when the parameter intervals are relatively narrow. R. Wyrzykowski et al. (Eds.): PPAM 2007, LNCS 4967, pp. 1400–1409, 2008. c Springer-Verlag Berlin Heidelberg 2008
Checking the Monotonicity of Parametric Interval Solution
1401
When the parametric solution is monotone with respect to all the parameters, the solution set hull can be computed by solving at most 2n real systems. Generally it is very difficult to check the monotonicity of the solution. Some attempts to solve this problem – check the sign of derivatives of the parametric solution as a function of parameters – have been described in [5,13,16]. However, no complete algorithm has been given, and only some directions for further research were presented instead. In this paper a new method for calculating enclosures for the derivatives of the parametric solution is presented. The new method, called here Method for Checking the Monotonicity (MCM for short), is based on the parametric Direct Method developed by I. Skalna [18]. The paper is organized as follows. The second section contains preliminaries on solving parametric interval linear systems with two disjoint sets of parameters. In the third section, the monotonicity approach for computing interval hull solution is outlined. This is followed by a description of the MCM method. Next, some illustrative examples of truss structures and the results of computational experiments are presented. The paper ends with summary conclusions.
2
Preliminaries
Italic faces will be used for real quantities, while bold italic faces will denote their interval counterparts. Let ÁÊ denote a set of real compact intervals x = [x, x] = {x ∈ Ê | x x x}. For two intervals a, b ∈ ÁÊ, a b, a b and a = b will mean that, resp., a b, a b, and a = b z a = b. ÁÊn will denote interval vectors, ÁÊn×n square interval matrices [10]. The midpoint x ˇ = (x + x)/2 and the radius r(x) = (x − x)/2 are applied to interval vectors and matrices componentwise. Consider linear algebraic system A(p)x(p) = b(p)
(1)
with coefficients being affine–linear functions of a vector p ∈ Êk aij (p) = αij0 + αT ij · p ,
bj (p) = βj0 + βjT · p ,
(2)
where αij0 , βj0 ∈ Ê, αij = {αijν } ∈ Êk , βj = {βjν } ∈ Êk , i, j = 1, . . . , n. In practice, right-hand vector and matrix elements depend on two disjoint sets of parameters p, q ∈ Êk . Hence, the system (1) can be replaced by the system A(p)x(p, q) = b(q) ,
(3)
and linear dependencies (2) by aij (p) = αij0 + αT ij · p ,
bj (q) = βj0 + βjT · q ,
where αij0 , βj0 ∈ Ê, αij = {αijν } ∈ Êk , βj = {βjν } ∈ Êl , i, j = 1, . . . , n.
(4)
1402
I. Skalna
Now assume that some model parameters are unknown. The real vectors p and q are replaced by interval vectors p and q (the real elements are represented by point intervals). This gives a family of the systems A(p)x(p, q) = b(q),
p ∈ p, q ∈ q ,
(5)
which is usually written in a symbolic compact form A(p)x(p, q) = b(q) ,
(6)
and is called the parametric interval linear system. Parametric (united) solution set of the system (6) is defined [2,3,12,17] as S(p, q) = { x | ∃p ∈ p, ∃q ∈ q, A(p)x(p, q) = b(q)} .
(7)
If the solution set S = S(p, q) is bounded, then its interval hull exists and is defined as 2S = [inf S, sup S] = {y ∈ ÁÊn | S ⊆ y} . 2S is called an interval hull solution. In order to guarantee that the solution set is bounded, the matrix A(p) must be regular, i.e. A(p) must be non-singular for all parameters p ∈ p. Two other solutions are defined for parametric linear systems [5]. Any vector x = [x, x] ∈ ÁÊn such that inf si xi and sup si xi , i = 1, . . . , n ,
s∈S
s∈S
is referred to as the inner solution (approximation [12]) for (6). Respectively, any vector x = [x, x] ∈ ÁÊn such that inf si xi and sup si xi , i = 1, . . . , n .
s∈S
s∈S
is referred to as the outer solution (approximation) for (6). The quality of outer approximation can be estimated by means of inner approximation.
3
Monotonicity Approach
When the parametric solution is monotone with respect to all parameters pi (i = 1, . . . , k,) and qj (j = 1, . . . , l,) then the interval hull solution can be computed by solving at most 2n real systems. Let E s = {e ∈ Ês | ei ∈ {−1, 0, 1}, i = 1, . . . , s}. For a ∈ ÁÊn , e ∈ E n , e ai = a if ei = −1, aei = a ˇ if ei = 0, and aei = a if ei = 1. Theorem 1. Let A(p) be regular and let functions xi (p, q) = A−1 (p) · b(q) i be monotone on interval boxes p ∈ ÁÊk , q ∈ ÁÊl with respect to each parameter pm (m = 1, . . . , k), qr (r = 1, . . . , l). Then 2S(p, q)i = A(p−e )−1 b(q −f ) i , A(pe )−1 b(q f ) i , (8) ∂xi ∂xi where eν = sign ∂p (p, q), fμ = sign ∂q (p, q). ν μ
Checking the Monotonicity of Parametric Interval Solution
1403
Now consider the family of parametric linear equations (5) and assume aij (p) and bi (q) (i, j = 1, . . . , n) are continuously differentiable functions in, resp., p and q. Global monotonicity properties of the solution with respect to each ∂x parameter pm , qr can be verified by checking the sign of derivatives ∂p (p, q), m ∂x (p, q) on the domains p and q. ∂qr Differentiation of (5) with respect to pm (m = 1, . . . , k) and qr (r = 1, . . . , l) results in ∂x ∂A(p) A(p) (p, q) = − x(p, q) , (9) ∂pm ∂pm A(p)
∂x ∂b (p, q) = (q) . ∂qr ∂qr
(10)
∂x The following estimation of the derivatives ∂p (p, q) (m = 1, . . . , k) and m q) (r = 1, . . . , l) has been proposed in [16]:
∂x ∂qr (p,
∂x (p, q) ⊆ BΔm x∗ , ∂pm
p ∈ p, q ∈ q ,
(11)
∂x (p, q) ⊆ Bδ m , p ∈ p, q ∈ q , (12) ∂qr ∂A ∂b where B approximates 2 A−1 (p) | p ∈ p , Δm = − ∂p (p), δ m = ∂q (q) and m r ∗ x is the initial enclosure for the parametric solution set (7). For the family of systems originating from equation (1) the derivatives can be estimated, as suggested by Kolev [5], by means of a formula similar to (12): ∂x (p) ⊆ B(δ m − Δm x∗ ), ∂pm
p∈p .
(13)
The main drawback of the approach involving approximation of the inverse matrix is that when transforming equations (9) and (10) into (11) and (12), information about the system dependencies is lost. In [13] it is stated that global monotonicity of the parametric solution of the system A(p)x(p) = b(p) can be verified by solving k parametric linear systems in a global domain p ∈ ÁÊk A(p)
∂x(p) ∂b(p) ∂A(p) ∗ = − x . ∂pm ∂pm ∂pm
However it is not clearly explained how to handle x∗ in the context of solving parametric linear systems restricted to the domain p.
4
Description of the MCM Method
In this paper an approach based on a Direct Method [18] for solving parametric linear systems is proposed. Bear in mind that aij (p), bj (q) are affine linear
1404
I. Skalna ∂a
ij ∂bi functions. This implies that ∂pm , ∂q are constant on, resp., p and q. Hence, the r ∂x ∂x approximations of ∂pm (p, q), ∂qr (p, q) can be obtained by solving the following k parametric linear systems
A(p)
∂x = bm (x∗ ) , ∂pm
(14)
∂x = br , ∂qr
(15)
and l parametric linear systems A(p)
∗ ∗ r ∗ ∗ where bm j (x ) = −αijm xj , bj = βjr , j = 1, . . . , n, x ∈ x . p ∂xi For a fixed i, 1 i n, let Dim denotes the estimate for ∂p (p, q), and Dqir m i the estimate for ∂x ∂qr (p, q) obtained by solving equations (14) and (15) using the MCM method. Assume that each estimate Dpim , Dqir (m = 1, . . . , k; r = 1, . . . , l; i = 1, . . . , n) meets one of the following conditions: (·)
(16a)
(·)
(16b)
(·)
(16c)
Di(·) 0 Di(·) 0 Di(·) = 0
Based on equations (16a), (16b) and (16c), vectors pi , pi , q i , q i are defined: ⎧ ⎧ p p ⎨ pm if Dim 0 ⎨ pm if Dim 0 p pim = pm if Dpim 0 , pim = pm if Dim 0 , m = 1, . . . , k , (17) ⎩ ⎩ pˇm if Dpim = 0 pˇm if Dpim = 0 ⎧ ⎧ q q ⎨ q r if Dir 0 ⎨ q r if Dir 0 q q i i q r = q r if Dir 0 , q r = q r if Dir 0 , r = 1, . . . , l (18) ⎩ ⎩ q qˇr if Dir = 0 qˇr if Dqir = 0 Then by Theorem 1 the i-th component of the interval hull solution
2S(p, q)i = A(pi )−1 b(q i ) i , A(pi )−1 b(q i ) i . ∂xi Now suppose that the sign of some derivatives ∂p (p, q), m been determined and let I = {1, . . . , k}, J = {1, . . . , l},
∂xi ∂qr (p,
i i I+ = {m ∈ I | Dpim 0} , J+ = {r ∈ J | Dqir 0} , p i i I− = {m ∈ I | Dim 0} , J− = {r ∈ I | Dqir 0} , p I0i = {m ∈ I | Dim = 0} , J0i = {r ∈ I | Dqir = 0} .
Then for each i, 1 i n, new vectors of parameters pi = pj j∈{I\(I i ∪I i ∪I i )} , qi = qj j∈{J\(J i ∪J i ∪J i )} , +
−
0
+
−
0
q) hasn’t
Checking the Monotonicity of Parametric Interval Solution
1405
are composed and a new linear systems A(pi )x(pi , qi ) = b(qi ) are considered. The process of determining the sign of derivatives restarts and continues separately for each new system, until no further improvement is obtained. The number of repetitions depends on the number and width of the interval parameters.
5
Numerical Examples
To check the performance of the Method for Checking the Monotonicity (MCM) some illustrative examples of structural mechanical systems have been provided. The results of the MCM method are compared with the results of the Evolutionary Optimization Method [20] (EOM for short) – which calculates inner approximation of the hull solution, and the results of the Based on Inverse Matrix method (BIM for short) – which uses the inverse of the parametric interval matrix to enclose derivatives. Example 1. (21-bar plane truss structure) For the plane truss structure shown in Fig. 1 the displacements of the nodes are computed. The truss is subjected to downward forces P1 = P2 = P3 = 30[kN] as depicted in the figure; Young’s modulus Y = 7.0 × 1010 [Pa], cross-section area C = 0.003[m2], and length L = 2[m]. Assume the stiffness of all bars is uncertain by ±5%. This gives 21 interval parameters.
P1
P2
P3 0.5L
L
Fig. 1. Example 1: 21-bar plane truss structure
The results obtained are presented in Table 1. The number of derivatives with a definite sign is denoted by ndsm – for the MCM method and, respectively, ndsb – for the BIM method. Columns 3, 4, 5 contain the displacements of the truss nodes. Both the MCM method and the EOM method yield the same result – the interval hull solution. The results of the MCM method and the BIM method
1406
I. Skalna Table 1. Comparison of the solutions for Example 1 ndsm ndsb 18 8 21 4 21 9 21 5 21 12 21 6 21 11 21 7 21 7 20 7 20 9 21 7 21 12 21 7 21 11 20 7 21 6 21 9 21 5 21 8 21 11
MCM [×10−4 ] [−275.20, −247.45] [−12.03, −10.88] [−226.23, −204.68] [−18.05, −16.33] [−135.69, −122.77] [−12.60, −10.31] [−32.74, −26.97] [−7.16, −4.30] [1.22, 1.65] [20.62, 25.25] [−226.23, −204.68] [20.62, 25.25] [−137.20, −124.13] [15.18, 19.24] [−31.24, −25.25] [15.18, 19.24] [−0.14, 0.14] [12.80, 16.58] [−1.50, −1.36] [1.53, 4.93] [−28.42, −22.69]
BIM [×10−4 ] [−280.67, −241.10] [−12.03, −10.88] [−227.01, −203.59] [−18.05, −16.33] [−135.69, −122.77] [−12.60, −10.31] [−32.75, −26.95] [−7.16, −4.30] [1.22, 1.65] [19.93, 25.81] [−227.44, −203.15] [20.12, 25.62] [−137.21, −124.10] [14.84, 19.47] [−31.24, −25.60] [14.70, 19.61] [−0.14, 0.14] [12.73, 16.66] [−1.50, −1.36] [1.40, 5.09] [−28.43, −22.7]
EOM [×10−4 ] [−275.20, −247.45] [−12.03, −10.88] [−226.23, −204.68] [−18.05, −16.33] [−135.69, −122.77] [−12.60, −10.31] [−32.74, −26.97] [−7.16, −4.30] [1.22, 1.65] [20.62, 25.25] [−226.23, −204.68] [20.62, 25.25] [−137.20, −124.13] [15.18, 19.24] [−31.24, −25.25] [15.18, 19.24] [−0.14, 0.14] [12.80, 16.58] [−1.50, −1.36] [1.53, 4.93] 69 [−2.84, −2.27]
differ for 9 elements. The number of derivatives with the definite sign ndsm is 2–5 times greater than ndsb. The MCM method generated the result four time faster than the BIM method. Example 2. (Baltimore bridge built in 1870) Consider the plane truss structure shown in Figure 2 subjected to downward forces of P1 = 80[kN ] at node 11, P2 = 120[kN ] at node 12 and P1 at node 15; Young’s modulus Y= 2.1 × 1011 [Pa], cross-section area C = 0.004[m2], and length L = 1[m]. Assume that the stiffness of 23 bars is uncertain by ±5%. This gives 23 interval parameters.
L L L
P1
P2
P1
Fig. 2. Example 2: Baltimore bridge (built in 1870)
Checking the Monotonicity of Parametric Interval Solution
1407
Table 2. Comparison of the number of derivatives of definite sign for Example 2 MCM Num. of elems 33 3 5 3 1
MCM ndsm 23 22 11 10 8
BIM Num. of elems 6 17 14 5 1 1
BIM ndsb 5 4 3 2 1 0
Table 3. Comparison of the solutions for Example 2 MCM ndsm 10 10 10 11 11 11 11 8 11 22 22 22
MCM [×10−4 ] [−24.63, −22.90] [18.37, 19.71] [−24.63, −22.90] [−40.81, −38.12] [18.40, 19.68] [−56.97, −53.36] [−56.97, −53.36] [16.23, 18.20] [10.40, 12.38] [−56.37, −53.96] [−24.19, −23.31] [−24.19, −23.31]
BIM ndsb 2 2 2 3 3 3 3 2 4 5 5 5
BIM [×10−4 ] [−25.10, −22.38] [17.96, 20.10] [−25.10, −22.38] [−41.33, −37.55] [18.15, 19.91] [−57.61, −52.66] [−57.61, −52.66] [16.01, 18.41] [10.20, 12.58] [−57.28, −53.00] [−24.53, −22.96] [−24.60, −22.89]
EOM [×10−4 ] [−24.45, −23.08] [18.55, 19.53] [−24.45, −23.08] [−40.52, −38.42] [18.55, 19.53] [−56.59, −53.76] [−56.59, −53.76] [16.65, 17.79] [10.62, 12.15] [−56.37, −53.96] [−24.19, −23.31] [−24.19, −23.31]
The results of the MCM and the BIM methods are summarized in Table 2 and Table 3. Table 2 contains the number of elements (even columns) for which the corresponding number of derivatives have definite sign (odd columns). Table 3 contains those elements of the solution for which the number of derivatives with definite sign ndsm is less than 23 (remaining elements are equal to the elements of the interval hull solution). The results of the MCM method, the BIM method, and the EOM method are presented in subsequent columns. It can be seen from the table that the result of the MCM method is far better than the result of the BIM method. The number of derivatives with the definite sign ndsm is 3–5 times greater than ndsb. The MCM method generated the result four time faster than the BIM method.
6
Conclusions
Checking the sign of the derivatives is a clue to test the global monotonicity of the solution of parametric linear systems. The global monotonicity enables calculating the interval hull solution easily by solving at most 2n real systems.
1408
I. Skalna
A new method for estimating the sign of the derivatives of the parametric interval solution is presented. The quality of the results depends on the size of the problem, and the number and width of interval parameters. In general case the MCM method produces very tight enclosure for the interval hull solution. It is quite obvious that the bigger is the number of derivatives with definite sign, the narrower are bounds of the approximate solution. If the number of derivatives with definite sign differs significantly from the number of interval parameters, then the EOM method can be used to check the overestimation. To show the superiority of the MCM method over the methods based on calculating the inverse of the interval matrix some illustrative examples of structural mechanical systems are provided. It can be seen from the results that the MCM method produces narrower bounds on the hull solution and is much faster. The presented methodology can be applied to any problem which requires solving linear systems with input data dependent on uncertain parameters. It can and used by, or combined with other methods for solving parametric linear systems (e.g. Evolutionary Optimization Method, Global Optimization).
References 1. Aughenbaugh, J., Paredis, C.: Why are intervals and imprecisions important in engineering design? In: Muhannah, R.L.R. (ed.) Proceedings of the NSF Workshop on Reliable Engineering Computing (REC), Svannah, Georgia USA, February 22– 24, 2006, pp. 319–340 (2006) 2. Jansson, C.: Interval linear systems with symmetric matrices, skew-symmetric matrices and dependencies in the right hand side. Computing 46(3), 265–274 (1991) 3. Kolev, L.: A method for outer interval solution of linear parametric systems. Reliable Computing 10, 227–239 (2004) 4. Kolev, L.: Outer solution of linear systems whose elements are ane functions of interval parameters. Reliable Computing 6, 493–501 (2002) 5. Kolev, L.: Solving linear systems whose elements are non-linear functions of intervals. Numerical Algorithms 37, 213–224 (2004) 6. Lallemand, B., Plessis, G., Tison, T., Level, P.: Modal Behaviour of Structures Defined by Imprecise Geometric Parameters #125. In: Proc. SPIE, Proceedings of IMAC-XVIII: A Conference on Structural Dynamics. Society of Photo-Optical Instrumentation Engineers (SPIE) Conference, January 2000, vol. 4062, p. 1422 (2000) 7. Muhanna, R.L., Erdolen, A.: Geometric uncertainty in truss systems: an interval approach. In: Muhanna, R.L. (ed.) Proceedings of the NSF Workshop on Reliable Engineering Computing: Modeling Errors and Uncertainty in Engineering Computations, Savannah, Georgia USA, February 22-24, 2006, pp. 239–247 (2006) 8. Muhanna, R., Kreinovich, V., Solin, P., Cheesa, J., Araiza, R., Xiang, G.: Interval finite element method: New directions. In: Muhannah, R.L. (ed.) Proceedings of the NSF Workshop on Reliable Engineering Computing (REC), Svannah, Georgia USA, February 22–24, 2006 (2006) 9. Neumaier, A.: Worst case bounds in the presence of correlated uncertainty. In: Muhannah, R.L. (ed.) Proceedings of the NSFWorkshop on Reliable Engineering Computing (REC), February 22–24, 2006, pp. 113–114 (2006)
Checking the Monotonicity of Parametric Interval Solution
1409
10. Neumaier, A.: Interval Methods for Systems of Equations. Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge (1990) 11. Popova, E.: Quality of the solution of parameter-dependent interval linear systems. ZAMM 82(10), 723–727 (2002) 12. Popova, E.D.: On the solution of parametrised linear systems. In: Kraemer, W., J.W.v.G. (eds.) Scientific Computing, Validated Numerics, Interval Methods, pp. 127–138. Kluwer Acad. Publishers, Dordrecht (2001) 13. Popova, E., Iankov, R., Bonev, Z.: Bounding the response of mechanical structures with uncertainties in all the parameters. In: Muhannah, R.L. (ed.) Proceedings of the NSF Workshop on Reliable Engineering Computing (REC), Svannah, Georgia USA, February 22–24, 2006, pp. 245–265 (2006) 14. Pownuk, A.: Calculations of displacement in elastic and elastic-plastic structures with interval parameters. In: 33rd Solid Mechanics Conference, Zakopane, Poland, September 2000, pp. 160–161 (2000) 15. Rao, S., Berke, L.: Analysis of uncertain structural systems using interval analysis. AIAA J. 35(4), 727–735 (1997) 16. Rohn, J.: A method for handling dependent data in interval linear systems. Technical report 911, Academy of Sciences of the Czech Republic, Czech Republic (2004) 17. Rump, S.: Verification methods for dense and sparse systems of equations. In: Herzberger, J. (ed.) Topics in validated computations: proceedings of IMACSGAMM International Workshop on Validated Computation, Oldenburg, Germany, August 30–September 3, 1993. Studies in Computational Mathematics, vol. 5, pp. 63–135. Elsevier, Amsterdam, The Netherlands (1994) 18. Skalna, I.: A method for outer interval solution of parametrized systems of linear interval equations. Reliable Computing 12(2), 107–120 (2006) 19. Skalna, I.: Methods for solving systems of linear equations of structure mechanics with interval parameters. Computer Assisted Mechanics and Engineering Sciences 10(3), 281–293 (2003) 20. Skalna, I.: Evolutionary optimization method for approximating the solution set hull of parametric linear systems. In: Boyanov, T., Dimova, S., Georgiev, K., Nikolov, G. (eds.) NMA 2006. LNCS, vol. 4310, pp. 361–368. Springer, Heidelberg (2007) 21. Zalewski, B., Mullen, R., Muhanna, R.: Bounding the response of mechanical structures with uncertainties in all the parameters. In: Muhannah, R.L. (ed.) Proceedings of the NSF Workshop on Reliable Engineering Computing (REC), Svannah, Georgia USA, February 22–24, 2006, pp. 439–456 (2006)
Author Index
Abdallah, Deema A. 170 Abu-Khzam, Faisal N. 170 Acebr´ on, Juan A. 1257 Aizhulov, Daniar Y. 888 Akzhalova, Assel Zh. 888 Allen, Gabrielle 1170 Almeida, Francisco 788, 1104 Aloisio, Giovanni 894 Alonso, Pedro 89, 419 Arrar´ as, Andr´es 371 Babik, Marian 738 Bader, David A. 166 Bader, Michael 628 Bala, Piotr 439, 762 Bali´s, Bartosz 381 Bana´s, Krzysztof 1265 Baranowski, Przemyslaw 469 Barker, Adam 746 Bartoli, Lisa 1200 Barty´ nski, Tomasz 1068 Benheddi, Radia 1078 Bernabeu, Miguel O. 89 Bichot, Charles-Edmond 698 Bielecka, Marzena 579 Bielecki, Andrzej 579 Bisseling, Rob H. 708 Blanco, Vincente 788 Blaszczyk, Jacek 649 Borkowski, Janusz 807, 971 Borowiecki, Piotr 981 Bouvry, Pascal 68, 549 Bo˙zejko, Wojciech 180 Briquet, Cyril 1275 Brudaru, Octav 479 Brunner, Rene 961 Brzezi´ nski, Jerzy 1 Bubak, Marian 381, 780, 1068 Burczy´ nski, Tadeusz 1285 Burtseva, Larisa 608 Buttari, Alfredo 639 Buzatu, Octavian 479 Bylina, Beata 99 Bylina, Jaroslaw 99
Cafaro, Massimo 894 Carota, Luciana 1200 Carracciuolo, Luisa 902 Carretero, Jes´ us 870 Cˆ arstea, Alexandru 843 Casadio, Rita 1200 Ciglan, Marek 302 Cole, Murray 1019 Cuciniello, Salvatore 1210 Cuenca, Javier 1150 Czarnul, Pawel 271 Czech, Zbigniew J. 189 Dalecki, Dariusz 391 Danilecki, Arkadiusz 11 D¸ebski, Lech 429 Debudaj-Grabysz, Agnieszka De Falco, Ivanoe 991 Della Cioppa, Antonio 991 de Marneffe, P.A. 1275 Desell, Travis 457 Dethier, G´erard 1275 D´ıaz, Daniel 852 Dongarra, Jack 639 Drozdowski, Maciej 1009 Dusza, Konrad 281 Dutka, Lukasz 59, 835 Dyllong, Eva 1341 Dymova, Ludmila 1392 Dziubecki, Piotr 331 Elmroth, Erik
259, 754
Fariselli, Piero 1200 Feminiano, Davide 1210 Fiolet, Valerie 912 Fiore, Sandro 894 Fl´ orez, Jorge 1351 Font´ an, Javier 817 Franz, Robert 628 Fraser, David L. 20 Freitag, Felix 961 Frˆıncu, Marc 843 Funika, Wlodzimierz 1140
189
1412
Author Index
Furma´ nczyk, Hanna 1001 F´ uster-Sabater, Amparo 499
John, Sunil 860 Jorge, Juan Carlos
Gajc, Krzysztof 489 Gajda, Karol 1361 Ganzha, Maria 400, 1220 Garc´ıa, F´elix 870 Garc´ıa, Luis-Pedro 1150 Gawinecki, Maciej 400 Gawkowski, Piotr 361 Georges, Gilles 1240 Gepner, Pawel 20 Gim´enez, Domingo 127, 1150 Giraud, Mathieu 1240 Giunta, Giulio 942, 951 Godowski, Piotr 1140 Gola¸b, Krzysztof 429 Gonz´ alez, Patricia 852 Goodale, Tom 1170 Gorawski, Marcin 199, 209 Gorawski, Michal 199 Gorski, Filip 29 Grabowski, Piotr 331 Grimm, Cornelius 1341 Guarracino, Mario R. 1210 Gubala, Tomasz 1068 G¨ unther, Stephan 628 Gustavson, Fred G. 618, 622
Kadiyev, Alexey 118 Kajiyama, Tamito 1086 Kalewski, Michal 1 Keong, Kwoh Chee 1249 Kitowski, Jacek 59, 321, 798, 835 Klopotek, Mieczyslaw A. 68 Klus´ aˇcek, Dalibor 1029 Kluszczy´ nski, Rafal 762 Kobusi´ nska, Anna 11 Kobusi´ nski, Jacek 29 Kobzdej, Pawel 400 Kohut, Roman 409 Kokosi´ nski, Zbigniew 219, 229 Konieczny, Dariusz 1160 Konovalov, Alexander 843 Kopanski, Damian 807 Koroˇsec, Peter 520 Korytkowski, Marcin 540, 570 Kosel, Franc 520 Kosowski, Adrian 1001, 1039 Kowalik, Michal F. 20 Krawczyk, Henryk 281 Kreinovich, Vladik 1372 Krysin´ski, Michal 331 Kryza, Bartosz 835 Kubica, Bartlomiej Jacek 1382 Kucherov, Gregory 1240 Kuczy´ nski, Tomasz 331 Kuczynski, Lukasz 825 Kurzak, Jakub 639 Kurzyniec, Dawid 291 Ku´s, Waclaw 1285 Kwiatkowski, Jan 1160
Hafid, Abdelhakim 1230 Hasegawa, Hidehiko 1086 Heinecke, Alexander 628 Hern´ andez, Francisco 259, 754 Hern´ andez-Goya, Candelaria 499 Hernandez, Israel 1019 Herrero, Jos´e R. 659 Hluch´ y, Ladislav 302, 738, 880 Hoenen, Olivier 108 Huedo, Eduardo 817 Huynh, Van Nam 1372 Il’yashenko, Matviy
922
Jacob, Arpith 1220 Jacques, Julien 1240 Jakl, Ondˇrej 409 Jakob, Wilfried 509, 589 Jankowska, Malgorzata 1361 Janusz, Maciej 798 Jeziorek-Kniola, Dorota 429
371
Laccetti, Giuliano 902, 932, 951 Laclav´ık, Michal 302 Langou, Julien 639 Lapegna, Marco 902 Laskowski, Eryk 39, 912 Lavenier, Dominique 1240 Lawenda, Marcin 1009 Lebied´z, Jacek 391 Lepekha, Volodymyr 137 Lezzi, Daniele 894 Li, Gen 78 Libuda, Marek 11 Llorente, Ignacio M. 817
Author Index ´ Ogoubi, Etienne 1230 Olejnik, Richard 912 ¨ Ostberg, Per-Olov 259
L´ opez-Esp´ın, Jose-Juan 127 Loulergue, Fr´ed´eric 1078, 1122 L ukawski, Grzegorz 312 L ukasik, Szymon 229 Maani, Rouzbeh 49 Macariu, Georgiana 843 Maggi, Giorgio 1200 Maksimov, Vyacheslav 118 Malawski, Maciej 1068 Malczok, Rafal 209 Manne, Fredrik 708 Marchot, Pierre 1275 Marciniak, Andrzej 1361 Marks, Michal 649 Martelli, Pier L. 1200 Mart´ın, Mar´ıa J. 852 Mart´ınez-Zald´ıvar, Francisco-Jose Masko, L ukasz 912 Maslennikow, Oleg 137 Matyska, Ludˇek 1029 Mederski, Jaroslaw 439 Meissner, Adam 1096 Mieloszyk, Krzysztof 391 Miguel, Alejandro 870 Mikulski, L ukasz 439 Milos, Tomasz 59 Monnet, Anthony 249 Montanucci, Ludovica 1200 Montella, Raffaele 942, 951 Montero, Rub´en S. 817 Morrison, John P. 860 Morzy´ nski, Marek 1293 Musial, Grzegorz 429 Nabrzyski, Jaroslaw 331 Nakamori, Yoshiteru 1372 Navarra, Alfredo 1039 Navarro, Leandro 961 Nguyen, Hung T. 1372 Niewiadomska-Szynkiewicz, Ewa Nikolow, Darin 321 Nishida, Akira 1086 Niwi´ nska, Magdalena 1096 No´e, Laurent 1240 Nowicki, Robert 540, 570 Nukada, Akira 1086 Numrich, Robert W. 148 Oblak, Klemen 520 Obuchowicz, Andrzej
600
1413
419
Palacz, Barbara 59 Paprzycki, Marcin 400, 1220 Pardo, Xo´ an C. 852 Parsa, Saeed 49 Paszy´ nska, Anna 1313 Paszy´ nski, Maciej 1303, 1313 Pawlik, Marcin 1160 P¸egiel, Piotr 1140 Pel´ aez, Ignacio 1104 Petcu, Dana 843 Peterlongo, Pierre 1240 Portero, Laura 371 Pouliot, David 1230 Qu, Changtao 770 Quintana-Ort´ı, Enrique S. 668, 678 Quintana-Ort´ı, Gregorio 668, 678 Quinte, Alexander 589 Radke, Thomas 1170 Rem´ on, Alfredo 668, 678 Riccio, Angelo 942 Rizk, Mohamad A. 170 Rodr´ıguez, Gabriel 852 Rodr´ıguez-Haro, Fernando Rokicki, Jacek 1333 Rozenberg, Valeriy 118 Rudov´ a, Hana 1029 Russell, Michael 331 Rutkowski, Leszek 540 Rycerz, Katarzyna 780 Rzadca, Krzysztof 1048
649
Sainz, Miguel A. 1351 Samatova, Nagiza F. 170 Santos, Adrian 788 Sanyal, Sugata 1220 Sapiecha, Krzysztof 312 Sawerwain, Marek 530 Sbert, Mateu 1351 Scafuri, Umberto 991 Scherer, Rafal 540, 570 Schmid, Giovanni 932 Schmidt, Bertil 1249 Schnetter, Erik 1170 Schwarz, Ulrich M. 1059
961
1414
Author Index
Sedukhin, Stanislav G. 1190 ˇ Seleng, Martin 302 Seredynski, Franciszek 489, 549 Seredynski, Marcin 68 Sergiyenko, Anatoli 137 Serzysko, Tomasz 400 Sevastjanov, Pavel 1392 ˇ Silc, Jurij 520 ˇ Simeˇ cek, Ivan 156 Singer, Daniel 249 Singh, David E. 870 Skalna, Iwona 1400 Skaruz, Jaroslaw 549 Skital, L ukasz 798 Slawi´ nska, Magdalena 291, 341 Slawi´ nski, Jaroslaw 341 Sl´ıˇzik, Peter 880 Sloot, Peter M.A. 780 Slota, Renata 321, 798, 835 Smolka, Maciej 351 Smyk, Adam 559 Sobaniec, Cezary 1 Sobral, Joao L. 1114 Song, Junqiang 447 Sosnowski, Janusz 361 Spigler, Renato 1257 Starczewski, Janusz 570 Stark, Dylan 1170 Star´ y, Jiˇr´ı 409 Stempin, Stanislaw 29 Stpiczy´ nski, Przemyslaw 688 Strug, Barbara 579 Stucky, Karl-Uwe 589 Su´ arez, Fernando 1104 S¨ uß, Wolfgang 589 Suda, Reiji 1086 Sunderam, Vaidy 291, 341 ´ eto´ 229 Swi n, Grzegorz Szjenfeld, Dawid 331 Szymanski, Boleslaw K. 457 Szyszka, Barbara 1361 Taghinezhad Omran, Masoud Tang, Jok Man 1323 Tang, Tao 78 Tarantino, Ernesto 991 Tarnawczyk, Dominik 331 Tchernykh, Andrei 608
239
Tesson, Julien 1122 Thiele, Frank 1293 Toledo, Sivan 728 Tomas, Adam 137 Tordsson, Johan 259, 754 Toursel, Bernard 912 Tudruj, Marek 39, 559, 807, 912, 971 Turcotte, Marcel 1230 Tvrd´ık, Pavel 156 Tymoczko, Andrzej 361 U¸car, Bora 718 Uchitel, Anatoli 728 Uci´ nski, Dariusz 469 van Engelen, Robert 894 van Hemert, Jano 746 Varela, Carlos 457 V´ azquez, Constantino 817 Veh´ı, Josep 1351 Vidal-Maci´ a, Antonio-Manuel Violard, Eric 108 Vuik, Kees 1323
89, 419
Wach, Jakub 381 Wa´sniewski, Jerzy 622 Wawrzyniak, Dariusz 600 Wichulski, Michal 1333 Wirawan, Adrianto 1249 Wiszniewski, Bogdan 391 Wo´zniak, Adam 1382 Wodecki, Mieczyslaw 180 Wolniewicz, Gosia 331 Wyrzykowski, Roman 137, 825 Wysota, Witold 1180 Wytrebowicz, Jacek 1180 Yang, Xuejun 78 Yaurima, Victor 608 Zekri, Ahmed S. 1190 Zhang, Weimin 447 Zhang, Ying 78 Zhu, Xiaoqian 447 Zieba, Joanna 835 Zumbusch, Gerhard 1130 Zwierzy´ nski, Krzysztof 1096 ˙ nski, Pawel 1001 Zyli´