Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
6067
Roman Wyrzykowski Jack Dongarra Konrad Karczewski Jerzy Wasniewski (Eds.)
Parallel Processing and Applied Mathematics 8th International Conference, PPAM 2009 Wroclaw, Poland, September 13-16, 2009 Revised Selected Papers, Part I
13
Volume Editors Roman Wyrzykowski Konrad Karczewski Czestochowa University of Technology Institute of Computational and Information Sciences, Poland E-mail:{roman, xeno}@icis.pcz.pl Jack Dongarra University of Tennessee, Department of Electrical Engineering and Computer Science, Knoxville, TN 37996-3450, USA E-mail:
[email protected] Jerzy Wasniewski Technical University of Denmark, Department of Informatics and Mathematical Modeling, 2800 Kongens Lyngby, Denmark E-mail:
[email protected]
Library of Congress Control Number: 2010930174 CR Subject Classification (1998): D.2, H.4, D.4, C.2.4, D.1.3, F.2 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-642-14389-X Springer Berlin Heidelberg New York 978-3-642-14389-2 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
We are pleased to present the proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics – PPAM 2009, which was held in Wroclaw, Poland, September 13–16, 2009. It was organized by the Department of Computer and Information Sciences of the Czestochowa University of Techno logy, with the help of the Wroclaw University of Technology, Faculty of Computer Science and Management. The main organizer was Roman Wyrzykowski. PPAM is a biennial conference. Seven previous events have been held in different places in Poland since 1994. The proceedings of the last four conferences have been published by Springer in the Lecture Notes in Computer Science series 2003, vol.3019; Pozna´ n, 2005, vol.3911; (Nalecz´ ow, 2001, vol.2328; Czestochowa, Gda´ nsk, 2007, vol. 4967). The PPAM conferences have become an international forum for exchanging ideas between researchers involved in parallel and distributed computing, including theory and applications, as well as applied and computational mathematics. The focus of PPAM 2009 was on models, algorithms, and software tools which facilitate efficient and convenient utilization of modern parallel and distributed computing architectures, as well as on large-scale applications. This meeting gathered more than 210 participants from 32 countries. A strict refereeing process resulted in the acceptance of 129 contributed presentations, while approximately 46% of the submissions were rejected. Regular tracks of the conference covered such important fields of parallel/distributed/grid computing and applied mathematics as: – – – – – –
Parallel/distributed architectures and mobile computing Numerical algorithms and parallel numerics Parallel and distributed non-numerical algorithms Tools and environments for parallel/distributed/grid computing Applications of parallel/distributed computing Applied mathematics and neural networks
Plenary and Invited Speakers The plenary and invited talks were presented by: – – – –
Srinivas Aluru from the Iowa State University (USA) Dominik Behr from AMD (USA) Ewa Deelman from the University of Southern California (USA) Jack Dongarra from the University of Tennessee and Oak Ridge National Laboratory (USA) – Iain Duff from the Rutherford Appleton Laboratory (UK) – Anne C. Elster from NTNU, Trondheim (Norway)
VI
– – – – – – – – –
Preface
Wolfgang Gentzsch from the DEISA Project Michael Gschwind from the IBM T.J. Watson Research Center (USA) Fred Gustavson from the IBM T.J. Watson Research Center (USA) Simon Holland from Intel (UK) Vladik Kreinovich from the University of Texas at El Paso (USA) Magnus Peterson from the Synective Labs (Sweden) Armin Seyfried from the Juelich Supercomputing Centre (Germany) Boleslaw Szyma´ nski from the Rensselaer Polytechnic Institute (USA) Jerzy Wa´sniewski from the Technical University of Denmark (Denmark)
Workshops and Minisymposia Important and integral parts of the PPAM 2009 conference were the workshops: – Minisymposium on GPU Computing organized by Jos´e R. Herrero from the Universitat Politecnica de Catalunya (Spain), Enrique S. Quintana-Ort´ı from the Universitat Jaime I (Spain), and Robert Strzodka from the Max-PlanckInstitut f¨ ur Informatik (Germany) – The Second Minisymposium on Cell/B.E. Technologies organized by Roman University of Technology (Poland), and Wyrzykowski from the Czestochowa David A. Bader from the Georgia Institute of Technology (USA) – Workshop on Memory Issues on Multi- and Manycore Platforms organized by Michael Bader and Carsten Trinitis from the TU M¨ unchen (Germany) – Workshop on Novel Data Formats and Algorithms for High-Performance Computing organized by Fred Gustavson from the IBM T.J. Watson Research Center (USA), and Jerzy Wa´sniewski from the Technical University of Denmark (Denmark) – Workshop on Scheduling for Parallel Computing - SPC 2009 organized by Maciej Drozdowski from the Pozna´ n University of Technology (Poland) – The Third Workshop on Language-Based Parallel Programming Models WLPP 2009 organized by Ami Marowka from the Shenkar College of Engineering and Design in Ramat-Gan (Israel) – The Second Workshop on Performance Evaluation of Parallel Applications on Large-Scale Systems organized by Jan Kwiatkowski, Dariusz Konieczny and Marcin Pawlik from the Wroclaw University of Technology (Poland) – The 4th Grid Application and Middleware Workshop - GAMW 2009 organized by Ewa Deelman from the University of Southern California (USA), and Norbert Meyer from the Pozna´ n Supercomputing and Networking Center (Poland) – The 4th Workshop on Large Scale Computations on Grids - LaSCoG 2009 organized by Marcin Paprzycki from IBS PAN and SWPS in Warsaw (Poland), and Dana Petcu from the Western University of Timisoara (Romania) – Workshop on Parallel Computational Biology - PBC 2009 organized by David A. Bader from the Georgia Institute of Technology in Atlanta (USA), Denis Trystram from ID-IMAG in Grenoble (France), Alexandros Stamatakis ˙ from the TU M¨ unchen (Germany), and Jaroslaw Zola from the Iowa State University (USA)
Preface
VII
– Minisymposium on Applications of Parallel Computations in Industry and ˇ Engineering organized by Raimondas Ciegis from the Vilnius Gediminas ˇ Technical University (Lithuania), and Julius Zilinskas from the Institute of Mathematics and Informatics in Vilnius (Lithuania) – The Second Minisymposium on Interval Analysis organized by Vladik Kreinovich from the University of Texas at El Paso (USA), Pawel University of Technology (Poland), Sewastjanow from the Czestochowa Bartlomiej J. Kubica from the Warsaw University of Technology (Poland), and Jerzy Wa´sniewski from the Technical University of Denmark (Denmark) – Workshop on Complex Collective Systems organized by Pawel Topa and Jaroslaw Was from the AGH University of Science and Technology in Cracow (Poland)
Tutorials The PPAM 2009 meeting began with four tutorials: – GPUs, OpenCL and Scientific Computing, by Robert Strzodka from the Max-Planck-Institut f¨ ur Informatik (Germany), Dominik Behr from AMD (USA), and Dominik G¨ oddeke from the University of Dortmund (Germany) – FPGA Programming for Scientific Computing, by Magnus Peterson from the Synective Labs (Sweden) – Programming the Cell Broadband Engine, by Maciej Remiszewski from IBM (Poland), and Maciej Cytowski from the University of Warsaw (Poland) – New Data Structures Are Necessary and Sufficient for Dense Linear Algebra Factorization Algorithms, by Fred Gustavson from the the IBM T.J. Watson Research Center (USA), and Jerzy Wa´sniewski from the Technical University of Denmark (Denmark)
Best Poster Award The PPAM Best Poster Award is given to the best poster on display at the PPAM conferences, and was first awarded at PPAM 2009. This award is bestowed by the Program Committee members to the presenting author(s) of the best poster. The selection criteria are based on the scientific content and on the quality of the poster presentation. The PPAM 2009 winner was Tomasz Olas University of Technology, who presented the poster “Parfrom the Czestochowa allel Adaptive Finite Element Package with Dynamic Load Balancing for 3D Thermomechanical Problems.”
New Topics at PPAM 2009 GPU Computing: The recent advances in the hardware, functionality, and programmability of graphics processors (GPUs) have greatly increased their appeal
VIII
Preface
as add-on co-processors for general-purpose computing. With the involvement of the largest processor manufacturers and the strong interest from researchers of various disciplines, this approach has moved from a research niche to a forwardlooking technique for heterogeneous parallel computing. Scientific and industry researchers are constantly finding new applications for GPUs in a wide variety of areas, including image and video processing, molecular dynamics, seismic simulation, computational biology and chemistry, fluid dynamics, weather forecast, computational finance, and many others. GPU hardware has evolved over many years from graphics pipelines with many heterogeneous fixed-function components over partially programmable architectures towards a more and more homogeneous general purpose design, although some fixed-function hardware has remained because of its efficiency. The general-purpose computing on GPU (GPGPU) revolution started with programmable shaders; later, NVIDIA Compute Unified Device Architecture (CUDA) and to a smaller extent AMD Brook+ brought GPUs into the mainstream of parallel computing. The great advantage of CUDA is that it defines an abstraction which presents the underlying hardware architecture as a sea of hundreds of fine-grained computational units with synchronization primitives on multiple levels. With OpenCL there is now also a vendor-independent high-level parallel programming language and an API that offers the same type of hardware abstraction. GPU are very versatile accelerators because besides the high hardware parallelism they also feature a high bandwidth connection to dedicated device memory. The latency problem of DRAM is tackled via a sophisticated thread scheduling and switching mechanism on-chip that continues the processing of the next thread as soon as the previous stalls on a data read. These characteristics make GPUs suitable for both compute- and data-intensive parallel processing. The PPAM 2009 conference recognized the great impact of GPUs by including in its scientific program two major related events: a minisymposium on GPU Computing, and a full day tutorial on “GPUs, OpenCL and Scientific Computing.” The minisymposium received 18 submissions, of which 10 were accepted (55%). The contributions were organized in three sessions. The first group was related to Numerics, and comprised the following papers: “Finite Element Numerical Integration on GPUs,”“ Reduction to Condensed Forms for Symmetric Eigenvalue Problems on Multi-core Architectures,”“On Parallelizing the MRRR Algorithm for Data-Parallel Coprocessors,” and “A Fast GPU Implementation for Solving Sparse Ill-Posed Linear Equation Systems.” The second session dealt with Applications. The papers presented were: “Simulations of the Electrical Activity in the Heart with Graphic Processing Units,”“Stream Processing on GPUs Using Distributed Multimedia Middleware,” and “A GPU Approach to the Simulation of Spatio–temporal Dynamics in Ultrasonic Resonators.” Finally, a third session about General GPU Computing included presentations of three papers: “Fast In-Place Sorting with CUDA Based on Bitonic Sort,” “Parallel Minimax
Preface
IX
Tree Searching on GPU,” and “Modeling and Optimizing the Power Performance of Large Matrices Multiplication on Multi-core and GPU Platform with CUDA.” The tutorial covered a wide variety of GPU topics and also offered handson examples of OpenCL programming that any paticipant could experiment with on their laptop. The morning sessions discussed the basics of GPU architecture, ready-to-use libraries and OpenCL. The afternoon session went in depth on OpenCL and scientific computing on GPUs. All slides are available at http://gpgpu.org/ppam2009. Complex Collective Systems: Collective aspects of complex systems are attracting an increasing community of researchers working in different fields and dealing with theoretical aspects as well as practical applications. In particular, analyzing local interactions and simple rules makes it possible to model complex phenomena efficiently. Collective systems approaches show great promise in establishing scientific methods that could successfully be applied across a variety of application fields. Many studies in complex collective systems science follow either a cellular automata (CA) method or an agent-based approach. Hybridization between these two complementary approaches gives a promising perspective. The majority of work presented during the workshop on complex collective systems represents the hybrid approach. We can distinguish four groups of subjects presented during the workshop. The first group was modeling of pedestrian dynamics: Armin Seyfried from the Juelich Supercomputing Center presented actual challenges in pedestrian dynamics modeling. Another important issue of crowd modeling was also taken into account during the workshop: modeling of stop-and-go waves (Andrea Portz and Armin Seyfried), calibration of pedestrian stream models (Wolfram Klein, Gerta K¨ oster and Andreas Meister), parallel design patterns in a pedestrian simulation (Sarah Clayton), floor fields models based on CA (Ekaterina Kirik, Tat’yana Yurgel’yan and Dmitriy Krouglov), and discrete potential field construction (Konrad Kulakowski and Jaroslaw Was). The second group dealt with models of car traffic: a fuzzy cellular model of traffic (Bartlomiej Placzek), and an adaptive time gap car-following model (Antoine Tordeux and Pascal Bouvry). The third group included work connected with cryptography based on cellular automata: weakness analysis of a key stream generator (Frederic Pinel and Pascal Bouvry), and properties of safe CA-based S-Boxes (Miroslaw Szaban and Franciszek Seredy´ nski). The fourth group dealt with various applications in a field of complex collective systems: frustration and collectivity in spatial networks (Anna Ma´ nkaKraso´ n, Krzysztof Kulakowski), lava flow hazard modeling (Maria Vittoria Avolio, Donato D’Ambrosio, Valeria Lupiano, Rocco Rongo and William Spataro), FPGA realization of a CA-based epidemic processor (Pavlos Progias, Emmanouela Vardaki and Georgios Sirakoulis)
X
Preface
Acknowledgements The organizers are indebted to the PPAM 2009 sponsors, whose support was vital to the success of the conference. The main sponsor was the Intel Corporation. The other sponsors were: Hewlett-Packard Company, Microsoft Corporation, IBM Corporation, Action S.A., and AMD. We thank to all members of the International Program Committee and additional reviewers for their diligent work in refereeing the submitted papers. Finally, we thank all of the local organizers from the Czestochowa University of Technology and Wroclaw University of Technology who helped us to run the event very smoothly. We are especially indebted to Gra˙zyna Kolakowska, Urszula Kroczewska, L ukasz Kuczy´ nski, and University of Technology; and to Jerzy Marcin Wo´zniak from the Czestochowa ´ atek, and Jan Kwiatkowski from the Wroclaw University of Technology. Swi
PPAM 2011 We hope that this volume will be useful to you. We would like everyone who reads it to feel invited to the next conference, PPAM 2011, which will be held September 11–14, 2011, in Toru´ n, a city in northern Poland where the great astronomer Nicolaus Copernicus was born. February 2010
Roman Wyrzykowski Jack Dongarra Konrad Karczewski Jerzy Wa´sniewski
Organization
Program Committee Jan Weglarz Roman Wyrzykowski Boleslaw Szyma´ nski Peter Arbenz Piotr Bala David A. Bader Michael Bader Mark Baker Radim Blaheta Jacek Bla˙zewicz Leszek Borzemski Pascal Bouvry Tadeusz Burczy´ nski Jerzy Brzezi´ nski Marian Bubak ˇ Raimondas Ciegis Andrea Clematis Zbigniew Czech Jack Dongarra Maciej Drozdowski Erik Elmroth Anne C. Elster Mariusz Flasi´ nski Maria Ganzha Jacek Gondzio Andrzej Go´sci´ nski Laura Grigori Frederic Guinand Jos´e R. Herrero Ladislav Hluchy Ondrej Jakl Emmanuel Jeannot Grzegorz Kamieniarz Alexey Kalinov Ayse Kiper
Pozna´ n University of Technology, Poland Honorary Chair University of Technology, Poland Czestochowa Chair Rensselaer Polytechnic Institute, USA Vice-Chair ETH, Zurich, Switzerland N. Copernicus University, Poland Georgia Institute of Technology, USA TU M¨ unchen, Germany University of Reading, UK Institute of Geonics, Czech Academy of Sciences Pozna´ n University of Technology, Poland Wroclaw University of Technology, Poland University of Luxembourg Silesia University of Technology, Poland Pozna´ n University of Technology, Poland Institute of Computer Science, AGH, Poland Vilnius Gediminas Tech. University, Lithuania IMATI-CNR, Italy Silesia University of Technology, Poland University of Tennessee and ORNL, USA Pozna´ n University of Technology, Poland Umea University, Sweden NTNU, Trondheim, Norway Jagiellonian University, Poland IBS PAN, Warsaw, Poland University of Edinburgh, Scotland, UK Deakin University, Australia INRIA, France Universit´e du Havre, France Universitat Politecnica de Catalunya, Barcelona, Spain Slovak Academy of Sciences, Bratislava Institute of Geonics, Czech Academy of Sciences INRIA, France A. Mickiewicz University, Pozna´ n, Poland Cadence Design System, Russia Middle East Technical University, Turkey
XII
Organization
Jacek Kitowski Jozef Korbicz Stanislaw Kozielski Dieter Kranzlmueller
Institute of Computer Science, AGH, Poland University of Zielona G´ ora, Poland Silesia University of Technology, Poland Ludwig Maximillian University, Munich, and Leibniz Supercomputing Centre, Germany Henryk Krawczyk Gda´ nsk University of Technology, Poland Piotr Krzy˙zanowski University of Warsaw, Poland Jan Kwiatkowski Wroclaw University of Technology, Poland Giulliano Laccetti University of Naples, Italy Marco Lapegna University of Naples, Italy Alexey Lastovetsky University College Dublin, Ireland Vyacheslav I. Maksimov Ural Branch, Russian Academy of Sciences Victor E. Malyshkin Siberian Branch, Russian Academy of Sciences Tomas Margalef Universitat Autonoma de Barcelona, Spain Ami Marowka Shenkar College of Engineering and Design, Israel Norbert Meyer PSNC, Pozna´ n, Poland Jarek Nabrzyski University of Notre Dame, USA Marcin Paprzycki IBS PAN and SWPS, Warsaw, Poland Dana Petcu Western University of Timisoara, Romania Enrique S. Quintana-Ort´ı Universitat Jaime I, Spain Yves Robert Ecole Normale Superieure de Lyon, France Jacek Rokicki Warsaw University of Technology, Poland University of Technology, Poland Leszek Rutkowski Czestochowa Franciszek Seredy´ nski Polish Academy of Sciences and Polish-Japanese Institute of Information Technology, Warsaw, Poland Robert Schaefer Institute of Computer Science, AGH, Poland Jurij Silc Jozef Stefan Institute, Slovenia Peter M.A. Sloot University of Amsterdam, The Netherlands Masha Sosonkina Ames Laboratory and Iowa State University, USA Leonel Sousa Technical University Lisbon, Portugal Maciej Stroi´ nski PSNC, Pozna´ n, Poland Domenico Talia University of Calabria, Italy Andrei Tchernykh CICESE, Ensenada, Mexico Carsten Trinitis TU M¨ unchen, Germany Roman Trobec Jozef Stefan Institute, Slovenia Denis Trystram ID-IMAG, Grenoble, France Marek Tudruj Polish Academy of Sciences and Polish-Japanese Institute of Information Technology, Warsaw, Poland Pavel Tvrdik Czech Technical University, Prague Jens Volkert Johannes Kepler University, Linz, Austria Jerzy Wa´sniewski Technical University of Denmark Bogdan Wiszniewski Gda´ nsk University of Technology, Poland Ramin Yahyapour University of Dortmund, Germany Jianping Zhu University of Texas at Arlington, USA
Table of Contents – Part I
Parallel/Distributed Architectures and Mobile Computing R R Evaluating Performance of New Quad-Core IntelXeon 5500 Family Processors for HPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pawel Gepner, David L. Fraser, and Michal F. Kowalik
Interval Wavelength Assignmentin All-Optical Star Networks . . . . . . . . . . Robert Janczewski, Anna Malafiejska, and Michal Malafiejski Graphs Partitioning: An Optimal MIMD Queueless Routing for BPC-Permutations on Hypercubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jean-Pierre Jung and Ibrahima Sakho Probabilistic Packet Relaying in Wireless Mobile Ad Hoc Networks . . . . . Marcin Seredynski, Tomasz Ignac, and Pascal Bouvry
1 11
21 31
Numerical Algorithms and Parallel Numerics On the Performance of a New Parallel Algorithm for Large-Scale Simulations of Nonlinear Partial Differential Equations . . . . . . . . . . . . . . . ´ Juan A. Acebr´ on, Angel Rodr´ıguez-Rozas, and Renato Spigler Partial Data Replication as a Strategy for Parallel Computing of the Multilevel Discrete Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liesner Acevedo, Victor M. Garcia, Antonio M. Vidal, and Pedro Alonso Dynamic Load Balancing for Adaptive Parallel Flow Problems . . . . . . . . . Stanislaw Gepner, Jerzy Majewski, and Jacek Rokicki A Balancing Domain Decomposition Method for a Discretization of a Plate Problem on Nonmatching Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leszek Marcinkowski Application Specific Processors for the Autoregressive Signal Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anatolij Sergiyenko, Oleg Maslennikow, Piotr Ratuszniak, Natalia Maslennikowa, and Adam Tomas A Parallel Non-square Tiled Algorithm for Solving a Kind of BVP for Second-Order ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Przemyslaw Stpiczy´ nski
41
51
61
70
80
87
XIV
Table of Contents – Part I
Graph Grammar Based Petri Nets Model of Concurrency for Self-adaptive hp-Finite Element Method with Rectangular Elements . . . . Arkadiusz Szymczak and Maciej Paszy´ nski
95
Numerical Solution of the Time and Rigidity Dependent Three Dimensional Second Order Partial Differential Equation . . . . . . . . . . . . . . . Anna Wawrzynczak and Michael V. Alania
105
Hardware Implementation of the Exponent Based Computational Core for an Exchange-Correlation Potential Matrix Generation . . . . . . . . . . . . . Maciej Wielgosz, Ernest Jamro, and Kazimierz Wiatr
115
Parallel Implementation of Conjugate Gradient Method on Graphics Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcin Wozniak, Tomasz Olas, and Roman Wyrzykowski
125
Iterative Solution of Linear and Nonlinear Boundary Problems Using PIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eugeniusz Zieniuk and Agnieszka Boltuc
136
Paralel and Distributed Non-numerical Algorithms Implementing a Parallel Simulated Annealing Algorithm . . . . . . . . . . . . . . Zbigniew J. Czech, Wojciech Mikanik, and Rafal Skinderowicz
146
Parallel Computing Scheme for Graph Grammar-Based Syntactic Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mariusz Flasi´ nski, Janusz Jurek, and Szymon My´sli´ nski
156
Extended Cascaded Star Schema for Distributed Spatial Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcin Gorawski
166
Parallel Longest Increasing Subsequences in Scalable Time and Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Krusche and Alexander Tiskin
176
A Scalable Parallel Union-Find Algorithm for Distributed Memory Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fredrik Manne and Md. Mostofa Ali Patwary
186
Tools and Environments for Parallel/Distributed/Grid Computing Extracting Both Affine and Non-linear Synchronization-Free Slices in Program Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wlodzimierz Bielecki and Marek Palkowski
196
Table of Contents – Part I
XV
A Flexible Checkpoint/Restart Model in Distributed Systems . . . . . . . . . . Mohamed-Slim Bouguerra, Thierry Gautier, Denis Trystram, and Jean-Marc Vincent
206
A Formal Approach to Replica Consistency in Directory Service . . . . . . . Jerzy Brzezi´ nski, Cezary Sobaniec, and Dariusz Wawrzyniak
216
Software Security in the Model for Service Oriented Architecture Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Grzegorz Kolaczek and Adam Wasilewski
226
Automatic Program Parallelization for Multicore Processors . . . . . . . . . . . Jan Kwiatkowski and Radoslaw Iwaszyn
236
Request Distribution in Hybrid Processing Environments . . . . . . . . . . . . . Jan Kwiatkowski, Mariusz Fras, Marcin Pawlik, and Dariusz Konieczny
246
Vine Toolkit - Grid-Enabled Portal Solution for Community Driven Computing Workflows with Meta-Scheduling Capabilities . . . . . . . . . . . . . Dawid Szejnfeld, Piotr Domagalski, Piotr Dziubecki, Piotr Kopta, Michal Krysinski, Tomasz Kuczynski, Krzysztof Kurowski, Bogdan Ludwiczak, Jaroslaw Nabrzyski, Tomasz Piontek, Dominik Tarnawczyk, Krzysztof Witkowski, and Malgorzata Wolniewicz
256
Applications of Parallel/Distributed Computing GEM – A Platform for Advanced Mathematical Geosimulations . . . . . . . . Radim Blaheta, Ondˇrej Jakl, Roman Kohut, and Jiˇr´ı Star´y Accelerating the MilkyWay@Home Volunteer Computing Project with GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Travis Desell, Anthony Waters, Malik Magdon-Ismail, Boleslaw K. Szymanski, Carlos A. Varela, Matthew Newby, Heidi Newberg, Andreas Przystawik, and David Anderson Vascular Network Modeling - Improved Parallel Implementation on Computing Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krzysztof Jurczuk, Marek Kr¸etowski, and Johanne B´ezy-Wendling Parallel Adaptive Finite Element Package with Dynamic Load Balancing for 3D Thermo-Mechanical Problems . . . . . . . . . . . . . . . . . . . . . . Tomasz Olas, Robert Le´sniak, Roman Wyrzykowski, and Pawel Gepner Parallel Implementation of Multidimensional Scaling Algorithm Based on Particle Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piotr Pawliczek and Witold Dzwinel
266
276
289
299
312
XVI
Table of Contents – Part I
Particle Model of Tumor Growth and Its Parallel Implementation . . . . . . Rafal Wcislo and Witold Dzwinel
322
Applied Mathematics and Neural Networks Modular Neuro-Fuzzy Systems Based on Generalized Parametric Triangular Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcin Korytkowski and Rafal Scherer Application of Stacked Methods to Part-of-Speech Tagging of Polish . . . . Marcin Kuta, Wojciech W´ ojcik, Michal Wrzeszcz, and Jacek Kitowski Computationally Efficient Nonlinear Predictive Control Based on State-Space Neural Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maciej L awry´ nczuk Relational Type-2 Interval Fuzzy Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . Rafal Scherer and Janusz T. Starczewski Properties of Polynomial Bases Used in a Line-Surface Intersection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gun Srijuntongsiri and Stephen A. Vavasis
332 340
350 360
369
Minisymposium on GPU Computing A GPU Approach to the Simulation of Spatio–temporal Dynamics in Ultrasonic Resonators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedro Alonso–Jord´ a, Isabel P´erez–Arjona, and Victor J. S´ anchez–Morcillo Reduction to Condensed Forms for Symmetric Eigenvalue Problems on Multi-core Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paolo Bientinesi, Francisco D. Igual, Daniel Kressner, and Enrique S. Quintana-Ort´ı On Parallelizing the MRRR Algorithm for Data-Parallel Coprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Lessig and Paolo Bientinesi
379
387
396
Fast In-Place Sorting with CUDA Based on Bitonic Sort . . . . . . . . . . . . . . Hagen Peters, Ole Schulz-Hildebrandt, and Norbert Luttenberger
403
Finite Element Numerical Integration on GPUs . . . . . . . . . . . . . . . . . . . . . . Przemyslaw Plaszewski, Pawel Maciol, and Krzysztof Bana´s
411
Modeling and Optimizing the Power Performance of Large Matrices Multiplication on Multi-core and GPU Platform with CUDA . . . . . . . . . . Da Qi Ren and Reiji Suda
421
Table of Contents – Part I
Stream Processing on GPUs Using Distributed Multimedia Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Repplinger and Philipp Slusallek Simulations of the Electrical Activity in the Heart with Graphic Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bernardo M. Rocha, Fernando O. Campos, Gernot Plank, Rodrigo W. dos Santos, Manfred Liebmann, and Gundolf Haase Parallel Minimax Tree Searching on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . Kamil Rocki and Reiji Suda A Fast GPU Implementation for Solving Sparse Ill-Posed Linear Equation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Florian Stock and Andreas Koch
XVII
429
439
449
457
The Second Minisymposium on Cell/B.E. Technologies Monte Carlo Simulations of Spin Glass Systems on the Cell Broadband Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francesco Belletti, Marco Guidetti, Andrea Maiorano, Filippo Mantovani, Sebastiano Fabio Schifano, and Raffaele Tripiccione
467
Montgomery Multiplication on the Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joppe W. Bos and Marcelo E. Kaihara
477
An Exploration of CUDA and CBEA for Einstein@Home . . . . . . . . . . . . . Jens Breitbart and Gaurav Khanna
486
Introducing the Semi-stencil Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ra´ ul de la Cruz, Mauricio Araya-Polo, and Jos´e Mar´ıa Cela
496
Astronomical Period Searching on the Cell Broadband Engine . . . . . . . . . Maciej Cytowski, Maciej Remiszewski, and Igor Soszy´ nski
507
Finite Element Numerical Integration on PowerXCell Processors . . . . . . . Filip Kru˙zel and Krzysztof Bana´s
517
The Implementation of Regional Atmospheric Model Numerical Algorithms for CBEA-Based Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dmitry Mikushin and Victor Stepanenko
525
Adaptation of Double-Precision Matrix Multiplication to the Cell Broadband Engine Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krzysztof Rojek and L ukasz Szustak
535
XVIII
Table of Contents – Part I
Optimization of FDTD Computations in a Streaming Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adam Smyk and Marek Tudruj
547
Workshop on Memory Issues on Multi- and Manycore Platforms An Orthogonal Matching Pursuit Algorithm for Image Denoising on the Cell Broadband Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dominik Bartuschat, Markus St¨ urmer, and Harald K¨ ostler
557
A Blocking Strategy on Multicore Architectures for Dynamically Adaptive PDE Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wolfgang Eckhardt and Tobias Weinzierl
567
Affinity-On-Next-Touch: An Extension to the Linux Kernel for NUMA Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Lankes, Boris Bierbaum, and Thomas Bemmerl
576
Multi–CMP Module System Based on a Look-Ahead Configured Global Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eryk Laskowski, L ukasz Ma´sko, and Marek Tudruj
586
Empirical Analysis of Parallelism Overheads on CMPs . . . . . . . . . . . . . . . . Ami Marowka
596
An Implementation of Parallel 3-D FFT with 2-D Decomposition on a Massively Parallel Cluster of Multi-Core Processors . . . . . . . . . . . . . . . . . . Daisuke Takahashi
606
Introducing a Performance Model for Bandwidth-Limited Loop Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Treibig and Georg Hager
615
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
625
Table of Contents – Part II
Workshop on Scheduling for Parallel Computing (SPC 2009) Fully Polynomial Time Approximation Schemes for Scheduling Divisible Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joanna Berli´ nska Semi-online Preemptive Scheduling: Study of Special Cases . . . . . . . . . . . . Tom´ aˇs Ebenlendr Fast Multi-objective Reschulding of Grid Jobs by Heuristics and Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wilfried Jakob, Alexander Quinte, Karl-Uwe Stucky, and Wolfgang S¨ uß
1
11
21
Comparison of Program Task Scheduling Algorithms for Dynamic SMP Clusters with Communication on the Fly . . . . . . . . . . . . . . . . . . . . . . . . . . . . L ukasz Ma´sko, Marek Tudruj, Gregory Mounie, and Denis Trystram
31
Study on GEO Metaheuristic for Solving Multiprocessor Scheduling Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piotr Switalski and Franciszek Seredynski
42
Online Scheduling of Parallel Jobs on Hypercubes: Maximizing the Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ondˇrej Zaj´ıˇcek, Jiˇr´ı Sgall, and Tom´ aˇs Ebenlendr
52
The Third Workshop on Language-Based Parallel Programming Models (WLPP 2009) Verification of Causality Requirements in Java Memory Model Is Undecidable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matko Botinˇcan, Paola Glavan, and Davor Runje
62
A Team Object for CoArray Fortran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert W. Numrich
68
On the Definition of Service Abstractions for Parallel Computing . . . . . . Herv´e Paulino
74
XX
Table of Contents – Part II
The Second Workshop on Performance Evaluation of Parallel Applications on Large-Scale Systems Performance Debugging of Parallel Compression on Multicore Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Janusz Borkowski Energy Considerations for Divisible Load Processing . . . . . . . . . . . . . . . . . . Maciej Drozdowski Deskilling HPL: Using an Evolutionary Algorithm to Automate Cluster Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dominic Dunlop, S´ebastien Varrette, and Pascal Bouvry Monitoring of SLA Parameters within VO for the SOA Paradigm . . . . . . Wlodzimierz Funika, Bartosz Kryza, Renata Slota, Jacek Kitowski, Kornel Skalkowski, Jakub Sendor, and Dariusz Krol
82 92
102 115
A Role-Based Approach to Self-healing in Autonomous Monitoring Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wlodzimierz Funika and Piotr P¸egiel
125
Parallel Performance Evaluation of MIC(0) Preconditioning Algorithm for Voxel μFE Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ivan Lirkov, Yavor Vutov, Marcin Paprzycki, and Maria Ganzha
135
Parallel HAVEGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alin Suciu, Tudor Carean, Andre Seznec, and Kinga Marton
145
The Fourth Grid Applications and Middleware Workshop (GAMW 2009) UNICORE Virtual Organizations System . . . . . . . . . . . . . . . . . . . . . . . . . . . Krzysztof Benedyczak, Marcin Lewandowski, Aleksander Nowi´ nski, and Piotr Bala Application of ADMIRE Data Mining and Integration Technologies in Environmental Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marek Ciglan, Ondrej Habala, Viet Tran, Ladislav Hluchy, Martin Kremler, and Martin Gera
155
165
Performance Based Matchmaking on Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrea Clematis, Angelo Corana, Daniele D’Agostino, Antonella Galizia, and Alfonso Quarati
174
Replica Management for National Data Storage . . . . . . . . . . . . . . . . . . . . . . Renata Slota, Darin Nikolow, Marcin Kuta, Mariusz Kapanowski, Kornel Skalkowski, Marek Pogoda, and Jacek Kitowski
184
Table of Contents – Part II
Churn Tolerant Virtual Organization File System for Grids . . . . . . . . . . . . Leif Lindb¨ ack, Vladimir Vlassov, Shahab Mokarizadeh, and Gabriele Violino
XXI
194
The Fourth Workshop on Large Scale Computations on Grids (LaSCoG 2009) Quasi-random Approach in the Grid Application SALUTE . . . . . . . . . . . . Emanouil Atanassov, Aneta Karaivanova, and Todor Gurov
204
Mobile Agents for Management of Native Applications in GRID . . . . . . . Rocco Aversa, Beniamino Di Martino, Renato Donini, and Salvatore Venticinque
214
Leveraging Complex Event Processing for Grid Monitoring . . . . . . . . . . . . Bartosz Balis, Bartosz Kowalewski, and Marian Bubak
224
Designing Execution Control in Programs with Global Application States Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Janusz Borkowski and Marek Tudruj
234
Distributed MIND - A New Processing Model Based on Mobile Interactive Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Magdalena Godlewska and Bogdan Wiszniewski
244
A Framework for Observing Dynamics of Agent-Based Computations . . . Jaroslaw Kawecki and Maciej Smolka HyCube: A DHT Routing System Based on a Hierarchical Hypercube Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Artur Olszak
250
260
Workshop on Parallel Computational Biology (PBC 2009) Accuracy and Performance of Single versus Double Precision Arithmetics for Maximum Likelihood Phylogeny Reconstruction . . . . . . . Simon A. Berger and Alexandros Stamatakis
270
Automated Design of Assemblable, Modular, Synthetic Chromosomes . . . Sarah M. Richardson, Brian S. Olson, Jessica S. Dymond, Randal Burns, Srinivasan Chandrasegaran, Jef D. Boeke, Amarda Shehu, and Joel S. Bader
280
GPU Parallelization of Algebraic Dynamic Programming . . . . . . . . . . . . . . Peter Steffen, Robert Giegerich, and Mathieu Giraud
290
Parallel Extreme Ray and Pathway Computation . . . . . . . . . . . . . . . . . . . . Marco Terzer and J¨ org Stelling
300
XXII
Table of Contents – Part II
Minisymposium on Applications of Parallel Computation in Industry and Engineering Parallelized Transient Elastic Wave Propagation in Orthotropic Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Arbenz, J¨ urg Bryner, and Christine Tobler
310
Parallel Numerical Solver for Modelling of Electromagnetic Properties of Thin Conductive Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ ˇ ˇ Raimondas Ciegis, Zilvinas Kancleris, and Gediminas Slekas
320
Numerical Health Check of Industrial Simulation Codes from HPC Environments to New Hardware Technologies . . . . . . . . . . . . . . . . . . . . . . . . Christophe Denis
330
Application of Parallel Technologies to Modeling Lithosphere Dynamics and Seismicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boris Digas, Lidiya Melnikova, and Valerii Rozenberg
340
AMG for Linear Systems in Engine Flow Simulations . . . . . . . . . . . . . . . . . Maximilian Emans Parallel Implementation of a Steady State Thermal and Hydraulic Analysis of Pipe Networks in OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mykhaylo Fedorov High-Performance Ocean Color Monte Carlo Simulation in the Geo-info Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tamito Kajiyama, Davide D’Alimonte, Jos´e C. Cunha, and Giuseppe Zibordi EULAG Model for Multiscale Flows – Towards the Petascale Generation of Mesoscale Numerical Weather Prediction . . . . . . . . . . . . . . . Zbigniew P. Piotrowski, Marcin J. Kurowski, Bogdan Rosa, and Michal Z. Ziemianski
350
360
370
380
Parallel Implementation of Particle Tracking and Collision in a Turbulent Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bogdan Rosa and Lian-Ping Wang
388
A Distributed Multilevel Ant-Colony Approach for Finite Element Mesh Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Katerina Taˇskova, Peter Koroˇsec, and Jurij Silc
398
Minisymposium on Interval Analysis Toward Definition of Systematic Criteria for the Comparison of Verified Solvers for Initial Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ekaterina Auer and Andreas Rauh
408
Table of Contents – Part II
Fuzzy Solution of Interval Nonlinear Equations . . . . . . . . . . . . . . . . . . . . . . Ludmila Dymova
XXIII
418
Solving Systems of Interval Linear Equations with Use of Modified Interval Division Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ludmila Dymova, Mariusz Pilarek, and Roman Wyrzykowski
427
Remarks on Algorithms Implemented in Some C++ Libraries for Floating-Point Conversions and Interval Arithmetic . . . . . . . . . . . . . . . . . . Malgorzata A. Jankowska
436
An Interval Method for Seeking the Nash Equilibria of Non-Cooperative Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bartlomiej Jacek Kubica and Adam Wo´zniak
446
From Gauging Accuracy of Quantity Estimates to Gauging Accuracy and Resolution of Measuring Physical Fields . . . . . . . . . . . . . . . . . . . . . . . . . Vladik Kreinovich and Irina Perfilieva
456
A New Method for Normalization of Interval Weights . . . . . . . . . . . . . . . . . Pavel Sevastjanov, Pavel Bartosiewicz, and Kamil Tkacz
466
A Global Optimization Method for Solving Parametric Linear Systems Whose Input Data Are Rational Functions of Interval Parameters . . . . . . Iwona Skalna
475
Direct Method for Solving Parametric Interval Linear Systems with Non-affine Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Iwona Skalna
485
Workshop on Complex Collective Systems Evaluating Lava Flow Hazard at Mount Etna (Italy) by a Cellular Automata Based Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria Vittoria Avolio, Donato D’Ambrosio, Valeria Lupiano, Rocco Rongo, and William Spataro
495
Application of CoSMoS Parallel Design Patterns to a Pedestrian Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sarah Clayton, Neil Urquhard, and Jon Kerridge
505
Artificial Intelligence of Virtual People in CA FF Pedestrian Dynamics Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ekaterina Kirik, Tat’yana Yurgel’yan, and Dmitriy Krouglov
513
Towards the Calibration of Pedestrian Stream Models . . . . . . . . . . . . . . . . Wolfram Klein, Gerta K¨ oster, and Andreas Meister
521
XXIV
Table of Contents – Part II
Two Concurrent Algorithms of Discrete Potential Field Construction . . . Konrad Kulakowski and Jaroslaw Was
529
Frustration and Collectivity in Spatial Networks . . . . . . . . . . . . . . . . . . . . . Anna Ma´ nka-Kraso´ n and Krzysztof Kulakowski
539
Weakness Analysis of a Key Stream Generator Based on Cellular Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fr´ed´eric Pinel and Pascal Bouvry
547
Fuzzy Cellular Model for On-line Traffic Simulation . . . . . . . . . . . . . . . . . . Bartlomiej Placzek
553
Modeling Stop-and-Go Waves in Pedestrian Dynamics . . . . . . . . . . . . . . . . Andrea Portz and Armin Seyfried
561
FPGA Realization of a Cellular Automata Based Epidemic Processor . . . Pavlos Progias, Emmanouela Vardaki, and Georgios Ch. Sirakoulis
569
Empirical Results for Pedestrian Dynamics at Bottlenecks . . . . . . . . . . . . . Armin Seyfried and Andreas Schadschneider
575
Properties of Safe Cellular Automata-Based S-Boxes . . . . . . . . . . . . . . . . . . Miroslaw Szaban and Franciszek Seredynski
585
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
593
Evaluating Performance of New Quad-Core R R IntelXeon 5500 Family Processors for HPC Pawel Gepner, David L. Fraser, and Michal F. Kowalik Intel Corporation {pawel.gepner,david.l.fraser,michal.f.kowalik}@intel.com
Abstract. In this paper we take a look at what the new Quad-Core Intel Xeon Processor code name Nehalem brings to high performance computing. We compare Intel Xeon 5400 series based system with a server utilizing his successor the new Intel Xeon X5560. We compare both CPU generations utilizing dual socket platforms using a number of HPC benchmarks. The results clearly prove that the new Intel Xeon processor 5500 family provide significant performance advantage on typical HPC workloads and demonstrate to be a right choice for many of HPC installations. Keywords: HPC, multi-core processors, quad-core processors, parallel processing, benchmarks.
1
Introduction
The new 5500 family Quad-Core Intel Xeon processor is the first Intel server CPU with an integrated memory controller (IMC) as well as the first Xeon processor with a Quick Path Interconnect (QPI) interface. Nehalem based products are different from the previous generation products and it was clearly intended not only to scale across all the different product lines, but to be optimized for all the different product segments and market needs from mobile via desktop to server. This new modular design strategy is called at Intel core and uncore approach. Basically Nehalem products are defined in two vectors: one specifies the explicit core functionality universal for all the Nehalem family members. The second is focused on uncore elements and is highly optimized to best fit the specific market needs and segment specifics: this includes number of cores, number of QPI links or memory interfaces and many others uncore components which might be completely different form one family member to another. Core and unncore components can be designed and validated separately. The Quad-Core Intel Xeon processor 5500 family redefines not only every feature of the microprocessor micro-architecture but requires fundamental modification in the system architecture. Systems based on Xeon 5500 family introduce a completely new model of memory hierarchy as well interconnect for connecting processors and other components. In addition to all of the micro-architecture and system innovations Intel Xeon 5500 processors have been designed with all the R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 1–10, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
P. Gepner, D.L. Fraser, and M.F. Kowalik
advantages of 45-nanometer (nm) Hi-k metal gate silicon technology. This process technology uses a new combination of Hi-k gate dielectrics and conductive materials to improve transistor properties such as reducing electrical leakage, chip size, power consumption, and manufacturing costs[1].
2
Processor Microarchitecture and System Changes
The microarchitecture (or computer organization) is mainly a lower level structure that governs a large number of details, hidden in the programming model. It describes the constituent parts of the processor and how these interconnect and interoperate to implement the architectural specification. This means that the new microarchitecture enhances the end-user benefits when compared to the previous generations, achieving a new level of performance and changes the Instruction Set Architecture in silicon, including cache memory design, execution units, and pipelining[2]. Modifications made inside the Nehalem core increase compute resources and deliver higher performance, enhanced parallelism, an improved macrofiusion mechanism, modified Loop Stream Detector (LSD), a new branch prediction mechanism, add a new set of SSE 4.2 instructions and we bring in a new cache hierarchy. In addition to all of the above we re-introduce the Simultaneous MultiThreading (SMT) techniques pioneered on previous generations of the NetBurst R microarchitecture based Xeonprocessor. Increasing parallelism at a core level and extending the microperation concurrency requires a change in size and structure of the out of order window. Nehalem’s out of order buffer (ROB) grew from 96 entries in previous generation of the Xeon family to 128 entries on Nehalem. The Reservation Station, which is used to dispatch operations for execution units, grew from 32 to 36 entries. To track all allocations of load and store, the load buffer extended in size from 32 to 48 entries and the store buffer from 20 to 32. A new macrofusion mechanism identifies more macrofusion opportunities to further increase performance and power efficiency. Nehalem decodes CMP or TEST and conditional branch Jcc as a single microperation (CMP+Jcc), the same as it preprocessor did plus it extends the case for the following branch conditions: —JL/JNGE —JGE/JNL —JLE/JNG —JG/JNLE This increases the decode bandwidth and reduces the microperation count and effectively makes the machine wider. Intel Xeon 5400 family only support macrofusion in 32-bit mode, the Intel Xeon 5500 family supports macrofusion in both 32-bit and 64-bit modes. The Loop Stream Detector identifies software loops and takes advantage of this behavior by removing instruction fetch limitations and disabling unneeded blocks of logic. During a loop the processor decodes the same instructions over
R R Evaluating Performance of New Quad-Core IntelXeon 5500
3
and over and makes the same branch predictions over and over, these units can be disabled to increase higher performance and saving power. The Loop R R Stream Detector was first introduced in the IntelXeon 5300 family and it also operates as a cache for the instruction fetch unit. The Nehalem architecture improves on this concept by moving the Loop Stream Detector closer to the decode stage and keeps macroperations instead of instructions, theoretically this R family. is a very similar approach to the trace cache we used on the Pentium4 Nehalem has two levels of branch target buffer (BTB) these are designed to increase performance and power efficiency. The first and second level BTBs actually use different prediction algorithms and also different history files. These new and improved branch predictors are optimized not only for the best possible branch prediction accuracy but also to work with SMT. Simultaneous Multi-Threading returns to Intel Xeon family after first being implemented in the NetBurst microarchitecture based Xeon. Because an SMTenabled CPU requires substantial memory bandwidth it is likely that SMT will bring even more end-user benefit than Hyper- Threading did. However even for some workloads, including HPC applications SMT may need to be disabled to get the best possible performance[3]. Nehalem supports the same SSE 4.2 instructions as the Xeon 5400 family but in addition to the 47 instructions previously supported, 7 new instructions have been added. The SSE 4.2 instruction set includes instructions for the Vectorizing Compiler and Media Accelerator. The new instructions should improve the performance of audio, video, image editing applications, video encoders, 3-D applications, and games. The last major change to the core microarchitecture for Nehalem is its new cache hierarchy. Nehalem’s cache hierarchy has been extended to three levels, L1 and L2 staying fairly small and private to each core, while the L3 cache is much larger and shared between all cores. Each core in Nehalem has a private 32KB first level L1 Instruction and 32kB Data Cache. In addition the unified 256KB L2 cache, is 8 way associative and provides extremely fast access to data and instructions, the latency is typically smaller than 11 clock cycles. The first members of the Intel Xeon 5500 family have 8MB L3 cache, however additional products may be introduced with different cache sizes depending on the market segment requirements and platform specifics. The L3 Cache is organised as a 16 way set associative, inclusive and shared cache. The cache size can also be different, as a result of the uncore design philosophy. The L3 cache clock is decoupled from the cores frequency, so theoretically the clock of the cache can be different from the frequency of the cores. Nehalem’s L3 load to use latency is less than 35 cycles and is implement as an inclusive cache model, this approach guaranties coherency in the most efficient way. Any cache access misses in the L3 has a guarantee that data is not present it in any of the L2 or L1 caches of the cores. The core valid bits mechanism limits unnecessary snoops of the cores during hit data and only checks the indentified core where it has a possibility to find a private copy of this cache line which might be modified in L1 or L2.
4
P. Gepner, D.L. Fraser, and M.F. Kowalik
This method allows the L3 cache to protect each of the cores from expensive coherency traffic. The Nehalem microarchitecture changes all these pieces and also redefines system architecture. The most ground-breaking changes are the integrated memory controller and Quick Path Interconnect. The memory controller is optimized per R R 5500 market segment, the same analogy as the L3 cache. The initial IntelXeon family products support 3 memory channels per socket of DDR3-1333Mhz both registered and un-registered DIMMs. Each channel of memory can operate independently and handles request in an out-of-order fashion. The local memory latency for Nehalem is 60 ns. The memory controller has been designed for low latency and optimized for both local and remote memory latency. Nehalem delivers huge reduction in local memory latency comparing to previous generation Front Side Bus based Xeon processors, even remote memory latency is a fast 93 ns. Of course effective memory latency is dependent on the application and Operating System implementation of the Non Uniform Memory Architecture. Linux is a NUMA ready operating system, so too is Windows Vista. The biggest improvement comparing to previous generation based system is memory bandwidth. Total peak bandwidth for a single socket is 32GB/s and typical HPC two socket configuration gives 64GB/s. The QuickPath Interconnect (QPI) is a packet-based, high bandwidth, low latency point to point interconnect that operates at up to 6.4GT/s. This new point-to-point interconnect provides socket to socket connections and socket to chipset connections thus enabling system architects to build scalable solutions. Each link is a 20 bit wide interface using differential signaling. The QPI packet is 80 bits long but, only 64 bits are dedicated for data, the rest is used for flow control. This means that each link provides a transfer rate of 12.8GB/s in each direction and 25.6GB/s in total as links are bi-directional. The QPI links are uncore components and depending on the CPU implementation might have a different number of links. The first occurrence of Nehalem dedicated for DP servers (Intel Xeon 5500 family) has two links but the desktop version e.g. (Intel Core i7) has only one. All the above microarchitecture innovations as well as system enhancements increase the system performance and should benefit high performance computing. The exact detail of those innovations is described further in this paper. During all tests we have used single system performance on standard HPC workloads. For all of the tests where we have been comparing two systems: Intel Xeon processor 5400 (Harpertown) family based platform versus Intel Xeon processor 5500 family (Nehalem-EP) family based system. Intel Xeon processor 5400 family based platform: Platform configured with two Quad-Core Intel Xeon processors E5472 3.00GHz, 2x6MB L2 cache, 1600MHz FSB, 16GB memory (8x2GB FBDIMM 800MHz), RedHat Enterprise Linux Server 5 on x86 64. Intel Xeon processor 5500 family based platform: Pre-production platform Green City (Tylersburg B0 stepping) with two Quad-Core Intel Xeon processors X5560 (Nehalem-EP) 2.80GHz with 8M L3 cache, QPI -6.4 GT/s, 24
R R Evaluating Performance of New Quad-Core IntelXeon 5500
5
GB memory (6x4GB DDR3, 1066 MHz), RedHat Enterprise Linux Server 5 on x86 64. Those configurations are the typical ones suggested by vendors as typical HPC high end platform. Difference of memory size is mainly driven by dissimilar memory type (FBDIMM vs. DDR3) and different memory architecture subsystem (FSB vs. Integrated Memory Controller).
3
Processor Performance
The main focus of this section is to present a comparison of two generations R R processors. A popular benchmark well-suited of the Quad-Core IntelXeon for parallel, core-limited workloads is the Linpack HPL benchmark. Linpack is a floating-point benchmark that solves a dense system of linear equations in parallel. The metric produced is Giga-FLOPS or billions of floating point operations per second. Linpack performs operations called LU Factorization. These are highly parallel and store most of their working data set on processor cache. It makes relatively few references to memory for the amount of computation it performs. The processor operations it does perform are predominantly 64-bit floating-point vector operations and these use SSE instructions [4]. This benchmark is used to determine the world’s fastest computers published at the website http://www.top500.org/.
Fig. 1. LINPACK: Dense Floating-Point Operations
In both cases each processor core has 3 functional units that are capable of generating 128-bit results per clock. In this case we may assume that a single processor core does two 64 bit floating-point ADD instructions and two 64 bit floating-point MUL instructions per clock. The theoretical performance, calculated is the product of MUL and ADD executed in each clock, multiplied by frequency giving the following results. For Quad-Core Intel Xeon processor E5472 this gives 3GHz x 4 operations per clock x 4 cores = 48 GFLOPS. For
6
P. Gepner, D.L. Fraser, and M.F. Kowalik
Intel Xeon X5560 theoretical performance we have = 2.8GHz x 4 operations per clock x 4 cores = 44.8 GFLOPS. This is theoretical performance only, and does not fully reflect the real life scenario of running Linpack, in fact this level of theoretical performance disadvantage has also been observed in the Linpack benchmarking scenario. All the innovations of Nehalem do not compensate for the slower clock it has at these initial launch frequencies when purely considering theoretical performance or CPU bounded workloads. The Intel Xeon processor E5472 benefits from a higher CPU clock on the Linpack application. Using LINPACK HPL we see 6% performance disadvantage between the sysR R tem based on the Quad-Core IntelXeon processor E5472 and new Quad-Core Intel Xeon processor X5560 as Fig. 1 shows. It indicates that as we have expected based on the theoretical performance the Intel Xeon 5500 family will not realize a performance improvement on tasks that are heavily dependent on CPU clock frequency.
4
Memory Performance
Theory describes memory performance as a combination of the two following elements, latency and throughput. These components can have an impact on the different workloads, and they are applicable to different usage models. Latency explains how long it takes to chase a chain of pointers through memory. Only a single chain is tracked at a time. Each chain stores only one pointer in a cache line and each cache line is randomly selected from a pool of memory. The pool of memory simulates the working environment of an application. When the memory pool is small enough to be placed inside cache, the benchmark measures the latency required to fetch data from cache. By changing the size of the memory pool we can measure the latency of any specific level of cache, or to main memory, by simply creating the pool bigger than all levels of cache [5]. We measured latency using a 3.0 GHz Quad-Core Intel Xeon processor E5472 and a 2.8 GHz Quad-Core Intel Xeon processor X5560. The results of this experiment are shown in Figure 2. The Intel Xeon processor X5560 with its new cache structure and integrated memory controller significantly improves latency and data movement. The L1 cache stays on the same level even taking slightly longer to fetch data because of the lower clock frequency of the Intel Xeon X5560 vs. Intel Xeon E5472. Also the new hierarchy of cache with a smaller dedicated L2 256KB and shared L3 8MB makes the picture less clear. Based on the preliminary results we have achieved it seems that LMBENCH does not count L3 at all or hides it in the memory latency. Such a huge reduction in memory latency of 48% has been achieved simply by the local memory allocation as the LMBENCH does not measure the NUMA aspects of the processor. The remote memory allocation latency will not be so greatly improved in fact it only increases by 5% versus Xeon E5472. The improvement made on latency reduction for local memory access significantly helps random memory access where the NUMA element is partly considered and overall reduction is 22%.
R R Evaluating Performance of New Quad-Core IntelXeon 5500
7
Fig. 2. Memory Latency
Fig. 3. Stream benchmark - memory throughput improvement
The second component of the memory performance is the throughput for sequential memory accesses. The benchmark we have used to measure the throughput is the Stream benchmark. The Stream benchmark is a synthetic benchmark program, written in standard Fortran 77 developed by John McCalpin. It estimates, both memory reads and memory writes (in contrast to the standard usage for bcopy). It measures the performance of four long vector operations. These operations are: COPY: a(i) = b(i) SCALE: a(i) = q*b(i) SUM: a(i) = b(i) + c(i) TRIAD: a(i) = b(i) + q*c(i)
8
P. Gepner, D.L. Fraser, and M.F. Kowalik
These operations are representative of long vector operations and the array sizes are defined in that way so that each array is larger than the cache of the processors that are going to be tested. This gives us an indication of how effective the memory subsystem is excluding caches. As Fig. 3 shows we see significant R R processor memory bandwidth improvement with the Quad-Core IntelXeon X5560 and this is mainly due to the Integrated Memory controller, different memory type and memory arrangement. This enormous throughput improvement versus the previous generation Quad Core Intel Xeon based system will be reflected in all memory intensive applications not only Stream but also other HPC workloads.
5
Application Performance
Linpack and Stream, which are synthetic benchmarks, used during memory performance tests, measure the performance of specific subsystems and do not explain the full spectrum of system capability. Typical HPC applications use by nature much more than a single subsystem and their behavior is much more sophisticated. So to get a better understanding of how the Quad-Core R R processor X5560 based platform benefits real HPC applications IntelXeon we have selected several real examples. These application and benchmarks represent a broad spectrum of HPC workloads and seem to be a typical representation of a testing suite for this class of calculation. Similar benchmarks were used in publication [1] therefore benchmark descriptions are similar. Abaqus from Simula is a commercial software package for finite element analysis. This is general-purpose solver using a traditional implicit integration scheme to solve finite element analyses. The product is popular with academic and re search institutions due to the wide material modeling capability, and the ability to be easy customized. Amber is a package of molecular simulation programs. The workload measures the number of problems solved per day (PS) using eight standard molecular dynamic simulations. See http://amber.ch.ic.ac.uk/amber8.bench1.html for more information. Fluent is a commercial engineering application used to model computational fluid dynamics. The benchmark consists of 9 standard workloads organized into small, medium and large models. These comparisons use all but the largest of the models which do not fit into the 8GB of memory available on the platforms. The Rating, the default Fluent metric, was used in calculating the ratio of the platforms by taking a geometric mean of the 8 workload ratings measured. GAMESS from Iowa State University, Inc - a quantum chemistry program widely used to calculate energies, geometries, frequencies and properties of molecular systems http://www.msg.ameslab.gov/GAMESS/GAMESS.html. Gaussian from Gaussian, Inc - a quantum chemistry program widely used to calculate energies, geometries, frequencies and properties of molecular systems http://www.gaussian.com.
R R Evaluating Performance of New Quad-Core IntelXeon 5500
9
GROMACS (Groningen Machine for Chemical Simulations) from Groningen University is molecular dynamics program used to calculate energies, geometries, frequencies and properties of molecular systems http://www.gromacs.org. LS-DYNA is a commercial engineering application used in finite element analysis such as a car collision. The workload used in these comparisons is called 3 Vehicle Collision and is publicly available from http://www.topcrunch.org/. The metric for the benchmark is elapsed time in seconds. Monte Carlo from Oxford Center for Computational Finance (OCCF) - financial simulation engine using Monte Carlo technique http://www.occf.ox.ac.uk/. PamCrash from ESI Group - an explicit finite-element program well suited for crash simulations. For more information go to http://www.esi-group.com/ /SimulationSoftware/NumericalSimulation/index.html. Star-CD is a suite of test cases, selected to demonstrate the versatility and robustness of STAR-CD in computational fluid dynamic solutions. The metric produced is elapsed seconds converted to jobs per day. For more information go to http://www.cd-adapco.com/products/STAR-CD.
Fig. 4. Platform comparison across HPC selected workloads
All these selected workloads have been tested on two dual socket HPC optiR R mized platforms. The Quad-Core IntelXeon processor E5472 based platform has been used as the baseline to illustrate the improvement that the new platform is going to bring in different HPC workloads. As we can see the Quad-Core Intel Xeon processor X5560 based platform shows significant performance improvement up to 133%. The new miroarchitecture helps especially in the data intensive applications and workloads where data movement plays an important role. If the task is more CPU intensive the difference is around 5-15% for most of these workloads. For reference we have also placed the best published results of the AMD two sockets platform utilizing two Quad-Core AMD Opteron 2354 (Barcelona 2.2GHz) processors with 16 GB RAM (8x2 GB, 667MHz DDR2). When looking at a broad range of highly applicable workloads the Quad-Core Intel Xeon processor E5560 based platform demonstrates clear leadership.
10
6
P. Gepner, D.L. Fraser, and M.F. Kowalik
Conclusion
R R The new Quad-Core IntelXeon X5560 processor brings a new level of performance to HPC, improving significantly on applications where its preprocessors were not the most powerful. The new microarchitecture enhancements and system level modifications benefit data intensive types of HPC applications by up to as much as 130%. Processor intensive tasks will not achieve the same performance improvement as memory banded workloads, but they continue to stay on the same high level. Both generations of tested CPUs deliver the same theoretical performance when running the same clock and keep within the same thermal envelope but as shown deliver an improvement in performance of HPC workloads. Also the platform level modifications like QPI make the new platform more balanced from an I/O perspective and open new possibilities for more superior interconnect solutions. The Quad-Core Intel Xeon 5500 series is the first instance of the Nehalem generation and new additions to this family will bring even more functionality, scalability and performance and will become a compelling choice for many of the new HPC installations.
References 1. Gepner, P., Fraser, D.L., Kowalik, M.F.: Multi-Core Processors: Second generation Quad-Core Intel Xeon processors bring 45nm technology and a new level of performance to HPC applications. In: ICCS 2008, pp. 417–426 (2008) 2. Smith, J.E., Sohi, G.S.: The Microarchitecture of superscalar processors. Proc. IEEE 83, 1609–1624 (1995) 3. Eggers, S.J., Emer, J.S., Levy, H.M., Lo, J.L., Stamm, R.L., Tullsen, D.M.: Simultaneous multithreading: A platform for next generation processors. Proc. IEEE 17, 12–19 (1997) 4. Dongarra, J., Luszczek, P., Petitet, A.: Linpack Benchmark: Past, Present, and Future, http://www.cs.utk.edu/luszczek/pubs/hplpaper.pdf 5. Gepner, P., Kowalik, M.F.: Multi-Core Processors: New Way to Achieve High System Performance. PARELEC 2006, pp. 9–13 (2006)
Interval Wavelength Assignment in All-Optical Star Networks Robert Janczewski, Anna Małafiejska, and Michał Małafiejski Gdańsk University of Technology, Poland Algorithms and System Modeling Department {robert,aniam,mima}@kaims.pl
Abstract. In the paper we consider a new problem of wavelength assigment for multicasts in the optical star networks. We are given a star network in which nodes from a set V are connected to the central node with optical fibres. The central node redirects the incoming signal from a single node on a particular wavelength from a given set of wavelengths to some of the other nodes. The aim is to minimize the total number of used wavelengths, which means that the overall cost of the transmission is minimized (i.e. wavelength conversion or optoelectronic conversion is minimized). This problem can be modelled by a p-fiber coloring of some labelled digraph, where colors assigned to arcs of the digraph correspond to the wavelengths. In the paper we assume the set of all wavelengths of the incoming signals to a particular node forms an interval, i.e. a consecutive set of numbers. We analysed the problem of one-multicast transmission (per node). We constructed polynomial time algorithms for some special classes of graphs: complete k-partite graphs, trees and subcubic bipartite graphs. Keywords: WDM optical networks, multicasting, interval coloring, star arboricity.
1
Introduction
The considerations in this paper are motivated by the multicasting communication in a multifiber WDM (wavelength-division multiplexing) all-optical networks, which was recently considered in papers [3,6,7,22]. By the multifiber WDM network we mean the network where nodes are connected with parallel fibers. The problem is formulated as follows: we are given an all-optical star network with a set of nodes V , which are connected to the central node by optical fibers. Each node v ∈ V sends a set of at most q multicasts to some other nodes S1 (v), . . . , Sq (v), where Si (v) ⊂ V . The transmission through the central node is using WDM, i.e. different signals may be sent at the same time through the same fiber but on different wavelengths. First step of every multicast transmission from a node v is sending a message to the central node on the particular wavelength from given set of wavelengths. In the next step, the central node redirects the message to each node of Si (v) using one of these wavelengths. The R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 11–20, 2010. c Springer-Verlag Berlin Heidelberg 2010
12
R. Janczewski, A. Małafiejska, and M. Małafiejski
goal is to minimize the total number of wavelengths during the simultaneous transmission of all multicasts in the network. This problem can be modelled by arc coloring of labelled (multi)digraph with some special requirements on the colors [3,7]. If we assume there is only one fiber between the central node and each node from V , and there is only one multicast per node (we define such a network as one-multicast one-fiber network) the problem reduces to the (directed) star arboricity (coloring) problem [1,2,15] and incidence coloring problem [8]. In this paper we assume the set of wavelengths of incoming signals forms an interval, i.e. a consecutive set of numbers, which may mean the multiplies of some base wavelength. Model and problem definition. In the general case, every node from set V of n nodes is connected to the central node with p optical fibres, such a network is sometimes called the star network. The multicast transmission in this network can be modelled by a (multi)digraph D with vertex set V and with labelled arc sets leaving v corresponding to multicasts S1 (v), . . . , Sq (v), i.e. every set Ai (v) = {vw : w ∈ Si (v)} of arcs leaving v is labelled with i (Fig. 1).
Fig. 1. S1 (v) = {a, b}, S2 (v) = {a, c}
We define a p-fiber coloring as a function assigning positive integers (colors, wavelengths) to the arcs such that at each vertex v and for every color c, inarc(v, c) + outlab(v, c) ≤ p, where inarc(v, c) is the number of arcs entering v and colored with c, and outlab(v, c) = |{i : ∃e ∈ Ai (v) and e is colored with c}|. In other words, outlab(v, c) is the number of labels of arcs leaving v and colored with c. For a given p-fiber coloring of digraph D we can easily construct a proper assignment of fibers in a network corresponding to D as follows: we assign a different fiber corresponding to every arc entering v and colored with c, and to each set of arcs leaving v, colored with c and of the same label. The correspondence between colors and wavelengths is shown in Fig. 1. Hence, the problem of simultaneous transmission of all multicasts in the p-fiber network with minimum number of wavelengths is equivalent to the p-fiber coloring of arcs of the digraph D with a minimum number of colors. Now, we formulate a decision version of the problem of p-fiber wavelength assigment for q-multicasts in optical star networks:
Interval Wavelength Assignment in All-Optical Star Networks
13
The (p, q)-WAM problem Instance: A digraph D with at most q labels on arcs, a positive integer k. Question: Is there a p-fiber coloring of D with at most k colors? A comprehensive survey of the (p, q)-WAM problem one can find in [3]. The problem (p, 1)-WAM was considered in [16] (i.e. for multitrees). In the paper we consider the (1, 1)-WAM problem, i.e. the fundamental case, where the fiber is unique and each node sends only one multicast. In this case, the p-fiber coloring of multidigraph reduces to the coloring of arcs of digraph satysfying only two conditions: (c1) any two arcs entering the same vertex have different colors, (c2) any arc entering and any arc leaving the same vertex have different colors. This coloring is well-studied [3,7] and arises from the problem of partitioning of a set of arcs into the smallest number of forests of directed stars, this problem is called the directed star arboricity problem [1,2,15]. Moreover, we focus our attention to the symmetrical communication, i.e. every transmission from v to w implies the communication between w and v, in this case the digraph modeling the communication between nodes is a simple graph. This problem can be described as (1, 1)-WAM problem with symmetric multicasts and can be reduced to the incidence coloring of graphs [8,10,11,20,23] (for details see Section 2). Previous results and our contribution. In this paper we study the WAM problem under the assumption that the set of colors assinged to arcs entering a vertex forms an interval, i.e. a consecutive set of numbers. This corresponds to having a consecutive wavelengths on the link between the central node and the destination node. This problem is important for traffic grooming in WDM networks, where wavelengths could be groomed into wavebands [22]. The interval (p, q)-WAM problem is a new concept arising from well-studied model of interval-edge coloring [4,12,13,14]. In [17] the authors introduced the problem of interval incidence coloring modeling message passing in networks. The authors proposed some bounds on the interval incidence coloring number (IIC) and studied this problem for some classes of graphs (paths, cycles, stars, fans, wheels and complete graphs). In this paper we construct polynomial time algorithms for trees, subcubic bipartite graphs, i.e. bipartite graphs with the degree of vertex at most three, and complete k-partite graphs. Outline of the paper. In Section 2 we introduce definitions, prove some useful bounds for the interval incidence chromatic number, and construct polynomial time algorithm for complete k-partite graphs. In Section 3 we construct polynomial time algorithms for biparite graphs with maximum degree at most 3, i.e.
14
R. Janczewski, A. Małafiejska, and M. Małafiejski
subcubic bipartite graphs. In Section 4 we construct polynomial time algorithm for trees using dynamic methods.
2
Definitions and Simple Properties
In the following we consider connected simple graphs only, and use standard notations in graph theory. Incidence coloring. For a given simple graph G = (V, E), we define an incidence as a pair (v, e), where vertex v ∈ V (G) is one of the ends of edge e ∈ E(G) (we say v is incident with e). Let us define a set of incidences I(G) = {(v, e) : v ∈ V (G) ∧ e ∈ E(G) ∧ v ∈ e}. We say that two incidences (v, e) and (w, f ) are adjacent if one of the following holds: (i) v = w, e = f, (ii) e = f , v = w, (iii) e = {v, w}, f = {w, u} and v = u. By an incidence coloring of G we mean a function c : I(G) → N such that c((v, e)) = c((w, f )) for any adjacent incidences (v, e) and (w, f ). In the following we use simplified notation c(v, e) instead of formal c((v, e)). Interval incidence coloring. A finite subset A of N is an interval if and only if it contains all numbers between min A and max A. For a given incidence coloring c of graph G let Ac (v) = {c(v, e) : v ∈ e ∧ e ∈ E(G)}. By an interval incidence coloring of graph G we mean an incidence coloring c of G such that for each vertex v ∈ V (G) set Ac (v) is an interval. The interval incidence coloring number of G, denoted by IIC(G), is the smallest number of colors in an interval incidence coloring. The interval incidence coloring of G using IIC(G) colors is said to be minimal. Symmetric (1, 1)-WAM problem. For a given symmetric digraph D consider (1, 1)-WAM problem and observe that we can ommit the unique label on arcs. Every two symmetric arcs uv and vu between vertices u and v we can replace by one edge e = {u, v} and define two incidences (v, e) and (u, e), which corresponds to arcs uv and vu, respectively. Hence, the coloring of arcs of digraphs satisfying conditions (c1) and (c2) is equivalent to the incidence coloring of graphs. 2.1
Bounds for the Interval Incidence Coloring Number
The IIC number can be bounded by maximum degree Δ(G) and its chromatic number χ(G) as follows. Theorem 1. For a given non-empty graph G we have Δ(G) + 1 ≤ IIC(G) ≤ χ(G) · Δ(G).
Interval Wavelength Assignment in All-Optical Star Networks
15
Proof. The lefthand inequality holds because for any interval incidence coloring c, each vertex v ∈ V (G) and its neighbour u we have c(u, {u, v}) = c(v, {u, v}). To prove the righthand inequality, we divide vertex set into χ(G) indepedent sets denoted by I1 , I2 , . . . , Iχ(G) . For any v ∈ Ii we can assign {c(v, e) : (v, e) ∈ I(G)} ⊆ {1 + (i − 1) · Δ, . . . , i · Δ} in such a way that Ac (v) is an interval. It is easy to see that any two adjacent incidences (v, {v, u}) and (u, e) will receive different colors. Theorem 2. For every k ≥ 2 and for every complete k-partite graph G = Kp1 ,p2 ,...,pk holds the following lower bound IIC(G) ≥ 2n − max{pi + pj : i, j = 1, . . . , k ∧ i = j}. Proof. Let c be an interval incidence coloring of G that uses IIC(G) colors. Let v be a vertex of G such that min Ac (v) = min c(I(G)). Let u be a neighbour of v such that c(v, {u, v}) = max Ac (v). Then Ac (u) ∩ Ac (v) = ∅ and IIC(G) ≥ |Ac (u)| + |Ac (v)| = deg(u) + deg(v) ≥ 2n − max{pi + pj : i, j = 1, . . . , k ∧ i = j}. 2.2
IIC Number for Some Class of Graphs
In the Table 1 we give exact values of the interval incidence coloring number for some known classes of graphs. Table 1. The value of IIC for some classes of graphs Graph’s family paths Pn , n ≥ 5 cycles Cn stars Sn 2-stars Sn2 k-stars Snk , k ≥ 3 wheels and fans complete graphs complete bipartite Kp,q complete k-partite Kp1 ,p2 ,...,pk 2n − max{pq bipartite torus regular bipartite
2.3
IIC(G)
Ref.
Δ+2 Δ+2 Δ+1 Δ+1 2Δ Δ+3 2Δ p+q + pr : 1 ≤ q < r ≤ k} 2Δ 2Δ
[17] [17] [17] [17] [21] [17] [17] [21] [21] [18] [18]
IIC Number for Complete k-Partite Graphs
Theorem 3. A complete k-partite graph Kp1 ,p2 ,...,pk can be colored optimally with 2n − max{pi + pj : i, j = 1, . . . , k ∧ i = j} colors in O(m + n) time. Proof. By Theorem 2, it suffices to construct interval incidence coloring of G = = j} colors. Kp1 ,p2 ,...,pk that uses exactly 2n − max{pi + pj : i, j = 1, . . . , k ∧ i
16
R. Janczewski, A. Małafiejska, and M. Małafiejski
Let I1 , I2 , . . . , Ik be partitions of G. Without loss of generality we can assume that pi = |Ii | (1 ≤ i ≤ k), p1 ≤ p2 ≤ . . . ≤ pk and Ii = {v1i , v2i , . . . , vpi i }. We construct the desired interval incidence coloring of G as follows: p1 + p2 + . . . + pk−1 + l k > i, i i k c(vj , {vj , vl }) = n + p1 + p2 + . . . + pk−1 + l otherwise. We leave it to the reader to verify that the above formula defines interval inci= j} colors. dence coloring that uses 2n − max{pi + pj : i, j = 1, . . . , k ∧ i
3
Polynomial Time Algorithms for Subcubic Bipartite Graphs
Cycles and paths were considered in Section 2.2, and there are only a few connected graphs with IIC ≤ 3. Let G be a subcubic bipartite graph G with Δ(G) = 3. By Theorem 1 the IIC(G) is between 4 and 6. In this section we will construct different algorithms for interval incidence coloring of subcubic bipartite graphs using 4, 5 and 6 colors. 3.1
Interval Incidence Coloring with 4 Colors
Suppose IIC(G) = 4. We prove that graph G satisfies the following properties: (i) each vertex v of degree 3 has at most one adjacent vertex of degree 3, (ii) each vertex v of degree 3 has at least one adjacent vertex of degree 1, (iii) on every path between two non-adjacent vertices of degree 3 there are at least two vertices of degree 2. The property (i) follows from the fact, that in {1, 2, 3, 4} one can find only two intervals of length 3: {1, 2, 3} and {2, 3, 4}, and in each of them there is only one element not belonging to the second one. To prove the property (ii) observe that if Ac (v) = {1, 2, 3} (Ac (v) = {2, 3, 4}), then the vertex u for which the incidence (v, {v, u}) is colored by 3 (by 2, respectively) must be a leaf. Suppose, contrary to the property (iii), that on the path between 3-degree vertices v1 , v2 there is only one vertex u with degree 2. Assume that c(u, {u, v1 }) = 1, c(u, {u, v2 }) = 2. This cleary implies 2 ∈ / Ac (v2 ) and max Ac (v2 ) ≥ 5, a contradiction. The same reasoning applies to the cases where Ac (u) = {2, 3} or Ac (u) = {3, 4}, so the property (iii) holds. We propose the algorithm using 4 colors for interval incidence coloring of graph G which satisfies properties (i) − (iii). First, color vertices of G using two colors a and b, which represent intervals containing 1 and 4, respectively. Next, for every 3-degree vertex v colored with a (or b, respectively), let • c(v, {v, u}) = 1 (or 4 respectively), if deg(u) = 3, • c(v, {v, u}) = 2 (or 3 respectively), if deg(u) = 2, or if deg(u) = 1 and there is no u ∈ N (v) such that deg(u) = 2,
Interval Wavelength Assignment in All-Optical Star Networks
17
• c(v, {v, u}) = 3 (or 2 respectively), if deg(u) = 1. For every 2-degree vertex v colored by a which represents an interval {1, 2} (or b, respectively, which represents an interval {3, 4}), we have to assign c(v, {v, u}) = 1 (respectively 4) if deg(u) = 3. Finally, we color all pending incidences (i.e. at leafs). By the above, the problem of interval incidence coloring of a graph using 4 colors is equivalent to veryfing properties (i) − (iii), hence we get Theorem 4. The problem of interval incidence coloring with 4 colors can be solved for subcubic bipartite graphs in linear time. 3.2
Interval Incidence Coloring with 5 Colors
For a given subcubic bipartite graph G, if IIC(G) ≤ 5 then the following algorithm for interval incidence coloring problem uses 5 colors, otherwise returns F ALSE: (i) Color vertices of G using a and b. (ii) For every v ∈ V (G) color incidences (v, {v, u}), for all u ∈ N (v) as follows: • if deg(v) = 3 and v is colored with a (b, respectively), then let c(v, {v, u}) ∈ {1, 2, 3} (c(v, {v, u}) ∈ {3, 4, 5}, respectively) in such a way that if deg(u) = 3 then c(v, {v, u}) = 3. If for every u ∈ N (v) deg(u) = 3, then return F ALSE, • if deg(v) = 2 and v is colored by a (b, respectively), then let c(v, {v, u}) ∈ {1, 2} (c(v, {v, u}) ∈ {4, 5}, respectively) in any way, • if deg(v) = 1 and v is colored by a (b respectively), then let c(v, {v, u}) = 1 (c(v, {v, u}) = 5 respectively).
Fig. 2. Interval incidence coloring of subcubic bipartite graphs with 5 colors
Consider any v ∈ V (G), if deg(v) ≤ 2, then because G is bipartite, we have Ac (v) ∩ Ac (u) = ∅, for any u ∈ N (v). The same observation holds, if deg(v) = 3 and there exists u ∈ N (v) such that deg(u) = 3, as showed in Figure 2. If for all u ∈ N (v), deg(u) = 3, then there is no interval incidence coloring of G with 5
18
R. Janczewski, A. Małafiejska, and M. Małafiejski
colors and algorithm returns F ALSE. In the other case we get proper incidence coloring c of G with 5 colors. It is easy to see that c is interval coloring, hence holds the following Theorem 5. The problem of interval incidence coloring with 5 colors can be solved for subcubic bipartite graphs in linear time. 3.3
Interval Incidence Coloring with 6 Colors
Suppose G is a subcubic bipartite graph such that IIC(G) > 5. Then we use the same algorithm as in the proof of Theorem 1 for the upper bound, which is exactly 6 in this case. Hence combining all the algorithms we have Theorem 6. Finding a minimal interval incidence coloring of subcubic bipartite graph can be done in linear time.
4
Polynomial Time Dynamic Algorithm for Trees
In this chapter one can find a polynomial-time algorithm for interval incidence coloring of trees. Given a tree T , we take any leaf r as a root of T and direct tree T from the leaves different from r to the root r. It means that we direct every edge e ∈ E(T ) in such a way that for every v ∈ V (T ), v = r, there is exactly one edge leaving v. We will color the incidences of T using the bottom-up technique in accordance to the defined direction. For every edge e = (v, u) ∈ E(T ) we build a matrix A of size 2Δ × 2Δ in such a way that A(i, j) is the maximum of already used colors of incidences in some minimal coloring of subtree Tv with the root v, assuming that the incidence (v, e) is colored by i and the incidence (u, e) is colored by j. The main idea of the algorithm for finding IIC(T ) is as follows: (i) For each edge e ∈ E(T ), beginning from the bottom, find a matrix A. (ii) Return IIC(T ) = mini,j=1,2,...,2Δ A(i, j), where A is a matrix for r. Constructing consecutive incidence coloring of T with IIC(T ) colors is possible by using additional structures for remembering chosen pair of colors for adjacent incidences while building a matrix A. 4.1
Constructing a Matrix A
We will assume that A(i, i) = ∞ for every i = 1, 2, . . . , 2Δ. Let’s consider an edge e = (v, u) directed from v to u. If v is a leaf, then A(i, j) = max{i, j} for i = j. In the other case, let v1 , v2 , . . . , vp be the neighbours of v different from u, p = deg(v) − 1. According to our assumption, there is a matrix Ai built for every edge (vi , v), i = 1, . . . , p. We use the following algorithm for constructing a matrix A for e = (v, u):
Interval Wavelength Assignment in All-Optical Star Networks
19
Algorithm MATRIX(e = (v, u)) for each pair of colors of incidences (v, e), (u, e), denoted by a, b, a =b for each k = 0, 1, . . . , deg(v) − 1 let mk = ∞ let Ik = {a − deg(v) + 1 + k, . . . , a + k} if b ∈ / Ik then construct bipartite graph G with partitions V1 , V2 and with weights of edges as follows: V1 = {v, v1 , v2 , . . . , vp }, V2 = Ik E(G) = {{vi , j} : i = 1, . . . , p ∧ j ∈ Ik } ∪ {v, a} w : E(G) → N w({vi , j}) = minl∈{1,...,2Δ}\Ik Ai (l, j) w({v, a}) = max{a, b} remember mk = MATCHING(G) end for A(a, b) = mink mk end for The procedure MATCHING is used for finding a perfect matching M of G that minimalizes the maximum of w(e), e ∈ E(M ). Algorithm MATCHING(G) for each i = 1, 2, . . . , 2Δ G = G \ {e ∈ E(G) : w(e) > i} if there is a perfect matching of G then return i end for To speed up the algorithm MATCHING, we can use bisection method instead of the loop for i = 1, 2, . . . , 2Δ. Using the Hopcroft-Karp algorithm for finding perfect matching in a graph G we obtain time complexity of MATCHING es5 timated by O(log Δ · Δ 2 ). The time complexity √ of the MATRIX algoritm for a given edge e ∈ E(T ) is estimated by O(log Δ ΔΔ5 ). Using the MATRIX algorithm O(n) times leads to the resulting algorithm for finding IIC(T ) with time 1 complexity estimated by O(nΔ5 2 log Δ).
5
Final Remarks
Theorem 7. Interval incidence coloring of planar graphs of degree 4 is NPC. The proof of this theorem can be found in [19]. An interesting question is the complexity of the interval incidence coloring of bipartite graphs.
20
R. Janczewski, A. Małafiejska, and M. Małafiejski
References 1. Algor, I., Alon, N.: The star arboricity of graph. Discrete Mathematics 75, 11–22 (1989) 2. Alon, N., McDiarmid, C., Reed, B.: Star arboricity. Combinatorica 12, 375–380 (1992) 3. Amini, O., Havet, F., Huc, F., Thomasse, S.: WDM and directed star arboricity. In: INRIA, France, pp. 1–20 (2007) 4. Asratian, A., Kamalian, R.: Investigation on interval edge-colorings of graphs. J. Combin. Theory. Ser. B 62, 34–43 (1994) 5. Beaquier, B., Bermond, J.C., Gargano, L., Hell, P., Perennes, S., Vacarro, U.: Graph problems arising from wavelength-routing in all-optical networks. In: Proc. of WOCS 1997. IEEE, Los Alamitos (1997) 6. Brandt, R., Gonzalez, T.F.: Multicasting using WDM in Multifiber Optical Star Networks. In: Proc. of the 15th IASTED International Conference on Parallel and Distributed Computing and Systems PDCS 2003, Canada, pp. 56–61 (2003) 7. Brandt, R., Gonzalez, T.F.: Wavelength assignment in multifiber optical star networks under the multicasting communication mode. Journal of Interconnection Networks 6, 383–405 (2005) 8. Brualdi, R.A., Massey, J.Q.: Incidence and strong edge colorings of graphs. Discrete Mathematics 122, 51–58 (1993) 9. Chen, D.L., Liu, X.K., Wang, S.D.: The incidence chromatic number and the incidence coloring conjecture of graphs. Math. Econom. 15(3), 47–51 (1998) 10. Chen, D.L., Pang, S.C., Wang, S.D.: The incidence coloring number of Halin graphs and outerplanar graphs. Discrete Mathematics 256, 397–405 (2002) 11. Dolama, M.H., Sopena, E., Zhu, X.: Incidence coloring of k-degenerated graphs. Discrete Mathematics 283, 121–128 (2004) 12. Giaro, K.: Interval edge-coloring Contemporary Mathematics. In: Kubale, M. (ed.) Graph Colorings, pp. 105–121. AMS (2004) 13. Giaro, K., Kubale, M., Małafiejski, M.: Compact scheduling in open shop with zero-one time operations. INFOR 37, 37–47 (1999) 14. Giaro, K., Kubale, M., Małafiejski, M.: Consecutive colorings of the edges of general graphs. Discrete Mathemathics 236, 131–143 (2001) 15. Guiduli, B.: On incidence coloring and star arboricity of graphs. Discrete Mathematics 163, 275–278 (1997) 16. Janczewski, R., Małafiejski, M.: Incidence coloring of multitrees. Zesz. Nauk. Pol. Gd. 13, 465–472 (2007) (in Polish) 17. Janczewski, R., Małafiejska, A., Małafiejski, M.: Interval incidence coloring of graphs. Zesz. Nauk. Pol. Gd. 13, 481–488 (2007) (in Polish) 18. Janczewski, R., Małafiejska, A., Małafiejski, M.: Interval incidence coloring of graphs. Rap. Tech. WETI 12, 1–16 (2008) (in Polish) 19. Janczewski, R., Małafiejska, A., Małafiejski, M.: The complexity of interval incidence coloring of graphs. Rap. Tech. WETI 16, 1–17 (2008) (in Polish) 20. Li, X., Tu, J.: NP-completeness of 4-incidence colorability of semi-cubic graphs. Discrete Mathematics 308, 1334–1340 (2008) 21. Małafiejska, A.: Interval coloring of graphs MSc Thesis, University of Gdańsk, pp. 1–70 (2008) 22. Modiano, E., Lin, P.J.: Traffic grooming in WDM networks. IEEE Communications Magazine 39(7), 124–129 (2001) 23. Shiu, W.C., Lam, P.C.B., Chen, D.-L.: Note on incidence coloring for some cubic graphs. Discrete Mathematics 252, 259–266 (2002)
Graphs Partitioning: An Optimal MIMD Queueless Routing for BPC-Permutations on Hypercubes Jean-Pierre Jung and Ibrahima Sakho Universit´e de Metz - UFR MIM - Dpt. d’Informatique Ile du Saulcy BP 80794 - 57012 Metz Cedex 01, France {jpjung,sakho}@univ-metz.fr
Abstract. Bit-Permute-Complement permutations (BPC) constitute the subclass of particular permutations which have gained the more attention in the search of optimal routing of permutations on hypercubes. The reason of this attention comes from the fact that they care permutations for general-purpose computing like matrix transposing, vector reversal, bit shuffling and perfect shuffling. In this paper we revisit the optimal routing of BPC problem on hypercubes under MIMD queueless communication model through a new paradigm which takes advantage of their topology: the so-called graphs partitioning. We prove that BPC are partitionable in any dimension of the hypercube and that the resulting permutations are also BPC. It follows that any BPC on n-dimensional hypercube is routable in at most n steps of data exchanges, each one realizing the partition of the hypercube. Keywords: Interconnection network, hypercube, BPC permutations, MIMD queueless routing, perfect matching of bipartite graph, graph partitioning.
1
Introduction
The processors interconnection network (IN) is the heart of the message passing parallel computers. Indeed, the performance of such computers depends greatly on the performance of their IN whose essential criteria are the scalability for massive parallelism, the capability of deadlock-free routing on shortest paths, the capability of simulating others IN and the management facility. In the research for IN which fulfil these criteria, hypercubes constitute a very attractive alternative. The incremental construction of hypercubes confers to them interesting mathematical properties which allow meeting most of performance criteria. On the contrary of most IN, they allow exponential increase of the number of the nodes while the diameter increases linearly. They admit implicit routing functions which alleviate the effort that requires their management in avoiding the computation of huge routing tables and which route on shortest paths. Finally most of the popular IN, rings, 2D wrap-around grids, trees, can be almost isometrically embed in hypercubes so guaranteeing their simulation capability. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 21–30, 2010. c Springer-Verlag Berlin Heidelberg 2010
22
J.-P. Jung and I. Sakho
For all these reasons, several commercial parallel machines have been built over the years and several theoretical and practical research works have also been done on different aspects of their use as IN. Among the theoretical research, one of the most challenging is the optimal rearrangeability of hypercubes that is, their capability for optimal routing any permutation such that the routes of each routing step constitute a one-to-one correspondence of adjacent nodes. BPC constitute the subclass of particular permutations which have gained the more attention in search of optimal permutations routing in hypercubes because they care permutations for general-purpose computing like matrix transposing, vector reversal, bit shuffling and perfect shuffling. In this paper we address the problem of the optimal routing of BPC. We revisit the problem through a new paradigm the so called graphs partitioning to take advantage of the recursive structure of the hypercubes topology. The remainder of the paper is organised in five sections. Section 2 gives the problem formulation and some basic definitions relative to hypercubes, permutations and routing. Section 3 presents the related works. Section 4 introduces the mathematical foundation used to develop the proposed routing algorithm. Section 5 characterizes partitionable BPC and proposes a routing algorithm. Section 6 concludes the paper and presents some perspectives.
2 2.1
Problem Formulation Definitions
Definition 1. An n-dimensional hypercube, nD-hypercube for short, is a graph H (n) = (V, E) where V is a set of 2n nodes u = (un−1 , un−2 , . . . , u0 ) such that ui ∈ {0, 1} and E is the set of arcs (u, v) ∈ V × V such that there is only one dimension i for which ui = vi . It is well known that for n > 0 an nD hypercube is obtained by interconnecting two (n− 1)D-hypercubes in any dimension 0 ≤ i ≤ n− 1. So any hypercube H (n) (n) (n) can be viewed as anyone of the n couples (H0,i , H1,i ) of (n − 1)D-hypercubes obtained by restricting H (n) to its nodes u such that ui = 0 (resp. 1). Figure 1 illustrates such a view.
5
4
6 2
14
7 3
13 9
8
1
0
12
15
10 11
Fig. 1. 4D-hypercube viewed as the interconnection of two 3D-hypercubes in the dimension 3
Optimal Queueless Routing for BPC on Hypercubes
23
Definition 2. A permutation on a nD-hypercube H (n) is an automorphism on V ; that is a one-to-one application π which associates to each node u = (un−1 , un−2 , . . ., u0 ) of H (n) one and only one node π(u) = (πn−1 (u),πn−2 (u),. . ., π0 (u)). It is represented by the sequence π = (π(u); u = 0, 1, . . ., 2n − 1). Definition 3. A permutation π in a nD-hypercube is a BPC if for any i, there is j such that for any u, πi (u) = uj or πi (u) = uj , the complement of uj . 2.2
MIMD Queueless Routing
Let π be a permutation on an nD-hypercube network with bidirectional links. Let’s consider a set of 2n messages of the same size each one located on one node u and destined for the node π(u). Routing the permutation π on the hypercube under MIMD queueless communication model consists in conveying all the messages to their destination such that each node holds one and only one message. Clearly, such a routing consists in a sequence of global and synchronous exchanges of messages between neighbourhood nodes such that no more than one message is located at each node after each exchange step.
3
Related Works
Optimal routing of permutations on nD-hypercubes is one of the most challenging open problems in the theory of IN. For an arbitrary permutation, it is well known, from e-cube routing [1], that such a routing requires at least n exchange steps. To specify its complexity, it has been extensively studied under several communication models and routing paradigms. Szimansky in [2] considers the offline routing in circuit-switched and packet switched commutation models under all-port MIMD communication model. Under the circuit-switched hypothesis he proves that, for n ≤ 3, any hypercube is rearrangeable. He also conjectured that routing can be made on the shortest paths, conjecture for which a counterexample has been given by Lubiw in [3]. Under packet-switched hypothesis he also shows that routing can be made in 2n − 1 steps, result which has been then improved to 2n − 3 by Shen et al in [4] in assuming that each link is used at most twice. Under the single port MIMD communication model, Zhang in [5] proposes a routing in O(n) steps on a spanning tree of the hypercube. While the above results considered offline routing and clearly established communication models, others works like the ones by Hwang et al [6,7] considered online oblivious routing under buffered all port MIMD communication models. They prove that in using local information, n steps routing is possible for n ≤ 12. The better routings under the previous models are due to V¨ockling. In [8], he √ proves on one side that deterministic offline routing can be done in n + O( n log n) steps on the other side that online oblivious randomized routing can be made in n + O(n/ log n) steps but in buffered all port MIMD model.
24
J.-P. Jung and I. Sakho
For the more restrictive single-port, MIMD queueless communication model the works can be classified in routing of arbitrary and particular permutations among which the BPC. For arbitrary permutations, the personal communication of Coperman to Ramras and the works of Ramras [9] constitute certainly the leading ones. Indeed, while Coperman gives the computational proof that any arbitrary permutation in the 3D-hypercube can be routed in 3 steps. Ramras proves that if a permutation can be routed in r steps in rD-hypercubes, then for n ≥ r arbitrary permutations can be routed in 2n − r steps. Thus, it follows that arbitrary permutations can be routed in 2n − 3 steps improving so the 2n − 1 routing steps of Gu and Tamaki [10]. Recently, Laing and Krumme in [11] have introduced an approach based on the concept of k-separability that is the possibility to partition a permutation after k routing steps into 2k permutations on disjoints (n − k)D-hypercubes. For BPC, Nassimi and Sahni in [12] prove that under SIMD communication models BPC can be routed in n steps. Johnson and Ho present algorithms for matrix transposition in all-port MIMD [13] and all-to-all personalized [14] communication models. Recently, M´elatiaga and Tchuent´e have developed an n-steps parallel routing algorithm.
4
Mathematical Foundation
Hypercubes partitioning is the foundation of our routing algorithm. It is based on the computation of bipartite graphs perfect matching. 4.1
Definitions
Definition 4. A bipartite graph is a graph whose the nodes set V = V1 ∪ V2 and the edges set E is constituted of edges (u, v) such that u ∈ V1 and v ∈ V2 . Definition 5. The adjacency matrix of a bipartite graph is the |V1 |×|V2 | matrix M whose each component M [u, v] = 1 (resp. 0) if (u, v) ∈ (resp. ∈) / E. Let’s observe that, as an nD-hypercube can be viewed as the interconnection of two (n−1)D-hypercubes in anyone of its n dimensions, its adjacency matrix M is anyone of the n 2 × 2 block matrix M (i) whose block M (i) [x, x] (resp. M (i) [x, x]) (n) is the adjacency matrix of the (n − 1)D-hypercube Hi,x (resp. interconnection (n)
(n)
graph of the nodes of Hi,x to the ones of Hi,x ). Table 1 illustrates such a matrix. Definition 6. A matching of a bipartite graph G is an one-to-one correspondence Γ which associates to each node u of a subset of V1 a node Γ (u) of V2 such that the edges (u, Γ (u)) belong to E and are two-by-two non adjacent. A matching Γ is said to be maximum (resp. perfect, maximal) if its cardinal |Γ | is maximum (resp. = |V1 | = |V2 |, maximal). The main result about the computation of maximum and perfect matching is due to C. Berge [15].
Optimal Queueless Routing for BPC on Hypercubes
25
Table 1. 4D-hypercube adjacency matrix view according to dimension i = 2 2 0 1 2 3 8 9 10 11 4 5 6 7 12 13 14 15
0 1 1 1
1 1 1 1
2 1 1 1
3
8 1
1 1 1
1 1
10 11
4 1
1
5
6
1
1
12 13 14 15
1 1
1 1
7
1 1
1 1 1
1
9
1
1 1 1
1 1 1 1
1
1 1 1 1 1 1
1 1 1
1 1 1
1
1 1 1
1 1 1 1
1 1
1 1
1 1 1 1 1
1 1
1
1
1 1 1
1 1 1
1 1 1
Theorem 1 (of C. Berge). A matching Γ of a bipartite graph G is maximum if and only if there is no path with extremities nodes not saturated by Γ which alternates edges belonging and edges not belonging to Γ . The implementations of this theorem have lead to several algorithms. We will use the Neiman algorithm [16] which proceeds by distinguishing one and only one “1” by row and by column in the adjacency matrix of the bipartite graph. In the sequel, the non distinguished “1” will be double crossed. 4.2
Characterization of Partitionable Permutations
Given a permutation π on H (n) , for x = 0, 1 and 0 ≤ i ≤ n − 1, let n min be its minimal number of routing steps, Gx,i the bipartite graph (Sx,i , Dx,i , E), where Sx,i = {u ∈ H (n) : πi (u) = x} and Dx,i = {u ∈ H (n) : ui = x} and Γ x,i a maximum matching of Gx,i . Definition 7. A permutation π on H (n) is partitionable in a dimension i if there is a permutation Γ = (Γ 0,i , Γ 1,i ) on H (n) routable in one step and which leads to the distinct permutation α (resp. β) on a distinct (n − 1)D-hypercube, routable in n min − 1 steps and which associates π(u) to each Γ (u) such that Γi (u) = 0 (resp. 1). Actually, the partionability of a permutation on H (n) is a guaranty that it can be decomposed, after at most one exchange step, in two independent permutations on two disjoint (n − 1)D hypercubes. Now let’s characterize the partitionable permutations. Lemma 1. A necessary and sufficient condition for a permutation on an nDhypercube to be partitionable is that there is a dimension i such that for any x, the bipartite graph Gx,i = (Sx,i , Dx,i , E) admits a perfect matching.
26
J.-P. Jung and I. Sakho
Proof. It is straightforward. It comes from Definition 4.
Lemma 2. A necessary and sufficient condition for the bipartite graph Gx,i = (Sx,i , Dx,i , E) to admit no perfect matching is that there is v ∈ Dx,i which cannot match with any node u ∈ Sx,i . Proof. The condition is obviously sufficient. To prove that it is also necessary it suffices to prove that the adjacency matrix, say N (i,x) , of Gx,i has at least one null column. To do this let’s first observe that this adjacency matrix is a mix of rows of M (i) [x, x] and M (i) [x, x]. Then by suited rows and columns permutations it can be rewritten in the form of table 2a where m = 2n−1 , A and B, I and O are respectively, non null, identity and null matrices. Secondly let’s observe that, by construction: (a) there is a perfect matching between the rows of I and the columns of B, (b) there is a matching of the rows of A with the columns of A and B which saturates the rows of A, (c) any perfect matching of Gx,i contains the perfect matching between the rows of I and the columns of B. Now let’s suppose that Gx,i does admit no perfect matching and consider any of its maximum matching. From the above properties, necessarily the rows and the columns of A are not perfectly matched, but they are maximally matched. Then from the Neiman algorithm, by permuting the rows and the columns of A, it can be decomposed as shown in table 2b where B’ results from a permutation of the rows of B, A1 contains in its diagonal a maximal matching of the rows and the columns of A, while A2 and A3 are non null matrices which do contain no matching. From the theorem of Berge, there is no alternated path whose initial extremity, a row of A3 , and the terminal extremity, a column of A2 are not saturated. In others words any attempt of building an alternated path from a row of A3 to a column of A2 whose extremities are not saturated will fail while, according to (b) it should be a success. Then necessarily A2 is null and N (i,x) contains at least one null column. Table 2. Block decomposition of N (i,x) according to a maximum matching a u
v1
…
vk
vk
vm
…
1
…
A
B
u uk+1 um
y1…
…yk
yk+1
…
ym
1
k
…
b
O
I
x … … xk xk+1 …
A1
A2
A3
O
B’
O
I
xm
Proposition 1. A necessary and sufficient condition for a permutation on a nD-hypercube to be partitionable is that there is a dimension i such that for any x, the adjacency matrix N (i,x) of the bipartite graph Gx,i does contain no null column. Proof. It is straightforward. It comes from the two previous lemmas.
Optimal Queueless Routing for BPC on Hypercubes
5
27
Optimal Routing of BPC in nD-Hypercubes
In this section we deal with the declination of the above characterization of partitionable permutations for BPC. 5.1
Characterization of Partitionable BPC
Lemma 3. If π is a BPC on H (n) then for any dimension i, Sx,i = Dx,j (resp. Dx,j ) if πi (u) = uj (resp. uj ) for any u ∈ H (n) . Proof. By definition, for any i, there is a dimension j such that πi (u) = uj or πi (u) = uj for any u. As Sx,i = {u ∈ H (n) : πi (u) = x}, if πi (u) = uj then Sx,i = {u ∈ H (n) : uj = x} = Dx,j and if πi (u) = uj then Sx,i = {u ∈ H (n) : uj = x} = {u ∈ H (n) : uj = x} = Dx,j . Proposition 2. Any BPC on H (n) is partitionable in any dimension. Proof. According to proposition 1, we have to prove that for any dimension i, the adjacency matrices N (i,x) contains no null column. Let v be a component of Dx,i . By definition, v = (vn−1 , . . . , vi+1 , x, vi−1 , . . ., vj+1 , vj , vj−1 , . . ., v0 ). According to Lemma 3, we distinguish two cases. Case 1. Sx,i = Dx,j . As Dx,j = {u ∈ H (n) : u = (un−1 , . . . , ui+1 , ui , ui−1 , . . ., uj+1 , x, uj−1 , . . ., u0 )}, then u = (vn−1 , . . . , vi+1 , x, vi−1 , . . ., vj+1 , x, vj−1 , . . ., v0 ) ∈ Dx,i . If vj = x then u = v and N (i,x) [u, v] = 1. If vj = x then u and v differ on only one bit, so they are adjacent nodes and N (i,x) [u, v] = 1. Case 2. Sx,i = Dx,j . In this case too, by a similar reasoning we prove that N (i,x) [u, v] = 1. 5.2
Partitioning Strategy of BPC
In order to determine a partitioning strategy, let’s examine the structure of N (i,x) . Before, let’s remark that for given i and x, N (i,x) is constituted of the adjacency matrices of the bipartite graphs ((Sx,i ∩ Dy,i , Dx,i , E), y = x, x). Let π be a BPC on H (n) , i and j such that πi (u) = uj or πi (u) = uj for any u and k = j (resp. n − 1, j − 1) if j < (resp. =, >)i. We have the following results. Lemma 4. Sx,i ∩ Dy,i is the sequence of the odd (resp. even) rank subsequences if πi (u) = uj (resp. uj ) of the longest sequence of disjoint 2k -terms subsequences of Dy,i in ascending (resp. descending) order if x = 0 (resp. 1) of their addresses. Proof. It comes from the fact that in the binary code table of the first 2n natural integers taken in the ascending order, the ith bit alternates 2i -length sequences of 0 and 1. Indeed, as Sx,i ∩Dy,i = Dz,j ∩Dy,i = {u ∈ H (n) : uj = z and ui = y}, from Lemma 3, the result follows in comparing the ith and j th bits of the binary code table.
28
J.-P. Jung and I. Sakho
Lemma 5. N (i,x) is the 2n−k × 2n−k block matrix of 2k × 2k blocks whose each block [I, J], for 0 ≤ I, J ≤ 2n−(k+1) − 1 is such that: N (i,x) [I, J] = M (i) [(2n−(k+1) −1)x+(−1)x 2I, (2n−(k+1) −1)x+(−1)x J] (resp. M (i) [(2n−(k+1) − 1)x + (−1)x (2I + 1), (2n−(k+1) − 1)x + (−1)xJ]) if πi (u) = uj (resp. πi (u) = uj ). Proof. It is straightforward. Indeed it is a corollary of Lemma 4. Table 3 illustrates from table 1 such adjacency matrices. Table 3. N (2,x) matrices for x = 0 and 1 of a BPC π on a 4D-hypercube such that π2 (u) = u1
0 1 8 9 4 5 12 13
0 1 1 1
1 1 1 1
2 1
3
8 1
1 1 1
9 1 1 1
10 11
1 1
1 1 1 1
Proposition 3. Let Γ be a partition ⎧ ⎨u Γ (u) = u ⊕ 2j ⎩ u ⊕ 2i
15 14 7 6 11 10 3 2
15 14 13 12 1 1 1 1 1 1 1 1 1 1
7 1 1 1
6
5
1 1 1
1
4
1
1 1
of π in dimension i. if (πi (u) = ui , u ∈ H (n) ) else if πi (u) = uj otherwise.
Proof. Let Γ be a perfect matching of Gx,i . We distinguish the following three cases. Case 1. πi (u) = ui , for any u ∈ H (n) . From Lemma 5, N (i,x) coincides with M (i) [x, x] and then admits the one-to-one correspondence (Γ (u) = u, u ∈ Dx,i ) as a perfect matching. Case 2. πi (u) = ui , for any u ∈ H (n) . From Lemma 5, N (i,x) coincides with M (i) [x, x] which is an identity matrix and then admits only the one-to-one correspondence (Γ (u) = u ⊕ 2i , u ∈ Dx,i ) as a perfect matching. Case 3. πi (u) = uj or πi (u) = uj , for any u ∈ H (n) . From Lemma 5, any Γ consists in a perfect matching of N (i,x) blocks belonging to M (i) [x, x] and M (i) [x, x]. Then from Case 2, (Γ (u) = u ⊕ 2i , u ∈ Dx,i ). According to Neiman algorithm, the only “1” residual components of N (i,x) are then the ones of the 2k × 2k identity matrix of M (i) [x, x]. As each of these residual “1” corresponds to the interconnection between the nodes whose addresses differ only on their j th bit, it follows that (Γ (u) = u ⊕ 2j , u ∈ Dx,i ). Γ is illustrated in Table 3 by the non-double-crossed “1”, the double-crossed ones being forbidden according to the Neiman algorithm.
Optimal Queueless Routing for BPC on Hypercubes
5.3
29
Optimal Routing of BPC
The previous subsection specified a way for the first step routing of a BPC on an nD-hypercube: partition the BPC in two distinct permutations α and β on two disjoint (n − 1)D-hypercubes. Now, we have to decide how to route each one of these permutations. In this order, let’s recall that α (resp. β) is the permutation which associates π(u) to node Γ (u) such that Γi (u) = 0 (resp. 1) the node. Proposition 4. Any permutation which results from a partition of a BPC on an nD-hypercube is also a BPC on an (n − 1)D-hypercube. Proof. We have to prove that for any λ ∈ {α, β}, ∀r, ∃s such that λr (Γ (u)) = Γs (u) or Γs (u) for any Γ (u). This is the case for r = i, the partition dimension. Thus let’s consider the permutation α and a dimension r = i. From the definition of α, ∀Γ (u) such that Γi (u) = 0, αr (Γ (u)) = πr (u). As π is a BPC, ∃s such that πr (u) = us or πr (u) = us . In others words, αr (Γ (u)) = us or αr (Γ (u)) = us . We then distinguish the following three cases. Case 1. πi (u) = ui . From Proposition 3, Γ (u) = u for any u. Then αr (Γ (u)) = us = Γs (u) or αr (Γ (u)) = us = Γs (u). Case 2. πi (u) = uj . From Proposition 3, Γ (u) = u ⊕ 2k , where k = i or j for any u. Then αr (Γ (u)) = (Γ (u) ⊕ 2k )s = Γs (u) (resp. Γs (u)) if s = i and s =j (resp. s = i or s = j) or αr (Γ (u)) = (Γ (u) ⊕ 2k )s = Γs (u) (resp. Γs (u)) if s = i and s = j (resp. s = i or s = j). Case 3. πi (u) = uj . The reasoning is similar to the Case 2 one. From the above study, routing a BPC π on an nD-hypercube under queueless communication model consists in recursive partitioning in two different BPC on two disjoint (n − 1)D-hypercubes . At each partition step, each node has either to conserve or to move its message in one of the dimensions i or j for which πi (u) = uj or πi (u) = uj .
6
Conclusion and Perspectives
This paper has addressed the problem of the optimal MIMD queueless routing of BPC on nD-hypercubes. The partitioning paradigm is the framework of the proposed routing algorithm. It consists in decomposing recursively a permutation in independent permutations on disjoint hypercubes. We have proved that any BPC can be partitioned in any dimension of the hypercube and that the resulting permutations are also BPC. Then any BPC can be routed in at most n data exchange steps. The resulting routing algorithm, for each message, consists in a simple comparison of two bits of its source and destination addresses. Clearly, the proposed routing algorithm looks like a self routing one. Unfortunately, these bits must have the same weight for all the nodes u and then require a consensual choice. Thus one of our future works will concern the BPC self routing. Beyond BPC there are others subclasses of permutations which are also subject to great attention. It is for instance the Linearly Transformed and Complemented, Omega and Inverse Omega permutations. So we also plan to test the applicability of the partitioning paradigm to these classes of permutations.
30
J.-P. Jung and I. Sakho
Acknowledgement. The conversion of this work to LATEX according to the LNCS format has been kindly realized with help of Frank Holzwarth, Springer DE.
References 1. Draper, J.T., Ghosh, J.: Multipath e-cube algorithms (MECA) for adaptive wormhole routing and broadcasting in k-ary n-cubes. In: Proceedings of International Parallel Processing Symposium, pp. 407–410 (1992) 2. Szymanski, T.: On the permutation capability of a circuit switched hypercube. In: Proceedings of the 1989 International Conference on Parallel Processing, pp. I-103–I-110 (1989) 3. Lubiw, A.: Counter example to a conjecture of Szymanski on hypercube routing. Informations Processing Letters 35, 57–61 (1990) 4. Shen, X., Hu, Q., Liang, W.: Realization of arbitrary permutations on a hypercube. Informations Processing Letters 51(5), 237–243 (1994) 5. Zhang, L.: Optimal bounds for matching routing on trees. In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, pp. 445–453 (1997) 6. Hwang, F., Yao, Y., Grammatikakis, M.: A d-move local permutation routing for d-cube. Discrete Applied Mathematics 72, 199–207 (1997) 7. Hwang, F., Yao, Y., Dasgupta, B.: Some permutation routing algorithms for low dimensional hypercubes. Theoretical Computer Science 270, 111–124 (2002) 8. V¨ ocking, B.: Almost optimal permutation routing on hypercubes. In: Proceedings of the 33rd Annual ACM- Symposium on Theory of Computing, pp. 530–539. ACM Press, New York (2001) 9. Ramras, M.: Routing permutations on a graph. Networks 23, 391–398 (1993) 10. Gu, Q.-P., Tamaki, H.: Routing a permutation in the hypercube by two sets of edge disjoint paths. J.of Parallel and Distr. Computing 44, 147–152 (1997) 11. Laing, A.K., Krumme, D.W.: Optimal Permutation Routing for Low-dimensional Hypercubes. EEECS Tufts University (July 11, 2003) 12. Nassimi, D., Sahni, S.: Optimal BPC permutations on a cube connected SIMD computer. IEEE Trans. Comput. C-31(4), 338–341 (1982) 13. Johnsson, S.L., Ho, C.T.: Algorithms for matrix transposition for boolean n-cube configured ensemble architectures. SIAM J. Matrix Appl. 9(3), 419–454 (1988) 14. Johnsson, S.L., Ho, C.T.: Optimal communication channel utilization for matrix transpose and related permutations on boolean cubes. Disc. Appl. Math. 53(1-3), 251–274 (1994) 15. Berge, C.: Graphes, 3`eme ´edition, Dunod, Paris (1983) 16. Neiman, V.I.: Structures et commandes des r´eseau18 sans blocage, Annales des T´el´ecom (1969)
Probabilistic Packet Relaying in Wireless Mobile Ad Hoc Networks Marcin Seredynski1 , Tomasz Ignac2 , and Pascal Bouvry2 1
University of Luxembourg, Interdisciplinary Centre for Security, Reliability and Trust, 6, rue Coudenhove Kalergi, L-1359, Luxembourg, Luxembourg
[email protected] 2 University of Luxembourg, Faculty of Sciences, Technology and Communication
[email protected]
Abstract. A wireless mobile ad hoc network consists of a number of devices that form a temporary network that operates without a support of any fixed infrastructure. Its distributed nature, lack of a single authority, and limited battery resources may lead participating devices to be reluctant to packet forwarding, which is a key assumption of such a network. This paper demonstrates how cooperation can be reached by means of a new trust-based probabilistic packet relaying scheme. Its most significant properties are relativity of the evaluation of trust and the use of activity as one of the trust measures. Computational experiments demonstrated that the more selfish nodes were present in the network, the more reasonable was to use strict strategies by the remaining nodes, which in turns led to a good protection of the network from free-riders. Keywords: Wireless ad hoc networks, trust-based cooperation.
1
Introduction
A correct operation of wireless mobile ad hoc network (MANET) requires its participants to cooperate on packet relaying. Nodes reluctant to this duty are referred to as selfish nodes or free-riders. Most of devices in MANET rely on battery, therefore the risk of a selfish behaviour driven by the battery conservation need is very high. As shown in the literature, MANET composed of nodes that belong to different authorities can suffer from free-riding [1,2,3]. The problem of selfish behaviour can be solved if nodes relay packets only on behalf of cooperative nodes. This requires network participants to use a local trust or reputation systems in order to collect information regarding behaviour of other nodes. The main difference between the two systems is that in the former a node evaluates a subjective view of the entities trustworthiness, while in the latter the view of the whole community is included [4]. Several packet relaying approaches have been proposed in the literature [5,6,7,8,9]. However, in these works it was typically assumed that all nodes follow the same deterministic relaying approach. Since in R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 31–40, 2010. c Springer-Verlag Berlin Heidelberg 2010
32
M. Seredynski, T. Ignac, and P. Bouvry
civilian MANET nodes belong to different authorities such an assumption might be difficult to attain. In this paper a different approach is introduced: firstly a new probabilistic relaying scheme is defined and next it is examined in various scenarios, in which nodes choose the relaying strategies that differ in its level of cooperation. Similarly to the approach introduced in [10] an activity trust measure is used as a key component to discourage nodes from strategically re-joining the network with a new identity (whitewashing action [11]). A new approach for the trust system is introduced: network participants are not classified as selfish or cooperative by some fixed standards, but rather they are compared against each other, so that one can view the system as an evaluation of subjective ranking of nodes. The next section introduces the new model of a probabilistic packet relaying. This is followed by Section 3, which contains simulation results. Last section concludes the paper.
2
Probabilistic Relaying Model
Each time an intermediate node receives a routing request (RREQ) it checks trustworthiness of the original sender of the message (source node). The action whether to relay or discard the packet originated by a node with a given trust is specified by a probabilistic relaying strategy. 2.1
Local Trust Evaluation System
In this work it is assumed that nodes use only a trust system. A source routing protocol is assumed to be used, which means that a list of intermediate nodes is included in the packet’s header. The information regarding behaviour of other nodes (referred later to as trust data) is gathered only by nodes directly participating in the communication session. The communication session involves a sender, several forwarders (nodes that relay packets) and a destination node. The trust data collection is performed in the following way: nodes are equipped with a watchdog mechanism (WD) [2] that enables them to check if a packet was successfully delivered to the destination. A node that asks another node to relay a packet verifies by means of a passive acknowledgment [12] whether the node has actually accepted the RREQ. As an example, let us assume that node n1 sends a message to node n4 via intermediate nodes n2 and n3 , and eventually the message is discarded by node n3 . This event is recorded by the WD mechanism of node n2 , which next informs node n1 about selfish behaviour of n3 . As a result, trust system of node n1 is updated with the two events - “packet relayed by n2 ” and “packet discarded by n3 ”, while the trust system of n2 is updated with the event “packet discarded by n3 ”. The detailed analysis of the performance of the WD can be found in [2]. The trust data are classified into two types referred to as personal and general trust data. The former takes considers status of packets originated by a node itself, while in the latter the status of packets originated by other nodes is taken into account. In consequence, node in a sender role collects personal trust data, while forwarders collect general trust data. The evaluation
Probabilistic Packet Relaying in Wireless Mobile Ad Hoc Networks
33
of trust of node j (source of the packet) by node i asked to relay the packet is based on two characteristics, namely relative forwarding rate (denoted by f rj|i ) and relative activity (denoted by acj|i ). Later in the text, relative forwarding rate and relative activity will be refered to as forwarding rate and activity. The notation j|i means that node j is under the evaluation of node i. Both values result from WD-based observations made by node i during its participation in the network. Only personal trust data is considered, however if such data are unavailable then general trust data is taken into account. Forwarding rate is defined over the context of ratio of packets relayed to packets discarded by a node, while activity over node’s degree of participation to the forwarding duties. It is assumed that acj|i , f rj|i ∈ [0, 1]. In the next step node i evaluates trust of node j, which is a number from the interval [0, 1]. Formally, the trust (evj|i ) is defined as a function of forwarding rate and activity: evj|i = fi (f rj|i , acj|i ), where fi : [0, 1] ⊗ [0, 1] → [0, 1]. Nodes might have different criteria of the trust evaluation, hence every function f is indexed with an index of an evaluating node. Node i observes two characteristics of j, namely req accj|i and req dscj|i , which are respectively numbers of packets relayed and discarded by j. Then, the node computes the RREQ acceptance ratio (rarj|i ) of node j: rarj|i =
req accj|i . req accj|i + req dscj|i
The values f rj|i and acj|i are functions of rarj|i and req accj|i respectively. Obviously, such functions can be defined in many various ways. However, in this paper the following definitions are proposed: 1 · rark|i < rarj|i , Oi − 1
(1)
1 · req acck|i < req accj|i , Oi − 1
(2)
f rj|i =
k∈Oi
acj|i =
k∈Oi
where Statement =
1 0
if Statement is true, if Statement is false.
Oi denotes a set of all nodes observed by node i and Oi stands for its size. The forwarding rate value defined by the equation (1) is a fraction of nodes that have lower forwarding rate than node j. A similar interpretation holds for the activity. Finally, the evj|i is computed using the following formula: evj|i = w1 · f rj|i + w2 · acj|i ,
(3)
34
M. Seredynski, T. Ignac, and P. Bouvry
where w1 , w2 are weights summing to one. In this paper the following assignment was made w1 = w1 = 12 . However, demanding on certain factors one may want to promote nodes with higher forwarding rate value or with higher activity by changing values of w1 and w2 . Based on histograms of rarj|i and req accj|i , it is possible to approximate density functions of distributions of RREQ acceptance ratio and number of relayed packets (denoted by rari (x) and req acci (x) respectively). Please note, that the function rari (x) is defined on the interval [0, 1] and req acci (x) may be considered on the semi-axis [0, +∞]. Now, equations (1) and (2) can be replaced by: rarj|i f rj|i = rari (x)dx, (4) 0
acj|i =
req accj|i
req acci (x)dx.
(5)
0
In large networks, it may be impossible to collect informations about all nodes. In such a case, one can take an advantage of the statistical approach. Even, if there is no information about node j, there still exists possibility to evaluate it in a reasonable way. A natural choice in this case is to compute expected values of forwarding rate and activity:
1
x · rari (x)dx,
f rj|i = 0
acj|i = 2.2
∞
x · req acci (x)dx.
0
Decision-Making: Probabilistic Relaying Strategy
The next step for a node that received the packet for relaying is to decide whether to accept or reject the RREQ originated by a node with a given trust. The decision is based on the relaying strategy which is a function of evj|i (trust), and is denoted by si (evj|i ). A probabilistic strategy is used, i.e., si (evj|i ) is a probability that i will relay a packet originated by j. This implies that: si : [0, 1] → [0, 1]. In the experiments presented in the next section simple strategies were used. The aim of these experiments was to demonstrate that the proposed approach efficiently enforces cooperation between nodes in a network. However, if one wants to build a stable, large network there will be a need to use more complex strategies. The reason for this is that, the choice of a strategy is motivated by many factors, such as battery level, number of RREQ or computational resource of network devices. Moreover, a high variability of conditions in MANET is assumed. Hence, nodes will be forced to adapt their strategies in a dynamic way
Probabilistic Packet Relaying in Wireless Mobile Ad Hoc Networks
35
due to changes of the environment. Experimental results suggested that when choosing a strategy one must take into account a level of cooperation of other nodes. This is another parameter that should be incorporated into the process of the strategy selection. The scope of our future work is to construct an AI module which will be responsible for the strategy management. A dynamic environment of MANET implies that one must be able to describe an evolution of a node’s strategy as a stochastic process. Hence, it would be effectively if all strategies of a node were members of the same family of probability distributions. A natural candidate is the beta distribution, which is defined on the interval [0, 1] and is characterized by two parameters, a and b. For such a distribution the density function is defined by the formula: da,b (x) = 1 0
xa−1 (1 − x)b−1 ua−1 (1 − u)b−1 du
,
for x ∈ [0, 1].
(6)
The strategy is defined as a cumulative distribution function related to da,b (x), namely: evj|i da,b (x)dx. (7) si (evj|i ) = 0
Parameters a and b determine the shape of si (x). The advantage of the beta distribution is that changing two parameters enables to cover a large set of strategies differing in their level of cooperativeness. Such a definition needs an additional assumption that a strategy is a nondecreasing function. Let us present an example of a parameter that will strongly affect the choice of a strategy - the distribution of a RREQ sent to node i. We introduce an auxiliary function ri (x) : [0, 1] → [0, +∞]. The value ri (x) is an average number of RREQ in a time unit received from nodes with the trust equal to x. The ri (x) is computed by node i out of the past RREQ received by the node. The ri (x) may be one of the key parameters used for selecting a strategy in a situation of limited power resource or channel capacity. For example, one can compute the expected value of packets relayed by i per a unit of time (with a given strategy si (x)): 1 Ki = ri (x)si (x)dx. (8) 0
However, the choice based only on Ki may be not the optimal one. There may be a need to compute other parameters of the ri (x), for instance a standard deviation, before choosing a strategy. This problem will be considered in a forthcoming paper.
3 3.1
Computational Experiments Model of the Network
The network was composed of a finite population of N nodes, while time was divided into R rounds. In each round every node initiated a single communication
36
M. Seredynski, T. Ignac, and P. Bouvry
session exactly once (acted as a sender), thus each round was composed of N communication sessions. Since dynamics of a typical MANET expressed in terms of mobility and connectivity of the nodes are unpredictable [13], intermediate and destination nodes were chosen randomly. The trust data were also used by nodes for originating its own packets. Each network participant used a path rating mechanism in order to avoid distrusted nodes. If a source node had more than one path available to the destination it would choose the one with the best rating. The rating was calculated as an arithmetic mean of trust of all nodes belonging to the route (trust of an unknown node was set to 0.3). The simulation of the network was defined by the following algorithm: 1. Specify i (source node) as i := 1, N as a number of nodes participating in the network and R as a number of rounds; 2. Randomly select node j (destination of the packet) and intermediate nodes, forming several possible paths from node i to j; 3. For each available path calculate arithmetic mean of the trust of all nodes belonging to the path and choose the path with best rating; 4. Let node i := 1 originate a packet (initiate a communication session), which is next processed by intermediate nodes according to their strategies; 5. As soon as the communication session is completed update the trust data as described in Section 2.1; 6. If i < N , then choose the next node (i := i + 1) and go to step 2. Else go to step 7; 7. If r < R, then r := r + 1 and go to step 1 (next round). Else stop the simulation. 3.2
Description of Experiments
A general form of a relaying strategy evaluated in the experiments was as follows: ⎧ p1 for evj ∈ [0, 0.4] ⎪ ⎪ ⎪ ⎪ for evj ∈ (0.4, 0.7] ⎨ p2 s(evj ) = p3 for evj ∈ (0.7, 1] ⎪ ⎪ for an unknown node in rounds 1 − 50 p ⎪ 4 ⎪ ⎩ p5 for an unknown node in rounds 51 − 600. Three types strategies were defined: a tolerant strategy (T), a strict strategy (S), and a non-cooperative strategy (N). Their relaying probabilities are shown in Table 1. The non-cooperative strategy forwarded small amount of packets in order to make it more difficult to be discovered. The T and S-type strategies were more cooperative (probability p4 set to 1) towards unknown nodes in the initial period (between rounds 1 and 50), while after that period these nodes turned to be non-cooperative (probability p5 set to 0.1). The reason for the non-cooperative behaviour after certain period was that this discouraged nodes from changing identity in order to take advantage of cooperative approach towards unknown nodes. Four sets of scenarios differing in the number of nodes using selfish strategies (represented by N strategy) were
Probabilistic Packet Relaying in Wireless Mobile Ad Hoc Networks
37
Table 1. Forwarding probabilities of the T, S and N-type strategies p1 p2 p3 p4 p5
Tolerant strategy (T) Strict strategy (S) Non-cooperative strategy (N) 0.4 0 0 0.9 0.7 0.1 1 1 0.15 1 1 0.1 0.3 0.1 0
analyzed. In the first set (s1.1 -s1.3 ) selfish strategies were not present at all, in the second set (s2.1 -s2.3 ) 20% of nodes used the N strategy, in the third (s3.1 -s3.3 ) 40%, while in the last one (s4.1 -s4.3 ) 60%. In each set three different distributions of the remaining T and S-types strategies were defined (s∗.1 with a dominance of T strategies, s∗.2 with equal number of both strategies and s∗.3 with dominance of S strategies). Dominance was defined as a situation, in which the dominating strategy was used by around 90% of remaining nodes. The exact settings of scenarios are shown in Table 2. Table 2. Composition of strategies used by nodes in tested scenarios s1.1 #N 0 # T 54 #S 6
s1.2 0 30 30
s1.3 0 6 54
s2.1 12 43 5
s2.2 12 24 24
s2.3 12 5 43
s3.1 24 32 4
s3.2 24 18 18
s3.3 24 4 32
s4.1 36 22 2
s4.2 36 12 12
s4.3 36 2 22
The following performance characteristics of nodes using a given type of strategy were used: a throughput (tp) defined as a ratio of successful message delivery, a number of packets relayed (npr) by a node and a relaying rate (rr) specifying a ratio of packets relayed to packets discarded. Each scenario was repeated 50 times. The values of performance characteristics of each type of strategy were calculated as an arithmetic mean over the values of all runs. The evaluation of trust (ev) was performed according to equations (1)-(3). The total number of all nodes (N ) was equal to 60, while the number of rounds in the NTS (R) was set to 600. The path length was set from 1 up to 5 hops with the following probabilities: one hop - 0.1, two hops - 0.3 and three to five hops - 0.2. The number of available paths from a source to a given destination ranged from 1 up to 4 (each value with equal probability). 3.3
Results
The results of all four sets of scenarios are shown in Table 3. The average throughput ranged from 0.19 (s4.3 ) to 0.67 (s4.1 ), meaning that 19 to 67% of messages were successfully delivered without being re-send. As one might expect, these figures were mainly related to the number of N strategies present in the network.
38
M. Seredynski, T. Ignac, and P. Bouvry
Table 3. Performance of T, S and N-type strategies in the four sets of scenarios: throughputs (tp), numbers of packets relayed (npr) and relaying rates (rr) of nodes using one type of strategy vs. nodes using other type of strategy
s1.1 s1.2 s1.3 s2.1 s2.2 s2.3 s3.1 s3.2 s3.3 s4.1 s4.2 s4.3
overall tp/npr tp by T 0.67 0.69/873 0.56 0.64/821 0.45 0.62/758 0.5 0.58/779 0.43 0.55/740 0.35 0.52/691 0.36 0.47/679 0.31 0.44/650 0.26 0.41/621 0.25 0.35/581 0.22 0.34/569 0.19 0.33/558
tp/npr by S 0.54/623 0.48/580 0.43/516 0.47/550 0.43/511 0.38/459 0.4/462 0.36/425 0.32/385 0.31/366 0.29/339 0.27/318
tp/npr rr: T vs. N, by N S vs. N 0.25/35 0.42, 0.15 0.2/28 0.43, 0.16 0.14/16 0.43, 0.16 0.21/28 0.43, 0.15 0.18/24 0.43, 0.16 0.15/20 0.44, 0.16 0.19/21 0.45, 0.16 0.16/18 0.45, 0.16 0.13/15 0.45, 0.17
rr: T vs. T T vs. S 0.84, 0.73 0.86, 0.75 0.88, 0.77 0.86, 0.75 0.89, 0.78 0.9, 0.81 0.9, 0.8 0.91, 0.82 0.91, 0.83 0.93, 0.86 0.93, 0.86 0.95, 0.87
rr: S vs. T S vs. S 0.68, 0.51 0.72, 0.54 0.76, 0.58 0.71, 0.55 0.75, 0.57 0.79, 0.61 0.76, 0.59 0.78, 0.61 0.8, 0.63 0.81, 0.67 0.82, 0.67 0.84, 0.7
In the first set of experiments, where nodes did not use non-cooperative strategies the throughput was between 0.45 and 0.67. The worse performance was observed in set 4 (0.19 to 0.25), which was due a dominance of N strategies (used by 60% of nodes). Differences between three variants of scenarios within a given set were related to the proportion of T to S strategies. As T-type strategies were generous in their contribution to packet forwarding, their presence improved the performance of nodes using all other types of strategies (including the selfish ones). Nodes using selfish strategies (N strategies) managed to send from 13% (s4.3 ) up to 25% (s2.1 ) of packets, which was far below the values obtained by nodes using other types of strategies. The T strategy scored the best performance (throughput from 0.33 to 0.69), while S came second with throughput ranging from 0.27 up to 0.54. These values were positively correlated with participation to packet relying. Nodes that used T strategy forwarded on average from 558 (s4.3 ) to 873 (s1.1 ) packets each, while nodes following S strategy relayed 318 (s4.3 ) up to 623 (s1.1 ) packets. On the other hand, selfish users almost did not participate to forwarding duties at all (forwarding 15 to 35 packets). Since all strategies were trust-dependent the relaying rates depended on type of node requesting forwarding service. Nodes using S strategies were the most demanding ones in terms of trust requirements regarding source nodes. Nodes using these strategies prevented selfish nodes from obtaining good throughput, as relaying rates concerning nodes using N strategies were around 0.15. On the other hand, these nodes were by far more cooperative towards nodes that used T strategies (relaying rates ranging from 0.68 to 0.84) and S strategies (0.45-0.63). Nodes that used T strategies were even more cooperative with relaying rates being in a range of 0.84-0.95 towards nodes using T strategies and 0.73-0.87 towards S strategies. These nodes were also quite cooperative towards selfish nodes (relaying rates around 0.43), but one has to keep in mind that as a path is typically
Probabilistic Packet Relaying in Wireless Mobile Ad Hoc Networks
39
composed of more than one intermediate hop, such rate is not enough to obtain a good throughput. This was demonstrated by scenarios s2.1 , s3.1 and s4.1 , where selfish nodes were accompanied mostly by nodes using T strategies. In these scenarios, nodes using N strategies managed to successfully send at most 25% of packets. As in MANET strategy selection process is done independently by each node the question is what strategies would be most probably chosen by network participants - the more or the less cooperative ones? Such a decision would most probably be driven by two factors: state of battery resources and a level of performance that a node would like to obtain from the network. The better the battery is and the more expectations from a network a node has, the more likely it is going to use more cooperative strategy. As demonstrated by the four sets of experiments, being more cooperative results in a better level of service received from the network (defined here as throughput). But as more nodes are cooperative, the better performance is obtained by selfish nodes, so one might expect, that nodes might be tempted to switch to selfish strategies. However, as demonstrated by the experiments this is not very likely to happen. For instance, in s1.1 (no selfish nodes present) the S nodes forwarded 623 packets and obtained a throughput equal to 0.54, while T nodes forwarded 40% more packets and obtained a throughput equal to 0.69. In such a case a decision whether to use more or less cooperative strategy was a matter of preferences and battery resources. However, in a large presence of selfish nodes (s4.1 -s4.3 ) T nodes forwarded 59-75% more packets then S nodes, and obtained only slightly higher throughput. It means, that in such a situation (large presence of selfish nodes) it is more reasonable to use a less cooperative strategy. As a result, it is very unlikely, that selfish nodes might have an occasion to free-ride.
4
Conclusions
In this paper a new trust-based probabilistic scheme for packet relaying was introduced. One of its key features is the use of the notion of activity as a basis for the relaying strategy. It promotes identity persistence as nodes are discouraged from a whitewashing action because new identity means that the activity status is reset to zero. Additionally, the relativity of the evaluation of trust enables the trust system to adapt the ratings to the actual networking conditions. Thus, one can view the output of the system as an evaluation of subjective ranking of nodes. Such an approach also helps to deal with uncertainty related to the watchdog-based data collection mechanism. False or inaccurate data occasionally provided by the mechanism affect all nodes in the same way, thus these errors will not affect the ranking of nodes. In consequence, this uncertainty, as long as remains reasonable, will not create any significant threat to the trust system. The computational experiments demonstrated a substantial degree of resistance to free-riding, as in a case of a large number of selfish nodes, the stricter strategies proved to be a better choice. However, the experiments were performed under simplified assumptions and selected strategies were far from
40
M. Seredynski, T. Ignac, and P. Bouvry
the ideal. The scope of our future work is to construct a model of MANET in which additional complex factors will be taken into account.
Acknowledgments This work has been partially founded by C08/IS/21 TITAN Project (CORE programme) financed by National Research Fund of Luxembourg.
References 1. Buttyan, L., Hubaux, J.P.: Nuglets: a virtual currency to stimulate cooperation in self-organized mobile ad hoc networks. Technical Report DSC/2001/001, Swiss Federal Institute of Technology (2001) 2. Marti, S., Giuli, T., Lai, K., Baker, M.: Mitigating routing misbehavior in mobile ad hoc networks. In: Proc. ACM/IEEE 6th International Conference on Mobile Computing and Networking (MobiCom 2000), pp. 255–265 (2000) 3. Michiardi, P., Molva, R.: Simulation-based analysis of security exposures in mobile ad hoc networks. In: Proc. European Wireless Conference (2002) 4. Josang, A., Ismail, R., Boyd, C.: A survey of trust and reputation systems for online service provision. Decision Support Systems 43(2), 618–644 (2007) 5. Buchegger, S., Boudec, J.Y.L.: Performance analysis of the confidant protocol. In: Proc. 3rd International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc 2002), pp. 226–236 (2002) 6. Buchegger, S., Boudec, J.Y.L.: The effect of rumor spreading in reputation systems for mobile ad-hoc networks. In: Proc. Workshop on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks (WiOpt 2003), pp. 131–140 (2003) 7. Buchegger, S., Boudec, J.Y.L.: Self-policing mobile ad hoc networks by reputation systems. IEEE Communications Magazine, Special Topic on Advances in SelfOrganizing Networks 43(7), 101–107 (2005) 8. Giordano, S., Urpi, A.: Self-organized and cooperative ad hoc networking. In: Basagni, S., Conti, M., Giordano, S., Stojmenovic, I. (eds.) Mobile Ad Hoc Networking, pp. 355–371. Wiley, IEEE Press (2004) 9. Michiardi, P., Molva, R.: Core: A collaborative reputation mechanism to enforce node cooperation in mobile ad hoc networks. In: Proc. 6th Conference on Security Communications, and Multimedia (CMS 2002), pp. 107–121 (2002) 10. Seredynski, M., Bouvry, P., Klopotek, M.A.: Analysis of distributed packet forwarding strategies in ad hoc networks. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2007. LNCS, vol. 4967, pp. 68–77. Springer, Heidelberg (2008) 11. Feldman, M., Papadimitriou, C., Chuang, J., Stoica, I.: Free-riding and whitewashing in peer-to-peer systems. IEEE Journal on Selected Areas in Communications 24(5), 1010–1019 (2006) 12. Jubin, J., Turnow, J.D.: The darpa packet radio network protocols. Proceedings of the IEEE 75(1), 21–32 (1987) 13. Fischer, D., Basin, D., Engel, T.: Topology dynamics and routing for predictable mobile networks. In: Proc. The 16th IEEE International Conference on Network Protocols (ICNP 2008). IEEE, Los Alamitos (2008)
On the Performance of a New Parallel Algorithm for Large-Scale Simulations of Nonlinear Partial Differential Equations ´ Juan A. Acebr´on1 , Angel Rodr´ıguez-Rozas1, and Renato Spigler2 1
2
Center for Mathematics and its Applications, Department of Mathematics, Instituto Superior T´ecnico, Av. Rovisco Pais 1049-001 Lisboa, Portugal
[email protected],
[email protected] Dipartimento di Matematica, Universit` a “Roma Tre”, Largo S.L. Murialdo 1, 00146 Rome, Italy
[email protected]
Abstract. A new parallel numerical algorithm based on generating suitable random trees has been developed for solving nonlinear parabolic partial differential equations. This algorithm is suited for current high performance supercomputers, showing a remarkable performance and arbitrary scalability. While classical techniques based on a deterministic domain decomposition exhibits strong limitations when increasing the size of the problem (mainly due to the intercommunication overhead), probabilistic methods allow us to exploit massively parallel architectures since the problem can be fully decoupled. Some examples have been run on a high performance computer, being scalability and performance carefully analyzed. Large-scale simulations confirmed that computational time decreases proportionally to the cube of the number of processors, whereas memory reduces quadratically.
1
Introduction
An efficient design of numerical parallel algorithms for large-scale simulations becomes crucial when solving realistic problems arising from Science and Engineering. Most of them are based on domain decomposition (DD) techniques [10], since it has been shown to be specially suited for parallel computers. Nevertheless, classical approaches based on domain decomposition usually suffer from process intercommunication and synchronization, and consequently the scalability of the algorithm turns out to be seriously degraded. Moreover, an additional excess of computation is often introduced when designing the algorithms in order to diminish the effects of the two previous issues. From the parallelism point of view, probabilistic methods based on Monte Carlo techniques offer a promising alternative to overcome these problems. The possibility of using parallel computers to solve efficiently certain partial differential equations (PDEs) based on its probabilistic representation was R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 41–50, 2010. c Springer-Verlag Berlin Heidelberg 2010
42
´ Rodr´ıguez-Rozas, and R. Spigler J.A. Acebr´ on, A.
recently explored. In fact, in [1,2], an algorithm for numerically solving linear two-dimensional boundary-value problems for elliptic PDEs, exploiting the probabilistic representation of solutions, were introduced for the first time. This consists in a hybrid algorithm, which combines the classical domain decomposition method [12], and the probabilistic method, and was called “probabilistic domain decomposition method” (PDD for short). The probabilistic method was used merely to obtain only very few values of the solution at some points internal to the domain, and then interpolating on such points, thus obtaining continuous approximations of the sought solution on suitable interfaces. Known the solution at the interfaces, a full decoupling in arbitrarily many independent subdomains can be accomplished, and a classical local solver, arbitrarily chosen, used. This fact represents a definitely more advantageous circumstance, compared to what happens in any other existing deterministic domain decomposition method. However, in contrast with the linear PDEs, probabilistic representation for nonlinear problems only exists in very particular cases. In [5], we extended the PDD method to treat nonlinear parabolic one-dimensional problems, while in [6] it was conveniently generalized to deal with arbitrary space dimensions. In this paper, for the first time the performance of the PDD algorithm in terms of computational cost and the memory consumption will be carefully analyzed. To this purpose, large-scale simulations in a high performance supercomputer were runned, and the advantageous scalability properties of the algorithm verified. Since the local solver for the independent subproblems can be arbitrarily chosen, for this paper we chose a direct method based on LAPACK for solving the ensuing linear algebra problem. This is so because ScaLAPACK were selected to compare the performance of our PDD method, being ScaLAPACK a competitive, freely available and widely used parallel numerical library. Other alternatives based on iterative methods are worth to be investigated, and will be done in a future work. In the following, we review briefly the mathematical fundamentals underlying the probabilistic representation of the solution of nonlinear parabolic PDEs. More precisely, for simplicity let consider the following initial-boundary value problem for a nonlinear two-dimensional parabolic PDE, ∂2u ∂2u ∂u ∂u ∂u = Dx (x, y) 2 + Dy (x, y) 2 + Gx (x, y) + Gy (x, y) − cu + αi ui , ∂t ∂x ∂y ∂x ∂y i=2 m
in Ω = [−L, L] × [−L, L], t > 0, u(x, y, 0) = f (x, y), u(x, y, t)|x∈∂Ω = g(x, y, t),
(1)
m where x = (x, y) ∈ R2 , αi ∈ [0, 1], i=2 αi = 1, being f (x, y) and g(x, y, t), the initial condition and boundary data, respectively. The probabilistic representation for the solution of (1) was explicitly derived in [6] for the n-dimensional case, and requires generating suitable random trees with so many branches as the power of the nonlinearity found in (1). Such random trees play the role that a random path plays in the linear case, (cf. Feynman-Kac formula for linear parabolic PDE). Mathematically, the solution of (1) for a purely initial value
On the Performance of a New Parallel Algorithm
43
problem at (x, t) can be represented (see [11]) as follows k(ω)
u(x, t) = Ex [
f (xi (t, ω))].
(2)
i=1
Here k(ω) is the number of branches of the random tree starting at x, and reaching the final time t. xi (t, ω) is the position of the ith branch of the random tree, corresponding to the stochastic process solution of the stochastic differential equation: dβ = b(x, y, t) dt + σ(x, y, t) dW (t), (3) where W (t) represents the standard Brownian motion, and b(x, y, t) and σ(x, y, t), the corresponding drift and diffusion related to the coefficients of the elliptic operator in (1). The different possible realizations of the random trees are labeled by ω, and a suitable average Ex should be taken over all trees. In case of a initialboundary value problem, with the boundary data u(x, t)|x∈∂Ω = g(x, t), a similar representation holds, i.e., ⎤ ⎡ k(ω) u(x, t) = Ex ⎣ f (xi (t, ω))1[Si >τ∂Ωi ] + g(βi (τ∂Ωi ), τ∂Ωi )1[Si <τ∂Ωi ] ⎦ , (4) i=1
where τ∂Ωi is the first exit time out of the domain of the stochastic process β(·); Si a random time governed by the exponential distribution p(Si ) = c exp(−c Si ), and 1[Si <τΩi ] is the indicator function, being 1 or 0 depending whether τ∂Ωi is or is not greater than Si . For more mathematical details, we refer to the reader to [5] and [6]. In practice, the probabilistic representation in (2) works as follows: A stochastic process initially located in (x, 0) (root of the tree) is generated and evolves in time. Simultaneously, a random time S1 , picked up from the exponential probability density p(S) = c exp(−c S), is generated. If the random time S1 is less than t, a number of branches corresponding to the power of the nonlinearity are created. n S > t. WhenThis procedure is repeated successively until holds that i=1 i ever one of these possible branches reaches the final time, t, the initial value function, f , is evaluated at the position where the stochastic process associated to every branch was located. The solution is finally reconstructed multiplying all contributions coming from each branch. In practice, to achieve a reasonable small error in (4), a large number of random trees should be generated. More details on the numerical errors related with the probabilistic representation are found in [5] and [6]. For the purpose of illustration, in Fig. 1 a sketchy picture is shown. This corresponds to the case of a tree with two branches, one of them reaching the final time t, while the other hitting the boundary. The paper is organized as follows. In Section 2 the PDD algorithm is described, showing their different components, and stressing how they can be parallelized in practice. Here we also discuss the important feature of being a natural faulttolerant algorithm. In Section 3 some examples of nonlinear two-dimensional PDEs illustrate the PDD method, and a deep analysis of the computational cost and memory consumption is also undertaken. Finally, some conclusions are given.
44
´ Rodr´ıguez-Rozas, and R. Spigler J.A. Acebr´ on, A.
Fig. 1. A picture illustrating a typical random tree in 2D
2
The PDD Method
As it was mentioned above, the core of algorithm requires evaluating the solution of (1) at some convenient given points using a probabilistic representation. The following parallelization strategies can be adopted: • Parallelizing by splitting in independent sets of points. Since the number of points where the solution is computed is larger than the number of processors p, computing such a solution can be assigned as a task to different processors. This can be seen as a coarse-grain parallelization and is the more convenient strategy. • Parallelizing for each point. Recall that for each point, the probabilistic representation requires generating a rather large number of random trees to obtain a sufficiently small statistical error. Since the number of random trees turns out to be often much larger than p, the task to generate only a few of them can be accomplished by different processors. This corresponds to a mid-grain parallelization. • Parallelizing for a given realization. Since every branch pertaining to a given tree is independent from each other, the task to compute only a few branches can be assigned to different processors. This can be seen as a fine-grain parallelization, being by far the most involved, provided that a strong intercommunication and synchronization among the processors is required. However, this level of parallelization offers an important advantage, since it allows to balance easily the overall computational load. We now briefly describe the three parts of the algorithm: Probabilistic part. This is the first step to be carried out, and consists in computing the solution of the PDE at a few suitable points by Monte Carlo, which is esentially a mesh-free method. In Fig. 2 a sketchy diagram is plotted, illustrating how the algorithm works in practice for a two-dimensional case. Here the solution is obtained probabilistically at a few points pertaining to some “interfaces” conveniently chosen inside the space-time domain D := Ω × [0, T ], with Ω ⊂ Rn . Such interfaces divide the domain into p subdomains, Ωi , from i = 1, ..., p, being assigned to different processors, pi , i = 1, ..., p. For instance, in R2 this is done
On the Performance of a New Parallel Algorithm
45
by splitting the domain in slices where the interfaces are parallel to the y-axis, as shown in Fig. 2. Since evaluating by Monte Carlo is computationally costly, only few points are chosen in each plane (y, t), and every processor computes the interfacial values at the corresponding interface located to the right (see Fig. 2). Since each processor shares its interface with its immediately closer neighbor, the interfacial values must be transferred from the processor pi to the processor pi+1 , establishing a next neighbor intercommunication among processors.
Fig. 2. A sketchy diagram illustrating the main steps of the algorithm in 2D: The figure on the left shows how the domain decomposition is done in practice. The figure on the right shows the points where the solution is computed probabilistically; these are used afterwards as nodal points for interpolation.
Interpolation. Once the solution has been computed at few points on each interface, a second step consists in interpolating on such points, being used as nodal points, thus obtaining continuous approximations of interfacial values of the solution. For that purpose, a tensor product interpolation based on cubic spline [7] was used. The computational cost of the two dimensional case is negligible compared with the time spent in the other parts of the algorithm. The nodal points were uniformly distributed on each plane, and a not-a-knot condition was imposed. Local solver. The third and final step consists in computing the solution inside each subdomain, this task being assigned to different processors. This can be accomplished resorting to local solvers, which may use classical numerical schemes, such as finite differences or finite elements methods. Since the comparison with other freely available method was made using ScaLAPACK, LAPACK has been chosen for this purpose. Fault Tolerance-Related Issues We stress that the algorithm is naturally fault-tolerant, being so in any of the parts it is composed. Here, we describe the peculiarities of such features: • Probabilistic part. If a given processor pi fails when computing the solution probabilistically at a given point, either any other processor may continue the pending task or, in case of no free processors at that time, this task may be postponed to be executed later.
46
´ Rodr´ıguez-Rozas, and R. Spigler J.A. Acebr´ on, A.
• Interpolation. Similarly as in the previous case, when a processor fails, either the subdomain corresponding to such a processor may be reassigned to a neighbor processor or the task may be postponed to be executed afterwards. Note that the failure merely affects to this processor. • Local solver. Let suppose a failure occurs at a certain time t = Tf , when solving locally the PDE by a classical method. Then, the solution obtained at Tf may be used as a new initial data, solving the problem starting now from Tf rather than t = 0.
3
Results and Performance
We now proceed with some numerical examples for two-dimensional problems aiming to illustrate the performance of the PDD algorithm. All simulations were carried out on the MareNostrum Supercomputer located at the Barcelona Supercomputing Center, using up to 1,024 processors. 3.1
Computational Cost and Memory Consumption
The space-time domain inside each subdomain corresponding to the PDD algorithm, as well as, the full domain to compare with the PDD, were discretized using the implicit Crank-Nicolson finite difference method. On the various uncoupled subdomains obtained by the PDD algorithm we used LAPACK as a local solver. In [6], a comparison with the results obtaining with ScaLAPACK was already carried out, showing serious limitations in terms of memory consumption. A striking difference in the performance of both methods was shown, the performance of the PDD method being notably superior to ScaLAPACK. Here we focus mainly on the strong scalability and performance of the PDD method. Being Nx and Ny the number of nodes in each dimension, the resulting matrix A ∈ RNx Ny ×Nx Ny of the linear algebra problem associated to the discretization of the domain is known to be banded, with a bandwidth BW = Nx . Since we parallelized the PDD algorithm assigning the task of solving the local linear system pertaining to each subdomain to different processors, the bandwidth of the associated matrix turns out to be smaller. The local solver for the PDD method, being based on LAPACK for solving banded linear systems (routines dgbtrf/dgbtrs), consists in a LU factorization followed by a forward/backward substitution. The memory consumption per processor, including an extra fill-in space, was also estimated in [6], and is given by MpP DD = 3
Nx2 Ny . p2
(5)
Recall that the computational cost for the LU factorization is known to be of order of Nx Ny BW 2 , while is Nx Ny BW for the forward/backward substitution, Nx Ny being the size of the square matrix and BW = Nx /p. Hence, the computational cost of the local solver for the PDD based on LAPACK is,
On the Performance of a New Parallel Algorithm
Nx Nx Ny N 2 Ny Nx Nx Ny CP DD ≈ αopt (4 + 1) + 3 x2 + p p p p p Nx2 Nx Ny Ny Nx Ny = 4αopt 2 + (αopt + 3) + p p p p
47
(6)
see [8,9,6]. The last two terms are related to the computational time spent to compute the matrix coefficients and the right-hand side, respectively. Here the parameter αopt < 1 was introduced to account for the computational advantages gained when using an optimized LAPACK library, after tuning conveniently the BLAS library for the specific hardware platform used. Typically, BLAS library can be easily tuned through the ATLAS package. To estimate experimentally the value of αopt when runned the simulations at the MareNostrum Supercomputer, we have compiled BLAS with and without ATLAS, and it turns out to be approximately αopt = 0.84. We stress that, when Nx >> p and Ny >> p, the computational cost of the PDD method decreases as p−3 for large p, while for ScaLAPACK it has been reported in [8,9] to decrease rather as p−1 . 3.2
Numerical Results
In this subsection we consider two numerical test problems. Example A. The problem is given by ∂u ∂2u ∂2u ∂u ∂u = (1+x4 ) 2 + (1+y 2 sin2 (y)) 2 + (2+sin(x)ey ) + (2+x2 cos(y)) ∂t ∂x ∂y ∂x ∂y 1 2 1 3 −u + u + u , in Ω = [−L, L] × [−L, L], 0 < t < T, u(x, y, t)|∂Ω = 0, (7) 2 2
πy with initial-condition u(x, y, 0) = cos2 πx cos2 2L . No analytical solution is 2L known for this problem, hence the numerical error was controlled comparing the solution obtained with the PDD method with that given by a finite difference method, with a very fine space-time mesh. Table 1. Example A: TT OT AL denotes the computational time spent in seconds by PDD. TM C , and TINT ERP correspond to the time spent by the Monte Carlo and the interpolation part, respectively; M emory denotes the total memory consumption. No. Processors 128 256 512 1024
TM C 445” 459” 463” 461”
TINT ERP <1” <1” <1” <1”
Memory 0.68 GBs 0.17 GBs 0.04 GBs 0.01 GBs
TT OT AL 12976” 3420” 1166” 595”
In Fig. 3, the pointwise numerical error made when solving the problem of Example A by PDD is shown. Here the value of the parameters were kept fixed to Δx = Δy = 10−2 , Δt = 10−3 and L = 1.
48
´ Rodr´ıguez-Rozas, and R. Spigler J.A. Acebr´ on, A.
Fig. 3. Pointwise numerical error for the solution of problem ”Example A”, at t = 0.5, using two processors. Note that there is only one interface located at x = 0.
In Table 1 results from the Example A are shown, using up to 1,024 processors. Now, the parameters chosen for the local solver were Δx = Δy = 2.5 × 10−4 , T = 0.5, L = 1 and Δt = 10−3 . Note that the algorithm scales with the number of processors according to a factor of approximately equal to 4 initially with p = 128, when doubling the number of processors. This is so when the number of processors is not too large (p = 128), being the computational workload per processor sufficiently large for this case. However, a factor of 8 is theoretically expected from (6), since Nx /p >> 1 and Ny /p >> 1. An explanation of this fact could be found in the size of the problem, which is not sufficiently large, and because the optimized LAPACK is speeding up considerably the LU factorization. Therefore, the dominant asymptotic behavior is governed, rather, by the second term in (6). Only for the purpose of confirming that this is indeed the case, new simulations for a larger computational load are needed, such that now Nx /p >> 1/αopt and Ny /p >> 1/αopt . The required computational resources in terms of memory, however, largely exceed the available memory (the limits have already being reached). A simple way to circumvent this problem consists in increasing the bandwidth of the associated matrix, keeping almost constant the computational load. This can be done merely considering a rectangular domain instead, Ω = [−Lx , Lx ] × [−Ly , Ly ], with Lx > Ly . Other alternatives based on adopting an heterogeneous computational grid, or even using higher order numerical schemes are equally possible, and will be analyzed elsewhere. From the results above, it is observed that increasing the size of the problem by raising the number of nodes Nx along the x−dimension, results in a computational cost growing cubically with Nx , whereas the memory consumption grows quadratically. Nevertheless, the effect on the y-dimension is more contained, since both the computational cost and the memory consumption increase linearly with the number of nodes Ny . Therefore, to increase the computational workload appreciably it is more convenient to increase the size of the problem along the x−dimension. In fact, this is successfully illustrated in the next example.
On the Performance of a New Parallel Algorithm
49
Example B. Consider the following problem ∂u ∂u ∂2u ∂2u ∂u = (1 + x2 ) 2 + (1 + y 3 ) 2 + (2 + sin(x) ey ) + (2 + sin(x) cos(y)) ∂t ∂x ∂y ∂x ∂y 1 2 1 3 −u + u + u , in Ω = [−Lx , Lx ] × [−Ly , Ly ], 0 < t < T, u(x, y, t)|∂Ω = 0, (8) 2 2
πy and initial-condition u(x, y, 0) = cos2 πx cos2 2L . In Table 2, the results cor2L responding to such a rectangular domain are shown. We used the parameters: Lx = 200, Ly = 0.125, Δx = Δy = 5 × 10−3 , T = 0.5, and Δt = 10−3 . Note that the scaling factor has changed significantly, reaching even the value 5, when passing from 128 to 256 processors. Increasing further the number of processors, implies reducing the scaling factor. This is in good agreement with the theoretical estimates in (6), since, asymptotically, the dominant term changes accordingly to the number of processors involved as well. Table 2. Example B: TT OT AL denotes the computational time spent in seconds by PDD. TM C and TINT ERP correspond to the time spent by the Monte Carlo and the interpolation part, respectively; M emory denotes the total memory consumption. No. Processors 128 256 512 1024
4
TM C 67” 69” 71” 72”
TINT ERP <1” <1” <1” <1”
Memory 1.79 GBs 0.45 GBs 0.11 GBs 0.03 GBs
TT OT AL 54235” 10666” 2601” 748”
Conclusions and Future Work
The performance of a new parallel numerical algorithm for solving nonlinear parabolic partial differential equations based on a probabilistic method, has been investigated. Such a method allows to decouple the full problem into several independent subproblems, the computational cost being dramatically alleviated. Many different sources of parallelism for this algorithm can be explored efficiently, since intercommunication overhead among processors is almost nonexistent. Moreover, the algorithm appears to be fault-tolerant in a natural way. While the computational cost for traditional schemes scales linearly with the number of processors, the PDD algorithm scales quadratically or even cubically, provided that the workload per processor is sufficiently large. Some numerical examples are given, showing the excellent scalability properties of the PDD algorithm in large-scale simulations, using up to 1,024 processors on a high performance supercomputer. Summarizing, the proposed algorithm appears to be a promising alternative to the traditional parallel schemes to solve partial differential equations. Such an algorithm is characterized by desirable features, such as low intercommunication overhead, and fault-tolerance, allowing to exploit at best massively parallel supercomputers as well as Grid computing.
50
´ Rodr´ıguez-Rozas, and R. Spigler J.A. Acebr´ on, A.
Acknowledgments This work was supported by the Portuguese FCT. The authors thankfully acknowledge computer resources, technical expertise, and assistance provided by the Barcelona Supercomputing Center - Centro Nacional de Supercomputaci´ on.
References 1. Acebr´ on, J.A., Busico, M.P., Lanucara, P., Spigler, R.: Domain decomposition solution of elliptic boundary-value problems. SIAM J. Sci. Comput. 27(2), 440– 457 (2005) 2. Acebr´ on, J.A., Busico, M.P., Lanucara, P., Spigler, R.: Probabilistically induced domain decomposition methods for elliptic boundary-value problems. J. Comput. Phys. 210(2), 421–438 (2005) 3. Acebr´ on, J.A., Spigler, R.: Supercomputing applications to the numerical modeling of industrial and applied mathematics problems. J. Supercomputing 40, 67–80 (2007) 4. Acebr´ on, J.A., Spigler, R.: A fully scalable parallel algorithm for solving elliptic partial differential equations. In: Kermarrec, A.-M., Boug´e, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 727–736. Springer, Heidelberg (2007) 5. Acebr´ on, J.A., Rodr´ıguez-Rozas, A., Spigler, R.: Efficient parallel solution of nonlinear parabolic partial differential equations by a probabilistic domain decomposition (2009) (submitted) 6. Acebr´ on, J.A., Rodr´ıguez-Rozas, A., Spigler, R.: Domain decomposition solution of nonlinear two-dimensional parabolic problems by random trees. J. Comput. Phys. 228(15), 5574–5591 (2009) 7. Antia, H.M.: Numerical methods for scientists and engineers. Tata McGraw-Hill, New Delhi (1995) 8. Arbenz, P., Cleary, A., Dongarra, J., Hegland, M.: A comparison of parallel solvers for diagonally dominant and general narrow-banded linear systems. Parallel and Distributed Computing Practices 2(4), 385–400 (1999) 9. Arbenz, P., Cleary, A., Dongarra, J., Hegland, M.: A comparison of parallel solvers for diagonally dominant and general narrow-banded linear systems II. In: EuroPar 1999 Parallel Processing, pp. 1078–1087. Springer, Berlin (1999) 10. Keyes, D.E.: Domain Decomposition Methods in the Mainstream of Computational Science. In: Proceedings of the 14th International Conference on Domain Decomposition Methods, pp. 79–93. UNAM Press, Mexico City (2003) 11. McKean, H.P.: Application of Brownian motion to the equation of KolmogorovPetrovskii-Piskunov. Comm. on Pure and Appl. Math. 28, 323–331 (1975) 12. Quarteroni, A., Valli, A.: Domain Decomposition Methods for Partial Differential Equations. Oxford Science Publications, Clarendon Press, Oxford (1999)
Partial Data Replication as a Strategy for Parallel Computing of the Multilevel Discrete Wavelet Transform Liesner Acevedo1,2 , Victor M. Garcia1 , Antonio M. Vidal1 , and Pedro Alonso1 1
Departamento de Sistemas Inform´ aticos y Computaci´ on Universidad Polit´ecnica de Valencia Camino de Vera s/n, 46022 Valencia, Espa˜ na {lacevedo,vmgarcia,avidal,palonso}@dsic.upv.es 2 Departamento de T´ecnicas de Programaci´ on Universidad de las Ciencias Informaticas Carretera a San Antonio de los Ba˜ nos Km 2 s/n, La Habana, Cuba
Abstract. In this paper we propose a strategy of partial data replication for efficient parallel computing of the Discrete Wavelet Transform in a distributed memory environment. The key is to avoid the communications needed between computation of different wavelet levels, by replicating part of the data and part of the computations, avoiding completely communications (except maybe at the setup phase). A similar idea was proposed in a paper by Chaver et al.; however, they proposed to replicate completely the data, which can require too much memory in each processor. In this work we have determined exactly how many data items shall be needed for each processor, in order to compute the DWT without extra communications and using only the memory strictly necessary.
1
Introduction
The Discrete Wavelet Transform (DWT) [1,2,3] is still a very recent field, so that it is the subject of active research. The number of applications which can benefit from this tool increases nearly every day, such as signal analysis [4], image compression [5], electromagnetism [6] and many others. There is also a growing interest in the use of the DWT in Numerical Linear algebra [7,8]. The DWT is a relatively simple operation that is applied to signals and data of any kind, and amounts to perform convolutions of certain digital filters with the data. It can be used with different goals: to obtain compressed representations of the data, to determine the frequency contents of the data, etc. Usually, the DWT is applied repeatedly over the data set; each new application of the DWT is called a level. Very often, several levels of the DWT are applied over very large data sets. This requires heavy computations and, in such cases, parallel computing may be needed to obtain reasonable computing times. There are many papers about the parallel computing of the DWT. Among these, we find specially relevant the paper by Nielsen and Hegland [9]. There, R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 51–60, 2010. c Springer-Verlag Berlin Heidelberg 2010
52
L. Acevedo et al.
they made a detailed study of the parallel computing of the DWT for vectors (DWT-1D) and for matrices (DWT-2D) in distributed memory computers. Their study was focused mainly in the 1-level case, establishing clearly the communications needed in this case. To compute the 1-level DWT, each processor should receive a portion of the data (which, for efficiency, should be composed of contiguous entries of the vector/matrix) and should compute its part of the DWT; to complete the calculation, each processor needs to receive some data from the “neighbor”processors. They also studied the 1-level 2D DWT, which involves as well a matrix (image) transposition. They proposed a strategy (The Communication-Efficient Distribution) to avoid the transposition step. Still, this is accomplished at the cost of some communications among processors. The k-level case can be obtained applying repeatedly the 1-level DWT, but as will be shown, new communications must be carried out prior to the computation of each level. In [10] a full replication strategy was proposed for the k-level DWT of a matrix. This amounts to copy the full matrix in every processor, splitting the computations among the processors. In this case, no communication would be needed (although some computations should be replicated as well). The main drawback of this strategy is that it requires a full copy of the matrix in each processor, so that there is a severe limit on the size of the matrices that can be processed in this way. In this paper, we have refined further the idea proposed in [10], determining exactly how many elements needs to hold each processor so that the parallel DWT can be computed without extra communications. We have obtained an exact result for this problem, proven in Proposition 2. The result was proven for the 1D case, but, since any higher dimension DWT can be computed as a sequence of 1D DWTs, our result can be used for DWTs of any dimension. The paper is structured as follows: Section 2 describes the DWT. The next Section describes the Parallel DWT, mainly for the 1D case, and includes our proposal, whose main result is the Proposition 2. Then, we discuss several considerations about the possible practical use of our proposal, and finally we give our conclusions.
2
Discrete Wavelet Transform
The theoretical background of the DWT is based on the concept of Multiresolution Analysis and is already quite well known. However, these concepts are not useful when considering the implementation details and the parallelization, so that we will refer to the interested reader to [2,11]. The DWT applied over a signal or vector is obtained through digital filters, L a lowpass filter G determined by the coefficients {gi }i=1 and a highpass filter H L determined by {hi }i=1 . Not any pair of highpass and lowpass filters are valid for a DWT; however, there are many such pairs. This means that the DWT is not a single, unique
Partial Data Replication as a Strategy for Parallel Computing
53
operation, since it depends on the filters. The length L of the filter is called the “order”of the wavelet. To obtain the DWT of a vector, both filters shall be applied (convolved) over the data vector; so, a single stage of the DWT (or a one-level DWT) over a vector v of length N = 2j is performed, first, convolving both filters with the vector v, obtaining two sequences of length N , and next, discarding every even element of the transformed sequences, and putting together the resulting sequences. These two steps are equivalent to the multiplication of the matrix WN = H by the data vector, where H (of dimension N2 × N ) has the following G structure: ⎛ ⎞ h1 h2 ... hL 0 · · · · · · · · · 0 ⎜ 0 0 h1 h2 ... hL 0 · · · 0 ⎟ ⎜ ⎟ (1) H=⎜ ⎟ . .. . . . . . . ⎝ 0 ··· ··· . . . . 0 ⎠ h3 h4 · · · hL 0 · · · 0 h1 h2 The matrix G has the same structure and dimension. The matrix-vector product would result in ⎛ ⎞ d1 ⎜ .. ⎟ ⎜ ⎟ ⎜ . ⎟ ⎜ H ·v dN/2 ⎟ ⎟ , WN · v = (2) =⎜ ⎜ ⎟ G·v ⎜ s1 ⎟ ⎜ . ⎟ ⎝ .. ⎠ sN/2 where the di are the high frequency (or “detail”) components and the si are the low frequency (or “smooth”) components. The k-Level DWT (or, simply, the DWT) is obtained applying k times the one-level DWT over the low frequency part of the data vector. The two-dimensional wavelet transform is applied to a matrix by applying 1-D transforms along each dimension. So, a one-level 2-D DWT applied over a matrix T0 ∈ n,n can be written as T˜0 = WN T0 WNT , with WN defined as above. In this case, we obtain the following partition: A1 B1 T˜0 = WN T0 WNT = , (3) C1 T1 where T1 = GT0 GT , A1 = HT0 H T , B1 = HT0 GT and C1 = GT0 H T . If, as usually happens, we want to apply more than one DWT level, at least two different possibilities appear. The so-called “Standard”DWT form of a matrix is obtained applying the k-level 1D-DWT over the columns, and then applying the k-level 1D DWT over the rows. The Non-Standard form [3] is obtained applying a one level DWT over the rows, and one over the columns (obtaining a partition T ) and another DWT by as in (3)). Then, another one-level DWT by rows (WN/2 columns (WN/2 ) are applied over the low-frequency submatrix T1 , and this is repeated k times, always over the low frequency submatrix.
54
3
L. Acevedo et al.
Data Distributions for the Parallel DWT
The parallel DWT is a relatively simple operation to implement in a distributed memory model (it becomes trivial in a shared memory model) [9]. Examining the 1D case, outlined above, it is clear that the DWT is just a matrix-vector product (although in a practical implementation the matrix Wn would never be built). The straightforward parallelization is just like in the matrix-vector product, distributing the data vector among the available processors in “chunks”of length N P , or as close as possible. Then, each processor performs its part of the DWT, although it will probably need data from other processors. Since the DWT is applied to contiguous components of vectors/matrices, a parallel DWT will be efficient only if the data assigned to any processor is made up of contiguous components of the vectors/matrices to be DWT-transformed. So, we will consider only this kind of “contiguity”data distributions. The parallel 2D DWT has been studied in different papers [9,12], but mainly for the single level case. In all cases, the distributions proposed were by blocks of columns, and the main efforts were oriented towards avoiding the transposition step. However, all the strategies assumed that there should be communications before the computation of each new level. In the next section we will discuss the procedure for parallel computing of a single DWT level. 3.1
Parallel Computation of the 1D-DWT, One Level
To understand how proceeds the computation of a single 1D-DWT level we can examine Fig. 1. This figure shows the data dependences in the computation of a vector of length 16 distributed across 2 processors using wavelet filters of length L = 4. Let us consider only the computation of the low frequency components (the high frequency components are computed exactly in the same way). To compute the low frequency components, the processor P0 computes the coefficients s10 , s12 , s13 , s14 , while the processor P1 computes the coefficients s15 , s16 , s17 , s18 . To carry out their tasks, the processor P0 needs to receive the values v9 and v10 to compute s14 while P1 needs to receive the values v1 and v2 . This idea was generalized by Nielsen and Hegland [9], who obtained the following basic result: Proposition 1. Consider the parallel computing of the 1D-DWT, applied over a vector of length N , and distributed over P processors (0, . . . , P − 1) in subvec. If the wavelet order (length of the filters) is L, to complete the tors of length N P computation of the 1D-DWT over the vector v, each processor must receive exactly L − 2 elements of the processor “neighbor”from the right. (except for Pp−1 , that must receive L − 2 elements from P0 ) Now the k-level DWT shall be discussed, including our proposal.
Partial Data Replication as a Strategy for Parallel Computing
55
Fig. 1. A 1D-DWT level over a vector v distributed over 2 processsors, wavelet filters with L = 4 (4 coefficients)
3.2
Parallel Computation of the 1D-DWT, Several Levels
The computation of several 1D-DWT levels can be done applying repeatedly the results of the former section, that is, prior to the computation of each level each processor sends and receives (L − 2) elements. Then, each processor computes its part of a new DWT level.
Fig. 2. Two 1D-DWT level over a vector v distributed over 2 processsors, wavelet filters with L = 4 (4 coefficients)
Figure 2 shows the data dependencies for the same case of Fig. 1. For two levels, the dependencies show that each processor (in this case) should receive 6 extra elements of the original vector to complete its task. It is clear that after each level there must be new communications, and the size of these communications grows with each new level. To avoid all these communications, in [10] it was proposed to replicate all the data in all the processors, so that instead of using communications each processor can compute the new elements that it needs. (This idea was proposed for the 2D-DWT of a matrix, but the same idea can be used for any dimension).
56
L. Acevedo et al.
In this way, all the communications (after the first distribution of the data) can be avoided. The extra computations are much less time consuming than the communications that it replaces. We call this a “full replication”technique. However, this means that all the processors must hold the full data. For very large sets of data, this may become a serious problem. Our proposal in this paper is to refine this idea, determining exactly how many elements will need each processor to complete its part of a k-level DWT without further communications, so that (hopefully) the memory needed by each processor is much less than with the “full replication”technique; we call our idea the “partial replication”technique. Considering a vector distributed over a set of processors, we want to determine the number NL,k of elements of the original vector (apart from those originally in the processor) that each processor will need to complete the k-level DWT using wavelet filters of length L. To prove our main result, we need first the following lemma: Lemma 1. Consider the parallel computing of k levels of the 1D-DWT, applied over a vector of length N , and distributed over P processors (0, . . . , P − 1) in . Let us assume that each processor has computed already subvectors of length N P the k-level DWT of its subvector (and, to do that, had already received all the NL,k elements needed from the original vector). For each processor to compute M additional elements of the k-level DWT without further communications, it must receive 2k M additional elements from the original vector Proof. Let us prove the lemma for k = 1 (one level). Let v be the original vector, and vdwt the vector transformed by a single DWT level. Let us consider a single processor P , which initially holds the subvector of v contained between the indexes [iini , iend ], and that has already computed its part of the vdwt vector, between the indexes [iwini , iwend ]. For this, processor P has received from another processor L − 2 extra elements, from v[iend + 1] to v[iend + L − 2]. It is desired that this processor computes as well some of the elements of vdwt “to the right”of those already computed: vdwt [iwend + 1] ,vdwt [iwend + 2], ... . We need to know how many extra elements of v must hold the processor P to compute these extra elements of the transformed vector vdwt . Starting by the element vdwt [iwend +1] , to compute this element the processor P would need the elements v[iend + 1],v[iend + 2], up to v[iend + L]. (Recall that the length of the filter is L). However, the first L − 2 elements of these had already been sent to P , so that the only new elements needed are v[iend + L − 1] and v[iend + L]. For the next element to be computed, vdwt [iwend + 2] , and taking the downsampling into account, the processor P needs v[iend + 3],v[iend + 4], up to v[iend + L + 2]. Again, it is two more elements from v to compute a new one of vdwt . It should be clear now that: For the processor P to compute M additional elements of the transformed vector vdwt it needs 2M additional elements from the original vector v.
Partial Data Replication as a Strategy for Parallel Computing
57
This has been proved for a single DWT level. However, the reasoning is exactly the same to go from level k − 1 to level k. Let us consider the case put forward in the lemma 1, when each processor has computed its part of the k-level DWT, and it is needed to compute M extra elements of the k-level DWT, located, as before, “to the right”of the already computed elements. Using the result just obtained, to compute these new M elements the processor needs 2M elements of the k-1 DWT vector; to compute these elements, 4M elements of the k-2 DWT will be needed. Repeating the process, we obtain that to compute the M elements of the k level DWT, 2k M elements of the original vector (level 0) will be needed, which proves the lemma. We can examine again Fig. 1 to check the lemma. We see that, to compute its part of the first level, the processor P0 had received the elements v9 and v10 . If we want that P0 computes as well s15 , it needs to have access to v9 , v10 , v11 and v12 ; but since v9 and v10 were already available, it only needs v11 and v12 . To compute also s16 , it would need another two elements (v13 and v14 ), and so on. Now, the number of elements that must be sent to each processor is easily obtained through the following proposition: Proposition 2. Consider the parallel computing of k levels of the 1D-DWT, using wavelet filters of length L, applied over a vector of length N , and distributed . To complete the task over P processors (0, . . . , P − 1) in subvectors of length N P without further communications, each processor should receive
(4) NL,k = 2k − 1 · (L − 2) additional elements from the original vector v, located “to the right”of those already available to processor P . Proof. We will prove this proposition using induction over the number of levels k. Let us consider first the base case, with k = 1. The base case is already proven using Proposition 1, where it states that the number of additional elements that each processor needs is
(5) L − 2 = NL,1 = 21 − 1 · (L − 2) . Let us suppose that (4) is true for k levels:
NL,k = 2k − 1 · (L − 2) ;
(6)
we will prove that the result is true for k + 1 levels:
NL,k+1 = 2k+1 − 1 · (L − 2) .
(7)
To compute the (k+1) DWT level, each processor must have computed previously the k-th level. Therefore, by
the induction hypothesis, each processor must have received NL,k = 2k − 1 · (L − 2) additional elements from the original vector.
58
L. Acevedo et al.
Apart from these NL,k elements from the original vector, each processor must receive some extra elements to complete the computation of the k+1 DWT level. These elements are easily determined, since the computation of the k+1 DWT level amounts to apply a 1 level DWT over the k-level DWT vector. So, using the base case, each processor must receive L − 2 additional elements from the k-level DWT vector. Therefore, to complete
the k+1 level DWT, each processor needs to have available NL,k = 2k − 1 · (L − 2) elements of the original vector, plus L − 2 additional elements from neighbor processor corresponding the k-level DWT vector. However, we want that each processor completes its task receiving only elements of the original vector. Using the lemma, we obtain that to compute these L − 2 elements from the k-level DWT vector, 2k · (L − 2) elements of the original vector v are needed. Then, we obtain easily that the number of elements of the original vector needed is:
k (8) 2 − 1 (L − 2) + 2k · (L − 2) = 2k+1 − 1 · (L − 2) . 3.3
Practical Use of Proposition 2
The practical interest of Proposition 2 resides in the elimination of communications when computing the parallel DWT in a distributed memory environment. Consider the following example; let us suppose that we want to compute the 8-level DWT of a vector of length, say 107 elements, distributed among 10 processors (P0 ,...,P9 ), using wavelet filters of length 4. Then, each processor must compute the 8-level DWT of a subvector of size 106 , but it needs some elements of the processor located ”to its right” (except the processsor P9 which will receive them from processor P0 ). There are three different procedures: 1. Repeated use of the 1-level DWT and of Proposition 1. Then, in the initial distribution each processor should receive 106 elements, plus another 2 from the processor holding the elements to the right. Then, after the computation of each level, each processor must receive another 2 elements (from the former level), so that there must be 8 collective communications, one per level. 2. Use of a full replication strategy. Then, no extra communication would be needed, but every processor needs to hold the full vector of 107 elements. This would be, somehow, equivalent to compute the 8-level DWT in a shared memory multiprocessor. 3. Partial replication strategy, using Proposition 2. In this case, each processor will need its subvector of 106 elements, plus (applying Proposition 2) (27 − 1) ∗ 2 = 254 extra elements in the initial distribution; with these extra elements (and performing the extra computations associated to these points) all the processors will be able to complete the 8-level DWT without extra communications. If the filter is longer (say, L=30), then (27 − 1) ∗ (30 − 2) = 3556; clearly, the number of extra elements to be initially sent to each processor is substantially
Partial Data Replication as a Strategy for Parallel Computing
59
higher (and the number of computations for these extra points). However, since all the communications after the first one are no longer needed, it is still probably a faster strategy. Other considerations about the practical use of Proposition 2: -This proposition has been proved for the 1D parallel DWT, but it can be used similarly for the DWT-2D or for higher dimension DWTs. For example, Fig. 3 displays a matrix distributed by blocks, where a certain processor holds the block A[r1 :r2 ,c1 :c2 ] . To compute two
DWT levels with L = 4 of that block, for each row the processor needs 22 − 1 · (4 − 2) = 6 extra elements on the right, and the same with the columns.
Fig. 3. Two 2D-DWT levels over a matrix wavelet filters with L = 4 (4 coefficients)
Some care must be taken to include the “diagonal”blocks, since these will be necessary too. - Even in the case when the number of elements given by the equation (4) is very high, still it would be possible to use (4) with a different approach. In such case, we might use (4) to perform communications only at selected levels. As an example, let us suppose that we want to compute 10 DWT levels using a filter with L = 24 of a large vector. We might choose between sending 210 − 1 · (L − 2) = 22506 elements right from the beginning, (where no more should take place), or send elements enough for 5 levels:
5 communications 2 − 1 · (L − 2) = 682, compute the first 5 levels, send again enough elements (again 682) for the final 5 levels, and finally compute the last 5 levels. Depending on the communication parameters of the system, we might determine an optimal strategy for each case. 3.4
Conclusions
The work presented allows the parallel computation af several DWT levels without any communication (except, maybe, in the setup phase). This idea, was already presented in [10], but there it was proposed replicating all the data. In this paper we have proved that it can be easily computed the exact number of extra elements needed. Therefore, these elements can be sent to each processor at the beginning of the computations, avoiding any extra computation.
60
L. Acevedo et al.
The practical impact of this proposal will depend on the application and on properties of the parallel computer where it is implemented. However, it is usually acknowledged that avoiding communications at the expense of a few extra computations will, in most cases, improve the execution times and the speed-up of the parallel DWT algorithms.
Acknowledgements Partially Supported by the Vice-rectorate for Research, Development and Innovation of UPV (Project 20080009) by the Generalitat Valenciana (Project 20080811), and by Spanish MICINN (Project TIN 2008-06570-C04-02).
References 1. Mallat, S.G.: A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Machine Intell. 2, 674–693 (1989) 2. Daubechies, I.: Ten Lectures on Wavelets. In: CBMS-NSF Regional Conference Series In Applied Mathematics, vol. 61. Society for Industrial and Applied Mathematics, Philadelphia (1992) 3. Beylkin, G., Coifman, R.R., Rokhlin, V.: Fast wavelet transforms and numerical algorithms I. Commun. Pure Appl. Math. XLIV, 141–183 (1991) 4. Chui, C.K.: Wavelets: A Mathematical Tool for Signal Analysis. In: SIAM Monographs on Mathematical Modeling and Computation. SIAM, Philadelphia (1997) 5. Starck, J.L., Murtagh, F., Bijaoui, A.: Image Processing and Data Analysis. In: The Multiscale Approach. Cambridge University Press, Cambridge (1998) 6. Sarkar, T.K., Salazar-Palma, M., Wicks, M.C.: Wavelet Applications in Engineering Electromagnetics. Artech House, Norwood (2002) 7. Chan, T.F., Chen, K.: On two variants of an algebraic wavelet preconditioner. SIAM J. Sci. Comput. 24, 260–283 (2002) 8. Chen, K.: Matrix Preconditioning Techniques and Applications (2005) 9. Nielsen, O.M., Hegland, M.A.: A Scalable Parallel 2D Wavelet Transform Algorithm. REPORT TR-CS-97-21, Australian National University (1997), http://cs.anu.edu.au/techreports 10. Chaver, D., Prieto, M., Pi˜ nuel, L., Tirado, F.: Parallel Wavelet for Large Scale Image Processing. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IEEE IPDPS 2002), Florida, USA (April 2002) 11. Wickerhauser, M.V.: Adapted Wavelet Analysis from Theory to software. A.K. Peters, IEEE Press (1994) 12. Marino, F., Piuri, V., Swartzlander Jr., E.E.: A parallel implementation of the 2-D discrete wavelet transform without interprocessor communications. IEEE Transactions on Signal Processing 47(11), 3179–3184 (1999)
Dynamic Load Balancing for Adaptive Parallel Flow Problems Stanislaw Gepner, Jerzy Majewski, and Jacek Rokicki Warsaw University of Technology The Institute of Aeronautics and Applied Mechanics Nowowiejska 24, 00-665 Warsaw, Poland
[email protected],
[email protected],
[email protected] http://c-cfd.meil.pw.edu.pl
Abstract. Large scale computing requires parallelization in order to arrive at solution at reasonable time. Today parallelization is a standard in fluid problems simulation. On the other hand adaptation is a technique that allows for dynamic modification of the mesh as the need for locally higher resolution arises. Adaptation used during parallel simulation leads to unbalanced numerical load. This in turn decreases the efficiency of parallelization. Dynamic load balancing strategies should be applied in order to ensure proper parallelization efficiency. The paper presents the potential benefits of applying the dynamic load balancing to adaptive flow problems simulated in parallel environments. Keywords: dynamic load balancing, parallel cfd, adaptation, mesh refinement.
1
Introduction
The parallelization is the only effective approach to deal with modern scientific or engineering flow simulations. A common approach to parallelization is the one based on the domain decomposition. As it is a natural and relatively easy to implement approach (See Fig. 1 for an example of domain decomposition applied to a complex geometry). For domain decomposition based parallelization to be effective the partitioning into sub domains must satisfy certain conditions. Firstly the computational load assigned to sub domains must be equally distributed. Usually the computational cost is proportional to the amount of mesh entities assigned to partitions. So to ensure numerical load to be balanced, it is enough to equally divide mesh elements among partitions. Good quality partitioning requires also the volume of communication during calculation to be kept at its minimum. This can only be achieved by examining the topology of the mesh being partitioned, and minimizing the resulting edge cuts. For static applications where numerical load distribution and topology does not change during computation it is enough to perform partitioning only once, prior to the computation. Such use is often referred to as static load balancing. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 61–69, 2010. c Springer-Verlag Berlin Heidelberg 2010
62
S. Gepner, J. Majewski, and J. Rokicki
Fig. 1. Mesh of a fighter plane partitioned prior to computation. CoolFluid project test-case [4].
When computing for a certain flow features, such as shocks a technique of adaptation is an effective tool to limit the number of necessary cells [2]. It allows for high grid resolution in areas of importance, while applying a relatively coarse mesh in other regions. Thus lowering the overall numerical cost, while keeping the solution properly resolved. Figure 2 presents an example of adaptation used to capture a complex shock wave structure for Onera M6 wing geometry, together with the solution adapted mesh.
Fig. 2. Adapted surface mesh for Onera M6 wing geometry (left), and the resulting density field (right)
The use of adaptation during a parallel simulation leads to the numerical load change throughout the domain, making the initial load balancing ineffective. If
Dynamic Load Balancing for Adaptive Parallel Flow Problems
63
one wants to maintain a proper parallel efficiency a dynamic load balancing scheme has to be introduced. In addition to the condition fulfilled by a static mesh partitioner, a dynamic one has to satisfy some additional criteria. It should work well in parallel, be efficient in memory consumption and have a short runtime as not to negatively influence the overall performance. An important feature of a partitioner is its incrementality, which means that data migration resulting from repartition will be kept to a necessary minimum. The problem of dynamic balancing of numerical load was considered in [5,6,7,8,9,10,11], the present approach concentrates on evaluation of benefits of DLB in aerodynamically motivated applications. This paper presents results from implementation of the DLB algorithm into the in-house FLOW code [2] a parallel residual distribution flow solver.
2
The Measure of Partition Quality
To estimate the quality of load balance, let W be a total numerical load, equal to the total number of mesh entities. If p stands for the number of processor cores used, than the load considered optimal and assigned to an it’h processor would be: wiopt = W/p. The actual load assigned is wi , so a load balance indicator is assumed in the following way: β = max
0
wi wiopt
(1)
If defined in this way, the partitioning is of acceptable quality when load balance indicator is close to one. This means that the numerical load assigned to the most heavily loaded partition is close to the optimal one.
3
Parallel Mesh Refinement
In general adaptation for three dimensional complex geometries is a non trivial problem [2,3]. It is further so if it is to be performed in parallel. The following isotropic mesh refinement algorithm has been implemented into the FLOW solver: 1. An error indicator estimates error connected to each of the mesh cells (at the particular moment of the simulation iterative process). 2. All cells exceeding a chosen error threshold are marked for refinement. 3. Nodes are introduced on the edges of the marked cells. 4. New cells are created, and unnecessary cells are removed from the data structure. 5. New solution process is restarted at the adapted mesh. 6. The simulation is returned to point one, if the appropriate convergence criteria is not met.
64
S. Gepner, J. Majewski, and J. Rokicki
Fig. 3. Mesh refinement. Cells 1 and 2 have been marked to be refined, new nodes have been added to them.
The adaptation process is schematically illustrated in Fig. 3. When working in parallel there has to be additional step taken to ensure coherence in the regions of overlap. Fig. 4 presents mesh configuration resulting from a series of consecutive refinements.
Fig. 4. Computational mesh after fourth refinement steps
4
Repartitioning Phase
During the repartitioning phase a new, better balanced mesh decomposition is created. It is accomplished using one of many available partitioning tools. Subsequently the mesh entities are redistributed between processors according to the obtained coloring. 4.1
Mesh Partitioning Tool Graph Partitioner
To calculate mesh decomposition an outside tool is used. In this work the ParMETIS [1] graph partitioner was used. It is a parallel version of a well known METIS package. It is based on the multilevel approach, and is capable of computing graph partitions of good quality. Main factor in it’s selection was its parallel efficiency and ease of use.
Dynamic Load Balancing for Adaptive Parallel Flow Problems
65
Fig. 5. Partitions are being modified according to the previously calculated coloring. On the left is the original partitioning and coloring, the middle picture shows original partitioning with new coloring. On the right is the partitioning after elements have been sent / deleted.
4.2
Mesh Partitioning
To perform dynamic load balancing procedure, the following steps have to be undertaken: 1. Elements of the mesh being repartitioned have to be given unique numbering throughout the whole partitioned domain. This makes mesh elements easily distinguishable throughout the parallel environment. 2. Grid topology has to be described as a connectivity graph. When working with ParMETIS the so called distributed CSR format is used. 3. With graph connectivity ready it is possible for ParMETIS to calculate new mesh partitioning. 4. According to the calculated partitioning there are some data movement to be performed. During this phase mesh elements are selected for communication or deletion appropriately. This step is strongly dependent on the data structure used by the simulation software. Applying the new partitioning involves massive changes to the simulation data structure. Some mesh entities have to be send to other computing nodes while Table 1. Comparison of repartition time, graph partitioning time, average iteration time; case of 5078927 nodes No of cores 32 72 80
Simulation time 311500 s 157500 s 129500 s
Total repartition time 663 s 165 s 128 s
Partitioner Run time 5s 2s 3s
66
S. Gepner, J. Majewski, and J. Rokicki
Fig. 6. Impact of numerical imbalance on computation time. Meshes used for comparison, the convergence history for both cases. In the balanced case convergence time is reduced.
some must be removed. Also the received data have to be added to the data structure accessible to a given CPU. It is a process strongly dependent on the data structure used within the solver. Once the information is exchanged the solver is ready to restart calculations.
Dynamic Load Balancing for Adaptive Parallel Flow Problems
67
Fig. 7. Mesh after consecutive refinement. Dynamic load balancing has been applied on the left. The number next to the mesh denotes its load balance indicator (LBI).
Fig. 8. Convergence as a function of computational time, with and without load balancing used
68
S. Gepner, J. Majewski, and J. Rokicki
This process is illustrated on Fig. 5 where a three dimensional domain is being repartitioned. On the left shown is the initial partitioning. In the middle presented is the new coloring calculated during the graph partitioning phase plotted onto original partitioning. The rightmost picture shows the mesh after appropriate data migration has been carried out.
5
Impact of Dynamic Load Balancing on Runtime
The aim of using dynamic load balancing during parallel simulation is to ensure its proper parallel efficiency to be maintained during runtime. Figure 6 shows how computational time is influenced by the dis-balanced numerical load. The case for which load balance indicator is close to 2 (Fig. 6a) needs more time to converge than in the case of properly balanced one. Figure 7 presents the history of computational mesh after consecutive refinement steps, used during parallel flow simulation. On the right hand side is the case where load balancing has not been used. On the left side presented is the case with load balancing active. Next to each of the presented meshes noted is its load balance indicator. Graph on Fig. 8 shows convergence history for both of the cases respectively. The case for which load balancing is used has a load balance indicator close to one throughout the simulation process. While for the situation without load balancing, the LBI measure is increasing together with adaptation steps taken. This has an influence on convergence time. Numerical cost of rebalancing is relatively small in comparison to total simulation time. Table 1 shows comparison of CPU time required to perform the simulation. Compared is the time of a single iteration, time required to perform graph partitioning (ParMETIS run time) and a total time required to perform a full rebalancing step. It is worth noticing that total rebalancing time decreases larger the number of CPU’s. The case under investigation is a three dimensional simulation of the flow around Oner M6 wing. Mesh of 5078927 nodes (32622210 cells) is used. Results ware obtained by running the FLOW solver on the cluster containing 20 QuadCore AMD Opteron(tm) processors (4 cores per processor) with 2 GB of memory per core, connected by Fast Ethernet connection.
6
Conclusions and Further Work
The paper presented dynamic load balancing algorithm which is an effective tool for an adaptive engineering application run in parallel. It was shown that it has significant impact on the runtime and overall efficiency of a simulation. The rebalancing step itself is relatively cheap numerically even when using larger number of computing nodes.
Dynamic Load Balancing for Adaptive Parallel Flow Problems
69
Acknowledgments This work was supported by the FP6 EU ADIGMA project (Adaptive HigherOrder Variational Methods for Aerodynamic Applications in Industry), and by MNiSW, grant No: N N516 4280 33, Dynamic load balancing for adaptive engineering applications in parallel environments, 2007-2009.
References 1. Karypis, G., Schloegel, K., Kumar, V.: PARMETIS* Parallel Graph Partitioning and Sparse Matrix Ordering Library 2. Majewski, J.: Anisotropic solution-adaptive technique applied to simulation of steady and unsteady compressible flows. In: Proceedings of the International Conference on Computational Fluid Dynamics, Ghent, Belgium (2006) 3. Majewski, J.: Ph. D. Thesis Grid generation for flow problems in complex geometries (Rozprawa Doktorska, Generacja siatek dla symulacji przepyww w obszarach zoonej geometrii.) Warsaw University of Technology (2001) (in Polish) 4. COOLFluid project home page, http://coolfluidsrv.vki.ac.be/trac/coolfluid 5. Cybenko, G.: Dynamic load balancing for distributed memory multiprocessors. J. Parallel Distrib. Comput. 7, 279–301 (1989) 6. Devine, K., Boman, E., Heaphy, R., Hendrickson, B., Teresco, J., et al.: New challenges in dynamic load balancing. Applied Numerical Mathematics 52(2-3), 133– 152 (2005) 7. Touheed, N., Selwood, P., Jimack, P.K., Berzins, M.: A comparison of some dynamic load-balancing algorithms for a parallel adaptive flow solver. Parallel Computing 26(12), 1535–1554 (2000) 8. Karypis, G., Schloegel, K., Kumar, V.: Dynamic Repartitioning of Adaptively Refined Meshes. Department of Computer Science and Engineering (1998) 9. Schloegel, K., Karypis, G., Kumar, V.: Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. Journal of Parallel and Distributed Computing 47(2), 109–124 (1997) 10. Walshaw, M., Cross, M.G.: Everett: Parallel dynamic graph partitioning for adaptive unstructured meshes. J. Parallel Distrib. Comput. 47, 102–108 (1997) 11. Hendrickson, B., Devine, K.: Dynamic load balancing in computational mechanics. Comp. Meth. AppL Mech. Engrg. 184, 485–500 (2000)
A Balancing Domain Decomposition Method for a Discretization of a Plate Problem on Nonmatching Grids Leszek Marcinkowski Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Banacha 2, 02-097 Warszawa, Poland
[email protected]
Abstract. In this paper we present a balancing domain decomposition method for solving a discretization of a plate problem on nonmatching grids in 2D. The local discretizations are a Hsieh-Clough-Tocher macro finite elements. On the interfaces between adjacent subdomains two mortar conditions are imposed. The condition number of the preconditioned problem is almost optimal i.e. it is bounded poly-logarithmically with respect to the mesh parameters.
1
Introduction
Recently, there has been a large number of investigations devoted to domain decomposition methods for discretizations on nonmatching grids. Most of the methods for the discretizations of partial differential equations are on one global grid, however there is a growing interest in the discretizations with nonmatching local triangulations. These types of approach are important among others for mesh generations, problems with local singularities, contact problems, multiphysics simulations, and hp-adaptive methods. In our paper we consider a mortar element method which is based on coupling between nonmatching meshes between adjacent substructures by imposing integral conditions on the traces of finite element functions from these subdomains on the subdomain interface. For more details we refer to [1]. In subdomain we use a Hsieh-Clough-Tocher (HCT) macro elements. There many effective iterative solvers for the mortar discretization of second order problems, cf. e.g. [2], [3], [4], [5], [6] and references therein, but there is a more limited number of investigations devoted to nonconforming discretizations of plate problems, see [7], [8], [9], [10]. Balancing domain decomposition methods form a class of fast scalable parallel solvers for systems of equations arising from different discretizations of many types of problems, cf. e.g. [11], [12], [13], [14] and references therein. However, to our knowledge there are no results for a BDD type of method for a discretization of a fourth order problem built on nonmatching grids.
This work was partially supported by Polish Scientific Grant: N N201 0069 33.
R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 70–79, 2010. c Springer-Verlag Berlin Heidelberg 2010
Balancing DDM for a Plate Problem
71
In our paper we first introduce a mortar discretization of a model clamped plate problem, cf. [15], which in subdomains utilize HCT macro finite element. Then to construct our method, which is of a substructuring type, the local variables interior to subdomains have to be eliminated and a new problem is formulated. We construct our balancing domain decomposition preconditioner using an abstract Additive Schwarz Method (ASM) scheme, cf. [16], [12]. The resulting preconditioner is almost fully parallel as usually in the balancing domain decomposition methods. Finally, we present almost optimal convergence estimates and discuss a few implementation issues. The remainder of the paper is organized as follows: in Section 2 we present a discrete problem, in Section 3 we construct our BDD method and give convergence estimates, and Section 4 is devoted to implementation.
2
Discrete Problem
Let Ω be a polygonal domain in the plane. Our model clamped plate problem is of the form: Find u∗ ∈ H02 (Ω) such that a(u∗ , v) = f v dx ∀v ∈ H02 (Ω),
(1)
Ω
where
a(u, v) = Ω
[uv + β(2ux1 x2 vx1 x2 − ux1 x1 vx1 x1 − ux2 x2 vx2 x2 )] dx,
u∗ is the displacement, f ∈ L2 (Ω) is the body force, and the constant β ∈ (1/2, 1). Here we use the notation: H02 (Ω) = {v ∈ H 2 (Ω) : v = ∂n v = 0 on ∂Ω}, 2
u ∂n is the normal unit derivative outer to ∂Ω, and uxi xj = ∂x∂i ∂x for i, j = 1, 2. j This problem has a unique solution what follows from the Lax-Millgram theorem. We assume that our domain Ω is decomposed into non-overlapping triangular subdomains, that form a shape regular coarse triangulation i.e.
Ω=
N
Ωk
k=1
and Ω k ∩ Ω l is a common vertex, an edge or the empty set, cf. e.g. [17]. Let introduce Hk = diam(Ωk ) and let H = maxk Hk be a parameter of this coarse triangulation. We introduce in an each substructure Ωk a triangulation made of triangles Th (Ωk ) which is quasi-uniform with a parameter hk = maxτ ∈Th (Ωk ) diam (τ ), cf. [17].
72
L. Marcinkowski
Let the local finite element space Xh (Ωk ) be defined as, cf. Figure 1, a subspace of C 1 (Ω k )∩HC2 (Ωk ) formed by all functions which are in P3 (τi ) for each of three triangles τi ⊂ τ, i = 1, 2, 3 formed by connecting the vertices of τ ∈ Th (Ωk ) to its centroid. Here HC2 (Ωk ) = {v ∈ H 2 (Ωk ) : v = ∂n v = 0 on ∂Ωk ∩ ∂Ω}.
Fig. 1. HCT macro element
The degrees of freedom of the HCT element are the following {v(p), vx1 (p), vx2 (p), ∂n v(m)} , where p is a vertex of an element and m the midpoint of an edge of an element, cf. Figure 1. We introduce an auxiliary global spaces Xh (Ω) as Xh (Ω) =
N
Xh (Ωk ).
k=1
An important role is played by the global interface Γ = ∂Ωk \ ∂Ω. k
Note that each interface which is a common edge to two adjacent subdomains: Γkl = ∂Ωk ∩ ∂Ωl ⊂ Γ inherits two 1D triangulations: one from Ωk - Ti,h (Γij ), and one from Ωl - Tj,h (Γij ), cf. Figure 2. We choose one side as a master (mortar) denoted by γkl ⊂ ∂Ωk while the second one as a slave (nonmortar) δlk ⊂ ∂Ωl . Here, the choice of the master side is arbitrary.
Balancing DDM for a Plate Problem
73
Let Mthl (δlk ) be an auxiliary space defined on δlk as the space of C 1 -smooth functions which are piecewise cubic over each element of the hl triangulation of δlk except of two elements which touch the ends of δlk , where the functions are piecewise linear. We also define a space Mnhl (δm,l ) as the space of continuous functions which are piecewise quadratic over each element of the hl triangulation of δlk and are piecewise linear on the two elements which touch the ends of δlk .
Γ ij
Ω
Ω
i
j
γ
δ
ij
ji
Fig. 2. Independent meshes on an interface
We can now define a discrete space V h as the subspace of Xh (Ω) of all function which are continuous at the cross-points (the common vertices of substructures), and satisfy the following two mortar condition on each interface Γij = δji = γij ⊂ Γ: (ui − uj )φ ds = 0 ∀φ ∈ Mth (δji ), (2) δji
δji
(∂n ui − ∂n uj )ψ ds = 0
∀ψ ∈ Mnh (δji ).
Note that a function u ∈ V h is continuous at a cross-point but in general ∇u is discontinuous there. Then we can introduce our discrete problem: Find u∗h ∈ V h such that ah (u∗h , v) = f v dx ∀v ∈ V h , (3) Ω
74
L. Marcinkowski
where ah (u, v) = ak (u, v) =
k
Ωk
ak (u, v) for
[uv + β(2ux1 x2 vx1 x2 − ux1 x1 vx1 x1 − ux2 x2 vx2 x2 )] dx.
and β, f are as in the differential problem (1). There is a unique solution to (3) and the error estimates are established, see [15].
3
Balancing Domain Decomposition
In this section we present a balancing domain decomposition method following the abstract scheme of [12], cf. also Chapters 4-6 in [16]. Our method is a substructuring one so first we decompose each function u ∈ Xh (Ωk ) into two parts: u = Hk u+Pk u, where Pk u is the ak orthogonal projection of u onto Xh,0 (Ωk ) = {u ∈ Xh (Ωk ) : u = ∂n u = 0 on ∂Ωk }, i.e. Pk u ∈ Xh,0 (Ωk ) and it satisfies ak (Pk u, v) = ak (u, v) ∀v ∈ Xh,0 (Ωk ), and Hk u = u − Pk u is the discrete biharmonic part of u uniquely defined by ∀v ∈ Xh,0 (Ωk ) ak (Hk u, v) = 0 (4) T r|∂Ωk Hk u = T r|∂Ωk u on ∂Ωk where T r|∂Ωk u = (u|∂Ωk , ∇u|∂Ωk ) are the traces of u onto ∂Ωk in the H 2 sense. We also introduce global discrete biharmonic part of any function u = (u1 , . . . , uN ) ∈ Xh (Ω) as follows Hu = (H1 u1 , . . . , HN uN ), and the local space of discrete biharmonic functions: Wk = Hk Xh (Ωk ), and the space of all discrete biharmonic function satisfying the mortar conditions: W = {u ∈ V h : u = Hu} ⊂ W1 × . . . × WN . Note that the discrete biharmonic function has a so called minimal energy property: ak (Hk u, Hk u) = min{ak (w, w) : T r|∂Ωk w = T r|∂Ωk Hk u = T r|∂Ωk u}. and that a discrete biharmonic function in Ωk is uniquely defined by its T r|∂Ωk , i.e. the values of degrees of freedom interior to this subdomain can be computed by solving (4). Thus a function in W is uniquely defined by the value of all degrees of freedom at nodes which are on masters (mortars) or which are vertices of subdomains. The degrees of freedom at nodes (vertices and midpoints) which are on slaves are determined by (2) and the ones at nodal points interior to subdomains by (4)
Balancing DDM for a Plate Problem
75
Next we can decompose the solution of the discrete problem (3) into two parts u∗h = P u∗h + u ˜∗h where P u∗h := (P1 u∗h , . . . , PN u∗h ) and u ˜∗h = Hu∗h is the discrete ∗ biharmonic part of uh . Note that P u∗h can be computed by solving N independent local problems, which can be done in parallel and u ˜∗h ∈ W is a unique solution of the following problem: u∗h , v) = ah (˜
f v dx Ω
∀v ∈ W.
(5)
We now introduce our balancing method using the abstract scheme of the additive Schwarz method (ASM), cf. [16]. We have to define a decomposition of W into subspaces and introduce local projections like operator onto these subspaces. We first introduce extensions operators. For each subdomain Ωi , an extension operator Ei : C 1 (Ω i ) → W is defined as follows: 1 u(cr ) for any cross-point cr which is a vertex of ∂Ωi \ ∂Ω, – Ei u(cr ) = N (c r) where N (cr ) is the number of domains which have cr as a vertex. – ∇Ei u(v) = ∇u(v) for any vertex v of substructure Ωi , which is not on ∂Ω, – Ei u(x) = u(x), ∇Ei u(x) = ∇u(x), and ∂n Ei u(m) = ∂n u(m) for any a nodal point (vertex) x and any midpoint m of an element of the hi triangulation of any mortar γij ⊂ ∂Ωi , – T r Ei u = 0 on remaining masters and at the cross-points which are not on ∂Ωi .
The values of the degrees of freedom of Ei u on a slave are defined by the mortar = i by (4) as Ei u ∈ W , i.e. is conditions (2) and interior to a subdomain Ωj , j discrete biharmonic. Note that for u = (u1 , . . . , uN ) ∈ W we have N
Ek uk = u,
k=1
we can say that the set of operators {Ek }N k=1 is a partition of unity. Next we introduce a coarse space: V0 = E1 P1 (Ω1 ) + E2 P1 (Ω2 ) + . . . + EN P1 (ΩN ) ⊂ W, where P1 (Ωk ) ⊂ Wk is a space of linear polynomials define over Ω k . This dimension of V0 is equal to 3 ∗ N minus the number of vertices of substructures which are on ∂Ω. Let P0 : W → V0 ⊂ W be the orthogonal projection onto V0 in terms of ah (u, v), i.e. P0 u ∈ V0 and ah (P0 u, v) = ah (u, v)
∀v ∈ V0 .
Next we introduce local spaces as follows Vk = (I − P0 )Ek Wk,0 ,
76
L. Marcinkowski
where Wk,0 is formed by all discrete biharmonic functions which has zero value at the three vertices of ∂Ωk . Next we introduce operators Tk : W → Vk defined as Tk u = (I − P0 )Ek Tˆk u for Tˆk u ∈ Wk,0 , which satisfies: ak (Tˆk u, v) = ah (u, (I − P0 )Ek v)
∀v ∈ Wk,0 .
(6)
Note that the operator Tk is properly defined as the problem (6) has a unique solution because the form ak (u, v) is Wk,0 elliptic. We also see that Tk is ah (·, ·) symmetric and nonnegative definite as ah ((I − P0 )Tk u, v) = ah ((I − P0 )Ek Tˆk u, v) = ak (Tˆk u, Tˆk v). Finally, we introduce the operator T : W → W as T = P0 +
N
Tk u
(7)
k=1
and we replace (5) by a new problem Tu ˜∗h = g,
(8)
where g = g0 + N ˜∗h and gk = Tk u ˜∗h . It follows from the next k=1 gk with g0 = P0 u theorem that the problems (8) and (5) are equivalent. It is important to note that g, g0 , gk can be computed without knowing u ˜∗h , see e.g. [16], cf. also Section 4. Theorem 1. For any u ∈ W , we have 2 H ah (u, u), c ah (u, u) ≤ ah (T u, u) ≤ C 1 + log h where H = maxk Hk with Hk = diam(Ωk ), h = mink hk , and c, C are positive constants independent of all mesh parameters hk , Hk . The proof of this theorem follows the lines of proofs in [18]. From this theorem it follows that if we apply a conjugate gradient iterative method for solving (8), then the number iterations is proportional to (1 + log(H/h)), i.e. grow logarithmically.
4
Implementation
Here we briefly describe an implementation of the method, cf. also [12], see Sections 3.2 and 3.3 there.
Balancing DDM for a Plate Problem
77
We first represent Tk and T in an operator form. Let ·, · and ·, · k are any inner products in W and Wk,0 , respectively, e.g. they may be respective L2 products in these spaces. We introduce the following operators: S : W → W , Sk,0 : Wk,0 → Wk,0 , EkT : W → Wk,0 by Su, v = ah (u, v) Sk,0 u, v k = ak (u, v)
T Ek u, v k = u, Ek v
∀v ∈ W, ∀v ∈ Wk,0 , ∀v ∈ Wk,0 .
Note that the operator Sk,0 is nonsingular. Then, from the definition of Tˆk , see (6), it follows that for any v ∈ Wk,0 we have
Sk,0 Tˆk u, v = ak (Tˆk u, v) = ah (u, (I − P0 )Ek v) k
Thus we get
= ah ((I − P0 )u, Ek v) = S(I − P0 )u, Ek v
= EkT S(I − P0 )u, v k .
−1 T Ek S(I − P0 ) Tˆk = Sk,0
k = 1, . . . , N,
what yields that −1 T Ek S(I − P0 ) Tk = (I − P0 )Ek Tˆk = (I − P0 )Ek Sk,0
k = 1, . . . , N.
Finally, the operator T can be represented as: N −1 T T = P0 + (I − P0 )( Ek Sk,0 Ek )S(I − P0 ).
(9)
k=1
Remark 1. If we introduce nodal basis in W, Wk and Wk,0 in a standard way, and take l2 inner product in this spaces then (9) presents a matrix representation of our balancing method. If we want to solve (8) by a CG or Richardson iteration method, then in each iteration we have to compute an action of T to a residual vector in each iteration. According to the above operator formula for T it requires to compute an action of P0 twice, but we now show how using a simple trick we can reduce this to one action of P0 . In real applications, to solve the problem (8), CG method is utilized, but for the simplicity of presentation, we restrict ourselves to the Richardson iterations. Let u0 be an arbitrary vector. Then in first step, we compute ˜∗h ) u1 = u0 − P0 (u0 − u that satisfies
P0 (u1 − u˜∗h ) = 0,
(10)
78
L. Marcinkowski
as P0 is an orthogonal projection, i.e. we get u1 − u ˜∗h orthogonal to V0 with respect to the bilinear form ah (u, v). We next show that all successive Richardson iterates un has this property, i.e. n ˜∗h is orthogonal to V0 . u −u The Richardson iteration is defined as follows: un+1 = un − τ T (un − u ˜∗h ) n = 1, 2, . . . where τ is a parameter chosen according to the spectral properties of T . Then by a simple induction argument, we get P0 (un+1 − u ˜∗h ) = P0 (un − u ˜∗h ) − τ P0 T (un − u ˜∗h ) = −τ P0 T (un − u˜∗h ) but we have P0 T = P0 P0 = P0 what results in
˜∗h ) = P0 (un − u ˜∗h ) = 0, P0 T (un − u
because P0 is an orthogonal projection operator. Thus we get that for Richardson iteration method that if u1 − u ˜∗h is orthogonal n ∗ to V0 then P0 (u − u ˜h ) = 0 as well. Hence our Richardson iterative method can be represented by u
n+1
n
= u − τ (I − P0 )
N k=1
−1 Ek Sk,0 (Ek )T S(un − u∗h )
n = 1, 2, . . .
(11)
Thus we have to apply P0 only once per iteration. Taking u1 such that P0 (u1 −u∗h ) = 0 yields that in each iteration of a Richardson we have to: – solve one global coarse problem to compute an action of P0 , – compute actions of Ek , EkT , – solve local independent problems with Sk,0 . Note that except computing the action of P0 , all the steps can be done in parallel. The same observation is true in the case of CG method, see e.g. Sections 3.2 and 3.3 in [12]. N −1 (Ei )T the Remark 2. Introducing s preconditioner B −1 = (I − P0 ) i=1 Ei Si,0 iteration method (11) can be interpreted a Richardson iteration with this preconditioner applied to solve the problem Su∗h = f˜
which is an operator form of (5). Here f˜ ∈ W satisfies: f˜, v = Ω f v dx for all v ∈ W.
Balancing DDM for a Plate Problem
79
References 1. Bernardi, C., Maday, Y., Patera, A.T.: A new nonconforming approach to domain decomposition: the mortar element method. In: Nonlinear partial differential equations and their applications. Coll`ege de France Seminar, vol. XI (Paris, 1989– 1991), Paris. Pitman Res. Notes Math. Ser., vol. 299, pp. 13–51. Longman Sci. Tech., Harlow (1994) 2. Bjørstad, P.E., Dryja, M., Rahman, T.: Additive Schwarz methods for elliptic mortar finite element problems. Numer. Math. 95(3), 427–457 (2003) 3. Braess, D., Dahmen, W., Wieners, C.: A multigrid algorithm for the mortar finite element method. SIAM J. Numer. Anal. 37(1), 48–69 (1999) 4. Dryja, M., Widlund, O.B.: A generalized FETI-DP method for a mortar discretization of elliptic problems. In: Domain decomposition methods in science and engineering, pp. 27–38. Natl. Auton. Univ. Mex., M´exico (2003) 5. Dryja, M.: A Neumann-Neumann algorithm for a mortar discetization of elliptic problems with discontinuous coefficients. Numer. Math. 99, 645–656 (2005) 6. Kim, H.H., Widlund, O.B.: Two-level Schwarz algorithms with overlapping subregions for mortar finite elements. SIAM J. Numer. Anal. 44(4), 1514–1534 (2006) 7. Xu, X., Li, L., Chen, W.: A multigrid method for the mortar-type Morley element approximation of a plate bending problem. SIAM J. Numer. Anal. 39(5), 1712–1731 (2001/2002) 8. Marcinkowski, L.: Domain decomposition methods for mortar finite element discretizations of plate problems. SIAM J. Numer. Anal. 39(4), 1097–1114 (2001) 9. Marcinkowski, L.: An Additive Schwarz Method for mortar Morley finite element discretizations of 4th order elliptic problem in 2d. Electron. Trans. Numer. Anal. 26, 34–54 (2007) 10. Marcinkowski, L., Dokeva, N.: A FETI-DP method for mortar finite element discretization of a fourth order problem. In: Langer, U., Discacciati, M., Keyes, D.E., Widlund, O.B., Zulehner, W. (eds.) Domain decomposition methods in science and engineering XVII. Lect. Notes Comput. Sci. Eng., vol. 60, pp. 583–590. Springer, Berlin (2008) 11. Mandel, J., Brezina, M.: Balancing domain decomposition for problems with large jumps in coefficients. Math. Comp. 65(216), 1387–1401 (1996) 12. Le Tallec, P.: Neumann-Neumann domain decomposition algorithms for solving 2D elliptic problems with nonmatching grids. East-West J. Numer. Math. 1(2), 129–146 (1993) 13. Pencheva, G., Yotov, I.: Balancing domain decomposition for mortar mixed finite element methods. Numer. Linear Algebra Appl. 10(1-2), 159–180 (2003) 14. Tu, X., Li, J.: A balancing domain decomposition method by constraints for advection-diffusion problems. Commun. Appl. Math. Comput. Sci. 3, 25–60 (2008) 15. Marcinkowski, L.: A mortar element method for some discretizations of a plate problem. Numer. Math. 93(2), 361–386 (2002) 16. Toselli, A., Widlund, O.: Domain decomposition methods—algorithms and theory. Springer Series in Computational Mathematics, vol. 34. Springer, Berlin (2005) 17. Brenner, S.C., Scott, L.R.: The mathematical theory of finite element methods, 3rd edn. Texts in Applied Mathematics, vol. 15. Springer, New York (2008) 18. Marcinkowski, L.: A balancing Neumann - Neumann method for a mortar finite element discretization of fourth order elliptic problems (in preparation)
Application Specific Processors for the Autoregressive Signal Analysis Anatolij Sergiyenko1, Oleg Maslennikow2 , Piotr Ratuszniak2 , Natalia Maslennikowa2, and Adam Tomas3 1
3
National Technical University of Ukraine, pr. Peremogy 37, 03056 Kiev, Ukraine
[email protected] 2 Koszalin University of Technology, ul. Partyzanow 17, 75-411 Koszalin, Poland
[email protected] Cz¸estochowa University of Technology, ul. Armii Krajowej 21, 42-200 Cz¸estochowa, Poland
Abstract. Two structures of the processors for the autoregressive analysis are considered. The first of them implements the Durbin algorithm using the rational fraction calculations. Such calculations provide higher precision than integer calculations do, and are simpler than the floating point calculations. The second of them implements the adaptive lattice filter. These processors are configured in FPGA, and give the possibility of the signal analysis with the sampling frequency up to 300 MHz. They provide new opportunities for the real time signal analysis and adaptive filtering in radio receivers, ultrasonic devices, wireless communications, etc.
1
Introduction
Autoregressive (AR) analysis is the effective and relatively simple method for the estimation of the physical object parameters. Additionally, it provides the modeling of these objects. Therefore, the AR analysis is widely used in economic process prediction, seismic exploring, medical diagnostics, mobile phones, music instrument modeling, etc. The use of the field programmable gate arrays (FPGAs) provides the digital signal processing in the frequency range up to hundreds of megahertz. The FPGA architecture is adapted to the fixed point arithmetic calculations. But the AR analysis algorithms afford the increased precision, and therefore, in most of cases they use the floating point calculations. In the paper the structures of the AR analysis processors are proposed, which are configured in FPGA. They provide AR analysis of the signals in the wide frequency range using the fixed point calculation units. AR analysis is based on the hypothesis that the physical object is represented by the AR filter model. This model implements the filtering of some excitation signal, and generates the samples of output signal x(n). The white noise is usually selected as the excitation signal. However it can have the special form, as one in the vocoders. AR filter coefficients, named as the prediction coefficients R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 80–86, 2010. c Springer-Verlag Berlin Heidelberg 2010
Application Specific Processors for the AR Signal Analysis
81
ai (i = 0, . . . , p), are derived in two steps. Firstly, the autocorrelation function samples rxx (i) of the signal x(n) are estimated. Then the normal Yule-Walker equation system is solved, like the following: Rxx (a1 , . . . , ap )T = −(rxx (1), . . . , rxx (p))T
(1)
whereRxx is the symmetrical Toeplitz matrix, which first row is equal to rxx (0), . . . , rxx (p − 1), a0 = 1 [1]. Equation system (1) is usually solved using the Durbin algorithm, which can be represented by the following loop nest for i = 1 to p a0 = 1; ki = −(a0 ri−1 + . . . + ai−1 r0 )/Ei−1 ; ai = ki ; for j = 1 to i − 1 aj = aj + ki ai−j ; end Ei = (1 − ki2 )Ei−1 ; end;
(2)
where the initial value of the error prediction is E0 = rxx (0) = r0 . Both prediction coefficients ai and reflection coefficients ki are used as the AR analysis results. The coefficients ai are often used in the methods of the high resolution spectrum analysis. The distance between coefficients ki , which values approach to ±1, is interpreted as the distance between medium layers. The excitation signal is reflected from these layers. The analysis of the algorithm (2) shows that the operation number is of the order of p2 , and this algorithm is complex to be parallelized. Therefore, more complex, but well parallelizable algorithms like QR-factorization algorithm are often recommended except the Durbin algorithm [2,3]. Therefore, the Givens algorithm and the Toeplitz matrix inversion algorithm are of success by their FPGA implementation [4]. In the algorithm (2) the division to the error prediction coefficient Ei can occur, when Ei approaches to zero, which forces the high result error. Therefore, this equation solving affords the calculations with the increased precision. Analogous properties have the similar algorithms of the Toeplitz equation solving [2]. Therefore, the AR analysis problem is usually solved using the floating point microprocessors. In the paper two AR processor structures are proposed, which find the effective implementations in the modern FPGAs.
2
AR Processor Based on the Durbin Algorithm
To derive the stable estimations of ai in (1), the values rxx (i) must be accumulated for at least qp = 10p samples of data [1]. When is limited by tenths, then
82
A. Sergiyenko et al.
the correlation function Rxx complexity is approximately in times higher then equation system (1) solving complexity. Usually the input data are sampled in sequence. Then the correlation accumulation is predominant in the time balance of the whole AR analysis. As a result, the equation system solving using the parallel processor array could not give the valuable throughput. In this situation the processor structure for the AR analysis is proposed, which contains correlation processor and Durbin algorithm processor. The correlation processor is the linear array of processor units, operating with the fixed point data. The Durbin algorithm processor implements the calculations with the increased precision. This structure is illustrated by the Fig. 1. Both processors can implement their tasks approximately for the equal period of time in the pipelined manner. The critical path in the algorithm (2) approaches through the cyclic path of the coefficient ki calculation. By the pipelined implementation of the arithmetical operations the iteration period is no less than the stage number of the adder and multiplier pipelines. If the floating point arithmetical unit is used then the iteration period is no less than 7 clock cycles, and this unit is underloaded [6].
Fig. 1. Structure of the AR analysis processor based on the Durbin processor
In [5] the rational fraction arithmetic operations were proposed for the linear algebra problem solving, which provide the advantages by the FPGA implementation. The rational fraction number x is represented by two integers, one of them is a numerator nx , and another one is a denominator dx , i.e. x = nx /dx . All the data are represented by these fractions. Then multiplication, division, and addition are calculated by the formulas: x ∗ y = nx ny /(dx dy );
x/y = nx dy /(dx ny );
x + y = (nx dy + ny dx )/(dx dy )
Such calculations are supported in the modern FPGA by a set of hundreds of multiplication and accumulation units like DSP48 unit in Xilinx Virtex FPGAs. The use of calculations with the rational fractions provides both small error level and expanded dynamic range, as well as the high speed and simple implementation in FPGA. Besides, division operation, which is included in the algorithm critical path, is very short, and adds the minimum error to the calculations.
Application Specific Processors for the AR Signal Analysis
83
If the natural representation of the results is needed then the computations (2) are finished by the division of numerators to denominators of the results ai or ki by the usual division operation. Such a division slightly increases the Durbin processor operation time. The structure of the rational fraction arithmetic unit is shown in [5]. The structure of the correlation processor can reflect the FIR filter structure.
3
AR Processor Based on the Lattice Filter
FIR filter with the coefficients ai is the reverse one to the AR filter. Therefore, the adaptive FIR filter is often used to derive the autoregressive coefficients [1,7]. But the slow adaptation of such a filter prevents its use for the frequent physical process analysis. The lattice form of the adaptive FIR filter is well known, which implements the calculations [1]: f b Ejf (n) = Ej−1 (n) + kj Ej−1 (n − 1);
f b Ejb (n) = Ej−1 (n − 1) + kj Ej−1 (n),
b where Ejf (n), Ej−1 (n) are forward and backward prediction errors in the n−th algorithm stage, respectively, kj is the reflection coefficient, E0f (n) = E0b (n) = x(n), j = 1, . . . , p. In the FIR filters with the sequential adaptation process the coefficients kj are calculated using some recursive gradient algorithm, which converges in several hundreds of steps. In the lattice filter which computes N = qp data samples these coefficients are calculated by the following formula for N steps [1]: f f b b (n))2 ) (n)Ej−1 (n)/( Σn (Ej−1 (n))2 Σn (Ej−1 kj = −Ej−1
Here the sums in numerator and denominator are the partial correlation coefficients. In this computational schema the coefficients kj are found step by step, beginning at k1 . By this process one input data array is inputted p times to the lattice filter. If the signal is stationary one in the time window of pN samples, then the data stream can be fed into the filter permanently without the buffering. The processor for the AR analysis consists of the lattice filter and the coefficient kj calculation unit (see Fig. 2). At the j−th step of adaptation the coefficient calculation unit is attached to the inputs of the j−th filter stage. It accumulates the partial correlation coefficients for N input data, and calculates the coefficient kj . This coefficient is loaded into the multiplicaton units of the j−th filter stage. As we see, the step of the correlation function estimation and the step of the reflection coefficient finding are combined in this computational schema. Comparing to the usual adaptive filters, here the robust convergence to the result is achieved only for pN clock cycles. Comparing to the considered above Yule-Walker equation solver, in the lattice filter the computation error accumulation is minimized, and always the stable results occur. Note that the stable
84
A. Sergiyenko et al.
Fig. 2. Structure of the AR analysis processor based on the lattice filter
results mean that |kj | < 1. Therefore, to implement the coefficient calculation unit the usual integer number calculations would do. Such an AR analysis processor can be configured in FPGA, but the integers have to be at least of doubled bit width. The convergence speed can be increased by the attachment of coefficient calculation units to all the filter stages. But then the hardware volume increases dramatically.
4
AR Processors Implemented in FPGA
Two mentioned above schemes for the AR analysis were carried out in the application specific processors, which are configured in the Xilinx FPGA. The Durbin algorithm processor is based on the arithmetic unit, which implements multiplication, division, accumulation of the rational fractions as well as transformation them in integers. The fraction contains 18-bit numerator and 18-bit denominator. Multiplication and division operations are implemented in the pipelined mode with the period of a single clock cycle. The latent delay is 7 clock cycles. Accumulation operation period lasts 4 clock cycles because this operation result is fed directly to its input. The results of the processor implementation for 16-bit data, N = 10 and different values of p are represented in the table 1. Their comparison shows that processors based on the Durbin algorithm have higher clock frequency, less hardware volume and shorter adaptation period. The processor analysis shows that input data sampling frequency can be increased more than in two times if the correlation processor clock signal is increased. The diagram on the Fig. 3 shows the prediction error of the lattice filter by the analysis of the white noise, which was filtered through the 6-th order IIR band pass filter. The square impulses show the periods of the coefficient kj calculation. One can see the high speed of the convergence of the adaptation. The processor based on the lattice filter has the advantage for small parameters p and for N < 10, as well as for processing the continuous data flows. Besides, it provides the robust estimation of kj for any input data.
Application Specific Processors for the AR Signal Analysis
85
Table 1. Processors of the order p implemented in Xilinx XC4VDX35-12 FPGA Processor based on
Hardware volume CLB slices+DSP48 p=10
p=30
p=90
Max. clock Period of deriving frequency, MHz clock cycles p=10 p=30 p=90 p=10 p=30
Durbin algorithm 1330+16 1850+36 4753+96 154 lattice filter
1446+24 2998+64 6822+184 151
p=90
159
150
610 1830
5490
147
145
1650 10950 86850
Fig. 3. Error prediction signal E f (n) at the lattice filter output
Comparing to the similar processors for the AR analysis based on FPGA [9,10] the proposed processors have substantially less hardware volume. The filter described in [10] has in 20 times more logic cells than the processor based on the Durbin algorithm having the equal multiplier unit number.
5
Conclusions
In the paper two structures of the processors for the AR analysis are proposed, which give the possibility of the signal analysis with the sampling frequency up to 300 MHz. The processors are configured in Xilinx FPGA. These processors give the possibility to expand substantially the area of the AR analysis use, for example in the ultrasonic devices, wireless communications. The proposed rational fraction number system provides higher precision than integers do, and is simpler in its implementation than the floating number system.
References 1. Candy, F.V.: Model-Based Signal Processing, p. 677. Wiley, Hoboken (2006) 2. Brent, R.P.: Old and new algorithms for Toeplitz systems. In: Proc. SPIE. Advanced Algorithms and Architectures for Signal Processing III, vol. 975. SPIE, Bellingham (1989) 3. Bojanchuk, A.W., Brent, R.P., de Hoog, F.R.: Linearly Connected Arrays for Toeplitz Least-Squares Problems. J. of Parallel and Distributed Computing 9, 261– 270 (1990)
86
A. Sergiyenko et al.
4. Sergyienko, A., Maslennikow, O.: Implementation of Givens QR Decomposition inFPGA. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2001. LNCS, vol. 2328, pp. 453–459. Springer, Heidelberg (2002) 5. Sergyienko, A., Maslennikow, O., Lepekha, V.: FPGA Implementation of the Conjugate Gradient Method. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 526–533. Springer, Heidelberg (2006) 6. Karlstr¨ om, P., Ehliar, A., Liu, D.: High performance, low latency FPGA based floating point adder and multiplier units in a Virtex4. In: 24th IEEE Norchip Conf., Link¨ oping, Sweden, November 20-21, pp. 31–34 (2006) 7. Widrow, B., Stearns, S.D.: Adaptive Signal Processing. Prentice-Hall, Englewod Clifs (1985) 8. Pohl, Z., Matouek, R., Kadlec, J., Tich´ y, M., L´ıcko, M.: Lattice adaptive filter implementation for FPGA. In: Proc. 2003 ACM/SIGDA 11th Int. Symp., Monterey, California, USA, pp. 246–250 (2003) 9. Hwang, Y.T., Han, J.C.: A novel FPGA design of a high throughput rate adaptive prediction error filter. In: 1st IEEE Asia Pacific Conf. on ASICs, AP-ASIC 1999, pp. 202–205 (1999) 10. Lin, A.Y., Gugel, K.S.: Feasibility of fixed-point transversal adaptive filters in FPGA devices with embedded DSP blocks. In: 3rd IEEE IWSOC 2003, pp. 157– 160 (2003)
A Parallel Non-square Tiled Algorithm for Solving a Kind of BVP for Second-Order ODEs Przemyslaw Stpiczy´ nski Department of Computer Science, Maria Curie–Sklodowska University Pl. M. Curie-Sklodowskiej 1, 20-031 Lublin, Poland
[email protected]
Abstract. The aim of this paper is to show that a kind of boundary value problem for second-order ordinary differential equations which reduces to the problem of solving tridiagonal system of linear equations with almost Toeplitz structure can be efficiently solved on modern multicore architectures using a parallel tiled algorithm based on the divide and conquer approach for solving linear recurrence systems with constant coefficients and novel data formats for dense matrices. Keywords: BVP for ODEs, parallel non-square tiled algorithm, multicore, novel data formats for dense matrices.
1
Introduction
Let us consider the following boundary value problem which arises in many practical applications [4]: −
d2 u = f (x) ∀x ∈ [0, 1], dx2
(1)
where u (0) = 0 and u(1) = 0. Numerical solution to the problem (1) reduces to the problem of solving tridiagonal systems of linear equations. Simple algorithms based on Gaussian elimination achieve poor performance, since they do not fully utilize the underlying hardware, i.e. memory hierarchies, vector extensions and multiple processors. The matrix of such systems have a special (almost Toeplitz) form and it clear that it should also be exploit. The aim of this paper is to present a new implementation of the algorithm presented in [6]. The new implementation is based on the idea of developing parallel tiled algorithms which are suitable for multicore architectures. The performance of the algorithm can be improved by using non-square tiles which can better fit into L1 cache [1].
2
Method
We want to find an approximation of the solution to the problem (1) in the grid points 0 = x1 < x2 < . . . < xn+1 = 1, R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 87–94, 2010. c Springer-Verlag Berlin Heidelberg 2010
88
P. Stpiczy´ nski
where xi = (i − 1)h, h = 1/n, i = 1, . . . , n + 1. Let fi = f (xi ) and ui = u(xi ). Using the approximation for the second derivative u (xi ) ≈
ui+1 − 2ui + ui−1 h2
and the boundary conditions u (0) ≈
u(0 − h) − 2u1 + u2 , h2
u (0) ≈
u2 − u(0 − h) 2h
we get the following equations 2(u1 − u2 ) = h2 f1 , −ui−1 + 2ui − ui+1 = h2 fi , i = 2, . . . , n − 1,
(2)
−un−1 + 2un = h fn . 2
Thus, we can rewrite (2) as the problem of solving the system of linear equations Au = d, where
(3)
⎞ 1 −1 ⎟ ⎜ −1 2 −1 ⎟ ⎜ ⎜ ⎟ . . . .. .. .. A=⎜ ⎟ ∈ IRn×n ⎜ ⎟ ⎝ −1 2 −1 ⎠ −1 2 ⎛
T
T
and the vectors satisfy u = (u1 , . . . , un ) , d = (d1 , . . . , dn ) and d1 = 12 h2 f1 , di = h2 fi for i = 2, . . . , n. The matrix A can be factorized as A = LR, where L and R are bidiagonal Toeplitz matrices ⎛ ⎞ ⎛ ⎞ 1 1 −1 ⎜ −1 1 ⎟ ⎜ ⎟ 1 −1 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ .. .. .. .. L=⎜ (4) ⎟, R = ⎜ ⎟ . . . . ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ −1 1 1 −1 ⎠ −1 1 1 The solution to the system (3) can be found using Gaussian elimination without pivoting. A simple algorithm for solving (3) based on such a factorization can comprise two sequential stages, namely solving the systems Ly = d (forward reduction) and Ru = y (back substitution). The first stage can be done using y1 = d1 (5) yi = di + yi−1 for i = 2, . . . , n. Then (second stage) we find the final solution to the system (3) using u n = yn ui = yi + ui+1 for i = n − 1, n − 2, . . . , 1.
(6)
Parallel Non-square Tiled Algorithm for Solving BVP for ODEs
89
More sophisticated approach can be based on the divide-and-conquer algorithm for solving linear recurrence systems [5]. The main idea of the algorithm is to rewrite the considered systems as block-bidiagonal systems of linear equations [5,7]. Without loss of generality, let us assume that there exist two positive integers r and s such that rs ≤ n and s > 1. The method can be used for finding y1 , . . . , yrs . To find yrs+1 , . . . , yn , we use (5) directly. Then for j = 1, . . . , r, we define vectors
T
T dj = d(j−1)s+1 , . . . , djs , yj = y(j−1)s+1 , . . . , yjs ∈ IRs and find all yj using the following formula y1 = L−1 s d1 yj = L−1 s dj + y(j−1)s e for j = 2, . . . , r,
(7)
where e = (1, . . . , 1)T and the matrix Ls ∈ IRs×s is of the same form as L given by (4). Analogously, we can perform the second stage using ur = Rs−1 yr (8) uj = Rs−1 yj + ujs+1 e for j = r − 1, . . . , 1,
T where uj = u(j−1)s+1 , . . . , ujs ∈ IRs and Rs ∈ IRs×s is of the same form as R given by (4). However, it is better to find the matrix U = (u1 , . . . , ur ) ∈ IRs×r
(9)
instead of individual vectors uj . Also, there is no need to use temporary vectors yj . It is clear that we have to allocate only one s × r array U and initially store vectors dj in its columns, namely U(∗, j) = dj . Then (Step 1A) we perform the following loop do k=2,s U(k,1:r)=U(k,1:r)+U(k-1,1:r) end do In general, during this step we use a previously updated row to update the next row of the table. During Step 1B we compute the bottom row of the table using the following simple loop do k=2,r U(s,k)=U(s,k)+U(s,k-1) end do Finally (Step 1C) we update s − 1 first entries of each column (except for the first one) using the entries of the bottom row. do k=2,r U(1:s-1,k)=U(1:s-1,k)+U(s,k-1) end do
90
P. Stpiczy´ nski Stage 1
Step A
Step B
Step C
Step A
Step B
Step C
Stage 2
used
modified
direction of updates
Fig. 1. Divide and conquer algorithm using a 2-D array
Similarly we proceed during the second stage of the algorithm (Figure 1). During Step 2A we update rows of the table bottom-up. Then (Step 2B) we update the first row of the table from right to left. Finally (Step 2C) using elements from the first row, we update the remaining entries of each column (except for the last one). do k=s-1,1,-1 U(k,1:r)=U(k,1:r)+U(k+1,1:r) end do ! do k=r-1,1,-1 U(s,k)=U(s,k)+U(s,k+1) end do ! do k=1,r-1 U(2:s,k)=U(2:s-1,k)+U(1,k+1) end do Both stages can be easily vectorized and parallelized. The array can be divided into blocks of columns and each processor can be responsible for computing one block. Steps A and C can be performed in parallel but steps B (in both stages) are sequential. Unfortunately, the performance of this algorithm can be poor because of cache misses which may occur during its execution. Note that
Parallel Non-square Tiled Algorithm for Solving BVP for ODEs
91
Stage 1
Step B
Step A
Step C
Stage 2
Step C
Step B
Step A
used
modified
direction of updates
Fig. 2. Divide and conquer algorithm using non-square tiles
it cannot be improved using simple standard techniques like loop interchange, because entries of the table are accessed row-wise (the steps A in both stages) and column-wise (the steps C in both stages).
3
Non-square Tiled Algorithm
In [6] we showed that the performance of the algorithm can be significantly improved by using novel data formats for dense matrices with the square blocked full column major order [3]. The performance of the algorithm can still be improved by the use of non-square tiles which can better fit into L1 cache than square blocks used in the square blocked full column major order format. Let us observe that the matrix U defined by the equation (9) can be partitioned as follows ⎞ ⎛ U11 . . . U1ng ⎜ .. ⎟ . (10) U = ⎝ ... . ⎠ Umg 1 . . . Umg ng
92
P. Stpiczy´ nski
Each submatrix Uij is stored as a rectangular (generally non-square) mb × nb tile which occupies a contiguous block of memory. The matrix U can be stored in a four dimensional array declared in Fortran 95 as follows real (kind=8) :: U(mb,nb,mg,ng) where mb = mb , nb = nb , mg = s/mb and ng = r/nb . For simplicity we assume that mb mg = s and nb ng = r. The new tiled algorithm also comprises two stages, each with three steps A, B, C but it operates on tiles (Figure 2). During Step 1A, we update columns of tiles (i.e. panels) top-down using vector addition. It is clear that when we operate within a single tile which is in L1 cache, a cache miss will not occur. However, on the border between two tiles in a panel, a cache miss may occur. During Step 1B, we operate on a bottom row of tiles and we have such a border between each couple of adjacent tiles. Finally, during Step 1C, we use bottom row of each tile in the bottom row of tiles to update columns in each panel. Similarly we can implement the second stage of the algorithm. During Step 2A, we update tiles within panels bottom-up and during Step 2B we update tiles from right to left. Note that in Step 2C it is better to update panels from left to right and tiles within each panel should be updated top-down. Then we can reduce the number of cache misses. It is clear that in both stages the steps A and C can be done in parallel. Each processor can be responsible for computing a group of panels. The algorithm can be expressed as a number of small computational tasks operating on tiles. Such tasks can be scheduled for parallel execution using OpenMP [2] and the well-known parallel construct. !$omp parallel private(...) shared(...) ! Stage 1 !$omp do private(...) default(shared) <---- Step 1A do j=1,ng ... end do !$omp single <---- Step 1B ... !$omp end single !$omp do private(...) default(shared) <---- Step 1C do j=1,ng ... end do ! Stage 2 ..... !$omp end parallel Within the parallel region we have three constructs: a parallelized do-loop for Step 1A, a single construct for Step 1B, and again a parallelized do-loop for Step 1C. Similarly we can implement the second stage on the algorithm.
Parallel Non-square Tiled Algorithm for Solving BVP for ODEs
4
93
Experimental Results
The algorithms for solving the problem (1) have been implemented in Fortran 95 with OpenMP directives [2]. The experiments carried out on a dual processor Quad-Core Xeon (3.2 GHz, 12 MB L2 cache, 32 GB RAM running under Linux) workstation using Intel Fortran Compiler show that the new algorithm achieves much better performance than the simple scalar algorithm based on (5)–(6).
Table 1. Performance of the simple scalar algorithm n 2097152 1048576
time (s) 0.173E-1 0.701E-2
Mflops 362.92 448.69
Table 2. Exemplary results of the performance of the parallel tiled algorithm n 2097152 2097152 1048576 1048576
s 2048 2048 1024 1024
r 1024 1024 1024 1024
mb 64 32 64 32
nb 16 32 16 32
time (s) speedup 0.312E-2 5.46 0.416E-2 4.17 0.194E-2 3.69 0.217E-2 3.29
Mflops 1984.68 1511.96 1618.86 1444.53
Table 2 shows exemplary results of the performance of the algorithm for various problem sizes (values of n, s, r) and block sizes (values of mb and nb ). Three last columns show execution time in seconds, speedup measured against athesimple scalar algorithm and Mflops. One can observe that the performance of the simple algorithm is very poor (Table 1) – only a very small fraction of the peak performance is achieved (about 400 Mflops using double precision). The performance of the new parallel non-square tiled algorithm is much better. The performance is better for the bigger problem size (n = 2097152) and the algorithm achieve reasonable speedup. One can observe that the algorithm with non-square tiles achieves better performance than for square blocks.
5
Conclusions
We have shown that the performance of the simple algorithm for solving a kind of boundary value problem for second-order ordinary differential equations which reduces to the problem of solving tridiagonal systems of linear equations can be highly improved by the use of the divide and conquer vectorized algorithm using novel data formats for dense matrices. Further improvements can be obtained by the use of non-square tiles. It should be noticed that the method can be applied to solve such problems on many-core graphics processing units with CUDA.
94
P. Stpiczy´ nski
References 1. Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing 35, 38–53 (2009) 2. Chandra, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J., Menon, R.: Parallel Programming in OpenMP. Morgan Kaufmann Publishers, San Francisco (2001) 3. Gustavson, F.G.: New generalized data structures for matrices lead to a variety of high performance algorithms. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2001. LNCS, vol. 2328, pp. 418–436. Springer, Heidelberg (2002) 4. Scott, L.R., Clark, T., Bagheri, B.: Scientific Parallel Computing. Princeton University Press, Princeton (2005) 5. Stpiczy´ nski, P.: Solving linear recurrence systems using level 2 and 3 BLAS routines. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 1059–1066. Springer, Heidelberg (2004) 6. Stpiczy´ nski, P.: Solving a kind of boundary value problem for ODEs using novel data formats for dense matrices. In: Ganzha, M., Paprzycki, M., Pelech-Pilichowski, T. (eds.) Proceedings of the International Multiconference on Computer Science and Information Technology, vol. 3, pp. 293–296. IEEE Computer Society Press, Los Alamitos (2008) 7. Stpiczy´ nski, P., Paprzycki, M.: Fully vectorized solver for linear recurrence systems with constant coefficients. In: Proceedings of VECPAR 2000 - 4th International Meeting on Vector and Parallel Processing, Porto, pp. 541–551. Facultade de Engerharia do Universidade do Porto (June 2000)
Graph Grammar Based Petri Nets Model of Concurrency for Self-adaptive hp-Finite Element Method with Rectangular Elements Arkadiusz Szymczak and Maciej Paszy´ nski Department of Computer Science AGH University of Science and Technology, Al. Mickiewicza 30, 30-059 Cracow, Poland
[email protected],
[email protected] http://home.agh.edu.pl/~ paszynsk
Abstract. The paper presents the Petri nets model describing the process of rectangular finite element mesh generation and h adaptation. This is the first step towards the formal analysis of all the parts of the existing graph grammar based parallel adaptive hp-Finite Element Method algorithms expressed by graph grammar productions. The mesh transformations are expressed as composite programmable graph grammar productions. The transitions on Petri nets correspond to the execution of graph grammar productions. The graph grammar based algorithms modeled as a Petri net can be analyzed for deadlocks, starvation or infinite execution. The parallel adaptive algorithms have many applications in the area of numerical simulations of different engineering problems, including material science, heat transfer, wave propagation and other. Keywords: Finite Element Method, hp adaptivity, Petri nets.
1
Introduction
The paper presents the Petri nets model dedicated to the process of two dimensional rectangular finite element mesh generation and adaptation. Besides the parallel solver, the mesh generation and adaptation are the most essential parts of adaptive hp-Finite Element Method (hp-FEM) algorithm [1,2,11]. The algorithm has been used to solve many engineering problems [3,4,1,2]. The adaptive parallel hp-FEM algorithm has been expressed by CP-graph grammar for triangular [5] as well as rectangular finite elements [9,10]. The models utilize the CP graph grammar introduced by [6,7,8]. The Petri nets model has been already created for triangular finite elements [12]. The extension of the model for rectangular finite elements is critical, since the graph grammar based model of parallelization of the self-adaptive hp-FEM algorithms has been fully developed for rectangular finite elements [10]. The Petri nets model will allow to analyze the algorithms for deadlocks, starvation or infinite execution. Actually, we have encountered several deadlocks with anisotropic R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 95–104, 2010. c Springer-Verlag Berlin Heidelberg 2010
96
A. Szymczak and M. Paszy´ nski
mesh refinements, related to the enforcement of the 1-irregularity mesh rule, for the case when the refinements have been executed over the computational mesh distributed into several sub-domains, and later the mesh has been repartitioned in order to maintain quasi-uniform load balance. The Petri nets model extended to anisotropic refinements will allow to make necessary corrections to the mesh refinements algorithm, to avoid the deadlock.
2
2D Mesh Generation
In this section we present a set of graph grammar productions for generation of two-dimensional mesh with rectangular finite elements. This extends the graph grammar introduced in [9] generating a row of rectangular elements. The new productions PI1 - PI5, presented in Figure 1, realize the topology generation of an arbitrary shape 2D rectangular mesh. Meaning of the node symbols is as follows: – Iel unstructured mesh element node – NIL - fake node denoting no adjacent node – ANY - fake node meaning that given node is irrelevant to the given production The new productions PI6 - PI9, presented in Figure 2, are responsible for identification of adjacent elements to prevent element overlapping. Since new element is always derived from one existing element and each element can have up to 4 adjacent elements, missing up to 3 bonds must be detected. Therefore, as soon as any of productions PI6 - PI9 matches, it must be executed before subsequent execution of productions PI1 - PI5. The generation of the topology of an initial mesh is followed by the generation of the structure of finite elements, which in turn is followed by the execution of mesh refinements. The exemplary generation of the topology of a simple square finite element mesh, is presented on the first row in Figure 3. This corresponds to execution of (PI1)-(PI5)-(PI4)2 (PI7) graph grammar productions. The generation of the structure of four finite elements followed by the h refinement of the left-bottom element is presented on the second row in Figure 3. This corresponds to execution of (PE)4 -(PCEH)2 (PCEV)2 -(PJI)4 -(PBI)-(PFE)2 -(PBE)2 graph grammar productions.
3
Graph Grammar Based Model of Concurrency
In this section we present the control diagram based model of concurrency for the process of an initial mesh generation as well as mesh h adaptation. The model is presented in Figure 4. The following list summarizes symbols used in the control diagrams. Some graph grammar productions referred in the list were defined in [9], while the productions PI1 - PI9 were described in the previous section.
Graph Grammar Based Petri Nets Self-adaptive hp-FEM
97
Fig. 1. Set of graph grammar productions PI1 - PI5 for the generations of rectangular finite element two dimensional mesh
Fig. 2. Graph grammar productions PI6 - PI9 connecting layers of generated elements
– PE - production for generation of element structure – PCEH - production for identification of common edge of horizontally adjacent elements
98
A. Szymczak and M. Paszy´ nski
Fig. 3. Exemplary generation of the topology of four finite element mesh, followed by the generation of the structure of finite elements, followed by a single h refinement
– PCEV - production for identification of common edge of vertically adjacent elements – PJI - production allowing for breaking element interior – PFE - production allowing for breaking edge between two broken interiors – PBI - production for breaking element interior – PBE production for breaking element edge – PWest - production for propagation of adjacency data for west edge of element – PEast - production for propagation of adjacency data for east edge of element
Graph Grammar Based Petri Nets Self-adaptive hp-FEM
99
Fig. 4. Left panel: Control diagram based model of concurrency for the process of the generations of rectangular finite elements two dimensional mesh. Right panel: Control diagram based model of concurrency for the process of h adaptation.
– PSouth - production for propagation of adjacency data for south edge of element – PNorth - production for propagation of adjacency data for north edge of element
100
4
A. Szymczak and M. Paszy´ nski
Petri Nets Model of Mesh Transformations
In this section we introduce the Petri net based model of concurrency for the initial mesh generation as well as for h adaptation. In addition to the control diagram based model it provides analysis tools typical for Petri nets like reachability graphs, place/transition invariants, etc. A system modeled as a Petri net can be analyzed for deadlocks, starvation or infinite execution. This model also allows to express some quantitative relationships between graph grammar productions. Figure 5 presents a priority Petri net with guards for initial mesh generation. Transitions are named after productions they execute. Priorities are assigned to transitions (in subscript) to enforce certain sequence of execution. Transitions PI6 - PI9 (priority 1) should be executed as soon as they get activated, before any other transition being active at the same time. The prioritization has been introduced to ensure correct mesh construction (with no overlapping elements). Transitions PI6 - PI9 have an additional activation guard - they can be activated only if there is an actual matching for these productions in the mesh being generated. They also have place p3 as input since whenever new adjacency is identified, the number of free bonds decreases and the potential for new elements generation is reduced. The condition expressed with the guard is not enforced by the net itself. Edge weights denote the number by which the marking of the destination place is increased and the marking of the origin place is decreased. The meaning of some of the places is as follows:
Fig. 5. Petri nets model of concurrency for the process of rectangular finite element two dimensional mesh generation
Graph Grammar Based Petri Nets Self-adaptive hp-FEM
101
– p0 number of elements left to be produced – p3 number of elements that can be produced from current mesh shape – p12 number of potential candidates for adjacency identification The initial marking is: p1 - 1, p0 - N, where N is a parameter to the system telling how many elements the initial mesh will consist of. The rest of the places are empty. The only transitions that increase the number of tokens in the net are PE and PI2 - PI5. Transition PE input to place p3 may be fired only once since the initial marking assigns only one token to place p1. Transitions PI2 - PI5 have place p0 as input which limits the number of their activations since the initial marking of that place is only decreasing during the system execution. Therefore the Petri net is bounded and the initial mesh generation will eventually stop. The reachability graph is finite and suitable for analysis. Place p0 is the determinant of the net’s liveness. When that place gets empty, the net quickly reaches a dead marking - as soon as places p4 - p11, p13 and p14 also get empty. Such a dead marking means the end of initial mesh generation; no deadlock is possible while place p0 contains tokens. The Petri net structure is fixed; different shapes of generated meshes will result in choosing different paths in the reachability graph. Figure 6 presents the Petri net for h adaptation. The initial marking of place p0 is the number of the mesh elements chosen for adaptation by an external driver. This includes elements that have to be broken as prerequisites to the adaptation of the chosen elements (according to the mesh regularity rule). The marking of place p1 determines the adaptation ”potential” of the mesh and initially it will be: number of elements with 4 neighbors multiplied by 4 + number of elements with 3 neighbors times 3 + number of elements with 2 neighbors times 2 + number of elements with 1 neighbor. The marking of places p1 and p6 should be preserved between subsequent iterations of h adaptation.
Fig. 6. Petri nets with guards model of concurrency for the process of h adaptation
102
A. Szymczak and M. Paszy´ nski
Fig. 7. Exemplary reachability graph for the two finite mesh generation
The branch with place p2 models interior breaking of elements with 1 neighbor; branch with place p3 - elements with 2 neighbors; branch with place p4 elements with 3 neighbors; branch with place p5 elements with 4 neighbors. Two transitions linked with a dashed line denote actually all activated transitions of a given type fired concurrently. Although their numbers depend on the actual problem being solved, the reachability graph is still finite and can be analyzed since the net is bounded thanks to the place p0. Mesh regularity rules are enforced with the transition guards. Therefore the net can reach a dead state even if places p1 and/or p6 still contain tokens.
5
Conclusions
This paper presented the Petri nets model for the process of generation of an initial two dimensional rectangular finite element mesh, as well as for the process of mesh h adaptation. The mesh transformations were expressed by graph grammar productions. The order of execution of graph grammar productions was defined by the control diagram. The Petri nets were developed based on the presented control diagrams. The transitions on the Petri nets correspond to the execution of graph grammar productions. The developed model allows for the formal analysis of the process of mesh generation and adaptation, modeled by the graph grammar. For example, the reachability graphs can be automatically generated from the Petri nets, for given initial marking, denoting the number of finite elements to be either generated or refined. The example presented in Figure 8, shows an automatically generated reachability graph for the case of the Petri net from Figure 5 with 2 initial markers. The graph nodes denote the Petri net states(numbers denote marking of places from P0 to P15):
Graph Grammar Based Petri Nets Self-adaptive hp-FEM
103
S0 = {2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} S1 = {1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} S2 = {1, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} S3 = {0, 0, 0, 6, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0} S4 = {0, 0, 0, 6, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0} S5 = {0, 0, 0, 6, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} S6 = {0, 0, 0, 6, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} S7 = {0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0} S8 = {0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0} S9 = {0, 0, 0, 6, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0} S10 = {0, 0, 0, 6, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0} S11 = {0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1} The graph is a DAG whereof all paths start in the initial state S0 and end in S11, denoting the successful finish of the mesh generation. All possible paths contain transition PI1 followed by PE and one of transitions PI2 - PI5 followed by PE, which means that exactly two mesh elements will be generated and structured. Also, in any case one of transitions PCEH or PCEV will be fired, so the common edge between the two elements will be identified. Such a reachability graph is a convenient tool for deadlock, infinite loops and stop condition analysis. However, it quite quickly becomes very complex with the number of mesh elements to generate, but then a coverity graph may be used instead. The future work will involve the development of the Petri nets model for the execution of the parallel direct solver over the generated hp refined mesh. The graph grammar model for these parts of the algorithm has been already developed [13]. The presented model involves only two dimensional rectangular finite elements. The future work will also involve the extension of the graph grammar model to the anisotropic mesh refinements and for the three dimensional finite element meshes, followed by development of the Petri nets model. We also intend to model all aspects of the self-adaptive hp-FEM algorithm by trace theory [14], with the comparison of the descriptive power of the developed models. Acknowledgments. The work reported in this paper was supported by the Foundation for Polish Science under Homming Programme.
References 1. Demkowicz, L.: Computing with hp-adaptive Finite Elements, vol. I. Chapman & Hall/Crc Applied Mathematics & Nonlinear Science, Taylor & Francis Group, Boca Raton (2006) 2. Demkowicz, L., Pardo, D., Paszy´ nski, M., Rachowicz, W., Zduneka, A.: Computing with hp-adaptive Finite Elements, vol. II. Chapman & Hall/Crc Applied Mathematics & Nonlinear Science, Taylor & Francis Group, Boca Raton (2007)
104
A. Szymczak and M. Paszy´ nski
3. Gawad, J., Paszy´ nski, M., Matuszyk, P., Madej, L.: Cellular automata coupled with hp-adaptive Finite Element Method applied to simulation of austenite-ferrite phase transformation with a moving interface. In: Steel Research International, pp. 579–586 (2008) 4. Pardo, D., Torres-Verdin, C., Paszy´ nski, M.: Simulations of 3D DC borehole resistivity measurements with a goal-oriented hp finite-element method. Pt. 2, Throughcasing resistivity instruments, Computational Geosciences 12, 83–89 (2008) 5. Paszy´ nska, A., Paszy´ nski, M., Grabska, E.: Graph transformations for modeling hp-adaptive Finite Element Method with triangular elements. In: Bubak, M., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2008, Part III. LNCS, vol. 5103, pp. 604–613. Springer, Heidelberg (2008) 6. Grabska, E.: Theoretical concepts of graphical modeling. Part One: Realization of CP-graphs, Machine Graphics and Vision 2(1), 3–38 (1993) 7. Grabska, E.: Theoretical concepts of graphical modeling. Part Two: CP-graph grammars and languages. Machine Graphics and Vision 2(2), 149–178 (1993) 8. Grabska, E., Hliniak, G.: Structural aspects of CP-graph languages. Schedae Informaticae 5, 81–100 (1993) 9. Paszy´ nski, M., Paszy´ nska, A.: Graph transformations for modeling parallel hpadaptive Finite Element Method. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2007. LNCS, vol. 4967, pp. 1313–1322. Springer, Heidelberg (2008) 10. Paszy´ nski, M.: On the parallelization of self-adaptive hp-Finite Element Method. Part I. Composite Programmable Graph Grammar Model, Fundamenta Informaticae 93, 411–434 (2009) 11. Pla˙zek, J., Bana´s, K., Kitowski, J., Boryczko, K.: Exploiting two-level parallelism in FEM applications. In: Hertzberger, B., Sloot, P.M.A. (eds.) HPCN-Europe 1997. LNCS, vol. 1225, pp. 878–880. Springer, Heidelberg (1997) 12. Szymczak, A., Paszy´ nski, M.: Graph grammar based Petri nets model of concurrency for self-adaptive hp-Finite Element Method with triangular elements. LNCS, vol. 5145, pp. 845–854. Springer, Heidelberg (2009) 13. Paszy´ nski, M.: Parallel direct solver for hp Finite Element Method controlled by Composite Programmable Graph Grammar. In: Proceedings of the 2008 International Conference on Artificial Intelligence and Pattern Recognition, Orlando, Florida, USA, July 7-10 (2008) 14. Diekert, V., Rozenberg, G.: The book of traces. World Scientific Publishing, Singapore (1995)
Numerical Solution of the Time and Rigidity Dependent Three Dimensional Second Order Partial Differential Equation Anna Wawrzynczak1 and Michael V. Alania2 1 2
Institute of Computer Science, University of Podlasie 3 Maja 54, Siedlce, Poland
[email protected] Institute of Math. and Physics, University of Podlasie, 3 Maja 54, Siedlce, Poland
[email protected]
Abstract. The method of the numerical solution of the time and rigidity dependent three dimensional (3-D) second order partial differential equation describing the anisotropic diffusion of the galactic cosmic ray particles in the heliosphere is proposed. A good agreement between the results of the numerical solution and experimental data is demonstrated. The proposed numerical method gives a broad possibility for implementation of the parallel programming techniques, and could be successfully used to solve second order partial differential equations, among them equations describing different classes of galactic cosmic ray intensity variations observed due to anisotropic diffusion of the galactic cosmic ray particles in the heliosphere. Keywords: partial differential equations, numerical method, modelling.
1
Introduction
Solar activity is characterized by number of sunspots on the surface of the Sun; increasing or dropping number of sunspots in time is called a solar cycle. More dominant and very well established one is the 11-year solar cycle. Each 11-year solar activity cycle is lasting from one minimum (when few sunspots appear or they are not visible at all) to another minimum epoch. The exceptional phenomenon of solar activity is an existence of the solar wind - an extension of the outer atmosphere of the Sun (the corona) into interplanetary space [1]. The solar wind is a stream of charged particles - a plasma, coming out from the Sun in all directions at very high speeds - an average of about 400 km/sec. It consists mostly of equal numbers of electrons and protons with energies of about 1 keV (1000 electron Volts (eV)). The electric currents in the Sun generate a complex magnetic field which extends out into interplanetary space to form the interplanetary magnetic field. The Sun’s magnetic field is carried out through the solar system by the solar wind. The rotation of the Sun winds up the magnetic field into a large rotating spiral, known as the Parker spiral. The magnetic field is primarily directed outward from the Sun in one of its hemispheres, and inward in R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 105–114, 2010. c Springer-Verlag Berlin Heidelberg 2010
106
A. Wawrzynczak and M.V. Alania
the other. This causes opposite magnetic field directions in the Parker spiral. The thin layer between the different field directions is described as the neutral current sheet. Since this dividing line between the outward and inward field directions is not exactly on the solar equator, the rotation of the Sun causes the current sheet to become ”wavy”, and this waviness is carried out into interplanetary space by the solar wind [2]. At the maximum epoch of each 11-year solar cycle (a peak in the sunspot numbers), the polarity of the Sun’s global magnetic field reverses, so that the North magnetic pole becomes the South and vice versa. Every 22 years the magnetic polarity of the Sun is returning to its earlier state. So, there exists the 22-year solar magnetic cycle. Each 11-year period of the 22-year solar magnetic cycle is lasting from one maximum to another maximum epoch. When the global magnetic field lines are directed outward from the north hemisphere of the Sun and are directed backward to south hemisphere, this 11-year period is conventionally called the positive (A > 0) magnetic polarity period, while in vice versa case, it is called the negative (A < 0) polarity period. The solar wind creates the heliosphere, a vast bubble in the interstellar medium surrounding the solar system. The heliosphere extends up to 100AU [3] (AU-astronomical unite is the Sun-Earth distance ∼ 150 million km). So, the whole heliosphere is saturated by the solar wind, a supersonic and super-Alfvenic plasma flow with the regular and turbulence magnetic field and is continuously expanding into the interplanetary space. Galactic cosmic ray (GCR) protons (we consider GCR protons with energies > than 109 eV) propagating through the interplanetary medium are undergoing to the influence (modulation) of the diverged solar wind (source of convection and adiabatic energy change of the GCR particles), and the combination of the regular (averaged) and turbulent (to be stochastic with the zero mean) interplanetary magnetic field being the sources of various kind of drifts and diffusion of the GCR particles. The modulation (changes) of the GCR intensity is generally typified with different long and short scale quasi periodic variations (22- years, 11-years and 27-days), and short time (a few days) changes related in main with the similar changes of the solar activity and solar wind. A short time decrease and recovery of the GCR intensity during 8-10 days is called the Forbush decrease (Fd) [4]. A time dependent modulation of the GCR intensity besides the above mentioned processes depends on the energy or on the rigidity of the GCR particles, so to describe a modulation (changes) of the GCR intensity the time and rigidity dependent model of the transport equation for three dimensional (3-D) heliosphere should be considered. However, in particular cases, e.g. for long period changes of the GCR intensity there is possible to consider stationary (time independent) modulation diminishing in this way a dimension of transport equation and simplify problem of solution [5]. In this paper our aim is to propose method of numerical solution of the transport equation in general case, i.e. to consider time and rigidity dependent 3-D model of the GCR particles propagation in the heliosphere (in the interplanetary space).
Numerical Solution of the Second Order Partial Differential Equation
2
107
GCR Particles Transport Equation
GCR particles are propagating through the causal and stochastic interplanetary magnetic field and their motion is described by the normal anisotropic diffusion equation in the heliosphere [6,7,8,9,10,11,12] ∂N 1 ∂ = ∇i (Kij ∇j N ) − ∇i (Ui N ) + (N R)(∇i Ui ), ∂t 3 ∂R
(1)
where N and R are density and rigidity of cosmic ray particles, respectively; Ui - solar wind velocity, Kij is the anisotropic diffusion tensor of GCR for the three dimensional IMF with components Br ,Bθ and Bϕ , and t-time. The term ∇i (Kij ∇j N ) describes the diffusion of the GCR particles, the term ∇i (Ui N ) ∂ convection and 13 ∂R (N R)(∇i Ui ) the change of the particles energy due to the interaction of the particles and the diverged solar wind with the IMF turbulence. In the heliocentric spherical coordinate system (r, θ, ϕ) for the anisotropic diffusion tensor Kij and for the radial solar wind velocity U ≡ U (Ur , 0, 0) the equation (1) has a form: 1 ∂ ∂N ∂N ∂N r ∂N = 2 (r2 Krr + rKrθ + Krϕ ) ∂t r ∂r ∂r ∂θ sin θ ∂ϕ 1 ∂ ∂N sin θ ∂N 1 ∂N (sin θKθr + Kθθ + Kθϕ ) + r sin θ ∂θ ∂r r ∂θ r ∂ϕ 1 ∂ ∂N 1 ∂N 1 ∂N + (Kϕr + Kϕθ + Kϕϕ ) r sin θ ∂ϕ ∂r r ∂θ r sin θ ∂ϕ 1 ∂ n 1 ∂ R ∂N − 2 (r2 N Ur ) + ( + )( 2 [r2 Ur ]). r ∂r 3 ∂R 3 r ∂r
(2)
We set the dimensionless density f = NN0 , time t∗ = tt0 , distance r = rr0 , and R , where N and N0 are density in the interplanetary space and rigidity R = 1GV in the local interstellar medium (LISM), respectively; N0 = 4πI0 , where the −2.8 intensity I0 in the LISM [13,14] has the form I0 = 1+5.85T21.1T −1.22 +1.18T −2.54 , T √ is kinetic energy in GeV (T = R2 + 0.9382 − 0.938); t0 is the characteristic time of GCR modulation corresponding to the change of the electro-magnetic conditions in the modulation region for the certain class of GCR variation; r is the radial distance and r0 is the size of the modulation region. For the case of Fd we accepted that t0 is equal to the Sun’s complete rotation period (27−days ∼ 2.3328 ∗ 106 s) and the size of modulation region r0 equals 30 AU; Kij is the anisotropic diffusion tensor of GCR for the three dimensional IMF taken as in [15,16]. The Eq. (1) for the dimensionless variables f , t∗ and r in the spherical coordinate system (r, θ, ϕ) can be written: A0
∂f ∂f = M f + A11 , ∗ ∂t ∂R
where M = A1
∂2 ∂2 ∂2 ∂2 ∂2 ∂2 + A + A + A + A + A 2 3 4 5 6 ∂r2 ∂θ2 ∂ϕ2 ∂r∂θ ∂θ∂ϕ ∂r∂ϕ
(3)
108
A. Wawrzynczak and M.V. Alania
+A7
∂ ∂ ∂ + A8 + A9 + A10 . ∂r ∂θ ∂ϕ r2
The coefficient A0 = t0 K0II and coefficients A1 , A2 ,...,A11 are functions of the spherical coordinates (r, θ, ϕ), rigidity R of GCR particles and time t∗ .
3
Method of the Numerical Solution of the Transport Equation
Solution of the full transport equation (3) is very complicated, even, after some simplification e.g.: 1. if
∂f ∂R
= 0, Eq. (3) is a parabolic time-dependent 3-D equation ∂f M = f, ∗ ∂t A0
(4)
∂f 2. if ∂t ∗ = 0, Eq. (3) is a stationary, but it remains as a 3-D parabolic problem with respect to the rigidity R
∂f M =− f. ∂R A11
(5)
The full model describing short period changes of the GCR intensity (e.g. Forbush decrease) requires the non stationary Parker’s transport equation (3) containing the simultaneous dependence of the GCR intensity on the particles rigidity and time. The equation (1) is not possible to solve analytically in general form without the extensive simplifications which makes the model to poor to describe the processes in the interplanetary space during the Fd. So, there is necessary to solve numerically the appropriate model based on the Eq. (3). Unfortunately, up to present there is no standard numerical method of solution of the partial differential equation of the form (3) which could take into consideration simultaneous dependence on time and on the GCR particles rigidity. We propose an approach to the numerical solution of this type of partial differential equations, as follows: First, we consider the equation (5) and transform it by the implicit finite difference scheme method [17] to the system of the algebraic equations: 2A1 + Δr1 A7 2A2 + Δθ1 A8 n+1 + fi,j+1,k Δr1 (Δr2 + Δr1 ) Δθ1 (Δθ2 + Δθ1 ) 2A3 + Δϕ1 A9 n+1 + fi,j,k+1 (6) Δϕ1 (Δϕ2 + Δϕ1 ) 2A1 2A2 2A3 A11 n+1 + fi,j,k ) (− − − + A10 − Δr1 Δr2 Δθ1 Δθ2 Δϕ1 Δϕ2 ΔR A4 A5 n+1 n+1 + fi+1,j+1,k + fi+1,j,k+1 (Δr1 + Δr2 )(Δθ1 + Δθ2 ) (Δr1 + Δr2 )(Δϕ1 + Δϕ2 ) A A4 6 n+1 n+1 + fi,j+1,k+1 + fi+1,j−1,k (Δθ1 + Δθ2 )(Δϕ1 + Δϕ2 ) (Δr1 + Δr2 )(Δθ1 + Δθ2 ) n+1 fi+1,j,k
Numerical Solution of the Second Order Partial Differential Equation
109
−A5 −A6 n+1 + fi,j+1,k−1 (Δr1 + Δr2 )(Δϕ1 + Δϕ2 ) (Δθ1 + Δθ2 )(Δϕ1 + Δϕ2 ) −A4 −A5 n+1 n+1 + fi−1,j+1,k + fi−1,j,k+1 (Δr1 + Δr2 )(Δθ1 + Δθ2 ) (Δr1 + Δr2 )(Δϕ1 + Δϕ2 ) −A6 2A1 − Δϕ2 A7 n+1 n+1 + fi,j−1,k+1 + fi−1,j,k (Δθ1 + Δθ2 )(Δϕ1 + Δϕ2 ) Δϕ2 (Δϕ2 + Δϕ1 ) 2A − Δθ A 2A − Δϕ 2 2 8 3 2 A9 n+1 n+1 + fi,j,k−1 + fi,j−1,k Δθ2 (Δθ2 + Δθ1 ) Δϕ2 (Δϕ2 + Δϕ1 ) A4 A5 n+1 n+1 + fi−1,j−1,k + fi−1,j,k−1 (Δr1 + Δr2 )(Δθ1 + Δθ2 ) (Δr1 + Δr2 )(Δϕ1 + Δϕ2 ) A6 A11 n+1 n + fi,j−1,k−1 + fi,j,k = 0. (Δϕ1 + Δϕ2 )(Δθ1 + Δθ2 ) ΔR n+1 + fi+1,j,k−1
n The bottom index (i, j, k) in fi,j,k denotes the point of the grid in the space (r, θ, ϕ) a (grid is 91x91x72x25x81; 91 steps in distance, 91 - in heliolatitudes, 72 - in heliolongitudes and 25 - in rigidities) and the upper index n denotes the rigidity layer. As mentioned above, the Eq. (5) is parabolic type problem with respect to the rigidity R. So, to solve the Eq. (5) the initial and boundary conditions [18,19,20,21] were used:
– initial condition f |R0 =150GV = 1 which means that R0 = 150GV is the higher limit of unmodulated rigidity of the GCR particles; – the boundary conditions with respect the distance r: f |r=30AU = 1, ∂f ∂r |r=0 = ∂f o = o = 0 and heliolongitude ϕ : f |ϕ=ϕ = 0, heliolatitude θ: ∂f | | θ=0 θ=180 1 ∂θ ∂θ f |ϕ=ϕL+1 , f |ϕ=ϕ−1 = f |ϕ=ϕL−1 . Then the system of equations (6) was solved by the Gauss-Seidel iteration method [17]. This is the first stage of solutions of full Eq. (3). Namely we find solutions for each layer of rigidity R (R1 = 130GV, R2 = 110GV,..., RL = 10GV) for the stationary case. After, these solutions for each rigidity layers are considered as an initial conditions for the non stationary case (Eq. (3)) for the given rigidity R and at time t∗ = 0 f (r, θ, ϕ, R, t∗ )|t∗ =0 = f (r, θ, ϕ, R).
(7)
Then the equation (3) (by the same method like Eq. (5)) was transformed to the system of the algebraic equations: 2A1 + Δr1 A7 2A2 + Δθ1 A8 t+1 + fi,j+1,k Δr1 (Δr2 + Δr1 ) Δθ1 (Δθ2 + Δθ1 ) 2A + Δϕ A 3 1 9 t+1 + fi,j,k+1 (8) Δϕ1 (Δϕ2 + Δϕ1 ) 2A1 2A2 2A3 A0 t+1 + fi,j,k (− − − + A10 − ) Δr1 Δr2 Δθ1 Δθ2 Δϕ1 Δϕ2 Δt∗ A4 A5 t+1 t+1 + fi+1,j,k+1 + fi+1,j+1,k (Δr1 + Δr2 )(Δθ1 + Δθ2 ) (Δr1 + Δr2 )(Δϕ1 + Δϕ2 ) t+1 fi+1,j,k
110
A. Wawrzynczak and M.V. Alania
A6 A4 t+1 + fi+1,j−1,k (Δθ1 + Δθ2 )(Δϕ1 + Δϕ2 ) (Δr1 + Δr2 )(Δθ1 + Δθ2 ) −A5 −A6 t+1 t+1 + fi+1,j,k−1 + fi,j+1,k−1 (Δr1 + Δr2 )(Δϕ1 + Δϕ2 ) (Δθ1 + Δθ2 )(Δϕ1 + Δϕ2 ) −A4 −A5 t+1 t+1 + fi−1,j+1,k + fi−1,j,k+1 (Δr1 + Δr2 )(Δθ1 + Δθ2 ) (Δr1 + Δr2 )(Δϕ1 + Δϕ2 ) −A 2A1 − Δϕ2 A7 6 t+1 t+1 + fi,j−1,k+1 + fi−1,j,k (Δθ1 + Δθ2 )(Δϕ1 + Δϕ2 ) Δϕ2 (Δϕ2 + Δϕ1 ) 2A2 − Δθ2 A8 2A3 − Δϕ2 A9 t+1 t+1 + fi,j−1,k + fi,j,k−1 Δθ2 (Δθ2 + Δθ1 ) Δϕ2 (Δϕ2 + Δϕ1 ) A4 A5 t+1 t+1 + fi−1,j−1,k + fi−1,j,k−1 (Δr1 + Δr2 )(Δθ1 + Δθ2 ) (Δr1 + Δr2 )(Δϕ1 + Δϕ2 ) t+1 + fi,j+1,k+1
t+1 + fi,j−1,k−1
n
n+1
fi,j,k − fi,j,k A6 A0 t + fi,j,k + A11 = 0, (Δϕ1 + Δϕ2 )(Δθ1 + Δθ2 ) Δt∗ ΔR
n where bottom index (i, j, k) in fi,j,k denotes the point of the grid in the space (r, θ, ϕ) and the upper index n denotes the rigidity layer; simultaneously the t ) the time layer. The system of equations (8) corresponding upper index t (fi,j,k to non stationary case (Eq. (3)) was solved by the Gauss-Seidel iteration method for each rigidity layer (Fig. 1).
SCHEMA I R0=150GV t=0
ft+1(i,j,k)
fn(i,j,k)=1 R1=130 GV t=Δ Δt
R1=130GV t=1
Ra= R0-aΔ ΔR t=Δ Δt
Ra= R0-aΔ ΔR t=1
RL=10 GV t=Δ Δt
RL=10 GV t=1
R1=130GV t=0
fn+1(i,j,k) Ra=R0-aΔ ΔR t=0
fn+a(i,j,k)
RL=10GV t=0
fL(i,j,k)
Fig. 1. The first numerical scheme applied in the solution of the three dimensional non stationary transport equation (3)
The proposed model of the short time changes (Fd) of the GCR intensity is stipulated by the changes of the diffusion coefficient K taking place in the disturbed vicinity of the interplanetary space [18,19,20,21] corresponding to the experimental data observed 9-20 September 2005 by neutron monitors on
Numerical Solution of the Second Order Partial Differential Equation
111
GCR intensity [%]
0 -2 -4 -6 -8 Calgary Halekala Thule Nagoya N0VV
-11 11/09
09/09
13/09
15/09
17/09
19/09
Date, Year 2005
Fig. 2. Temporal changes of the GCR intensity (smoothed over 3 days) registered by Calgary, Halekala and Thule neutron monitors and Nagoya N0VV muon telescope in the period of 9-20 September 2005 corresponding to the effective rigidities: Calgary(Canada), Thule(Greenland) ∼ 10 GV, Halekala (USA)∼ 15 ÷ 20 GV, and Nagoya (Japan)∼ 25 ÷ 35GV
GCR intensity [%]
the Earth surface (Fig. 2). The diffusion coefficient including in the model has a form: K = K0 K(r)K(R, ν(t∗ )), where K0 = 4.5 × 1021 cm2 /s (value of normalization of diffusion coefficient for rigidity of 10GV), K(r) = 1 + 0.5r (dependence of the diffusion coefficient on the distance r), K(R, ν(t∗ )) = 5t∗ −0.5 1.5 ∗ R1.2−35( 3 ) Exp(−8(t −0.1)) . Solutions of the transport Eq. (3) for different GCR particles rigidities are presented in Fig. 3. We see from Fig. 3 that the expected temporal changes of the GCR intensity shows the similar changes as in experimental data (Fig. 2). 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
R=130GV R=60GV R=30GV R=18GV R=10GV 0
2
4
6
8 days
10
12
14
Fig. 3. Changes of the expected GCR intensity for different GCR particles rigidities obtained as the solutions of the transport Eq. (3) based on the I numerical scheme (Fig. 1)
Using the above presented transformations (6) and (8) the solution of the equation (3) can be found by the another way, too. Namely, we find solutions for each layer of rigidity R (R1 = 130GV, R2 = 110GV,..., RL = 10GV) for the stationary case (Fig. 4). After, we find solution for the non stationary case for layer R1 = 130 GV (as before), and then solution corresponding to the times Δt, 2Δt, 3Δt, ..., t using as the initial conditions for all other rigidity layers as it is shown by arrows in Fig. 4. Results of solution of the Eq. (3) based
112
A. Wawrzynczak and M.V. Alania SCHEMA II R0=150GV t=0
fn(i,j,k)=1
ft(i,j,k) R1=130 GV t=Δ Δt
R1=130 GV t=1
R1=130GV t=0
fn+1(i,j,k) ΔR Ra=R0-aΔ t=Δ Δt
Ra=R0-aΔ ΔR t=0
Ra=R0-aΔ ΔR t=1
fn+a(i,j,k) RL=10 GV t=1
RL=10 GV t=Δ Δt
RL=10GV t=0
fL(i,j,k)
GCR intensity [%]
Fig. 4. The second numerical scheme applied in the solution of the three dimensional non stationary transport equation (3)
0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
R =60GV sch I R=60GVsch II R =30GV sch I R=30GVsch II R =10GV sch I R=10GVsch II 0
2
4
6
8 days
10
12
14
Fig. 5. Changes of the expected amplitudes of the Fd of the GCR intensity for different GCR particles rigidities obtained as the solutions of the transport Eq. (3) for the Fd based on the I and II numerical scheme (Fig. 1 and Fig. 4)
on the II numerical scheme (Fig. 4) are presented in Fig. 5. Fig. 5 shows that results of solutions by the II numerical scheme coincides with the results obtained by I numerical scheme in the scope of accuracies of numerical solutions (Fig. 5). This comparison, to some extant, gives a possibility to judge about the valuedness of our solutions. However, in spite of a good enough coincidence of the numerical solutions obtained by two different schemes, there exists a natural question about the valuedness of the solutions. For a such very much complicated Eq. (3) (time and rigidity dependent 3-D second order partial differential equation) we have not any theoretical method allowing to judge about the valuedness of the solutions found by the numerical methods. So, in this kind of cases only a comparison of results of solutions with the experimental data (Fig. 6) should play a decisive role. In Fig. 6 are presented results of solutions of the Eq. (3) corresponding to the rigidity of GCR particles of 10GV based on the I (Fig. 1)
Numerical Solution of the Second Order Partial Differential Equation
113
GCR intensity [%]
0 -2 -4 -6 -8 Calgary R=10 GV sch I R=10 GV sch II
-10 -12 0
5
10 days
15
20
Fig. 6. Changes of the expected GCR intensity for rigidity 10 GV obtained as the solutions of the the transport Eq. (3) and the GCR intensity registered by Calgary neutron monitor in 9-20 September 2005 (Fig. 2)
and II (Fig. 4) numerical scheme. There are also presented the experimental data of the changes of the Fd of the GCR intensity observed by Calgary neutron monitor (corresponding as well to the rigidity of 10 GV) in 9-20 September 2005 (Fig. 2). We show that there are good enough coincidences of the profiles and levels of the expected intensity between the numerical solutions obtained by the I and II scheme on the one side (correlation coefficient equals 0.99 ± 0.05), and the both solutions with experimental data (correlation coefficient equals 0.99 ± 0.05 for first and second scheme), on the other. Thus, we state that the numerical approaching presented in this paper could successfully be relevant for the solution of the time and rigidity dependent 3-D second order partial differential equation describing the anisotropic diffusion of the GCR particles in the heliosphere.
4
Conclusions
A numerical scheme based on the implicit finite difference method of solution the time and rigidity dependent 3-D second order partial differential equation describing the anisotropic diffusion of the GCR particles in the heliosphere is proposed. We show an existence of the satisfactory correspondence between the results of the numerical solution and experimental data. We conclude that the proposed numerical method successfully could be used for the solution of the time and rigidity dependent 3-D second order partial differential equation describing different classes of GCR intensity variations stipulated due to anisotropic diffusion of the GCR particles in the heliosphere.
References 1. Parker, E.N.: Dynamics of the Interplanetary Gas and Magnetic Fields. Astrophys. J. 128, 664 (1958) 2. Jokipii, J.R., Thomas, B.: Effects of drift on the transport of cosmic rays. IV Modulation by a wavy interplanetary current sheet. Astrophysical Journal, Part 1 243, 1115–1122 (1981)
114
A. Wawrzynczak and M.V. Alania
3. Stone, E.C., Cummings, A.C., McDonald, F.B., et al.: Voyager 1 Explores the Termination Shock Region and the Heliosheath Beyond. Science 309, 2017 (2005) 4. Forbush, S.: World-Wide Cosmic-Ray Variations 1937-1952. Journal of Geophysical Research 59, 525 (1954) 5. Alania, M.V., Wawrzynczak, A., Bishara, A.A., et al.: Three Dimensional Modeling of the Recurrent Forbush Effect of Galactic Cosmic Rays. In: Proceedings 29th International Cosmic Ray Conference, Pune, India SH 2.6, vol. 1, pp. 367–370 (2005) 6. Krymski, G.F.: Diffusion mechanism of daily variation of galactic cosmic ray. Geomagnetism and Aeronomy 4(6), 977–986 (1964) 7. Parker, E.N.: The passage of energetic charged particles through interplanetary space. Planetary and Space Science 13, 9 (1965) 8. Axford, W.I.: Anisotropic diffusion of solar cosmic rays. Planet. Space Sci. 13, 115 (1965) 9. Gleeson, L.J., Axford, W.I.: Solar Modulation of Galactic Cosmic Rays. Astrophys. J. 149, 115–118 (1968) 10. Dolginov, A.Z., Toptygin, I.N.: Cosmic rays in the interplanetary magnetic fields. Ikarus 8(1-3), 54–60 (1968) 11. Jokipii, J.R.: Cosmic-Ray Propagation. I. Charged Particles in a Random Magnetic Field. Astrophysical Journal 146, 480 (1966) 12. Zhang, M.: A Markov Stochastic Process Theory of Cosmic-Ray Modulation. Astrophys. J. 513(1), 409–420 (1999) 13. Webber, W.R., Lockwood, J.A.: Voyager and Pioneer spacecraft measurements of cosmic ray intensities in the outer heliosphere: Toward a new paradigm for understanding the global modulation process: 2. Maximum solar modulation (19901991). J. Geophys. Res. 106(9), 323 (2001) 14. Caballero-Lopez, R.A., Moraal, H.: Limitations of the force field equation to describe cosmic ray modulation. J. Geophys. Res. 109, A01101 (2004) 15. Alania, M.V., Dzhapiashvili, T.V.: The Expected Features of Cosmic Ray Anisotropy due to Hall-Type Diffusion and the Comparison with the Experiment. In: 16th International Cosmic Ray Conference, vol. 3(80), p. 19 (1979) 16. Alania, M.V.: Stochastic Variations of Galactic Cosmic Rays. Acta Physica Polonica B 33(4), 1149–1165 (2002) 17. Kincaid, D., Cheney, W.: Numerical Analysis. WNT, Warsaw (2006) 18. Alania, M.V., Wawrzynczak, A.: On Peculiarities of The Anisotropic Diffusion During Forbush Effects of Galactic Cosmic Rays. Acta Phys. Polonica B 35(4), 1551–1563 (2004) 19. Wawrzynczak, A., Alania, M.V.: Peculiarities of galactic cosmic ray Forbush effects during October-November 2003. Adv. in Space Res. 35, 682 (2005) 20. Wawrzynczak, A., Alania, M.V., Modzelewska, R.: Features of the Rigidity Spectrum of Galactic Cosmic Ray Intensity during the Recurrent Forbush Effect. Acta Phys. Polonica B 37(5), 1667–1676 (2006) 21. Wawrzynczak, A., Alania, M.V.: Modeling of the recurrent Forbush effect of the galactic cosmic ray intensity and comparison with the experimental data. Adv. in Space Res. 41, 325–334 (2008)
Hardware Implementation of the Exponent Based Computational Core for an Exchange-Correlation Potential Matrix Generation Maciej Wielgosz, Ernest Jamro, and Kazimierz Wiatr AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Krakow ACK Cyfronet AGH ul. Nawojki 11, 30-950 Krakow {wielgosz,jamro,wiatr}@agh.edu.pl Abstract. This paper presents an FPGA implementation of a calculation module for a finite sum of the exponential products (orbital function). The module is composed of several specially designed floating-point modules which are, fully pipelined and optimized for high speed performance. The hardware implementation revealed significant speed-up for the finite sum of the exponential products calculation ranging from 2.5x to 20x in comparison to the CPU. The orbital function is a computationally critical part of the Hartree-Fock algorithm. The presented approach aims to increase the performance of the part of the quantum chemistry computational system by employing FPGA-based accelerator. Several issues are addressed such as an identification of proper code fragments, porting a part of the Hartree-Fock algorithm to FPGA, data precision adjustment and data transfer overheads. The authors’ intention was also to make hardware application of the orbital function universal and easily attachable to different systems. Keywords: HPRC (High Performance Reconfigurable Computing), FPGA, quantum chemistry, floating-point.
1
Introduction
High performance computing has been recognized for many years as an area dominated only by microprocessors. The rapid increase of the logic resources of modern FPGAs (Field Programmable Gate Arrays) gives a huge opportunity to adopt hardware accelerators for calculations which involve processing a large volume of data. Such a vital rise in the computational capabilities of FPGAs has stimulated investigations to construct a reconfigurable hardware accelerator [1]. The proposed hardware module aims to speed up HPC chemistry and physics calculations and provide compatibility with the standard floating- point data formats. The RASC [2] platform has been chosen as a hardware accelerator because of the computational capabilities it offers for SMP (Symmetric Multiprocessing) supercomputers. Since achievable computation speed-ups strictly depends on the R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 115–124, 2010. c Springer-Verlag Berlin Heidelberg 2010
116
M. Wielgosz, E. Jamro, and K. Wiatr
selected part of the algorithm, the choice of it becomes a critical issue. Therefore the finite sum of exp() functions was found a proper candidate for hardware acceleration. This operation is predominant in scientific computations, therefore the described solution may be employed in many kinds of applications, (e.g. Quantum Monte Carlo)[3,4]. However, the implementation described hereby should be considered as a first step in hardware implementation of the exchange-correlation potential calculationimplementation. The ultimate goal is to design a hardware unit capable of generating an exchange-correlation potential matrix, which is one of the most computationally intensive routines within the Hartree-Fock calculations.
2
Algorithm
The time-independent Schroedinger equation for a system of N particles interacting via the Coulomb interaction is: ˆ = Eψ Hψ ˆ = H
N
N
(1) N
2 2 1 Zi Zj (− ∇i ) + 2m 2 4π i 0 |ri − rj | i=1 i=1
(2)
j=i
where: ψ is an N-body wavefunction, r denotes spatial positions and Z represents the charges of the individual particles. E denotes the energy of either the ground or an excited state of the system. Errors of approximations made in obtaining the wavefunction will be manifested in any property derived from the wave function. Where high accuracy is required, considerable attention must be paid to the derivation of the wavefunction and any approximations made. Therefore there has been careful investigation into what level of precision should be adopted in calculations to ensure satisfactory approximation. The general procedure of solving the Hartree-Fock equations is to make the orbitals self-consistent with the potential field they generate. It is achieved through an iterative trial-and-error computational process, thus the entire procedure is called the self-consistent field (SCF) method. The SCF procedure of solving the Hartree-Fock equation leads to the following set of eigenvalue equations in matrix formulation: F C − SCE = 0
(3)
F is the Fock-operator, C the matrix of the unknown coefficients, S the overlap matrix and E is the energy eigenvalues. All matrixes are of the same size. The Hartree-Fock equation solving is an iterative process. The coefficients (corresponding to an electric field) are used to build the Fock-operator F, with which the system of linear equations is solved again to get a new solution (a new electric field). The procedure is repeated until the solution no longer changes.
Hardware Implementation of the Exponent Based Computational Core
117
The Fock operator depends on orbitals, which in turn are its eigenfunctions. Therefore orbitals have to be calculated for each iteration of the whole HartreeFock algorithm. That is why the finite sum of exponential products contributes scientifically to the overall performance of the application. Orbital function is expressed by equation 4 [5]: 2 χklm (r) = rxk ryl rzm Ci Ni e−αi r (4) i
The algorithm (equ. 4) was split into several sections [8] to allow adoption of a modular design approach. Consequently, a fully pipelined structure was obtained.
Fig. 1. EP module block diagram
Three different units have been employed within the EP module [Fig. 1]. All of them are specially designed, fully pipelined floating-point modules [6,7] optimized to high speed performance. Input widths as well as their internal data path scales from single to double data precision. The input and output data format complies with the IEEE-754 floating-point standard. Nevertheless intermediate data representations employ different non-standard floating-point format in order to reduce hardware resources.
118
M. Wielgosz, E. Jamro, and K. Wiatr
1. Multiplier For double precision data format inputs mantissa is 53-bit (54-bit including leading one) wide, therefore the product width is 108-bit wide. Such a bitwidth is far beyond the required precision, therefore the LSBs of the product are usually disregarded [10]. Consequently the LSB’s part of the multiplier performs operations which are not used in the next arithmetic operations. As a result, in the proposed architecture, some of the LSBs logic is not implemented at all. This approach allows for significant area reduction. A more detailed description of the multiplier will be available in a separate paper. 2. Exponential function This is mixed table-polynomial implementation of the exp function [6,7], based on commonly known mathematical identities ex = 2x∗log2 (e) = 2xi ∗ exf where xi is an integer part of x ∗ log2 (e) and xf = x − xi /log2 (e) and ex1 +x2 = ex1 ∗ ex2
(5)
(6)
where x1 , x2 are different bit-fields of xf . 3. Accumulator This is a fully pipelined mixed precision floating point unit. It can process one datum per clock cycle due to its advanced and optimized structure. A more detailed description of the accumulator will be available in a separate paper.
3
Platform
The Altix 4700 series [11] is a family of multiprocessor distributed shared memory (DSM) computer systems that currently ranges from 8 to 512 CPU sockets (up to 1,024 processor cores) and can accommodate up to 6TB of globally shared memory in a single system while delivering a teraflop of performance in a smallfootprint rack. The RASC module communicates with the Altix system in the same manner as any CPU does, i.e. employing NUMALink interconnection. Data transfer between the FPGA and NUMALink bus is realized by an applicationspecific integrated circuit (ASIC) denoted as TIO. SGI RASC RC100 Blade consists of two Virtex-4 LX 200 [9] FPGAs, with 40 MB of external SRAM logically organized as two 16MB blocks (as shown in Fig. 2) and an 8MB block. Each QDR (Quad Data Rate) SRAM block is capable of transferring 128-bit data every clock cycle (at 200MHz) both for reads and writes. The RC100 Blade is connected using the low latency NUMALink interconnect to the SGI Altix 4700 Host System, for a rated peak bandwidth of 3.2GB per second each direction. The RASC algorithm execution procedure, from the processor’s perspective, is composed of several functions which reserve resources, queue commands and perform other preparation steps. The resource reservation procedure,
Hardware Implementation of the Exponent Based Computational Core
119
Fig. 2. RASC architecture and the host-fpga data link
once conducted, allows many runnings of the algorithm - which amounts to a huge time savings, since the procedure takes approximately 7.5 ms. Therefore optimal structure of the hardware-accelerated application should contain as few initializations of the FPGA chips as possible so the FPGAs configuration time will be only a fraction of the algorithm execution time.
4
Implementation and Acceleration
The EP module implemention results on the RASC [2] platform are presented below. Core services, provided by SGI together with the RASC module, is an extra logic incorporated in an FPGA which provides communication with the on-board hub. Units which the EP module is built of are strongly parametrized, therefore resources occupation varies depending on the data width (different calculation precision) of the module. Table 1. Implementation results of the EP module - single precison Implementation results # 4-input LUT # flip-flops # 18-Kb BRAMs EP module alone 2229 (1%) 1975(1%) 2(0.006%) EP module with the core services 11560 (7%) 15922(9%) 25(7%)
Exp() function and the accumulator module consume the highest amount of resources, due to their complexity. The overall pipeline latency of the exp module is 33 clock cycles for the single precison data format. The latency strongly depends on the width of input data.
120
M. Wielgosz, E. Jamro, and K. Wiatr Table 2. Implementation results of the EP module - double precison
Implementation results # 4-input LUT # flip-flops # 18-Kb BRAMs EP module alone 8684 (4.8%) 7891(4%) 6(0.02%) EP module with the core services 18015 (10%) 21838(12%) 29(8%) Table 3. Implementation results of the elementary modules - single precision Implementation results # 4-input LUT Exp() 820 (0.5%) Multiplier 549 (0.3%) Accumulator 860 (0.5%)
# flip-flops # 18-Kb BRAMs 920(0.5%) 2(0.006%) 444(0.25%) 0 605(0.4%) 0
Table 4. Implementation results of the elementary modules - double precison Implementation results # 4-input LUT Exp() 5025 (3%) Multiplier 2316 (1%) Accumulator 1343 (0.5%)
# flip-flops # 18-Kb BRAMs 5223 (3%) 6 (1.8%) 1850(1%) 0 818(0.4%) 0
Table 5. Delays introduced by the EP module’s components - single precision Module Multiplier Exp() Accumulator EP module Pipeline delay [clk] 4 21 8 33 Table 6. Delays introduced by the EP module’s components - double precision Module Multiplier Exp() Accumulator EP module Pipeline delay [clk] 5 30 10 45
It is worth underlining that all the modules have adjustable widths of data paths within the range from single to double precision. The Hartree-Fock algorithm is executed differently, depending on the set of molecules it is employed for. This, in turn impacts the precision requirements at different stages of computations. It may happen that for a certain chemical substance some sections of the algorithm process much more data than others. Besides, in the Hartree-Fock algorithm, the result precision increases (error decreases) with every algorithm iteration, therefore the calculation precision may be somehow scaled with the iteration number. Utilizing FPGAs in Hartree-Fock computations allows for the adjustment of data buses widths within the system so the proper precision is maintained across all the stages of the algorithm execution. The FPGA configuration time (7.5 ms) is significantly shorter than the overhead introduced by the oversized data format. It may appear that some sections of calculations can be successfully computed at single precision (or any other precision, e.g. 48-bits),
Hardware Implementation of the Exponent Based Computational Core
121
saving a huge amount of resources that would otherwise be wasted if double precision data format was adopted. A hardware approach allows gradual switching from single to double data format together with the algorithm’s rising demand for precision.
Fig. 3. Example of the precision adjustment within the EP module
All the data connections within the EP module are adjustable so it is possible to change them in order to meet precision requirements. Different approach can be taken to parameterize the EP module but the most common is a topdown method, which involves gradual modification of the inter module interface’s width. Eventually number of guard bits of the basic components is established. For instance ( Fig. 3) optimal data width with regards to the precision of the interface between first multiplier and that exp() unit is 37 bits. The EP module is intended to be a part of a larger system performing exchange-correlation potential calculation, so some preliminary tests were carried out in order to determine top achievable performance of the module on the RASC platform. Each of the EP modules performs calculations in single precision so it is possible to aggregate two of them to increase overall throughput. Additionally due to the algorithm’s structure some of the data (atom base coefficients) is used multiple times. So only a single 32-bit data chunk is streamed in and out the FPGA. This in turn allows to aggregate up to four EP module (single precision) on the
122
M. Wielgosz, E. Jamro, and K. Wiatr
Fig. 4. Configuration of EP modules on the RASC platform
single FPGA. This holds as the parallization degree is limited by the external memory data transfer rate rater than the FPGA resources. Single exponential operation calculation together with two-directional data transfer takes roughly 8 ns on the RASC platform whereas the same operation executed on the Itanium takes roughly 20 ns for a highly optimized code. Both the values were obtain as the result of a conducted tests. Table 7. Acceleration results Number of EP modules [RASC/Itanium] speed-up 1 2.5 2 5 4 10
Employing both Xilinx Virtex 4 chips available on the RASC platform leads 20x speed-up comparing to Itanium 2 1.6 GHz processor. There is a huge difference between optimized and a simple code in execution time. Tables 8 and 9 present different degree of optimization applied to the processor source code. What shows how such an acceleration evaluation can somethimes be misleading, especially when hardware performance is not compared with the optimized software one.
Hardware Implementation of the Exponent Based Computational Core
123
Table 8. Acceleration results as the results of the processor source code optimization Itanium algorithm execution Not optimized C code Not optimized C code with compiler highest effort (-O3 compiler flag set) Fully optimized C code
Itanium a single operation execution time [us] 0.2 0.16
[RASC/Itanium] up 25 14.6
0.02
2.5
speed-
Table 9. C implementation of the orbital calculation routine Not optimized C code
Optimized C code
for(t=0;t<size;t++) for(t=0;t<size;t++) result+= a[t]*exp(_table(b[t]*c[t]); b_c_table[t] = b[t]*c[t]; for(t=0;t<size;t++) exp_table[t] = exp(b_c_table[t]); for(t=0;t<size;t++) result = result + a[t]*exp_table[t];
5
Summary
There were some other papers published regarding Hartree-Fock(HF) calculations in FPGAs [12,13]. But the solution presented in [12] aims to accelerate ERI (Compute electron repulsion integrals) computations and additionally the paper deals with various issues regarding coarse-grain parallelism aspects of the implementation. Therefore it does not directly correspond to the orbital function calculation for exchange-potential generation. It is worth noting that a finite sum of the exponential products (eq.4) is, as a computational routine, ubiquitous in scientific calculations because of its universal function. Many different processes in the real world can be described by exponential function. Combination of the exp() leads to an even more uniform formula. Performance tests revealed that the FPGA is much faster than a processor - even with data stored in the processor’s memory and the full optimization provided. The module can be scaled from single to double precision to adjust to calcualtion profile. It is worth taking into consideration that the Hartree-Fock algorithm, due to its iterative execution, allows the adoption of gradually adjustable data precision, which will give a speed boost to FPGAs in the final application. So there is still a huge potential in hardware implementation of quantum chemistry computational routines.
124
M. Wielgosz, E. Jamro, and K. Wiatr
References 1. Underwood, K.D., Hemmert, K.S., Ulmer, C.: Architectures and APIs: assessing requirements for delivering FPGA performance to applications. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC 2006, Tampa, Florida, November 11-17, p. 111. ACM, New York (2006) 2. Silicon Graphics, Inc. Reconfigurable Application-Specific Computing User’s Guide, Ver. 005, SGI (January 2007) 3. Gothandaraman, A., Peterson, G., Warren, G., Hinde, R., Harrison, R.: FPGA acceleration of a quantum Monte Carlo application. Parallel Computing 34(4-5), 278–291 (2008) 4. Gothandaraman, A., Warren, G.L., Peterson, G.D., Harrison, R.J.: Reconfigurable accelerator for quantum Monte Carlo simulations in N-body systems. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC 2006, Tampa, Florida, November 11-17, p. 177. ACM, New York (2006) 5. Mazur, G., Makowski, M.: Development and Optimization of Computational Chemistry Algorithms. In: KDM 2008, Zakopane, Poland (March 2008) 6. Wielgosz, M., Jamro, E., Wiatr, K.: Highly Efficient Structure of 64-Bit Exponential Function Implemented in FPGAs. In: Woods, R., Compton, K., Bouganis, C., Diniz, P.C. (eds.) ARC 2008. LNCS, vol. 4943, pp. 274–279. Springer, Heidelberg (2008) 7. Jamro, E., Wielgosz, M., Wiatr, K.: FPGA implementation of 64-bit exponential function for HPC. In: FPL Proceedings, FPL Netherlands, August 27-29 (2007) 8. Omondi, A.R.: Computer Arithmetic Systems. Prentice Hall, Cambridge (1994) 9. Xilinx Virtex-4 Family Overview, http://www.xilinx.com/support/documentation/data_sheets/ds112.pdf 10. Parhi, K.K., Chung, J.G., Lee, K.C., Cho, K.J.: Low-error fixed-width modified booth multiplier, US Patent: 957244 11. SGI Altix 4700, http://www.sgi.com/products/servers/altix/4000/ 12. Ramdas, T., Egan, G.K., Abramson, D., Baldridge, K.: Converting massive TLP to DLP: a special-purpose processor for molecular orbital computations. In: Proceedings of the 4th International Conference on Computing Frontiers, CF 2007, Ischia, Italy, May 7-9, pp. 267–276. ACM, New York (2007) 13. Kandaswamy, M., Kandemir, M., Choudhary, A., Bernholdt, D.: Optimization and Evaluation of Hartree-Fock Application’s I/O with PASSION. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC), November 1997, pp. 31–50 (1997)
Parallel Implementation of Conjugate Gradient Method on Graphics Processors Marcin Wozniak, Tomasz Olas, and Roman Wyrzykowski Institute of Computational and Information Sciences, Czestochowa University of Technology Dabrowskiego 74, 42-201 Czestochowa, Poland {marcell,olas,roman}@icis.pcz.pl
Abstract. Nowadays GPUs become extremely promising multi/manycore architectures for a wide range of demanding applications. Basic features of these architectures include utilization of a large number of relatively simple processing units which operate in the SIMD fashion, as well as hardware supported, advanced multithreading. However, the utilization of GPUs in an every-day practice is still limited, mainly because of necessity of deep adaptation of implemented algorithms to a target architecture. In this work, we propose how to perform such an adaptation to achieve an efficient parallel implementation of the conjugate gradient (CG) algorithm, which is widely used for solving large sparse linear systems of equations, arising e.g. in FEM problems. Aiming at efficient implementation of the main operation of the CG algorithm, which is sparse matrix-vector multiplication (SpMV ), different techniques of optimizing access to the hierarchical memory of GPUs are proposed and studied. The experimental investigation of a proposed CUDA-based implementation of the CG algorithm is carried out on two GPU architectures: GeForce 8800 and Tesla C1060. It has been shown that optimization of access to GPU memory allows us to reduce considerably the execution time of the SpMV operation, and consequently to achieve a significant speedup over CPUs when implementing the whole CG algorithm.
1
Introduction
Nowadays GPUs become extremely promising multi/many-core architectures for a wide range of general-purpose applications demanding high-intensive numerical computations [5,7]. Basic features of these architectures include utilization of a large number of relatively simple processing units which operate in the SIMD fashion, as well as hardware supported, advanced multithreading. For example, NVIDIA Tesla C1060 is equipped with 240 cores, delivering the peak performance of 0.93 TFLOPS. The tremendous step towards a wider acceptation of GPUs in general-purpose computations was development of software environments which made possible to program GPUs in high-level languages [15]. The new software developments, such as NVIDIA CUDA [7,8] and OpenCL [9], allow programmers to implement R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 125–135, 2010. c Springer-Verlag Berlin Heidelberg 2010
126
M. Wozniak, T. Olas, and R. Wyrzykowski
algorithms on existing and future GPUs much easier. For example, the CUDA environment makes it possible to develop applications for both the Windows and Linux operating systems, giving access to a well-designed programming interface in the C language. On a single GPU, it is possible to run several CUDA and graphics applications concurrently. The CUDA architecture includes two modules dedicated respectively to a general-purpose CPU and graphics processor. This allows for utilization of GPU as an application accelerator, when a part of the application is executed on a standard processor, while another part is assigned to GPU. However, the utilization of GPUs in an everyday practice is still limited. The main reason is the necessity of deep adaptation of implemented applications and algorithms to a target architecture [21], in order to match its internal characteristics. This paper deals with the problem how to perform such an adaptation efficiently for the conjugate gradient (CG) method [1,10], which is widely applied for solving iteratively large sparse linear systems of equations. Such systems arise e.g. when modeling numerically different problems described by partial differential equations, with the use of the Finite Element Method (FEM) [21]. The material of this paper is organized in the following way. In Section 2, the CG method is introduced, as well as key elements of the NVIDIA GPUs architecture which are important for efficiency of CUDA applications [6,8]. Among them are: organization of thread execution, and memory architecture. Section 3 is devoted to methods of implementing the sparse matrix-vector multiplication, which determines the efficiency of the whole CG algorithm. To speed up the multiplication, we propose a way how to optimize access of threads to memory. Some performance results of numerical experiments carried out on NVIDIA GeForce 8800 GTX and NVIDIA Tesla C1060 are presented in Section 4.
2 2.1
Conjugate Gradient Algorithm and GPU Architecture Conjugate Gradient Method
The CG method is one of the most widely used iterative methods dedicated to solution of linear systems of equations with sparse, symmetric and positive-definite matrix of coefficients [1,10,21]. A version of the CG algorithm implemented in this work is shown in Fig. 1 This version was derived based on the assumption that subsequent residue vectors ri+1 are mutually orthogonal, while subsequent search directions pi are mutually conjugate. The most time-consuming part of this algorithm is multiplication of the sparse matrix A by a dense vector pi . That is why, we focus on the problem how to optimize the execution of this operation. 2.2
GPU Architecture
Graphics processors operate in accordance with the SIMT (Single Instruction Mutiple Threads) model [5,15]. In this model, all the GPU threads execute the
Parallel Implementation of Conjugate Gradient Method on GPU
127
r0 = b − Ax0 p0 = r0 j=1 repeat until stop criterion is satisfied { αj = (rj , rj )/(Apj , pj ) xj+1 = xj + αj pj rj+1 = rj − αj Apj βj = (rj+1 , rj+1 )/(rj , rj ) pj+1 = rj+1 + βj pj j =j+1 } Fig. 1. CG algorithm implemented in this work
same code, called kernel, on different data. The SIMT model is especially suitable for algorithms operating on large data arrays, such as operations performed on vectors and matrices in the CG algorithm. Fig. 2 illustrates the architecture of the NVIDIA GT200 graphics processor. In this case, the smallest execution unit (or core) is a Thread Processor (TP). TP units are connected in groups of 8 cores. Each group creates a Streaming Multiprocessor (SM), while three SMs together with 16 KB of L1 cache create a Thread Processing Cluster (TPC). The resulting architecture of GT200 consists of 240 TP cores, partitioned into 10 TPCs.
Fig. 2. Architecture of GT200 (source NVIDIA)
128
2.3
M. Wozniak, T. Olas, and R. Wyrzykowski
Organization of Thread Execution in GPU
The important feature of GPU architectures is organization of thread execution. Because of hardware support, creating, deleting and switching threads is extremely fast. Threads are connected in 1D, 2D or 3D blocks (maximum 512 threads in a single block). Each block is assigned to a particular multiprocessor. Threads within a block are able to communicate and synchronize, but the synchronization across different blocks is not allowed. For a particular multiprocessor, the number of active threads is much greater than the number of TPs. That is why, blocks are divided into portions, called warps, each consisting of 32 threads. Each time, every multiprocessor executes a certain warp; the rest of warps are suspended. Such an organization of thread execution allows for tolerating latencies caused by access to the GPU global memory (300 cycles). When a certain warp waits for data, it is suspended, and the multiprocessor is switching to the execution of another warp. Since synchronization of threads is possible only within blocks, the only way for synchronizing all the threads is to allow for finishing the execution of a kernel code. 2.4
Memory Architecture in GPU
The memory organization in graphics processors is rather complex (Fig. 3). After being initialized, each thread obtains access to a pool of registers and Local Memory. Moreover, threads within a particular block share 16 KB of the buffered, fast Shared Memory, which is used for their communications. For each thread assigned to a certain multiprocessor, the amount of available Local Memory and registers depends on the number of threads in a block - the high number of threads share resource of the multiprocessor the lower amount of resources correspond to each thread. GPUs are also equipped with a Global Memory accessible by all the threads (read and write). However, access to this memory is rather expensive. Another types of GPU memory are Constant Memory and Texture Memory. Their access time is rather short, but threads are allowed only to read from these memories. For all these three types of memory, their content is preserved between subsequent invocations of kernels; this content is accessible for all the threads running on a graphics card. In order to optimize the memory access in GPUs, it is advisable to use the Constant Memory and Texture Memory whenever possible. Also, the access to the Global Memory may be optimized by a suitable allocation of arrays in a way which allows threads to read data from contiguous memory areas.
3
Sparse Matrix-Vector Multiplication on GPUs
Because of irregular memory access, the efficient implementation of the multiplication of sparse matrix A by dense vector x (SpMV operation shortly) is a difficult problem for both traditional processors, and modern multi-/manycore
Parallel Implementation of Conjugate Gradient Method on GPU
129
Fig. 3. Memory organization of GPU (source NVIDIA)
architectures [11,19,20]. In order to optimize the efficiency of SpMV, a lot of storage formats were proposed for representing sparse matrices [13,16,14,17,19,22]. Some of them are suitable only for a certain type of matrices (e.g., Jagged Diagonal format for band matrices), or are heavily based on specific features of processor architecture (e.g., block formats allowing for efficient utilization of caches). In this work, we focus on the CSR format for its vewrsatility and popularity; this storage is then combined with the ELLPACK format. 3.1
CSR Format
The CSR format (Fig. 4) uses three arrays: r, c, and v. The arrays v and c contain respectively all non-zero elements of A-matrix, and column indexes of these non-zero elements, while elements of the array r point to first elements of subsequent rows of A-matrix in the previous arrays. Note that the number of non-zero elements in a particular row is determined by the difference between two consecutive elements of r. Having a matrix in this format, the result of SpMV is computed using a rather simple algorithm shown in Fig. 5. 3.2
ELLPACK Format
The row version of the ELLPACK format (Fig. 6) uses two 2D arrays: As and IA, each containing n × cmax elements, where cmax denotes the maximum number of non-zero elements in rows of A-matrix. Each row of the array As stores non-zero elements of the corresponding row of A-matrix, while the array IA contains column indexes for these non-zero elements. Note that in the array IA, indexes with values of 0 will appear only in the first row. All other values
130
M. Wozniak, T. Olas, and R. Wyrzykowski
Fig. 4. An example 6 × 6 matrix stored in CSR format
Fig. 5. Computing result b of SpMV operation when matrix is stored in CSR format
of 0, which may appear in any other row of the array IA, are not treated as column indexes, but they deliver information that the list of non-zero elements in the corresponding row of A-matrix has been already exhausted.
Fig. 6. An example 6 × 6 matrix stored in ELLPACK format
Parallel Implementation of Conjugate Gradient Method on GPU
3.3
131
Implementation of SPMV on GPU
For the CSR format, the simplest method of implementing the SpMV operation on GPUs is parallelization of the main loop of the algorithm (Fig. 7), when each thread is responsible for multiplying non-zero elements of a particular row of A-matrix by corresponding elements of the input vector. However, due to the irregular access to elements of both the matrix and vector, this scheme, known as ”1 thread per row”, slows down the resulting implementation [6].
Fig. 7. GPU implementation of SPMV for CSR format (“1 thread per row” scheme)
Other methods of implementing the SpMV operation on GPUs for the CSR format rely on assigning each row of A to more than one thread. The final result for a particular row is then computed in parallel by summing partial results within the Shared Memory, using the parallel reduction operation. Two implementations based on this scheme, known as ”1 warp per row” and ”half warp per row”, were proposed in [3] and [2], where each row is distributed across a warp containing 32 and 16 threads, respectively. In these implementations, in order to optimize access to the GPU memory, rows are filled up with zeros, up to sizes equal to multiples of the warp size. In this way, it is possible to achieve the access to contiguous areas of memory, which speeds up fetching elements of the matrix. The implementation available in the CUDA Data Parallel Primitives (CUDPP) Library [4] makes each thread responsible for computing the product of a single element of A-matrix by a corresponding element of vector. For a certain row of A, the obtained results are then summed up in parallel, using the Segmented Scan operation [12]. In this way, access to elements of A-matrix is optimized, while access to elements of vector is still irregular. For the ELLPACK format (row wersion), the GPU implementation of SpMV was proposed in [18]. It was shown that by filling up rows of A with zeros it is possible to optimize the GPU memory access for the ”1 thread per row” scheme.
132
M. Wozniak, T. Olas, and R. Wyrzykowski
In this work, we focus on the problem how to optimize the GPU implementation of the ”1 thread per row” scheme, using the CSR format. As it was shown before, the main problem here is the irregular memory access. To deal with this problem and optimize access to elements of A-matrix, we propose to reorganize the structure of this matrix represented in the CSR format by combining it with the ELLPACK storage. The proposed implementation is then compared with the solution available in the CUDPP library [4], as well as with solutions using the Global Memory and Texture Memory to implement the ”1 thread per row” scheme. When multiplying the sparse matrix represented in the standard CSR format, a particular thread fetches consecutive non-zero elements of the array v and their column indexes from the array c (see Fig. 4 and Fig. 8a). In the proposed, modified version of the CSR format, the sequence of non-zero elements of the matrix and their column indexed is changed in a way which allows all the threads to read the required data from contiguous and suitable allocated areas of memory. As a result, at the beginning of the array v we put all the first non-zero elements of all the rows of A; then this pattern is used for the second elements of all the rows, etc. This is illustrated in Fig. 8b. The identical modification is applied for the array c, while the array r is left without changes. After such a reorganization, each thread within a certain warp accesses data in a optimized way, fetching data from contiguous memory areas. Moreover, there is no need in communication and synchronization among threads, which also promotes increasing the computation speed.
Fig. 8. Patterns of GPU memory access for standard (a) and modified (b) CSR format
Parallel Implementation of Conjugate Gradient Method on GPU
133
Fig. 9. Speedups of different NVIDIA GF8800 implementations against CPU implementation
4
Performance Results
The performance experiments were carried out for the whole CG algorithm implemented in single precision on the following GPU platforms: (i) CUDA on NVIDIA GeForce 8800 GTX, (ii) CUDA on NVIDIA Tesla C1060. The speedups shown in Fig. 9 and Fig. 10 were determined for FEM matrices with different size, against the CPU implementation running on Intel Core2 Duo 2.33GHz, 4MB Cache. The CPU implementation used the standard CSR format, while in case of GPUs the following variants were investigated: 1. using the CUDPP library (marked as CUDPP), 2. standard CSR format and ”1 thread per row” scheme (marked as CSR), 3. variant 2) modified by utilization of the Texture Memory (CSR TEX),
Fig. 10. Speedups of different NVIDIA TESLA C1060 implementation against CPU implementation
134
M. Wozniak, T. Olas, and R. Wyrzykowski
4. using the modification of the CSR format proposed in this paper (CSR OPT), 5. variant 4) modified by utilization of the Texture Memory (CSR OPT TEX). Based on these experiments, we conclude that the proposed method of optimizing access to the Global Memory when implementing the SpMV operation on NVIDIA GPUs allows us to speed up the CG iterative algorithm as a consequence of reducing the execution time for the sparse matrix-vector multiplication.
Acknowledgment This work was supported by the Polish Ministry of Science and High Education under grant no. N516 4280 33.
References 1. Basermann, A., Reichel, B., Schelthoff, C.: Preconditioned CG methods for sparse matrices on massively parallel machines. Parallel Computing Journal 23(3), 381– 393 (1997) 2. Baskaran, M.M., Bordawekar, R.: Optimizing Sparse Matrix-Vector Multiplication on GPUs. IBM Research Report No. RC24704, W0812-047 (2009) 3. Bell, N., Garland, M.: Efficient Sparse Matrix-Vector Multiplication on CUDA. NVIDIA Tech. Report No. NVR-2008-004 (2008) 4. CUDPP: CUDA Data Parallel Primitives Library, http://gpgpu.org/developer/cudpp 5. Dokken, T., Hagen, T.R., Hjelmervik, J.M.: An Introduction to General-Purpose Computing on Programmable Graphics Hardware. In: Geometric Modelling, Numerical Simulation, and Optimization: Applied Mathematics at SINTEF, pp. 123– 161. Springer, Heidelberg (2007) 6. Fujimoto, N.: Faster matrix-vector multiplication on GeForce 8800GTX. In: Proc. IEEE Int. Symp. on Parallel and Distributed Processing - IPDPS 2008, pp. 1–8 (2008) 7. Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable Parallel Programming with CUDA. Queue 6(2), 40–53 (2008) 8. NVIDIA CUDA Programming Guide 2.2, http://developer.download.nvidia.com/compute/cuda/2 2/toolkit/docs/ NVIDIA CUDA Programming Guide 2.2.pdf 9. OpenCL - The open standard for parallel programming of heterogeneous systems, http://www.khronos.org/opencl 10. Olas, T., Karczewski, K., Tomas, A., Wyrzykowski, R.: FEM Computations on Clusters Using Different Models of Parallel Programming. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2001. LNCS, vol. 2328, pp. 170–184. Springer, Heidelberg (2002) 11. Saad, Y.: SPARSKIT: A basic toolkit for sparse matrix computations, http://www-users.cs.umn.edu/~ saad/software/SPARSKIT/paper.ps 12. Sengupta, S., Harris, M., Zhang, Y., Owens, J.D.: Scan Primitives for GPU Computing. In: Proc. 22nd ACM SIGGRAPH/EUROGRAPHICS Symp. on Graphics Hardware, pp. 97–106 (2007)
Parallel Implementation of Conjugate Gradient Method on GPU
135
13. Smailbegovic, F.S., Gaydadjiev, G.N., Vassiliadis, S.: Sparse Matrix Storage Format. In: Proc. 16th Ann. Workshop on Circuits, Systems and Signal Processing, ProRisc 2005, pp. 445–448 (2005) 14. Stathis, P.T.: Sparse Matrix Vector Processing Formats. PhD Thesis, Delft University of Technology (2004), http://ce.et.tudelft.nl/publicationfiles/955_1_Thesis_P_T_Stathis.pdf 15. Strzodka, R., Doggett, M., Kolb, A.: Scientific computation for simulations on programmable graphics hardware. Simulation Modelling Practice and Theory 13(8), 667–860 (2005) 16. Tvrdik, P., Simecek, I.: A New Diagonal Blocking Format and Model of Cache Behavior for Sparse Matrices. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 164–171. Springer, Heidelberg (2006) 17. Tvrdik, P., Simecek, I.: A New Approach for Accelerating the Sparse Matrix-Vector Multiplication. In: Proc. 8th Int. Symp. on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2006, pp. 156–163. IEEE Computer Society, Los Alamitos (2006) 18. V´ azquez, F., Garz´ on, E.M., Mart´ınez, J.A., Fern´ andez, J.J.: Accelerating sparse matrix vector product with GPUs. In: Proc. 9th Int. Conf. Computational and Mathematical Methods in Science and Engineering, CMMSE, pp. 1081–1092 (2009) 19. Vuduc, R.W.: Automatic Performance Tuning of Sparse Matrix Kernels. PhD Thesis, University of California, Berkeley (2003), http://bebop.cs.berkeley.edu/pubs/vuduc2003-dissertation.pdf 20. Williams, S., Oliker, l., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing 35(3), 178–194 (2009) 21. Wyrzykowski, R.: PC-clusters and multicore architectures. EXIT, Warsaw (2009) (in Polish) 22. Wyrzykowski, R., Kanevski, J.: A Technique for Mapping Sparse Matrix Computations into Regular Processor Arrays. In: Lengauer, C., Griebl, M., Gorlatch, S. (eds.) Euro-Par 1997. LNCS, vol. 1300, pp. 310–317. Springer, Heidelberg (1997)
Iterative Solution of Linear and Nonlinear Boundary Problems Using PIES Eugeniusz Zieniuk and Agnieszka Boltuc Institute of Computer Science, University of Bialystok Sosnowa 64, 15-887 Bialystok, Poland {ezieniuk,aboltuc}@ii.uwb.edu.pl
Abstract. The paper presents an effective approach to solving boundary problems modeled by a nonlinear partial differential equation. An applied algorithm bases on: an iterative method used in order to find successive approximations to the solution of a boundary problem and PIES method as a tool for solving a boundary problem in each iteration step. The paper also contains some suggestions about an effective way of domains modeling and calculation of integrals over those domains. Authors also have performed an initial verification of a method on the basis of a linear equation and nonlinear equations with a various intensity of nonlinearity.
1
Introduction
A variety of technical problems have been modeled by means of partial differential equations such as Lapalces, Navier-Lame or Helmholtz equations. Frequently such practical problems, despite of their complexity, are modeled by linear differential equations. It causes simplification of solved problem and gives a guarantee of obtaining results. Problems defined in that way can be solved by means of various numerical methods: FEM [10], BEM [1,2] or mesh-free methods [6]. Each of them has some advantages and drawbacks, which are emphasized taking into account problems from different fields and with a different degree of difficulty. It must be mentioned that gainful feature of two first methods is existence of a very developed and easy available computer software. It is known, however, that in most realistic problems occurs nonlinearity. There is a question: can we use above mentioned methods for solving such kind of problems? Obviously it is possible, but is connected with different problems which complicate a task. Thus, considering classic BEM, which was popularized as a metod without domain discretization, for solving nonlinear problems (and not only) one has to calculate domain integrals. It is connected with necessity of division of an integration domain into smaller sub-regions called cells. As a result we have, from technical point of view, almost the same process as in FEM discretization of a domain by means of finite elements. In the research [7,8,9] carried out by authors a different approach to boundary problems (until now only linear) has been proposed. A basis of that approach is a method called parametric integral equation system (PIES) obtained on the R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 136–145, 2010. c Springer-Verlag Berlin Heidelberg 2010
Iterative Solution of Linear and Nonlinear Boundary Problems Using PIES
137
basis of classic boundary integral equation by its analytical modification [7]. That modification is concerned with elimination of discretization both a domain (as in BEM) and a boundary by including a boundary geometry (defined in a parametric way) directly in PIES kernels. Described approach was successfully applied to solving boundary problems modeled by various differential equations [7,8,9] and has given satisfactory and promising results. An effectiveness relates especially to: a number of data used for a boundary geometry modeling, a number of solved algebraic equations and an accuracy of obtained results. The purpose of this paper is to propose PIES method for solving 2D nonlinear problems, to analyze its advantages and drawbacks and to verify its reliability and efficiency considering also linear problems.
2
PIES in Linear and Nonlinear Boundary Problems
The following partial differential equation [6] is solved in the present paper ∇2 u(x) + εu(x)n = p(x), x ∈ Ω, ∂2 ∂x21
(1)
∂2 ∂x22
where ∇2 = + is the Laplaces operator, p(x) is a given source function, ε is a constant, and the domain Ω is enclosed by Γ = Γu ∪ Γp with boundary ∂u conditions u = u on Γu and ∂n ≡ p = p on Γp . Taking into account various values of n the partial differential equation (1) solved in the paper is linear (n = 1) or nonlinear (n = 1). Considering a linear equation there are two different possibilities. Both of them require an existence and knowledge of a fundamental solution of differential equation (1). In both situation we can use known solution of linear Laplaces equation [1]. However, that solution has to be modified taking into account an element εu(x), which is additional in relation to Laplaces equation. Other approach bases on an iterative process with the help of which we can approximate solution u(x) in finite number of steps. Taking into account universality of a proposed approach (taking a linear and a nonlinear problems into consideration) it is reasonable to apply the second approach. In connection with mentioned approach a new form of PIES for considered in the paper equation is following (it is a modification of known from [9] form of PIES for linear Poisson equation) 0.5ul (s1 ) =
n j=1
Sj Sj−1
∗
∗
{U lj (s1 , s)pj (s) − P lj (s1 , s)uj (s)}Jj (s)ds + + Ω
∗
U l (s1 , x) [εu(x)n − p(x)] dΩ(x),
where sl−1 ≤ s ≤ sl , sj−1 ≤ s ≤ sj ,Jj (s) =
(1)
∂Γj (s) ∂s
2
+
(2)
∂Γj (s) ∂s
(2)
2 0.5 .
138
E. Zieniuk and A. Boltuc
The first and second integrands from (2) are respectively fundamental and singular solutions for Laplaces equation (they are presented in an explicit form in [7]). ∗
Function U l from a domain integral from (2) takes the following form ∗
U l (s1 , x) = (1)
1 1 ln 0.5 , 2 2π η 1 + η 22
(3)
(2)
where η 1 = Γl (s1 ) − x1 , η 2 = Γl (s1 ) − x2 , whilst Γ (s) are parametric curves which describe a boundary (in the paper Bezier curves of the first and third degree are used [3]). In order to obtain wanted values of u(x) in a domain Ω one should know a form of an integral identity, which can be expressed as follows u(y) =
n j=1
Sj
Sj−1
∗
∗
{U lj (y, s)pj (s) − P lj (y, s)uj (s)}Jj (s)ds + + Ω
ˆ Ul∗ (y, x) [εu(x)n − p(x)] dΩ(x),
(4)
ˆ where function Ul∗ for domain integral takes the following form ˆ 1 Ul∗ (y, x) = 2π ln ˆ 1ˆ 0.5 , ηˆ1 = y1 − x1 , ηˆ1 = y2 − x2 , x1 , x2 ∈ Ω. η12 +η22
Analyzing above expressions (2,4) it can be seen, that there is a necessity of a domain integrals calculation. In connection with that fact one of the most important stages of performed in the paper researches is testing of a presented approach (on linear problems - as an initial verification of an algorithm, on nonlinear problems as a destination) and elaborating and verification of an effective way of calculation of mentioned integrals.
3
Numerical Solution of PIES Using an Iterative Method
Finally, a solution of PIES (2) leads to finding unknown functions uj (s) or pj (s) on each boundary segment of considered problem. Unknown functions are approximated by following expressions in each step of an iterative process pj (s) =
N
(k) (k)
pj fj (s), uj (s) =
k=0 (k)
(k)
where uj , pj
N
(k) (k)
uj fj (s),
(5)
k=0
are unknown coefficients on the segment j, N is a number of (k)
coefficients on the segment and fj (s) are the global base functions on individual segments (e.g. used in the paper Chebyshev polynomials of the first kind).
Iterative Solution of Linear and Nonlinear Boundary Problems Using PIES
139
After the substitution of (5) into (2) and application of the collocation method [4] we obtain an expression in the following general form 0.5ul (s1 ) =
n N j=1 k=0
(k)
{pj
Sj
Sj−1
∗
(k)
U lj (s1 , s) − uj + Ω
Sj
Sj−1
∗
(k)
P lj (s1 , s)}fj Jj (s)ds +
∗
U l (s1 , x) [εu(x)n − p(x)] dΩ(x).(6)
However, final solution of above equation requires special treatment, because a domain integral contains a nonlinear and unknown at the same time function u(x)n . An equation (6) written down in the form of an algebraic equations system can be presented as follows Hu = Gp + W,
(7)
where H = [H]M×M , G = [G]M×M (M = n × N ) are square matrices of elements expressed by integrals from equation (6), u and p are column matrices which contain coefficients of approximation series (5), whilst W is a column matrix of elements expressed by means of a domain integral from equation (6). After application boundary conditions and some transformations (moving known values to the one side and unknown to the other), an equation (7) takes the form of an algebraic equations system AX = F + W,
(8)
where vector X contains unknown coefficients of searched boundary approximating functions (5), whilst vector F is known and depends on given boundary conditions. The only problem is that in the case of considered problems (linear and nonlinear) the right-hand side (W) of an equation (8) is unknown. It depends on current value of solution ui (x) at chosen points of a domain Ω. For that reason it is necessary to apply an iterative method and assume an initial guess for a searched solution u0 (x). Taking into account convergence of a method, most effective is to assume for an iteration i = 0 real value of unknown function. It is also acceptable (in some cases) to choose constant or zero values. After calculation u1 (x) on the basis of (8), a solution becomes approximated in following iteration steps until fulfilling given stop criterion. An iterative process should be recognized as finished if a difference between two lastly obtained values at all considered points of a domain (or a boundary) is very small (|ui+1 (x) − ui (x)| ≤ ε).
4
Technique of Domain Integrals Computation
A very important stage of solving nonlinear (and not only) problems is calculation of domain integrals. Such inconvenient task (especially considering complex
140
E. Zieniuk and A. Boltuc
domains) in the case of classic BEM [1,2] was led to addition of integrals over smaller domains with simple shapes (a triangle, a rectangle). However, it caused getting rid of a main advantage of BEM viz no domain discretization. Exemplary division of a domain into integral cells is presented in Fig.1a.
Fig. 1. A definition of a domain in a) BEM, b) PIES
Realizing an idea of PIES author was guided by intensions of complete disposing of discretization, what was achieved and successfully tested on many examples [7,8]. For that reason, our concept was to obtain such algorithm of domain integrals calculation which completely eliminate necessity of division of a domain into cells. The fist problem was an effective definition of a domain for integration. Authors made a decision to use surfaces known from computer graphics [3]. In the paper rectangular Bezier surfaces of the first (Fig.2a) and third degree (Fig.2b) were used. Taking into account the first case (used for polygon definition) we need only corner points to domain definition, whilst in the second case (used for modeling of domains with curvilinear boundaries) together 16 control points. It is worth to mention, that in considered problems domains are flat, therefore one should pose only 12 control points (which define a boundary).
Fig. 2. A definition of a Bezier surface of a) the first degree, b) the third degree
Applied surfaces made an effective tool for modeling taking into account easiness of their definition and modification [3]. Moreover, very important is the fact that a geometry described by a surface (that is a mathematical formula) can be directly included into an integrand of a domain integral. The last step is
Iterative Solution of Linear and Nonlinear Boundary Problems Using PIES
141
to apply a high-order quadrature for numerical integration (in the paper rectangular Gaussian quadrature was used) [5]. In such a way authors have obtained a procedure [9] for domain integrals calculation, which does not require domain discretization. Finally, there is only one question: if applied quadrature will be proper for integrals calculated over strongly modified domains (with relation to base, rectangular shape of a surface). In response to a mentioned question further part of the paper consists of numerical analysis of two different geometries.
5
An Application to Practical Examples
In order to verify an idea described in the paper the proposed iterative algorithm with PIES was applied to solving a linear equation obtained after substitution n = 1 into (1). The analytical solution of that equation is known u(x) = x21 + x22 ,
(9)
for p(x) = ε{x21 + x22 }n + 4. The Dirichlet boundary condition is imposed on all sides according to (9). Such formulated problem, as was mentioned, can be solved without necessity of application iterative methods. However, in the present paper an iterative approach is used in order to authenticate correctness of proposed method, which is destined to solving nonlinear problems. Mentioned above linear equation was defined on two different domains presented in Fig.3. In both cases Bezier rectangular surfaces were used for domain modeling, but the domain from Fig.3a was defined by means of a surface of the first degree, whilst from Fig.3b by a surface of the third degree. Taking into account such selected examples (one of them is a modification of the second) we can verify generality of proposed modeling approach.
Fig. 3. Considered domains
The rest of parameters assumed for researches are: ε = 1 and three different initial values (the same at all considered domain points) of a solution u0 = 0, 18, 20, −30. Solutions obtained in vertical cross-sections of both domains are presented in Table 1 and 2. Authors have been concentrated not only on analysis of accuracy of obtained results, but also on effectiveness of an iterative process (Fig.4).
142
E. Zieniuk and A. Boltuc Table 1. Solutions from a chosen cross-section of the square domain (Fig.3a) x
y
analytical 0.5 0.05 0.2525 0.5 0.15 0.2725 0.5 0.25 0.3125 0.5 0.35 0.3725 0.5 0.45 0.4525 0.5 0.55 0.5525 0.5 0.65 0.6725 0.5 0.75 0.8125 0.5 0.85 0.9725 0.5 0.95 1.1525 a number of iterations
U u0 = 0 0.252329 0.27306 0.313309 0.37345 0.453532 0.55355 0.673504 0.813403 0.973202 1.15024 7
u0 = 18 0.252329 0.27306 0.313309 0.37345 0.453532 0.55355 0.673504 0.813403 0.973202 1.15024 9
u0 = −30 0.252329 0.27306 0.313309 0.37345 0.453532 0.55355 0.673504 0.813403 0.973202 1.15024 9
Table 2. Solutions from a chosen cross-section of the modified domain (Fig.3b) x
y
analytical 0.6 0.15 0.3825 0.6 0.25 0.4225 0.6 0.35 0.4825 0.6 0.45 0.5625 0.6 0.55 0.6625 0.6 0.65 0.7825 0.6 0.75 0.9225 0.6 0.85 1.0825 0.6 0.95 1.2625 a number of iterations
U u0 = 0 0.384326 0.422461 0.482743 0.562522 0.662908 0.782717 0.922777 1.08264 1.26263 6
u0 = 20 0.384326 0.422461 0.482743 0.562522 0.662908 0.782717 0.922777 1.08264 1.26263 7
u0 = −30 0.384326 0.422461 0.482743 0.562522 0.662908 0.782717 0.922777 1.08264 1.26263 7
As can be seen in Table 1 and 2, the proposed algorithm gives very satisfactory results for both tested linear problems. Starting from various initial values the same accurate results were obtained. An average relative error of solutions in a considered cross-section is equal to: 0.175% for the square domain and 0.073% for the quarter of a ring. An iterative process can be described as effective, because converge to proper solution very fast. As a proof two charts are presented in Fig.4. As can be noticed already in third iteration solutions converge to final (proper) value. Taking into account presented above results and some made conclusions it can be claimed that the proposed algorithm is promising, because gives accurate results in short time. As was mentioned, finally it should be applied to solving nonlinear problems. Therefore, a next step of researches is to apply the proposed technique to solving the equation (1) with different values of n = 1 which have influence on a scale of nonlinearity.
Iterative Solution of Linear and Nonlinear Boundary Problems Using PIES
143
Fig. 4. Convergence of an iterative process for a) the square domain, b) the quarter of a circular ring
The first nonlinear problem is connected with parameter n = 2. Authors have taken into account two domains (the same as in the first example) over which solving equation was defined. The following values of parameters were considered: ε = 1 and u0 =< −30..20 >.
Fig. 5. Solutions obtained by PIES for: a) the square domain, b) the quarter of a circular ring
Obtained by means of the proposed algorithm solutions for chosen initial values u0 (Fig.5) are characterized by high accuracy in comparison with analytical results. However, it should be mentioned that: a) a number of iterations is higher than in the case of a linear problem and b) taking into account some initial values (e.g. u0 = −20, −30, 20 for the square domain) an iterative process is divergent. The best results are obtained for u0 = 0, because a final solution is obtained most quickly. Starting from different initial values (e.g. u0 = 15, −10), after first iteration we have obtained results with high relative error (even 4100%), but finally a process converges to the same results as in the case with u0 = 0 . Described above results are presented in Fig.6. Analyzing above results it can be stated that taking into account solved problems the apprach based on an iterative process and PIES gives satisactory results
144
E. Zieniuk and A. Boltuc
Fig. 6. Convergence of the iteration process for: a) the square domain, b) the quarter of a circular ring
Fig. 7. Numerical and exact solutions for different value of for: a) the square domain, b) the quarter of a circular ring
also in the case fo nonlinear problems. There is a question about quality of results in the case of stronger nonlinearity. In order to answer on that question authors have solved a problem modeled by the equation (1) with n = 3 defined over two considered before domains (Fig.3). In the computation, the constant ε is taken to be 1 and 0.1, whilst an initial guess is a value from u0 =< −30..20 >. After solving a described problem with different values of the initial solution and the constant ε authors made some conclusions: a) a number of iterations gets larger with the increase of ε , b) there are some values of u0 for which the iterative process is divergent, e.g. for the square domain a range of variability for which solutions were obtained is < −3..3 > (but the solution for u0 = 0 was obtained in shortest time) and c) obtained final results are less accurate than in previous cases (an average relative error for the square domain is equal to 0.81%, whilst for n = 2 it was 0.26% and for n = 1 only 0.17%). Situation is getting better, when we weaken an influence of nonlinear element εu(x)n by imposing smaller value of ε = 0.1. Taking into account mentioned above value of ε obtained solutions are more accurate, an iterative process is faster, and possible range of variability of u0 is wider. Solutions obtained for both value of ε and both domains in comparison with analytical results are presented in Fig.7.
Iterative Solution of Linear and Nonlinear Boundary Problems Using PIES
6
145
Conclusions
The proposed in the paper algorithm for solving of boundary problems modeled by nonlinear differential equation has given satisfactory results. Both an initial verification of the approach on linear problems and the final application of it to solving problems with nonlinearity have finished successfully (obtaining an accurate numerical solutions) and seem to be effective (an iterative process converges to a final solution in small number of steps). There are also some cases (some values of initial guesses) for which an iterative process is divergent or an accuracy of obtained results is not acceptable. However, presented initial results are promising and encourage to further researches which should eliminate all faults observed on that stage.
References 1. Banerjee, P.K.: The Boundary Element Methods in Engineering. McGraw-Hill, London (1994) 2. Becker, A.A.: The Boundary Element Method in Engineering: A Complete Course. McGraw-Hill, London (1992) 3. Foley, J.D.: Introduction to Computer Graphics. WNT, Warszawa (2001) (in Polish) 4. Gottlieb, D., Orszag, S.A.: Numerical analysis of spectral methods. SIAM, Philadelphia (1977) 5. Stroud, A.H., Secrest, D.: Gaussian Quadrature formulas. Prentice-Hall, London (1966) 6. Zhu, T., Zhang, J., Atluri, N.: A meshless local boundary integral equation (LBIE) method for solving nonlinear problems. Computational Mechanics 22, 174–186 (1998) 7. Zieniuk, E.: Bezier curves in the modification of boundary integral equations (BIE) for potential boundary-values problems. International Journal of Solids and Structures 40(9), 2301–2320 (2003) 8. Zieniuk, E., Boltuc, A.: Non-element method of solving 2D boundary problems defined on polygonal domains modeled by Navier equation. International Journal of Solids and Structures 43, 7939–7958 (2006) 9. Zieniuk, E., Szerszen, K., Boltuc, A.: Global calculation of domain integrals in PIES for two-dimensional problems modeled by Navier-Lame and Poisson equations. Modelowanie inzynierskie 2(33), 181–186 (2007) (in Polish) 10. Zienkiewicz, O.C., Taylor, R.L.: The Finite Element Method. McGraw-Hill, London (1977)
Implementing a Parallel Simulated Annealing Algorithm Zbigniew J. Czech1,2 , Wojciech Mikanik1 , and Rafal Skinderowicz2 1
Silesia University of Technology, Gliwice, Poland {Zbigniew.Czech,Wojciech.Mikanik}@polsl.pl 2 University of Silesia, Sosnowiec, Poland
[email protected]
Abstract. The MPI and OpenMP implementations of the parallel simulated annealing algorithm solving the vehicle routing problem (VRPTW) are presented. The algorithm consists of a number of components which co-operate periodically by exchanging their best solutions found to date. The objective of the work is to explore speedups and scalability of the two implementations. For comparisons the selected VRPTW benchmarking tests are used. Keywords: parallel simulated annealing, MPI library, OpenMP interface, vehicle routing problem with time windows.
1
Introduction
This work presents two implementations of a parallel simulated annealing algorithm. The algorithm, consisting of a number of components which co-operate periodically by exchanging their best solutions found to date, is used to solve the vehicle routing problem with time windows (VRPTW). A solution to the VRPTW is the set of routes beginning and ending at the depot which serves a set of customers. For the purpose of goods delivery there is a fleet of vehicles. The set of routes should visit each customer exactly once, ensure that the service at any customer begins within the time window and preserve the vehicle capacity constraints. The VRPTW belong to a large family of the vehicle routing problems (VRP) which have numerous practical applications. The MPI and OpenMP implementations of the parallel simulated annealing algorithm solving the VRPTW are presented. The objective of the work is to explore speedups and scalability of these two implementations. The results of the work extend our previous efforts concerning the application of parallel simulated annealing to solve the VRPTW. However in the earlier projects [1,2,3,4,5,6] the parallel execution of processes was simulated in a single processor, and the main goal of investigations was to achieve a high accuracy of solutions through cooperation of processes. In section 2 the parallel simulated annealing algorithm is presented. Sections 3 and 4 describe the MPI and OpenMP implementations of the algorithm, R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 146–155, 2010. c Springer-Verlag Berlin Heidelberg 2010
Implementing a Parallel Simulated Annealing Algorithm
147
respectively. Section 5 contains the discussion of the experimental results. Section 6 concludes the work.
2
Parallel Simulated Annealing Algorithm
The parallel simulated annealing algorithm comprises p components which are executed as processes P0 , P1 , . . . , Pp−1 (Fig. 1). A process generates its own annealing chain divided into two phases (lines 6–18). A phase consists of a number of cooling stages, and a cooling stage consists of a number of annealing steps. The goal of phase 1 is to minimize the number of routes of the VRPTW solution, whereas phase 2 minimizes the total length of the routes. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
parfor Pj , j = 0, 1, . . . , p − 1 do Set co-operation mode to regular or rare depending on a test set; L := (5E)/p; {establish the length of a cooling stage} Create the initial solution using some heuristics; current solution j := initial solution; best solution j := initial solution; for f := 1 to 2 do {execute phase 1 and 2} {beginning of phase f } T := T0,f ; {initial temperature of annealing} repeat {a cooling stage} for i := 1 to L do annealing step f (current solution j , best solution j ); end for; if (f = 1) or (co-operation mode is regular ) then {ω = L} co operation; else {rare co-operation: ω = E} Call co operation procedure every E annealing step counting from the beginning of the phase; end if; T := βf T ; {temperature reduction} until af cooling stages are executed; {end of phase f } end for; end parfor; Produce best solution p−1 as the solution to the VRPTW;
Fig. 1. Parallel simulated annealing algorithm; constant E = 105 annealing steps
At every annealing step a neighbor solution is computed by migrating customers among the routes (line 2 in Fig. 2). In simulated annealing the neighbor solutions of lower costs are always accepted, where by accepting we mean that the neighbor solution becomes a current one in the annealing chain. The higher cost solutions are accepted with probability e−δ/Tk , where Tk , k = 0, 1, . . . , af , is a parameter called a temperature of annealing, and af is a number of cooling stages in phase f , f = 1 or 2. The temperature drops from an initial value
148 1 2 3 4 5 6 7 8 9 10 11
Z.J. Czech, W. Mikanik, and R. Skinderowicz procedure annealing step f (current solution, best solution); Create new solution as a neighbor to current solution (the way this step is executed depends on f ); δ := cost f (new solution)−cost f (current solution); Generate random x uniformly in the range (0, 1); if (δ < 0) or (x < e−δ/T ) then current solution := new solution; if cost f (new solution) < cost f (best solution) then best solution := new solution; end if; end if; end annealing stepf ; Fig. 2. Annealing step procedure
1 2 3 4 5 6 7 8 9 10 11
procedure co operation; if j = 0 then Send best solution 0 to process P1 ; else {j > 0} receive best solution j−1 from process Pj−1 ; if cost f (best solution j−1 ) < cost f (best solution j ) then best solution j := best solution j−1 ; current solution j := best solution j−1 ; end if; if j < p − 1 then Send best solution j to process Pj+1 ; end if; end if; end co operation; Fig. 3. Procedure of co-operation of processes
T0,f according to equation Tk+1 = βf Tk , where βf (βf < 1) and δ denote some constants and an increase of the solution cost, respectively (line 17 in Fig. 1). A sequence of steps for which the temperature of annealing stays constant constitutes a cooling stage. The cost of solution s to the VRPTW is defined in phase 1 of the parallel algorithm as: cost 1 (s) = c1 N + c2 D + c3 (r1 − r¯), and in phase 2 as: cost 2 (s) = c1 N + c2 D, where N is the number of routes in solution s (equal to the number of vehicles carrying out the service), D is the total travel distance of the routes, r1 is the number of customers in a route which is tried to be get rid of the current solution, r¯ is an average number of customers in all routes, and c1 , c2 , c3 are constants. The basic criterion of optimization is the number of routes, therefore c1 c2 is fixed. The processes of the parallel algorithm co-operate with each other every ω annealing step passing their best solutions found to date (lines 12–16 in Fig. 1 and Fig. 3). The chain of annealing steps of process P0 is entirely independent (Fig. 4). The chain of process P1 is updated at steps uω, u = 1, 2, . . . , um , to the better solution between the best solutions found by processes P0 and P1 to date. Similarly, process P2 chooses as the next point in its chain the better solution between its own best and the one obtained from process P1 . Thus the
Implementing a Parallel Simulated Annealing Algorithm
149
best solution found by process Pl is piped down for further enhancement to processes Pl+1 . . . Pp−1 . Clearly, after step um ω process Pp−1 holds the best solution, Xb , found by all the processes. ⎧ (0) X0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ (0) ⎪ ⎪ X1 ⎪ ⎪ ⎪ ⎪ ⎨ • X0 → ⎪ ⎪ • ⎪ ⎪ ⎪ (0) ⎪ ⎪ Xp−2 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ (0) Xp−1
(ω)
→ X0 ↓ (ω) → X1 ↓ • • (ω) → Xp−2 ↓ (ω) → Xp−1
(2ω)
→ X0 ↓ (2ω) → X1 ↓ • • (2ω) → Xp−2 ↓ (2ω) → Xp−1
(um ω)
→ • • → X0
↓
(um ω)
→ • • → X1
↓ •• • •• • (um ω) → • • → Xp−2 ↓ (um ω) → • • → Xp−1 → Xb
Fig. 4. Scheme of co-operation of processes (X0 – initial solution; Xb – best solution among the processes)
As mentioned before, the temperature of annealing decreases according to the equation Tk+1 = βf Tk for k = 0, 1, 2, . . . , af , where af is the number of cooling stages. In this work we investigate two cases in establishing the points of process co-operation with respect to temperature drops. In the first case, of regular co-operation, processes interact at the end of each cooling stage (ω = L) (lines 15–16 in Fig. 1). The number of annealing steps executed within a cooling stage is set to L = (5E)/p, where E = 105 is a constant and p, p = 5, 10, 15 and 20, is the number of processes (line 3 in Fig. 1). Such an arrangement keeps the parallel cost of the algorithms constant when different numbers of processes are used, provided the co-operation costs are neglected. Therefore in this case, as the number of processes becomes larger the length of cooling stages goes down, what means that the frequency of co-operation increases. In the second case, of rare co-operation, the frequency is constant and the processes exchange their solutions every ω = E annealing step (lines 17–18 in Fig. 1). For the number of processes p = 10, 15 and 20, the co-operation takes place after 2, 3 and 4 temperature drops, respectively.
3
The MPI Implementation
All processes Pj , j = 0, 1, . . . , p − 1, in the MPI implementation of the parallel simulated annealing algorithm, execute the code shown in Fig. 5. Like in the general description of the algorithm (Fig. 1) the computation of processes is divided into phases 1 and 2. Depending on the co-operation mode, regular or rare, the processes interact every L or E annealing step, respectively. The MPI implementation of the co-operation between processes is presented in Fig. 6.
150
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Z.J. Czech, W. Mikanik, and R. Skinderowicz Processes Pj , j = 0, 1, . . . , p − 1; Set co-operation mode to regular or rare depending on a test set; L := (5E)/p; {establish the length of a cooling stage} Create the initial solution using some heuristics; current solution := initial solution; best solution := initial solution; for f := 1 to 2 do {execute phase 1 and 2} {beginning of phase f } T := T0,f ; {initial temperature of annealing} repeat {a cooling stage} for i := 1 to L do annealing step f (current solution, best solution); end for; if (f = 1) or (co-operation mode is regular ) then {ω = L} MPI co operation; else {rare co-operation: ω = E} Call MPI co operation procedure every E annealing step counting from the beginning of the phase; end if; T := βf T ; {temperature reduction} until af cooling stages are executed; {end of phase f } end for; Produce best solution of process Pp−1 as the solution to the VRPTW; Fig. 5. MPI implementation
1 2 3 4 5 6 7 8 9 10 11 12
procedure MPI co operation; if j = 0 then Send best solution to process P1 using MPI Isend function; else {j > 0} Receive best solution r from process Pj−1 using MPI Recv function; if cost f (best solution r ) < cost f (best solution) then best solution := best solution r ; current solution := best solution r ; end if; if j < p − 1 then Send best solution to process Pj+1 using MPI Isend function; end if; end if; end MPI co operation; Fig. 6. Procedure of co-operation of MPI processes
4
The OpenMP Implementation
In the OpenMP implementation of the parallel simulated annealing algorithm, the current and best solutions found to date are kept in the shared memory in the arrays defined as follows current solution, best solution: array[0 .. p − 1] of solution;
Implementing a Parallel Simulated Annealing Algorithm
151
where p is the number of threads, and solution is a data structure representing a VRPTW solution. The arrays are initialized in lines 5–9 in Fig. 7. The most time consuming part of the parallel simulated annealing algorithm in Fig. 1 is the loop in lines 9–11 where each process computes its own annealing chain. In the OpenMP implementation these chains are built by the threads within the parallel regions (lines 13–19 in Fig. 7). The co-operation of the threads is shown in Fig. 8. Summing up, only lines 5–9 and 13–19 of the OpenMP implementation shown in Fig. 7 are executed in parallel by the team of threads. The remaining parts of the implementation is executed sequentially by a single thread.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
omp set num threads(p); {set number of threads to p} Set co-operation mode to regular or rare depending on a test set; L := (5E)/p; {establish the length of a cooling stage} Create the initial solution using some heuristics; #pragma omp parallel for for j := 0 to p − 1 do current solution[j] := initial solution; best solution[j] := initial solution; end for; for f := 1 to 2 do {execute phase 1 and 2} {beginning of phase f } T := T0,f ; {initial temperature of annealing} repeat {a cooling stage} #pragma omp parallel { j := omp get thread num(); for i := 1 to L do annealing step f (current solution[j], best solution[j]); end for; } if (f = 1) or (co-operation mode is regular ) then {ω = L} OpenMP co operation; else {rare co-operation: ω = E} Call OpenMP co operation procedure every E annealing step counting from the beginning of the phase; end if; T := βf T ; {temperature reduction} until af cooling stages are executed; {end of phase f } end for; Produce best solution[p − 1] as the solution to the VRPTW; Fig. 7. OpenMP implementation
152 1 2 3 4 5 6 7 8
Z.J. Czech, W. Mikanik, and R. Skinderowicz procedure OpenMP co operation; for j : = 1 to p − 1 do if cost f (best solution[j − 1]) < cost f (best solution[j]) then best solution[j] := best solution[j − 1]; current solution[j] := best solution[j − 1]; end if; end for; end OpenMP co operation; Fig. 8. Procedure of co-operation of OpenMP threads
Table 1. Results for tests R107 and R207; regular co-operation (p – number of processes or threads, τ¯ – average execution time, s – standard deviation of execution times, S – speedup, exp. – number of experiments)
Test p 1 5 R107 10 15 20 1 5 R207 10 15 20
5
MPI implement. τ¯ s S exp. 730.8 9.9 1.00 100 149.4 1.7 4.89 100 76.3 0.9 9.58 100 50.8 0.6 14.38 100 38.2 0.5 19.16 100 1113.7 36.9 1.00 100 231.5 8.6 4.81 100 115.2 4.6 9.67 100 77.5 2.4 14.38 100 58.6 2.3 19.01 100
OpenMP implement. τ¯ s S exp. 1047.0 10.1 1.00 100 263.5 3.9 3.97 100 135.8 2.5 7.71 100 91.0 0.9 11.51 100 69.0 0.7 15.18 100 1794.1 64.6 1.00 100 434.0 21.6 4.13 100 223.5 16.2 8.03 100 144.5 3.6 12.41 100 110.0 2.7 16.32 100
Experimental Results
The MPI implementation was run on a cluster of 288 nodes which consisted of 1 or 2 Intel Xeon 2.33 GHz processors, each with 4 MB level 3 cache. Nodes were connected with the Infiniband DDR fat-tree full-cbb network (throughput 20 Gbps, delay 5 μs). The computer was executing a Scientific Linux. The source code was compiled using Intel 10.1 compiler and MVAPICH1 v. 0.9.9 MPI library. The OpenMP computations were performed on a ccNUMA parallel computer SGI Altix 3700 equipped with 128 Intel Itanium 2 processors clocked at 1.5 GHz, each with 6 MB level 3 cache. The computer was running Suse Linux Enterprise Server 10 operating system. The source code was compiled using Intel C compiler 11.0 with the following compilation flags: -openmp -openmp-report2 -wd1572. The MPI and OpenMP implementations were compared on the selected VRPTW benchmarking tests proposed by Solomon [7]. The investigations reported in [6] indicated that Solomon’s tests can be divided into 3 groups: I – tests which can be solved quickly (e.g. using p = 20 processes) to good accuracy
Implementing a Parallel Simulated Annealing Algorithm
153
Table 2. Results for tests R104 and RC106; rare co-operation (meanings of columns as before) Test p 1 5 R104 10 15 20 1 5 RC106 10 15 20
MPI implement. τ¯ s S exp. 755.6 7.8 1.00 1000 154.7 2.4 4.88 1000 79.9 1.1 9.46 1000 52.1 1.0 14.51 1000 39.4 0.5 19.18 1000 705.5 6.9 1.00 200 143.6 1.5 4.91 200 73.1 0.7 9.65 200 49.2 0.6 14.35 200 35.5 0.6 19.33 200
20
◦ ◦ - MPI - OpenMP
15
◦
OpenMP implement. τ¯ s S exp. 1028.2 10.3 1.00 234 262.6 8.0 3.92 200 134.8 5.3 7.63 200 90.5 3.2 11.36 200 68.7 1.7 14.97 200 1072.0 20.5 1.00 100 278.1 4.9 3.85 100 139.9 29.8 7.66 100 95.4 1.6 11.24 100 73.8 1.0 14.54 100
20
◦ ◦ - MPI - OpenMP
15
◦
S 10 ◦
5
◦
5
1 ◦ 1
◦
S 10 test R107
◦
test R207
1 ◦ 5
10 p
15
20
1
5
10 p
15
20
Fig. 9. Speedup S vs. number of processes (threads) p for tests R107 and R207 (broken line shows the ideal speedup)
20 ◦ - MPI - OpenMP
15 S 10
◦ ◦
◦
5
◦
20
15
◦ - MPI - OpenMP
◦
5
test R104
◦ ◦
S 10
1 ◦ 1
◦
test RC106
1 ◦ 5
10 p
15
20
1
5
10 p
15
20
Fig. 10. Speedup S vs. number of processes (threads) p for tests R104 and RC106
with rare co-operation (ω = E); II – tests which can be solved quickly (p = 20) but the frequency of co-operation should be relatively high (in this work we call it regular) (e.g. ω = E/4 for p = 20) to achieve good accuracy of solutions; and III – tests whose solving cannot be accelerated as much as for the tests from
154
Z.J. Czech, W. Mikanik, and R. Skinderowicz 20
◦ ◦ - MPI - OpenMP
15 S 10
◦ ◦
◦
5
◦ ◦ - MPI - OpenMP
15
◦
5
test RC104
◦
◦
S 10
1 ◦ 1
20
test RC202
1 ◦ 5
10 p
15
20
1
5
10 p
15
20
Fig. 11. Speedup S vs. number of processes (threads) p for tests RC104 and RC202
groups I and II, i.e. the number of processes should be less than 20 to obtain good accuracy of solutions. For comparisons of the implementations, two tests from each of these groups were chosen: tests R104, RC106 from group I, tests R107, R207 from group II, and tests RC104, RC202 from group III. The results of experiments are presented1 in Tab. 1-2 and illustrated in Fig. 9-11. It can be seen that the OpenMP implementation achieves slightly worse speedups than the MPI implementation. The experiments are still in progress and they concentrate now on the accuracy of solutions to the VRPTW. According to results presented in [6], the RC104 and RC202 tests are difficult to solve, i.e. for a larger number of processes the loss of accuracy of solutions to these tests is observed. The goal of the ongoing experiments is to clarify this issue using the MPI and OpenMP implementations.
6
Conclusions
The MPI and OpenMP implementations of the parallel simulated annealing algorithm to solve the VRPTW were presented. The experimental results obtained for the selected tests by Solomon indicated that the the MPI implementation scales better than the OpenMP implementation. The issue of the accuracy of the VRPTW solutions is to be cleared up.
Acknowledgments We thank the following computing centers where the computations of our project were carried out: Academic Computer Centre in Gda´ nsk TASK, Academic Computer Centre CYFRONET AGH, Krak´ ow (computing grants 027/ 2004 and 069/2004), Pozna´ n Supercomputing and Networking Center, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University (computing grant G27-9), Wroclaw Centre for Networking and Supercomputing (computing grant 04/97). The research of this project was supported by the Minister of Science and Higher Education grant No 3177/B/T02/2008/35. 1
Because lack of space we omit the results for tests RC104 and RC202.
Implementing a Parallel Simulated Annealing Algorithm
155
References 1. Czarnas, P., Czech, Z.J., Gocyla, P.: Parallel simulated annealing for bicriterion optimization problems. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 233–240. Springer, Heidelberg (2004) 2. Czech, Z.J., Czarnas, P.: A parallel simulated annealing for the vehicle routing problem with time windows. In: Proc. 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing, Canary Islands, Spain, January 2002, pp. 376–383 (2002) 3. Czech, Z.J., Wieczorek, B.: Solving bicriterion optimization problems by parallel simulated annealing. In: Proc. of the 14th Euromicro Conference on Parallel, Distributed and Network-based Processing, Montb´eliard-Sochaux, France, February 1517, pp. 7–14 (2006) (IEEE Conference Publishing Services) 4. Czech, Z.J., Wieczorek, B.: Frequency of co-operation of parallel simulated annealing processes. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 43–50. Springer, Heidelberg (2006) 5. Czech, Z.J.: Speeding up sequential annealing by parallelization. In: Proc. of the International Conference on Parallel Computing in Electrical Engineering, PARELEC 2006, Bialystok, Poland, September 13-17, pp. 349–354 (2006) (IEEE Conference Publishing Services) 6. Czech, Z.J.: Co-operation of processes in parallel simulated annealing. In: Proc. of the 18th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2007), Dallas, Texas, USA (November 13-15, 2006), pp. 401–406 (2007) 7. Solomon, M.M.: Algorithms for the vehicle routing and scheduling problems with time window constraints. Operations Research 35, 254–265 (1987), http://w.cba.neu.edu/~ msolomon/problems.htm
Parallel Computing Scheme for Graph Grammar-Based Syntactic Pattern Recognition Mariusz Flasi´ nski, Janusz Jurek, and Szymon My´sli´ nski IT Systems Department, Jagiellonian University Straszewskiego 27, 31-110 Cracow, Poland http://www.wzks.uj.edu.pl/ksi
Abstract. Real-time syntactic pattern recogniton imposes strict computing time constraints on new techniques developed. Recently, a method for an analysis of hand postures of the Polish Sign Language based on the ETPL(k) graph grammars (Flasi´ nski: Patt. Recogn. 26 (1993), 1-16; Theor. Comp. Sci. 201 (1998), 189-231) has been constructed. In order to make a system implemented more feasible for the users, a research into parallelization of a pattern recognition process has been led. Possible techniques of tasks distribution have been tested. It has allowed us to define an optimum strategy of parallelization. The results are presented in the paper. Keywords: parallel computing, syntactic pattern recognition, graph grammar.
1
Introduction
In the field of pattern recognition and image processing, both parallel architectures [11,18,22,26] and parallel processing techniques [2,14,15,19,23,25] have been widely used since the early eighties. An optical pattern (image) recognition is one of the main areas of using these techniques, since it usually imposes real-time constraints on computing process. Pattern recognition methods can be divided into two main categories: the statistical one and structural/syntactic one. Whereas a lot of research has been made into parallelization techniques of the first approach (see e.g. citations listed above), only a few results concerning the second area have been published till now. The main idea of syntactic pattern recognition consists in representing a pattern as a structure of the form of string, tree or graph. Then, a set of structures is treated as a formal language (string, tree or graph language) that can be analyzed with a (string, tree or graph) automaton (syntax analyzer, parser). A syntactic pattern recognition computing process consists of three main phases: - an image processing (filtering, enhancement, etc.), - a generation of a structural representation, and - a structural representation parsing (syntactic analysis). In the area of syntactic pattern recognition, a research into parallel computing techniques has concerned mainly the first two phases (see citations listed above). R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 156–165, 2010. c Springer-Verlag Berlin Heidelberg 2010
Parallel Computing Scheme for Syntactic Pattern Recognition
157
For the third phase, that usually demands a very large computational effort [12], a few results are known. A parallel version of the Earley’s parsing algorithm for syntactic patern recognition was defined by Chiang and Fu in 1984 [1]. Then, results of a research into a parallelization of parsing for attributed grammars were published in the nineties: [13,21] by Klaiber, Gokhale and independently by Tokuda and Watanabe (tasks: of parsing, and an attribute evaluation in sequence), and by Sideri, Efremidis and Papakonstantinou [20] (parallel both: parsing and an attribute evaluation). A parallelization technique of a structural pattern recognition was proposed by Wu and Rosenberg in 1988 [24]. A method of pipelined processing of attributed grammar parsing for syntactic pattern recognition was presented by Pavlados, Panagopoulos and Papakonstantinou in 2004 [17]. A research into parsing for graph grammars has resulted in defining a very efficient parser, O(n2 ), for ETPL(k) graph grammars [3,5,7,9]. It has been used as a syntactic pattern recognition model for a variety of applications such, as: a pattern recognition in the industrial robot control system [5], a software allocation system for a parallel/distributed environment [4], a solid modelling and analysis in the CAD/CAM integration system [6], and a real-time expert control system [8]. Recently, it has been also applied for a recognition of hand postures of the Polish Sign Language [10,16]. Hard real-time requirements of this last application have motivated us to search for such a parallel scheme of a recognition process that could allow us to make the system more feasible for the users. The results of this research are presented in the paper. A syntactic pattern recognition model of a hand posture analysis is shortly presented in the next chapter. An architecture of the hand posture recognition system is described in the third chapter. Chapter 4 includes considerations concerning two possible models of tasks distribution for parallel processing. It also contains results of experiments made to compare these models. Concluding remarks are presented in the last chapter.
2
Syntactic Pattern Recognition Model of Hand Posture Analysis
A syntactic pattern recognition of hand postures of the Polish Sign Language consists of three main phases shown in Fig. 1.
IMAGE PROCESSING AND RECOGNITION SYSTEM
HAND IMAGE
INITIAL PROCESSING
GENERATION OF GRAPH-BASED HAND IMAGE REPRESENTATION
SYNTACTIC ANALYSIS
HAND POSTURE
Fig. 1. Main phases of a syntactic pattern recognition of hand postures
158
M. Flasi´ nski, J. Jurek, and S. My´sli´ nski
The first phase of (initial) image processing includes: a skin region detection, a noise removal, a contour extraction, and a polygonal approximation of a contour (see Figures: 2(a) and 2(b)). A set of characteristic points of an image (that will be used as a set of graph nodes) consists of: the region centroid and polygon vertices as it is depicted in Fig. 2(b) [16]. The second phase is critical, because - after spanning a graph structure on a set of characteristic points (see Fig. 2(c)) - we have to define a graph representation (see Fig. 2(d)) fulfilling a formal definition of IE-graphs [5,7]. Only such graphs belong to a parsable class of ETPL(k) graph languages [3,7], and they can be parsed with an efficient, O(n2 ), parser [5].
Y
0
(a)
X
(b)
Y
Y
4
h
-h 6
b
3
i
1
(c)
X
0
-i
i g 5 -i 2
0
h
e
(d)
a
b
d
7
s X
Fig. 2. Results of phases: of image processing and a graph representation generation
A formal notion of IE-graph is defined in the following way [5,7]. Definition. An indexed edge-unambiguous graph, IE graph, over Σ and Γ is a quintuple g = (V, E, Σ, Γ, φ), where V is a finite, non-empty set of nodes that indices have been ascribed to in an unambiguous way,
Parallel Computing Scheme for Syntactic Pattern Recognition
159
Σ is a finite, non-empty set of node labels, Γ is a finite, non-empty set of edge labels, E is a set of edges of the form (v, λ, w), where v, w ∈ V, λ ∈ Γ , such that index of v is less than index of w, φ : V −→ Σ is a node-labeling function, and g contains a BFS spanning tree, which nodes are indexed in the same way as nodes of g and its edges are directed in the same way as edges of g. The final, third stage of syntactic analysis consists in parsing of an IE-graph representation of a hand posture image. This way, the parser identifies, with ca 95% accuracy, a hand posture of the Polish Sign Language.
3
System Architecture
A computer-aided recognition of any sign language imposes hard real-time constraints on a system. In fact, a fast identification of consequtive hand postures is a necessary condition for the system to be useful in practice. Therefore, to increase an efficiency of a hand posture identification process, the HandPostPolSign 2.0 system, constructed with the use of a model presented in a previous chapter, has been designed and implemented on the basis of a parallel computing paradigm. In a syntactic pattern recognition-based model, a result of each of the previous phases is an input for the next one. Treating each phase as an independent task enables their parallel execution on a multiprocessor (multicore) system in the form of different threads. A software architecture of the HandPostPolSign 2.0 system is shown in Fig. 3. Each of the processing modules (corresponding to the phase of a syntactic pattern recognition model) is designed as an independent sub-system. The system has been implemented in C++. For image processing, Intel Integrated Performance Primitives library has been used. The system consists of
1..* IMAGE SOURCE
1
1 Dispatcher
Colour image 1
1..*
1
1..*
Hand posture
1 1..*
1..*
«subsystem»
PreProcessor
Image characteristic points set
FURTHER PROCESSING
«subsystem»
GraphGenerator
Graph-based image representation
«subsystem» Parser
Fig. 3. A software architecture of the HandPostPolSign 2.0 system
160
M. Flasi´ nski, J. Jurek, and S. My´sli´ nski
seven main classes. Three of them, namely CPreProcessor, CGraphGenerator and CParser correspond to main syntactic pattern recognition phases. The remaining classes: CDispatcher, CCSystem, CImageSrc and CProcessingMutex are responsible for: a task allocation and a cooperation with multiple image sources, preserving a correct sequence of recognition results, a cooperation with input image devices, ensuring data consistency during multi-threaded processing (respectively). In HandPostPolSign 2.0 a number of threads corresponding to syntactic pattern recognition tasks can be changed dynamically (many hand posture images can be recognized simultaneously). However, a proper model of a task distribution should have been defined in order to optimize an utilization of a multiprocessor system. Results of a research into such a model are presented in the following chapter.
4
Task Distribution Model
In this chapter two possible models of a distribution of syntactic pattern recognition tasks are introduced. The results of time efficiency experiments for these models are presented and corresponding conclusions are discussed. 4.1
Vertical Model of Tasks Distribution (VMTD)
In a vertical model of recognition tasks distribution (VMTD model), each of the sub-systems (i.e. initial processing, a graph generation, a syntactic analysis) is initialized in a separate thread that is executed on a separate CPU (or core) of a multiprocessor (multicore) system. When a recognition process starts, the first image is inputed to the initial processing sub-system. Then, a set of characteristic points identified in a hand image is handed over to the graph generation subsystem. At the same time, the second image is inputed to the initial processing sub-system that has already finished processing of the first image. In the meantime, a graph representation of the first image is passed to the syntactic analysis sub-system. Then, a set of characteristic points identified for the second image Time
Thread
1
1
2
3
2
3
IP
4
IP GG
5
IP GG
SA
6
IP GG
SA
...
...
...
SA
...
...
Fig. 4. Vertical model of recognition tasks distribution. IP – Initial Processing, GG – Graph Generation, SA – Syntactic Analysis (Parsing). Various colours represent consecutive images.
Parallel Computing Scheme for Syntactic Pattern Recognition 6
161 1 thread
5
3 threads
]s 4 [ e3 m i T2
1 0 1
4
7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70
Vertices
Fig. 5. A VMTD model processing times of sequences of 10 images in relation to a number of vertices of their graph-based image representation
is handed over to the graph generation sub-system, while the third image is fed into the initial processing sub-system, etc. Successive images are put into the system as it is shown in Fig. 4. Such a task distribution model applied in a pattern recognition field seems to be natural, because recognition results are obtained in the same order as the order of images inputed to a system. However, experiments concerning a computing time have revealed its low efficiency. In Fig. 5, processing times of a hand image recognition in relation to a number of vertices of their graph representations are shown. For each test, sequences of 10 images have been used to acquire measurement results. One can easily see that increasing a number of threads nearly does not influence an overall performance of the system. A VMTD model can be efficient for pattern recognition applications, however, only if computing times of processing phases are comparable. As we have discussed it in chapter 2, algorithms used in the phases of: a graph representation generation and a syntactic analysis are based on the efficient model of ETPL(k)
Initial Processing
0,45 0,4 0,35 0,3 ]s [ 0,25 e m i 0,2 T 0,15 0,1 0,05 0
Graph Generation Syntactic Analysis
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70
Vertices
Fig. 6. Processing times of a hand image for three main phases
162
M. Flasi´ nski, J. Jurek, and S. My´sli´ nski
graph grammars [5,7]. In Fig. 6, processing times of a single hand image for three main phases of our method are shown. It can be easily noticed that the initial processing is the most time-consuming phase of the whole process. As a result, the graph generation sub-system and the parsing sub-system are idle for most of the time. The disproportion of processing time is so big that even performing the initial processing phase in one thread sequentially, and both a graph generation and parsing in the other thread, does not solve the problem. 4.2
Horizontal Model of Tasks Distribution (HMTD)
In a horizontal model of recognition tasks distribution (HMTD model), all three processing tasks related to an inputed image are initialized in one thread as it is shown in Fig. 7. Each of such threads is executed on a separate CPU (or core) of a multiprocessor (multicore) system. Time
Thread
1
2
3
4
5
IP
GG
1
IP
GG
SA
2
IP
GG
SA
...
3
IP
GG
SA
...
6
SA
...
...
...
Fig. 7. Horizontal model of recognition tasks distribution. IP – Initial Processing, GG – Graph Generation, SA – Syntactic Analysis (Parsing). Various colours represent consecutive images.
Processing times of a hand image recognition process (in relation to a number of vertices of their graph representations) performed in 1, 2 and 4 processing units simultaneously for this model are presented in Fig. 8. As one can easily notice, a remarkable performance boost of the system has been achieved for the HMTD model. However, in this case some problems had to be solved to assure a correctness of the recognition process. Since IE-graph representations of hand posture images differ in a number of vertices, their recognition times vary, as well. Of course, in a case of our application (an analysis of the Polish Sign Language), the same sequence of recognition results at the output of the system, as the sequence of images at the input has to be preserved. Therefore, a dispatching method for preserving an image sequence had to be constructed for a synchronization of processing units. It has been implemented as a dispatcher sub-system, mentioned in chapter 3. A sequence of images is directed to the dispatcher sub-system from an image source (i.e. a video camera). The dispatcher sends subsequent images to available processing units running in separate threads. The recognition results from the
Parallel Computing Scheme for Syntactic Pattern Recognition
163
6
1 thread
5
2 threads
]s 4 [ e3 m iT 2
4 threads
1 0 1
4
7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70
Vertices
Fig. 8. A HMTD model processing times of sequences of 10 images in relation to a number of vertices of their graph-based image representation
processing units are stored in a buffer. The mechanisms preserving the right order of the recognition results are implemented for the buffer. It allows the system to preserve a correct sequence of recognition results generated.
5
Concluding Remarks
The research into a system recognizing hand postures of the Polish Sign Language has revealed a need of making the method developed more efficient computionally, in order to be more feasible for the users. In the field of pattern recognition, such an increase of time efficiency is usually achieved by applying parallel computing techniques. Therefore, the hand posture recognition system HandPostPolSign 2.0 has been designed and implemented on the basis of a parallel computing paradigm. Two models of a distribution of recognition tasks (phases) of the system have been considered to increase the system throughput. In a vertical model VMTD, subsequent recognition tasks are initialized in separate threads. Although such a model seems to be more natural for image analysis applications, since recognition results are obtained in the same order as the order of images inputed to the system, it has turned out to be inefficient for our method. It results from the fact that (contrary to standard syntactic pattern recognition methods, in which a syntactic analysis phase may yield a combinatorial explosion [12]) we have used a very efficient parsing scheme, O(n2 ), based on ETPL(k) graph grammars [3,5,7,9]. The experiments made for each phase of a syntactic pattern recognition scheme separately, allow us to draw a general methodological conclusion in a field of the syntactic pattern recognition theory. The VMTD model can be efficient, only if computing times of phases of: image preprocessing, a structural representation generation, and parsing, are comparable. In a horizontal model HMTD, all three recognition tasks (phases) related to an inputed image are performed in one thread. In this case, a remarkable performance boost in a recognition process has been obtained. An increase of a
164
M. Flasi´ nski, J. Jurek, and S. My´sli´ nski
performance is ca. 41% and ca. 59% for 2 and 4 processing units, respectively (Intel Core 2 Quad, 2.33 GHz, 4 GB RAM), comparing to the system performance with no parallel processing. However, for the HMTD model, the mechanisms allowing the system to preserve the same sequence of recognition results at the output, as the sequence of input images, have to be implemented.
References 1. Chiang, Y., Fu, K.S.: Parallel parsing algorithms and VLSI implementation for syntactic pattern recognition. IEEE Trans. Pattern Anal. Machine Intell. PAMI-6, 302–313 (1984) 2. Davis, L.S., Rosenfeld, A., Inoue, K., Nivat, M., Wang, P.S.P. (eds.): Parallel Image Analysis: Theory and Applications. World Scientific, Singapore (1995) 3. Flasi´ nski, M.: Parsing of edNLC-graph grammars for scene analysis. Pattern Recognition 21, 623–629 (1988) 4. Flasi´ nski, M., Kotulski, L.: On the use of graph grammars for the control of a distributed software allocation. The Computer Journal 35, A165–A175 (1992) 5. Flasi´ nski, M.: On the parsing of deterministic graph languages for syntactic pattern recognition. Pattern Recognition 26, 1–16 (1993) 6. Flasi´ nski, M.: Use of graph grammars for the description of mechanical parts. Computer Aided Design 27, 403–433 (1995) 7. Flasi´ nski, M.: Power properties of NLC graph grammars with a polynomial membership problem. Theoretical Computer Science 201, 189–231 (1998) 8. Flasi´ nski, M.: Automata-based multi-agent model as a tool for constructing real-time intelligent control systems. In: Dunin-Keplicz, B., Nawarecki, E. (eds.) CEEMAS 2001. LNCS (LNAI), vol. 2296, pp. 103–110. Springer, Heidelberg (2002) 9. Flasi´ nski, M.: Inference of parsable graph grammars for syntactic pattern recognition. Fundamenta Informaticae 80, 379–413 (2007) 10. Flasi´ nski, M., Jurek, J., My´sli´ nski, S.: Multi-agent system for recognition of hand postures. In: Allen, G., et al. (eds.) ICCS 2009, Part II. LNCS, vol. 5545, pp. 815–824. Springer, Heidelberg (2009) 11. Guerra, C.: 2D object recognition on a reconfigurable mesh. Pattern Recognition 31, 83–88 (1998) 12. Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Machine Intell. PAMI-22, 4–37 (2000) 13. Klaiber, A., Gokhale, M.: Parallel evaluation of attributed grammars. IEEE Trans. Parall. Distr. Syst. 3, 206–220 (1992) 14. Lee, S.S., Tanaka, H.T.: Parallel image segmentation with adaptive mesh. In: Proc. 15th Int. Conf. Patt. Recogn., Barcelona, Spain, September 2000, vol. 1, pp. 635– 639 (2000) 15. Miguet, S., Montanvert, A., Wang, P.S.P. (eds.): Parallel Image Analysis. World Scientific, Singapore (1998) 16. My´sli´ nski, S.: On the generation of graph representation of hand postures for syntactic pattern recognition. Journal of Applied Computer Science 17 (2009) 17. Pavlatos, C., Panagopoulos, I., Papakonstantinou, G.: A programmable pipelined coprocessor for parsing applications. In: Proc. Workshop on Application Specific Processors, Stockholm, September 2004, pp. 84–91 (2004) 18. Rabah, H., Mathias, H., Weber, S., Mozef, E., Tanougast, C.: Linear array processors with multiple access modes memory for real-time image processing. Real Time Imaging 9, 205–213 (2003)
Parallel Computing Scheme for Syntactic Pattern Recognition
165
19. Ranganathan, N. (ed.): VLSI and Parallel Computing for Pattern Recognition and Artificial Intelligence. World Scientific, Singapore (1995) 20. Sideri, M., Efremidis, S., Papakonstantinou, G.: Semantically driven parsing of CFG. The Computer Journal 32, 91–98 (1994) 21. Tokuda, T., Watanabe, Y.: An attribute evaluation of context-free languages. Information Processing Letter 57, 91–98 (1994) 22. Weschler, H.: An overview of parallel hardware architectures for computer vision. Int. J. Patt. Recogn. Artif. Intell. 6, 629–649 (1992) 23. Wu, A.Y., Bhaskar, S.K., Rosenfeld, A.: Parallel processing of region boundaries. Pattern Recognition 22, 165–172 (1989) 24. Wu, A.Y., Rosenfeld, A.: Parallel processing of encoded bit strings. Pattern Recognition 21, 559–565 (1988) 25. Yalamanchili, S., Aggarwal, J.K.: Analysis of a model for parallel image processing. Pattern Recognition 18, 1–16 (1985) 26. Zhang, D., Pal, S.K. (eds.): Neural Networks and Systolic Array Design. Series in Machine Perception and Artificial Intelligence, vol. 49. World Scientific, Singapore (2002)
Extended Cascaded Star Schema for Distributed Spatial Data Warehouse Marcin Gorawski Institute of Computer Science, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
[email protected]
Abstract. In this paper several new aspects of spatial data warehouse modeling are presented. The extended cascaded star schema in spatial distributed data warehouse DSDW was defined. Research proven that there is a strong need for building many SDW’s extended cascaded star schemas as an outcome of separate spatio-temporal telemetric conceptual models. For one of these new data schemas, the definitions of cascaded ECOLAP operations were presented. These operations base on a relation algebra, and make possible ad-hoc queries executing.
1
Introduction
The designing of every data warehouse begins with creation of logical model that in an effective way presents described phenomenon. The rapid development in the field of the data warehouse is the cause for more sophisticated concepts to be modeled. Particularly, the Spatial Data Warehouse (SDW) store multidimensional data (data hyper-dimensionality) and describe processes that consists of multi-theme [1,11]. Hence exists a necessity to defining new logic schemas of SDW and models of on-line analytical processing OLAP, both for spatial and flat data [10]. Significant research in the area of SDW and Distributed Spatial Data Warehouses (DSDW) [2,5,6,7,8,9]. Previous works in the field of data modeling have contributed to creation of few, commonly used schemas. In section 2, the cascaded star schema is discussed and an example is presented, that shows disadvantages of this schema. Section 3 contains the formal definition of a new data model: the extended cascaded star, and in section 4 the definition of EOLAP operations for the new schema is presented. Section 5 summarizes the paper. Regulation of energy sector in EU created new market - media (electricity, gas, heat, and water) recipient market. The main problem is automated reading of hundreds of thousands recipients meters and critically fast analysis of terabyte sets of meters’ data. Comprehensive spatial telemetric data analysis can be performed by use of the technology called Automatic (Integrated) Meter Reading (A(I)MR) on the one hand, and data warehouse based Decision Support Systems (DSS) on the other [4,5]. Our solution consists of two layers (Fig. 1): the first is a Telemetric System of Integrated Meter Readings and the second is Distributed Spatial Telemetric Data Warehouse. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 166–175, 2010. Springer-Verlag Berlin Heidelberg 2010
Extended Cascaded Star Schema for Distributed Spatial Data Warehouse Local radio communication
167
GSM Network
DSDW(t)
GSM
AIUT GSM Telemetric Server
TCP/IP
Data distribution in ETL process Data loading
DSTDW Servers
Client – server RMI communication DSTDW Client
33 10
12 34
31
1
21 15
14 22 30
Fig. 1. Two layers of a telemetric infrastructure
Telemetric system of integrated meter readings (TSIMR) is based on A(I)MR and GSM/GPRS technology. The system sends the data from meters located in a huge geographical region to the telemetric server using the GSM mobile phone net utilizing GPRS technology. AIUT GSM system possesses all those features. Distributed Spatial Telemetric Data Warehouse System (DSDW(t)) is a decision making support system. The decisions concerning given medium production are made basing on the short-term prognosis of a medium consumption. The prognosis are calculated using the analysis of the data gathered in DSDW(t) what, in turn, is supplied with data by the telemetric server. In order to meet the need for improved performance created by growing data sizes in Spatial Telemetric Data Warehouse (SDW(t)), parallel processing and efficient indexing become increasingly important [5]. DSDW(t) has a distributed architecture of SN type (shared nothing). The communication between nodes is made only through network connections (message passing) in a RMI technology (Remote Method Invocation). Relational analytical on-line processing (ROLAP) for spatial data, is performed in the DSDW(t) system based on cascaded star model [3] indexed by an aR-tree [4].
2
Data Models
The star model (schema) is a logical structure that consists of one, containing a huge amount of data, fact table and few smaller dimension tables. Between all dimension tables and the fact table the referential integrity is defined. Dimension tables are denormalized, however, the fact table is normalized. The advantages of this schema are: simple structure and high efficiency in query execution, that results from a low number of defined connections. One of the disadvantages of this schema is long data loading time, that results from tables denormalization. Another problem is that the star schema can be used only for data related to a single theme. In the snowflake schema the dimension tables are not denormalized. As a result of separation of dimension tables, the hierarchy is explicitly highlighted. Thanks to table normalization, in this model there is no data redundancy, and
168
M. Gorawski a
c
T
b
e
d f g
Fig. 2. Cascaded star schema S C with central fact table T
the schema is easy to modify. An additional virtue of normalization is faster data loading, but it has an influence on the answer time, which is longer than in a star schema. In some cases there is a need that few fact tables share the same dimension. That is where the fact constellation is used. This model is used very rarely, since there are not many concepts which can be modeled with that schema. Logical structures of OLAP models described above are not able to optimally model spatial data with regard for theirs hyper-dimensional and multitheme. The mentioned schemas are suitable for representing data which are single-process oriented and the associated OLAP operations are processed along the dimension hierarchy. According to this, using a star or a snowflake schema, presenting a single process, causes that a complex query in DSDW relating to many parallel processes described with different dimension levels, must be defined as several separate queries, which results are combined by the user. In the paper [11] the new formalized logical ROLAP model called Cascaded Star has been presented, which allows modeling of a hyper-dimensional data and multi-thematic decision support processes. The Cascaded Star S C is characterized by a fact, that some of its dimensions form separate nested single star schemas or cascaded stars. The structure of cascaded star S C is presented on Fig. 2. Definition 1. Cascaded Star [11]. The cascaded star is defined as S C = (SS, DC , T C ), where: • SS is a set of stars, such that Si ∈ SS is a single star schema S 1 or a cascaded star schema S C . • DC is a set of dimension tables S C , where every dimension table Di ∈ DC contains [Ai , P Ki ]. Ai is a set of attributes Di , and P Ki is the primary key of dimension Di . • T C = [V, SKD , P KD ] is the fact table of the cascaded star. V is a set of central measurements. SKD = SKi is a key of Si |Si ∈ SS, and P KD = P Ki is the primary key of Di |Di ∈ DC . Even though the cascaded star schema has extended the capabilities of logic DSDW schemas modeling, there are still situations in which this model is not sufficient for describing spatial, multithematic phenomenon. An example of such
Extended Cascaded Star Schema for Distributed Spatial Data Warehouse Node_model node_model scope max_meters service type
Meter_model meter_model scope service type
Node node node_mod el location man f date
Date Location location x y z
Meter
Address Day
day_id day month_id
Month
node client model location manuf_date install_date medium
month_id month quarter_id
Status
Client
Payment
status income employment
status payment address family_members
payment description currency
169
Street street_id name city_id
City city_id name country_id
Quarter
Country
quarter_id quater year_id
country_id name zip_code
Year year_id year
Fig. 3. A part of the DSDW(t) schema - cascaded star and hierarchical schemas
system DSDW(t), which the APAS Group is currently working on [3,6]. Below (Fig. 4) an example of a part of the DSDW(t) system is presented. It contains cascaded star schema with three fact tables (NODE, METER and CLIENT) and two hierarchical dimensions ADDRESS and DATE. According to the definition of the cascaded star presented above, the dimensions of the cascaded star can be either single star or another cascaded star schema. Therefore, the hierarchical schemas DATE and ADDRESS are separated elements, and no connections between them and the cascaded star are defined. Query: Find addresses of these clients, whose meters have been installed in year 2005. The query is related to the cascaded star as well as to the hierarchical dimension ADDRESS and DATE. This query has to be divided into several separate queries, which results have to be joined by the user. In the beginning all dates (DAY.day id) in dimension DATE, for which Y EAR.year =’2005’ have to be found. Then the results of this query are used in the second query to find meters and address identifiers for clients, whose meters have been installed in the given year (DAY.day id = M ET ER.install date AND M ET ER.client = CLIEN T.id). Having address identifiers, full information about client address in hierarchical dimension ADDRESS can be found in a separate query (CLIEN T.address = ST REET.street id). As can be seen, to execute this query three separate queries to obtain expect result have to be defined. Only one example of the DSDW(t) schema, for which the cascaded star schema is insufficient, has been shown. In paper [6] more complex examples of the logic schemas of the DSDW(t) systems have been collected, with detailed description, that proof the need to work out a new schema for this system. Interested readers are asked to look up in the mentioned paper. Even though the cascaded star schema has extended the capabilities of modeling logic schemas for DSDW, there are still situations in which this model is not sufficient for describing spatial, multi-thematic phenomenon.
170
3
M. Gorawski
Formalization of the Extended Cascaded Star Schema
Definition 1 is based on the concept of a single star cube S 1 . In the paper [11] this concept was defined incorrectly because of: (a) ambiguity of the terminology (cube) and used definitions (dimension tables, fact tables), (b) lack of conceptual completeness of the schema - the influence of data granulation is not taken under consideration. In the new definition 2 of the single star schema these formal deficiencies were eliminated. The star schema projects the objects of the conceptual model into a logical model in the following way: (a) objects that keep information about facts are stored in the form of fact tables (base tables), (b) dimensions and their attributes are stored in the form of denormalized dimensions tables, (c) in the schema occurs the minimal number of joins between the tables. In the new Definition 2 these formal deficiencies were eliminated. Definition 2. Single star. The single star schema is defined as S S = (D, F ), where: 1. D is a set of flat dimensions tables Di . Every dimension table Di ∈ D contains [Ai , IKi , Gi ] where: • Ai is a set of attributes of Di , • IKi is a set of surrogate, numerical key identifying dimension Di , • Gi is a set of data granulations level of dimension Di . 2. F = [VS , P KS , GS ] is the fact table, where: • P KS is the primary key of S S , P KS = IKi , where IKi is a key that identifies Di |Di ∈ D, • VS is a set of facts, VS = f (GS ), where f is a function, that for a given granulation level GS |GS ∈ Gi, generates the values of numerical facts VS . The disadvantages of the star schema mentioned above can be eliminated by using the snowflake schema. The main change, which was introduced in the snowflake schema is: a) the departure from the full dimension tables denormalization towards their normalization - explicit presentation of the real, hierarchical structure of the dimensions, b) the facts tables store only one level of the aggregation, c) the maintenance of the relations between attributes of the same dimension with multiplicity of 1:1, 1:N, N:1 and N:M. Depending on dimensions table’s degree of normalization, four types of the snowflake schema can be distinguished (Def. 3). Definition 3. Snowflake The snowflake schema is defined as S F = (H, R, F ), where: 1. H is a set of hierarchical dimensions, where for the relation R of type: • {1 : 1, 1 : N, N : 1} every hierarchical dimension Hi ∈ H has a determined degree Li , defined as a number of levels of the hierarchy. Hi is a set of tables creating a normalized dimension with a minimal cardinality Li . Table hj ∈ Hi contains [Ahj , P Khj , IKhj ], where Ahj is a set of
Extended Cascaded Star Schema for Distributed Spatial Data Warehouse
171
attributes hj , P Khj is the primary key of hj and IKhj is a set of foreign keys migrating from tables hj+1 , ..., hLj . Set IKhj is empty on the highest level of the hierarchy and for the one-level hierarchy (flat dimension), • {N : M } set R is a set of tables of relations rik Hi/ik+r between attributes of one dimension Hi |Hi ∈ H, such that rik Hi/ik+r , contains combinations of identifiers of the attributes of tables hj,k , being in relation in the dimension Hi with multiplicity N : M . 2. F = [VSF , P KH ] is a fact table, where VSF is a set of measurements, and P KH is a set of primary keys P Ki of the dimensions Hi . The dimensions hierarchy in the snowflake schema is explicitly represented by the normalized dimension tables (tables H1 , H3 on Fig. 4), in which data and relations were divided on additional tables (tables h10 , h11 , h12 ∈ H1 , {h30 , h31 , h32 , relations tables r30 H3/31 } ∈ H3 ). Between the attributes, included in the same dimension Hi of the snowflake schema, are some relations, which determinate the logical connections of attributes within the confines of one dimension. These relations are transitive within the confines of one hierarchy and have their own multiplicity. As mentioned before, the current analysis of the research and the results of experiments on the cascaded star schema for the DSDW(t) justify C the creation of a new DSDW schema called Extended Cascaded Star ES{P −t} , where P is schema’s degree described by the type and kind of the substars and by their multi-version, and t is a number of themes simultaneously described. The most recent experiential logic model of SDW(t) system represents the exC SDW(t) oriented on measurement tended star of II-nd degree schema ES{II−1} C (Fig. 5). The presented schema ES{II−1} SDW(t) is the S C (SDW(t)) schema extended of two components. The first expansion of S C (SDW(t)) schema concerns the using of the hierarchical dimensions. According to definition in [11], the S C schema allows only the existence of flat dimensions and cascaded dimensions. Hierarchical dimensions, described in Def. 3, are used in the snowflake C SDW(t) schema the hierarchical dimensions schema. In the case of the ES{II−1} are: time dimension (e.g. measure time, meter installation time, map expiry date, etc) and the dimension defining the clients address. The second expansion of S C (SDW(t)) schema concerns the way of linking fact tables with cascaded dimensions. In the cascaded star the key of the star and the function g, connected with this key, is used according to definition presented by Yu et al. [11]. Yet,
H4 h12
H1
H3 h30
h10
F
h11
h31 r30-H3/31 h32
H2 Fig. 4. Logical model of the snowflake schema
172
M. Gorawski WTH_MEASURE_CONF H_WTH_METER_INST_DATE
H_WTH_MEASURE_DATE WTH_MEASURE WTH_METER
WTH_METER_LOC
MEAS_CONFIDENCE MEAS_ZONE
WTH_METER_MODEL
H_ METER_INST_DATE
METER_MODEL MEASURE
METER_LOC METER
CLIENT
DAY MONTH QUARTER YEAR
NODE H_NODE_INSTALL_DATE
CL_STATUS
NODE_MODEL NODE_LOC
CL_PAYMENT MAP MAP_TYPE
MAP_LT_LOC
H_TIME
PROVINCE COUNTRY
H_MAP_EXP_DATE MAP_RB_LOC
STREET QUARTER CITY
H_ADDRESS
C Fig. 5. ES{II−1} SDW(t) schema oriented on measurement
in order to avoid complications connected with the definition of the g function, surrogate keys in fact tables are generated. For every record in the fact table a consecutive number is generated. The disadvantage of such approach is the necessity of defining an additional column for storing the key in the fact table (this disadvantage concerns also the approach that uses the star’s key). However, using the surrogate primary keys allows to avoid problems connected with defining the g function. Most of the recent relational database systems allow to define sequences. In Definition 4, the schema of the extended cascaded star of II-nd degree, single-theme (t=1), is described. Definition 4. Extended Cascaded Star of II-nd degree for t = 1. C The schema of the extended cascaded star of II-nd degree for t = 1 ES{II−1} C C C C is defined as ES{II−1} = (SS, H , T , L ), where: 1. SS is a set of cascaded dimensions - data schemas, such that Si ∈ SS is: C a) a single star schema S S , or b) a snowflake schema S F , or c) ES{II−1} schema. 2. H C is a set of hierachical dimensions of ES C . Every hierarchical dimension Hi ∈ H C has a given degree Li defined as a number of hierarchy levels in the S F schema. 3. T C = [V C , P KSR , P KH ] is the cascaded fact table, where V C is a set of central fact measures, V C = VSF ∪ VS . P KSR = SRi is a set of surrogate keys for cascaded dimensions Si |Si ∈ SS, and P KH = P Ki is a set of primary keys of hierarchical dimensions Hi |Hi ∈ H C . The primary key P Ki of dimension Hi is the primary key of table h0 . C The ES{II−1} schema can contain nested star schemas as well as nested snowflake schemas. This expansion allows to carry out the normalization of dimension tables.
Extended Cascaded Star Schema for Distributed Spatial Data Warehouse
4
173
ECOLAP Operations for Extended Cascaded Star Schema
C Using the new structure ES{P −t} in OLAP analysis requires defining appropriate operations to aid this analysis. Below the Extended Cascaded OLAP (ECOLAP) C operations are defined for the extended cascaded star with ES{II−1} schema. UsC ing the new structure ES{P −t} , in OLAP analysis requires defining appropriate operations to aid this analysis. Hierarchical join is an operation that performs joins for the given hierarchy. The join is performed from the lowest level in the hierarchy to the level predetermined by the second parameter of the operation. This operation is defined in the following way: Hierachical Join(Hi , hi ) = πRH (σ(h = C)(h1 ...hi )) where: the parameter Hi defines the hierarchy from the set of hierarchies of the given snowflake schema, the parameter hi is a table, till which the join should be performed. Schema Hierarchical Join (SHJ) is an operation which receives, as parameters, a set of hierarchical dimensions and a set of tables, till which the join should be performed. SHJ : H × I → N ; SHJ(Hi , hi ) = Hierarchical Join(Hi , hi ), where: H = Hi - set of hierarchical dimensions, I = hi - set of tables, which for a given hierarchy Hi define the border level of the join hi within the confines of this hierarchy, N - set of hierarchical dimensions after the joins are performed. eTraverse (extended Traverse) allows to perform the basic OLAP operations within the confines of a single dimension S S or S F for a certain level of star M = k, k > 0. It consists of Relation Operators (SPJ) on simple attributes and SHJ operations for hierarchical dimensions. Using the expression of relational algebra this operation can be written down as: eT raverse(S, I) = πRS (σ(d = C)(SDSHJ(H, I))), where: S is a S S or S F schema, D is a set of flat dimensions, H is a set of hierarchical dimensions, I is a set of tables, which for a given hierarchy Hi ∈ H define the border level of the join hi within the confines of this hierarchy, d is a set of attributes for D and H, C is a set of selections conditions, RS is a set of the projections attributes S. eDecompose (extended Decompose) operation can be expressed, using the relational algebra, as: eDecompose(Q, P, I) = πRQ (σ(d = C)(Qtraverse(P, I))), where: P is a S S or S F schema, containing a set of hierarchical dimension H and it is a single dimension for Q, which is a extended cascaded star schema C ES{II−1} , I is a set of tables, which for a given hierarchy Hi ∈ H define the border level of join hi within the confines of this hierarchy, d is a set of attributes for P , C is a set of selections conditions, Rq is a set of the projections attributes within Q.
eJump (extended Jump). The operation of a two stars join, between which there is no direct relation, results in obtaining a set , which is a cartesian product
174
M. Gorawski
of both stars. Such set will contain a huge number of records. For this reason the definition of this operation should be understood as a join of two stars with using additional fact tables that contain appropriate relations. eJump operation can be expressed: eJump(Si , Sj , Ii , Ij ) = πRi,j (eT raverse(Si , Ii )eT raverse(Sj , Ij ))), where: Si and Sj indicate the source and destination of a S S or S F schema, containing the sets of hierarchical dimensions HSi and HSj , Ri,j is a set of projection operations attributes from Si and Sj , Ii and Ij are sets of tables, which for a given hierarchy Hi ∈ HSj and Hj ∈ HSj define the border level of the join hi and hj . eCascaded-roll-up. The Roll-up operation decreases the number of the presented data details, as a result some dimensions will be omitted. The starting point is a single star S S or a snowflake S F . Performing eTraverse operation, a set of attributes from a set of single schema is received. Then there is a move to C schema. Using expression a lower level to the parent star, which is a ES{II−1} of algebra relation this process can be written down as: eCascaded roll up(Q, P, Iq , Ip ) = eDecompose(Q, eT raverse(P, Ip ), Iq ), C where: Q is the ES{II−1} schema, P is a single star schema S S or a snowflake schema S F containing a set of hierarchical dimension H, Ip and Iq are sets of tables, which for a given hierarchy Hp ∈ H and Hq ∈ H define the border level of join hp and hq within the confines of these hierarchy. eCascaded-drill-down. This operation is the opposite to eCascaded-roll-up. It increases the number of details of the presented data, and in some particular cases the eCascaded-drill-down operation causes the appearance of new dimensions. Starting from the central star and performing consecutive eDecompose operations, a single star level is reached, where the operation eTraverse is performed. Notation in relational algebra is as following: eCascaded drill down(Q, P, Ipq , Iq ) = eT raverse(eDecompose(Q, P, Ipq ), Iq ), C where: Q is the ES{II−1} schema, P is a single star schema S S or a snowflake F schema S containing a set of hierarchical dimension H, Ipq and Iq are sets of tables, which for given hierarchy Hpq ∈ H and Hq ∈ H define the border level of join hpq and hq within the confines of these hierarchy. eCascaded-slice and eCascaded-dice. These operations reduce the number of presented elements for a multidimensional cube by setting additional selection conditions for given attributes of the substars (for eCascaded-slice operation they are attributes of a single dimension - star or snowflake) [5]. Multiple cubes operation (MCUBE) performs recursive calculations of type CUBE on relation such as base tables or materialized views [6].
5
Conclusions
In this paper a complex data modeling schema was presented - the extended cascaded star for distributed spatial data warehouses, with reference to the single star schema and the snowflake schema. An example of such system is a spatial
Extended Cascaded Star Schema for Distributed Spatial Data Warehouse
175
telemetric data warehouse DSDW(t). Also the basic cascaded ECOLAP operations were defined for the extended cascaded star schema: drill-down, roll-up, slice & dice and MCUBE, allow to analyze data, with usage of the new data C . Future works will be focused on defining ECOLAP operations model ES{II−1} for other schemas of the extended cascaded star and on the expansion of the set of operations on next elements taking under consideration data mining [6].
References 1. Atluri, V., Adam, N., Yu, S., Yesha, Y.: Efficient storage and management of environmental information. In: 19th IEEE Symposium on Mass Storage Systems, Maryland, USA (2002) 2. Bdard, Y., Merret, T., Han, J.: Fundamentals of Spatial Data Warehousing for Geographic Knowledge Discovery In Geographic Data Mining and Knowledge Discovery. In: Research Monographs in GIS, ch. 3, pp. 53–73. Taylor & Francis, London (2001) 3. Gorawski, M., Gorawski, M.J.: Modified R-MVB Tree and BTV Algorithm Used in a Distributed Spatio-temporal Data Warehouse. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2007. LNCS, vol. 4967, pp. 199– 208. Springer, Heidelberg (2008) 4. Gorawski, M., Malczok, R.: Materialized aR-tree in Distributed Spatial Data Warehouse. International Journal Intelligent Data Analysis 10(4), 361–377 (2006) 5. Gorawski, M.: Extended Cascaded Star Schema and ECOLAP Operations for Spatial Data Warehouse. In: Corchado, E., Yin, H. (eds.) IDEAL 2009. LNCS, vol. 5788, pp. 251–259. Springer, Heidelberg (2009) 6. Gorawski M.: Advanced Data Warehouse. Dissertation, Studia Informatica, vol. 30(3B), Pub. House of Silesian Univ. of Technology, p. 338 (2009) 7. Han, J., Stefanovic, N., Kopersky, K.: Selective Materialization: An Efficient Method for Spatial Data Cube Construction. In: Wu, X., Kotagiri, R., Korb, K.B. (eds.) PAKDD 1998. LNCS, vol. 1394. Springer, Heidelberg (1998) 8. Jensen, C., Kligys, A., Pedersen, T., Timko, I.: Multidimensional Data Modeling for Location-based Services. VLDB Journal 13, 1–21 (2004) 9. Stefanovic, N., Han, J., Koperski, K.: Object-based selective materialization for efficient implementation of spatial data cubes. In: IEEE TKDE, pp. 938–958 (2000) 10. Timoko, I., Pedersen, T.: Capturing Complex Multidimensional Data in Locationbased Warehouses. In: Proc. of CAN GIS, LNCS. Springer, Heidelberg (2004) 11. Yu, S., Atluri, V., Adam, N.: Cascaded star: A hyper-dimensional model for a data warehouse. In: Bressan, S., K¨ ung, J., Wagner, R. (eds.) DEXA 2006. LNCS, vol. 4080, pp. 439–448. Springer, Heidelberg (2006)
Parallel Longest Increasing Subsequences in Scalable Time and Memory Peter Krusche and Alexander Tiskin DIMAP and Department of Computer Science, The University of Warwick, Coventry, CV4 7AL, UK
Abstract. The longest increasing subsequence (LIS) problem is a classical problem in theoretical computer science and mathematics. Most existing parallel algorithms for this problem have very restrictive slackness conditions which prevent scalability to large numbers of processors. Other algorithms are scalable, but not work-optimal w.r.t. the fastest sequential algorithm for the LIS problem, which runs in time O(n log n) for n numbers in the comparison-based model. In this paper, we propose a new parallel algorithm for the LIS problem. Our algorithm solves the more general problem of semi-local comparison of permutation strings of length n in time O(n1.5 /p) on p processors, has scalable communi√ cation cost of O(n/ p) and is synchronisation-efficient. Furthermore, √ we achieve scalable memory cost, requiring O(n/ p) of storage on each processor. When applied to LIS computation, this algorithm is superior to previous approaches since computation, communication, and memory costs are all scalable.
1
Introduction
The longest increasing subsequence (LIS) problem is a classical problem in theoretical computer science and mathematics. We are given a sequence of n numbers, and we are asked to find the LIS. The problem is closely related to patience sorting, and has been studied extensively, beginning as early as [1] (see also e.g. [2] and references therein). In the patience sorting problem, we are looking for a cover of a given sequence of n integers which consists of a minimal number of decreasing subsequences. The number of decreasing subsequences of such a minimal cover is equal to the length of the LIS. For sequences of length n, the fastest sequential algorithm in the comparison-based model (which only allows less-than or equal comparison on the sequence elements) for patience sorting/LIS computation runs in time O(n log n)1 . A few parallel algorithms have been proposed for the LIS problem, however, these algorithms either do not achieve work-optimality w.r.t. the fastest sequential running time, or have rather restrictive slackness conditions. We would first 1
Research supported by the Centre for Discrete Mathematics and Its Applications (DIMAP), University of Warwick, EPSRC award EP/D063191/1. It is possible to achieve O(n log log n) in the integer arithmetic model. In this paper, we work in the comparison-based model.
R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 176–185, 2010. c Springer-Verlag Berlin Heidelberg 2010
Parallel Longest Increasing Subsequences in Scalable Time and Memory
177
like to point out that the problem is trivially in the complexity class NC, as it is an instance of the shortest path problem in a grid dag, which is in NC by reduction to (min, +) matrix multiplication [3]. However, the parallel algorithm resulting from this reduction is far from being work-optimal. A better generic approach is to reduce the problem to computing the longest common subsequence (LCS) of two strings of length n (see e.g. [4]). However, this is clearly not workoptimal as it gives cost O(n2 /p) on p processors. On the EREW PRAM model with p processors, Nakashima and Fujiwara [5] showed that the problem can be solved using computation O((n log n)/p), but only if p < n/k2 where k is the length of the LIS. This algorithm becomes asymptotically sequential once √ k > n. Any sequence of n numbers must have either a monotonically √ increasing or a monotonically decreasing subsequence of minimum length n [1]. Therefore, for any sequence of numbers, the condition p < n/k2 definitely inhibits parallelism either for running the algorithm on the sequence itself, or on its reversal. Sem´e [6] gave a CGM algorithm runs in O(n log(n/p)). Asymptotically, this is not faster than the sequential algorithm because the number of processors only appears in a lower order polynomial term. We consider the LIS problem as a special case of computing the LCS of two permutation strings, which can be solved sequentially in time O(n log n) in the comparison-based model [7]. While multiple work-optimal and scalable algorithms for the general LCS problem exist2 , it remains open to obtain a similar result for the LIS problem. It is remarkable that no such result is known, considering how well-studied the LIS problem is in most other aspects. The contribution of this paper is a parallel algorithm which extends the algorithm from [10] to be applicable to semi-local comparison of permutation strings, which as a special case also solves the LIS problem. For permutation strings of length n, the fastest sequential algorithm for semi-local comparison runs in O(n1.5 ) [11]. For comparing two arbitrary permutations of size n semi-locally on p processors, we obtain computation cost O(n1.5 /p), scalable communica√ √ tion cost O(n/ p), and scalable memory cost O(n/ p). Although this is not work-optimal for the LIS problem since the fastest sequential algorithm for this problem runs in O(n log n) = o(n1.5 ), using our algorithm can be advantageous since it is the asymptotically best parallel algorithm which actually achieves some scalability on this problem.
2
Definitions and Problem Analysis
Let x = x1 x2 . . . xn and y = y1 y2 . . . yn be two strings of length n over an alphabet Σ. In permutation string comparison, we have |x| = |y| = |Σ| = n, and x and y each contain exactly one instance of every character. A substring of any string x can be obtained by removing zero or more characters from the beginning and/or the end of x. A subsequence of string x is any string that can be obtained by deleting zero or more characters, i.e. a string with characters xjk , 2
The problem can be solved by combining the dynamic programming approach from [8] with the BSP algorithm for grid dag computation from [9].
178
P. Krusche and A. Tiskin
where jk < jk+1 and 1 jk m for all k. The longest common subsequence (LCS) of two strings is the longest string that is a subsequence of both input strings. The LIS problem for a string x can be solved by observing that any LCS of x and the sequence of all characters from Σ in ascending order is a LIS of x. We note that we are only interested in the length of the LCS/LIS here, but that the techniques used also allow retrieval of the actual sequence. Throughout this paper we will denote the set of integers {i, i + 1, . . . , j} by [i : j], and the set {i + 12 , i + 32 , . . . , j − 12 } of odd half-integers by i : j. Odd halfinteger variables are marked using a ˆ symbol. When indexing a matrix M by odd-half integer values ˆı and ˆj, we define that M(ˆı,ˆj) = M(i, j) with i = ˆı − 1/2 and j = ˆj − 1/2. Therefore, if a matrix has integer indices [1 : m] × [1 : n], it has odd-half integer indices 0 : m × 0 : n. Further, we define the distribution D(ˆı,ˆj) with (ˆı,ˆj) ∈ i : m × 0 : j. matrix DΣ of matrix D as DΣ (i, j) = The standard dynamic programming LCS algorithm compares all prefixes of one string to all prefixes of the other string [8] by dynamic programming. Semilocal string comparison is an alternative to this standard method. Solutions to the semi-local LCS problem are given by a highest-score matrix A,3 which we define as A(i, j) = |LCS(x, yi . . . yj )|. Since the values A(i, j) for different i and j are strongly correlated, it is possible to derive an implicit, space-efficient representation of matrix A(i, j): Lemma 1 ([11]). Let A be the highest-score matrix for comparing two permutation strings x and y of length n. The corresponding implicit highest-score matrix DA is a permutation matrix of which only a core of 2n nonzeros needs to be stored, and which can be computed in O(n1.5 ) time. We have: A(i, j) = j − i − DΣ A (i, j). Tiskin [13] further described a procedure which, given two implicit highest-score matrices which correspond to comparing two strings x and y to the same string z, computes the implicit highest-score matrix for comparing their concatenation xy and z. This procedure can be reduced to the multiplication of two integer matrices in the (min, +) semiring. We will describe this method and its parallelisation briefly in the next section. For analysing parallel algorithms, we use the BSP model introduced by Valiant[14]. A BSP computation runs on p identical processors, and its execution is divided into S supersteps that each consist of local computation and a communication phase. At the end of each superstep, the processes perform a barrier-style synchronization. A BSP computation’s efficiency on p processors is characterized by its computation cost W, which gives the total time spent on computation, and its communication cost H, which gives the maximum total number of data elements that need to be transferred between any two processors throughout the computation. We will further consider memory cost M, which is the maximum amount of storage necessary on a processor across all supersteps. Consider a problem of size n which has input/output size I(n). We assume that a given sequential algorithm for this problem runs in time W(n) and memory 3
These highest-score matrices are related to the DIST matrices used e.g. in [12].
Parallel Longest Increasing Subsequences in Scalable Time and Memory
179
M(n). A BSP algorithm solving the problem is work-optimal relative to the given sequential algorithm if W(n, p) = O(W(n)/p). Let c > 0 be a constant. The algorithm achieves scalable communication if H(n, p) = O(I(n)/pc ). The algorithm achieves scalable memory if M(n, p) = O(M(n)/pc ).
3
Highest-Score Matrix Multiplication
We will now outline the matrix multiplication algorithm from [13] and the parallel algorithm from [10] as far as is necessary to understand the new method in the next section. As an input, we have the nonzeros of two N × N permutation matrices DA and DB . The procedure computes an N × N permutation matrix DC , such that the permutation distribution matrix DΣ C is the (min, +) product of perΣ Σ Σ mutation distribution matrices DΣ and D : D (i, j) = minj (DΣ A B C A (i, j)+DB (j, k)) with i, j, k ∈ [1 : N]. Matrices DA , DB and DC have half-integer indices rangΣ ing over 0 : N. The values DΣ A (i, j) + DB (j, k) are called elementary (min, +) products. The matrix multiplication method uses a divide-and-conquer approach that recursively partitions the output matrix DC into smaller blocks in order to locate its nonzeros. We compute the number of nonzeros contained in each block. The base of the recursion has two cases: either if the block does not contain any nonzeros, or if it is of size 1 × 1 and thus specifies the location of a nonzero. A rectangular submatrix of a permutation matrix D described by an interval I of indices in D will be called a D-block, and we denote the number of nonzeros contained in such a block as D(I) = (ˆı,ˆj)∈I D(ˆı,ˆj). Consider the square DC block with index interval I = i0 − h : i0 × k0 : k0 + h. To compute the number of nonzeros contained within this block, we need the values DΣ C (i0 , k0 ), Σ Σ (i − h, k ), D (i , k + h), and D (i − h, k + h) of the distribution matrix DΣ 0 0 C 0 C 0 0 C 0 at the four corners of the block, since D(I) = DΣ (i0 − h, k0 + w) − DΣ (i0 , k0 + w) − DΣ (i0 − h, k0 ) + DΣ (i0 , k0 ). Furthermore, the differences between the DΣ C values within our DC -block from Σ DC (i0 , k0 ) are determined only by sets of relevant nonzeros in DA and DB . The relevant nonzeros for a DC -block corresponding to indices in i0 − h : i0 × k0 : k0 + h are located in the DA -block with indices RA = i0 − h : i0 × 0 : N, and in the DB -block with indices RB = 0 : N × k0 : k0 + h. For carrying out our matrix product using only the density matrices DA and DB , we split these blocks at position j to obtain only the nonzeros relevant for the (min, +) products at the lower left corner of the block: RA (j) = i0 − h : i0 × 0 : j and RB (j) = j : N × k0 : k0 + h. As the block size decreases, fewer nonzeros are relevant for a block. We can therefore represent all data necessary to independently locate the nonzeros in our DC -block in O(h) space. For each DC -block, we need to know the relevant Σ nonzeros and the sequence of minima MI (d) = minj∈J(d) (DΣ A (i0 , j)+DB (j, k0 )), where J(d) = {j | D(RA (j)) − D(RB (j)) = d}. Sequence M stores the locally minimal elementary (min, +) products in the intervals of consecutive j-values where the number of relevant nonzeros does not change. For our DC -block, MI
180
P. Krusche and A. Tiskin
can be stored in O(h) space. Further, for a DC -subblock with index interval I ⊂ I, we can compute MI and the relevant nonzeros in time O(h). We use this method to set up the input data for the recursive calls when partitioning of the DC -blocks. This recursive method can be parallelised as follows (see [10] for more details). The sequential recursive divide-and-conquer computation has at most p independent problems at level 12 log2 p, since it uses a quadtree partitioning. In the parallel version of the algorithm, we start the recursion directly at this √ √ level. We assign a different DC -block of size N/ p × N/ p with index interval √ √ Iq = iq − N/ p : iq × kq : kq + N/ p to each processor q, and compute the relevant nonzeros and sequences MIq from scratch. To compute the values of sequence MIq for every processor q, we must compute the elementary (min, +) √ Σ products DΣ p is A (iq , j) + DB (j, kq ) with j ∈ [0 : N]. We assume w.l.o.g. that an integer, and that every processor has an unique identifier q with 0 q < p. Furthermore, we assume that all parameters match, such that e.g. N/p is integer. The initial distribution of the nonzeros of the input matrices is assumed to be even among all processors, so that every processor holds N/p nonzeros of DA and DB . Lemma 2 ([10]). Given the distributed implicit highest-score matrices DA and DB , all the p · N elementary (min, +) products can be computed on a BSP com√ puter in W = O(N/ p), H = O(p + N/p)[= O(N/p) if N > p2 ], and S = O(1). Lemma 3 ([10]). If every processor holds N/p elementary (min, +) products, as well as at most N/p nonzeros in DA and the same number of nonzeros in DB , we can redistribute data such that each processor q holds the sequence MIq and the relevant nonzeros for block Iq in one superstep using communication √ H = O(N/ p). These two theorems can be combined to obtain a first result for the BSP complexity of parallel highest-score matrix multiplication [10]. We can compute the elementary (min, +) products using Lemma 2, and redistribute them efficiently using Lemma 3. The resulting parallel highest-score matrix multiplication pro√ √ cedure has BSP cost W = O((N/ p)1.5 ) = O(N1.5 /p0.75 ), H = O(N/ p), and S = O(1).
4
Parallel Semi-local Comparison of Permutation Strings
The highest-score matrix multiplication procedure shown in the previous section is not work-optimal: in the denominator of the work cost we have p0.75 instead of p, which is required for work-optimality. Because not every block at level 12 log2 p may contain the same number of nonzeros, a load-imbalance can occur. In the √ √ worst case, only p blocks contain nonzeros, and therefore only p blocks of √ size N/ p are processed in parallel. We will now show a load-balancing strategy which allows to achieve workoptimality retaining the property of scalable communication, and therefore allows to use the procedure from the previous section for permutation string comparison. We first prove a small extension to the theorems by Tiskin [13].
Parallel Longest Increasing Subsequences in Scalable Time and Memory
181
Lemma 4. Consider a DC -block of size h × h in which we would like to locate k nonzeros. Assume we are given sequence MI for this block and the relevant nonzeros in DA and √ DB . On a single processor, the nonzeros in the DC -block can be located in O(h k) time. Proof. When partitioning the DC -block recursively, the number of subblocks can increase geometrically only up to level 1/2 log k. At this level, only k blocks can contain nonzeros from the subset we are looking for. Therefore, k indepen√ dent subproblems exist, each of which requires time O(h/ k) in order to be partitioned further. Since DC is a permutation matrix, the size of the k subproblems must decrease geometrically from this level on. Therefore, the running time is dominated by level √ 1/2 log k, and the cost in the worst case is bounded √ by O(h/ k · k) = O(h k). We apply Lemma 4 to have each processor locate nonzeros contained within a different set of leaves of the partitioning quadtree. One way to implement this is to pick subsets of nonzeros based on the order in which the sequential algorithm discovers the nonzeros. When partitioning, we then know for each subblock how many nonzeros the sequential algorithm would have discovered in this block, and we can therefore only partition those blocks which contain nonzeros in the subset we are looking for. After the pre-processing phase in which we obtain the relevant nonzeros and minima MI in each of the p DC -blocks, we introduce another step to balance the work of locating nonzeros. Lemma 5. Highest-score matrix multiplication of two N × N implicit highest√ score matrices on p processors has BSP cost W = O(N1.5 /p), H = O(N/ p), and S = O(1). √ Proof. Each processor holds data of size O(N/ p) to process a DC -block of √ √ size N/ p × N/ p. If this block contains more than N/p nonzeros, it cannot be processed work-optimally by a single processor. Therefore, we partition all blocks into zero or more complete groups with exactly N/p nonzeros, and possibly one incomplete group which contains less than N/p nonzeros. If the block contains at most N/p nonzeros, then we have a single group in this block. We broadcast the data for the block to enough processors such that each processor has to locate exactly one complete group of nonzeros. This is possible with BSP cost √ H = O(N/ p), and S = O(1) using a two-phase broadcast as shown in [15] (pp. 66–67). Each processor then only locates the N/p nonzeros in its respective group, and possibly the nonzeros in one incomplete group if one existed for its √ block. The work for locating N/p nonzeros in a block of size N/ p is O(N1.5 /p) due to Lemma 4. Moreover, the maximum number of complete groups is p, since there are only N nonzeros in total. Furthermore, maximally one incomplete group per processor can exist, which limits the number of such groups to p. The maximum total number of separate groups of nonzeros is therefore 2p. Each of these groups can be processed in time O(N1.5 /p) on a single processor, which gives the claimed work-optimal bound.
182
P. Krusche and A. Tiskin
Tiskin [11] showed that we can solve the semi-local permutation string comparison problem sequentially in O(n1.5 ) time by iterated application of the semi-local matrix multiplication procedure shown in the previous sections (see [16] for the details). We will now show a parallel algorithm for semi-local comparison of permutation strings. Theorem 1. The semi-local LCS problem for permutation strings of length n can be solved on a BSP computer using local computation W = O(n1.5 /p), com√ munication H = O(n/ p) and S = O(log p) supersteps. Proof. We partition one of the input strings into substrings of length n/p, and compute the highest-score matrix for each of these substrings compared to the other input string in time O((n/p)1.5 ) = O(n1.5 /p1.5 ). This is possible because the inputs are permutation strings, and therefore only n/p character matches exist for each substring. Furthermore, if we only store non-trivial nonzeros, i.e. these outside the main diagonal, the space necessary for representing the size of the resulting implicit highest score matrix reduces to O(n/p) (see [16]). We then start merging the highest-score matrices in parallel using the workoptimal parallel highest-score matrix multiplication algorithm. Consider level l ∈ [0 : log p] of the merging process, and let r = 2l . We again use the fact that the resulting highest score matrix for comparing a permutation string and a permutation-substring of length 2rn/p can be stored in O(2rn/p) space. Furthermore, each merge is performed by a group of 2r processors. By taking the sum over all r corresponding p levels of the merging process, we get to the log 1.5 (2rn/p) /(2r)) = O(n1.5 /p) and communication computation cost W = O( √ r √ cost H = O( r 2rn/p/( 2r)) = O(n/ p). This analysis includes the top level of the merging tree where r = p. The number of supersteps S = O(log p), as the merging tree contains log p levels. Theorem 2. The semi-local LCS problem for permutation strings of length n √ can be solved using scalable memory cost M = O(n/ p) on each processor. √ √ Proof. Because H = O(n/ p), the memory cost is lower-bounded by O(n/ p). √ The recursive partitioning procedure only requires memory O(n/ p) for blocks √ √ of size n/ p × n/ p because each new level of the recursion can re-use the memory used by the data for the previous level. It remains to show that we only need to store a fraction of n/p of both input strings on each processor. This is obviously the case for input string x, which is partitioned into disjoint substrings of length n/p. However, we must still show that it is not necessary for each processor to store the entire other input string y. Initially, each processor q ∈ ¯ q = y(q−1)· np +1 . . . yq· np . ¯q = x(q−1)· np +1 . . . xq· np and y [1 : p] holds substrings x ¯q in y, we need to find In order to determine the position of each character from x the sorting permutation of y, after which each processor can retrieve the positions ¯q of x. In the comparison-based model, this is of all characters in its fraction x possible in BSP cost W = O((n log n)/p), H = O(n/p) and S = O(1) supersteps e.g. using parallel sorting by regular sampling (see [17,18]). The computation
Parallel Longest Increasing Subsequences in Scalable Time and Memory
183
and communication costs for this sorting step are dominated by the costs of the iterated highest-score matrix multiplication procedure shown in Theorem 1. After sorting, we can redistribute the data using communication O(n/p) such that each processor holds all matches for matching its substring of x against y. At no point in this initial sorting and redistribution procedure do we need to store more than O(n/p) elements of data on any processor. Therefore, the memory √ √ cost is dominated by the cost for carrying out the recursion on the n/ p×n/ p√ sized blocks of the output highest-score matrix, which is O(n/ p). We have now shown that our algorithm is scalable in computation, communication and memory for semi-local permutation string comparison. However, the question remains what speedup we can achieve for LIS computation. For estimating this speedup, consider the following simple performance model. We compute the LIS of a sequence with n elements using p processors. We fix a constant g which specifies how long data transmission takes in comparison to computation4 . We consider two different input data distributions. First, we assume that each processor initially holds an equal fraction of the sequence, which could e.g. be the case if the sequence is a result of a previous parallel computation. In the other distribution, the entire sequence is known to every processor before starting the computation. Let r be the record size which specifies the size of each element in the input sequence. If the input sequence is distributed in equal fractions, we have account for bseq = rn data elements which need to be distributed to all processors first. In the parallel algorithm, we need only to compensate for having a distributed input by adding a term bpar = r · n/p, as each processor only needs a n/p-sized fraction of the input. If the input sequence is known to all processors in advance, we have bseq = bpar = 0. The running time for the sequential computation is then given by wseq = n log n + gbseq (neglecting constant factors, as we are only looking at the performance ratio compared to the parallel algorithm). The running time of the parallel computation using p √ processors is wpar(p) = 21.5 · n1.5 /p + g · (p + n/ p + bpar ).5 In our estimation, we assume g = 10, such that transmitting one element of data takes ten times as long as one step of the LIS computation (i.e. we assume that the network connecting the processors is very fast), and further that r = 2 (i.e. each sequence element requires the same amount of memory as two integers). Figures 1 and 2 show the speedup wseq /wpar(p) for various problem sizes obtained using this estimation. We see that speedup can be achieved under these conditions. This speedup is below the ideal linear speedup since our algorithm is restricted by the fact that the communication cost does not scale optimally, but √ only by a factor √ of 1/ p. Also, the number of processors must be bounded from below by Ω( n/ log n) in order to achieve speedup over the fast sequential algo√ rithm. In order to achieve speedup higher than p, a fast communication network is required. If the communication network is too slow, the communication cost of 4 5
In the standard BSP model [14], this parameter is called the inverse bandwidth or communication gap. The factor of 21.5 comes from the fact that the largest highest-score matrices multiplied in our computation are of size 2n × 2n.
184
P. Krusche and A. Tiskin
Speedup
Speedup
8
Seq.
3
Seq.
p4
6
p4 2
p16
p16
4 p32
p32 1
2
p64
1000
10 000
100 000
Sequence Length 1 000 000
p64
1000
10 000
100 000
Sequence Length 1 000 000
Fig. 1. Speedup estimation if sequence is Fig. 2. Speedup estimation if sequence is distributed in equal fractions known to all processors initially
the highest-score matrix multiplication algorithm becomes dominant, particularly since it is necessary to use many processors to achieve speedup over sequential computation for long sequences. Improving the method to achieve perfectly scalable communication as well as computation would therefore greatly improve its overall scalability. We would like to point out that these speedup estimations only serve the purpose of evaluating the practical potential of a stronger theoretical result. Our main aim is to show a first approach to parallel LIS computation which is scalable, and pointing out that despite the LIS problem being well-studied, no algorithms had been proposed so far which are scalable and can compete with the fastest sequential method.
5
Conclusions and Outlook
In this paper, we have presented a new parallel algorithm for semi-local permutation string comparison. An important application for this algorithm is LIS computation. Although the computation cost for our algorithm is not asymptotically work-optimal in comparison the the best sequential algorithm for the LIS problem, it is in fact the first coarse grained parallel algorithm which achieves any scalability for the problem. All previous approaches either have restrictive slackness conditions for work-optimality, or only have very limited theoretical speedup over just running sequentially. For LIS computation, using or algorithm can be of advantage for computations in which many processors exist and are connected by a very fast interconnection network (e.g. to achieve at least some speedup on a shared memory machine), or if either the input data or the resulting LIS needs to be distributed to all processors. It remains an open question whether it is possible to obtain a generally workoptimal parallel algorithm for the LIS problem running in time O((n log n)/p) in the comparison-based model. We believe that our method can be improved in two different ways. If the sequential algorithm for highest-score matrix multiplication can be improved to run in o(n1.5 ), speedup would be improved for sequences long enough such that the computational work becomes dominant.
Parallel Longest Increasing Subsequences in Scalable Time and Memory
185
Furthermore, improving the communication cost of the parallel algorithm would allow to use more processors to speed up the computation. Therefore, we believe that our approach has the potential to eventually yield a first truly work-optimal algorithm for LIS computation.
References 1. Erd˝ os, P., Szekeres, G.: A combinatorial problem in geometry. Compositio Math. 2, 463–470 (1935) 2. Bespamyatnikh, S., Segal, M.: Enumerating longest increasing subsequences and patience sorting. Information Processing Letters 76, 7–11 (2000) 3. McColl, W.F.: General purpose parallel computing. In: Lectures on parallel computation, pp. 337–391. Cambridge University Press, New York (1993) 4. Garcia, T., Myoupo, J., Sem´e, D.: A work-optimal CGM algorithm for the longest increasing subsequence problem. In: Proc. PDPTA (2001) 5. Nakashima, T., Fujiwara, A.: A cost optimal parallel algorithm for patience sorting. Parallel Processing Letters 16(1), 39–52 (2006) 6. Sem´e, D.: A CGM algorithm solving the longest increasing subsequence problem. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Lagan´ a, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3984, pp. 10–21. Springer, Heidelberg (2006) 7. Hunt, J.W., Szymanski, T.G.: A fast algorithm for computing longest common subsequences. Communications of the ACM 20(5), 350–353 (1977) 8. Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974) 9. McColl, W.F.: Scalable Computing. In: van Leeuwen, J. (ed.) Computer Science Today: Recent Trends and Developments. LNCS, vol. 1000, pp. 46–61. Springer, Heidelberg (1995) 10. Krusche, P., Tiskin, A.: Efficient parallel string comparison. In: ParCo. NIC Series, vol. 38, pp. 193–200. John von Neumann Institute for Computing (2007) 11. Tiskin, A.: Longest common subsequences in permutations and maximum cliques in circle graphs. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 270–281. Springer, Heidelberg (2006) 12. Apostolico, A., Atallah, M.J., Larmore, L.L., McFaddin, S.: Efficient parallel algorithms for string editing and related problems. SIAM J. Comp. 19(5), 968–988 (1990) 13. Tiskin, A.: Semi-local longest common subsequences in subquadratic time. J. Disc. Alg. 6(4), 570–581 (2008) 14. Valiant, L.G.: A bridging model for parallel computation. Communications of the ACM 33(8), 103–111 (1990) 15. Bisseling, R.: Parallel Scientific Computation: A Structured Approach Using BSP and MPI. Oxford University Press, Oxford (2004) 16. Tiskin, A.: Semi-local string comparison: Algorithmic techniques and applications. Extended version of [19], arXiv: 0707.3619v5 17. Shi, H., Schaeffer, J.: Parallel sorting by regular sampling. J. Parallel and Distributed Computing 14(4), 361–372 (1992) 18. Tiskin, A.: The bulk-synchronous parallel random access machine. Theoretical Computer Science 196(1-2), 109–130 (1998) 19. Tiskin, A.: Semi-local string comparison: Algorithmic techniques and applications. Mathematics in Computer Science 1(4), 571–603 (2008)
A Scalable Parallel Union-Find Algorithm for Distributed Memory Computers Fredrik Manne and Md. Mostofa Ali Patwary Department of Informatics, University of Bergen, N-5020 Bergen, Norway {Fredrik.Manne,Mostofa.Patwary}@ii.uib.no
Abstract. The Union-Find algorithm is used for maintaining a number of non-overlapping sets from a finite universe of elements. The algorithm has applications in a number of areas including the computation of spanning trees, sparse linear algebra, and in image processing. Although the algorithm is inherently sequential there has been some previous efforts at constructing parallel implementations. These have mainly focused on shared memory computers. In this paper we present the first scalable parallel implementation of the Union-Find algorithm suitable for distributed memory computers. Our new parallel algorithm is based on an observation of how the Find part of the sequential algorithm can be executed more efficiently. We show the efficiency of our implementation through a series of tests to compute spanning forests of very large graphs. Keywords: disjoint sets, parallel Union-Find, distributed memory.
1
Introduction
The disjoint-set data structure is used for maintaining a number of nonoverlapping sets consisting of elements from a finite universe. Its uses include among other, image decompositions, the computation of connected components and minimum spanning trees in graphs, and is also taught in most algorithm courses. The algorithm for implementing this data structure is often referred to as the Union-Find algorithm. More formally, let U be a collection of n distinct elements and let Si denote a set of elements from U . Two sets {S1 , S2 } are disjoint if S1 ∩ S2 = ∅. A disjoint set data structure maintains a collection {S1 , S2 , . . . , Sk } of disjoint dynamic sets selected from U . Each set is identified by a representative x, which is usually some member of the set. The two main operations are then to find which set a given element belongs to by locating its representative element and also to create a new set from the union of two existing sets. The underlying data structure of each set is typically a rooted tree where the element in the root vertex is the representative of the set. Using the two techniques Union-by-rank and path compression the running time of any combination of m Union and Find operations is O(nα(m, n)) where α is the very slowly growing inverse Ackerman function [4]. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 186–195, 2010. c Springer-Verlag Berlin Heidelberg 2010
Scalable Parallel Union-Find Algorithm for Distributed Memory Computers
187
From a theoretical point of view using the Union-Find algorithm is close to optimal. However, for very large problem instances such as those that appear in scientific computing this might still be too slow. This is especially true if the Find-Union computation is applied multiple times and take up a significant amount of the total time. It might also be that the underlying problem is too large to fit in the memory of one processor. One recent application that makes use of repeated applications of the Union-Find algorithm is a new algorithm for computing Hessian matrices using substitution methods [7]. Hence, designing parallel algorithms is necessary to keep up with the very large problem instances that appear in scientific computing. The first such effort was by Cybenko et al. [5] who presented an algorithm using the Union-Find algorithm for computing the connected components of a graph and gave implementations both for shared memory and distributed memory computers. The distributed memory algorithm duplicates the vertex set and then partitions the edge set among the processors. Each processor then computes a spanning forest using its local edges. In log p steps, where p is the number of processors, these forests are then merged until one processor has the complete solution. However, the experimental results from this algorithm were not promising and showed that for a fixed size problem the running time increased with the number of processors used. Anderson and Woll also presented a parallel Union-Find algorithm using waitfree objects suitable for shared memory computers [1]. In doing so they showed that parallel algorithms using concurrent Union-operations risk creating unbalanced trees. However, they did not produce any experimental results for their algorithm. We note that there exists an extensive literature on designing parallel algorithms for computing a spanning forest or the connected components of a graph. However, up until the recent paper by Bader and Cong [3] such efforts had failed to give speedup on arbitrary graphs. In [3] the authors present a novel scalable parallel algorithm for computing spanning forests on a shared memory computer. Focusing on distributed memory computers is of importance since these have better scalability than shared memory computers and thus the largest systems tend to be of this type. However, their higher latency makes distributed memory computers more dependent on aggregating sequential work through the exploitation of locality. The current work presents a new parallel Union-Find algorithm for distributed memory computers. The algorithm operates in two stages. In the first stage each processor performs local computations in order to reduce the number of edges that need to be considered for inclusion in the final spanning tree. This is similar to the approach used in [5], however, we use a sequential Union-Find algorithm for this stage instead of BFS. Thus when we start the second parallel stage each processor has a Union-Find type forest structure that spans each local component. In the second stage we merge these structures across processors to obtain a global solution. In both the sequential and the parallel stage we make use of a
188
F. Manne and Md. M.A. Patwary
novel observation on how the Union-Find algorithm can be implemented. This allows both for a faster sequential algorithm and also to reduce the amount of communication in the second stage. To show the feasibility and efficiency of our algorithm we have implemented several variations of it on a parallel computer using C++ and MPI and performed tests to compute spanning trees of very large graphs using up to 40 processors. Our results show that the algorithm scales well both for real world graphs and also for small-world graphs. The paper is organized as follows. In Section 2 we briefly explain the sequential algorithm and also how it can be optimized. In Section 3 we describe our new parallel algorithm, before giving experimental results in Section 4.
2
The Sequential Algorithm
In the following we first outline the standard sequential Union-Find algorithm. We then point out how it is possible to speed up the algorithm by paying attention to the rank values. This is something that we will make use of when designing our parallel algorithm. The standard data structure for implementing the Union-Find algorithm is a forest where each tree represents a connected set. To implement the forest each element x has a pointer p(x) initially set to x. Thus each x starts as a set by itself. The two operations used on the sets are then F ind(x) and U nion(x, y) where x and y are distinct elements. F ind(x) returns the root of the tree that x belongs to. This is done by following pointers starting from x. U nion(x, y) merges the two trees that x and y belong to. This is achieved by making one of the roots of x and y point to the other. With these operations the connected components of a graph G(V, E) can be computed as shown in Algorithm 1. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
S=∅ for each x ∈ V do p(x) = x end for for each (x, y) ∈ E do if F ind(x) = F ind(y) then U nion(x, y) S = S ∪ {(x, y)} end if end for
Algorithm 1. The sequential Union-Find algorithm When the algorithm terminates the vertices of each tree will consist of a connected component and the set of edges in S define a spanning forest on G. There are two standard techniques for speeding up the Union-Find algorithm. The first is Union-by-rank. Here each vertex is initially given a rank of 0. If two sets are to be merged where the root elements are of equal rank then the rank of
Scalable Parallel Union-Find Algorithm for Distributed Memory Computers
189
the root element of the combined set will be increased by one. In all other Union operations the root with the lowest rank will be set to point to the root with the higher rank while all ranks remain unchanged. Note that this ensures that the parent of a vertex x will always have higher rank than the vertex x itself. The second technique is path compression. In its simplest form, following any Find operation, all traversed vertices will be set to point to the root. This has the effect of compressing the path and making subsequent Find operations using any of these vertices faster. Note that even when using path compression the rank values will still be strictly increasing when moving upwards in a tree. Using the techniques of Union-by-rank and path compression the running time of any combination of m Union and Find operations is O(nα(m, n)) where α is the very slowly growing inverse Ackerman function [4]. We now consider how it is possible to implement the Union-Find algorithm in a more efficient way. It is straight forward to see that one can speed up Algorithm 1 by storing the results of the two Find operations and use these as input to the ensuing Union operation which then only has to determine which of the two root vertices should point to the other. In the following we describe how it is possible in certain cases to terminate the Find operation before reaching the root. Let the rank of a vertex z be denoted by rank(z). Consider two vertices x and y belonging to different sets with roots rx and ry respectively where rank(rx ) < rank(ry ). If we find rx before ry then it is possible to terminate the search for ry as soon as we reach an ancestor z of y where rank(z) = rank(rx ). This follows since the rank function is strictly increasing and we must therefor have rank(ry ) > rank(rx ) implying that ry = rx . At this point it is possible to join the two sets by setting p(rx ) = p(z). Note that this will neither violate the rank property nor will it increase the asymptotic time bound of the algorithm. However, if we perform F ind(y) before F ind(x) we will not be able to terminate early. To avoid this we perform the two Find operations in an interleaved fashion by always continuing the search from the vertex with the lowest current rank. In this way the Find operation can terminate as soon as one reaches the root with the smallest rank. We label this as the zigzag Find operation as opposed to the classical Find operation. The zigzag Find operation can also be used to terminate the search early when the vertices x and y belong to the same set. Let z be their lowest common ancestor. Then at some stage of the zigzag Find operation the current ancestors of x and y will both be equal to z. At this point it is clear that x and y belong to the same set and the search can stop. We note that the zigzag Find operation is similar to the contingently unite algorithm as presented in [9], only that we have extracted out the specifics of the path compression technique.
3
The Parallel Algorithm
In the following we outline our new parallel Union-Find algorithm. We assume a partitioning of both the vertices and the edges of G into p sets each,
190
F. Manne and Md. M.A. Patwary
V = {V0 , V1 , . . . , Vp−1 } and E = {E0 , E1 , . . . , Ep−1 } with the pair (Vi , Ei ) being allocated to processor i, 0 ≤ i < p. If v ∈ Vi (or e ∈ Ei ) processor i owns v (or e) and v (or e) is local to processor i. Any processor i that has a local edge (v, w) ∈ Ei such that it does not own vertex v will create a ghost vertex v as a substitution for v. We denote the set of ghost vertices of processor i by Vi . Thus an edge allocated to processor i can either be between two vertices in Vi , between a vertex in Vi and a vertex in Vi , or between two vertices in Vi . We denote the set of edges adjacent to at least one ghost vertex by Ei . The algorithm operates in two stages. In the first stage each processor performs local computations without any communication in order to reduce the number of edges that need to be considered for the second final parallel stage. Due to space considerations we only outline the steps of the algorithm and neither give pseudo-code nor a formal proof that the algorithm is correct. Stage 1. Reducing the input size Initially in Stage 1 each processor i computes a spanning forest Ti for its local vertices Vi using the local edges Ei − Ei . This is done using a sequential FindUnion algorithm. It is then clear that Ti can be extended to a global spanning forest for G. Next, we compute a subset Ti of Ei such that Ti ∪ Ti form a spanning forest for Vi ∪ Vi . We omit the details of how Ti can be computed efficiently but note that this can be done without destroying the structure of Ti . The remaining problem is now to select a subset of the edges in Ti so as to compute a global spanning forest for G. Stage 2. Calculating the final spanning forest The underlying data structure for this part of the algorithm is the same as for the sequential Union-Find algorithm, only that we now allow trees to span across several processors. Thus a vertex v can set p(v) to point to a vertex on a different processors other than its own. The pointer p(v) will in this case contain information about which processor owns the vertex being pointed to, its local index on that processor, and also have a lower bound on its rank. Each ghost vertex v will initially set rank(v ) = 0 and p(v ) = v. Thus the connectivity of v is initially handled through the processor that owns v. For the local vertices the initial p() values are as given from the computation of Ti . We define the local root l(v) as the last vertex on the Find-path of v that is stored on the same processor as v. If in addition l(v) has p(l(v)) = l(v) then l(v) is also a global root. In the second stage of the algorithm processor i iterates through each edge (v, w) ∈ Ti to determine if this edge should be part of the final spanning forest or not. This is done by issuing a Find-Union query (FU) for each edge. A FUquery can either be resolved internally by the processor or it might have to be sent to other processors before an answer is returned. To avoid a large number of small messages a processor will process several of its edges before sending and receiving queries. A computation phase will then consist of first generating new
Scalable Parallel Union-Find Algorithm for Distributed Memory Computers
191
FU-queries for a predefined number of edges in Ti and then to handle incoming queries. Any new messages to be sent will be put in a queue and transmitted in the ensuing communication phase. Note that a processor might have to continue processing incoming queries after it has finished processing all edges in Ti . In the following we describe how the FU-queries are handled. A FU-query contains information about the edge (v, w) in question and also to which processor it belongs. In addition the FU-query contains two vertices a and b such that a and b are on the Find-paths of v and w respectively. The query also contains information about the rank of a and b and if either a or b is a global root. Initially a = v and b = w. When a processor receives (or initiates) a FU-query it is always the case that it owns at least one of a and b. Assume that this is a, we then label a as the current vertex. Then a is first replaced by p(l(a)). There are now three different ways to determine if (v, w) should be part of the spanning forest or not: i) If a = b then v and w have a common ancestor and the edge should be discarded. ii) If a = b, p(a) = a, and rank(a) < rank(b) then p(a) can be set to b and thus including (v, w) in the spanning forest. iii) If a = b, rank(a) = rank(b), p(a) = a, while b is marked as also being a global root then p(a) can be set to b while a message is sent to b to increase its rank by one. To avoid that a and b concurrently sets each other as parents in Case iii) we associate a unique random number r() with each vertex. Thus we must also have r(a) < r(b) before we set p(a) = b. If a processor i reaches a decision on the current edge (v, w), it will send a message to the owner of the edge about the outcome. Otherwise processor i will forward the updated FU-query to a processor j (where j = i) such that j owns at least one of a and b. In the following we outline two different ways in which the FU-queries can be handled. The difference lies mainly in the associated communication pattern and reflects the classical as opposed to the zigzag Union-Find operation as outlined in Section 2. In the classical parallel Union-Find algorithm a is initially set as the current vertex. Then while a = p(a) the query is forwarded to p(a). When the query reaches a global root, in this case a, then if b is marked as also being a global root, rules i) through iii) are applied. If these result in a decision such that the edge is either discarded or p(a) is set to b then the query is terminated and a message is sent back to the processor owning the edge in question. Otherwise, the query is forwarded to b where the process is repeated (but now with b as the current vertex). In the parallel zigzag algorithm a processor that initiates or receives a FUquery will always check all three cases after first updating the current vertex z with l(z). If none of these apply the query is forwarded to the processor j which owns the one of a and b marked with the lowest rank and if rank(a) = rank(b) the one with lowest r value. Note that if v and w are initially in the same set then a query will always be answered as soon as it reaches the processor that owns the lowest common ancestor of v and w. Similarly, if v and w are in different
192
F. Manne and Md. M.A. Patwary
sets the query will be answered as soon as the query reaches the global root with lowest rank. Since FU-queries are handled concurrently it is conceivable that a vertex z ∈ {a, b} has seized to be a global root when it receives a message to increase its rank (if Case iii) has been applied). To ensure the monotonicity of ranks z then checks, starting with w = p(z), that rank(w) is strictly greater than the updated rank of z. If not we increase rank(w) by one and repeat this for p(w). Note that this process can lead to extra communication. Similarly as for the algorithm in [1] it is possible that unbalanced trees are created with both parallel communication schemes. This can happen if more than two trees with the same rank are merged concurrently such that one hangs of the other. When a processor i receives a message that one of its edges (v, w) is to be part of the spanning forest it is possible to initiate a path compression operation between processors. On processor i this would entail to set l(v) (and l(w)) to point to the new root which would then also have to be included in the return message. Since there could be several such incoming messages for l(v) and these could arrive in an arbitrary order we must first check that the rank of the new root is larger than the rank that i has stored for p(l(v)) before performing the compression. If this is the case then it is possible to continue the compression by sending a message to p(l(v)) about the new root. We label these schemes as either 1-level or full path compression.
4
Experiments
For our experiments we have used a Cray XT4 distributed memory parallel machine with AMD Opteron quad-core 2.3 GHz processors where each group of four cores share 4 GB of memory. The algorithms have been implemented in C++ using the MPI message-passing library. We have performed experiments using both graphs taken from real application as well as on different types of synthetic graphs. In particular we have used application graphs from areas such as linear programming, medical science, structural engineering, civil engineering, and automotive industry [6,8]. We have also used small-world graphs as well as random graphs generated by the GTGraph package [2]. Table 1 give properties of the graphs. The first nine rows contains information about the application graphs while the final two rows give information about the small-world graphs. The first 5 columns give structural properties about the graphs while the last two columns show the time in seconds for computing a spanning forest using Depth First Search (DFS) and the sequential zigzag algorithm (ZZ). We have also used two random graphs both containing one million vertices and respectively, 50 and 100 million edges. Note that all of these graphs only contains one component. Thus the spanning forest will always be a tree. Our first results concern the different sequential algorithms for computing a spanning forest. As is evident from Table 1 the zigzag algorithm outperformed
Scalable Parallel Union-Find Algorithm for Distributed Memory Computers
193
Table 1. Properties of the graphs Name m t1 cranksg2 inline 1 ldoor af shell10 boneS10 bone010 audi spal 004 rmat1 rmat2
|V | 97578 63838 503712 952203 1508065 914898 986703 943695 321696 377823 504817
|E| 4827996 7042510 18156315 22785136 25582130 27276762 35339811 38354076 45429789 30696982 40870608
Max Deg 236 3422 842 76 34 80 80 344 6140 8109 10468
Avg Deg 98.95 220.64 72.09 47.86 33.93 59.63 71.63 81.28 282.44 162.49 161.92
DFS ZZ 0.12 0.06 0.15 0.03 0.57 0.26 0.71 0.47 1.04 0.37 0.86 0.38 1.05 0.47 1.20 0.33 1.33 0.66 2.07 1.34 2.71 1.81
the DFS algorithm. A comparison of the different sequential Union-Find algorithms on the real world graphs is shown in the upper left quadrant of Figure 1. All timings have been normalized relative to the slowest algorithm, the classical algorithm (CL) using path compression (W). As can be seen, removing the path compression (O) decreases the running time. Also, switching to the zigzag algorithm (ZZ) improves the running time further, giving approximately a 50% decrease in the running time compared to the classical algorithm with path compression. To help explain these results we have tabulated the number of “parent chasing” operations on the form z = p(z). These show that the zigzag algorithm only executes about 10% as many such operations as the classical algorithm. However, this does not translate to an equivalent speed up due to the added complexity of the zigzag algorithm. The performance results for the synthetic graphs give an even more pronounced improvement when using the zigzag algorithms. For these graphs both zigzag algorithms outperforms both classical algorithms and the zigzag algorithm without path compression gives an improvement in running time of close to 60% compared to the classical algorithm with path compression. Next, we present the results for the parallel algorithms. For these experiments we have used the Mondrian hypergraph partitioning tool [10] for partitioning vertices and edges to processors. For most graphs this has the effect of increasing locality and thus enabling to reduce the size of Ti in Stage 1. In our experiments T = ∪i Ti contained between 0.1% and 0.5 % of the total number of edges for the application graphs, between 1 % and 6 % for the small-world graphs, and between 2 % and 36 % for the random graphs. As one would expect these numbers increase with the number of processors. In our experiments we have compared using either the classical or the zigzag algorithm, both for the sequential computation in Stage 1 and also for the parallel computation in Stage 2. We note that in all experiments we have only used level1 path compression in the parallel algorithms as using full compression, without exception, slowed down the algorithms.
194
F. Manne and Md. M.A. Patwary Sequential spanning tree algorithms
Parallel spanning tree algorithms using 4 processors
100 100
90 90
80
Time in %
Time in %
80
70
60
70
60
50
P−CL S−CL−W
S−CL−W
50
P−ZZ S−CL−W
S−ZZ−W 40
P−ZZ S−ZZ−W
S−CL−O
P−ZZ S−CL−O
S−ZZ−O 30
m_t1
cranksg2 inline_1
ldoor af_shell10boneS10 bone010
P−ZZ S−ZZ−O
40
audi
spal_004
m_t1
cranksg2 inline_1
ldoor
Graphs
af_shell10 boneS10 bone010
audi
spal_004
Graphs
Parallel spanning tree algorithms using 8 processors
Speedup of parallel spanning tree algorithms
100
16
14
90
12
10
Speedup
Time in %
80
70
8
6
60
P−CL S−CL−W
spal_004
4
audi
P−ZZ S−CL−W
50
P−ZZ S−ZZ−W
inline_1
2
P−ZZ S−CL−O P−ZZ S−ZZ−O
40
m_t1
cranksg2 inline_1
ldoor af_shell10boneS10 bone010
Graphs
0 audi
spal_004
1
4
8
16
24
32
40
Processors
Fig. 1. Performance results: S - Sequential algorithm, P- Parallel algorithm, CL Classical Union-Find, ZZ - zigzag Union-Find, W - With path compression, O - Without path compression
How the improvements from the sequential zigzag algorithm are carried into the parallel algorithm can be seen in the upper right and lower left quadrant of Figure 1. Here we show the result of combining different parallel algorithms with different sequential ones when using 4 and 8 processors. All timings have again been normalized to the slowest algorithm, the parallel classical algorithm (PCL) with the sequential classical algorithm (S-CL), and using path compression (W). Replacing the parallel classical algorithm with the parallel zigzag algorithm while keeping the sequential algorithm fixed gives an improvement of about 5% when using 4 processors. This increases to 14% when using 8 processors, and to about 30% when using 40 processors. This reflects how the running time of Stage 2 of the algorithms becomes more important for the total running time as the number of processors are increased. The total number of sent and forwarded FU-queries is reduced by between 50% and 60% when switching from the parallel classical to the parallel zigzag algorithm. Thus this gives an upper limit on the possible gain that one can obtain from the parallel zigzag algorithm over the parallel classical algorithm. When keeping the parallel zigzag algorithm fixed and replacing the sequential algorithm in Step 1 we get a similar effect as we did when comparing the
Scalable Parallel Union-Find Algorithm for Distributed Memory Computers
195
sequential algorithms, although this effect is dampened as the number of processors is increased and Step 1 takes less of the overall running time. The figure in the lower right corner shows the speedup on three large matrices when using the best combination of algorithms, the sequential and parallel zigzag algorithm. As can be seen the algorithm scales well up to 32 processors at which point the communication in Stage 2 dominates the algorithm and causes a slowdown. Similar experiments for the small-world graphs showed a more moderate speedup peaking at about a factor of four when using 16 processors. The random graphs did not obtain speedup beyond 8 processors and even for this configuration the running time was still slightly slower than for the best sequential algorithm. We expect that the speedup would continue beyond the current numbers of processors for sufficiently large data sets. To conclude we note that the zigzag Union-Find algorithm achieves considerable savings compared to the classical algorithm both for the sequential and the parallel case. However, our parallel implementation did not achieve speedup for the random graphs, as was the case for the shared memory implementation in [3]. This is mainly due to the poor locality of such graphs.
References 1. Anderson, R.J., Woll, H.: Wait-free parallel algorithms for the union-find problem. In: Proceedings of The Twenty-Third Annual ACM Symposium on Theory of Computing (STOC 1991), pp. 370–380 (1991) 2. Bader, D.A., Madduri, K.: GTGraph: A synthetic graph generator suite (2006), http://www.cc.gatech.edu/~ kamesh/GTgraph 3. Bader, D.J., Cong, G.: A fast, parallel spanning tree algorithm for symmetric multiprocessors (smps). Journal of Parallel and Distributed Computing 65, 994– 1006 (2005) 4. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press, Cambridge (2001) 5. Cybenko, G., Allen, T.G., Polito, J.E.: Practical parallel algorithms for transitive closure and clustering. International Journal of Parallel Computing 17, 403–423 (1988) 6. Davis, T.A.: University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (submitted) 7. Gebremedhin, A.H., Tarafdar, A., Manne, F., Pothen, A.: New acyclic and star coloring algorithms with applications to computing hessians. SIAM Journal on Scientific Computing 29, 515–535 (2007) 8. Koster, J.: Parasol matrices, http://www.parallab.uib.no/projects/parasol/data 9. Tarjan, R.E., van Leeuwen, J.: Worst-case analysis of set union algorithms. J. ACM 31, 245–281 (1984) 10. Vastenhouw, B., Bisseling, R.H.: A two-dimensional data distribution method for parallel sparse matrix-vector multiplication. SIAM Review 47, 67–95 (2005)
Extracting Both Affine and Non-linear Synchronization-Free Slices in Program Loops Wlodzimierz Bielecki and Marek Palkowski Faculty of Computer Science, Technical University of Szczecin, 70210, Zolnierska 49, Szczecin, Poland {bielecki,mpalkowski}@wi.ps.pl http://kio.wi.zut.edu.pl/
Abstract. An approach is presented permitting for extracting both affine and non-linear synchronization-free slices in program loops. It requires an exact dependence analysis. To describe and implement the approach, the dependence analysis by Pugh and Wonnacott was chosen where dependences are found in the form of tuple relations. The approach is based on operations on integer tuple relations and sets and it has been implemented and verified by means of the Omega project software. Results of experiments with the UTDSP benchmark suite are discussed. Speed-up and efficiency of parallel code produced by means of the approach is studied. Keywords: synchronization-free slices, free-scheduling, parallel shared memory programs, dependence graph.
1
Introduction
Microprocessors with multiple execution cores on a single chip are the radical transformation taking place in the way that modern computing platforms are being designed. Hardware industry is moving toward the direction of multi-core programming and parallel computing. This opens new opportunities for software developers [1]. Multithreaded coarse-grained applications allow the developer to exploit the capabilities provided by the underlying parallel hardware platform including multi-core processors. Coarse-grained parallelism is obtained by creating a thread of computations on each execution core (or processor) to be executed independently or with occasional synchronization. Different techniques have been developed to extract synchronization-free parallelism available in loops, for example, those presented in papers [2,3,4,5,6]. Unfortunately, none of those techniques extracts synchronization-free parallelism when the constraints of sets representing slices are non-linear. In our recent work [7,8], we have proposed algorithms to extract coarse-grained parallelism represented with synchronization-free slices described by non-linear constraints. But they are restricted to extracting slices being represented by a graph of the chain or tree topology only. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 196–205, 2010. c Springer-Verlag Berlin Heidelberg 2010
Extracting Affine and Non-linear Synchronization-Free Slices in Loops
197
The main purpose of this paper is to present an approach that permits us to extract synchronization-free slices when they are represented i) by both affine and non-linerar constraints, ii) not only by the chain or tree topology but also by a graph of an arbitrarily topology. The presented algorithm is applicable to the parameterized both uniform and non-uniform arbitrary nested loop.
2
Background
In this paper, we deal with affine loop nests where, for given loop indices, lower and upper bounds as well as array subscripts and conditionals are affine functions of surrounding loop indices and possibly of structure parameters, and the loop steps are known constants. Two statement instances I and J are dependent if both access the same memory location and if at least one access is a write. I and J are called the source and destination of a dependence, respectively, provided that I is lexicographically smaller than J (I≺J, i.e., I is always executed before J). Our approach requires an exact representation of loop-carried dependences and consequently an exact dependence analysis which detects a dependence if and only if it actually exists. To describe and implement our algorithm, we chose the dependence analysis proposed by Pugh and Wonnacott [11] where dependences are represented by dependence relations. A dependence relation is a tuple relation of the form [input list ]→[output list]: formula, where input list and output list are the lists of variables and/or expressions used to describe input and output tuples and formula describes the constraints imposed upon input list and output list and it is a Presburger formula built of constraints represented with algebraic expressions and using logical and existential operators. We use standard operations on relations and sets, such as intersection (∩), union (∪), difference (-), domain (dom R), range (ran R), relation application (S = R(S): e ∈S iff exists e s.t. e→e ∈R,e∈S), transitive closure of relation R, R*. In detail, the description of these operations is presented in [11,12,13]. Definition 1. An iteration space is a set of all statement instances that are executed by a loop. Definition 2. An ultimate dependence source is the source that is not the destination of another dependence. Defnition 3. Given a dependence graph, D, defined by a set of dependence relations, S, a slice is a weakly connected component of graph D, i.e., a maximal subgraph of D such that for each pair of vertices in the subgraph there exists a directed or undirected path. Definition 4. Given a relation R, set of common dependence sources, CDS, is defined as follows
198
W. Bielecki and M. Palkowski
CDS := {[e] : Exists(e ,e , s.t., e = R−1 (e ) = R−1 (e ) && e ,e ∈ range(R) && e =e ) }. Definition 5. Given a relation R, set of common dependence destinations, CDD, is defined as follows CDD := {[e] : Exists(e ,e s.t., e = R(e ) = R(e ) && e ,e ∈ domain(R) && e =e )}. We distinguish the following three kinds of the dependence graph topology i) chain when CDS = φ and CDD = φ, ii) tree when CDS = φ and CDD = φ, iii) multiple incoming edges graph - MIE graph when CDD = φ. A schedule is a mapping that assigns a time of execution to each statement instance of the loop in such a way that all dependences are preserved, that is, mapping s:I→Z such that for any dependent statement instances si1 and si2 ∈ I, s(si1)<s(si2) if si1 ≺ si2 [9]. Definition 7 [9]. A free schedule assigns statement instances as soon as their operands are available, that is, mapping σ:I→Z such that 0 if thereis no p ∈ I s.t. p ≺ p σ(p) = 1 + max(σ(p )), p ∈ I, p ≺ p . The free schedule is the ”fastest” schedule possible. Its total execution time is Tf ree = 1 + max(σf ree (p), p ∈ I) .
3
Motivating Example
To explain troubles we deal with non-linear slices, let us consider the following example for i = 1 to n for j = 1 to n a(i,j) = a(2*i,j) : a(i,j-1) endfor endfor Dependences in this loop are represented by the following relations R1 :=[i,j] → [2i,j]: 1i && 2i n && 1 j < n , R2 :=[i,j] → [i,j+1] : 1 i n && 1 j < n . From Figure 1, illustrating dependences in the loop for n=6, we can see that there exist 3 synchronization-free slices represented by the following sets of statement instances sk := {[i, j] : 1 ≤ i ≤ n && 1 ≤ j ≤ n && Exists α : i = 2α ∗ zk }, where k=1,2,3 and z1 =1, z2 =3, z3 =5. The constraints of the sets above are non-linear. We are unaware of any technique permitting for generating code for
Extracting Affine and Non-linear Synchronization-Free Slices in Loops
199
Fig. 1. Iteration space and dependences for the motivating example, n=6
such sets. Our previous algorithms [7,8] allow one to parallelize only such loops whose dependence graphs are chains or trees. The dependence graph for the motivating example is a multiple incoming edge graph. Below we propose an algorithm permitting for generating code enumerating statement instances of both affine and non-linear slices.
4
Algorithm
The idea of the algorithm presented below is the following. First, using a set of dependence relations, we check whether the topology of a dependence graph is a MIE graph. If so, relations are modified so that they do not represent any common dependences destination. Then we check whether such modifications are permitted. If so, using modified relations, we generate code that preserves all dependences represented by the original relations. Algorithm. Extracting both affine and non-linear synchronization-free slices. Input: Set of dependence relations Ri , 1iq, S; set of slice sources, Sour, calculated by means of the technique presented in [7]. Output: code scanning synchronization-free slices when possible. 1 Using set S, calculate the union of relations R:= Ri . 1≤i≤q
2 Using R, calculate set of common dependence destinations, CDD. If CDD = φ then S =S, go to step 5 /* satisfying this condition means that the dependence graph described by R is not a MIE graph */ 3 For each relation Ri , 1iq from set S do 3.1 Form relation R i := Ri ={[e]→[e ] : constraints} and insert in its constraints the following constraint: && e ∈ range Ri && e ∈ / CDD, /* this constraint causes that R i does not describe any common dependence destination*/ 3.2 If the following condition Ri - R i = φ AND R i - Ri = φ, /* satisfying this condition means that Ri and R i are the same */ is not satisfied, then do
200
W. Bielecki and M. Palkowski
3.2.1 Create a new set of relations S , where Ri is replaced for R i 3.2.2 Calculate R as the union of all relations contained in set S 3.2.3 If R+(Sour) -R +(Sour)= φ and R +(Sour)- R+(Sour)= φ, /*satisfying this condition means that the transitive closure of R and that of R are the same */ where R+ is the transitive closure of R, then replace Ri for R i in set S, 4 Using the modified set S , calculate the union of relations R:= Ri ; 1≤i≤q
using R, calculate set of common dependence destinations, CDD, and check whether CDD = φ; if not, the end, the algorithm fails to extract slices. 5 Generate parallel code for set of dependence relations S using the algorithm presented in [8]. Proof. To prove the correctness of the algorithm above, we should demonstrate that the resulting code produced by the algorithm preserves all the dependences in the original loop. With this purpose, let us remind that code produced by the technique presented in paper [8] assigns statement instances to be executed according to free-scheduling. To show that the free scheduling is the same for both a set of modified relations and a set of original relations, we observe that time k when a given statement instance, I, is executed according to free-scheduling, is defined by the number of vertices on the maximal path from the source of a slice, S, to I (a path in a graph is a sequence of vertices such that from each of its vertices there is an edge to the next vertex in the sequence). Because satisfying the condition R + (Sour) − R + (Sour) = φ and R + (Sour) − R + (Sour) = φ guarantees that for any vertex I, the maximal path from S to I is the same for the both the dependence graph formed by a set of original dependence relations and that formed by a set of modified relations produced by the algorithm, we can conclude that the free scheduling is the same for both the original dependence graph and the dependence graph produced by the algorithm, i.e., the resulting code, produced by the algorithm, preserves all the dependences in the original loop. From the proof above, it is clear when the algorithm fails to produce parallel code. This takes place when the algorithm is not able to produce a non-MIE dependence graph or the free scheduling for a produced non-MIE graph is different from that produced for the original dependence graph. In the motivating example, each ultimate dependence source is the source of a slice (the dependence graph does not contain any pair of ultimate dependence sources that is connected by the undirected path in the dependence graph). So, sources of slices can be calculated as Domain(R)-Range(R). In papers [7,8], we proposed algorithms to generate while loop based code scanning non-linear slices of the chain and /or tree topology. Each iteration of a while loop executes all the statement instances whose operands have already been calculated and prepares a set of statement instances to be executed at the next iteration. Statement
Extracting Affine and Non-linear Synchronization-Free Slices in Loops
201
instances are executed as soon as their operands are available, preserving in such a way all dependences exposed for an original loop. Such a schedule (assigning a time of execution to each operation of the loop) is known to be freeschedule [9]. Below we demonstrate applying the algorithm to the motivating example. Set S of dependence relations is the following R1 :={[i,j] → [2i,j]: 1i && 2i n && 1 j < n }, R2 :={[i,j] → [i,j+1] : 1 i n && 1 j < n }. A set of sources, Sour, is calculated by the algorithm presented in [7] Sour := {[i,1]: Exists ( alpha : 2alpha = 1+i && 1 i n, 2n-3}. 1 R :=R1 ∪R2 ={[i,1]→[2i,1]: 1i && 2in && 1j < n }∪{[i,j]→[i,j+1]:1in &&1j
202
W. Bielecki and M. Palkowski
R+(Sour)-R +(Sour) = {[i,j] : 2 < j n && 1 i n && Exists(alpha : 2alpha = i)} = φ /* transitive closure of R and that of R are not the same */ S = {R 1, R2 } 4 CDD = φ. The dependence graph is represented by the tree (Fig. 2). 5 Using the technique presented in [8], we generate the following parallel code. parfor(t=1;t<=min(n,2*n-3);t+=2) { I=[t,1]; Add I to S’; /* I is the source of a tree */ while(S’!= null) { S_tmp=null; (par)foreach(vector I=[i,j] in S’) { s1(I); /* the original loop statement */ /* to be executed at iteration I */ if(j==1 && 1<=i && 2*i<=n){ ip = 2*i; jp = 1; add J=[ip,jp] to S_tmp; } if(1<=i && i<=n && 1<=j&&j
Fig. 2. Iteration space and dependences for the motivating example after applying the algorithm
5
Experiments
We have studied the parallel code above to examine what is its speedup and efficiency. We have carried experiments on a multiprocessor machine (2x Intel
Extracting Affine and Non-linear Synchronization-Free Slices in Loops
203
Quad Core, 1.6 GHz, 2 GB RAM, Ubuntu Linux) using the OpenMP API [15]. The time of the execution of the original loop and that of the code produced by our algorithm on one processor as well as speed-up, S, and efficiency, E, for 2,4, and 8 processors are presented in Table 1. The following conclusions can be made analyzing the data in Table 1. There is no considerable difference between the time of the execution of the original sequential for loop and that of the while loop produced by the presented algorithm. The produced parallel code is characterized by positive speed-up( S>1). The efficiency of the parallel code increases as the volume of computations executed by the produced while loop increases. Table 1. Results for the motivating example N 250 500 1000 1500 2000
1 CPU for [s] while [s] 0,014 0,016 0,056 0,064 0,224 0,257 0,504 0,573 0,922 1,006
2 CPU S E 1,273 0,636 1,436 0,718 1,659 0,830 1,732 0,866 1,787 0,893
4 CPU S E 1,167 0,292 2,074 0,519 3,027 0,757 3,170 0,792 3,453 0,863
8 CPU S E 1,077 0,135 2,667 0,333 4,148 0,519 5,538 0,692 6,188 0,773
We have also studied UTDSP benchmarks [14] to recognize what is the dependence graph topology and what is the effectiveness of the presented algorithm. The UTDSP Benchmark Suite was created to evaluate the quality of code generated by a high-level language (such as C) compiler targeting a programmable digital signal processor (DSP). We have considered only such loops for which Petit [12] was able to carry out a dependence analysis. Table 2. Results for UTDSP loops All loops Topology of dependence graph MIE graphs tree chain Before using the algorithm 14 78% 1 5% 3 17% 18 After using the algorithm 5 27% 2 11% 11 62%
Table 2 presents the number and percentage of loops with different dependence graph topologies. MIE graphs occur in 14 of 18 graphs. After using the presented approach, the number of loops with that topology was reduced to 5. For 14 MIE graphs, the approach is able to transform 9 of them to the tree or chain topology. We examined parallel codes produced on the basis of the presented approach for UTDSP loops. Experimental results were similar for the loops under examination. Due to the lack of space, we attach experimental results only for the edge detect application from the UTDSP suite. Dependences in this loop are represented by a MIE graph. After applying the proposed algorithm, dependences
204
W. Bielecki and M. Palkowski Table 3. Experimental results for the UTDSP edge detect application N 1000 1500 2000 2500 3000
1 CPU for[s] while [s] 0,047 0,061 0,107 0,137 0,188 0,246 0,294 0,382 0,428 0,550
2 CPU S E 1,309 0,654 1,446 0,723 1,440 0,720 1,431 0,716 1,507 0,754
4 CPU S E 1,521 0,380 2,099 0,525 2,806 0,701 2,786 0,696 2,828 0,707
8 CPU S E 1,842 0,230 2,301 0,288 3,388 0,423 3,406 0,426 4,389 0,549
are represented by chains. The time, speed-up and efficiency of a parallel program (generated by approach [7]) are presented in Table 3. Analysing the data in this table, we can conclude that the presented approach can be applied for producing efficient parallel code for real-life loops.
6
Related Work
The affine transformation framework, considered in papers [4,5,6] unifies a large number of previously proposed loop transformations. However, it fails to extract synchronization-free parallelism when slices are described by non-linear constraints. Our previous papers [7,8] are devoted to the extraction of non-linear synchronization-free slices being represented by graphs of the chain or tree topology only. The techniques presented in those papers fail to extract slices being represented by multiple incoming edges(MIE) graphs. The main contribution of this paper is the demonstration how to extract synchronization-free slices described by both affine and non-linear constraints and being represented by MIE graphs. The algorithm presented in this paper was implemented by means of the Omega library. Carried out experiments by means of a tool developed [16] demonstrate the correctness and satisfactory efficiency of parallel code produced on the basis of the algorithm presented in this paper.
7
Conclusion and Future Work
We presented a way how to extract synchronization-free slices represented by both affine and non-linear constraints. Code generated by the presented algorithm assigns loop statement instances to be executed according to the free scheduling policy. The algorithm successes when eliminating common dependence destinations does not violate the free scheduling valid for an original dependence graph presenting exact dependences in a program loop. The algorithm can be applied to produce parallel coarse-grained programs for parallel shared memory computers. Our future research direction is to derive techniques for extracting both synchronization-free slices and slices requiring synchronization and being represented by graphs of an arbitrary topology.
Extracting Affine and Non-linear Synchronization-Free Slices in Loops
205
References 1. Akhter, S., Roberts, J.: Multi-core Programming, Intel Books by Engineers, for Engineers (2006) 2. Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures, p. 790. Morgan Kaufmann, San Francisco (2001) 3. Banerjee, U.: Unimodular transformations of double loops. In: Proceedings of the Third Workshop on Languages and Compilers for Parallel Computing, pp. 192–219 (1990) 4. Feautrier, P.: Toward automatic distribution. Journal of Parallel Processing Letters 4, 233–244 (1994) 5. Lim, W., Cheong, G.I., Lam, M.S.: An affine partitioning algorithm to maximize parallelism and minimize communication. In: Proceedings of the 13th ACM SIGARCH International Conference on Supercomputing (1999) 6. Darte, A., Robert, Y., Vivien, F.: Scheduling and Automatic Parallelization. Birkhuser, Boston (2000) 7. Bielecki, W., Beletska, A., Palkowski, M., San Pietro, P.: Extracting synchronization-free chains of dependent iterations in non-uniform loops. In: 14th International Multi-Conference on Advanced Computer Systems, Polish Journal of Environmental Studies, Miedzyzdroje, vol. 16(5B), pp. 165–171 (2007) 8. Bielecki, W., Beletska, A., Palkowski, M., San Pietro, P.: Extracting synchronization-free trees composed of non-uniform loop operations. In: Bourgeois, A.G., Zheng, S.Q. (eds.) ICA3PP 2008. LNCS, vol. 5022, pp. 185–195. Springer, Heidelberg (2008) 9. Boulet, P., Darte, A., Silber, G., Vivien, F.: Loop paralellelization algorithms: from parallelism extraction to code generation, Technical Report 97-17, LIP, ENS-Lyon, France, p. 30 (June 1997) 10. The Omega project, http://www.cs.umd.edu/projects/omega 11. Pugh, W., Wonnacott, D.: An exact method for analysis of value-based array data dependences. In: The Sixth Annual Workshop on Programming Languages and Compilers for Parallel Computing. Springer, Heidelberg (1993) 12. Kelly, W., Maslov, V., Pugh, W., Rosser, E., Shpeisman, T., Wonnacott, D.: The omega library interface guide. Technical report, College Park, MD, USA (1995) 13. Pugh, W., Rosser, E.: Iteration space slicing and its application to communication optimization. In: Proceedings of the International Conference on Supercomputing, pp. 221–228 (1997) 14. UTDSP benchmark suite, http://www.eecg.toronto.edu/~ corinna/DSP/infrastructure/UTDSP.html 15. OpenMP API, http://www.openmp.org 16. Implementation of the presented approach source code, and documentation, http://detox.wi.ps.pl/SFS_Project
A Flexible Checkpoint/Restart Model in Distributed Systems Mohamed-Slim Bouguerra1,2, Thierry Gautier2 , Denis Trystram1 , and Jean-Marc Vincent1 1
Grenoble University, ZIRST 51, avenue Jean Kuntzmann 38330 MONTBONNOT SAINT MARTIN, France 2 INRIA Rhone-Alpes, 655 avenue de l’Europe Montbonnot-Saint-Martin 38334 SAINT ISMIER, France {mohamed-slim.bouguerra,thierry.gautier,denis.trystram, jean-marc.vincent}@imag.fr
Abstract. Large scale applications running on new computing platforms with thousands of processors have to face with reliability problems. The failure of a single processor will cause the entire execution to fail. Most existing approaches to guarantee reliable executions are based on fault tolerance mechanisms. Coordinated checkpointing is one of the most popular technique to deal with failures in such platforms. This work presents a new model of coordinated Checkpoint/Restart mechanism for several types of computing platforms. The model is parametrized by the process failure distribution, the cost to save a global consistent state of processes and the number of computational resources. Through mathematical analysis of reliability, we apply this new model to compute the optimal interval between checkpoint times in order to minimize the average completion time. Model independency from the type of the failure law makes it completely flexible. We show that such a model may be used to reduce the checkpoint rate up to 20% in same cases and up to factor 4 the total overhead in same cases. Finally, we report some experiments based on simulations for random failure distributions corresponding to the two most popular laws, namely, the Poisson’s process and Weibull’s law. Keywords: Fault tolerance - Reliability modeling - Checkpointing.
1
Introduction
Today the most powerful computing systems involve more and more processors. Reliability is a crucial issue to address while running applications on such large systems because the failure rate grows with the system size. From the IBM source, the BlueGene/L system with 65,536 nodes is expected to have a mean time between failure less than 20 hours [1][2][3]. The users expect several processors to fail during the execution of their applications. Hence, it is necessary to develop strategies for providing a reliable completion of large applications. Fault-tolerant systems have been extensively studied on various computing or R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 206–215, 2010. c Springer-Verlag Berlin Heidelberg 2010
A Flexible Checkpoint/Restart Model in Distributed Systems
207
real-time systems and any approaches have been proposed for dealing with failures. We focus in this work on the coordinated Checkpoint/Restart approach which is one of the most popular mechanism used in practice [4][5][6][7][8]. In checkpointing-based approaches, the state of computations is saved periodically on a reliable storage [9]. Informally, checkpointing is a technique that enables to reduce the completion time of the application by saving intermediate states in a stable storage, and then, to restore from the last stored state when a failure occurs. In the case of distributed computations, checkpointing methods differ from each other in the way the processes are coordinated in order to capture the global state in a consistent manner or not. Coordinated checkpointing requires that all processes coordinate the construction of a consistent global state before they write the individual checkpoints in the stable storage. After one or many failures, the restart mechanism sets up each processor from the last checkpoint and then, it schedules again the tasks of the crashed processors on new ones. In this paper, we focus on the minimization of both the completion time of an application and the Checkpoint/Restart overhead by determining the optimal interval between two consecutive checkpoints. The main contribution of this work is to derive a new probabilistic model of the execution time of parallel applications with stochastic processes. This explicit analytical cost model enables to address and solve several interesting problems. First, like most of other models, it can be used to determine the optimal interval length between two successive checkpoints that minimizes the checkpoint overhead. Moreover, it is possible to take into account a variable checkpoint cost that depends on residual workload. This issue is not detailed here however, the details can be found in the technical report [10]. Finally, the proposed model can be adapted to several types of computations since it is independent from the failure law type, this make it very flexible. We show how to use it for the two most popular laws (Poisson’s process and Weibull distribution). The rest of this paper is organized as follows: Section 2 briefly recalls and discusses the other existing models for coordinated checkpointing on parallel platforms. Section 3 describes the proposed analytical model that expresses the expected completion time. In Section 4, we detail how to apply this model to two case-studies for the distribution of failure occurrences with Poisson’s process and Weibull’s law. Before concluding, we report in Section 5 some experiments based on simulations for assessing the model in several scenarii.
2
Review of Related Works
Let us first recall the principle of the coordinated checkpoint/restart protocol proposed by Chandy et al. [9]. It served as a basis of many implementations of fault tolerant systems for high performance computing. Coordinated checkpointing requires that all processes periodically coordinate the construction of a consistent global state before writing the state of individual checkpoints in a stable storage. Based on simulation results, Elnozahy et al. [5] showed that this approach is the most effective fault tolerance mechanism in large-scale parallel
208
M.-S. Bouguerra et al.
platforms. The aims of fault tolerance models is to find strategies that minimize the completion time of the execution. Young [8] proposed a checkpoint/restart model where the failures follow an exponential law. Moreover, checkpointing is assumed to be fault-free and to have a constant execution time. The basic periodic version of this model has been recently extended by Oliner et al. [7]. Daly proposed also in [4] an extension of Young’s model where failures can occur during checkpointing and he derived a higher order of approximation. Both Young’s and Daly’s models are able to compute an optimal checkpoint interval that minimizes the completion time considering constant checkpoint times. Other works from Geist et al. [11], Plank et al. [12] studied stochastic models and determined an optimal checkpoint date that minimizes a cost function which corresponds to maximize the availability of the system. Yudan et al. [6] proposed a stochastic model and an optimal checkpoint interval which does not depend on the specific failure law, such as the Poisson’s process used in [11] [12]. Moreover, in this model, it is possible to compute the optimal checkpoint interval that minimizes the expected wasting time in the protocol itself (checkpoint overhead, restart time and rollback time). Another variant of this model is proposed in [13], that modelizes the incremental checkpoint under Poisson’s process. We present an analytical model that expresses the expected completion time with respect to the distribution law of failures taking into account the checkpoint overheads. The most interesting differences in our model, is that it does not depend on a particular failure law and it reduces both of the checkpoint rate and total wast time. Finally it takes into account the variable size of checkpoint as an input of the problem and this issue is addressed in [10].
3
Probabilistic Execution Model
For modeling the execution of a parallel application in presence of failures and with Checkpoint/Restart mechanism, the scheme of modeling is organized as follows. Firstly, a basic model that describe the parallel application in presence of failures is presented. This first model allows to compute the expected completion time of the execution with a formal method. Finally, using this abstraction without checkpointing, we derive a global model that contains the Checkpoint/Restart mechanism. It allows, to establish a global formula that expresses the expected completion time of the execution. 3.1
Parallel Application Model
A parallel application is modeled by a residual computation work function denoted by ωt which is the amount of work remaining at the time index t. If we draw this variation, the curve would be divide into two parts. The first part is characterized by a linear speed-up with a slope which depends on the number of computing processors m and overhead factor α (due to the parallelism). In this phase, all the processors are busy. The second part presents a less stiff curve. This means that there is not enough work for all the processors and some idle
A Flexible Checkpoint/Restart Model in Distributed Systems ωt phase of startup ω0 R R
Wt 1 R ω0
phase of recovery R
Z3
R1
209
R1
I1
Phase of Checkpoint C (I1 )
*
$R^{k−1}$
Rk−1
Rk−1
Ik−1 Rk
Rk
C (I1 + I2 + · · · + Ik−1 )
Failure Ik
X1
X2
X11
X3
X21
X31
X1k−1
Z1
Z2
X2k−1
X3k−1
X1k
X2k t
1 Tend
Tend t
k−1 Tend
k Tend
Tend
Fig. 1. Execution Model without check- Fig. 2. Model of execution with the Checkpoint mechanism point/Restart mechanism
time appear. These two parts will be determined by the amount of parallelism in the target application. In this work we consider only the first part with a linear speed-up until the completion of the application. This is not really a limitation for large scale applications. Under this hypothesis we have: ωt = ω0 − αmt. 3.2
Execution Model without Checkpoint Mechanism
To establish the first fundamental theorem, we start by studying the execution without checkpoint mechanism, where the execution is restarted from the beginning after a failure (see Figure 1). This figure models the following renewal process. At the first step, there is a start-up in where the time is elapsed but the residual workload does not decrease. Then, the computation is started and the residual workload is decreased with a speed-up depending on the number of computing nodes. Finally, when a failure comes into one of these preview phases, all the computed work is lost. Thus, the application is restarted from the beginning after a duration R which denotes the restart cost. For modeling the failures process, we consider mainly failures that affect the hardware or software system. Also, we assume the reliability of the failures detection tool and the time of failures detection is negligible. Then, let us suppose that failures do not propagate hence, the time between failures on different processors in the system are independent and identically distributed. From the above assumptions, we deduce that {Xi } process (which denote the ith inter-arrival of failures) is a renewal process and {Xi } are positive independent and identically distributed random variables with distribution function F (t) and probability density function f (t). We recall that in this case the recovery process is considered as a deterministic process with duration R. From this abstraction, we deduce an analytical model that expresses the expected completion time Tend . Let us consider a number of failures N during the execution such as: {N = ω0 + R}. Consequently N is a stopping time process associated n} = {Xn αm to the process {Xi }. Using Wald’s equation [14] we can establish the following theorem.
210
M.-S. Bouguerra et al.
Theorem 1. Let Tend be the completion time of the execution and p the probaω0 bility of failure during the execution with p = F ( αm + R) then: 1 ω0 1 E(Tend ) = E(X1 )+ +R−E(XN ) = ω0 1−p αm 1 − F ( αm + R)
ω0 αm +R
1−F (x)dx. 0
proof in [10]. 3.3
Execution Model with Checkpoint Mechanism
The next step consists in inserting the Checkpoint/Restart mechanism to the basic model. Let us divide the initial amount of work into k intervals denoted j denote the completion time of amount of work I1 , I2 , · · · , Ik such as each Tend Ij . Therefore, the global execution model is drawn in Figure 2. At the first step, there is a start-up or a restart phase in where the time is elapsed but the residual workload does not decrease. Then, the computation is started and the residual workload is decreased with a speed-up depending on the number of computing nodes. After an interval of work denoted by Ij a checkpoint is triggered. Finally, when a failure comes into one of these preview phases, all the computed work after the last checkpoint will be lost. Thus, the application is restarted from the last checkpoint. In this model we consider that the checkpoint process is a deterministic process and its cost depends only on the amount of the work already done. We briefly recall that based on the design of the coordinated checkpoint mechanism, the data to save are the current state of each process and the computing data results associated to this one. More precisely, the amount of this data will depend on the work already done. Therefore, let sj be the amount of work already done, where sj = ji=1 Ii , the checkpoint cost function is denoted by C(sj ). Also, the restart process is considered as a deterministic process and its cost depends on the amount of data saved in the last checkpoint. It is denoted j−1 by R(rj ) where rj = i=1 Ii is the work already done before the last checkpoint. Finally, we consider that the phases of checkpoint and recovery may also suffer from failures. From the above assumptions, we deduce that {Xij } process (which denotes the ith inter-arrival of failures in the j th interval of checkpoint) is a renewal process, thus {Xij } are positive independent and identically distributed random variables with distribution function F (t) and probability density function f (t). Then, the main idea is to use the basic model (without checkpoint mechanism) for each Ij , as Ij is the amount of initial workload that should be done between the checkpoint number j − 1 and j. Using this abstraction and under the these hypothesis we establish the following theorem: Theorem 2. Let k be the number of checkpoints and ηj be the amount of work k ω0 between each checkpoint where ηj = Ij + C(sj ) + R(rj ). Such as j=1 Ij = αm then: E(Tend ) = E(X1 )
k j=1
1 j + ηj − E(XN ), 1 − pj j=1 k
where pj = F (ηj ).
A Flexible Checkpoint/Restart Model in Distributed Systems
211
Proof in [10]. Hence, Theorem 2 expresses the expected value of the completion time. It is worth noting it does not depend on the type of failures law which confirms the generality and flexibility of the proposed model. Therefore, in order to find the optimal amount of work between each checkpoint (Iˆ1 , Iˆ2 , · · · , Iˆk ), we have to minimize the equation of Theorem 2.
4
Case Studies
In this section, we propose two case studies. In the first one, we express the optimal checkpoint interval when the failures process is modeled by a Poisson’s process. Then, in the second one we also express the suboptimal solution when the failures followed a Weibull’s law. 4.1
Poisson’s Process Failures
One of the most common method to model the failures on computer devices, is the Poisson’s process with a constant rate denoted by λ. In this case the expected completion time is given by this expression. 1 ληj [e − 1] where ηj = Ij + C(sj ) + R(rj ). λ j=1 k
E(Tend ) =
(1)
Then, considering the cost of the checkpoint barrier as a constant cost such that C(s) = C and R(r) = R, the following lemma is established. Lemma 1. When the cost of Checkpoint and restart is constant, then the optiω0 mal interval between each checkpoint is αmk where k is the checkpoint number. Proof in [10]. Then after finding the optimal interval between each checkpoint using Lemma 1, Equation (1) becomes: k λ( ω0 +C+R) (e αmk − 1). (2) λ Finally, to minimize this equation we use the Lambert’s function denoted by W. Then, the optimal interval between checkpoint is: ω0 = 1 + W(−e−1−λ (C+R) ) (3) αmk Notice that Lambert’s function could be computed numerically using Taylor’s series, (more information about the proof can be found in [10]). E(Tend ) =
4.2
Using Weibull’s Law
In this second case study, a Weibull’s law with shape parameter β and scale parameter λ is used to model the failure distribution occurrences. Then, the expected completion time is given by: ηj k β (ληj )β E(Tend ) = e e−(λx) dx. (4) j=1
0
212
M.-S. Bouguerra et al.
Considering the case where a constant cost is spend to checkpoint the global application. Consequently, the following lemma can be established: Lemma 2. If the failure distribution is a Weibull’s law and the checkpoint/ ω0 restart cost is constant, the suboptimal interval between each checkpoint is αmk , where k is the number of checkpoints. Proof details in [10]. Then, using Lemma 2 the new expression of the expected completion time becomes: ω
ke(λ( αmk +C+R))
β
ω αmk +C+R
β
e−(λx) dx
(5)
0
ˆ the Brent’s numerical method Hence, to minimize Equation (5) in order to find k, from the GNU scientific library is used to find the minimum. More informations about the numerical method source code are in [10].
5 5.1
Simulations Simulations Scenario
In this simulation we propose 3 different scenario to validate and to compare the proposed model. In the first and the second scenario a Poisson’s process is used to generate random failures time with respect to different failure rates. Then, in the last scenario we use a Weibull’s law to generate the failures time using the same value for the parametres λ and β described in [6]. These Weibull’s parameters are the result of a statical analysis in [6] of failures log-files from the LLNL ASC White Computer system during the one year period June, 2003 to May, 2004. The best fitted distribution to the data is the Weibull distribution with the shape parameter of β = 0.5094 and the scale parameter of 1 . λ = 20.584 Then, to compare the model, we consider the Daly’s model [4] in the first and in the second scenario then, in the 3th scenario we consider the model proposed in [6] to compare our model when the failures follow a Weibull distribution. The main objective of these simulations is to study the behaviors of the model with respect to the variation of one form the different parameters i,e, (Failure rate, Checkpoint cost, Initial amount of work). Therefore, as scenario, we suppose that one parameter varies in a given interval and the others are constant. The different scenario configurations are in Table 1. For instance, in the first scenario the failure rate is increased in [ 12 , 96] by step of 5 while the checkpoint cost and the initial amount of work are constant. Then, for each setup we perform 104 simulations of execution with random failures. We notice that the same simulation is performed for both constant checkpoint cost 15 and 10 minutes only for scenario 3. Finally, It should be noted that the confidence interval reported in all simulations are at 95%.
A Flexible Checkpoint/Restart Model in Distributed Systems
213
Table 1. Scenario configurations Scenario Failure Law Failure Rate (per day) Checkpoint Cost (mins) Amount Of Work 1 Poisson’s Process λ ∈ [1/2, 96] C = 10 ω0 /αm = 10 days 2 Poisson’s Process λ = 1/2 C ∈ [1/2, 96] ω0 /αm = 7 days 3 Weibull λ = 1/20.584, β = 0.5094 C = 10, C = 15 ω0 /αm ∈ [100, 2500] hrs
5.2
Results and Discussions
Figure 3 reports the variation of the average completion time versus the variation of the failures rate, with respect to the configuration of scenario 1. As it can be seen in this figure both models perform the same average completion time and the same confident interval. Consequently, our model has the same performance in term of completion time as Daly’s model when λ highly increases. The second result with configuration 1 is presented in Figure 4, that shows the variation of the checkpoint rate versus the failure rate. This figure reveals that our model outperforms Daly’s model in term of checkpoint rate when λ becomes very large. Indeed, the proposed model has the same average completion time as Daly’s model even though it reduces the checkpoint rate by to 20%. It should be noted that decreasing in the checkpoint rate leads to lower I/O request and communication on the system in terms of network communication and stable storage writing as well. Secondly, to explain more why both models have the same completion time and FCM outperforms Daly’s model in term of checkpoint rate. The average overhead due to failure and the average overhead due to checkpoint are depicted versus the variation of the checkpoint cost in Figure 5, for FCM in contrast to Daly’s model. Hence, as it can be seen in this figures that the proposed model performs less checkpoint (discontinuous curve with square marker) than Daly’s model (discontinuous curve without marker) but FCM loses more work due to failure then Daly’s model. That means both models have the same wast time (Checkpoint overhead in addition to re-exectude work due to failures), consequently they have the same average completion time. These outcomes show that the proposed model (FCM) outperforms the Daly’s model in terms of system I/O overhead while the completion time performance dose not change which is very interesting features. In fact, this is due to the assumption on the checkpoint cost that should be low in Daly’s model. As a conclusion of the performances analysis, let us emphasize that both models have the same average completion time and confident interval which determines the stability of the model. Indeed, FCM reduces very well the checkpoint rate by to 20% and it is independent from the type of the failures law which make it flexible while Daly’s model considers only the Poisson’s process. Finally in the 3th scenario, we intend to compare the average wast time versus the variation of the initial amount of work with the model proposed in [6]. The result of this simulation is depicted in Figure 6, in which the same scenario is repeated for two values of checkpoint cost. This figure reveals for the both values of checkpoint cost, that FCM reduces by a factor of 3 the total wast time and reduces as well the confident intervals especially when the initial amount of work is relatively large. This is due to the point-fixe method
214
M.-S. Bouguerra et al.
30
1600
FCM J.Daly
FCM J.Daly
25 1200
Checkpoint number
Average completion time (days)
1400
20
15
10
1000
800
600
400 5 200
0 0
10
20
30
40
50
60
70
80
90
0 0
100
10
20
30
Failures rate per day
40
50
60
70
80
90
100
Failures rate per day
Fig. 3. Variation of the average completion times (Scenario 1)
Fig. 4. Variation of the optimal checkpoint number (Scenario 1)
1.6
300
MCM lost work J.Daly lost work MCM checkpoint overhead J.Daly checkpoint overhead
1.4
250
FCM (15 mins) Yudan et al (15 mins) Yudan et al (10 mins) FCM (10 mins)
Average wast time per hrs
Mean of lost time (Day)
1.2
1
0.8
0.6
200
150
100
0.4 50
0.2
0
0
10
20
30
40
50
60
70
80
90
100
Cost of checkpoint (minute)
Fig. 5. Average Checkpoint / Lost work overhead (Scenario 2)
0 0
500
1000
1500
2000
2500
Initial amount of work per hrs
Fig. 6. Variation of the average wast time (Scenario 3)
used by authors to predict the time between the last checkpoint and the next failures. Consequently, FCM reduces the average completion time by reducing the total wast time and it outperforms the model in [6] in terms of stability by reducing the confident intervals.
6
Conclusions
We presented in this paper a new flexible mathematical model for checkpointing applications in large-scale computing platforms. The main result was to establish a general analytical model which expresses the expected completion time of the application in distributed systems. We showed that it is possible to use this model with various probability laws that modelize the failures process, e.g, the Poisson’s process or the Weibull’s law. Moreover, this model is derived, for determining the optimal interval between checkpoints considering the checkpoint cost and
A Flexible Checkpoint/Restart Model in Distributed Systems
215
the wast time. A performance analysis with others existing models revealed that our model reduces the checkpoint rate, the strain on I/O bandwidth and the total wast time in terms of expectation and variance. We are currently working on implementing this mechanism in a real system. For this purpose, it would be interest to take into account a variable number of processors (instead of unbounded number of processors) since, it is not reasonable to assume that enough number of spare processors are always available.
References 1. Adiga, N., et al.: An Overview of the BlueGene/L Supercomputer. In: ACM/IEEE 2002 Conference on Supercomputing, p. 60 (2002) 2. Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: DSN 2006: Proceedings of the International Conference on Dependable Systems and Networks, Washington, DC, USA, pp. 249–258 (2006) 3. Hacker, T.J., Romero, F., Carothers, C.D.: An analysis of clustered failures on large supercomputing systems. J. Parallel Distrib. Comput. 69(7), 652–665 (2009) 4. Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22(3), 303–312 (2006) 5. Elnozahy, E.N., Plank, J.S.: Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Trans. Dependable Secur. Comput. 1(2), 97–108 (2004) 6. Liu, Y., Nassar, R., Leangsuksun, C., Naksinehaboon, N., Paun, M., Scott, S.: An optimal checkpoint/restart model for a large scale high performance computing system. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–9 (2008) 7. Oliner, A.J., Rudolph, L., Sahoo, R.K.: Cooperative checkpointing: a robust approach to large-scale systems reliability. In: Proceedings of The 20th Annual International Conference on Supercomputing, pp. 14–23. ACM, New York (2006) 8. Young, J.W.: A first order approximation to the optimum checkpoint interval. ACM Commun. 17(9), 530–531 (1974) 9. Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985) 10. Bouguerra, M.S., Gautier, T., Trystram, D., Vincent, J.M.: A new flexible checkpoint/restart model. Technical report, RR-6751, INRIA (2008) 11. Geist, R., Reynolds, R., Westall, J.: Selection of a checkpoint interval in a criticaltask environment. IEEE Transactions on Reliability 37, 395–400 (1988) 12. Plank, J.S., Thomason, M.G.: The average availability of parallel checkpointing systems and its importance in selecting runtime parameters. In: 29th International Symposium on Fault-Tolerant Computing, pp. 250–259 (1999) 13. Naksinehaboon, N., Liu, Y., Leangsuksun, C., Nassar, R., Paun, M., Scott, S.: Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments. In: IEEE International Symposium on Cluster Computing and the Grid, pp. 783–788 (2008) 14. Tijms, H.C.: A First Course in Stochastic Models. John Wiley, Chichester (2003)
A Formal Approach to Replica Consistency in Directory Service Jerzy Brzeziński, Cezary Sobaniec, and Dariusz Wawrzyniak Poznan University of Technology, Institute of Computing Science Piotrowo 2, 60-965 Poznan, Poland {Jerzy.Brzezinski,Cezary.Sobaniec,Dariusz.Wawrzyniak}@put.edu.pl
Abstract. Replication is commonly applied in the implementation of directory service as a way to improve availability and efficiency. However, it raises the problem of consistency maintenance, especially important in the case of slow propagation of changes. The delay in propagation may lead to different anomalies in directory service usage. This paper presents an analysis of consistency conditions that are satisfied by the replicated directory entries, and specifies requirements resulting from client’s history of execution (session guarantees) that are necessary in order to meet the same conditions in the view on the client side. Keywords: directory service, replication, consistency.
1
Introduction
Directory service in distributed systems extensively exploits replication as a way to improve availability and efficiency. Availability improvement is the natural consequence of data (directory entries) redundancy. Efficiency in the sense of increased throughput may result from load distribution over multiple servers capable of providing the same data to requesting clients. Besides, in the case of proper replica allocation, the distance between a client and the server that handles it can be shorten, thereby reducing the response time. Replication raises the problem of consistency maintenance. However, this is cost-effective in the case of directory service because most of access requests are queries, which do not change the state of replica. As for the consistency itself, the design of coherence protocol is usually aimed at eventual consistency (e.g. [1]) formulated in [2], and generalised in [3], which states that: if the system stopped issuing new modifications, copies of a replicated data would eventually have the same value. Restrictions on update acceptance and the topology of site interconnections for the propagation of changes (e.g. master-slave), however, may result in stronger consistency. To identify the consistency conditions, different strategies of update propagation are analysed.
Acknowledgment. The research presented in this paper was partially supported by the European Union in the scope of the European Regional Development Fund program no. POIG.01.03.01-00-008/08.
R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 216–225, 2010. c Springer-Verlag Berlin Heidelberg 2010
A Formal Approach to Replica Consistency in Directory Service
217
Another aspect of consistency concerns client-server interaction. Whilst servers are responsible for data exchange to keep a consistent state of replicas, a client can switch from one server to another that is not up-to-date yet. Consequently, the view of directory data on the client side may differ from the view on the server side. Some properties in this aspect — session guarantees (also called data-centric models) — have been formulated in [4]. Here, the influence of session guarantees on the consistency met on the client side is analysed.
2
System Model
The system considered here consists of two kind of sites (sequential processes): clients requesting the directory service and servers storing the replicated directory entries. Clients use the service by updating the entries at servers (writing) or querying about them (reading). A client interacts with one server at a time, but can switch from one server to another during the execution. It is assumed that each server keeps a replica of all entries. In order to exchange the information of updates, servers are interconnected by means of a fixed network infrastructure. Eventually, every update must appear in the history of operations performed by each server. To this end, the server that accepts an update request from a client (i.e. handles the interaction) distributes the changes to other servers. For the sake of formal definitions let us assume the following notation: X — the set of directory entries; wi (xj )v — an event subsequent to the update of entry x ∈ X at a server (replica) Sj issued by a client Ci , defining a new value v for this entry (write and update are used interchangeably); ri (xj )v — an event subsequent to the query on the replica of entry x ∈ X kept by a server Sj , returning a value v to a client Ci (read and query are used interchangeably). If some elements of the notation are not important or clear in the context they are used, they can be omitted. However, to emphasise that any replica of a given entry, say x, is addressed, the notation x∗ is used. Every operation issued in the system raises a few events in the client-server interaction (at the protocol level of abstraction): sending request at a client process and subsequent acceptance at a server process, possibly sending reply at a server process and subsequent acceptance at a client process. Both clients and servers are sequential processes, so all the events appear in some total order, moreover if a reply is expected, no request is sent (by a client) or accepted (by a server) in the interval between the request and corresponding reply. Thus, the order of operations follows the order of events concerning requests, which is formalised by the definitions below. Definition 1 (issue order). An operation o1 precedes an operation o2 in issue order, iff they are requested by the same client, say Ci , and the request to execute C o1 is sent before the request to execute o2 (which is denoted o1 i o2). Issue order is the relation between operations on the client side, and is the consequence of the program performed by a client. In other words, the set of operations is defined by the program of the client process.
218
J. Brzeziński, C. Sobaniec, and D. Wawrzyniak
Definition 2 (execution order). An operation o1 precedes an operation o2 in execution order, iff the requests for these operations are accepted by the same Sj
server, say Sj , and o1 is executed by this server before o2 (denoted o1 o2). Execution order is the relation between operations on the server side. The relation involves the operations that are executed by a server as direct response to a client request, i.e. the operations that come from the interaction with clients handled by the server. However, potentially all updates are observed by each server in the system, because they change the entries, thereby directly or indirectly influence the the results of queries. Thus, each process has its own view of the operations in the system, according to the definition below. Definition 3 (view). Let O be a set of operations, and a total order relation in this set. View, (O, ), is the set O ordered by the relation , provided that the following condition is satisfied: ∀x∈X ∀w(x)v,r(x)v∈O w(x)v r(x)v Moreover, a legal view can be distinguished. Generally, a view is legal if each read operation returns the value defined by the immediately preceding update of the same entry ([5]). Every server operates on its own replica of the set of entries, therefore write operations that appear in its view concern this replica. The operations result either from the direct interaction with clients or from exchange of information with other servers. In either case, a write operation appears in the view of a given server exactly once. Thus the definition for the server side is formalised as follows. Definition 4 (s-legal view). View (O, ) on the server side is legal (s-legal for clarity) if it satisfies the following condition:1 ∀
∃
∀
x∈X w(x)v,r(x)v∈O w(x)u∈O
(u = v ∧ w(x)v w(x)u r(x)v)
(1)
View on the client side is more complicated because a client can switch from one server to another. Since a write operation subsequent to one issue is in the set of operations of each server, it can appear many times in this view. Therefore, the view need not be s-legal. Moreover, the next server can have executed less operations than the previous one or the same operations in a different order. To simplify the identification of operations, it is assumed that a client uses a given server only once. Therefore, the view on the client side includes all write operations on each replica. Understanding of legality in this case is more general. Definition 5 (c-legal view). View (O, ) on the client side is legal (c-legal for clarity) if it satisfies the following 2 conditions: ∀
∀
∃
x∈X r(xj )v∈O w(xj )v∈O
w(xj )v r(xj )v ∧
∃
(u = v ∧ w(xj )v w(x)u r(xj )v)
(2)
w(x)u∈O 1
For the sake of definition simplicity it is assumed that each write issued by a client defines a new (unique) value of a given object.
A Formal Approach to Replica Consistency in Directory Service
219
The set of operations influencing the execution of a process includes (potentially) all write operations and local read operations. In order to define view for a given process, let us decompose the set of operations into the subset of reads and subset of writes, i.e. O = OR ∪ OW . Thereby, let us assume the following R W notation: OCi = OC ∪ OC — the set of operations issued by a client process i i R W Ci ; OSj = OSj ∪ OSj — the set of operations executed by a server Sj , in which OSRj — the set of all read operations executed by a server Sj (as a result of client request), OSRj /Ci ⊆ OSRj — the set of all read operations executed by a server Sj as a result of a request by Ci , OSWj — the set of all write operations executed by a server Sj as the result of both direct request from a client and cooperation with other servers, including a write defining an initial value (not issued by any client). It is clear that there is a correspondence between the sets of operations issued by the clients and the sets of operations executed by the servers. Excluding W an initial value writing, there is a one-to-one mapping between OSWj and i OC i R R R R for each Sj , OC and O / for each C , as well as O and O . C i i Sj Sj j i Ci j i Definition 6 (server’s view). View on the server side for a server Sj is an Sj
s-legal view of the set OSj (denoted (OSj , )). Definition 7 (client’s view). View on the client side for a client Ci is a c-legal Ci
view of the set OCi (denoted (OCi , )).
3
Consistency Models
Consistency models are defined by means of a number of consistency conditions to be satisfied by views of operations. In fact, the definitions are formulated in terms of equivalence between views, i.e. the conditions concern an equivalent view rather than the view itself. Each process has its own view, so in some cases the conditions concern views (or their equivalents) of all processes. Let us start from the definitions of equivalence. Definition 8 (equivalence of write operations). Two write operations — . w(x∗ )v, w(y∗ )u — are equivalent (w(x∗ )v = w(y∗ )u) iff x = y and u = v. The above definition says that write operations are equivalent if (only if) they write the same value to the same logical object (no matter which replica) under the previously made assumption about the uniqueness of written value. The equivalence relation partitions the set of write operations into equivalence classes that form a quotient set. On the server side the set of write operations and its quotient set are in fact the same2 , because each write is executed by a server only once.
2
Formally, there is one-to-one mapping — projection — between OSWj and OSWj /=. .
220
J. Brzeziński, C. Sobaniec, and D. Wawrzyniak
Definition 9 (equivalence of views). Two views of a set of operations, say i
j
(Oi , ) and (Oj , ), are equivalent if all the following conditions hold: (i) OiR = OjR (ii) OiW /=. = OjW /=. j i o1 o2 ⇔ o1 o2 (iii) ∀ o1,o2∈OiR
3.1
Data-Centric Models
The definition of data-centric models may concern both clients and servers. In order to generalise the definitions, it is assumed that a number of processes, say P1 , P2 , . . . , Pn , either clients or servers, cooperate by accessing shared entries, as described in the previous section. In the unified approach to consistency of shared memory presented in [6], some consistency properties are distinguished that are mostly based on process order and data order of operations. Process order is the order of issue (by individual process) resulting from program execution, and data order corresponds to the order of operations observed on an individual entry. Because of the distinction between clients and servers the term local order is used instead of process order. The distinction results from the fact that two operations can be executed by a server in different order than issued by a client. However, the same notation is used for local order on the client side and on the lo server side (i.e. ) assuming that the suitable definition is clear in the context. Definition 10 (local order on the client side). Operations o1 and o2 are C
in local order on the client side iff they are in issue order, i.e.: ∃Ci o1 i o2. Definition 11 (local order on the server side). Operations o1 and o2 are Sj
in local order on the server side iff they are in execution order, i.e.: ∃Sj o1 o2. Definition 12 (data order). Operations on an entry x, o(x)v and o(x)u, are do
in data order (o(x)v o(x)u) iff at least one of the following conditions holds: lo
(i) o(x)v o(x)u
(3)
(ii) v = u ∧ o(x)v ≡ w(x)v ∧ o(x)u ≡ r(x)v po u = v ∧ o(x)v r(x)u ∧ o(x)u ≡ w(x)u (iii) ∃ r(x)u∈O do do (iv) ∃ o1 o ∧ o o2 o ∈O
(4) (5) (6)
Data-centric models are defined by a number of restrictions (consistency conditions) on the order of operations in the view for each process. General assumption is that the view for a given process (or and preserves its equivalence) is s-legal lo
Pi
its own local order, i.e.: ∀Pi ∀o1,o2∈Oi o1 o2 ⇒ o1 o2 .
A Formal Approach to Replica Consistency in Directory Service
221
Combining local order and data order relations, the following consistency models (formalised below) can be defined [6]: PRAM consistency — preserves local order [7]; cache consistency — preserves data order [8]; processor consistency — preserves both local order and data order [8,9]; slow consistency — preserves the intersection of local order and data order [10]. Pi lo PRAM: ∀ o1 o2 ⇒ o1 o2 (7) ∀ Pi o1,o2∈OPi ∪O W
cache: ∀
∀
Pi do o1 o2 ⇒ o1 o2
(8)
∀
Pi lo do o1 o2 ∨ o1 o2 ⇒ o1 o2
(9)
∀
Pi lo do o1 o2 ∧ o1 o2 ⇒ o1 o2
(10)
Pi o1,o2∈OPi ∪O W
processor: ∀
Pi o1,o2∈OPi ∪O W
slow: ∀
Pi o1,o2∈OPi ∪O W
One of the most important models is sequential consistency. The equivalence to sequentially consistent execution is assumed as a general correctness criterion for concurrent system sharing data [11,12]. In addition to local order it preserves total order of all write operations, i.e.: Pi Pi (11) ∀ w1 w2 ∨ ∀ w2 w1 ∀ w1,w2∈O W
3.2
Pi
Pi
Client-Centric Models
Formal definitions of client-centric models covers the dependencies between issue order of operations on the client side and view of their counterparts on the server side, according to the following formulations: Sj C read your writes (RYW): ∀ ∀ w(x)v i r(yj )u ⇒ w(x)v ri (y)u (12) C i Sj
Sj
C
monotonic writes (MW): ∃ w(x)v i w(y)u ⇒ ∀ wi (x)v wi (y)u Ci
(13)
Sj
C
Sj
(14) writes follow reads (WFR): ∃ r(x∗ )v i w(y)u ⇒ ∀ w(x)v wi (y)u Ci Sj Sj Ci monotonic reads (MR): ∀ ∀ r(x∗ )v r(yj )u ⇒ w(x)v ri (y)u (15) C i Sj
Besides, the definitions of session guarantees, presented in [4], are formulated under the assumption that non-commutative modifications of shared data are performed in the same order on each replica. In our system model this means that write operations of one shared object are observed in the same order on each replica, which implies cache consistency on the server side.
222
4
J. Brzeziński, C. Sobaniec, and D. Wawrzyniak
Consistency of Directory Contents
Two general approaches to consistency maintenance of directory contents are considered: master-slave (single master) and multimaster [13,1]. In the masterslave approach slave replicas are available for reading only, thereby update requests are accepted by the master server and then passed to slave servers. This approach simplifies the implementation of coherence protocol because it facilitates the control over the order of update propagation. Especially, the order in which modifications are performed at the master server can be easily preserved at slave servers. The order can be applied to all updates, which implies PRAM consistency (7), or to updates of individual entries, which implies slow consistency (10). However, master server controls all operations, especially the operations on a given entry, thereby ensures data order (def. 12). This in turn implies processor consistency (9), or cache consistency (8), respectively. Moreover, if there is only one master server for the whole domain (no data decomposition), the local order at this server concerns all modification. Consequently, there is the same order of write operations, which implies sequential consistency, because condition (11) is also preserved. This is not the case if directory data are decomposed and separated parts of entries distributed over a number of master servers. The servers independently of one another control the order of operations, which leads to processor consistency or cache consistency. Data decomposition is aimed at exploitation of multiple master servers, nevertheless avoiding conflicts. A general multimaster approach, however, allows for conflicting updates at different servers. In this case, the least requirement for directory contents to avoid persistent divergence is eventual consistency [14]. To achieve eventual consistency in the case of unrelated entries, it is sufficient to apply conflicting updates in the same order on each replica, or in practice, apply Last Writer Wins algorithm [3]. Assuming that two updates are in conflict if they modify the same entry, the order of updates interleaved with possible queries defines data order (def. 12). Consequently eventual consistency in this context implies at least cache consistency.
5
Data-Centric Model on the Client Side
Sec. 4 concerns the view on the server side, and omits the view on the client side. If each client uses the same server all the time, it has the same view that is at its server. As clients can switch from one server to another, a given data-centric model on the server side does not guarantee the same consistency at the client side. The point at issue is which session guarantees are necessary to meet on the client side the data-centric models that is preserved on the server side. Fig. 1 shows an execution in which all session guarantees are preserved. However, the write operations are observed by the servers in different order, which results in the violation of s-legality at the client side. Thus, cache consistency on the server side is the necessary consistency condition to ensure any data-centric model on the client side, which is the assumption for further analysis.
A Formal Approach to Replica Consistency in Directory Service
223
Lemma 1. If all session guarantees (i.e. RYW, MW, MR, WFR) are satisfied in the client-server interaction and the views on the server side are s-legal, the view on the client side is equivalent to an s-legal one. Proof. The proof is only sketched. Let us assume for contradiction that a view Ci
on the client side is not s-legal, i.e. for a client, say Ci : ∃w(x)u,w(x)v∈O w(x)v Ci
w(x)u r(x)v. This means that the client Ci gets v from a server which either has not executed w(x)u yet, or has executed w(x)u before w(x)v. The former case means the violation of RYW or MR, and latter MW or WFR.
C1 w(x)1 XX
S1
S2
C2 w(x)2
X z• w(x)1 X @ = • w(x)2 @ 9 w(x)2• @ R• w(x)1 @ 9 r(x)2 9 r(x)1
Fig. 1. Violation of s-legality
C1
S1
S2
C2 w(x)1
w(x)1• 9
X
XX z r(x)1 X w(x)1• w(x)2 w(x)2 •9
r(x)2 9 9 r(x)1 w(x)2•
XX X z r(x)2 X
Fig. 2. Violation of cache consistency
Most data centric models preserve local order, but its definition differs on the client side and on the server side. The following lemma addresses this problem. Lemma 2. If RYW and MW are satisfied in the client-server interaction, execution order preserves issue order, i.e. if a client Ci issues o1 and o2 during the C
Sj
interaction with Sj , the following condition holds: o1 i o2 ⇒ o1 o2. Proof. MW guarantees that two writes appear in issue order, and RYW guarantees that all previously issued writes appear before the subsequent read. Corollary 1. MW and RYW ensure that issue order is preserved in server view. Execution order has no influence on issue order, however, it influences the view on the server side, thereby on the client side. Lemma 3. If execution order is preserved in server view, and MR guarantee is satisfied in the client-server interaction, a view on the client side (or its equivalence) preserves execution order as well. Proof. MR ensures that each read operation by a client is performed at a server which has executed all write operations appearing in this client’s view. The operations appear in execution order, so this order is preserved in client view. Lemma 4. If all session guarantees are satisfied in the client-server interaction, any data-centric model held on the server side is also met on the client side.
224
J. Brzeziński, C. Sobaniec, and D. Wawrzyniak
Proof. The views on the client side are legal (Lemma 1). Let us assume for contradiction that two writes appear in some order in server view, but there is no equivalent view on the client side that preserves this order. The view on the client side must include some read that does not allow showing the equivalence to the server view, which leads to the violation of some session guarantees. The question is about session guarantees that are important to a given datacentric model. The answer is formulated by the following theorems. Theorem 1. If sequential consistency is preserved on the server side, and RYW, MW and MR guarantees are met in the client-server interaction, sequential consistency is fulfilled on the client side. Proof. Lemma 4 states that all session guarantees applied together are sufficient for any data-centric model. Sequential consistency preserves local order, so — following Lemma 2, Corollary 1, and Lemma 3 — RYW, MW, and MR are necessary session guarantees. As for WFR, it is preserved as a result of sequential consistency at the server side, and need not be checked as a property of session. Theorem 2. If PRAM consistency is preserved on the server side, and RYW, MW and MR guarantees are met in the client-server interaction, PRAM consistency is fulfilled on the client side. Proof. PRAM consistency requires only local order to be preserved, so — following Lemma 2, Corollary 1, and Lemma 3 — RYW, MW, and MR are necessary session guarantees. WFR concerns in fact the relationship between operations that are issued by different processes, so is not necessary. Corollary 2. PRAM consistency on the server side implies in fact processor consistency. If RYW, MW and MR guarantees are met in the client-server interaction, processor consistency is fulfilled on the client side. Proof. Cache consistency is assumed on the server side — in combination with local order (PRAM) processor consistency is preserved. Following the reasoning in the proof of Theorem 1, processor consistency is also met on the client side. Theorem 3. If cache consistency is preserved on the server side, and RYW and MR guarantees are met in the client-server interaction, cache consistency is fulfilled on the client side. Proof. Cache consistency does not require local order. RYW is necessary so as not to miss own writes. MR is necessary to avoid reading a previous value from another server (see Fig. 2).
6
Conclusions
Consistency of data provided by a directory service satisfies sequential consistency, processor consistency, or cache consistency depending on the implementation, configuration, and topology. However, because of the delays in
A Formal Approach to Replica Consistency in Directory Service
225
synchronisation the data accessed by a client on different servers can seem not to satisfy the model preserved on the server side. To meet the consistency model on the client side, a conjunction of session guarantees is necessary. For sequential consistency of processor consistency RYW, MW and MR guarantees are sufficient, and for cache consistency RYW and MW. The formal, declarative specification of consistency allows the users and the administrators to understand, predict, and possibly avoid different anomalies in a directory service.
References 1. Stokes, E., Weiser, R., Moats, R., Huber, R.: Lightweight Directory Access Protocol (version 3) Replication Requirements. RFC 3384 (Informational) (October 2002) 2. Birrell, A.D., Levin, R., Schroeder, M.D., Needham, R.M.: Grapevine: an exercise in distributed computing. ACM Commun. 25(4), 260–274 (1982) 3. Shapiro, M., Bhargavan, K.: The Actions-Constraints approach to replication: Definitions and proofs. Technical Report MSR-TR-2004-14, Microsoft Research (March 2004) 4. Terry, D.B., Demers, A.J., Petersen, K., Spreitzer, M., Theimer, M., Welch, B.W.: Session guarantees for weakly consistent replicated data. In: Proc. of the Third Int. Conf. on Parallel and Distributed Information Systems (PDIS 1994), Austin, USA, September 1994, pp. 140–149. IEEE Computer Society, Los Alamitos (1994) 5. Mizuno, M., Raynal, M., Zhou, J.Z.: Sequential consistency in distributed systems. In: Birman, K.P., Mattern, F., Schiper, A. (eds.) Dagstuhl Seminar 1994. LNCS, vol. 938, pp. 224–241. Springer, Heidelberg (1995) 6. Steinke, R.C., Nutt, G.J.: A unified theory of shared memory consistency. Journal of ACM 51(5), 800–849 (2004) 7. Lipton, R.J., Sandberg, J.S.: PRAM: A scalable shared memory. Technical Report CS-TR-180-88, Dept. of Computer Science, Princeton University (September 1988) 8. Goodman, J.R.: Cache consistency and sequential consistency. Technical Report 61, IEEE Scalable Coherence Interface Working Group (March 1989) 9. Ahamad, M., Bazzi, R.A., John, R., Kohli, P., Neiger, G.: The power of processor consistency (extended abstract). In: Proc. of the 5th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 1993), June 1993, pp. 251–260 (1993) 10. Hutto, P.W., Ahamad, M.: Slow memory: Weakening consistency to enhance concurrency in distributed shared memories. In: Proc. of the 10th Int. Conf. on Distributed Computing Systems (ICDCS-10), May 1990, pp. 302–311 (1990) 11. Shasha, D., Snir, M.: Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst. 10(2), 282–312 (1988) 12. Ahamad, M., Neiger, G., Kohli, P., Burns, J.E., Hutto, P.W.: Casual memory: Definitions, implementation and programming. Distributed Computing 9, 37–49 (1995) 13. Cheung, R.Y.M.: From Grapevine to Trader: the evolution of distributed directory technology. In: CASCON 1992: Proceedings of the 1992 Conference of the Centre for Advanced Studies on Collaborative Research, pp. 375–389. IBM Press (1992) 14. Saito, Y., Shapiro, M.: Optimistic replication. ACM Computing Surveys 37(1), 42–81 (2005)
Software Security in the Model for Service Oriented Architecture Quality Grzegorz Kolaczek and Adam Wasilewski Institute of Informatics, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland {grzegorz.kolaczek,adam.wasilewski}@pwr.wroc.pl
Abstract. The paper presents an original approach to locating security aspects in the Service Lifecycle and Service Oriented Architecture (SOA) quality model. The first part of the paper focus on the quality of SOA and security measures and investigate some functional and non-functional requirements for security measurement. The general discussion about SOA quality and security measures have been summarized by the proposition of the novel multi-agent architecture for SOA systems security level evaluation in the second part of the paper. Keywords: Service Oriented Architecture, Software Quality, Software Security.
1
Introduction
Quality of SOA (Service Oriented Architecture) means usually more then only reducing defects. It has to be connected with the requirements of its users, not only for today but into the future as well. When business and IT expectations are mixed, defect-free product is a necessary, but not sufficient. The real challenge with SOA software is in guaranteeing that the application meets the all (business and technological) requirements set out for it. One of such important thing is security assurance. SOA is an approach to designing, implementing, and deploying product (service) as a puzzle it is created from the set of components implementing discrete business functions. These components may be distributed across the world but for the user they have to be secure. The problem is how to evaluate and confirm a security level of SOA product. The paper introduces an approach to locating security aspects in the Service Lifecycle and SOA quality model. Then some functional and non-functional requirements for security measurement are presented and connected with security assurance in SOA. Finally, conclusions and future works are stated.
2
Quality in Service Lifecycle Management
The ability to effectively use of available methods of QA (Quality Assurance) in the lifecycle of services is fundamental to achieving success within SOA. The R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 226–235, 2010. c Springer-Verlag Berlin Heidelberg 2010
Software Security in the Model for Service Oriented Architecture Quality
227
Fig. 1. Typical service lifecycle
simplest model of Service Lifecycle may include: governance, delivery, execution and measure (fig. 1). Nevertheless the main aim for each stage it is important to think about quality analysis and improvement as well (tab. 1).
Table 1. Quality assurance in the service lifecycle
phase
key achievement [16] – committing to a strategy for SOA within the overall IT strategy Governance – explicitly determining the level of IT and SOA capabilities – articulating and refining the vision and strategy for SOA – reviewing current governance capabilities and arrangements – developing a governance plan – establishing or refining a SOA Center of Excellence (COE) Delivery – defining additional capabilities required, such as upgrades to the IT infrastructure – agreeing on policies for service reuse across lines of business – putting funding mechanisms in place to encourage this reuse – establishing mechanisms to guarantee service levels – deploying new and enhanced governance arrangements Execution – deploying technology to discover and manage assets
quality assurance – reviewing of quality aspects – setting the quality expectations (levels) – developing a quality assurance plan
– defining quality characteristics – developing quality metrics
– deploying rules of result’s interpretation – founding procedures for testing – establishing system of work’s documentation – deploying quality assurance model – using external tools and application to service quality analysis
228
G. Kolaczek and A. Wasilewski Table 1. (Continued)
phase
Measure
3
key achievement [16] – communicating and educating expected behaviors and practices within both the business and IT decisionmaking communities – enabling the policy infrastructure – executing the service – monitoring compliance with policies and governance arrangements, such as service level agreements (SLAs), reuse levels, and change policies – analyzing IT effectiveness metrics
quality assurance
– monitoring and analyzing values of quality metrics
– communicating team to improve quality
within service
SOA Quality Model and Security Metrics
SOA quality model should address multiple aspects of service quality across SOA service implementations. In fact, two aspects seem to be the most important software product and business process quality (fig. 2). Both of them determining final quality of the service, which depends on final user expectations and feelings. Companies to ensure top SOA quality of service have to meet customers business requirements, improve their satisfaction and profitability as well as ensure the highest level of software reliability.
Fig. 2. Typical service lifecycle
Software Security in the Model for Service Oriented Architecture Quality
3.1
229
Quality of Business Process
One of the model for business process quality analysis was defined by A.Selcuk Guceglioglu and Onur Demirirs (fig. 3). It presents a complementary processbased approach and focusing on the quality aspect of the process [12]. The structure of the model is based on ISO/IEC 9126, so includes: – – – –
categories (aspects) of quality; characteristics (functionality, reliability, usability and maintainability); subcharacteristics; metrics (to analyze quality attributes).
Fig. 3. Model for business process quality [4]
In this model security (of the business process) is a part of its functionality and may be measured using access auditability metrics. 3.2
Quality of Software Product
One of the well-known model of standard for description the software product quality is ISO/IEC9126 Software engineering Product quality. It defines six quality characteristics (subdivided into subcharacteristics) for internal and external quality and four characteristics for quality-in-use (fig. 4). ISO/IEC 9126 may be used to specify and evaluate software product quality from different perspectives. It is dedicated for users linked with analysis of requirements, development, evaluation, maintenance, use or audit software and typical examples of its use are to: – – – – – –
validate the completeness of a requirements definition; identify software requirements; identify software design objectives; identify software testing objectives; identify quality assurance criteria; identify acceptance criteria for a completed software product. [5]
230
G. Kolaczek and A. Wasilewski
Fig. 4. ISO 9126 model for software product quality [5]
Security according to ISO/IEC 9126 means the capability of the software product to protect information and data so that unauthorised persons or systems cannot read or modify them and authorised persons or systems are not denied access to them [5]. It has several metrics, e.g. access auditability, access controlability, data corruption prevention, data encryption.
4
Security Level Evaluation for Software Quality Measurement
The IEEE standard for a Software Quality Metrics Methodology describes software quality as the degree to which software possesses a desired combination of quality attributes [2]. In this context the crucial element for the quality measurement are quality attributes which are also called quality characteristics. Quality attributes can be classified into two main categories: execution qualities - such as security and usability, which are observable at run time, and evolution qualities - such as testability, maintainability, extensibility and scalability, which are embodied in the static structure of the software system [8]. According to ISO software quality model [3], security is a component of functionality category which is one of the six categories of quality characteristics that has been defined within this model. Definition of security proposed in this document is the capability of the software product to protect information and data so that unauthorized persons or systems cannot read or modify them and authorized persons or systems are not denied access to them. These two documents devoted to software quality measurement are the main set of guidelines to elaborate the security level evaluation framework for SOA systems. 4.1
Security Requirements and Security Measures for Software Development
There are several well known problems and controversies while defining the exact meaning of the security metrics [1,6,10]. There is no one metrics that is
Software Security in the Model for Service Oriented Architecture Quality
231
acceptable and applicable in context of all possible systems and situations. The security metrics are highly context dependant, so the final shape of the metrics is related to a situation and target depend on security goals, technical, organizational, and operational needs, available resources, etc. At the other hand, metrics are essential in measuring the goodness of target system and it is also true in the context of security quality evaluation, so there is continuous need for security metrics definition. As a system activity cannot be managed well if it cannot be measured, metrics provide the manager with instruments which enable to characterize, to evaluate, to predict and to improve process execution. Security metrics and measurements can be used for decision support, especially in assessment and prediction. Exemplary security metrics for security assessment include [11]: – Risk management activities in order to mitigate security risks, – Comparison of different security controls or solutions, – Obtaining information about the security posture of an organization, a process or a product, – Security assurance of a product, an organization, or a process, – Security testing (functional, red team and penetration testing) of a system, – Certification and evaluation (e.g. based on Common Criteria) of a product or an organization, and – Intrusion detection in a system, – Other reactive security solutions such as antivirus software. For example, to predict the security behavior of an organization, a process or a product in the future some metrics using mathematical models and algorithms can be applied to collect and analyze measured data (e.g. regression analysis). A security metric can be qualified as objective or subjective, quantitative or qualitative, dynamic or static, relative or absolute, and direct or indirect [13]. The ISO/IEC 9126 propose external, internal and indirect metrics categories. The internal metrics are understood as static measure products. This metrics measure quality only indirectly. It has also been assumed that at the early stages of development only resources and process can be measured [9]. The external metrics of this standard measure code when during execution. The indirect metrics is a metric of quality in use. According to SOA systems characteristics there are some specific requirements that should be taken into account while considering security measurement. 4.2
Security Assessment in SOA
The security evaluation process should be based on some formal prerequisites. This means that the security evaluation must be objective to guarantee the repeatability and universality of the evaluation results. So, there must be defined notion of the security measure. There are some confusion about this notion. The first problem is that the security measure does not have any specific unit. The other difficulties are: security level has no objective grounding but it only in
232
G. Kolaczek and A. Wasilewski
Fig. 5. The layered model of Service Oriented Architecture
some way reflects the degree in which our expectation about security agree with reality, security level evaluation is not fully empirical process, etc. [7,6] As the SOA system can be defined by its five functional layers (fig.5) the correspondent definition of SOA security requirements for security evaluation process should address the specific security problems within each layer. Some elements from a set defining security requirements for the SOA layers has been presented in Table 1. describing functional and non-functional security evaluation requirements for each of the SOA functional layers (selection). The complete list can be found in [7] The most important functionality related to the SOA security level evaluation architecture is description of the all components, mechanisms and relations that are necessary to precisely evaluate the security level
Fig. 6. The layered model of Service Oriented Architecture
Software Security in the Model for Service Oriented Architecture Quality
233
Fig. 7. The architecture of the multi-agent system for SOA security level evaluation
of the particular SOA system. As it was described in this section the problem of security evaluation is very complex and there exist more than one solution that could be acceptable within a context of a particular system and its environment. We propose some general idea about SOA security level evaluation in a relation to requirements listed in the table presented in the figure 6. The central part of the proposition is a multi-agent architecture presented in the fig. 7. The multi-agent architecture is composed of three types of agents: monitoring agents that tests the various security parameters related to particular SOA layer, superior agents that manage the activity of monitoring agents, managing agents that are responsible for all superior agents and for communication with service consumer agents. This type of architecture was selected according to its correspondence to SOA characteristic. Within the architecture monitoring agents are responsible for performing the task related to security assessment using the selected metrics while the managing agent should provide the final results. The next step of the research will be devoted to the problem of selection or elaboration of appropriate for SOA systems security measures for monitoring agents and data fusion methods for managing agent. Where: • • • •
5
AMOL – SOA functional layer monitoring agents ASL – SOA functional layer superior agents AM – SOA managing agents AC – the agents of consumers of SOA system services.
Conclusions
Improved interoperability is one of the most prominent benefits of SOA. This type of the systems allows service users to transparently call services implemented in
234
G. Kolaczek and A. Wasilewski
disparate platforms using different languages. However, one of the challenges of eliciting quality requirements for a system is that it may not be possible to know all the collaborating parts. This is especially true in SOA-based systems that provide public services and/or search for services at runtime. A quality measures allows to judge quality of the systems. Quality requirements, such as those for performance, security, modifiability, reliability, and usability, have a significant influence on the software architecture of a system. The use of a service-oriented approach positively impacts some quality attributes, while introducing challenges for others. This paper presented the impact of SOA characteristic on different quality measures. The first part of the paper discussed quality of SOA and security measures and some functional and non-functional requirements for security measurement and then the proposition of the novel multi-agent architecture for SOA systems security level evaluation has been presented in the second part. The future work will concentrate on the evaluation of the proposed architecture for security assessment and refinement of the security measures. Acknowledgements. The research presented in this paper has been partially supported by the European Union within the European Regional Development Fund program no. POIG.01.03.01-00-008/08.
References 1. Hecker, A.: On Security Metrics. ENST, DESEREC project document, http://www.deserec.eu/ 2. IEEE1061, IEEE Standard for a Software Quality Metrics Methodology, The Institute of Electrical and Electronics Engineers: New York, USA (1998) 3. ISO/IEC 9126-1, Software engineering — Product quality — Part 1: Quality model (2001) 4. ISO/IEC 9126-2, Software engineering — Product quality — Part 2: External metrics (2003) 5. ISO/IEC 9126-3, Software engineering — Product quality — Part 3: Internal metrics (2003) 6. Kaner, C., Bond, W.P.: 10th International Software Engineering Metrics: What Do They Measure and How Do We Know? In: Software Metrics Symposium, METRICS 2004 (2004) 7. Kolaczek, G.: Multi-agent Security Evaluation Framework for Service Oriented Architecture Systems. LNCS (LNAI). Springer, Heidelberg (2009) (to appear) 8. Matinlassi, M., Niemel¨ a, E.: The impact of maintainability on component-based software systems. In: Proceedings of 29th Euromicro Conference: New Waves in System Architecture. IEEE Computer Society, Belek-Antalya (2003) 9. Swanson, M., Nadya, B., Sabato, J., Hash, J., Graffo, L.: Security Metrics Guide for Information Technology Systems, NIST 800-55, National Institute of Standards and Technology Special Publication #800-26 (2003) 10. Savola, R.: Towards a Security Metrics Taxonomy for the Information and Communication Technology Industry. In: Proceedings of the International Conference on Software Engineering Advances, ICSEA, August 25-31. IEEE Computer Society, Washington (2007)
Software Security in the Model for Service Oriented Architecture Quality
235
11. Savolainen, P., Niemela, E., Savola, R.: A Taxonomy of Information Security for Service-Centric Systems. In: Proc. of the 33rd EUROMICRO Conference on Software Engineering and Advanced Applications EUROMICRO, pp. 5–12. IEEE Computer Society, Washington (2007) 12. Vaughn, R.B.J., Henning, R., Siraj, A.: Information assurance measures and metrics - state of practice and proposed taxonomy. In: Proceedings of 36th Hawaii International Conference on System Sciences, HICSS 2003 (2003) 13. Zeiß, B., Vega, D., Schieferdecker, I., Neukirchen, H., Grabowski, J.: Applying the ISO 9126 Quality Model to Test Specifications — Exemplified for TTCN-3 Test Specifications, Software Engineering (2007) 14. ISO/IEC 9126 Software engineering – Product quality, First edition 2001-06-15 15. Guceglioglu, A.: Selcuk, Demirors Onur, Using Software Quality Characteristics to Measure Business Process Quality, Middle East Technical University, Turkey 16. http://www-01.ibm.com/software/solutions/soa/gov/lifecycle
Automatic Program Parallelization for Multicore Processors Jan Kwiatkowski and Radoslaw Iwaszyn Wroclaw University of Technology, Institute of Informatics Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland {jan.kwiatkowski,radoslaw.iwaszyn}@pwr.wroc.pl
Abstract. With the advent of multi-core processors the problem of designing application that efficiently can utilize it performance become more and more important. Moreover developing programs for these processors requires from the programmers some additional, specific knowledge about the processor architecture. In multi-core systems efficient program execution is the main issue. It can even happen that switching from sequential to parallel computation can lead to decreasing of performance. The paper deals with the short description of SliCer, the hardware independent tool that parallelizes serial programs in automatic way depending on the number of available processing units by creating the proper number of threads that can be later execute in parallel. Keywords: parallel processing, automatic program transformation.
1
Introduction
Nowadays, it becomes harder and harder to enhance single processor speed, therefore multiplying processing units seems to be the best way to achieve better performance. Because of that multi-core processors become more and more popular for both commercial and home usage. Developing programs for these processors requires from the programmers some additional specific knowledge about the processor architecture and parallel programming. Problems such as memory requirements, synchronization and so on have to be taken into consideration. However, the most important problem that should be solved is efficient program execution. For example, when the number of program ”blocks” that can be executed in parallel is larger than the number of available processing units, it can lead even to performance decreasing due to frequently context switching of the working processes. Therefore the most suitable situation is when the number of program ”blocks” that can be executed in parallel is equal to the number of available processing units. Unfortunately it can causes that program will not be portable by means that it slows down during it execution on the processor with the different number of processing units. Different operating systems provide mechanisms which help programmers and allow to use more than one processing unit efficiently but still the knowledge related to the processor architecture and parallel programming is needed. On R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 236–245, 2010. c Springer-Verlag Berlin Heidelberg 2010
Automatic Program Parallelization for Multicore Processors
237
the other hand, there are available different tools which help programmers in program developing. For example: SWARM [2] that is a portable open source parallel programming framework for designing algorithms for multicore systems, RapidMind [9], an execution environment, which takes serial program as the input and then execute it in parallel using the virtual machine. Using both of above tools still requires from programmers to have some knowledge about the processor architecture and parallel programming. As an alternative solution the automatic program parallelization can be used. The paper deals with the short description of SliCer, the hardware independent tool that in automatic way parallelizes serial programs written in C by creating the proper number of threads that can be later executed in parallel. The proposed solution is similar to Octopiler [5] which enables automatic SIMDization of code and automatic user-directed exploitation of shared memory parallelism, however it needs some transformations that are manually managed by the programmer. Currently SliCer offers only some of functionalities normally supported by the parallelizing compilers. It parallelizes loops [3,4] and can determine parallelism between program blocks. It can be used under Windows or Unix Posix standard, but now only Windows is fully supported. The paper is organized as follow. Section 2 briefly presents design overview. The internal SliCer structure is presented in Section 3. Section 4 describes the parallelization process implemented in the SliCer and presents some implementation details. Section 5 illustrates the results of experiments obtained during SliCer evaluation. Finally, section 6 outlines the work and discusses the further works.
2
Design Overview
Multicore processors gives the opportunity of parallel program execution using the number of available processing units. In common during programming multicore processors, the shared memory paradigm is used. To take advantage of this, it is necessary to develop programs in such way, that there are exactly the same number of tasks (e.g. threads) and available processing units. Typical serial programs can utilize only one processing unit, then during it execution other units are idle. From technical point of view it is easy to change such program behaviour, however when we take into consideration the ”program architecture” it becomes very tough. Moreover, each situation when threads are created and synchronized can leads to many problems. On the other hand to take advantage of using multicore processors it is necessary to modify program code in such way, that the most time consuming program parts, for example loops, will be executed in parallel as a separate threads. It is worth to emphasize that dividing program to threads is not the only possibility of parallelization. It is also possible to execute process for each task, but in such case it is also necessary to ensure communication between processes. The way of performing parallelization process depends on the available hardware platform. Therefore regardless of chosen method it is necessary to divide calculation with respect to each thread. For this aim algorithms determining
238
J. Kwiatkowski and R. Iwaszyn
data dependencies and build on this basis dependency graph has to be used. It leads to choosing proper order of instruction execution to ensure that the result will be the same as from the corresponding serial program. In general, automatic parallelization is made in the following steps [7]: performing a dependence test to detect potential parallelism, restructuring the program into some blocks which can be executed in parallel, generating parallel code for a particular architecture by scheduling selected program blocks on available processing units and then synthesizing a convenient mechanism for achieving parallelism depending on the used system. It is clear that the last step of above procedure depends on the used system architecture when the first two are hardware independent with respect to performed program transformations that can be different for different platforms. According parallelization process two relations in the program code can be determined: the first that express the dataflow (data dependence) and the second, which express the control flow (control dependence). Therefore during program parallelization process, in the first step all control dependences have to be identified and handled. In the second step data dependencies are used for determining which program instructions can be executed in parallel. To avoid the fine grain parallelism which is not suitable for multicore processors with the limited number of processing units initially the program is divided into blocks. Program blocks are disjoint and have exactly one entry and one or more exit points, of course they may have many predecessors and successors and may even be its own successors [6]. For example such program blocks as ”loop” block, ”if-then” block, etc. can be distinguished. It is obvious that program blocks constitute a new program and determining parallelism between program blocks is much more suitable than between instructions. On the basis of data and control dependences SliCer creates the dependency graph. Then the possibility of loop parallelization is checked by determining if flow, output, and antidependences exist and if they caused loop-carried dependency. If loop parallelization is possible the tool calculates dependency distance to determine how many treads can be created. The number of treads used during program execution are determined depending on the number of available processing units as well. To optimize loop parallelization process different loop transformation such as loop splitting, loop distribution and others are considered. Then the parallelism between program blocks is determined. It is achieved on the basis of dependency graph for both data and control. Depending on the control relation different program classes have to be distinguished and different ways of control dependency handling has to be utilized. For example when control relation creates a linear sequence of program blocks and all blocks have only one exit point, there are no control dependencies and data dependencies specified at the dependence graph are used for determining, which blocks can be executed in parallel. In the case when not all block have only one exit point, due to the possible branches in the program additional conditions should be fulfilled during dependence analysis. In the most general case when the control relation is only transitive and reflexive, the possibility of existing the local symmetry of the control relation should be taken into consideration [6].
Automatic Program Parallelization for Multicore Processors
3
239
The Internal Structure of SliCer
Slicer gives an opportunity of serial programs parallel execution on the multicore processors. The tool is implemented in C, as input it takes the source code written in C and as an output it generates the source code of parallel program in C/C++ language. Then the new source is compiled by any available C/C++ compiler. It means that comparing presented approach with standard program developing process only one additional step, parallelization is needed. It is not necessary to introduce any other modification in the program source code. Even if some instructions used in the program cannot be parallelized they will be leaved without changes. If hardware platform used for program execution has to be changed it is not necessary to modify source code. The only thing to do is to parallelize program and compile it. Even when operating system is changed it is not necessary to modify source code (assuming there are no direct system commands in the code). Parallelizing is made by executing proper number of threads. The number depends on the available processing units. Threads share memory available for the process, so no extra communication between them is needed. Threads synchronization is made in two different ways: by defining barrier synchronization or when only two threads should be synchronized, simply by waiting for some event.
Fig. 1. The internal structure of SliCer
The internal structure of SliCer is presented in figure 1.The first SliCer’s unit is Code Analyzer which generates program parse tree. The second one is Optimizer which provides two main SliCer functionalities: it performs loop analysis [3,4] and looks for internal parallelism within the program instructions (blocks) [6,7]. It checks the number of available processing units and depending on its number decides how many program blocks can be generated and executed in parallel. These two units are hardware independent. The last unit is Code Generator which is equipped with built-in library used during code generation. These solution gives the opportunity to used the tool for different hardware as well as software platforms by simple changing the built-in library which consists of functions that depends on the hardware and software. Figures 2 and 3 presents two
240
J. Kwiatkowski and R. Iwaszyn
Fig. 2. The main window of SliCer
main SliCer windows that are used for choosing different configuration settings. Obviously, the number of available processors can be defined. On that basis the tool knows how many threads can be used at most. It is important when transformations are applied to code. Another section allows to set path to compiler and when using Widows to editbin.exe. It is necessary if not only parallel code has to be generated, but compiled version of serial program as well. Last section gives a possibility to set time for each elementary instruction. These settings are used during code execution time estimations. It is also possible to choose different loop transformations. It is important that each simple operation time can be set in configuration menu, because when on the target platform any operation takes more time than on the other, it can be simple expressed by changing time for this operation. All configuration settings are very important during parallelization process. Configuration depends on the target platform to which code is compiled and on current system (e.g. compiler path). All configuration data are stored in XML file, so it is possible to change only this file to ensure compiling for a different platform.
4
Program Parallelization with SliCer
SliCer generates as its output a code written in C that can be after compilation executed in parallel on the multi-core processor. The first SliCer unit is Code Analyzer (Fig.1.), which accepts as its input any correct serial program written in C and generates program parse tree as its output. Parser, which is used, was
Automatic Program Parallelization for Multicore Processors
241
Fig. 3. SliCer configuration window
generated with ANTLR [1], which transforms each grammar rule to the function. Program blocks are basic elements of generated parse tree. All language construction such as loop, conditional statement, and others are derived from blocks (”for” loop block, ”while” loop block, ”if” block, etc). Each block contains variable declarations, assignments, function calls and can have subordinate blocks. Assignment can be placed in any type of block and can have unlimited number of arguments and operators which are stored in tree structure as well. This representation is very useful for code generating because operators priority is handled in the most natural way by simple tree traversing. Argument can be of simple or complex (e.g. function call) types. There are two types of assignments: variable initialization and non-initial for already declared variables. Function call represents calling of any function within the program code. It can has unlimited number of arguments of any type. Arguments can be as complex as the right side of assignments (other function calls are acceptable as well). Variables declarations are divided into three groups: simple variables (e.g. int x), arrays (e.g. float[] x) and pointers (e.g. char * x). Such representation is useful for code generating. It is not necessary to have any fields in class which describes its type, all variables can be derived from the same class. When program parse tree is created the Optimizer begins optimization process. It traverses the tree and tries to parallelize each block. First of all it creates data dependency graph for the block. Then it checks the possibility of loop parallelization (currently only for ”for” loop). Based on the graph Optimizer checks which of available transformations are possible. While transforming loop into parallel version it is often possible to use more than one transformation and
242
J. Kwiatkowski and R. Iwaszyn
even if it is chosen it doesnt mean that another one cannot be used. That’s why during parallelization process one of the most important decision is to choose the most suitable transformation set which allows to achieve maximum efficiency. In order to make such choice possible SliCer applies a few transformation sets to each parallelized block and estimates execution time. For every set estimated execution time is stored. This process is achieved virtually which in this case means that parallelized block can be restored to original version and transformations don’t have to be applied to the output code. It is impossible in real programs to check each possible transformation set for each parallelized block, because of large amount of possibilities. To simplify this process SliCer assumes that the best way to choose appropriate transformation set is to check modifications which are ”the best” from the statistical point of view. When we consider not-nested ”for” loops, the most efficient transformations include each type of loop partitioning, currently ”Loop splitting” and ”Loop distribution” [8]. The first one breaks the loop into a parts which has the same body but iterate over different index range. When applying this transformation it is necessary to modify starting index of the loop and its step. Loop body remains unchanged. After that loop parts can be executed simultaneously on multiple processing units (cores). The second transformation ”Loop distribution” works in the different way; loop is divided into multiple loops which iterates over the same index range, but each of them takes only the part of original loop body. This transformation is not so convenient as the first one. The reason is that the execution time of instructions are different and it is hard to divide loop equally into parts with the same execution time. When using the same body it can be considered that each of them is executed in the same time (in most cases). Therefore, the time of execution of each thread should be similar. It is harder to achieve this using ”Loop distribution”. But obviously usage of this modification is beneficial and SliCer tries to apply it, if it is impossible to use ”Loop splitting”. SliCer assumes that if it is possible to partition loop it is always the best choice (excluding of course some special cases e.g. very short loops). It leads to the assumption that sets which include loop partitioning has to be considered. If none of them is possible then the tool tries different possibilities. Although it is not always possible to use loop partitioning as first transformation it can be possible later (e.g. after anti-dependency elimination process). Because of that at each parallelization step SliCer checks the possibility of applying loop partitioning. Algorithm of choosing the most suitable transformation ends when there is no more possible transformation to apply. The only exception is when the tool detects that it generated as many threads that it is impossible to run them in parallel and one or more processing unit will be idle. In such case it is necessary to divide extra work for each unit to achieve its maximal utilization. This issue will be considered in details in our further work. Currently five different loop transformations are implemented: loop splitting, loop distribution, loop interchange, loop unrolling and loop reversal. In the next SliCer version more transformation will be taken into account and parallelization of other loops will be included. In the second step the parallelism between program blocks is
Automatic Program Parallelization for Multicore Processors
243
determined. Then for each possible transformation Optimizer estimates its execution time and chooses the fastest one. Firstly it estimates serial execution time, then parallel execution time for different transformations and finally calculate speedup that can be received for each of them. In that way SliCer knows whether it brings benefits to made the transformation, the best transformation can be chosen. To determine estimated execution time of the block SliCer adds running time of each elementary instruction. To achieve this each operation (e.g. assignment, function call, mathematical operators) has assigned its execution time. Obviously calculated time can be used only to determine speedup and calculations are made in virtual units. Real execution time is different than time estimated by the SliCer. Thats why before program modification is saved SliCer knows whether they brings benefits or not. If so transformations are saved, otherwise they can be withdrawn. To determine the number of generated threads very simple algorithm is used. After transformation process the number of threads, which can be executed in parallel is known, similarly the number of processing units is known, too (specified in configuration). Then if the number of threads that can be executed in parallel is lower than the number of processing units, the number of generated threads are equal to the number of threads defined by the parallelization process. In other case the number of generates threads are equal to the number of available processing units. For this purpose using the estimated time of threads execution, different threads are joined together. In general, in this way the work during program execution is equally distributed among available processing units. When the code is optimized, modified program parse tree moves to Code generator. It traverses the tree and put code for every instruction, block, etc. into output stream. While generating the code it is formatted. Generator uses special module for system calls, because there are different for each operating system (especially thread create functions). It is possible to change this module by implementing special interface to generate code for different platforms. Below the skeleton of implemented algorithm for loop parallelization is presented. For each ”for” loop block apply the following procedure. Firstly create data dependency graph and estimate sequential program
Fig. 4. Speedup when using loop splitting transformation
244
J. Kwiatkowski and R. Iwaszyn
Fig. 5. The cost of parallelization process
execution time, then check possibility of applying ”Loop splitting” transformation. If it is possible apply it, otherwise check the possibility of applying ”Loop distribution” transformation. ”Loop splitting” transformation can be applied when dependency distance for loop is lower than loop step. Otherwise it is possible as well but in more specific cases. It strictly depends on the data dependency graph. Possibility of using this transformation is checked first because it gives the best speedup in program execution. Modification is made in such manner that maximum number of existing cores is used. It is not always possible to use all cores, because of different loop indexes ranges. Then apply ”Loop unrolling” transformation. ”Loop distribution” transformation can be applied when data dependency graph is divided into independent part (subgraphs). Each of them can be represented as a new loop with the same index range as original one but with different loop body. The number of used processing units depends on the number of existing units and independent parts of data dependence graph. Then apply ”Loop unrolling” transformation to all loops created by previous steps if there are only a few iterations to be made. Finally estimate parallel execution time and calculate possible speedup. If there are no speedup, rollback performed transformation, otherwise apply modifications.
5
Results of Experiments
Different experiments have been performed to confirm the usefulness of designed and implemented tool. During the experiments two different hardware platforms have been used Intel Xeon X3040 1,86 GHz and Intel Core Quad Q6600 2.4GHz. The presented results are the averages from the series of 100 experiments performed in various conditions. Obtained speedup has been measured for different loop transformations as well as for some example programs. Analysed loops consists of different number of iterations, from 10 to 100. The average one loop run execution time has been about 1 sec. Real measured speedup has been compared with estimated speedup calculate during parallelization process. For all
Automatic Program Parallelization for Multicore Processors
245
performed experiments the results have been very similar, for two most important transformation, loop splitting and loop distribution and for example programs with loops the speedup has been close to the number of used processing units. Figure 6 presents the results obtained for loop splitting transformation. The left diagram shows results when the program execution time has been correctly estimated by the SliCer and right diagram when estimation has been incorrect. What is very important even when estimation process has not been correct it doesn’t slow down program execution time. Figure 7 shows the cost of parallelization process. It can be determine that overhead related to parallelization process is not very large, the time of parallelisation time is close to the compilation time.
6
Conclusions and Future Work
The project is in current study, therefore presently the tool covers only a part of the functionalities that normally are supported by the parallelizing compilers. The prototype still misses a lot of features and not all types of instructions can be parallelized, they will be implemented in the further versions of the tool. However experiments performed using the first prototype indicates that the presented tool will be very useful for standard multi-core processors. The speedup received for programs parallelized by the tool has been close to the number of available processing units. Additionally using the tool, no specific knowledge about the used processors and experience in the parallel programming are required to build the parallel application. Acknowledgements. The research presented in this paper has been partially supported by the European Union within the European Regional Development Fund program no. POIG.01.03.01-00-008/08.
References 1. ANTLR v.3 Documentation (2007), http://www.antlr.org/wiki/display/ANTLR3/ANTLR+v3+documentation 2. Bader, D.: SWARM Framework, http://www.cc.gatech.edu/~ bader/papers/SWARM.html 3. Banerjee, U.: Loop Transformations for Restructuring Compilers: The Foundation. Kluwer Academic Publishers, Dordrecht (1993) 4. Banerjee, U.: Loop Parallelization. Kluwer Academic Publishers, Dordrecht (1994) 5. Eichenberger, A.J., et al.: Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture. IBM Sys. J. 45(1), 59–84 (2006) 6. Kwiatkowski, J.: Automatic program Restructuring. In: Melecon 1991 International Conference, IEEE Catalog No 91CH2964-5, pp. 1041–1044 (1991) 7. Polychronopoulos, C.D.: Parallel Programming and Compilers. Kluwer Academic Publishers, Dordrecht (1988) 8. Randy, A., Kennedy, K.: Optimizing Compilers for Modern Architectures: A Dependence-Base Approach. Morgan Kaufmann Pub., Academic Press (2002) 9. RapidMind., Development Platform, http://www.rapidmind.com
Request Distribution in Hybrid Processing Environments Jan Kwiatkowski, Mariusz Fras, Marcin Pawlik, and Dariusz Konieczny Institute of Informatics, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland {jan.kwiatkowski,mariusz.fras,marcin.pawlik, dariusz.konieczny}@pwr.wroc.pl
Abstract. In recent years the evolution of software architectures led to the rising prominence of the Service Oriented Architecture (SOA) concept. This architecture paradigm facilitates building flexible service systems. The services can be deployed in distributed environments, executed on different hardware and software platforms, reused and composed into complex services. In the paper a SOA request analysis and distribution architecture for hybrid environments is proposed. The requests are examined in accordance to the SOA request description model. The functional and non-functional request requirements in conjunction with monitoring of execution and communication links performance data are used to distribute requests and allocate the services and resources. Keywords: SOA architecture, SOA request distribution.
1
Introduction
The increasing significance of information technology in contemporary business is determined by two factors: new concepts of modelling processes and software architectures and development of technical means especially in the area of computing resources. New hybrid environments are designed to utilize the power of classical processing architectures together with new architectural developments. Classical processors, used individually or grouped in clusters and grids, are cooperating with novel architectures: general-purpose graphics processing units (GPGPU), filed-programmable gate arrays (FPGA) or general-purpose multiprocessor hybrid architectures [4,3]. The versatility of classical processors is combined with the computational power offered by novel architectures to meet the requirements of contemporary scientific and business applications. In recent years the evolution of software architectures led to the rising prominence of the Service Oriented Architecture (SOA) concept. This architecture paradigm facilitates building flexible service systems. The services can be deployed in distributed environments, executed on different hardware and software platforms, reused and composed into complex services. In the paper [6] the following SOA definition can be found: ”SOA is an architecture approach R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 246–255, 2010. c Springer-Verlag Berlin Heidelberg 2010
Request Distribution in Hybrid Processing Environments
247
for defining, linking, and integrating reusable business services that have clear boundaries and are self-contained with their own functionalities. Adopting the concept of services SOA takes IT to another level, one that’s more suited for interoperability and heterogeneous environments. A service is a function that is well-defined, self-contained, and does not depend on the context or state of other services”. Taken the above requirements into account, the new software architecture has to be developed. In the paper a SOA request analysis and distribution architecture for hybrid environments is proposed. The requests are examined in accordance to the SOA request description model. To ensure efficient resource utilization, incoming SOA requests are attributed to execution classes. The functional and non-functional request requirements in conjunction with monitoring of execution and communication links performance data are used to distribute requests and allocate the services and resources. The paper is organized as follow. Section 2 briefly presents design overview. The general architecture of the SOA request distribution system is presented in the Section 3. Sections 4 and 5 describe two parts of the proposed architecture: the execution analyser and matching module. Finally, section 6 outlines the work and discusses the further works.
2
Design Overview
SOA services are characterized by mobility, reusability and a large degree of execution independence from the underlying hardware infrastructure. The analysis of historical and ongoing SOA services execution data, together with the system monitoring results and the information about the service requirements can be utilized to distribute the incoming SOA service execution requests. The goal of the project is to design and implement the SOA system infrastructure architecture. In the proposed architecture (Fig.1) the service requirements combined with the system monitoring data are utilized to ensure efficient SOA service execution, performed in accordance with the requirements imposed. To achieve this goal, the methods of semantic description of services, resources and requirements was implemented. Ontology-based dependence models of the infrastructure, services and execution requirements were designed. Algorithms were developed to transform execution and requirements information into the execution flow knowledge. A recommendation and description method was proposed for selected SOA/SOKU services utilization in SOA applications. Different aspects related to systems quality assurance, reliability, availability, safety and security were taken into consideration during analysis process. In general system architecture the three main modules can be distinguished Execution Virtualization module, Request and Execution Knowledge Aggregator module and Request Distribution module. The Execution Virtualization module implements an SOA service execution interface, used to hide the underlying hardware-specific details. The virtualization makes it possible to utilize a variety of hardware resources dynamically selected to ensure efficient SOA service execution performed in accordance to the requirements. The Request and Execution
248
J. Kwiatkowski et al.
Fig. 1. The general view of the system
Knowledge Aggregator module aggregates, interprets and utilizes the monitoring data to define the SOA service execution location and execution details using built-in prediction system. The Request Distribution module analyzes the incoming SOA requests, compares them to the historical data acquired from the Request and Execution Knowledge Aggregator module and decides which resources will be used for its execution by the Execution Virtualization module.
3
SOA Request Distribution System
At the highest level of abstraction any service can be seen as a complex service. The complex service has to be decomposed onto atomic services. Basically a complex service can be considered as a graph of atomic services. An atomic service cannot be decomposed into smaller parts. During decomposition it is taken into account only the functional parameters of services. Because of that in the moment it is needed to choose a versions of atomic services which preserve
Request Distribution in Hybrid Processing Environments
249
the expected nonfunctional properties. For every version of atomic service it is needed to choose an instance of such version, from a set of instances which can be registered in different localization. But the registration only inform, that a chosen system can deliver an atomic service with specified functional and not functional properties. But the system itself can run the service on different realizations. This means that during servicing a request the system has to choose a realization of instance of service. Our system starts its work from receiving an atomic service request. The information systems designed with SOA paradigm are working both in local and wide area networks, particularly built with Internet and GRID recourses. For SOA service client, there is no need to carry about the localization and execution meanings of the service it want to use. However, the services are well described in the system and their localizations are known. It can dynamically change, however this process is not frequent, and usually the set of localizations of given service is constant for some period [6]. The paradigm of SOA says that given kind of services can be achievable from different service providers. Because the effectiveness of different SOA servers as well as distance to them (and thus communication costs) can be different, the quality of service delivery (especially the time of delivery) can be significantly different too [9]. To deliver SOA services more effectively we propose the architecture of SOA request distribution system. The assumptions for proposed distribution system for SOA services are following: – the SOA services are considered as the same if they offer the same functionality, – the given SOA service is available in different localizations (execution systems) distributed in the network, – the set of execution systems for given SOA service is known and constant, – the set of SOA services (that are to distribute) is limited and known. The general model of SOA request distribution system assumes that there is special node in the edge of client area the SOA request broker (Fig. 2) which manages distribution of requests for SOA services. For given service there is known constant set of M execution servers I = I1 , I2 , , IM , where Im is identifier of server m, and M communication links sm . The client can be an end user or a SOA service that composes its services with use of other services e.g. building complex services from atomic ones. We assume that the distance between the client and the broker is negligible. The service delivery time consists of two parts: the data transfer time tT of the client request and server reply, and service request execution time tO (preparation of the reply data at the execution server - all back-end processing of the request). In general, even for given class of request, these two values are not constant. They depend on dynamic change of communication link parameters as well as current state of execution server (e.g. its load). Our aim is to distribute client’s request to such server that service delivery time will be the smallest one. Hence, the broker has to support the following functionality:
250
J. Kwiatkowski et al.
I1 s1 C
sm
B
Im
sM IM Fig. 2. General distribution system model
– monitoring of servicing client requests, – prediction of selected communication link parameters, – prediction or registration (in case of cooperation model) of selected remote server’s parameters, – estimation of request transfer times and request execution times for all known servers, on the basis of adopted model of communication links and execution servers, – if necessary, adaptation of adopted model. In the Fig. 3 we propose the general architecture of the broker that fulfils all functionality necessary to determine for every request xi arriving at the moment n destination server which should serve this request in order to satisfy same time based distribution criteria. The decision un of the choice of one of M servers is performed on the basis of i the vector TˆT,n i.e. the vector of estimated transfer times tˆm,i T,n of request xi to i ˆ each server m, and vector TO,n i.e. the vector of estimated execution times tˆm,i O,n of request xi at each server m, m ∈ 1, ..., M . These times are estimated with the aid of models of execution servers and communication links built for every possible class ci of requests. If the client of the broker is SOA service the class of request may be provided by the client. In the other case some classification procedure ought to be applied. Monitoring of Request Transfer Module
xi
SOA Request Classification Module
TiT,n Network Measurement Module
i Comunication TˆT ,n Link Estimation Module
>tˆ
1,i ˆ m,i ˆ M ,i T , n , ..., t T , n , ..., t T , n
Vn Decision Making Module
ci
Request Execution Analyzer
@
Ln TiO,n
SOA Server Estimation Module
TˆOi ,n
>tˆ
un
1,i ˆ m,i ˆ M ,i O ,n , ..., t O ,n , ..., t O ,n
Fig. 3. General architecture of the SOA request broker
@
Request Distribution in Hybrid Processing Environments
251
The key parts of the broker are estimation modules of the communication links and execution servers. They estimate corresponding times according to fuzzy-neural models of links and servers similar as in [1]. Each module is built of M models of servers (or links, each for one destination server or communication link) that works as a controller. The input of the model (Fig. 4) is class ci of the request xi that arrives at the moment n, the server state parameters vector Lm n of the server m, known at the moment n, and completion times of requests executions till moment n. The vector Lm n may consist of different parameters characterizing the current state of the server. We take into account two: the load of the server, and its efficiency. The output of each model block of the module is estimated execution time of the request at each server.
Request Execution Analyzer
SOA Server 1 Model
Lmn ci
tˆO1,,in
t Om,,ni
SOA Server m Model
SOA Server M Model
tˆOm,,ni
tˆOM,n,i
Fig. 4. SOA Server Estimation Module
The communication link estimation is very similar. The difference is input Vnm (instead of Lm n ) which is the vector of appropriate link parameters, and measured m real transfer times tˆm,i T,n of prior requests (before the moment n). Vector Vn consist of two parameters that are taken under consideration: forecasted link latency (namely TCP Connect Time) and forecasted link throughput. These two parameters are derived from periodic measurements of latency and link throughput, and using time series analysis based prediction algorithms. Each model (server and link) is built for each destination and for each class of the request. It is designed as fuzzy-neural controller (Fig. 5) accomplished with use of 3-layered neural network [1,7,2]. The first layer of the network constitute the fuzzyfication block. For link model, the inputs are forecasted link latency tˆT CP C,n at the moment n, and forecasted link throughput tˆT h,n . The outputs of the fuzzyfication block are the grades of membership of latency µtT CP C in defined fuzzy sets for latency, and grades of membership of throughput µtT h in
252
J. Kwiatkowski et al.
defined fuzzy sets for throughput. The logic block inferences output. It computes grades of membership µR to the rules that are of the format proposed in [1]: IF tT h = ZT H,a AND tT CP C = ZT CP C,b THEN y = tT,c where: Z fuzzy sets for throughput (Th ) and latency (T CP C), tT estimated transfer time (output y), a, b, c indexes of appropriate fuzzy sets and output.
tˆTCPC ,n
tˆTh, n
PtTCPC Fuzzyfication block
PtTh
Logic block
PR
Defuzzyfication block
tˆTm, n,i
Fig. 5. Communication link model as a fuzzy-neural controller
After the defuzzyfication (using weight method) estimated transfer times of the request are derived, and along with measured actual value of transfer the network parameters are tuned using back propagation method [5].
4
Parallel Execution Performance Measurements
The request distribution algorithm described in the previous sections requires the information about the resource load and utilization. A monitoring system was designed to provide this information [8]. The system, called rsus is a distributed environment dedicated to perform low-overhead, external measurements of necessary data. The measurements are performed in a non-invasive manner, not requiring any modifications to the service being measured. The rsus environment is dedicated for linux operating system - the most popular OS in the HPC applications. The modularity of rsus architecture makes it relatively easy to port it to other operating systems. The environment offers two execution modes. In the basic mode it utilizes the data gathered by the operating system in the /proc virtual directory hierarchy to ensure low-overhead measuring method. In the extended mode rsus utilizes the communication performance data gathered by the mpiP profiler [10]. For every execution environment, the wall-clock time is measured together with the CPU time communication time and the execution environment load. The rsus environment is built from a set of applications. The main application, rsus is used to perform the measurements and has to be present in every execution environment monitored. It can be used to start and monitor a SOA service or, alternatively, attach to a working service and perform the measurements. Rsus application can send the monitoring results in a continuous, on-line fashion or in an off-line manner after the SOA request is processed by the service. The monitoring results are kept in the RAM memory of the environment being monitored. To avoid the memory exhaustion, the data can be periodically offloaded to the hard drive.
Request Distribution in Hybrid Processing Environments
253
Fig. 6. Rsus system architecture
The rsus application can be utilized in an autonomous fashion but in the proposed system it constitutes a part of the fully assembled monitoring installation. The data gathered is sent to the control application rsus-mom. The data transfer path is established by the rsus application sending an UDP broadcast request consisting of its execution environment address and receiving in response the rsus-mom address. Rsus-mom controls all the rsus applications in the system it is thus possible to analyze the information coming from all the services that participate in the request execution. To distinguish information coming from different requests, the data sent from rsus application, each service execution data has an unique identifier attached. The identifier can be automatically determined or set manually as an rsus application execution parameter. The data acquired by the rsus-mom application is saved on the hard drive for further analysis. It can be later processed by a dedicated analyzer application rsusrp. To integrate the rsus monitoring system with the existing execution environment the set of execution scripts can be used. Currently the execution scripts are available for PBS-compatible parallel execution environments. The rsus monitoring system architecture is depicted on fig. 6.
5
Static Matching Module
The aim of static matching module is to find the best execution environment for a request. Because we have heterogeneous hybrid hardware, with difference load balance we need to check on which ones we can serve the request, and decide on which one we finally execute the code. We assume that we can use a virtualizers with which we can simulate hardware or operation systems. This means that to really run a service we need: hardware, virtualizer, operating system (with libraries), service (executable code). To prepare this all elements we have to find all possible such couples and that decide which one to use. To generate the couples the module first check what hardware we have. Then it have to be connected every unit of hardware with proper virtualizers which
254
J. Kwiatkowski et al. Hardware description Rule-based Vitrualizer description
Expert System
Virtual hardware description
Execution environment
Virtual operating
Matching
system description
Algoritm
description
Requirements
Set of atomic service instances
matching algorithm
Description of atomic service instances
Complex
Selection algorithm
Set of selected atomic service instances
service
Fig. 7. Matching module architecture
can be executed on that hardware. In a description of registered virtualizers there are some rules which define the requirements of it (among of memory, kind of processor, etc.) Such a pair constitutes a virtual hardware (VH). On any of such virtual hardware it can be run an operating systems (it is called a virtual operating systems - VOS). In description of a VOS there are rules about requirement of virtual hardware needed for running the VOS. A pair of VOS and VH constitutes possible execution environment (EE). Then we have to connect service realization with desirable non-functional properties with adequate execution environments. In description of service realization there are rules which define the non-functional properties depending on properties of virtual operating system. At this moment we have a set of possible concrete realization (hardware, virtualizer, virtual operating system, service realization). Depending on load balance and on execution analysis of previous execution of the service realization the module choose one of the concrete realization and propagate it to execution virtualization module with all information needed to execute. Because not all hardware have their virtualizers, there will be also a dummy virtualizer. The virtualizer will vitrualize nothing and it will be in special way interpreted by execution module. This means that the rules for virtualization will change nothing, but this virtualizers are also special marked to inform the execution module to not run a virtualizer. It can also mean that the operating system is still run on such processor and it is only needed to execute a service realization. The system of information about hardware, virtualizers, virtual systems and service realizations is very flexible. Depending on registering the information of aforementioned elements the whole system can be viewed as a one instance of service or not. The information can be registered in such way that an executor
Request Distribution in Hybrid Processing Environments
255
can distinguish between different types of hardware but without distinguishing computation units, or in other way that service will be interchangeably assigned to exactly one unit of hardware. It allow to use different policy of hardware facilitate for execution of service requests.
6
Conclusions and Future Work
In the paper the architecture of a newly designed system of requests distribution in SOA environments is described. A prototype implementation of the system was created, comprised of the Request Distribution module, Stating Matching module, Dynamic Execution Monitoring module and a preliminary implementation of the Execution Virtualization module. Initial tests were performed, confirming the correct functioning of the system. Currently the set of performance tests is conducted to ensure efficient, low overhead operations of the final system implementation. Acknowledgements. The research presented in this paper has been partially supported by the European Union within the European Regional Development Fund program no. POIG.01.03.01-00-008/08.
References 1. Borzemski, L., Zatwarnicka, A., Zatwarnicki, K.: Global distribution of HTTP requests using the fuzzy-neural decision-making mechanism. In: Nguyen, N.T., Kowalczyk, R., Chen, S.-M. (eds.) ICCCI 2009. LNCS, vol. 5796, pp. 752–763. Springer, Heidelberg (2009) 2. Jain, L.C., Martin, N.M.: Fusion of neural networks, fuzzy sets, and genetic algorithms industrial applications. CRC Press LLC, London (1999) 3. Halfhill, T.: Parallel Processing With CUDA, Microprocessor Report (2008) 4. Kahle, J., Day, M., Hofstee, J.C., Maeurer, T., Shippy, D.: Introduction to the cell multiprocessor. IBM Journal of Research and Development (2005) 5. Keong-Myung, L., Dong-Hoon, K., Hyung, L.: Tuning of fuzzy models by fuzzy neural networks. Fuzzy Sets and System 76(1), 47–61 (1995) 6. Mabrouk, M.: SOA fundamentals in a nutshell, IBM Corp. (2008) 7. Mamdani, E.H.: Application of Fuzzy Logic to Approximate Reasoning Using Linguistic Synthesis. IEEE Trans. on Comp. C-26(12), 1182–1191 (1977) 8. Pawlik, M.: Execution Time Decomposition Method in Performance Evaluation, Science and supercomputing in Europe - report 2007. In: CINECA (2007) 9. Williams, W., Herman, R., Lopez, L.A., Ebbers, M.: Implementing CICS Web Services, IBM Readbook (2007) 10. Vetter, J., McCracken, M.: Statistical Scalability Analysis of Communication Operations in Distributed Applications. In: Proc. ACM SIGPLAN Symp. (2001)
Vine Toolkit - Grid-Enabled Portal Solution for Community Driven Computing Workflows with Meta-scheduling Capabilities Dawid Szejnfeld1 , Piotr Domagalski2 , Piotr Dziubecki1 , Piotr Kopta1 , Michal Krysinski1 , Tomasz Kuczynski1 , Krzysztof Kurowski1 , Bogdan Ludwiczak1 , Jaroslaw Nabrzyski2 , Tomasz Piontek1 , Dominik Tarnawczyk1, Krzysztof Witkowski2 , and Malgorzata Wolniewicz1 1
Poznan Supercomputer and Networking Center Tel.: (+48 61) 8582174, Fax: (+48 61) 8582151 {dejw,piotr.dziubecki,pkopta,mich,tomasz.kuczynski, krzysztof.kurowski,bogdanl,piontek,dominikt,gosiaw}@man.poznan.pl 2 FedStage Systems Tel.: (+48 61) 6427197, Fax: (+48 61) 6427104 {piotr.domagalski,krzysztof.witkowski,jarek.nabrzyski}@fedstage.com
Abstract. In large scale production environments, the information sets to perform calculations on come from various sources. In particular, some computations may require the information obtained as a result of previous computations. Workflow description offers an attractive approach to formally deal with such complex processes. Vine Toolkit [1] solution addresses some major challenges here such as the synchronization of distributed workflows, establishing a community driven Grid environment for the seamless results sharing and collaboration. In order to accomplish these goals Vine Toolkit offers integration on different layers starting from rich user interface web components, integration with workflow engine and Grid security and ending up with a built-in meta-scheduling mechanisms, that allow IT administrators to perform load balancing automatically among computing clusters and data centers to meet peak demands. As a result of this particular project a complete solution has been developed and delivered. Keywords: Vine Toolkit, Vine, GRMS, Flowify, wow2green, GRIA, middleware.
1
Introduction
Vine Toolkit (in short Vine) is a modular, extensible Java library that offers developers an easy-to-use, high-level Application Programmer Interface (API) for Grid-enabling applications and even more as it will be described in the given article. It was developed within EU funded projects like OMII Europe [2] or R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 256–265, 2010. c Springer-Verlag Berlin Heidelberg 2010
Vine Toolkit
257
BEinGRID [3] and polish infrastructural project PL-Grid [4]. It was used to build an advanced portal solution integrating many different grid technologies within BEinGRID business experiment called wow2green (Workflows On Web2.0 for Grid Enabled infrastructures in complex Enterprises) [5]. The solution will be described to illustrate the great scope of Vine usage and its advanced features. 1.1
Problem Description
The general problem solved within wow2green business experiment is presented in the given paragraph. This use case is just an illustration of how and where the Vine Toolkit can be used. This article is focused mainly on Vine itself and only on the business logic layer without going into end user interface and functionality description. Big companies consist of several units generating a variety of information sets and performing advanced computations. Workflow description offers an attractive approach to formally deal with the complex processes. Unfortunately, existing workflow solutions are mostly legacy systems which are difficult to integrate in dynamically changing business and IT environments. Grid computing offers flexible mechanisms to provide resources in the on-demand fashion. Wow2green system - called officially Flowify [6] which is based on the Vine Toolkit - is a Grid solution managing computationally intensive workflows used in advanced simulations. Each department within big enterprise or a small company in a supply chain: – performs unsynchronized computations among departments, – exchanges computing results independently with other project participants, so input data is re-entered in different formats, output data is lost or overwritten, – overloads computing resources during deadlines. Flowify helps users synchronize the data processing for execution and collection of results in an automated fashion. Easy-to-use Flowify workflow interfaces are provided as intuitive Web applications or integrated with commonly used tools. Thus many users can simultaneously form dynamically workspaces, share not only data for their computing experiments but also manage the whole process of advanced calculations. Built-in GRMS meta-scheduling mechanisms provided by Flowify allow IT administrators to perform load balancing automatically among computing clusters and data centers to meet peak demands. 1.2
Main Requirements and Demands
The following main requirements have been identified within wow2green experiment: – enable workflow executions via the wow2green portal (portal based solution should provide a suitable, user-friendly interface for workflow executions using Web 2.0 style easy-to-use web interfaces. Users should have a fully transparent access to underlying computing resources so they can take advantage of high level workflow management)
258
D. Szejnfeld et al.
– provide detailed information and feedback for users about their computing workflow simulations (every workflow is a kind of template for creating a computing simulation which could be later executed on different computing resources. External workflow engine was selected - Kepler [7] which supports provenance mechanisms required to search and discover templates in created workflows. These features are also available via the wow2green portal) – enable more efficient platform usage through applying brokering of jobs defined in workflow nodes (GRMS [8] broker is used as a metascheduling and brokering component for load balancing among computing resources. Thus, end users will not have to select computing resources manually and the overall utilization of computing resources will be improved) – easy-to-use stage in and stage out simulation data (users are able to upload and load simulation data and workflows to and from our portal in an intuitive way) – easy-to-use, Kepler-like graphical user interfaces for workflow management and control (to provide user friendly web interfaces we selected a new web technology called Adobe Flex. It will enable us to implement various web applications for easy management and control of workflows via portal) – support for new workflow definitions (additional web tools for workflows will be developed to enable users to create, share and modify workflows. Users (workflow creators) will be able to share created workflows with other users interested only in performing simulations of existing workflows). – provide a repository of workflow definitions.
2
Flowify Architecture
The overall Flowify architecture is presented below, all components are described to show its function and reveal the complexity of the prepared environment (Fig.1). Identified components: – Gridsphere portal framework - portal framework that provides an opensource portlet based Web portal. GridSphere enables developers to quickly develop and package third-party portlet web applications that can be run and administered within the GridSphere portlet container. The wow2green portal is based on this framework. – Vine Toolkit - used as business model layer within the portal, applications created for the scenario uses the Vine mechanisms to interact with grid services and workflow engine, integrated Adobe Flex with BlazeDS technology allows creating advanced web applications. Vine Toolkit components used in portal: o Workflow Manager - general component used to interact with different workflow management solutions through common API; in case of the wow2green solution kepler plugin was introduced
Vine Toolkit
259
Fig. 1. Wow2greem system (Flowify) overall architecture
–
–
–
–
o Role Manager - role management component which allows to manage role based access within the portal regarding GUI elements and accessible actions and also to define role based access policies regarding data managed through logical file manager component o File Manager - File Manager component which allows to interact with different data management systems by means of common API o Portal Security Component - used for portal user certificates and proxies management, together with the Certificates generation module allows to automatic certificate generation for registered portal users o Portal User Management - used for portal users management, user registration in the portal; together with the MS Active Directory plugin it allows for using existing accounts in the given organisation to make the integration easier with the existing IT environment (it is also possible to use some existing LDAP service if required). Vine web service - web service which exposes a subset of functionality of the Vine Toolkit component to external systems/services - regarding Job and File Manager, the web service is used by Kepler actors to interact with grid enabled resources Kepler workflow engine - core part of the Kepler/Ptolemy standalone application/framework for workflow management, it is used in the wow2green solution as primary workflow management engine, kepler workflow description called MOML is used within the system Kepler engine web service - the web service exposes the workflow engine functionality to external systems/services, portal workflow manager component uses it as the workflow management service Set of kepler’s vine actors - base components of the workflows’ definitions to submit jobs and manage data through the Vine Toolkit service layer
260
D. Szejnfeld et al.
– GRMS broker with web service interface - job broker component with advanced scheduling technology, scheduling policies could be easily exchanged dependently on the requirements – GRIA middleware [9] - grid middleware - service-oriented infrastructure (SOI) designed to support B2B collaborations through service provision across organizational boundaries. Following services are used in the wow2green system: Job Service - job management, Data Service - data management, Membership Service - to manage groups of users relevant to the roles defined in the portal – Griddle Information Service - web service which exposes the information about available GRIA resources and batch queues statistics.
3
Vine Toolkit Portal Based Solution
Vine Toolkit (Fig. 2) is a modular, extensible Java library that offers developers an easy-to-use, high-level Application Programmer Interface (API) for Gridenabling applications. Vine can be deployed for use in desktop, Java Web Start, Java Servlet 2.3 and Java Portlet 1.0 environments with ease. Additionally, Vine Toolkit supports a wide array of middleware and third-party services. Using the Vine Toolkit, one composes applications as collections of resources and services for utilizing those resources (basic idea in Vine - generic resource model - any service and data source can me modeled as an abstract entity called resource and can be integrated with web applications using high-level APIs). The Vine Toolkit makes it possible to organize resources into a hierarchy of domains to represent one or more virtual organizations (VOs). Vine offers security mechanisms for authenticating end-users and authorizing their use of resources within a given domain. Other core features include an extensible model for executing tasks (every action is persisted as Task) and transparent support
Fig. 2. Vine Toolkit high level architecture in portal deploy scenario
Vine Toolkit
261
for persisting information about resources and tasks with in-memory or external relational databases. Adobe Flex [10] and BlazeDS [11] technology allows creating advanced and sophisticated web applications similar to many stand-alone GUIs. 3.1
Workflow Engine Support
Vine Toolkit API has been enriched with the generic Workflow Engine API. It was created as a separate sub project which could be added as a feature by the portal installation. API reflects identified requirements regarding functionality used within end user web applications to manage users’ workflows through the independent API. It allows quite simply add support for different workflow engines if desired or required by some applications constraints. For the wow2green experiment Kepler engine was used as the primary workflow engine. The base API allows define so called WorkflowDefinition which represents the workflow definition prepared as an XML string. In case of Kepler engine MoML [12] based workflow description is used. Through the use of generic WorkflowDefinitionParameter a set of input parameters could be passed along with the workflow definition to the target workflow engine plugin. WorkflowManager component plays role of the generic workflow manager while the concrete instance is created during workflow submission to the service deployed on the target host. Resources defined in the Vine domain description (it is a resource registry in the given domain). Vine contains Kepler plugin project which encapsulates the Kepler engine client to implement the generic Vine Workflow API. Figure 1 clearly shows the connection from the portal to the Kepler Web Service it is another component which exposes the functionality of the Kepler workflow engine to the external clients acting as a workflow management service. The key point here are so called the Vine actors small java classes which are in fact Vine Web Service clients. These ”bricks” are used by the workflow designer to prepare workflow description to execute applications using grid resources and to acquire results of the computations. User is not aware where the computations take place - Vine Web Service by using grid services like GRMS which meta-scheduler and job broker dispatches the jobs on the accessible grid environment. Vine is used here in different separate role in servlet mode to be a backed of the Vine Web Service and provide ease and unified access to different grid services like in this case GRMS and GRIA. Such construction allows to easy switch to different grid technology and reconfiguration of the whole environment to different middleware stack like Unicore 6 [13] for example. 3.2
Uniform Interface to HPC Infrastructures
In order to provide end-users with transparent access to resources, we developed a mechanism responsible for the management of uniform interfaces to diverse Grid middleware. Advanced classloader management mechanism enables dynamic loading of components that provide access to the functionality of specific Grid middleware through a single uniform interface. These components are
262
D. Szejnfeld et al.
called plug-ins in this context (and are realized as Vine subprojects with separate set of client jars). These uniform interfaces are provided for such functionality like job submission, control, monitoring, data management, role management, user registration and authentication. They are based on standards where possible (e.g. JSDL for job submission) and take into account both the gathered requirements and the functionality provided by Grid middleware deployed in HPC centers. Job Submission, Control and Monitoring Interface. One of the interfaces which makes use of the mechanism described above is the uniform interface for job submission functionality. It was based, first, on an analysis of the functionality of middleware systems and, second, on the Job Specification Description Language (JSDL) [14] created by the JSDL working group at the Open Grid Forum [15]. The following main functionality was included in the job submission interface: startJob, stopJob, getJobOutputs, getJobStatusById, createJobSpec and beside this a bunch of support methods like getJobByTaskId to retrieve job instance from Vine using its task id. Currently Vine supports such middleware like gLite 3.x [16] WmProxy and CREAM, Unicore 6 UnicoreX job submission, Globus GT4 [17] WsGram, GRIA Job Service and recently QosCosGrid [18] SMOA Computing and GRMS. Vine’s plugin management allows to simultaneous usage of different middleware technologies what enables to uniform access to the different HPC resources. In case of wow2grenn use case GRIA and GRMS Plugins were used to access grid services. Another motivation beside seamless usage of different grid technologies was ease of configuration and possibility to switch on to different mieddleware stack without modifications to the developed web applications (Vine plays role of the abstract access layer with well defined interface). Data Management Interface. Another interface is the uniform interface for data management functionality. It was based, first, on an analysis of the functionality of storage services of middleware systems trying to find common API. The following main functionality was included in the data management interface: getHomeDirectory, createInputStream, createOutputStream, info, exists, list, goUpDirectory, changeDirectory, makeDirectory, copy, rename, move, delte, upload, download, createNativeURLString and beside this a bunch of support methods like getDefaultFileManager to retrieve default data manager instance from Vine. Currently Vine supports such middleware like gLite 3.x SRM and LFC, Globus GridFTP, Unicore 6 Storage Management System (SMS), WebDAV, Storage Resource Broker (SRB), GRIA Data Service, HTTP. Vine’s plugin management allows to simultaneous usage of different storage technologies and copying data between them. In case of wow2grenn use case GRIA Data Service and HTTP Data Service plugins were used to access grid services. By using Vine abstract layer the web applications could be written without taking into account different technological nuances. Vine allows also to use different storage servers as uniform services making data management quite easy for the application developer.
Vine Toolkit
3.3
263
Security Issues
Vine can solve different security issues while accessing different grid services. A significant interoperability problem results from the differences in the security models of systems such as UNICORE or GRIA (where authentication is based on standard X.509 certificates) and Globus-based Grids (authentication based on GSI [19]). The main difference between these systems is that basic versions of UNICORE and GRIA do not support dynamic delegation while GSI-compliant systems do. Currently, to overcome this problem, we allow a portal to authenticate on behalf of end-users. To do this, the portal signs a job with its own certificate and adds the distinguished name of the end-user before submitting the job to, for example, a GRIA site. In case of Unicore Vine supports GSIenabled extension for Unicore gateway what enables to authenticate users by using proxy certificates. In case of gLite3 middleware voms proxy certificate are created and used - Vine can be configured to support different voms servers and allow users interactions from different VO without any problem. Vine support MyProxy [20] server to provide true single sign-on - during the user logging in to the portal proxy certificate is retrieved and used while accessing different middleware services. In case of the wow2gren solution the GRIA middleware was used. So as mentioned before portal common certificate was used to submit jobs (GSI is not supported). To distinguish users so called GRIA policy rules are used to set job and data file owners by using users’ public certificates and their distinguish names. One of the requirement of the experiment was to simplify the security management issues. So user certificates are maintained by the Vine based portal integrated with Certificate Authority - while user is registered for the first time its certificate is generated and hold in the safe place on the server. Later proxy certificates are used to interact with the grid middleware. Common portal certificate is used to submit jobs to enable GRMS broker to manage them on behalf of users without need to past the private user’s key certificate. 3.4
User Management
Common problem to solve while using different middleware infrastructures is user management and account provision. Vine is able to register users in different middleware infrastructures like gLite3, Unicore 6 or Globus GT4. Vine contains registration modules which allows to register users on different resources. Currently such ones are provided: GridsphereRegistrationResource - allows adding new users to Gridsphere portal, GssCertificateRegistrationResource - creates x509 certificate and key pair for a new user, Gt4RegistrationResource - creates a user account on Globus Toolkit 4 host machine, GriaRegistrationResource - creates a user account in Gria 5.3 Trade Account Service, Unicore6RegistrationResource creates a user account on Unicore 6 XUUDB, DmsRegistrationResource - register a user on Data Management Service, VomsRegistrationResource - register a user to Virtual Organization Membership Service. Modular architecture and plugin management system allows easy to plugin other registration modules if
264
D. Szejnfeld et al.
desired and extend Vine with new functionality. In case of wow2green use case GssCertificateRegistrationResource, GridsphereRegistrationResource and DmsRegistrationResource were used to generate users’ certificates, create portal accounts and logical file system DMS [21]. Vine comes also with bundled administrative web applications regarding user managamnet. One of it is UserRegistrationApp which allows to create requests for new accounts in Vine. Another is UserManagementApp where administrator has got option for deactivate user account and edit user’s profile information. In near future also Liferay [22] portal framework will be supported with suitable registration module. 3.5
Role Management
Vine introduces also role management mechanism to authorize users while taking different actions in the web applications. Roles are used in two scenarios as roles at the portal level and middleware stack level. Roles at the portal level allow to set permissions to access different functionality in the web applications based on Vine. Administrator is able to set permissions for roles to different part of application and distinguish users regarding their possible scope of possible actions. Roles at the middleware stack are used to set access permissions regarding jobs and data files at the storage services. For the wow2green use case GriaRoleResource, DmsRoleResource were implemented. Thus the roles created in the portal are propagated to the middleware stack - in this case GRIA. User by using end user interface is able to share data for the selected roles. Such action is propagated to the middleware resources and could be used later while accessing remote storage by using given role name. Such approach makes permission management much simpler and does not multiply remote policies for every user of the system - it is enough to assign policy for some role and assign an user to the given role at the target system. Of course it is applicable only on middleware system which supports roles management like GRIA for example or DMS logical file system.
4
Conclusions
In this paper we presented Vine Toolkit as universal framework to create software solution to interact with different services and technologies (among others different grid middleware stacks) by means of uniform, easy API. The generic architecture of the Vine Toolkit were described. We also included some requirements and initial ideas for the wow2green business experiment as illustration of Vine application in a real, practical use case. Vine allows to submit jobs to different middleware stacks through a uniform user interface. In similar way uniform API for data management, user authentication, role and user management are provided. Resource based model allows straightforward service support and configuration by uniform way regardless of the target technology. Security issues which always occur while accessing different middleware stacks are solved by using Vine registration modules and proxy certificates support. Role management
Vine Toolkit
265
mechanism allows administrators and end users for easy permission management in the applications and easy grant permissions to data files on the storage server if applicable. Beside simple jobs whole workflows could be managed by the uniform workflow interface. Currently only the Kepler engine is supported but it could be easily extended by adding another workflow engine plugin. Vine Toolkit could be used as framework for building advanced web applications for example with use of Adobe Flex technology as well as business logic provider for advanced web service solutions.
Acknowledgments We are pleased to acknowledge support from the EU funded OMII-Europe and BEinGRID projects and polish infrastructural project PL-Grid.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
13. 14. 15. 16. 17. 18. 19. 20. 21. 22.
Vine Toolkit web site, http://vinetoolkit.org/ OMII-Europe project, http://www.omii-europe.com/ BEinGRID project, http://www.beingrid.eu/ PL-Grid project, http://www.plgrid.pl/en Wow2green business experiment, http://www.beingrid.eu/wow2green.html, http://www.it-tude.com/wow2green_sol.html Flowify, http://www.flowify.org/ Kepler, https://kepler-project.org/ GRMS, http://www.gridge.org/content/view/18/99/ GRIA, http://www.gria.org/about-gria/overview Adobe Flex, http://www.adobe.com/products/flex/ Adobe BlazeDS, http://opensource.adobe.com/wiki/display/blazeds/BlazeDS/ MoML language specification, http://ptolemy.eecs.berkeley.edu/publications/papers/00/moml/ moml erl memo.pdf Unicore 6, http://www.unicore.eu/ JSDL, http://www.ogf.org/documents/GFD.56.pdf Open Grid Forum (OGF), http://www.gridforum.org/ gLite 3.x, http://glite.web.cern.ch/glite/default.asp Globus GT4, http://www.globus.org/toolkit/ QosCosGrid, http://node2.qoscosgrid.man.poznan.pl/gridsphere/gridsphere Grid Security Infrastructure (GSI), http://www.globus.org/security/overview.html MyProxy, http://grid.ncsa.illinois.edu/myproxy/ Data Management System (DMS), http://dms.progress.psnc.pl/ Liferay, http://www.liferay.com/
GEM – A Platform for Advanced Mathematical Geosimulations Radim Blaheta, Ondˇrej Jakl, Roman Kohut, and Jiˇr´ı Star´ y Institute of Geonics AS CR, Studentsk´ a 1768, 708 00 Ostrava-Poruba, Czech Republic {blaheta,jakl,kohut,stary}@ugn.cas.cz http://www.ugn.cas.cz
Abstract. The contribution deals with the development of a 3-D finiteelement package called GEM and its aspirations in demanding mathematical modelling and simulations arising in geosciences. On the background of two complex applications from the presently running projects, formulated as linear elasticity and thermo-elasticity problems, the most important characteristics, especially those of the solvers, are presented. Features related to high performance computing, including parallel processing, are focused on. Keywords: Finite element modelling, parallel solvers, applications in geosciences.
1
Introduction
GEM can be characterized as an in-house non-commercial 3-dimensional finiteelement package oriented on the solution of problems arising in geosciences. It serves both research purposes and practical modelling and its development is mainly problem-driven, reflecting the needs of the current research and applications. This contribution presents GEM in its development towards a code for advanced and demanding modelling. In this sense, it reviews its interesting features and points to relevant references for more detailed treatment. GEM was originally designed for the solution of linear elasticity models, which were appropriate for most problems arising from the local mining industry to facilitate coal extraction from deep underground deposits. As a rule, the goal of the modelling procedure was to assist the expert evaluation of the stressstrain fields induced by mining activities in complex geomechanical conditions, to increase the safety of the extraction process both in the project and production phase.
2
Elasticity Problems
From the mathematical point of view, the considered elasticity problems are formulated in terms of displacement u(x) of points x of the body Ω ⊂ R3 . In the weak formulation, we want to find u : Ω → R3 , u − u0 ∈ V
and a(u, v) = b(v) ∀v ∈ V ,
R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 266–275, 2010. c Springer-Verlag Berlin Heidelberg 2010
(1)
GEM – A Platform for Advanced Mathematical Geosimulations
267
where (i)
V = {v = (v1 , v2 , v3 ) : vi ∈ H 1 (Ω), vi = 0 on ΓD ⊂ ∂Ω for i = 1, 2, 3}, a(u, v) = cijkl εij (u) εkl (v) dx Ω ∂uj ∂ui ∂vl 1 ∂vk εij (u) = 12 ∂x + , ε (v) = + kl ∂xi 2 ∂xl ∂xk . j Above H 1 (Ω) is the Sobolev space, u0 ∈ [H 1 (Ω)]3 is the function, which can (i) prescribe nonzero Dirichlet boundary conditions, cijkl are elastic moduli, ΓD are parts of the boundary ∂Ω and b is a linear functional determined by the applied forces. For more details see e.g. [1]. 2.1
GEM’s Building Blocks
The core of every modelling software is the solver for numerical solution of the differential equations describing the model. In the GEM’s case, the solver can be highlighted as follows: Finite elements. For the solution of the problem of elastic deformation (1), the finite element (FE) method is employed. The FE discretization of this boundary value problem, based on linear tetrahedral finite elements in GEM, leads to the linear system of the type Au = b ,
u, b ∈ Rn ,
(2)
where A is a symmetric positive definite n × n stiffness matrix, b ∈ Rn is the right-hand side given by the loading and u ∈ Rn is the n-dimensional vector of unknown nodal displacements. The dimension n of the system can be very high, typically in the range from 105 to 107 for the 3D analysis of elasticity problems of stress changes induced by mining. Structured meshes. GEM makes use of structured meshes for the discretization of the modelled domain Ω, which can be viewed as adaptation (deformation) of a regular cubical (reference) grid of nodes — see e.g. [11]. The cubical cells are divided into six tetrahedral elements in a way known as Kuhn’s triangulation. We have developed procedures that allow structured meshes to fit the geometry of geomechanical models quite easily. Moreover, the resulting stiffness matrix provides possibility for efficient storage and fast matrix handling routines, essential for demanding modelling. Note that the use of structural grids is quite common in geomechanical applications, although in some other application areas this approach may not be flexible enough. Conjugate gradients and preconditioning. The symmetry and positive definiteness of the stiffness matrix A permit to solve the linear system (2) by the (iterative) preconditioned conjugate gradient (PCG) method. In the PCG algorithm,
268
R. Blaheta et al.
the preconditioning operation is a very important for its efficiency — roughly speaking, it improves the search directions for the solution. GEM employs preconditioning based on the incomplete factorization and the so-called displacement decomposition, see e.g. [2]. For parallelization, we combine displacement decomposition with overlapping domain decomposition approach. We shall tackle this topic in the next section. 2.2
Parallelization
Displacement decomposition (DiD). DiD separates the three components of the nodal displacement vector by rearranging and decomposing the vector u (and correspondingly the matrix A of the system (2)) to u = (u1 , u2 , u3 ), where the blocks u1 , u2 , u3 respectively include all nodal displacements in x, y, z directions. With this partitioning, parallelization is relatively straightforward, since the displacement directions can be processed by three concurrent processes. Thanks to the block diagonal structure of the preconditioner in the DiD case, the preconditioning step does not involve any communication. However, beside a few small-sized communications in each iteration of the parallel PCG algorithm, the most demanding data exchange is involved in the repeated matrix by vector multiplication using the same displacement decomposition. Aiming at the reduction of iterations, we realized an alternative to the this fixed preconditioning, namely the so-called variable preconditioning, in which the pseudoresiduals are computed by an inner CG method, or more precisely, by three independent (communication-free) PCG procedures.1 In this way we can reduce the number of outer iterations (and their overhead) at the price of increasing the computational work involved in the preconditioning. This is appropriate for parallel systems with high computation/communication performance ratio, e.g. with slow interprocess communication. Note, that in general the variable preconditioner is not linear, symmetric and positive definite, and the standard CG method can fail or converge too slowly. Therefore, we consider also the generalized preconditioned CG method (GPCG) [7,5], with an additional orthogonalization of the search directions. See [9] for a more detailed treatment of DiD and parallel PCG based on it. The parallelization based on DiD made possible “comfortable” solution of demanding models of several millions degrees of freedom (DOF) more than ten years ago already. It perfectly fits small parallel environments, at that time consisting of 3 – 4 workstations, usually with low-speed interconnecting network. But also nowadays parallel computations based on DiD can match e.g. personal computers with four-core processors. Domain decomposition (DD). For HPC however, more scalability of the parallel computations is definitely required. That is why another space decomposition method [7] is implemented in parallel GEM. With DD, GEM partitions the domain Ω “horizontally” in the Z dimension into m subdomains with a two-layer 1
Each inner CG iteration can be again preconditioned by the incomplete factorization. Moreover, it is sufficient to compute the pseudoresiduals with a low inner accuracy.
GEM – A Platform for Advanced Mathematical Geosimulations
269
minimal overlap, perform corresponding decomposition of vectors and matrices and assigns them to m concurrent processes, which pursue the CG algorithm on the corresponding blocks of data. During the matrix-by-vector multiplication just local communication with neighbours is needed and the amount of communicated data is fairly small and proportional to the overlapped region. Moreover, efficient Schwarz type preconditioners can be constructed, see [8] as classical reference. Our version of these preconditioners can be characterized by the following features: – use of subproblems corresponding to the slightly overlapping subdomains, – inexact solution of the subproblems by incomplete factorization, – algebraic creation of an auxiliary global approximation by aggregation of DOF and its use in two-level version of the preconditioner, – inexact solution of the global subproblem by inner CG iterations, – use of both additive and multiplicative (hybrid) combination of corrections from local and global subproblems. These topics, too extensive to be covered in detail at this place, are discussed in [6]. Note that algebraic construction of both local subproblems and the global approximation provides a black box approach. Realization aspects. From the technical point of view, the developed parallel realizations of the PCG algorithm in GEM follow the message passing paradigm, and can be implemented by means of any message passing system. We have made use both of PVM and MPI message passing, called from Fortran codes. The design gives rise to one master process and a number of slave processes, one per partition. As a computationally inexpensive task with just controlling functions, the master can share its processor with one of the workers. See [7] for more realization details. 2.3
Applications – Analysis of Geocomposites
The number of practical elasticity computations related to mining was very large during GEM’s existence. Maybe the most challenging mathematical model of this kind, with great impact on GEM’s development, was the simulation of extraction of a uranium ore deposit in the Bohemian-Moravian Highlands, see [11] for details. It aimed at the assessment of geomechanical effects of mining, e.g. the comparison of different mining methods from the point of view of stress changes and possibility of dangerous rockbursts. The current elasticity computations performed in GEM concern another research theme: microstructure 3D finite element modelling. In Fig. 1 we can see a FE model of a geocomposite specimen from the coal environment reinforced by injection of polyurethane resin. Such technology can be used e.g. to strengthen coal pillars during mining. The mechanical properties as well as permeability depend on degree of filling the fractures and properties of the used resin. The numerical upscaling and evaluation of properties of the homogenized geocomposite material enable to assess the effect of grouting and can be also used for optimization of the grouting process.
270
R. Blaheta et al.
Fig. 1. CT scan of a geocomposite specimen (left) and timings of its GEM processing using various solvers (right – see text for a legend)
For the numerical upscaling, the geocomposite samples (cubes with edge 75 mm) are scanned by X-ray computer tomography, which provides their microstructure and properties of different voxels. In the case above, the sample was discretized with an uniform grid of 231 × 231 × 37 voxels, giving 2 045 312 nodes and more than 6 million DOF for the investigation of elastic properties. The homogenized (upscaled) properties are obtained by strain and stress driven tests implemented numerically by means of the FE analysis. To give an idea on performance of GEM’s solvers, Fig. 1 right provides wall-clock solution times for the corresponding FE system on a NUMA system with 8 AMD Opteron quad-core processors and 128 GB of RAM. The following parallel solution methods apply: displacement decomposition with fixed (DiD fix.) and variable (DiD var.) preconditioning and two-level Schwarz domain decomposition without/with an algebraic coarse space created by aggregation of DOF (DD/DD+CG). The sequential computing time (not shown) was 304 seconds. Note that the grid size small in the Z dimension limited the scalability to 16 subdomains.
3
Thermo-Mechanical Modelling
The investigation of performance of deep geological repositories (DGRs) of the spent nuclear fuel, a pressing issue associated with nuclear power utilization, motivates advancement of GEM’s modelling repertoire towards multiphysics. Modelling of DGRs in full extend is a great challenge since it has to take into account many processes like heat transfer, water flows, mechanical behaviour and also transport of chemical agents and chemical reactions. Moreover, mutual interactions of the processes and very long DGR operation make the modelling even more complex. This research orientation underlines the necessity of efficient parallel solution methods.
GEM – A Platform for Advanced Mathematical Geosimulations
271
In an initial step we focused on the thermal and mechanical (T-M) phenomena with the fairly natural assumption that temperature changes produce additional deformations but the mechanical deformations are only slow and do not influence the temperature fields. More precisely, we consider only thermo-elasticity modelling with an one-directional coupling via a thermal expansion term in the constitutive relations. In this case, the model can be mathematically formulated as follows: Find the temperature τ : Ω × (0, T ) → R and the displacement u : Ω × (0, T ) → R3 that fulfill the equations κρ −
∂2τ ∂τ =k + Q(t) ∂t ∂xi 2 i
∂σij = fi (i = 1, . . . , 3) ∂xj j σij = cijkl [εkl (u) − αkl (τ − τ0 )] kl
εkl (u) =
1 2
∂uk ∂ul + ∂xl ∂xk
in Ω × (0, T ) ,
(3)
in Ω × (0, T ) ,
(4)
in Ω × (0, T ) ,
(5)
in Ω × (0, T ),
(6)
together with the corresponding boundary and initial conditions. The four expressions represent the heat conduction equation, equations of equilibrium, Hook’s law and tensor of small deformations, respectively, with symbols having the following meaning: κ is the specific heat, ρ is the density of material, k are the coefficients of the heat conductivity, Q is the density of the heat source, σij is the Cauchy stress tensor, εkl is the tensor of small deformations, f is the density of the volume (gravitational) forces, cijkl are the elastic moduli, αkl are the coefficients of the heat expansion and τ0 is the reference (initial) temperature. The considered time interval is denoted (0, T ). For the heat conduction, the boundary conditions prescribe the temperature, the heat flow through the surface heat flux and the heat transfer to the surrounding medium. For the elasticity, the boundary conditions set the displacement, stresses and surface loading. The initial conditions are represented only by the initial temperature. See e.g. [3] for this topic. 3.1
Numerical Processing
GEM solves the initial-boundary value problem of thermo-elasticity (3) – (6) in two steps: First, the temperature distribution is determined by the solution of the nonstationary heat conduction equation for a set of time points. Second, the linear elasticity problem for a chosen subset of those time points is solved in a post-processing procedure. Time stepping. After the variational formulation, the problem is discretized by finite elements in space and finite differences in time. GEM, employing linear
272
R. Blaheta et al.
tetrahedral finite elements and the simplest finite differences, computes the vectors τ j and uj of the nodal temperatures and displacements respectively at the time points tj (j = 1, . . . , N ), defining the time steps Δtj = tj − tj−1 . GEM follows the time-stepping algorithm, which involves repeated solution of linear systems with millions of DOF: For each time point, we solve the thermal linear system (Mh + ϑΔtj Kh )τ j = (Mh − (1−ϑ)Δtj Kh )τ j−1 + ϑΔtj qhj + (1−ϑ)Δtj qhj−1 and in a chosen “interesting” time levels also the elasticity linear system Ah uj = bj In the expressions above, Mh is the capacitance matrix, Kh is the conductivity matrix, Ah is the stiffness matrix, qh represents the heat sources and b = bh (τ ) comes from volume and surface forces including a thermal expansion term. The parameter ϑ ∈ 0, 1 sets the so-called time scheme of the computation: the explicit Euler scheme for ϑ = 0, the Crank-Nicholson (CN) scheme for ϑ = 12 and the backward Euler (BE) scheme for ϑ = 1, which is used by default. Adaptive time stepping. To ensure accuracy and economize the computational effort, it is worth adapting the time steps according to the behaviour of the solution. For this, we use the procedure based on a local comparison of the BE and CN time steps. Evaluating some easily computable ratio, we can control the subsequent time step Δtj and adjust its size if the variation is too small or large. For more details on this point see [10] and the references therein. Parallel solution. The parallel solution of the heat conduction problem is conceptually analogous to the DD elasticity solution with one exception: For reasonable sizes of time steps, one-level Schwarz method is efficient enough and construction and use of a global auxiliary subproblem is not necessary, see e.g. [12]. 3.2
The APSE Problem
As an example of a thermo-mechanical model solved in GEM, let us introduce the APSE problem, which was formulated in the framework of the DECOVALEX ¨ o Pillar Stability Experiment, in which a 2011 project.2 APSE stands for Asp¨ stability of a 1 meter wide pillar between two deposition holes in a tunnel located in depth of about 450 meters in granite rocks is investigated, c.f. Fig. 2 left. The loading is given by the initial stress, excavation of tunnel and deposition holes and finally also by heating. The aim of the experiment was to get data describing the rock mass response and mainly to examine the failure process in a heterogeneous and slightly fractured rock mass when subjected to coupled 2
DECOVALEX 2011 is the present phase of the long-term international DECOVALEX project (DEvelopment of COupled models and their VALidation against EXperiments), focused on DGR modelling.
GEM – A Platform for Advanced Mathematical Geosimulations
273
10
130 120
9
110 8
Sigma 1 (MPa)
100
7
90 80 70
6
60 50 40
3
2
1
4 5
30 20 0
2
4
6
8 10 Sigma 3 (MPa)
12
14
16
18
Fig. 2. Detail of the APSE FE model (left). Stress development during excavation steps (right).
excavation-induced and thermal-induced stresses. This failure and possibility of creating of the so called Excavation Damage Zone (EDZ) is important for the assessment of the repository performance. This assessment has to take into account the fact that EDZ can create a possible path with greater hydraulic conductivity, and predict the possible consequences. More detailed description of the APSE experiment can be found in [13]. The thermo-elastic computation provides a first-stage information about the acting stresses. Further, in a second stage, a local response of creating EDZ including such effects as spalling rock walls has to be investigated. Measurement of convergence in the operating tunnel, temperatures in the pillar and displacements on the deposition hole walls can be used for calibration and validation of models, codes and approaches. For the first-stage modelling by GEM, the APSE problem was represented in a domain of 105×126×118 meters and discretized it by a regular structured grid of 99×105×59 nodes, providing a FE system of 613 305 DOF. GEM supplied stress paths (main stress σ1 and σ3 in a point located at 2 m depth and 0.032 m into the pillar wall of the second hole) during excavation and the subsequent heating, see Fig. 2 right: (1) initial stress state, (2) after excavation of the tunnel, (3) first hole 2 m deep, (4) first hole 4 m deep, (5) first hole completed, (6) second hole 1 m deep, (7) second hole 2 m deep, (8) second hole 2.5 m deep, (9) second hole 5 m deep, (10) second hole completed. Those results and accuracy obtained were consistent with values provided by the other research teams. A more detailed comparison will be presented in the final report of the project. The efforts for achieving very high coincidence between the model and measured data lead to a problem of finding in this sense optimal values of thermal parameters by solving an identification problem. More precisely, we had some temperature measurements in 8 locations and 12 time points. Assuming the rock to be homogeneous, we wanted to find optimal values of two thermal parameters, heat conductivity λ and volume heat capacity cv . For each couple of parameters we can determine the deviation of the computed temperatures from the measured ones. Considering this deviation as the value of the functional F (λ, cv ), our
274
R. Blaheta et al.
aim was to find the couple providing the minimum of this functional. To reduce the number of repeated solutions of the FE system, originally counted in hundreds, by order of magnitude, we took advantage of the Nelder-Mead (simplex) algorithm (see e.g. [17]) for finding this minimum. OpenMP parallelization. The development of the thermo-elasticity solver was a good opportunity to follow current trends in parallel programming and experience the nowadays increasingly popular shared-variables paradigm, represented by the OpenMP standard. Thus, beside the MPI version, an OpenMP version of the thermal solver was prepared. Both variants of the solver have the same logic, i.e. they follow the same algorithm and apply the same data decomposition, described above, taking account of different character and formalism of the MPI and OpenMP standards. Without going into details, which can be found in [14], our case study supports the prevalent idea that for larger number of processors, thanks to the better data locality, MPI surpasses OpenMP in efficiency.
4
Conclusion
The article focused on the basic features of the GEM software and of its solvers in particular that contribute to its competence for the solution of some complex problems arising in geosciences. The limited extend did not allow to mention some other interesting facets, e.g. composite grid technique used for a local mesh refinement or elasto-plastic computations, and those that were mentioned had to refer to other publications for more treatment from the same reason. As far as our future plans related to GEM are concerned, we are considering the hydrogeological phenomena to be included in the DGR models, either in the separated or coupled form. The idea is to start with the saturated Darcy flow as the hydraulic problem. We intend to use a less standard mixed variational formulation and mixed FE method with the lowest order Thomas-Raviart finite elements. The resulting linear system can be solved e.g. by the MINRES method. More about this topic in [15]. Some of the GEM solvers are freely available under the GPL licence, see [16]. Acknowledgement. The work was supported by projects No. AV0Z30860518 of the Academy of Sciences of the Czech Republic and No. GACR 105/09/1830 of the Grant Agency of the Czech Republic. Together with the authors of this paper, Petr Byczanski, Zdenˇek Dost´ al, Alexej Kolcun and Josef Mal´ık are the main contributors to the development of the GEM software.
References 1. Neˇcas, J., Hlav´ aˇcek, I.: Mathematical Theory of Elastic and Elasto-Plastic Bodies: An Introduction. Elsevier, Amsterdam (1981) 2. Blaheta, R.: Displacement decomposition – incomplete factorization preconditioning techniques for linear elasticity problems. Numerical Linear Algebra with Applications 1 (1994)
GEM – A Platform for Advanced Mathematical Geosimulations
275
3. Quarteroni, A., Valli, A.: Numerical Approximation of Partial Differential Equations. Springer, Berlin (1994) 4. Jakl, O.: Database methods in finite element postprocessing. In: Vondr´ ak, I., ˇ ˇ TU Ostrava (1995) Stefan, J. (eds.) Proc. Computer Science, VSB, 5. Blaheta, R.: GPCG – generalized preconditioned CG method and its use with nonlinear and nonsymmeteric displacement decomposition preconditioners. Appl. Math. 9, 527–550 (2002) 6. Blaheta, R.: Space Decomposition Preconditioners and Parallel Solvers. In: Proc. of the 5th European Conference on Numerical Mathematics and Advanced Applications ENUMATH 2003, pp. 20–38. Springer, Berlin (2003) 7. Blaheta, R., Jakl, O., Star´ y, J.: Linear system solvers based on space decompositions and parallel computations. Engineering Mechanics 10(6), 439–454 (2003) 8. Toselli, A., Widlund, O.: Domain Decomposition Methods – Algorithms and Theory. Springer, Heidelberg (2005) 9. Blaheta, R., Jakl, O., Star´ y, J.: Displacement decomposition and parallelization of the PCG method for elasticity problems. Int. J. of Computational Science and Engineering (IJCSE) 1, 183–191 (2005) 10. Kohut, R., Star´ y, J., Blaheta, R., Kreˇcmer, K.: Parallel Computing of Thermoelasticity Problems. In: Lirkov, I., Margenov, S., Wa´sniewski, J. (eds.) LSSC 2005. LNCS, vol. 3743, pp. 671–678. Springer, Heidelberg (2006) 11. Blaheta, R., Byczanski, P., Jakl, O., Kohut, R., Kolcun, A., Kreˇcmer, K., Star´ y, J.: Large scale parallel FEM computations of far/near stress field changes in rocks. Future Generation Computer Systems 22(4), 449–459 (2006) 12. Blaheta, R., Kohut, R., Neytcheva, M., Star´ y, J.: Schwarz Methods for Discrete Elliptic and Parabolic problems with an Application to Nuclear Waste Repository Modelling. Mathematics and Computers in Simulation 76, 18–27 (2007) ¨ o Hard Rock Laboratory. Asp¨ ¨ o Pillar Stability Experiment, 13. Andersson, C.: Asp¨ final report. SKB TR-07-01, Stockholm (2007) 14. Jakl, O., Kohut, R., Star´ y, J.: MPI and OpenMP Computations for Nuclear Waste Deposition Models. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2007. LNCS, vol. 4967, pp. 409–418. Springer, Heidelberg (2008) 15. Blaheta, R., Byczanski, P., Kohut, R., Star´ y, J.: Modelling THM Processes in Rocks with the Aid of Parallel Computing. In: Proc. of the 3rd Int. Conference on Coupled T-H-M-C Processes in Geosystems GeoProc 2008. ISTE and John Wiley, London (2008) 16. GEM download page, http://www.ugn.cas.cz/link/gem 17. Singer, S., Nelder, J.: Nelder-Mead algorithm. Scholarpedia 4(7), 2928 (2009), http://www.scholarpedia.org/article/Nelder-Mead_algorithm
Accelerating the MilkyWay@Home Volunteer Computing Project with GPUs Travis Desell1 , Anthony Waters1 , Malik Magdon-Ismail1, Boleslaw K. Szymanski1 , Carlos A. Varela1 , Matthew Newby2 , Heidi Newberg2 , Andreas Przystawik3, and David Anderson4 1
3
Department of Computer Science, Rensselaer Polytechnic Institute, Troy NY 12180, USA 2 Department of Physics, Applied Physics and Astronomy, Rensselaer Polytechnic Institute, Troy NY 12180, USA Institut f¨ ur Physik, Universit¨ at Rostock, 18051 Rostock, Germany 4 U.C. Berkeley Space Sciences Laboratory, University of California, Berkeley, Berkeley CA 94720, USA
Abstract. General-Purpose computing on Graphics Processing Units (GPGPU) is an emerging field of research which allows software developers to utilize the significant amount of computing resources GPUs provide for a wider range of applications. While traditional high performance computing environments such as clusters, grids and supercomputers require significant architectural modifications to incorporate GPUs, volunteer computing grids already have these resources available as most personal computers have GPUs available for recreational use. Additionally, volunteer computing grids are gradually upgraded by the volunteers as they upgrade their hardware, whereas clusters, grids and supercomputers are typically upgraded only when replaced by newer hardware. As such, MilkyWay@Home’s volunteer computing system is an excellent testbed for measuring the potential of large scale distributed GPGPU computing across a large number of heterogeneous GPUs. This work discusses the implementation and optimization of the MilkyWay@Home client application for both Nvidia and ATI GPUs. A 17 times speedup was achieved for double-precision calculations on a Nvidia GeForce GTX 285 card, and a 109 times speedup for double-precision calculations on an ATI HD5870 card, compared to the CPU version running on one core of a 3.0 GHz AMD Phenom(tm)II X4 940. Using single-precision calculations was also evaluated which further increased performance 6.2 times for ATI card, and 7.8 times on the Nvidia card but with some loss of accuracy. Modifications to the BOINC infrastructure which enable GPU discovery and utilization are also discussed. The resulting software enabled MilkyWay@Home to use GPU applications for a significant increase in computing power, at the time of this publication approximately 216 teraflops, which would place the combined power of these GPUs between the 11th and 12th fastest supercomputers in the world1 . 1
http://www.top500.org/list/2009/06/100
R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 276–288, 2010. c Springer-Verlag Berlin Heidelberg 2010
Accelerating the MilkyWay@Home Volunteer Computing Project with GPUs
1
277
Introduction
General-Purpose computing on Graphics Processing Units (GPGPU) is an emerging field of research which allows software developers to utilize the significant amount of computing resources GPUs provide for wide range of applications. While traditional high performance computing environments such as cluster, grids and supercomputers require significant architectural modifications to incorporate GPUs, volunteer computing grids already have these resources available as most personal computers have GPUs available for more recreational use. Additionally, volunteer computing grids are gradually upgraded by the volunteers as they upgrade their hardware, whereas clusters, grids and supercomputers are typically just replaced by newer hardware. As such, MilkyWay@Home’s volunteer computing system is an excellent testbed for testing and measuring the potential of large scale distributed GPGPU computing across a large number of heterogeneous GPUs. There are two main challenges in effectively utilizing GPUs in a volunteer computing setting. The first is developing efficient applications that appropriately utilize the GPU hardware. Additionally, the MilkWay@Home application requires a high degree of floating point accuracy to compute correct results. Only recent generations of GPUs provide the required double precision support, but with varying degrees of adherence to floating point standards. Second, MilkyWay@Home uses the Berkeley Open Infrastructure for Network Computing (BOINC), which provides an easily extensible framework for developing volunteer computing projects [1]. However, new support needed to be added to track not only what GPUs are available on volunteered hosts, but also what driver versions they are using so appropriately compiled applications can be sent to those clients. This paper discusses the implementation and optimization of the MilkyWay@Home client application for both Nvidia and ATI GPUs. A 17 times speedup was achieved for double precision calculations on a Nvidia GeForce GTX 285 card, and a 109 times speedup for double precision calculations on an ATI HD5870 card, compared to the CPU version running on one core of a 3.0 GHz AMD Phenom(tm)II X4 940. Performing single precision calculations was also evaluated, and the methods presented improved accuracy from 5 to 8 significant digits for the final results. This compares to 16 significant digits with double precision, but on the same hardware, using single precision further increased performance 6.2 times for ATI, and 7.8 times for the Nvidia card. Utilizing these GPU applications on MilkyWay@Home has provided an immense amount of computing power, at the time of this publication approximately 216 teraflops. The paper is organized as follows. Related volunteer computing projects and their progress in using GPGPU is discussed in Section 2. ATI and CUDA optimizations, along with strategies for improving their accuracy are presented in Section 3. Section 4 describes how BOINC has been extended to support GPU applications and how to use these exensions. The paper concludes with conclusions and future work in 5.
278
2
T. Desell et al.
Related Work
Recently there has been an increase in the amount of volunteer computing projects porting their applications for use on GPUs. Some of these projects include Folding@Home [2], GPUGRID [3], Einstein@Home [4], and Aqua@Home [5]. The GPUs are chosen because of their availability and large increase in computational power over conventional CPUs [6]. However, the increase in computational power is not guaranteed. For example, Aqua@Home’s GPU application implemented on Nvidia hardware does not perform as well as the CPU version [7]. Also, in order to reach peak performance levels it is necessary to design the algorithms to access data in specialized ways and limit the usage of branch type instructions [8]. Folding@home pioneered the use of GPUs in a distributed computing project to accelerate the speed of their calculations. The application was first programmed for ATI hardware using the Microsoft DirectX 9.0c API and the Pixel Shader 3.0 specification [6]. The Folding@Home GPU client was over 25 times faster than an optimized CPU version, with outdated hardware in today’s standards. To achieve these performance levels they made use of the large amount of parallel execution units present on the GPU through the use of several optimizations [6]. To get a better idea of the impact the GPUs had on performance they stated that in 2006 there were 150,000 Windows computers performing 145 TFlops of work, while there were 550 GPUs producing 34 TFlops of work. Based on their numbers that leads to one GPU producing around 56 times more TFlops per 1 TFlop generated on a CPU. Since then ATI and Nvidia started to offer tools for programming GPUs without being restricted by a graphics API. Both now use a subset of the C language augmented with extensions designed to ease the handling of data parallel tasks. Nvidia calls this programming environment CUDA (Compute Unified Device Architecture) [9], while ATI’s version is named Brook+ [10]. This had a significant impact on both the adoption rate as well as the performance of Folding@home on GPUs. In 2009 they reported only a moderate increase of the performance delivered by 280,000 Windows computers to approximately 275 TFlops. But now 27,000 GPUs from ATI as well as Nvidia provide more than 3 PFlops of computing power [11], almost a hundred-fold increase compared to only 2.5 years before. Like Folding@Home, GPUGRID recently started to utilize GPUs for performing molecular dynamics calculations [8,12]. Both projects experienced significant speed ups by performing the calculation on the GPU, in particular Folding@Home stated between 100x and 700x performance increase comparing a Nvidia GTX 280 to a CPU with single-precision calculations.
3
Optimizing Milkyway@Home for GPUs
The MilkyWay@Home application attempts to discover various structures existing in the Milky Way galaxy and their spatial distribution [13]. This requires a
Accelerating the MilkyWay@Home Volunteer Computing Project with GPUs
279
probabilistic algorithm for locating geometric objects in spatial databases [14]. The data being analyzed is from the Sloan Digital Sky Survey [15], which has recently released 15.7 terabytes of images (fits), 26.8 terabytes of data products in fit formats, 18 terabytes of catalogs in the SQL database and 3.3 terabytes of spectra (fits). The observed density of stars is drawn from a mixture distribution parameterized by the fraction of local sub-structure stars in the data compared to a smooth global background population, the parameters that specify the position and distribution of the sub-structure, and the parameters of the smooth background. Specifically this is a probability density function (P DF ) that calculates the chance of obtaining the observed star distribution after repeated independent sampling from the total set of stars: L(Q) =
N
P DF (li , bi , gi |Q)
(1)
i=1
where i denotes the ith of N stars (l and b are galactic coordinates and g is the magnitude), and Q represents the parameters in the model. Such probabilistic framework gives a natural sampling algorithm for separating sub-structure from background [13]. By identifying and quantifying the stellar substructure and the smooth portion of the Milky Way’s spheroid, it will be possible to test models for the formation of our galaxy, and by example the process of galaxy formation in general. In particular, we would like to know how many merger events contributed to the build up of the spheroid, what the sizes of the merged galaxies were, and at what time in the history of the Milky Way the merger events occurred. Models for tidal disruption of merger events that build up the spheroid of the Milky Way can be matched with individual, quantified spatial substructures to constrain the Galaxy’s gravitational potential. Since the gravitational potential is dominated by dark matter, this technique will also teach us about the spatial distribution of dark matter in the Milky Way. This section continues by first describing how precision for single-precision GPU calculation could be improved in Section 3.1 and then with specific optimizations used to greatly improve the performance of the GPU applications in Section 3.2. 3.1
Conserving Precision on Different Platforms
A general concern using heterogeneous systems for computation is to ensure the consistency of the results. This is even more important when completely different architectures like GPUs from different vendors as well as CPUs are utilized. While most floating point operations on current GPUs closely follow the precision requirements of the the IEEE 754 standard [16] there are some notable exceptions. Furthermore because of the parallel nature of the execution on GPUs it is necessary to pay attention to the order in which the individual work items are calculated. Generally, no specific order is guaranteed without serializing. In our case we can easily work around this issue as the results of
280
T. Desell et al.
the individual work items of the time consuming parts of the calculation are basically just added up. The commonly employed solution is the well known Kahan method [17], which reduces the rounding errors and effectively guards the result against different summation orders. We have done extensive testing to determine the reliability and reproducibility of the results between different architectures using single-precision as well as double-precision. Using single-precision on GPUs appears especially tempting because of the wider range of supported products and the vastly higher performance. While the peak performance difference between single-precision and double-precision on most CPUs is virtually nonexistent for legacy code and reaches only a factor of two when using Single Instruction Multiple Data (SIMD) extensions, it ranges from a factor of five (ATI) to a factor of eight (Nvidia) for GPUs. Furthermore GPUs support only basic double-precision operations in hardware, divisions or square roots for instance are computed iteratively with sequences of instructions. It was not until the latest GPU generations of both manufacturers that support for the instruction to perform double-precision fused multiply-add was included. With this instruction it is possible to obtain correctly rounded results for the already mentioned division and square root operations. Generally one has to evaluate the stability of the algorithm against small deviations in the handling of certain operations on different platforms. In our case the non-adherence to the IEEE standard for division and square root operations does not affect the final results. On the contrary, it enabled the use of low level optimizations which are described in the following section. Table 1 shows the obtained precision and performance of the different implementations of our algorithm. The CPU and GPU implementations using doubleprecision operations all deliver the exact same results for typical parameter ranges.
Table 1. Comparison of the performance and precision of different platforms, SP and DP indicate single-precision and double-precision calculations respectively. The low level optimizations used in some DP versions are described in the text. The speedup is given relative to the double-precision CPU implementation running on a single core of a 3.0 GHz AMD PhenomII X4 940 using already vectorized code. The entry in the precision column quantifies the typical relative deviation to the results of the CPU implementation. Platform 3.0 GHz CPU core, DP, SSE3 Nvidia GTX285, DP opt. ATI HD4870, DP ATI HD4870, DP opt. ATI HD5870, DP opt. NV GTX285, SP, na¨ıve sum Nvidia GTX285, SP ATI HD4870, SP
precision measured speedup 0 < 1 · 10−16 < 1 · 10−16 < 1 · 10−16 < 1 · 10−16 ∼ 2 · 10−6 ∼ 1 · 10−8 ∼ 1 · 10−8
1 17 34 48 109 130 130 299
theoretical peak 1 (12 GFlop/s) 7.4 (89 GFlop/s) 20 (240 GFlop/s) 20 (240 GFlop/s) 45 (544 GFlop/s) 89 (1063 GFlop/s) 89 (1063 GFlop/s) 100 (1200 GFlop/s)
Accelerating the MilkyWay@Home Volunteer Computing Project with GPUs
281
The single-precision variants give a sizable speedup while sacrificing some accuracy. But this loss can be minimized by taking care of the rounding errors occurring when combining the results of the individual work items. The resulting precision is in the expected range and actually slightly better than the resolution of the single-precision format (2−24 ≈ 6 · 10−8 ). The reason is that the Kahan summation method effectively increases the precision of the stored intermediate values and the deviations between single and double-precision partly average out. Furthermore this allows to use CPU as well as GPUs for the same work units as all platforms calculate to the same precision. 3.2
Performance Optimizations
The time consuming part of our algorithm is the evaluation of the PDF for a three-dimensional wedge of the sky. The convolution with the probability distribution of the distance of a star for a given observed magnitude makes this effectively a four dimensional problem, which can be conveniently mapped to parallel architectures like GPUs. To use the hardware as efficiently as possibly we employed a range of optimizations. Changes on the algorithmic level were applied to all platforms, i.e. CPU as well as GPU applications use the same algorithm to calculate the results. Here we concentrate on the specific optimizations done for GPUs. As the architectures of ATI as well as Nvidia GPUs share common design principles, most of the optimizations apply to both platforms. Because of the distinct features of the programming environments of both vendors often the same optimization has to be implemented in a different way. Figure 1 depicts the various optimizations that were done along with their effect on the performance for the case of the CUDA implementation. The first set of optimizations aims at the efficient use of the memory controller and the caches. As recurring expensive expressions are precalculated and stored in lookup tables, the access to these data structures can be accelerated by the use of the texture caches in GPUs increasing the available bandwidth as well as reducing the latency in comparison to accesses to the global memory. This is particularly easy for ATI GPUs as all data structures are allocated as textures by default in Brook+. The hardware tiles and interleaves the data in a transparent way to make use of the spatial locality by coalescing memory accesses and improving the cache hit rate. For Nvidia GPUs the coalescence of memory accesses requires some manual intervention. The data inside the different data structures has to be reordered to be accessed consecutively by the global thread id. Initially the implementation made use of the GPU’s constant memory instead of textures in order to store the precalculated values associated with the algorithm. While constant memory is really stored in the GPU’s global memory space it is cached by a dedicated constant cache after the first access [9]. Nevertheless the use of texture memory offers some advantages exploiting the spatial locality in texture references [9]. The first reason is that the textures accesses are cached by a two level cache system which simply offers more space and therefore higher hit rates. The other reason is the organization in cache lines, i.e. when one value is read from the
282
T. Desell et al.
memory, the successive values which are most likely needed by a neighboring thread or the next iteration in the same thread are also fetched into the cache. This lowers the average latency for the accesses. For our algorithm the ratio of calculations to the number of memory accesses is high. Depending on the problem one has between 5 and 12 floating point operations per byte fetched from the lookup tables. The calculations can be arranged to take advantage of spatial locality within the caches even when the full lookup tables are larger than the first level caches on CPUs or the texture caches on GPUs. Therefore memory fetches are certainly not a serious bottleneck by themselves. Nevertheless one has a limited number of threads available to hide the access latencies. Therefore it is necessary to look into both, improving the efficiency of memory accesses and minimizing the latency as well as maximizing the number of threads present on the GPU to attain a high occupancy of the execution units. The latter usually requires the reduction of the used hardware registers, as all threads on a SIMD engine in a GPU share the same register file. In the case of the CUDA version, the amount of shared memory usage had to be reduced. This reduction is possible by changing the number of active threads per block, this had a negligible effect on performance. The baseline CUDA version minimized the register usage, but as our algorithm is clearly compute bound no excessive amount of threads is necessary to hide the latencies. Furthermore newer GPUs raised the amount of available registers and lessened the importance of this optimization strategy. It was therefore beneficial to employ the same strategy as for CPUs and ATI GPUs and combine calculations wherever possible in order to minimize the necessary operations. This results in using more registers but allowed for a slight performance increase and is labeled as Combine Terms in Figure 1. As the division and square root are particularly costly on GPUs in doubleprecision, we looked for possible shortcuts compared to the compiler generated instruction sequences for these operations. Generally the Newton iteration method is used for both. After the exponent extraction which reduces the argument range the single-precision reciprocal or square root instruction is used to get the first estimate. In our case we have some prior knowledge about the range of values we operate with. As we need only the longer mantissa but not the extended exponent range of the double-precision format one can simply spare the exponent extraction. To get correctly rounded results, one needs one final iteration step using a fused multiply add operation [18]. Without this step the last bit of the mantissa may deviate. But as this step doesn’t make a difference on GPUs not supporting a fused multiply add instruction without intermediate rounding and it is not the limiting factor in our calculations, we omitted it as well. Together this reduced the amount of instructions in the innermost loops by up to 20 percent. While one cannot guarantee that the results after these changes exactly match the original versions in all cases, we found no deviations in our tests (see Table 1). A plot of how the changes affected the performance of the Nvidia implementation can be seen in Figure 1, labeled as Sqrt and Div.
Accelerating the MilkyWay@Home Volunteer Computing Project with GPUs
283
CUDA Optimizations 165
Time [s]
160 155 150 145 140
D iv
rt
Sq s
ne
re
tu
bi om
x Te
C e
lin
se
Ba
r Te s m
Fig. 1. Optimizations for the double-precision CUDA version and the effect on the calculation time for one evaluation on a Nvidia GTX285 GPU. See the text for a description of the changes.
We also explored the feasibility of some further low level optimizations for ATI. As the execution units of current ATI GPUs are organized in groups which are fed with bundles of independent instructions (resembling VLIW architectures), it is sometimes difficult for the compiler to maximize the utilization of the units. The Brook+ compiler generates code in a pseudo assembly intermediate language (IL), which is just-in-time compiled and optimized for the specific GPU in the system at runtime [10]. The IL code is easily accessible. The innermost loop of the time consuming kernels was tuned for an optimal instruction pairing to increase the utilization resulting in an additional 20 percent speedup compared to the Brook+ generated IL. As a result we have implemented the algorithm with a comparable level of optimization on different platforms. The general strategies are very similar. Only the VLIW-like organization of the execution units of ATI GPUs requires some additional attention to obtain optimal performance. This is even more important for single-precision applications. In our case it was straightforward to combine four threads into a single one by using the appropriate vector data types. On the other hand the memory layout and handling required more effort for Nvidia GPUs. Generally, the performance of the different implementations roughly follow the theoretical peak performances of the corresponding platforms. As our algorithm is significantly compute-bound it scales very well on massively parallel architectures. In fact, even when neglecting the low level optimized version the resource utilization on GPUs is better than on CPUs. This can be explained by the latency hiding features of GPUs which are designed to mask the execution latencies of the individual instructions as well as memory latencies. This may be even more significant for single-precision when the GPUs can use their fast hardware instructions for division and square root operations which have a high latency on CPUs. In contrast GPUs are able to continuously operate close
284
T. Desell et al.
to their maximum instruction throughput on our algorithm. Comparing ATI and Nvidia GPUs, both arrive at about the same performance relative to their theoretical peak throughput.
4
Using GPUs on BOINC
BOINC is a platform for volunteer computing. It is used by about 50 projects, including Milkyway@home, SETI@home, IBM World Community Grid, Einstein@home, Rosetta@home, and Climateprediction.net. BOINC allows volunteers to participate in multiple projects. To do so, they download and run a client program (available for all common operating systems) and attach the client to the desired projects. Each attachment has a volunteer-specified resource share indicating the fraction of the computer’s available resources that should be allocated to the project.
Fig. 2. The basic operation of BOINC
The basic operation of BOINC is shown in Figure 2, in which a volunteer host is attached to projects A and B. The BOINC client maintains a queue of jobs and manages their execution; depending on user preferences, it may run jobs when the computer is in use, not in use, or both. When the length of the queue (in terms of estimated runtime) falls below a lower limit, the client selects a project and issues a scheduler RPC to it, requesting enough jobs to reach the upper limit. The lower and upper limits on queue length are chosen to avoid running out of work during periods of disconnection, and to minimize the frequency of scheduler RPCs. The choice of project is based on resource share: the client maintains a per-project work debt that increases in proportion to resource share, and decreases as work is done. Jobs are requested from the project whose debt is greatest.
Accelerating the MilkyWay@Home Volunteer Computing Project with GPUs
285
Each project has its own server and database. The database stores a set of jobs, each of which is associated with an application. An application can have several versions, one per platform (Window 32-bit, Windows 64-bit, Mac OS X, Linux, etc.). The attributes of a job include its memory and disk requirements, an estimate of its FLOP count, and its deadline. The scheduler RPC’s request message describes the host’s operating system and version, processor type and benchmark scores, memory size, and available disk space. It requests an amount of work, as measured in expected runtime. Based on this information, the scheduler attempts to find a set of jobs that satisfy the request and that can be executed on the host (i.e., the memory and disk requirements are met, the deadline will probably be met, and a version is available for the host’s platform). This basic framework was extended to handle applications that use GPUs as well as CPUs. The resulting system handles clients that have arbitrarily many GPUs, possibly of different types (currently Nvidia and ATI are supported). GPUs of a given type are assumed to be identical. The volunteer host population may have a wide range of GPU models, driver versions, and memory sizes. Referring to Figure 2, we now allow an application to have multiple versions per platform. For a given platform (say, Win32) an application might have versions for one CPU, for Nvidia GPU, for ATI GPU, and for multiple CPUs. The resource requirements of a particular version may not be integral: for example, a job might need 0.5 GPUs and 0.1 CPUs, and these fractions may depend on the host. The notion of resource share is defined as applying to a host’s aggregate processing resources, not to the resource types separately. For example, suppose a host has a CPU and a GPU, and the GPU is twice as fast. Suppose that the host is attached to projects A and B with equal resource shares, and that project A has both GPU and CPU applications, while project B has only CPU applications. In this case project A should be allocated 100% of the GPU and 25% of the CPU, while project B should be allocated 75% of the CPU (see Figure 3). As shown by this example, the client must be able to ask the server for jobs of a particular resource type (CPU or GPU). Thus, we extended the scheduler RPC protocol to include a separate work request (in seconds) for each resource
Fig. 3. A project’s resource share applies to all processing resources. In this example, projects A and B each get 15 GFLOPS.
286
T. Desell et al.
type. In the scheduler RPC reply, jobs are associated with particular versions, hence with particular resource types and usage levels. Two major parts of the BOINC client have been extended to support GPU applications. First, the job scheduler is now GPU-aware; the client allocates GPUs, and passes command-line parameters to GPU jobs telling them which GPU instance(s) to use. Because GPU memory is physical rather than virtual, preempted GPU jobs must always be removed from memory. The client maintains separate job queues for each processing resource, and maintains a separate per-project debt for each resource as well. It maintains a dynamic estimate of which projects have jobs for which resources. A project’s overall debt is the sum of its debts over all resources, weighted by the average speeds of the resources. The client’s work-fetch policy can be summarized as follows: when the queue for some resource falls below its lower limit, the client chooses, from among the projects likely to have jobs for that resource, the one whose overall debt is greatest. It then issues a scheduler RPC to that project, requesting work for the given resource and possibly for others as well. This policy enforces resource shares as described above. Moving now to the scheduler, deciding whether a GPU job can be handled by a particular client may involve many details of the GPU hardware and driver. This decision is made by an application planning function that is supplied by the project for each of its versions. This function takes as input a description of the host and its GPUs. It returns a) a flag indicating whether the jobs can run the version, and if so b) an estimate of the resource usage (number of CPUs, number of GPUs) and c) estimate of the FLOPs/second the version will achieve on the host. In sending a job to a client, the scheduler must now decide which of possibly several versions to use. This is done as follows. The scheduler considers all the application’s versions for the client’s platforms, and calls the respective application planning functions. It skips those that use resources for which no work is being requested, and from the other it selects the one whose FLOPs/second estimate is greatest. This estimate is then used to estimate the job’s runtime on the host, and the work request for the resource is decremented by that amount.
5
Summary
This work discusses the implementation and optimization of the MilkyWay@Home client application for both Nvidia and ATI GPUs. A 17 times speedup was achieved for double-precision calculations on a Nvidia GeForce GTX 285 card, and a 109 times speedup for double-precision calculations on an ATI HD5870 card, compared to a vectorized CPU version running on one core of a 3.0 GHz AMD Phenom(tm)II X4 940. Performing single-precision calculations was also evaluated, and the methods presented improved accuracy from 5 to 8 significant digits for the final results. This compares to 16 significant digits with double-precision, but on the same hardware, using single-precision further increased performance 6.2 times for ATI, and 7.8 times faster on the Nvidia
Accelerating the MilkyWay@Home Volunteer Computing Project with GPUs
287
card. Utilizing these GPU applications on MilkyWay@Home has provided an immense amount of computing power, at the time of this publication approximately 216 teraflops. While developing GPGPU applications still requires significant technical knowledge, the process is becoming easier. Additionally, the large amount of computing resources this type of hardware provides makes utilizing GPUs a highly desirable prospect, especially in the area of volunteer computing. As the hardware and software matures, we expect GPGPU applications to become more mainstream. The techniques discussed in this paper can aid in the development of other GPGPU applications and describe how they can be effectively used in a volunteer computing environment.
Acknowledgements We would like to thank our many volunteers for taking part in the MilkyWay@HOME BOINC computing project as this research would not be possible without them, as well as Nvidia and ATI for their generous support and hardware donations. This work has been partially supported by the following grants: NSF AST No. 0607618, NSF IIS No. 0612213, NSF MRI No. 0420703 and NSF CAREER CNS Award No. 0448407. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
References 1. Anderson, D., Korpela, E., Walton, R.: High-performance task distribution for volunteer computing. In: e-Science, pp. 196–203. IEEE Computer Society, Los Alamitos (2005) 2. Pande, V.S.: http://folding.stanford.edu 3. De Fabritiis, G.: http://gpugrid.net 4. Allen, B.: http://einstein.phys.uwm.edu 5. D-Wave Systems Inc., http://aqua.dwavesys.com 6. Elsen, E., Houston, M., Vishal, V., Darve, E., Hanrahan, P., Pande, V.S.: N-Body simulation on GPUs. In: SC 2006: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 188. ACM, New York (2006) 7. D-Wave Systems Inc., http://aqua.dwavesys.com/faq.html 8. Friedrichs, M., Eastman, P., Vaidyanathan, V., Houston, M., Legrand, S., Beberg, A., Ensign, D.L., Bruns, C.M., Pande, V.S.: Accelerating molecular dynamic simulation on graphics processing units. Journal of Computational Chemistry 30, 864–872 (2009) 9. NVIDIA Corporation. NVIDIA CUDA Programming Guide Version 2.3.1. 10. AMD Corporation. ATI Stream Computing User Guide Version 1.4.0. 11. Beberg, A., Ensign, D., Jayachandran, G., Khaliq, S., Pande, V.: Folding@home: Lessons from eight years of volunteer distributed computing. In: IEEE International Symposium on Parallel & Distributed Processing, pp. 1–8 (2009)
288
T. Desell et al.
12. Harvey, M., Giupponi, G., De Fabritiis, G.: ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale. Journal of Chemical Theory and Computation 5 (2009) 13. Purnell, J., Magdon-Ismail, M., Newberg, H.: A probabilistic approach to finding geometric objects in spatial datasets of the Milky Way. In: Hacid, M.-S., Murray, N.V., Ra´s, Z.W., Tsumoto, S. (eds.) ISMIS 2005. LNCS (LNAI), vol. 3488, pp. 485–493. Springer, Heidelberg (2005) 14. Reina, C., Bradley, P., Fayyad, U.: Clustering very large databases using mixture models. In: Proc. 15th International Conference on Pattern Recognition (2000) 15. Adelman-McCarthy, J., et al.: The 6th Sloan Digital Sky Survey Data Release, ApJS (July 2007) (in Press), arXiv/0707.3413, http://www.sdss.org/dr6/ 16. IEEE Standard for Binary Floating-Point Arithmetic, ANSI / IEEE Std. 754-1985 (1985) 17. Kahan, W.: Pracniques: further remarks on reducing truncation errors. ACM Commun. 8(1), 40 (1965) 18. Cornea-Hasegan, M., Golliver, R., Markstein, P.: Correctness Proofs Outline for Newton-Raphson Based Floating-Point Divide and Square Root Algorithms. In: Proceedings of the 14th IEEE Symposium on Computer Arithemtic, pp. 96–105 (1999)
Vascular Network Modeling - Improved Parallel Implementation on Computing Cluster Krzysztof Jurczuk1,2,3 , Marek Kr¸etowski1 , and Johanne B´ezy-Wendling2,3 1
Faculty of Computer Science, Bialystok Technical University Wiejska 45a, 15-351 Bialystok, Poland 2 INSERM, U642, Rennes, F-35000, France 3 University of Rennes 1, LTSI, Rennes, F-35000, France {k.jurczuk,m.kretowski}@pb.edu.pl
Abstract. In this paper, an improved parallel algorithm of vascular network modeling is presented. The new solution is based on a more decentralized approach. Moreover, in order to accelerate the simulation of vascular growth process both the dynamic load balancing and periodic rebuildings of vascular trees were introduced. The presented method was implemented on a computing cluster with the use of the MPI standard. The experimental results show that the improved algorithm results in better speedup thus making it possible to introduce more physiological details and also perform simulations with a greater number of vessels and cells. Furthermore, the presented approach can bring the model closer to reality where the analogous vascular processes can be also parallel.
1
Introduction
The vascular system modeling plays an important role in the field of computational biology and medicine [2]. It can help to understand complex vascular processes (e.g. blood transport, angiogenesis) and also their impact on the other living elements (e.g. cells), tissues and whole organs. In medicine, computer models of blood vessel trees may be considered as working tools that can help to understand pathological processes. Moreover, virtual organs with vascular systems can be used to perform dynamic medical imaging simulations, with no need for patients to participate and hence useful to test several acquisition protocols. In order to build a representative model which reflects reality, we ought to take into account the most essential properties of the system and disregard all those elements whose role is insignificant. However, this selection process constitutes one of the main difficulties in model designing, particularly in modeling of living organisms and consequently vascular systems. A high quality vascular model, apart from representing an appropriate number of physiological and anatomical details, should be also effective in practical cases, i.e. the computational simulations should be performed in a reasonable time. Moreover, a lot of internal organic processes are parallel or distributed (e.g. new blood vessel growing from the pre-existing ones). Therefore, the creation of distributed vascular models seems to be useful. Firstly, they can decrease the time of simulation and R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 289–298, 2010. c Springer-Verlag Berlin Heidelberg 2010
290
K. Jurczuk, M. Kr¸etowski, and J. B´ezy-Wendling
simultaneously allow us to introduce more sophisticated details. Secondly, they make it possible to reflect reality more accurately. In our previous research [1] and [6], we used the modeling of vascular systems to search for pathology markers and improve the interpretation of dynamic medical images. We focus on vessel networks because many diseases are directly related to changes in vessel structures. Moreover, when a contrast product is administrated, these anatomical or functional vessel modifications can appear in medical images. Initially, we made use of a sequential algorithm to generate a virtual liver [7], CT simulator and MRI virtual scanner. Subsequently, we introduced a parallel solution of vascular network growth [3], which significantly reduced the computation time. In this paper, we wish to present an improved parallel algorithm of physiological vascular modeling whose aim is to further accelerate the simulation process. Moreover, the new solution is based on a more decentralized approach that can reflect the real process of vessel growth more accurately. Both parallel algorithms were implemented on a computing cluster with the use of the MPI standard [10]. Many other vascular models have been proposed, e.g. Constrained Constructive Optimization method for an arterial tree generation inside a 3D volume [5] or a fractal model [12]. However, as far as we know, all the previous solutions, published by other authors, have been using sequential algorithms to develop vascular systems. The remaining part of the paper is organized as follows. In the next section the vascular model is briefly described and both the sequential and preliminary parallel algorithms of vascular development are recalled. In Sect. 3 the improved parallel solution of the vascular growth process is explained. An experimental validation of the proposed algorithm is performed in Sect. 4. The last section contains the conclusion and some plans for future research.
2
Vascular Modeling for Image Understanding
The discussed model consists of two main components: the tissue and the vascular network [7]. Although most of its features are not linked with any particular organ, the model expresses the specificity of the liver which is one of the most important organs. The liver plays a major role in the metabolism and has a number of functions in the body, including hormone production, decomposition of red blood cells, detoxification, protein synthesis, etc. [11]. Moreover, it stands out from other vital organs by its unique organization of the vascular system with three types of trees: hepatic artery, portal vein and hepatic vein. The model is designated to simulate the growth of organs and certain pathological processes. However, it should be emphasized that it is oriented towards an image generation. Therefore, it concentrates on elements that are directly visible in images or have a significant influence on image formation. In our case, we focus on vessels which play a key role in the propagation of contrast material and are very visible in dynamic CT and MRI images. In the model, vessels form the vascular system and supply the tissue.
Improved Parallel Implementation of Vascular Network Modeling
2.1
291
Tissue Modeling
The tissue is represented by a set of Macroscopic Functional Units (MFU) which are located inside the specified shape. An MFU is a small, fixed size part of tissue and has been assigned a class which determines most of its structural/functional properties (e.g. probability of mitosis and necrosis, size, density) and also physiological features (e.g. blood flow rate, blood pressure). Several classes of MFUs can be defined, which makes it possible to represent pathological regions of the tissue. Moreover, in order to introduce more natural variability, certain parameters (like blood flow rate) are described by defined distributions. 2.2
Vessel Network Modeling
The vascular network of the liver consists of three vessel trees. Hepatic arteries and portal veins deliver blood to individual MFUs. Whereas, the hepatic venous tree is responsible for blood transport back to the heart. In the model, the vascular trees are composed of vessels that can divide creating bifurcations. Each vessel is represented by an ideal, rigid tube with a fixed radius, wall thickness, length and spatial position. The model distinguishes vessels larger than capillaries, while the capillaries themselves are hidden in the MFUs. As a result, anastomoses (e.g. mutual vessel intersections), which occur particularly among vessels with very small radii or in pathological situations, are not taken into account. Hence, each vascular tree assumes the form of a binary tree. In order to design realistic vascular structures, the vessel parameters (i.e. radius, blood flow, etc.) are calculated according to two basic physical laws. At each bifurcation the law of matter conservation must be observed: Q = Qr + Ql ,
(1)
where Q is the blood flow in a parent vessel, Qr and Ql are the blood flows in the right and left daughter branches respectively. This relation ensures that the quantity of blood entering and leaving a bifurcation is equal. Moreover, the morphological law describing the dependency of the mother vessel radius (r) and the radii of its two daughters (right rr and left rl ) is used: rγ = rrγ + rlγ ,
(2)
where the γ factor varies between 2 and 3 [4]. Furthermore, one more important assumption is also taken into consideration. In the model, the blood is treated as a Newtonian fluid, with constant viscosity (μ). Therefore, its flow can be modeled as a non-turbulent streamline flow in parallel layers (laminar flow) induced by the pressure difference (ΔP ) between two extremities and consequently the Poiseuille’s law can be used to calculate the remaining properties of vessels: ΔP = Q
8μl , πr4
where l is the length, r is the radius and Q is the blood flow of the vessel.
(3)
292
2.3
K. Jurczuk, M. Kr¸etowski, and J. B´ezy-Wendling
Sequential Vascular System Growth Algorithm
The growth process starts with an organ whose size is a fraction of the adult one. In discrete time moments (called cycles) the organ enlarges its size (growth phase) until it reaches its full, mature form. Additionally, each cycle consists of subcycles during which each MFU can divide and give birth to a new MFU of the same class (mitosis process) or die (necrosis process). During the growth phase the proliferation process is dominant, due to the mitosis supremacy over the necrosis (analogy to immature and not fully developed organs) [8]. Therefore, the increasing needs of the growing tissue induce the development of a vascular network which is responsible for the blood delivery. In each subcycle, the processes of mitosis and necrosis are repeated, whereas new cycle starts when the specified organ shape is filled by MFUs. New MFUs, which appear during the mitosis process, are initially ischemic (no connection with blood vessels) as they are not perfused by the existing vascular system. As a result, they produce angiogenic factors to stimulate the closest vessels to sprout new branches that will be able to provide blood supply. In the model, for each new MFU a fixed number of the nearest vessels (candidate vessels) is found. If more than one tree is taken into account, the algorithm creates all possible combinations of candidate vessels (a single combination consists of one vessel from each tree). Subsequently, temporary bifurcations are created, which means that all combinations of candidate vessels sprout new branches that temporarily perfuse the assigned MFU. The spatial position of the bifurcation point is controlled by the Downhill Simplex procedure (minimization of additional blood volume to perfuse and supply new MFU) [7]. The above process can be regarded as a kind of competition since only one vessel in each tree can perfuse the new MFU (exactly one combination of candidate vessels) and the remaining ones retract and disappear. Moreover, in the presented approach only non-crossing configurations are taken into account, therefore the algorithm detects intersections between perfusing vessels both from the same and different trees and rejects related configurations. Finally, from among the remaining combinations, the configuration with the lowest sum of volumes is chosen to permanently perfuse the MFU. After the mitosis and perfusion (i.e. reproduction process), comes a degeneration phase. During this part of the algorithm, few MFUs can die (necrosis process) and then all the vessels supplying these MFUs retract and disappear (retraction process). Next, the algorithm goes back to the reproduction phase again. 2.4
Previous Parallel Growth Algorithm
In the sequential algorithm of organ growth, all MFUs are perfused by the vascular system one by one. For each new MFU a lot of temporary bifurcations have to be created and tested, which requires a great number of calculations to face the imposed constraints connected with realistic vascular designing. Therefore, in order to accelerate this most time consuming part of vascular development,
Improved Parallel Implementation of Vascular Network Modeling
293
an algorithm to spread the computations of the perfusion process over processors was introduced (see Fig. 1) [3]. Moreover, our intention was to bring the solution closer to reality in which analogous perfusion phenomena can also occur in a parallel way. The mitosis process is performed at the managing/principal node that, however, makes no attempt to find candidate vessels. Instead, it sends new MFUs to computational nodes. When a computational node receives the message, it then tries to find the closest vessels and an optimal point to connect the received tissue element. If the search ends with a success, it does not perfuse a new MFU permanently but sends the parameters of the bifurcation to the managing node. Next, when the managing node receives a message with an optimal configuration, it takes a decision about a permanent perfusion. If, however, at least one of candidate vessels sent by the computational node was used earlier to perfuse another tissue element, then the MFU is rejected. It means that such a vessel no longer exists at the managing node, which is related with the tree nonuniformity between the nodes. But in other cases, the new MFU is permanently connected to the vascular system. Subsequently, if there are more MFUs to be checked, the managing node sends both all the new changes in its vascular trees as well as the next MFU to the sender of the latest considered configuration. The remaining processes (i.e. necrosis, retraction and shape growth) are performed only at the managing node that possesses the most current vascular system and tissue. Thus, at the beginning of each subcycle, the principal node broadcasts the latest vascular trees and MFUs. In order to minimize the message size, we choose only the parameters of vessels that cannot be reconstructed by computational nodes. Furthermore, a lot of vascular structure characteristics are read from input files, which enables us to send quite a small package in comparison to the size of all modeled organ elements.
Fig. 1. The outline of the preliminary parallel algorithm of vascular growth
3
Improved Algorithm of Parallel Vascular Development
The previously presented parallel algorithm of vascular network growth can be treated as the first step of bringing the discussed model closer to reality [3]. Moreover, in spite of the fact that in this solution the efficiency of computational nodes was strongly dependent on the operation of the managing node that splitted the tasks step by step, it allowed us to gain a significant speedup (about 8 for 16 processors). This strong dependency ensured high uniformity of the vascular trees between the nodes. Furthermore, the computational nodes
294
K. Jurczuk, M. Kr¸etowski, and J. B´ezy-Wendling
were evenly loaded during the perfusion process. But, on the other hand, such an approach could result in delays connected with the additional time needed to wait for the following candidate MFUs. Therefore, a decision was made to build a more decentralized algorithm that would provide a better speedup and, at the same time, reflect reality more accurately. Basing on the profiling results (execution times of specific methods), it was shown that the time needed to send a message with vascular structures and tissue within the framework of the message passing interface is insignificant but the tree reconstruction process at the computational nodes is a time consuming operation, especially for large configurations with thousands and more vessels. Therefore, we decided to broadcast the whole organ only once at the beginning of the simulation to assure that all the nodes possess identical vascular trees and MFUs. Now, each new subcycle starts with the mitosis during which a list of new MFUs is created (see Fig. 2). Next, the perfusion process is performed. As it is the most time consuming operation of the vascular network growth, we also decided to improve it. After the reproduction process, the degeneration phase follows. Due to giving up broadcasting the whole organ in each of the subcycles, all the computational nodes must be informed about possible necrosis changes. Therefore, the managing node transmits essential information to all other nodes about the MFUs that have to be removed. The entire algorithm of retraction must be performed at each node and fortunately it is not so time consuming in comparison to the perfusion process. There is a possibility to model the process of MFU selection in a parallel environment but the performance analysis showed that the time needed for that part of the algorithm can be neglected. After the degeneration process, if the organ reaches its adult size, the algorithm ends but when the another new subcycle is needed the process starts again with the mitosis. In the other case the organ enlarges its size. At each node the same operations are performed in a parallel way and only after they are completed, the solution returns to the mitosis process.
Fig. 2. The general diagram of the improved parallel vascular growth algorithm
Improved Parallel Implementation of Vascular Network Modeling
3.1
295
Modified Parallel Perfusion Algorithm
After the mitosis process, when the list of new macroscopic functional units is created, the managing node spreads all candidate MFUs over computational nodes (see Fig. 3). Each computational node, after receiving the message with the group of different MFUs, attempts to find the closest vessels and optimal points to perfuse new tissue elements. Each time, when the search ends with a success, the parameters of the optimal bifurcation are sent to the managing node. Next, the computational node checks if there are any messages with vascular tree changes to be introduced in its vascular system. If there are, the changes are applied and the node continues to work with the next assigned candidate MFU. Special attention was paid to assure the optimal load among computational nodes. Due to spreading all candidate MFUs, immediately after the mitosis process, some of processors can be idle after performing all assigned tasks while others have jobs queuing for execution. Therefore, the opportunity to leave some candidate MFUs, which are sent on demand, was introduced (dynamic load balancing). However, before the work distribution in each following subcycle, it is impossible to approximate the time needed to perform each task. Therefore, the presented solution makes it possible to leave a fixed part of new MFUs. When the computational node finishes all assigned tasks it sends a message to the managing node with the demand to get more job. Such an algorithm allows us to detect and handle load imbalances dynamically. Moreover, we introduced the possibility to spread some MFUs before the end of the mitosis process. The managing node is responsible for gathering messages coming from the computational nodes. It can also perform calculations connected with finding optimal perfusion parameters. Owing to non-blocking communication between the managing and computational nodes, for each permanent perfusion in the vascular network a unique number is attributed. The managing node also collects tree updates in special queues for each individual node as well as related permanent perfusion numbers. Using this information, it estimates the tree uniformity and makes a decision about the permanent perfusion. The principal node has a possibility to send all collected vascular tree updates only to the sender of the
Fig. 3. The detailed diagram of the improved parallel perfusion process
296
K. Jurczuk, M. Kr¸etowski, and J. B´ezy-Wendling
last package or broadcast the permanent changes among all the computational nodes immediately as these changes appear.
4
Experimental Results
In this section the improved solution is experimentally verified. The default settings for the sequential version (about 12000 MFUs) were used and we also checked the behavior of the algorithm for large configurations with about 48000 MFUs and consequently about 300000 vessel segments. Figure 4 presents a visualization of one of the obtained vascular networks. In the experiments a cluster of sixteen SMP servers running Linux 2.6 and connected by an Infiniband network was used. Each server was equipped with two 64-bit Xeon 3.2GHz CPUs with 2MB L2 cache, 2GB of RAM and an Infiniband 10Gb/s HCA connected to a PCI-Express port. We used the MVAPICH version 0.9.5 [9] as the MPI standard implementation [10]. Moreover, randomly selected experiments were carried out on a similar cluster of sixteen SMP servers except that each server was equipped with eight CPUs and the MVAPICH version 1.0.1. Furthermore, in order to execute the performance analysis we made use of the Multi-Processing Environment (MPE) library and the graphical visualization tool Jumpshot-4 [10]. Figure 5a presents the obtained mean speedup. It is clearly visible that the improved algorithm is about 25% faster than the previous one. Currently, the time needed to obtain an organ with about 300000 vessels equals approximately 2 hours (with 16 processors), instead of 24 hours in a single processor machine (3.2GHz CPU and 2GB of RAM). Moreover, it is worth noting that in the presented solution the acceleration is not sensitive to the increasing size of vascular structures. Furthermore, in order to assure that the vascular trees are represented in the operating memory by continuous memory segments, periodical reallocations were introduced. Namely, at the beginning of each subcycle computational nodes rebuild their vascular structures. Obviously this operation takes some time
a)
b)
Fig. 4. Visualization of adult liver (about 48000 MFUs and 300000 vessels): a) hepatic veins with a tumor shape, b) main hepatic arteries, portal veins and hepatic veins with liver and tumor shapes
Improved Parallel Implementation of Vascular Network Modeling
14
4 processors 8 processors 16 processors
10
12
8 Gain in time [%]
Absolute speedup
12
theoretical improved solution previous solution
16
297
10 8 6 4
6 4 2
2 0
0 0
2
4
6 8 10 12 Number of processors
a)
14
16
0
10
20
30
40
50
60
70
80
90 100
Left MFUs to dynamic load balancing ][%
b)
Fig. 5. Efficiency of the parallel implementation: a) mean speedup for several configurations of MFUs, b) influence of dynamic load balancing
but the profiling results showed that exactly at the same time the mitosis process is being performed at the managing node. Owing to this vascular tree rebuilding, the time needed to find optimal bifurcations, connected with frequently executed tree search algorithms, was decreased. Prior to the introduction of this mechanism, the obtained speedup used to be even worse than in the previous solution, especially for huge vascular networks. The detailed results concerning the optimal load balancing among computational nodes are presented in Fig. 5b. It shows how the number of left (send on demand) MFUs influences on the simulation time. We can see that it is impossible to choose one common value that gives the best gain in time. But on the other hand, it is visible that each value below 90% makes it possible to achieve a better speedup. Finally, we suggest that any value within the range from 30% to 60% is acceptable. The introduced possibility of spreading new MFUs before the end of mitosis has a positive influence on the simulation time when the number of vascular structures is small enough (i.e. thousands of vessels maximally). Moreover, when the managing node also performs calculations connected with finding optimal bifurcations, the speedup can be decreased because of delays in tree change broadcasting (especially in experiments with more than 4 processors).
5
Conclusion
In this paper, an improved solution of vascular network modeling in a parallel environment is investigated. It was shown that the new algorithm is about 25% faster than the one used previously. As a result, it is not only possible to increase the number of MFUs and vessels but also introduce more details into the model.
298
K. Jurczuk, M. Kr¸etowski, and J. B´ezy-Wendling
Moreover, the presented algorithm, based on a more decentralized approach, can bring the discussed model closer to reality in which analogous phenomena of vascular development can occur both in a parallel and distributed way. The presented model is still in its development stage. Continuing our research, we intend to introduce an algorithm aimed at restoring the tree uniformity among nodes during the perfusion process. On the other hand, we plan to implement the vascular growth process in the framework of multi-platform shared-memory parallel programming (OpenMP), which could eliminate the necessity of maintaining tree uniformity between nodes. Acknowledgments. This work was supported by the grant W/WI/5/08 from Bialystok Technical University.
References 1. B´ezy-Wendling, J., Kr¸etowski, M.: Physiological modeling of tumor-affected renal circulation. Computer Methods and Programs in Biomedicine 91(1), 1–12 (2008) 2. Hoppensteadt, F.C., Peskin, C.S.: Modeling and Simulation in Medicine and the Life Sciences. Springer, Heidelberg (2004) 3. Jurczuk, K., Kr¸etowski, M.: Parallel implementation of vascular network growth. In: Bubak, M., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2008, Part I. LNCS, vol. 5101, pp. 679–688. Springer, Heidelberg (2008) 4. Kamiya, A., Togawa, T.: Optimal branching structure of the vascular trees. Bulletin of Mathematical Biophysics 34, 431–438 (1972) 5. Karch, R., Neumann, F., Neumann, M., Schreiner, W.: Staged growth of optimized arterial model trees. Annals of Biomedical Engineering 28, 495–511 (2000) 6. Kr¸etowski, M., B´ezy-Wendling, J., Coupe, P.: Simulation of biphasic CT findings in hepatic cellular carcinoma by a two-level physiological model. IEEE Trans. on Biomedical Engineering 54(3), 538–542 (2007) 7. Kr¸etowski, M., Rolland, Y., B´ezy-Wendling, J., Coatrieux, J.-L.: Physiologically based modeling for medical image analysis: application to 3D vascular networks and CT scan angiography. IEEE Trans. on Medical Imaging 22(2), 248–257 (2003) 8. Morgan, D.O.: Cell Cycle: Principles of Control. Oxford University Press, Oxford (2006) 9. MVAPICH: MPI over InfiniBand and iWARP, http://mvapich.cse.ohio-state.edu/ 10. Pacheco, P.: Parallel Programming with MPI. Morgan Kaufmann Publishers, San Francisco (1997) 11. Sherlock, S., Dooley, J.: Diseases of the liver and biliary system. Blackwell Science, Malden (2002) 12. Zamir, M.: Arterial branching within the confines of fractal L-system formalism. Journal of General Physiology 118, 267–275 (2001)
Parallel Adaptive Finite Element Package with Dynamic Load Balancing for 3D Thermo-Mechanical Problems Tomasz Olas1 , Robert Le´sniak1, Roman Wyrzykowski1, and Pawel Gepner2 1 Czestochowa University of Technology Dabrowskiego 73, 42-201 Czestochowa, Poland {olas,roman,lesniak}@icis.pcz.pl 2 Intel Corporation
[email protected]
Abstract. Numerical modeling of 3D thermomechanical problems is a complex and time-consuming issue. Adaptive techniques are powerful tools to perform efficiently such modeling using the FEM analysis. During the adaptation computational workloads change unpredictably at the runtime, therefore dynamic load balancing is required. This paper presents new developments in the parallel FEM package NuscaS; they allow for extending its functionality and increasing performance. In particular, by including dynamic load balancing capabilities, this package allows us to solve efficiently adaptive FEM problems with 3D unstructured meshes on distributed-memory parallel computers such as PC-clusters. For solving sparse systems of equations, NuscaS uses the message-passing paradigm to implement the PCG iterative method with geometric multigrid as a preconditioner. The implementation of load balancing is based on the proposed performance model.
1
Introduction
The finite element method (FEM) is a powerful tool for studying different phenomena in various areas. However, many applications of this method have too large computational or memory costs for a sequential implementations, to be useful in practice. Parallel computing allows FEM users to overcome this bottleneck [6]. For this aim, an object-oriented environment for the parallel FEM modeling, called NuscaS, was developed at the Czestochowa University of Technology [14]. This environment is dedicated to modeling of such thermomechanical phenomena [13], described by time dependent PDEs, as heat transfer, kinetics of solidification in castings, stress in solidifying castings, damages of materials, etc. Being primarily intended for PC-clusters, NuscaS was also ported to other parallel and distributed architectures, like the CLUSTERIX grid [8,15]. Numerical modeling of three-dimensional (3D) thermomechanical problems is a complex and time-consuming issue. Adaptive techniques are powerful tools to perform efficiently such modeling using the FEM analysis [9]. They allow the solution error to be kept under control; therefore computation costs can R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 299–311, 2010. c Springer-Verlag Berlin Heidelberg 2010
300
T. Olas et al.
be minimized. However, during the adaptation computational workloads change unpredictably at the runtime, therefore dynamic load balancing is required. Multigrid (MG) methods are among the fastest numerical algorithms for solving large sparse systems of linear equations [3]. They are motivated by the observation that simple iterative methods are efficient in reducing high-frequency error components, but they can not efficiently reduce the slowly oscillating content of error. These methods were initially designed for the solution of elliptic partial differential equations (PDEs). MG algorithms were later adopted to solve other PDEs, as well as problems not described by PDEs. This paper presents new developments in the NuscaS package; they allow for extending its functionality and increasing performance. In particular, by including dynamic load balancing capabilities, the package allows us to solve efficiently adaptive FEM problems with 3D unstructured meshes on distributed-memory parallel computers. For solving sparse systems of equations, NuscaS uses the message-passing paradigm to implement the PCG iterative method with geometric multigrid as a preconditioner. The implementation of load balancing is based on a performance model proposed in the paper. The materials of this paper is organized as follows. In Section 2, the basic concepts and architecture of the NuscaS package are introduced shortly, including parallel computing capabilities of the package, and details of parallel implementation of geometric multigrid preconditioner. Section 3 describes how the mesh adaptation and dynamic load balancing steps are implemented by NuscaS. The performance model for parallel FEM computations with dynamic load balancing is proposed in Section 4, while Section 5 contains preliminary performance results of numerical experiments. Section 6 presents conclusions and further works.
2 2.1
NuscaS Package with Parallel Computing Capabilities Basics of NuscaS
The basic part of the NuscaS package is the kernel - a class library which provides an object-oriented framework for developing finite element applications. This library consists of basic classes necessary to implement FEM modeling, irrespective of the type of a problem being solved. The inheritance is used to develop new applications classes for a particular modeling problem. The previous class library was not perfect, and useful only for 2D problems. In 2004, based on the previous experience we started to implement a completely new library with 3D finite element solvers. It was called the Numerical Simulations and Computations (NSC) library. The first version of the NSC library appeared in 2006. This library has a rich functionality, which is not limited just to mesh data structures, data input/output, element shape functions, and linear algebra operations. Abstractions are also available for higher-level concepts, other than just a point, a node, or an element. Although the internals of the library are unavoidably complex, the interface is as easy to use as possible. The library is maintainable and extensible using design patterns. The NSC library is usable for large-scale problems, where the number of nodes in a mesh can reach millions. For
Parallel Finite Element Package NuscaS with Dynamic Load Balancing
301
this aim, the support for management of life cycle of FEM classes was provided, taking into account anticipated large number of objects to be created. 2.2
Parallel Computing in NuscaS
In 2009, a new version of the NSC library was enhanced with parallel processing capabilities for 3D problems. The parallel version of the library is based on geometric decomposition applied for uniprocessor nodes of a parallel system; the distributed-memory architecture and message-passing model of parallel programming are assumed. In this case, a FEM mesh is decomposed [6,12] into p submeshes (subdomains), which are then assigned to separate processors. As a result, for each submesh we can distinguish internal, boundary, and internal nodes. Internal and boundary nodes are called local ones. When solving a system of equations in parallel, values of unknowns computed in boundary nodes of submeshes are exchanged between neighboring subdomains. Hence at the preprocessing stage, it is necessary to generate data structures for each subdomain, to provide an efficient implementation of communication. For a given subdomain, these data determine both to which processors local values computed in the boundary nodes of the subdomain will be sent, and from each processors values calculated in the external nodes of the subdomain will be received. In the NSC library, object-oriented techniques are used to hide details of parallel processing. To solve linear systems of equations with sparse matrices, which are results of FEM discretization, we use iterative methods. The computational kernel of these methods is the matrix-vector multiplication with sparse matrices. When implementing this operation in parallel, the overlapping of computation and communication is exploited to reduce the execution time of the algorithm. 2.3
Parallel Multigrid Preconditioner in NuscaS
An efficient class of iterative methods for solving large linear systems with sparse matrices is Krylov subspace methods [12], for example, the method of conjugate gradients (CG) suitable only for symmetric positive definite matrices, and GMRES and Bi-CG-STAB algorithms used in non-symmetric cases. To provide success of a Krylov subspace method by its acceleration, a suitable preconditioner has to be used. Furthermore, the performance of a Krylov subspace method strongly depends on the type of a preconditioner employed. Geometric multigrid [3] is a good preconditioning algorithm for Krylov iterative solvers. The preconditioned CG method (PCG) with geometric multigrid as a preconditioner features a good convergence even when the multigrid solver itself is not efficient. In comparison with other preconditioners (ICC, ILU, ...), the main benefit of using MG as a preconditioner is that the convergence rate of the resulting iterative algorithm is independent of problem size. In NuscaS, we use the geometric multigrid as a preconditioner for unstructured 3D meshes (Fig. 1). Both the V-cycle and full multigrid algorithms were implemented, with the weighted Jacobi smoother. As a solver for the coarsest grid, we adopted the PCG algorithm with the Jacobi preconditioner.
302
T. Olas et al.
Fig. 1. Geometric multigrid: an example of grid hierarchy and its partitioning
To include the geometric multigrid into the NuscaS package, the method of FEM mesh representation had to be changed in the NSC library. For each level of MG hierarchy, nodes and elements are placed in the same data structures as in the standard case (vectors nodes_, elements_, and connectivity_ in the Mesh class). Information describing which elements belong to a particular level is stored in a 2D array with size of (L, Nlevel ), where L states for the number of levels, and Nlevel is the number of elements for a certain level. When implementing computation in parallel, FEM meshes are partitioned and assigned (Fig. 1) to corresponding processors of a parallel architecture, for each level of multigrid. Every processors (or process) keeps the assigned part of a mesh on each level of multigrid. A mesh is divided into p submeshes, which are assigned to separate processors. As a result, on each level the j-th subdomain will own a set of Nj nodes selected from all the N nodes of the corresponding mesh. For an arbitrary subdomain with index j, we have three types of nodes: – Nji of internal nodes - they are coupled only with nodes belonging to this subdomain; – Njb of boundary nodes - they correspond to those nodes in the j−th subdomain that are coupled with nodes of other subdomains; the number of local nodes is Njl = Nji + Njb ; – Nje of external nodes - they correspond to those nodes in other subdomains that are coupled with boundary nodes of the given subdomain. For each level, every process determines and stores information concerning the assigned nodes, including the splitting on internal, boundary and external nodes. In addition, the process stores information about its coupling with neighboring processes; this is necessary to organize the data exchange between processes. The main issue with such a parallel scheme of multigrid computations is necessity to decompose a FEM mesh on a certain level of multigrid hierarchy in
Parallel Finite Element Package NuscaS with Dynamic Load Balancing
303
a way which allows for reducing the data exchange for subsequent levels of hierarchy. To minimize parallel overheads, parts of submeshes are doubled, and calculations are performed locally as much as possible, instead of exchanging data between processors.
3 3.1
Mesh Adaptation and Dynamic Load Balancing in NuscaS Mesh Adaptation
In general, mesh adaptation methods can be grouped into three different classes, named r-, p-, and h-methods, depending on the level of mesh modifications. In this paper, we focus only on adaptation by h-methods [10]. The first step in such a mesh adaptation procedure is selection of elements which should be partitioned. We measure a discretization error using the approach as follows. For each face in an element, the mean value of a simulated variable is calculated. Then if the difference between the mean value and value obtained in simulation, taken from each node of the face, is greater than it was assumed by the user, the element is marked for refinement. It is obvious that a strong variation of the calculated variable field inside some element leads to marking them for refinement . It is a consequence of the basis assumptions of this error estimator which is intended to detect regions containing elements with higher variations of the simulated variable field. The next step divides elements using the longestedge bisection method (Fig. 2). The presented solution can be easily adapted to thermomechanical problems of different types. At present, the mesh adaptation step is implement sequentially in NuscaS. The parallel realization of adaptation is under development, based on the 8Tetrahedra Longest-Edge (8T-LE) technique [10,1]. 3.2
Dynamic Load Balancing
The METIS package [5] is adopted as a mesh partitioner for the initial distribution of a FEM mesh to processors of a parallel architecture. During the adaptation computational workloads change unpredictably at the runtime, and distribution of computations among processors become imbalanced.
Fig. 2. Using bisection for mesh adaptation
304
T. Olas et al.
Therefore dynamic load balancing steps are required. The problem of repartitioning of unstructured meshes is a mult-objective optimization problem, since advantage in reducing the computation time after restoring the load balance should not be lost because of additional costs of data redistribution. Currently the ParMETIS tool [4] is used for the repartitioning in NuscaS. PARMETIS provides ParMETIS_V3_AdaptiveRepart routine for repartitioning such adaptively refined meshes. This routine assumes that the existing decomposition is well distributed among the processors, but it is imbalanced. Fig. 3 presents steps of the proposed load balancing algorithm. At first, the global ordering of mesh nodes is derived. At the beginning of this steps, processes exchange information about the numbers of local nodes assigned to them, using MPI_Allgather routine. For process j, the global index i (gij ) of a local node is computed as a sum of the local index lij of this node, and the sum of numbers Nkl of nodes for processes from 0 do j − 1: gij = lij +
j−1
Nkl .
(1)
k=0
To derive the global index of an external node, the neighboring processes exchange information about indexes of their boundary nodes. In the second step, a distributed graph describing the adaptively refined mesh is built, based upon the global ordering. This step does not require interprocess communications. In the third step, this graph is repartitioned among processes, using ParMETIS_V3_AdaptiveRepart routine from ParMETIS, but without performing migration physically. After executing this step, a particular process will own information to which processes its local nodes have to be assigned. Also, the neighboring processes exchange information about their boundary nodes, in order to build data structures describing external nodes of processes. The fourth step is responsible for selecting which parts of the FEM mesh corresponding to a particular process should migrate to other processes. For this aim, the nodes which correspond to boundaries of subdomains after the repartitioning have to be selected (Fig. 4). Then based on these information, all data structure necessary to implement interprocess communications are created, in order to perform migration between processes in the next step.
1. create the new global ordering of nodes 2. create the distributed graph that corresponds to an adaptively refined mesh 3. repartite the graph 4. select of data (nodes, elements, boundary conditions, process relationship) to migrate 5. send and receive selected data 6. map changes onto local data structures Fig. 3. Steps of the load balancing algorithm
Parallel Finite Element Package NuscaS with Dynamic Load Balancing
305
for all elements for all edges of the element if the nodes of the edge are assigned to different subdomains { 1) the nodes are marked as boundary in both subdomains 2) the nodes are included into data structures responsible for interprocessor communication 3) the element is marked as assigned to both subdomains } Fig. 4. Algorithm of selecting boundary nodes
The fifth step starts with initializing messages to send. This requires computing the size of a message applying get_data_size() method, which returns the number of bytes necessary to store a certain object (node, element, boundary condition). Using MPISerializer method, the objects are then serialized to a buffer, whose size has been previously initialized. Once the buffer is filled up, its content is sent to a target process using MPI_Isend routine. Load balancing is finished with receiving data by processes, and building their resulting FEM meshes. At first, new data structures are being created to store the mesh assigned to a process; these structures correspond to those vertexes, nodes and boundary conditions that stay in the process after repartitioning. Concurrently, messages from other processes are being received. After receiving a message, data structures describing the resulting FEM mesh for the process are completed using deserialize method at first; this method is responsible for performing deserialization of objects from the buffer.
4
Performance Model for Load Balancing
After the mesh adaptation, the distribution of computational workloads among processes become imbalanced, which results in increasing the computation time for the subsequent steps of a parallel numerical simulation. The load balancing step allows for restoring the balance of computation workloads. However, the cost of this step could be relatively large. Therefore it is important to have possibility to determine the influence of computation imbalance on the total cost of a numerical simulation, in order to decide where the load balancing step should be performed or not. In this section, we propose a performance model which allow us to estimate the cost of the load balancing step, as well as the execution time for a computation step performed with either even or uneven load balance. The proposed model is a result of combining analytical methods, and measurements. The expression approximating the algorithm execution time is based on theoretical dependencies between its components, which are derived using the knowledge of parameters describing a parallel architecture. To determine values of these parameters, some measurements should be performed during the numerical simulation. The values of parameter obtained in this way are then used to estimate the execution time for each stage of subsequent computation steps.
306
4.1
T. Olas et al.
Cost Function for Load Balancing
The load imbalance factor is defined as: η=
max0≤j
.
(2)
The cost of load balancing is derived as a sum of times required for creating a distributed graph, repartitioning a graph, and remapping data. Numerical experiments show that each of these components depends on the load imbalance factor linearly, for the constant number of processors. Therefore the load balancing time λ can be expressed using a linear approximation. To determine coefficients a and b in the linear function λ(η) = aη + b the least square methods is used, giving the following system of equations: a i ηi + bn = i λi (3) 2 a i ηi + b i ηi = i ηi λi Therefore to determining the cost of load balancing step, this step should be performed at least twice. 4.2
Cost Function for Building System of Equations
This stage of FEM computations does not requires communications between processes. As a result, the time necessary to build system of equations can be estimated as a product of the number ne of FEM elements assigned to a processor, and average time te of building system of equations for a single element: tbuild = ne te . 4.3
(4)
Cost Function for Solving System of Equations
Unlike building system of equations, its solving requires interprocess communications. Therefore we have to estimate time of communication. Communication time The standard Hockney model is used to estimate the time Tcomm necessary to transfer a message containing w elements [7]: Tcomm = α + βw,
(5)
where α and β are respectively the start-up time and per-word transfer time, while w is the number of transferred words. For collective operations, we assume that: tcomm = log2 p(α + βw).
(6)
Parallel Finite Element Package NuscaS with Dynamic Load Balancing
307
Computation time The computation time is estimated [7] using the following expression: tcomp = nf lops tf ,
(7)
where nf lops and tf are respectively the number of floating point operations, and the average time of one floating point operation. Final expression If the communication network in a parallel system has a sufficiently high bandwidth, for large systems of equations communications can be overlapped with computations, during the matrix-vector multiplication [7]. It has been verified experimentally that even for the Fast Ethernet network this assumption is satisfied when processing more than several thousand mesh nodes on a processor. As a result, assuming an uniform assignment of mesh nodes to processors, we obtain the final expression for solving the system of equations: t = (11N l + 4N + 10)tf + 2nz tf + 2log2 p(α + 3β),
(8)
where N = N i + N b + N e , and nz denotes the number of nonzero elements in the matrix.
5
Results of Experiments
To verify and investigate the proposed performance model, we utilized numerical experiments for a test FEM problem. All experiments were performed on a Linux cluster containing seven 2-way nodes equipped with 2-core Intel Itanium2 1.4GHz processors each with 12 MB L3 cache; nodes were connected using Gigabit Ethernet. Except for Subsection 5.1, all results presented in this section were obtained for a FEM test problem with 19253 nodes, executed on four nodes of the cluster. 5.1
Scalability of Load Balancing Algorithm
Fig. 5 shows the execution time of the load balancing step as a function of the number of processor cores. The repartitioning (steps 1–3 in Fig. 3) and remapping (steps 4-6 in Fig. 3) are considered separately. The results achieved confirm a satisfying scalability of the proposed solution, and its usefulness for parallel FEM simulations. 5.2
Cost of Load Balancing Step
Fig. 6 presents graphs of the load balancing cost derived as a function of η in two different ways: (i) experimentally by measurements, and (ii) using the proposed approximation with λ(η) = 0.406976η − 0.0668437.The experiments confirm an acceptable accuracy of the proposed approximation.
308
T. Olas et al. 6
only repartitioning repartitioning + remapping
5
time [s]
4
3
2
1
0 2
4
6
8
10
12
14
16
18
20
number of processor cores
Fig. 5. Execution time as a function of number of processor cores for a FEM mesh containing 100584 nodes before adaptation, and 152177 nodes after adaptations
5.3
Model-Guided Load Balancing
The procedure for determining the cost of load balancing and estimating the influence of load imbalance on the execution time of the next step of a FEM simulation includes the following activities. Before performing the simulation, the values of parameters α and β are estimated. The first two steps of the simulation when adaptation takes place are responsible for gathering the performance model parameters, which are necessary to estimate the computation time for subsequent steps of FEM simulation. Also, times of executing load balancing steps are measured; these times are used for deriving the approximation of λ(η) according to Eqns. (3). Other parameters which are measured include: N , nz , and ne ; their values depend on value of parameter N l (for FEM meshes generated with different mesh generators values of these parameters will be different). Additionally after building and solving the system of equations, values of parameters te and tf are determined using expressions (4) and (8), respectively. Tab. 1 shows estimated values of model parameters. To estimate the time of solving system of equations, it is assumed that the number of iterations in the next step is the same as in the current step. Fig. 7 presents comparison of estimations for the execution time of solving system of equations in two cases: – system is solved without load balancing (t∗ ); – solving with load balancing - in this case the execution time t∗∗ is estimated as a sum of time necessary for load balancing, and time of solving system of equations after restoring load balance. In Fig. 7, the right-hand graph shows the overhead ratio r defined as follows: r = t∗∗ /t∗ .
(9)
Parallel Finite Element Package NuscaS with Dynamic Load Balancing
309
1.4 approximated time measured time 1.2
time [s]
1
0.8
0.6
0.4
0.2 1
2
3
4
imbalance factor
Fig. 6. Comparison of load balancing cost (N = 19253, p = 4) derived as a function of load imbalance factor either by measurements, or using approximation Table 1. Estimated values of model parameters for the test FEM problem parameter α β N ne nz te tf value 61μs 2.9−8 1.16N l 5.6N l 13.5N l 4.6−6 1.2−8
a)
b) 2.2
1.5 1.45
2
1.4 1.8 overhead ratio r
1.35
time [s]
1.6 1.4 1.2
1.3 1.25 1.2 1.15 1.1
1
1.05 0.8
1
load balancing + solving solving without load balancing
0.6 1
2
3 imbalance factor
0.95 4
1
2
3
4
imbalance factor
Fig. 7. Comparison of estimated execution time of solving system of equations in two cases: (i) system is solved without load balancing, (ii) solving with load balancing
It can be concluded that the difference between t∗∗ and t∗ decreases as the load imbalance factor increases. As a result, at this stage of developing the performance model we could recommend that it is advisable to perform the load balancing step if the overhead ratio r does not exceed 1.25. However, the recommendation of such a type is not sufficient. It should be completed with additional criteria which allow for performing the load balancing step, for example, in the
310
T. Olas et al.
case when the adaptation is no more carried out, but the distribution of computation among processes is imbalanced.
6
Conclusions
The primary goal of the performance model proposed in this paper for estimating the load balancing cost is to obtain possibility to determine the influence of this cost on the execution time of the next step of a FEM simulation performed on a distributed-memory parallel computer. Potentially this initial model would allow us to build a more thorough model with capability for deciding in which moments of the whole FEM simulation is is advisable to perform the load balancing step to minimize the total execution time for this simulation. Achieving this goal requires further research, which in the first place should address the problem of predicting what the mesh adaptation will look like in consecutive steps of the FEM simulation. Acknowledgment. This work was supported by the Polish Ministry of Science and High Education under grant no. N516 4280 33.
References 1. Balman, M.: Tetrahedral Mesh Refinement in Distributed Environments. In: Proc. IEEE Int. Conf. on Parallel Processing Workshops - ICPP 2006 (The 8th Workshop on High Performance Scientific and Engineering Computing - HPSEC 2006), pp. 497–504. IEEE Computer Society, Los Alamitos (2006) 2. Bana´s, K.: Scalability Analysis for a Multigrid Linear Equations Solver. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2007. LNCS, vol. 4967, pp. 1265–1274. Springer, Heidelberg (2008) 3. Hulsemann, F., Kowarschik, M., Mohr, M., Rude, U.: Parallel Geometric Multigrid. Lecture Notes in Computational Science and Engineering 51, 165–208 (2006) 4. Karypis, G., Schloegel, K., Kumar, V.: PARMETIS Parallel Graph Partitioning and Sparse Matrix Ordering Library Version 3.1. Univ. Minnesota, Army HPC Research Center (2003), http://glaros.dtc.umn.edu/gkhome/fetch/sw/parmetis/manual.pdf 5. METIS - Family of Multilevel Partitioning Algorithms, http://glaros.dtc.umn.edu/gkhome/views/metis 6. Olas, T., Karczewski, K., Tomas, A., Wyrzykowski, R.: FEM computations on clusters using different models of parallel programming. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2001. LNCS, vol. 2328, pp. 170–182. Springer, Heidelberg (2002) 7. Olas, T., Wyrzykowski, R., Karczewski, K., Tomas, A.: Performance Modeling of Parallel FEM Computations on Clusters. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 189–200. Springer, Heidelberg (2004) 8. Olas, T., Wyrzykowski, R.: Porting Thermomechanical Applications to Grid Environments. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 364–372. Springer, Heidelberg (2006)
Parallel Finite Element Package NuscaS with Dynamic Load Balancing
311
9. Patzak, B., Rypl, D.: A Framework for Parallel Adaptive Finite Element Computations with Dynamic Load Balancing. In: Proc. First Int. Conf. Parallel, Distributed and Grid Computing for Engineering, Paper 31. Civil-Comp Press (2009) 10. Plaza, A., Rivara, M.: Mesh Refinement Based on the 8-Tetrahedra Longest-Edge Partition. In: Proc. 12th Int. Meshing Roundtable, Sandia National Laboratories, pp. 67–78 (2003) 11. Romanazzi, G., Jimack, P.K.: Performance Prediction for Multigrid Codes Implemented with Different Parallel Strategies. In: Proc. First Int. Conf. Parallel, Distributed and Grid Computing for Engineering, Paper 43. Civil-Comp Press (2009) 12. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. SIAM, Philadelphia (2003) 13. Sczygiol, N.: Object-oriented analysis of the numerical modelling of castings solidification. Computer Assisted Mechanics and Engineering Sciences 8(2), 79–98 (2001) 14. Wyrzykowski, R., Olas, T., Sczygion, N.: Object-oriented approach to finite element modeling on clusters. In: Sørevik, T., Manne, F., Moe, R., Gebremedhin, A.H. (eds.) PARA 2000. LNCS, vol. 1947, pp. 250–257. Springer, Heidelberg (2001) 15. Wyrzykowski, R., Meyer, N., Olas, T., Kuczynski, L., Ludwiczak, B., Czaplewski, C., Oldziej, S.: Meta-computations on the CLUSTERIX Grid. In: K˚ agstr¨ om, B., Elmroth, E., Dongarra, J., Wa´sniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 489–500. Springer, Heidelberg (2007)
Parallel Implementation of Multidimensional Scaling Algorithm Based on Particle Dynamics Piotr Pawliczek1 and Witold Dzwinel1,2 1
2
AGH University of Science and Technology, Institute of Computer Science, Krakow, Poland WSEiP, College of Economy and Law, Department of Computer Science, Kielce, Poland
Abstract. We propose here a parallel implementation of multidimensional scaling (MDS) method which can be used for visualization of large datasets of multidimensional data. Unlike in traditional approaches, which employ classical minimization methods for finding the global optimum of the “stress function”, we use a heuristic based on particle dynamics. This method allows avoiding local minima and is convergent to the global one. However, due to its O(N 2 ) complexity, the application of this method in data mining problems involving large datasets requires efficient parallel codes. We show that employing both optimized Taylor’s algorithm and hybridized model of parallel computations, our solver is efficient enough to visualize multidimensional data sets consisting of 104 feature vectors in time of minutes. Keywords: data mining, multidimensional scaling, method of particles, parallel algorithm.
1
Introduction
Most common approach for the analysis of objects is treating them as vectors of descriptors (features) in D dimensional feature space, where D is the number of features, i.e., Xi = (Xi1 , Xi2 , . . . , XiD ), where Xij ∈ R. In this paper the term “input data” refers to N vectors X1 , X2 , . . . , XN where Xi ∈ RD . The clustering tools can be use then for finding data dependences and for building data models. It is well known, however, that due to, so called, “the curse of dimensionality”, the results of clustering of noisy data in D-dimensional space are highly unreliable. The results of clustering are very sensitive to the algorithm parameters. For huge data sets and problems from data mining, the choice of a proper data analysis method becomes extremely difficult. It is caused by our limitation of imaging how data looks like in multidimensional spaces. The transformation of multidimensional dataset into 2-D or 3-D space, which conserves the most of the dataset properties, could give us valuable information about the data structure. Graphical presentation enables observation of clusters and other more complicated correlations between data, which are hidden behind raw numbers. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 312–321, 2010. c Springer-Verlag Berlin Heidelberg 2010
Parallel Implementation of Multidimensional Scaling Algorithm
313
Following [1,2], four general approaches to visualize high dimensional data can be identified: 1. Use multiple views of data each covered a subset of dimensions. 2. Embedding dimensions within other dimensions (dimensional stacking). 3. Use non-orthogonal axes and display data along all axes simultaneously (Chernoff face). 4. Reduce the dimensionality by mapping transformation to two or three dimensions and then visualize the data. This kind of approach is called dimensionality reduction. A variety of dimensionality reduction methods have been elaborated already (e.g., [2,3,4,5,6]). The majority of them are based on linear or nonlinear transformations of multidimensional data onto two or three-dimensional space. Multidimensional scaling (MDS) belongs to the nonlinear class of data extraction procedures. The multidimensional scaling maps the objects X = X1 , X2 , . . . , XN from D-dimensional feature space onto their images x = x1 , x2 , . . . , xN in ddimensional target space (d = 2, 3 and d D) that the distances between D-dimensional points (data objects) are conserved as accurately as possible i.e.: F:X∈R
D
N −1 N 1 → x ∈ R ; D d ∧ min E (x, X) = wij · (δij − pij )2 x CN i=1 j=i+1 d
(1) where input data are defined by a square, symmetric and positive distance matrix M = [δij ]N ×N and δij is dissimilarity measure between vectors i and j, while the distance matrix P = [pij ]N ×N , collects the Euclidean distances pij between points in d-dimensional target space. The MDS mapping minimizes the quadratic loss function E(x, X), called “the stress function”, where CN and wij are free parameters, which depend on the MDS goals [4,5]. The main obstacle in employing MDS for mapping large datasets is its time complexity. It depends linearly on the number of distances between data objects and exponentially on the number of local minima of the “stress function”. The second problem can be partially overcome by using very efficient heuristics based on the N -body (virtual particle method) solver presented in [7,8]. However, its time complexity is bounded from below by the quadratic term O(N 2 ) depending on the number of distances fit by the “stress function”. To speed up the computations in [9] we propose the implementation of mapping algorithm based on the virtual particles method [7] operating on incomplete distances matrix M . Here, we present the efficient parallel algorithm allowing for additional reduction of execution time. First, we present briefly the virtual particles method. Then we describe the parallel algorithm and the results of tests. Finally, we discuss the conclusions.
2
MDS and Particle Dynamics
The idea of the heuristics based on virtual particles was described in [7,8]. The principal idea of the method consists in minimization of the “stress function”
314
P. Pawliczek and W. Dzwinel
(see Eq.1) being the sum of potentials of interacting particles. The particles, which correspond to the respective objects (points) from the target space, interact with each other and change their positions and velocities according to the Newtonian rules of motion. Each pair of particles interacts due to a semiharmonic pair potential, which is the source of attraction and repulsion forces. Both the directions and strength of forces acting between particles i and j are determined by the difference between corresponding vectors’ distance δij in the source space and current particles’ distance pij . Initially, the positions of particles are chosen randomly and their velocities are set to zero. When the simulation starts, particles move due to semi-harmonic forces acting among them. Energy is dissipated by the friction force which is proportional to the particle velocity. Therefore, after a number of iterations the equilibrium is reached. Then, the locations of motionless particles xi in the low dimensional target space represent the locations of the corresponding vectors Xi in D-dimensional feature space. The algorithm described above requires calculation of N (N2−1) distances in every timestep of integration of the Newtonian equations of motion. Whereas, the number of particle coordinates in equilibrium, which represents the global minimum of E, increases linearly with N (as d · N , where N is the number of feature vectors). Thus, the system of nonlinear equations, which has to be solved to minimize “the stress function” (Eq.1) is overcomplete. Therefore, the simplest method for reducing the time complexity of MDS algorithm is to calculate only a part of all distances. The repercussions of this procedure were discussed in [9]. Parallelization is the other option for reducing the computational load. The formerly used parallel algorithms [8] involve replication of the whole distance array M = [δij ]N ×N and P = [pij ]N ×N to the memory of each computational node. In the following section we describe the efficient parallel algorithm which is free of such disadvantage. Specialized hybrid algorithm is composed in order to adapt virtual particles method to parallel computations in cluster environment.
3
Data Structure and Algorithm
Our solution is adjusted to cluster environment and uses MPI standard. Each MPI process is run on a separated cluster node. The dissimilarities matrix and calculations are spread out among nodes with the use of UTFBD algorithm ([10]) which is an optimized version of well known Taylor’s force-decomposition algorithm [11]. Calculations within MPI process are additionally parallelized with the use of OpenMP or POSIX threads (both versions are implemented). Cluster’s nodes used for calculations are modeled as half of matrix Un,n = [urc ]n,n . Elements urc lying on or under the main diagonal (r ≥ c) represent MPI processes (cluster’s nodes). The process which lies at the left top corner of the matrix has rank zero. The rest of the processes are ordered from left to right and from top to bottom, as was shown in Fig.1. As a result, for every element
Parallel Implementation of Multidimensional Scaling Algorithm
315
urc of matrix Un,n for which r ≥ c, exactly one process is assigned with MPI rank equal to: (r − 1) · r +c−1 (2) 2 This data structure constraints the number of cluster’s nodes that are used during computations to n · (n + 1) where n is the size of matrix U (given in advance).
Fig. 1. The ordering of MPI processes in Un,n matrix and schema of communication between nodes during simulation
The particles are indexed from 1 to N . Their positions and velocities are denoted as sequence of vectors x1 , x2 , . . . , xN , and v1 , v2 , . . . , vN , respectively. Both of these sequences are divided into n separable subsequences indexed by following integers starting from one. They are denoted by Sx (1), Sx (2), . . . , Sx (n) and Sv (1), Sv (2), . . . , Sv (n). To describe relationship between particles’ indices and numbers of subsequences, two functions named size(g) and f irst(g) are introduced: N/n for g ≤ N mod n size(g) = g = 1, 2, . . . , n N/n for g > N mod n 1 for g = 1 f irst(g) = g−1 (3) size(i) for g > 1 i=1 Value of size(g) determines size of g subsequence and value of f irst(g) determines index of its first particle (0 < g ≤ n). The subsequence of particles’ positions with index g is defined as: (4) Sx (g) = xf irst(g) , xf irst(g)+1 , xf irst(g)+2 , . . . , xf irst(g)+size(g)−1 Analogically, n subsequences with velocities are defined: Sv (g) = vf irst(g) , vf irst(g)+1 , vf irst(g)+2 , . . . , vf irst(g)+size(g)−1
(5)
316
P. Pawliczek and W. Dzwinel
The positions’ sequences Sx (g) and velocities’ sequences Sv (g) are assigned to MPI processes according to their positions in matrix U. Sequences Sx (g) are duplicated along rows and columns, while sequences Sv (g) are assigned only to processes lying on the main diagonal. Thus, processes urc lying on main diagonal (r = c) get positions’ sequence Sx (r) and velocities’ sequence Sv (r). The rest of processes (r > c) get two positions’ sequences: Sx (r) and Sx (c). During simulation every process urc from diagonal (r = c) keeps tracking of all distances (both in target and source spaces) from the set: { xi xj : xi , xj ∈ Sx (r)}
(6)
The rest of processes, urc (r > c), keep tracking of all distances from the sets: { xi xj : xi ∈ Sx (r), xj ∈ Sx (c)}
(7)
In each timestep, current distances between particles P are compared with values from the input matrix of dissimilarities. Consequently, total forces acting on each particle are computed and particles are moved according to the Newtonian dynamics. Because particles’ positions are scattered between nodes, each node may compute only partial forces acting on their particles. The sequences of partial forces are denoted as sF . They are defined analogically as sequences Sx in Eq.4: (8) sF (g, h) = fh,f irst(g) , fh,f irst(g)+1 , fh,f irst(g)+2 , . . . , fh,f irst(g)+size(g)−1 where fh,i is a part of the total force acting on particle i, exerted by particles from Sx (h) sequence. Every sequence sF (g, h) may be computed by the process ugh for g < h or by the process uhg when g ≥ h. At this stage no communication between processes is required. To compute sequence of total forces SF (g) acting on particles from Sx (g), corresponding elements from sF (g, h) sequences must be added. They are gathered by the processes ugg from the main diagonal of Un,n matrix. This procedure can be expressed as follows: n n n fh,f irst(g) , fh,f irst(g)+1 , . . . , fh,f irst(g)+size(g)−1 SF (g) = (9) h=1
h=1
h=1
The schematic of one iteration is presented in Tab. 1. To exchange data among processes only two MPI functions are needed: MPI Bcast and MPI Reduce. Unfortunately, both of them are blocking functions. That causes a serious problem with efficiency drop. During each timestep, the processes lying out of diagonal have to execute each function twice: 1. for synchronization of the first dataset along row 2. for synchronization of the second dataset along column. It is easy to notice that these two operations may be executed in parallel, because they work on separable buffers. To bypass the lack of nonblocking versions
Parallel Implementation of Multidimensional Scaling Algorithm
317
Table 1. The functions of processes u from matrix Un,n during a single timestep processes from diagonal (r = c) processes lying out of diagonal (r > c) update velocities vectors from Sv (r) and do nothing positions vectors from Sx (r) broadcast particles positions vectors (se- receive two sequences with positions vecquence Sx (r)) along row and column tors: Sx (r) from process urr and Sx (c) from process ucc compute sF (r, r) on the ground of differ- compute sF (r, c) and sF (c, r) on the ence between current particles distances ground of difference between current parand original distances ticles distances and original distances compute SF (r) by gather and addition send sF (r, c) to process urr and sF (c, r) to proper sF sequences from processes lying process ucc in the same row and column
of these functions in MPI standard, POSIX threads are employed. Every process lying out of diagonal uses two threads during data synchronization. One thread synchronizes data along row and another synchronizes data along column. The application described below was implemented in C++ with the use of MPICH2 implementation of MPI and POSIX threads (known as pthreads). Complete program was tuned up for using on cluster named “mars” from ACC CYFRONET AGH (Krakow, Poland). This cluster consists of 56 IBM Blade Center HS21 servers with two Dual-core 2.66 GHz Intel Xeon processors each. Each MPI process was started on separate node and reserved all four slots. To achieve better performance, computations on every MPI process were divided between four threads.
4
Results and Timings
The implementation of multidimensional scaling described in this paper was tested on two datasets. The first one was generated artificially and contains
Fig. 2. The result of mapping of artificial created dataset with two well separated clusters into 3-D target space. The input dataset consists of 10000 vectors from 60dimensional feature space.
318
P. Pawliczek and W. Dzwinel
Fig. 3. The timings, speedup and efficiency for various nodes’ number obtained during mapping first dataset (consists of 10000 vectors). Four cores were used on each node. The speedup and efficiency were calculated versus serial program’s version (employing one core only) as well as parallel version running on one node and employing four cores.
10000 objects from 60-dimensional feature space. It consists of two well separated clusters of 5000 points each. Coordinates of each object were generated by the following algorithm. A half of them were randomly selected from interval [−1, 1]. Remaining coordinates are randomly selected from interval [−1, 0] for points belonging to the first class and [0, 1] for objects from the second class. In the
Parallel Implementation of Multidimensional Scaling Algorithm
319
Fig. 4. The results of multidimensional scaling of MNIST database into three dimensions. In target space this dataset forms sphere with some points scattered around it. Four views of output dataset were presented in the figure. Only a few chosen classes are visible on each view. Table 2. Time cost of one iteration for various nodes’ number. The timings were collected during mapping onto 3-D space MNIST dataset with 60 thousands vectors. Number of Number of Average time of one nodes threads iteration in seconds 45 180 2,19 91 364 1,17 136 544 0,96
target space, the points formed two separated hemispheres. The screenshot of data visualization in 3-D target space is shown in Fig.2. The first input dataset was processed several times for various sizes of U matrix. Four cores on each node were utilized by dividing calculations among four
320
P. Pawliczek and W. Dzwinel
POSIX threads. The timings representing average execution time of one iteration are collected. Additionally, for comparison purposes and to obtain speed-ups and efficiency, analogical test was performed on serial program’s version, which uses only one process with one thread. The timings are collected in Fig.3. Speed-ups and efficiency were calculated in relation to serial program’s version (utilizing only one core) as well as parallel implementation running on single node and employing four cores. First one shows total speed up achieved by our solution. Second one represents efficiency of the algorithm of distribution computations between nodes. Dividing the calculations within one node is very efficient and yields 3,9 speed-up for four threads (efficiency equals 97%). Parallelization along nodes gives much worse effect and gains efficiency ranging from 51% on 10 nodes to 29% on 36 nodes. It is caused by overhead of communication among nodes. The second dataset is from MNIST database ([12]). It consists of 60000 normalized images of handwritten digits. Each image is represented by a single vector with 784 coordinates, whose values correspond to image’s pixels. The results of mapping are presented on Fig.4. As we can see, this simple approach is sufficient to separate some pairs of digits, e.g., for pairs 0-1, 0-9 or 6-7. The most problematic are digits 5 and 8. Their clusters interfere with all remainder digits. The average time of iteration was measured during processing the MNIST dataset on 45, 91 and 136 cluster’s nodes. The calculations within one MPI process were distributed among four threads, so four CPU cores were employed on each node. The timings are presented in Table 2. After analyzing the time costs collected in that table we can assume that the algorithm is well scalable up to at least 90 nodes. Due to the size of the dataset we were not able to determine values of speedup and efficiency. The distance matrix is too large to process this data on a single node.
5
Conclusions
In this paper we presented parallel version of multidimensional scaling algorithm. It enables mapping of large multidimensional datasets into 3-D space and can utilize efficiently a cluster of computers. The presented algorithm was implemented and tested with the use of MPI environment and OpenMP or POSIX threads. It gained efficiency up to 50% on 10 cluster’s nodes for dataset with ten thousand vectors. On 36 quad-core nodes it allows obtaining the speedup over 40 in comparison with serial implementation employing single core. The matrix of mid-points distances is scattered between nodes during calculations. This enables mapping of datasets, which cannot be processed in one piece on a single computer due to memory limitations. In the future work we will merge this algorithm with methods of reducing distances’ number employed during mapping. It would enable to apply multidimensional scaling to much larger datasets. Acknowledgments. This research is financed by the Polish Ministry of Education and Science, Project No. 3 T11F 010 30 and internal AGH Institute of
Parallel Implementation of Multidimensional Scaling Algorithm
321
Computer Science grant No. 11.11.120.777. Majority of calculations was performed on cluster “mars” from The Academic Computer Centre CYFRONET AGH in Krakow (grant No. MNiSW/IBM BC HS21/AGH/144/2007).
References 1. Torgerson, W.S.: Multidimensional scaling. 1. Theory and method. Psychometrika 17, 401–419 (1952) 2. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290, 2319–2323 (2000) 3. Roweis, S.T., Saul, L.K.: Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290, 2323–2326 (2000) 4. Kruskal, J.: Multidimensional scaling by optimizing goodness-of-fit to a nonmetric hypothesis. Psychometrika 29, 1–27 (1964) 5. Sammon, J.W.: A nonlinear mapping for data structure analysis. IEEE Trans. Comput. C-18(5), 401–409 (1969) 6. Dzwinel, W.: How to Make Sammon’s Mapping Useful for Multidimensional Data Structures Analysis? Pattern Recognition 27(7), 949–959 (1994) 7. Dzwinel, W.: Virtual particles and search for global minimum. Future Generation Computer Systems 12, 371–389 (1997) 8. Dzwinel, W., Basiak, J.: Method of particles in visual clustering of multidimensional and large data sets. Future Generation Computer Systems 15, 365–379 (1999) 9. Pawliczek, P., Dzwinel, W.: Visual Analysis of Multidimensional Data Using Fast MDS algorithm. In: Proceedings of 20th IEEE-SPIE Symposium Photonics, Electronics and Web Engineering, Signal Processing Syposium, Jachranka, Poland, May 24-26 (2007) 10. Shu, J., Wang, B., Chen, M., Wang, J., Zheng, W.: Optimization techniques for parallel force-decomposition algorithm in molecular dynamic simulation. Computer Physics Communications 154, 121–130 (2003) 11. Taylor, V.E., Stevens, R.L., Arnold, K.E.: Parallel molecular dynamics: Communication requirements for massively parallel machines. In: Proceedings of Frontiers 1995, Fifth Symposium on the Frontiers of Massively Parallel Computation, p. 156. IEEE Comput. Society Press, Los Alamitos (1994) 12. LeCun, Y., Cortes, C.: The MNIST database of handwritten digits, http://yann.lecun.com/exdb/mnist/
Particle Model of Tumor Growth and Its Parallel Implementation Rafal Wcislo and Witold Dzwinel AGH University of Science and Technology, Institute of Computer Science, Cracow, Poland {wcislo,dzwinel}@agh.edu.pl
Abstract. We present a concept of a parallel implementation of a novel 3-D model of tumor growth. The model is based on particle dynamics, which are building blocks of normal, cancerous and vascular tissues. The dynamics of the system is driven also by the processes in microscopic scales (e.g. cell life-cycle), diffusive substances – nutrients and TAF (tumor angiogenic factors) – and blood flow. We show that the cell life-cycle (particle production and annihilation), the existence of elongated particles, the influence of continuum fields and blood flow in capillaries, makes the model very tough for parallelization in comparison to standard MD codes. We present preliminary timings of our parallel implementation and we discuss the perspectives of our approach.
1
Introduction
Typically, tumor growth is divided into three stages: avascular growth, angiogenesis, and vascular growth [1]. In the earliest stage, the tumor develops in the absence of blood supply and grows up to a maximum size. Its growth is limited by the amount of nutrients the tumor can obtain through its surface. In angiogenic phase [1], some of the cells of avascular tumor mass produce and release substances called tumor angiogenic factors (TAFs). TAF diffuses through the surrounding tissue, and, upon arrival to the vasculature, it triggers a cascade of events which stimulates the growth of vasculature towards the tumor. In vascular stage the tumor has access to virtually unlimited resources, so it can grow beyond its limited avascular size. Moreover, it can use the blood network for spreading cancerogenic material to form metastases in any part of the host organism. Thus, whereas in the avascular phase tumors are basically harmless, once they become vascular they are potentially fatal. A better understanding of the dynamics of tumor formation and growth can be expected from mathematical modeling and computer simulation. A rich bibliography about mathematical models of tumor growth driven by the process of angiogenesis is overviewed in [2,3,4,5]. The models fall into four categories: (a) continuum models that treat the endothelial cell (EC) and chemical species densities as continuous variables that evolve according to a reactiondiffusion system, (b) mechano-chemical models that incorporate some of the mechanical effects of EC-ECM (extracellular matrix) interactions, (c) discrete, R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 322–331, 2010. c Springer-Verlag Berlin Heidelberg 2010
Particle Model of Tumor Growth and Its Parallel Implementation
323
cellular automata or agent based models in which cells are treated as units which grow and divide according to prescribed rules, (d) hybrid multiscale models involving processes from micro-to-macroscale. Multiscale and multiphysic models represent the most advanced simulation methodologies. Disregarding all microscopic and mesoscopic biological and biophysical processes, in macroscopic scales, tumor growth looks like a purely mechanical phenomenon. It involves dissipative interactions between the main actors: normal and cancerous tissues and vascular network. Any of existing computational paradigms is not able to reproduce this basic process, which influences the most tumor and vasculature remodeling, tumor shape and its directional progression. In [6,7] we present the concept of tumor growth model which is driven by particle dynamics [8]. In fine-grained models a particle mimics a single cell or components of ECM. Nevertheless, this assumption becomes very computationally demanding for modeling tumors of realistic sizes. The largest MD (molecular dynamics) simulations involve 2×1010 particles [9] for rather not so impressive (only 50-100) number of timesteps. The computational restrictions impose current limits on spatiotemporal scales simulated by particle model of tumor growth. Tumor of 1mm in diameter consists at least of a million cells [2]. To simulate in 3-D the tumor of this size and its closest environment, one needs at least 1.6×107 particles simulated in 105 timesteps. This requires approximately the same computational resources as the largest MD simulations. However, particle method can also be used in its coarse-grained form. Unlike in purely fine-grained particle model, where all cells and tissue ingredients are represented by particles, in our model a particle can mimics a cluster of cells in ECM envelope. The particle system representing growing tumor is very different than standard particle codes such as molecular dynamics (MD) (e.g. in [10,11,12]). The number of cells increases and/or fluctuates because they can replicate or disappear. The particles have attributes which evolves according to the rules of cellular automata. Moreover, their values depend on concentration fields which are coupled with particle dynamics. Thus, the process of parallelization is expected to be more complicated than for standard particle codes. Below, we give a brief description of our model, and we discuss its parallel implementation. We present also examples of trial simulations and preliminary computational timings. Finally, we discuss the conclusions.
2
Model Description
In this section we give a brief description of particle model of tumor growth. More detailed discussion can be found in [7]. We assume that the fragment of tissue is made of a set ΛN = {Oi : O(r i , V i , ai ), i = 1, . . . , N } of particles where: i - the particle index; N - the number of particles, r i , V i , ai - particle position, velocity and attributes, respectively. The attributes are defined by particle type {tumor cell (TC), normal cell (NC), endothelial cell (EC)}, cell life cycle state {newly formed, mature, in hypoxia, after hypoxia, apoptosis, necrosis}, cell size, cell
324
R. Wcislo and W. Dzwinel
age, hypoxia time, concentrations of TAF, O2 and other factors, total pressure exerted on particle i from its closest neighbors and the walls of the computational box. The particles are confined in the cubical computational box of volume V. For the sake of simplicity we assume here that the vessel is constructed of tubelike “particles” – EC-tubes – each made of many endothelial cells. Summarizing, we define three types of interactions: sphere-sphere, sphere-tube, and tube-tube. The forces between cells should mimic both mechanical repulsion from squashed cells and attraction due to cell adhesiveness and depletion interactions. The particle dynamics is governed by the Newtonian laws. During the life-cycle, the normal and tumor cells change their states from new to apoptotic (or necrotic) according to their individual clock (and/or oxygen concentration). The cells of certain age and size, being well oxygenated, undergo the process of mitosis. They form two daughter cells. Finally, after given time threshold, the particles die due to apoptosis, i.e., programmed cell death. Dead cells are removed from the system. For oxygen concentration smaller than a given threshold, the living cell changes its state to hypoxia. Such the cells become the source of TAF. The cells, which are in hypoxia for a period of time longer than a given threshold, die and become necrotic. We assume that in the beginning, the diameter of necrotic cell decreases twice and, after some time, the cell is removed. This is contrary to apoptotic cells, which are rapidly digested by their neighbors or by macrophages. The duration of phases of the cell cycle for normal and tumor cells differs considerably. Moreover, the proliferation rate of tumor cells, which were in hypoxia state, can change [7]. The cell cycle for EC-tubes is different than for spherical cells. In fact, ECtubes are clusters of endothelial cells. The tubes grow both in length and in diameter. They can collapse due to a combination of reduced blood flow, the lack of VEGF, dilation, perfusion and solid stress exerted by the tumor. Because the EC-tube is a cluster of EC cells, its division onto two adjoined tubes does not represent the process of mitosis but is a computational metaphor of vessel growth. Unlike normal and tumor cells, the tubes can appear as tips of newly created capillaries sprouting from existing vessels. The new sprout is formed when the TAF concentration exceeds a given threshold and its growth is directed to the local gradient of TAF concentration. The transport of bloodborn oxygen and TAF into the tissue is modeled by means of reaction-diffusion equations. The distribution of hematocrit is the source of oxygen, while the distribution of tumor cells in hypoxia is the source of TAF. We assume that the cells of any type consume oxygen and the rate of oxygen consumption depends on both cell type and its current state. We assume additionally that only EC-tubes absorb TAF. TAF is washed out from the system due to blood flow. Diffusion of oxygen and TAF is many orders of magnitude faster than the process of tumor growth. On the other hand, the blood circulation is slower than diffusion but still faster than mitosis cycle. Therefore, we can assume that both the concentrations and hydrodynamic quantities are in steady state in the time scale defined by the timestep of numerical integration of equations of motion.
Particle Model of Tumor Growth and Its Parallel Implementation
325
This allows for employing fast approximation procedures for both calculation of blood flow rates in capillaries and solving reaction-diffusion equation (see [7]).
3
Data Structure and Algorithm
The simulation of tumor growth using particles involves data structures allowing for efficient performance of multiple, multiscale concurrent processes such as cell cycle, lumen formation, development and maturation of vascular network, blood flow and others. As was shown before, our system consists of two types of particles: spherical and “tube-like” particles, representing tissue and vasculature, respectively (see Fig. 1). The state of each particle (cell, fragment of capillary) is defined by several attributes and tens of minor parameters connected with their physical and biochemical properties. These data are stored in a shared array P of records. Each record contains information corresponding to a single particle. As shown in Fig. 1, during simulation, the number of spherical and “tube-like” cells can both increase due to mitosis and decrease as the result of apoptosis and necrosis. Information on newly formed cells is added at the end of particle array. In case of lack of free slots, the array is enlarged dynamically. The information on dead cell is removed from P and the empty space is replaced by the record from the end of array. During simulation particle indices in the array P become mixed. Consequently, positions of the particles in the array do not reflect their position in space. This fact has advert effect on the efficiency of the parallel implementation. The main loop of simulation representing time evolution of the particle system is shown in Fig. 2. array of cells
new cell
unused records
dead cell
data movement
array of cells
unused records
Fig. 1. The mapping of particles onto P array records containing current state of cells during the changes cased by cell cycle
After initialization phase, in subsequent timesteps we calculate new position of particles, the diffusion of active substances (nutrients, TAFs, pericytes), the intensity of blood flow in the vessels and the states of individual cells triggered by previous three factors and constrained by individual cell time clocks. These modifications of cell state can result in cell proliferation, its death and changes in its functions (hypoxic cells), size and properties.
326
R. Wcislo and W. Dzwinel
From the point of view of efficient computations, the linear computational complexity O(N ) of the algorithm is the most important requirement. However, both the calculation of forces and approximate procedure used for solving diffusion equation, involve the pairs of particles. Fortunately, both the particle interactions and approximation kernel [7] are short-ranged. By using well known procedures such as Hockney boxes (as in [10,11,12]) the approximate linear computational complexity is guaranteed. The computational box is divided onto cubic sub-boxes with edges equal to the interaction (kernel) range. As demonstrated in Fig. 3, the particle located in a given sub-box interacts with other particles located in this sub-box and in adjacent sub-boxes.
initialization for each time step for every pair of interacting particles compute forces
over 90% of simulation time (in sequential version)
compute gradients of densities for every particle calculate new velocitiy, size and position calculate new densities change state { may lead to creation of new cell or vessel or death (removal) of cell }
for every vessel particle calculate blood flow and pressure
Fig. 2. The main loop of the simulation code
Despite its apparent efficiency this classic algorithm has an annoying deficiency. As was shown in [11,12], the execution time for sequential fluid particle codes increases substantially during simulation. This happens when the particles cross the sub-box boundaries and the attributes of particles residing in the same sub-box are located in distant memory addresses. Therefore, to obtain the efficient code, the neighboring particles should be closer one another in the computer memory. This effect becomes more evident for multithread implementations on shared memory computers [12]. To avoid frequent cache misses and delays caused by programs retaining cache coherency, the particles should be renumbered every some period of time and the processors with large cache memories are preferred. Consequently, the particles in the same cell are classified with consecutive numbers. However, the gap between particle numbers still exists for the particles from neighboring cells used for computing the forces. This is due to the incoherence between the linear addressing of memory and indexing of 3D particle system. This cache effect is even stronger in our model, where incoherency between particle indices and their spatial position are larger than for the systems with conserved number of particles. This problem becomes serious even on the single sub-box level, where the particle addresses residing in the single sub-box can be
Particle Model of Tumor Growth and Its Parallel Implementation
327
distant in P array. To avoid this problem, we use another way of data storing in sub-boxes. The P array is divided onto blocks of similar size. The size is equal to the maximum number of particles which can be located in the subblock. The blocks correspond to individual sub-boxes. Every block in a certain moment of time has filled as many records as many particles (cells) are found in the corresponding sub-box (see Fig. 4). The rest of records are empty. The newly formed cells are located directly in corresponding blocks so not at the end of the P array.
array of boxes
many-to-many
array of cells
r_cut
r_cut unused records
Fig. 3. The array of sub-boxes
As we demonstrate in the following section, the preliminary timings made for improved code are encouraging. On the other hand we realize that this procedure allows for avoiding the cache-space incoherency on the level of individual subboxes, though the problem with neighboring sub-boxes still exists. As we show in the previous section the blood vessels are simulated using “tubelike” particles. Moreover these tubes can grow. Just the tubes are very serious complication disabling direct application of Hockney algorithm. The tubes can cover even 5 sub-boxes. Thus, in forces calculation procedure, one has to use additional tokens preventing from multiple calculations of insignificant interactions where at least one out of two particles is the “tube-like.”
4
Parallelization of the Simulation Program
In order to speed up the computations, the sequential code has been rewritten as a parallel program which can be executed on SMP machines with a shared memory access. We mainly focused on the efficient parallelization of computations of particle-particle interactions – as we annotated in Fig. 2, those computations were the most time consuming part of a simulation. In the parallel version of the program computations threads are created for every available processor core. During the simulation, every thread performs the
328
R. Wcislo and W. Dzwinel array of cells
box 1
box 2
box 3
box 4 unused records
box n
Fig. 4. The sparse array of cells
computations for the assigned subset of sub-boxes only. The threads, however, cannot run independently: the computations involving the boundary sub-boxes require access (both read and write) to data assigned to neighboring threads. The proper order of sub-box computations allows to avoid the locking problems. Currently, we are porting the simulation code to nVidia GPU cards to test the possibility of achieving better speed-ups than shown in Fig. 7.
5
Results and Timings
In Figs. 5, 6 we display the effects of mechanical interaction between vessels and growing tumor simulated by using our particle based model. The figures clearly show how important role this process plays in tumor progression. In Fig. 5A, B two vessels are pushed away by growing tumor opening the gap filled with cells in hypoxia (pointed by white arrow in Fig. 5C). The cells, being in hypoxia during a period of time longer than a given threshold, die and are removed from the system. This process produces necrotic region in the middle of growing tumor which has principal consequences on tumor dynamics in its both avascular and angiogenic phases. In avascular phase the existence of the necrotic region decreases the internal tumor pressure slowing down its growth. As shown in Fig. 5C, D, where the tumor progression goes with vascular network evolution (the angiogenic phase), the evolving network is being continuously remodeled by mechanical forces exerted by growing tumor. On the other hand, the vessels reshape the tumor body and influence its compartmentalization. The separate regions of low oxygen concentration become the sources of TAF. Due to chemotaxis, these regions attract the growing vessels, which can be the sources of oxygen. Supply of oxygen down-regulates TAF production and can rebuild tumor tissue in the necrotic regions. This process occurs in various spatial scales resulting in the transformation of the regular vasculature in normal tissue into a highly inhomogeneous vasculature of the tumor.
Particle Model of Tumor Growth and Its Parallel Implementation
329
Fig. 5. The process of tumor growth with angiogenesis process switched on. Due to better oxygenation of the cells, the tumor grows faster. The necrotic region (C) is reborn (D) because of oxygen supply by the newly formed vessels.
Fig. 6. The snapshots from 3-D simulation of angiogenesis and tumor progression. We display only the evolution of the vascular network. The angiogenesis starts from a single vessel (A). The sprouts of blood vessels, stimulated by TAF secreted by tumor cells, emerge (B). They are represented by black threads. Only those threads, which can conduct blood – i.e., the closed loops (anstomoses) – can survive. They are marked in red. The unproductive vessels are removed from the system after some time.
6
interactions overall
180
5.2 5
160
speedup
seconds
140 120 100
4
4.3
4.5
6
7
3.4 2.7
3
3.0
80
1.8
2
60 40
1
1.0
20
0
0 old array organization
new array organization
1
2
3
4 5 threads
8
Fig. 7. Timings and speed-ups obtained as the result of reorganization of particle ordering (left) and influence of a greater number of threads (right)
The progression of tumor vasculature is shown in Fig. 6. The source, initially straight vessel, is remodeled due to tumor growth dynamics. Because of increasing TAF concentration we can observe newborn capillaries sprouting from the source vessel (black threads). The sprouts can bifurcate and merge creating
330
R. Wcislo and W. Dzwinel
anastomoses. The blood flow is stimulated by pressure difference in anastomosing vessels. Only productive vessels – marked in red – have a chance to survive if the TAF concentration is sufficiently high. Unproductive vessels undergo regression and disappear after some time. The preliminary timings of the model computer implementation are presented in Fig. 7. For testing simulations, employing 2×105 particles in the 3-D cubic box divided onto 52×52×52 cells, the procedure calculating interactions between the particles consumes about 90% of the total computational time. As shown in Fig. 7 (left), the application of proposed data structure (see Fig. 4) allows for 10% decreasing of the total computational time. More meaningful acceleration of the code was obtained by increasing the number of threads on SMP server equipped with two dual core AMD Opteron processors. The speed ups are shown in the right panel of Fig. 7.
6
Conclusions
The particle model developed in this study can be used as a robust modeling framework for developing more advanced tumor growth models. Consequently, these models can be used to evaluate effects of individual factors on the process of tumor growth. The particle based framework is easy both to develop principal mechanisms of tumor dynamics such as cell proliferation, vascularization, vessels remodeling and incorporate sub-models of other important processes like: microscopic intracellular phenomena, blood circulation and extracellular matrix models. The particle model can be supplemented easily by more precise models of reproduction mechanism, cell-life cycle, lumen growth dynamics, haptotaxis mechanism and others. Some models of intracellular phenomena can be copied directly from the existing bibliography (e.g from [2,3,4,5,6]). The quantitative verification of our model with experiment requires performing large simulations involving millions of particles to obtain realistic tumor sizes. In this paper, we present some ideas for better parallelization of the code and we demonstrate preliminary timings. In the nearest future we are aimed at obtaining efficient parallel codes on the newest multicore and GPU processors, to provide flexible simulation tool for oncologists, which can be used on small but strong stand alone workstations. Acknowledgments. Thanks are due to Professor Dr A. Dudek from Hematology, Oncology and Transplantation (HOT) Division of the University of Minnesota Medical School for his contribution to this paper and priceless expertise. This research is financed by the Polish Ministry of Education and Science, Project No. 3 T11F 010 30 and internal AGH Institute of Computer Science grant No. 11.11.120.865. The exemplar movies from simulations can be downloaded from http://www.icsr.agh.edu.pl/∼wcislo/Angiogeneza/index.html
Particle Model of Tumor Growth and Its Parallel Implementation
331
References 1. Folkman, J.: Tumor angiogenesis: Therapeutic implications. N. Engl. J. Med. 285, 1182–1186 (1971) 2. Bellomo, N., de Angelis, E., Preziosi, L.: Multiscale Modeling and Mathematical Problems Related to Tumor Evolution and Medical Therapy. J. Theor. Med. 5(2), 111–136 (2003) 3. Preziozi, L. (ed.): Cancer modelling and simulation, p. 426. Chapman & Hall/CRC Mathematical Biology & Medicine (2003) 4. Mantzaris, N., Webb, S., Othmer, H.G.: Mathematical Modeling of Tumor-induced Angiogenesis. J. Math. Biol. 49(2), 1416–1432 (2004) 5. Chaplain, M.A.J.: Mathematical modelling of angiogenesis. J. Neuro-Oncol 50, 37– 51 (2000) 6. Wcislo, R., Dzwinel, W.: Particle based model of tumor progression stimulated by the process of angiogenesis. In: Bubak, M., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2008, Part II. LNCS, vol. 5102, pp. 177–186. Springer, Heidelberg (2008) 7. Wcislo, R., Dzwinel, W., Yuen, D.A., Dudek, A.Z.: A new model of tumor progression based on the concept of complex automata driven by particle dynamics. Journal of Molecular Modeling (2009) (in Press) 8. Dzwinel, W., Alda, W., Kitowski, J., Yuen, D.A.: Using discrete particles as a natural solver in simulating multiple-scale phenomena. Molecular Simulation 20(6), 361–384 (2000) 9. Kadau, K., Germann, T.C., Lomdahl, P.S.: Large-scale molecular-dynamics simulation of 19 billion particles. International Journal of Modern Physics C 15(1), 193–201 (2004) 10. Dzwinel, W., Alda, W., Pogoda, M., Yuen, D.A.: Turbulent mixing in the microscale. Physica D 137, 157–171 (2000) 11. Boryczko, K., Dzwinel, W., Yuen, D.A.: Parallel Implementation of the Fluid Particle Model for Simulating Complex Fluids in the Mesoscale. Concurrency and Computation: Practice and Experience 14, 1–25 (2002) 12. Boryczko, K., Dzwinel, W., Yuen, D.A.: Modeling Heterogeneous Mesoscopic Fluids in Irregular Geometries using Shared Memory Systems. Molecular Simulation 31(1), 45–56 (2005)
Modular Neuro-Fuzzy Systems Based on Generalized Parametric Triangular Norms Marcin Korytkowski1,2 and Rafal Scherer1 1
Department of Computer Engineering, Cz¸estochowa University of Technology al. Armii Krajowej 36, 42-200 Cz¸estochowa, Poland http://kik.pcz.pl 2 Olsztyn Academy of Computer Science and Management ul. Artyleryjska 3c, 10-165 Olsztyn, Poland
[email protected],
[email protected] http://www.owsiiz.edu.pl/
Abstract. Neuro-fuzzy systems are used for modeling and classification thanks to their advantages such as interpretable knowledge and ability to learn from data. They are similar to neural networks but their design is structured depending on knowledge in the form of fuzzy rules. Final decision is inferred by fuzzy inference using triangular norms. The paper presents modular neuro-fuzzy systems based on parametric classes of generalized conjunction and disjunction operations. The proposed systems are better adjustable to data during learning. Part of this flexibility is achieved by the new operators what do not alter the original knowledge of fuzzy rules. The method described in the paper is tested on several known benchmarks.
1
Introduction
Fuzzy logic since its introduction in 1965 has been used in various areas of science, economics, manufacturing, medicine etc. It constitutes, along with other methods like neural networks or evolutionary programming, the idea of soft computing. Neuro-fuzzy systems (NFS) [5][8][9] are synergistic fusion of fuzzy logic and neural networks, hence NFS can learn from data and retain the inherent interpretability of fuzzy logic. Thanks to this, two contradictory terms are fulfilled: the knowledge in the form of fuzzy rules can be transparent and intelligible and at the same time high accuracy performance can be achieved. The intelligibility of the rules depends on the conditional part of the fuzzy rules and the rule creating algorithm should be designed properly to retain it. Generally fuzzy systems can be divided, depending on the connective between the antecedent and the consequent in fuzzy rules, as follows: i. Takagi-Sugeno method - consequents are functions of inputs, ii. Mamdani-type reasoning method - consequents and antecedents are related by the min operator or generally by a t-norm, iii. Logical-type reasoning method - consequents and antecedents are related by fuzzy implications, e.g. binary, L ukasiewicz, Zadeh etc. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 332–339, 2010. c Springer-Verlag Berlin Heidelberg 2010
Modular NFS Based on Generalized Parametric Triangular Norms
333
In the paper we use Mamdani-type systems with modified structure by using generalised parametric triangular norms. The proposed systems are learned by a gradient learning and fuzzy c-means clustering. The systems are combined in ensembles. Combining single classifiers into larger ensembles is an established method for improving the accuracy. Of course, the obvious improvement is bound up with increased storage space (memory) and computational burden. However this trade-off is easy to accept with modern computers. The most important problem in case of creating ensembles from fuzzy systems as base hypothesis is that each rule base has different overall activation level. Thus we cannot treat them as one large fuzzy rule base. This possible ”inequality” comes from different activation level during training. To overcome this problem, this paper proposes a method to normalise all rule bases during learning. The normalisation is achieved by adding the second output to the system and keeping all rule bases at the same level. The paper is organised as follows. Neuro-fuzzy systems with generalised parametric norms are described in Section 2. The new method that allows merging individual rule bases within the ensemble is introduced in Section 3. The last section provides a description of the ensemble meta-learning, learning method used for members of the ensemble and testing on a known benchmark.
2
Mamdani Systems with Generalized Triangular Norms
We will use neuro-fuzzy system in an ensemble thus we add t index denoting classifier number. The output of the single, most common, Mamdani neuro-fuzzy system, is defined Nt y¯tr · τtr r=1 ht = N , (1) t r τt r=1
where τtr = T μAri (¯ xi ) is the activity level of the rule r = 1, ..., Nt of the n
i=1
classifier t = 1, ..., T . The output of the system is computed on the basis of feature vector x = [x1 ...xn ] using fuzzy sets μAri . Standard 1 Mamdani neuro-fuzzy system uses minimum or product T-norms for antecedent conjunction operation. In the paper we propose to use parameterized triangulars norms [2], [3], [6]. By use of these T-norms we can model the operation by tuning T-norm parameters. In the paper we use T (a1 , a2 , ..., an ) = min (ai )p , (2) i=1..n
where p is a parameter responsible for the shape of T-norm function. The rule activation level becomes n p τtr = μAri (¯ xi ) . (3) i=1
The structure of the committee member is presented in Fig. 1. Index t is omitted for clarity.
334
M. Korytkowski and R. Scherer
Fig. 1. Neuro-fuzzy classifier architecture with generalized triangular norm
3
Merging Mamdani Neuro-Fuzzy Classifiers
Systems from the previous section are used to build an ensemble learned with bagging metalearning algorithm described in the next section. The final response of the bagging ensemble is
f (x) =
T t=1
Nt
y¯tr ·
r=1 Nt
n p μAri (¯ xi )
i=1
n p μAri (¯ xi )
.
(4)
r=1 i=1
Formula (4) is a sum of all hypothesis outcomes. It is not possible to merge all rule bases coming from members of the ensemble, because (4) can not be reduced to the lowest common denominator. Let us observe that in the denominator of (4) there is sum of activity level of rules in a single neuro-fuzzy system. Thus if we want to treat formula (4) as one neuro-fuzzy system, it should be transformed so that it has the sum of activity level of all rules in the ensemble. The solution to the problem is assuming the following for every module constituting the ensemble Nt r=1
τtr = 1 ∀t = 1, ..., T .
(5)
Modular NFS Based on Generalized Parametric Triangular Norms
335
Then we obtain the following formula for the output of the whole ensemble N T t r r f (x) = (¯ yt · τt ) . (6) t=1
r=1
The assumption (5) guarantee that in case of creating an ensemble of several such modified neuro-fuzzy systems, summary activity level of one system does not dominate other subsystems. Unfortunately there not exists a method for designing neuro-fuzzy systems such that the sum of the activity levels N r=1 τr for a single system equals one. One possibility is such selection of fuzzy system parameters during learning that assumption (5) is satisfied. Considered neurofuzzy systems should be trained to satisfy two assumptions 1) ht (xq ) = dq ∀q = 1, ..., M , Nt τtr = 1 ∀t = 1, ..., T . 2)
(7)
r=1
To satisfy (7) we modify the neuro-fuzzy structure in Fig. 2 to make it take the form depicted in Fig. 3. We removed the last layer performing the division thus the system has two outputs. The error on the first output will be computed taking into account desired output from learning data. Desired signal on the
Fig. 2. Neuro-fuzzy classifier with modified structure for learning
336
M. Korytkowski and R. Scherer
Fig. 3. Neuro-fuzzy classifier with modified structure for learning
second output is constant and equals 1. The sum of activity levels can equal 1 in case of using triangular membership functions, but in case of Gaussian functions it is not possible to fulfill the second condition in (7). This is caused by the range of Gaussian function which do not take the exact zero value. It is not the serious problem in real life applications as we need only to approximately fulfill the first condition in (7). As we assume the fulfillment of the first condition in (7), we can remove after learning from the structure elements responsible for denominator in (1), as the denominator equals 1. In such system the rules can be arranged in arbitrary order. We do not have to remember from which submodule the rule comes from. The obtained ensemble of neuro-fuzzy systems can be interpreted as a regular single neuro-fuzzy system.
4
Bagging Ensemble Numerical Simulation
Classifiers can be combined at the level of features, data subsets, using different classifiers or different combiners, see Figure 4. Popular methods are bagging and boosting which are meta-algorithms for learning different classifiers. The ensemble from the previous section is learned by bagging metalearning and the backpropagation algorithm.
Modular NFS Based on Generalized Parametric Triangular Norms
4.1
337
Learning
The ensemble is learned with the bagging algorithm [4]. The name comes from Bootstrap AGGregatING. The ensemble members use data sets created from the original data set by drawing at random with replacement. There are more sophisticated metalearning procedures for ensembles such as the AdaBoost [10]. They allow to achieve better accuracy but bagging can be used without any modifications in parallel computing as all classifiers work independently and only during classification stage the overall response (decision) is computed. Moreover the algorithm is suitable for multiclass problems. During learning, the right number of rules, fuzzy set parameters and weights are to be computed. Alternatively they can be determined by an expert in the field. In the paper, we compute all the parameters by machine learning from numerical data. At the beginning, input fuzzy sets are determined by the fuzzy c-means clustering algorithm. Then, all parameters, i.e. input fuzzy sets, relation elements and output elements, are tuned by the backpropagation algorithm [11]. Having given learning data set of pair (¯ x, d) where d is the desired response of the system we can use the following error measure Q (¯ x, d) = 12 [¯ y (¯ x) − d]2 . (8) Every relational neuro-fuzzy system parameter, denoted for simplicity as p, can be determined by minimizing the error measure in the iterative procedure. For every iteration t, the parameter value is computed by p (t + 1) = p (t) − η
∂Q (¯ x, d; t) . ∂p (t)
(9)
where η is a learning coefficient, set in simulations to 0.03. Datasets used in experiments were divided randomly into a learning set and a testing set. The accuracy shown in the paper was obtained using full train and full test set. The learning method was not so important as the aim of the paper is showing the possibility to merge all systems from the ensemble into one rule base. The overall decision is obtained by averaging classifiers responses H(x) =
T 1 ht (x) T t=1
(10)
where T is number of hipothesis in the bagging ensemble. 4.2
Numerical Simulations on the PID Problem
The Pima Indians Diabetes (PID) data [1] contains two classes, eight attributes (number of times pregnant, plasma glucose concentration in an oral glucose tolerance test, diastolic blood pressure (mm Hg), triceps skin fold thickness (mm), 2-hour serum insulin (mu U/ml), body mass index (weight in kg/(height in m)2), diabetes pedigree function, age (years)). We consider 768 instances, 500 (65.1%) healthy and 268 (34.9%) diabetes cases. All patients were females at least 21
338
M. Korytkowski and R. Scherer
Fig. 4. General diagram of different classifiers ensembling methods [7]
years old of Pima Indian heritage, living near Phoenix, Arizona. It should be noted that about 33% of this population suffers from diabetes. In our experiments, all sets are divided into a learning sequence (576 sets) and a testing sequence (192 sets). Simulations were carried out using bagging algorithm on three neuro-fuzzy classifiers: i. Classifier 1 - 10 rules, 69.0% accuracy, ii. Classifier 2 - 8 rules, 66.2% accuracy, iii. Classifier 3 - 8 rules, 68.0% accuracy. Classification accuracy of the ensemble was 70%. This result is comparable with the best ones from the literature (e.g. in [9]). The main purpose of the proposed method was creating an ensemble of neuro-fuzzy systems in which rules can be treated as the one rulebase.
5
Conclusion
The paper shows a possibility to build an ensemble of classifiers consisted of Mamdani neuro-fuzzy systems with generalized parametric T-norms. Ensembling classifiers is a frequent method for improving classification accuracy. We use bagging meta learning and the backpropagation algorithm for members of the ensemble. Our goal was to design a flexible and accurate classification system with a set of intelligible fuzzy rules. The fuzzy rule base can be obtained thanks to a new modification of neuro-fuzzy systems bringing all the members to the common denominator. Simulation results shows the correctness of the proposed approach. rules sufficient for solving a problem.
Acknowledgements This work was partly supported by the Polish Ministry of Science and Higher Education (Habilitation Project 2007-2010 Nr N N516 1155 33 and Polish-Singapore Research Project 2008-2010).
Modular NFS Based on Generalized Parametric Triangular Norms
339
References 1. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2007), http://www.ics.uci.edu/~ mlearn/MLRepository.html 2. Batyrshin, I., Kaynak, O., Rudas, I.: Fuzzy Modeling Based on Generalized Conjunction Operations. IEEE Transactions on Fuzzy Systems 10(5), 678–683 (2002) 3. Batyrshin, I., Kaynak, O.: Parametric Classes of Generalized Conjunction and Disjunction Operations for Fuzzy Modeling. IEEE Transactions on Fuzzy Systems 7(5), 586–596 (1999) 4. Breiman, L.: Bagging predictors. Machine Learning 26(2), 123–140 (1996) 5. Jang, R.J.-S., Sun, C.-T., Mizutani, E.: Neuro-Fuzzy and Soft Computing, A Computational Approach to Learning and Machine Intelligence. Prentice Hall, Upper Saddle River (1997) 6. Klement, E.P., Mesiar, R., Pap, E.: Triangular norms. Position paper II: general constructions and parameterized families. Fuzzy Sets and Systems 145, 411–438 (2004) 7. Kuncheva, L.I.: Combining Pattern Classifiers, Methods and Algorithms. John Wiley & Sons, Chichester (2004) 8. Nauck, D., Klawon, F., Kruse, R.: Foundations of Neuro - Fuzzy Systems. John Wiley, Chichester (1997) 9. Rutkowski, L.: Flexible Neuro Fuzzy Systems. Kluwer Academic Publishers, Boston (2004) 10. Schapire, R.E.: A brief introduction to boosting. In: Sixteenth International Joint Conference on Artificial Intelligence, pp. 1401–1406 (1999) 11. Wang, L.-X.: Adaptive Fuzzy Systems And Control. PTR Prentice Hall, Englewood Cliffs (1994)
Application of Stacked Methods to Part-of-Speech Tagging of Polish Marcin Kuta, Wojciech W´ ojcik, Michal Wrzeszcz, and Jacek Kitowski Institute of Computer Science, AGH-UST, al. Mickiewicza 30, Krak´ ow, Poland {mkuta,kito}@agh.edu.pl
Abstract. We compare the accuracy of several single and combination part-of-speech tagging methods applied to Polish and evaluated on the modified corpus of Frequency Dictionary of Contemporary Polish (mFDCP). Three well known combination methods (weighted voting, distributed voting, and stacked) are analyzed, as well as two new weighted voting methods: MorphCatPrecision and AmbClassPrecision methods are proposed. The MorphCatPrecision method achieves the highest accuracy among all considered weighted voting methods. The best combination method achieves 11.9% error reduction with respect to the best baseline tagger. We report also the statistical significance of the difference in accuracy between various methods measured by means of the McNemar test. Selection of the best algorithms was conducted on a multiprocessor supercomuter due to the high time and memory requirements of most of these algorithms. Keywords: part-of-speech tagging, weighted voting taggers, distributed voting taggers, stacked taggers, natural language processing.
1
Introduction
Part-of-speech (POS) tagging, being a cornerstone of a wide range of tasks, influences the quality of syntactic and semantic parsing, machine translation, text understanding, knowledge extraction and ontology construction, speech recognition, and many others. The wide literature concerning POS tagging proposes numerous tools but most of research focuses on English. A relevant issue is to evaluate such methods in different languages, in particular highly inflected languages, like Polish. For such languages POS tagging is a harder task than for analytic languages as the former need annotations with large, complex tagsets, representing many morphological categories. We examine baseline and combination tagging methods applied to Polish. The paper focuses on combination methods of three types: weighted voting, distributed voting, and stacked. We made effort not to omit any important baseline or combination method. We propose two new weighted voting methods: MorphCatPrecision and AmbClassPrecision methods, of which MorphCatPrecision R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 340–349, 2010. c Springer-Verlag Berlin Heidelberg 2010
Application of Stacked Methods to Part-of-Speech Tagging of Polish
341
achieves the highest accuracy among weighted voting methods. Finally, we compare achieved error rate reductions with results for English and examine statistical significance of the difference in accuracy between investigated methods. The taggers are evaluated on two versions of the modified corpus of Frequency Dictionary of Contemporary Polish (m-FDCP) 1 [1] as annotated with the full version of the tagset (referred to the complex tagset) and its reduced version to basic information about part of speech (the simple tagset). Finding the best tagging algorithms involves high calculation effort due to time requirements (e.g. maximum entropy or conditional random fields algorithms), memory requirements and even exponential time complexity (as a function of the number of baseline taggers) for some combination methods.
2
Evaluated Baseline Tagging Algorithms
To investigate combination methods we evaluated first a few baseline algorithms. Seven algorithms have been chosen. Hidden Markov Model (HMM) algorithm belongs to the statistical methods and exploits Viterbi algorithm. Maximum entropy method (statistical approach) [2] maximizes the entropy function by selection of binary features, reflecting dependence in a training corpus. Memory-based learning algorithm [3] exploits rule based approach. The tagger acquires examples from a training corpus, which are later used in the tagging process. The output tags are determined by taking into account a context and a distance to examples. Transformation-based error-driven learning method [4] assigns an initial sequence of tags to a given tokenised text. This sequence is improved by applying a series of replacements. The advantage of the method is a possibility of supplying and modifying rules by linguists. Support vector machines (SVM) algorithm classifies input data by constructing a linear hyperplane, which separates positive examples from negative ones with the maximal margin. Tree tagger is based on Markov Models and exploits additionally decision trees to improve its work [5]. Conditional random fields (CRFs) are the probabilistic framework for labelling sequence data that avoids label bias.
Table 1. Taggers used in experiments with their abbreviations employed
1
Tagger
Algorithm
T1 T2 T3 T4 T5 T6 T7
Hidden Markov Model Maximum entropy Transformation based Memory based Support Vector Machines Decision trees Conditional Random Fields
Tagger name TnT [6] MXPost [2] fnTBL [7] MBT [3] SVMTool [8] TreeTagger [5] CRF++ [9]
Abbreviation TnT MXP fnTBL MBT SVM TrTg CRF
The m-FDCP corpus is available at http://nlp.icsr.agh.edu.pl
342
M. Kuta et al.
Choosing algorithms for evaluation we took into account representativeness of different approaches and open accessibility of tagging tools (Table 1).
3
Combination Methods
The idea of improving the performance of a task by combining several methods is well recognised in machine learning community and known under many names as: committees of learners, mixture of experts, classifiers ensembles, or multiple classifier systems. The main contribution in application and developing combination methods to NLP area is due to [10] and [11]. Each combination method works on a basis of a set of taggers (components), an exemplary set of baseline taggers is provided in Sect. 2. Three main types of combination methods can be distinguished, which are referred in the paper as: weighted voting methods, distributed voting methods, and stacked methods. 3.1
Weighted Voting Methods
Different baseline taggers (preferably completely independent) propose a tag simultaneously [10] according to their algorithms, as described in Sect. 2. In concordance with the name the final tagging of methods is determined with the help of election procedure. The vote strength of a member of such a voting system can be assigned in virtually any way. Already investigated metods [12] are shown in Table 2. We propose also 2 new voting methods: the AmbClassPrecision and MorphCatPrecision method. According to [13]: ’An ambiguity class is a set of values (such as genitive and accusative) of a single category (such as CASE) which arises for some word forms as a result of morphological analysis.’ In our experiments we did not use of morphological analyser, adopting a corpus based approach instead, hence a slightly modified definition applies: ’An ambiguity class is a set of values of a single morphological category which arises for a word form as a results of occurrences in a training set.’ If a word form occurs (in different places of a training set) for instance in the two mentioned cases, its AC for case is { genitive, accusative }. Given an ambiguity class AC, precision of a tagger on AC is defined: df
ambClassPrec AC =
# correctly tagged tokens ∈ AC . # tokens ∈ AC
(1)
The MorphCatPrecision method is applicable exclusively to corpora tagged with tagsets describing more than one morphological category. When applied to the simple tagset, the method reduces to the TotalPrecision method. Value of morphological category Cat is selected during voting independently from other morphological categories with vote strength of a component tagger defined as: df
morphCatPrec Cat =
# tokens correctly tagged with tags containing Cat . # tokens tagged with tags containing Cat (2)
Application of Stacked Methods to Part-of-Speech Tagging of Polish
343
Then independently chosen values of morphological categories are joined to form one tag. Within weighted voting methods, ties and data sparseness problems were resolved by fallbacks to simpler (less data demanding) methods. If this was not enough a tag was chosen randomly. We discuss widely these issues for Polish in [14]. Table 2 summarises vote strength of particular weighted voting methods. In implementation presented parameters were fixed on a tuning set, separate from a test set, to prevent optimistic bias of weighted voting methods. Table 2. Impact of a component tagger in various weighted voting methods, W stands for a currently tagged token, X stands for a tag proposed by a component tagger, Y (Y = X) represents each tag proposed by opposition Voting method Tag X Tag Y Majority 1 0 TotalPrecision accuracy 0 TagPrecision precX 0 PrecisionRecall precX 1 − recallY LexAccuracy lexAcc W 0 LexPrecision lexPrec W,X 0 LexPrecisionRecall lexPrec W,X 1 − lexRecall W,Y MorphCatPrecision morphCatPrecision 0 AmbClassPrecision ambClassPrecision 0
3.2
Distributed Voting Methods
The TagPair and Weighted Probability Distribution Voting have been proposed by [10] while LexPair is our modification of TagPair [12]. TagPair and LexPair. These methods depart from baseline taggers as voting components (Sect. 3.1) to voting components being pairs of taggers [10]. A pair of taggers proposes a tag tagT with strength P (tagT | tagi , tagj ) (the TagPair approach) or P (tagT | tagi , tagj , tok) (the LexPair approach) where tagi , tagj tags are proposed by baseline taggers and tok is a currently considered token. Weighted Probability Distribution Voting (WPDV). WPDV perceives each case of assigning a tag to a token as the feature-value pair set, Fcase = { (f1 = v1 ), . . . , (fn = vn ) }. The tag with the maximal probability P (tagT ) = WFsub P (tagT | Fsub ) (3) Fsub ⊂Fcase
is selected, where WFsub denotes the weight for Fsub . Translating to a lower abstraction level we see that a component tagger proposes tagT with impact wN · P (tagT | f1 = v1 , . . . , fN = vN ), wN denoting a weight factor. In basic WPDV method features of the form: ’a tag assigned by a baseline tagger to a current token’ are considered only.
344
M. Kuta et al.
WPDV+Contexts (WPDVC) variant extends further WPDV by including into the method two features: tags assigned by baseline taggers to the previous token and tags assigned by baseline taggers to the next token. It may seem that 2 × number of baseline taggers new features should be introduced, but compression reduced the number of features to two. 3.3
Stacked Methods
Stacked methods have been studied in [10] and [15]. C4.5 algorithm. The C4.5 algorithm developed by [16] is a general purpose statistical classifier, which builds a decision tree and uses concept of information entropy. The implementation involved was the release 8 of the algorithm, J48 from the Weka package [17]. Memory-based learning (MBL). Memory-based learning is a universal approach, already proved useful in baseline tagging. We based both baseline and stacked tagging on Daelmans TiMBL package [18].
4
Evaluated Data
We conducted experiments on the modified corpus of Frequency Dictionary of Contemporary Polish [1], annotated with the slightly abridged version of the IPI PAN tagset [19]. The m-FDCP corpus is the authors’ partially corrected and disambiguated version of the FDCP corpus [20], with manual checking and further verified in automatic manner described in [1]. The m-FDCP corpus consists of 658,656 tokens tagged with 30 different tags (the simple tagset case) and with 1243 different tags (the complex tagset case), respectively. The baseline algorithms have been evaluated with the split of the m-FDCP corpus to a training and test set, consisting of 592,729 and 65,927 tokens and standing for 90% and 10% of the corpus, respectively. The tuning set, created from the training set by 9-fold cross validation, in a way to avoid spill over from test data to training data, served for training combination taggers. The experiments were performed using one processor of SGI Altix 3700 SMP supercomputer (256 GB RAM, 128 1.5 GHz Intel Itanium 2 processors) located at the ACC Cyfronet AGH-UST. Training phase required up to 98 · 103 sec (the MXP case) and tagging speed varied from 1.2 to 220 · 102 tokens/sec.
5 5.1
Experiments: Setup and Results Baseline Methods
We investigated 7 taggers (Table 1), the accuracy of which is provided in Table 3. The results for the CRF tagger are new, compared to previous works [21,12,14]. The nominal accuracy presents percentage of correctly tagged tokens among all tokens in a test set. This measure may be not representative in full extent as
Application of Stacked Methods to Part-of-Speech Tagging of Polish
345
Table 3. Accuracy measures of baseline taggers trained on 90% of the m-FDCP corpus [%] Simple tagset Complex tagset Tagger Nominal Absolute Predictive Sentence Nominal Absolute Predictive Sentence accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy TnT 96.20 89.50 88.65 61.48 86.33 78.66 60.86 28.95 MXP 96.30 91.09 89.43 62.15 85.00 78.71 60.55 26.88 fnTBL 96.51 91.36 86.89 63.71 86.79 80.34 58.09 29.87 MBT 95.74 89.94 82.60 58.54 82.31 72.48 49.06 22.51 SVM 96.74 91.36 89.14 65.16 73.54 63.44 0.23 12.04 TrTg 95.73 88.94 85.19 58.35 82.16 72.89 45.18 21.25 CRF 89.85 90.59 55.23 34.96 – – – –
only ambiguous tokens are a source of difficulty of tagging task, and the ratio of unambiguous tokens to ambiguous ones may differ significantly depending on the evaluated corpus. Accordingly, the absolute accuracy, a measure computed only on ambiguous tokens, is more plausible. The predictive accuracy is defined as the percentage of correctly tagged unknown tokens, i.e., tokens from a test set, which did not appear in a training set. The sentence accuracy shows in turn a ratio of entirely correctly tagged sentences to the total number of sentences in the test set. The results for the CRF++ tagger for the complex tagset could not be obtained due to time limitations of queueing system, under which the tagger was run (out of time fault). Having partial results only, we chose 6 of 7 taggers as baseline components, and excluded CRF as a component of combination methods. 5.2
Combination Methods
Each combination method has been investigated on training and test sets in three setups: 1. composition from 6 best base taggers, 2. composition from 5 best baseline taggers (all components except CRF and MBT taggers for the simple tagset and all components except SVM tagger for the complex tagset), 3. composition from 4 best baseline taggers (all components except CRF, MBT and TrTg taggers for the simple tagset and all components except SVM and TrTg taggers for the complex tagset). To determine the set of the best taggers we investigated performance on the tuning set (rather than on the test set, as results in Table 3). As application of stacked method may lead to involuntary replication of the majority voting pattern we undertook additional experiments with training on disagreement subsets only (created by removing cases for which all baseline taggers agree) [15]. In effects each stacked method has been investigated with further distinctions between two arrangements: (1) computations on the full
346
M. Kuta et al.
Table 4. Accuracy with standard deviation and error rate reduction, ΔErr , (in relation to the best base tagger) of voting taggers and stacked taggers, [ % ] Simple tagset Accuracy ΔErr
Method Theoretical
98.58
Complex tagset Accuracy ΔErr 94.00
Weighted voting taggers 96.86 ± 0.01 96.87 ± 0.01 96.89 ± 0.02 96.96 96.91 96.95 96.97 96.82 96.82 96.97 –
Majority 5-Best Majority 4-Best Majority TotalPrecision TagPrecision PrecisionRecall LexAccuracy LexPrecision LexPrecisionRecall AmbiguityClassPrecision MorphCatPrecision
3.7 4.0 4.6 6.8 5.2 6.4 7.1 2.5 2.5 7.1 –
86.14 ± 0.04 – 86.87 ± 0.03 0.6 87.19 ± 0.04 3.0 88.03 9.4 87.41 4.7 87.21 3.2 87.89 8.3 87.15 2.7 86.99 1.5 87.37 4.4 88.14 10.2
Distributed voting taggers TagPair LexPair WPDV, wN = N ! WPDV, wN = 1 WPDV+Contexts, wN = N ! WPDV+Contexts, wN = 1
97.02 96.84 ± 0.01 97.02 97.01 97.05 96.99
8.6 3.1 8.6 8.3 9.5 7.7
88.22 10.8 86.99 ± 0.03 1.5 88.20 10.7 88.29 11.4 88.33 11.7 88.36 11.9
Stacked taggers C4.5 (J48) 4T MBL 4T MBL 4T+TB MBL 4T+TB+TA MBL 4T+TB+W MBL 4T+TB+TA+W
97.01 97.03 97.02 97.06 97.09 97.11
8.3 8.9 8.6 9.8 10.7 11.3
– 87.86 88.34 88.36 88.23 88.28
– 8.1 11.7 11.9 10.9 11.3
training and test set, (2) computations on disagreement subsets of the training and test set only. Taking into account that each experiment was performed for both simple and complex tagset, we got total 2 × 3 = 6 versions of an experiment for each weighted voting and distributed voting method, and 2 × 3 × 2 = 12 versions of an experiment for a single stacked method. WPDV methods have been tested with wN = N ! and wN = 1. Tags proposed by best baseline taggers for a current token formed input for stacked methods (4 best baseline taggers contributed together 4 features, denoted 4T in Table 4). The MBL was tested with more features in various combinations.
Application of Stacked Methods to Part-of-Speech Tagging of Polish
347
Table 5. Statistical significance of the difference in accuracy between baseline taggers trained on 90% of the m-FDCP corpus according to the McNemar test, () a tagger in a row is better (worse) than a tagger in a column with statistical significance, > (<) tagger in a row is better (worse) than tagger in a column without statistical significance (a) simple tagset
(b) complex tagset
TrTg SVM MBT fnTBL MXP TnT MXP fnTBL MBT SVM
>
<
TrTg SVM MBT fnTBL MXP TnT MXP fnTBL MBT SVM
>
Table 6. Statistical significance of the difference in accuracy between best combination taggers trained for the simple tagset according to the McNemar test, notation is the same as in Table 5 A B CD WPDVC, wN = N ! MBL 4T MBL 4T+TB+TA MBL 4T+TB+W
< < << < < <
A – MBL 4T+TB+TA+W B – MBL 4T+TB+W C – MBL 4T+TB+TA D – MBL 4T
These remaining features, depicted in Table 4, have the following meaning: TB – tag assigned by the best baseline tagger to the token before the current token, TA – tag assigned by the best baseline tagger to the token after the current token, W – current token itself, i.e., its word form. Table 4 presents accuracy values of the all described combination methods considered in composition from 4 best baseline taggers and the MBL methods considered as trained on disagreement subsets. The error rate reduction, ΔErr , of a considered combination tagger in relation to the best baseline tagger is defined as: #errors of the combination tagger df ΔErr = (1 − ) · 100% (4) #errors of the best baseline tagger For methods containing random resolution of ties the standard deviation of accuracy over 50 runs of a tagger was provided. Positions marked as ’–’ in Table 4 mean that some results were not provided due to different reasons. The MorphCatPrecision method applied to simple tagsets collapses to the TotalPrecision method. The C4.5 (J48) tagger was not able to cope with the complex tagset due to its memory requirements beyond available limit. We reported ΔErr only if positive values were attained. We studied also whether observed differences between accuracy of taggers are statistically significant, applying the standard McNemar test at level of significance 0.05. First we examined nominal accuracy values from Table 3 and
348
M. Kuta et al.
each pair of baseline taggers. Subsequently we investigated differences between the best combination methods, separately for the simple and complex tagset. We selected from Table 4 six best methods for the complex tagset and five best methods for the simple tagset. Results of these tests are shown in Table 5 (baseline taggers) and Table 6 (combination taggers for the simple tagset). Examination of 6 best combination taggers trained for the complex tagset revealed no statistical significance of the difference in accuracy for any considered pair of taggers.
6
Conclusions
New methods called MorphCatPrecision and AmbClassPrecision have been proposed. The MorphCatPrecision achieved the highest accuracy among weighted voting methods, which are relatively simple to set up. All important baseline taggers have been investigated on Polish. On the basis of the baseline taggers the combination methods (weighted voting, distributed voting, stacked) have been evaluated. The memory based combination method with the full set of features turned out to be the best for the simple tagset, although its superiority over MBL with 4T+TB+W feature set, and WPDVC with weight wN = N ! was not statistically significant. In the case of the complex tagset 6 top combination methods, belonging to WPDV and MBL families, showed no statistical significance of the difference in accuracy. Augmenting the number of features increased accuracy of the MBL method until sparse data problem came into play. When choosing an appropriate tagger, accuracy, speed and availability factors have to be taken into account. Representatives of both MBL and WPDV families achieved top accuracy values but big differences in tagging speed were observed: MBL taggers were very fast while WPDV taggers (in our Java implementation) turned out to be extremely slow (esp. tagging phase), what excludes them from practical applications. Acknowledgments. This research is supported by the AGH University of Science and Technology grant no. 11.11.120.865. ACC CYFRONET AGH is acknowledged for the computing time.
References 1. Kuta, M., Chrzaszcz, P., Kitowski, J.: Increasing quality of the Corpus of Frequency Dictionary of Contemporary Polish for morphosyntactic tagging of the Polish language. Computing and Informatics 28(3), 2009 2. Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proc. of the 1st Conf. on Empirical Methods in Natural Language Processing, pp. 133– 142 (1996) 3. Daelemans, W., Zavrel, J., Berck, P., Gillis, S.: MBT: A memory-based part of speech tagger-generator. In: Proc. of the 4th Workshop on Very Large Corpora, pp. 14–27 (1996)
Application of Stacked Methods to Part-of-Speech Tagging of Polish
349
4. Brill, E.: Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics 21(4), 543–565 (1995) 5. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proc. of the Int. Conf. on New Methods in Language Processing, pp. 44–49 (1994) 6. Brants, T.: TnT - a statistical part-of-speech tagger. In: Proc. of the 6th Applied Natural Language Processing Conf., pp. 224–231 (2000) 7. Florian, R., Ngai, G.: Fast Transformation-Based Learning Toolkit manual. John Hopkins Univ., USA (2001), http://nlp.cs.jhu.edu/~ rflorian/fntbl 8. Gim´enez, J., M` arquez, L.: SVMTool: A general POS tagger generator based on Support Vector Machines. In: Proc. of the 4th Int. Conf. on Language Resources and Evaluation, pp. 43–46 (2004) 9. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. of the 18th Int. Conf. on Machine Learning (ICML 2001), pp. 282–289 (2001) 10. van Halteren, H., Zavrel, J., Daelemans, W.: Improving accuracy in word class tagging through the combination of machine learning systems. Computational Linguistics 27(2), 199–229 (2001) 11. Brill, E., Wu, J.: Classifier combination for improved lexical disambiguation. In: Proc. of the 7th Int. Conf. on Computational Linguistics, pp. 191–195 (1998) 12. Kuta, M., Wrzeszcz, M., Chrzaszcz, P., Kitowski, J.: Accuracy of baseline and complex methods applied to morphosyntactic tagging of Polish. In: Bubak, M., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2008, Part I. LNCS, vol. 5101, pp. 903–912. Springer, Heidelberg (2008) 13. Hajiˇc, J.: Morphological tagging: Data vs. Dictionaries. In: Proc. of the First Conf. North American Chapter of the Association for Computational Linguistics, pp. 94–101 (2000) 14. Kuta, M., W´ ojcik, W., Wrzeszcz, M., Kitowski, J.: Application of weighted voting taggers to languages described with large tagsets. Computing and Informatics (2009) (submitted) 15. Mihalcea, R.: Performance analysis of a part of speech tagging task. In: Proc. of the 4th Int. Conf. on Computational Linguistics and Intelligent Text Processing, pp. 158–166 (2003) 16. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993) 17. Witten, I., Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann Publishers, San Francisco (2005) 18. Daelemans, W., Zavrel, J., van der Sloot, K., van den Bosch, A.: TiMBL: Tilburg Memory Based Learner. Technical report ILK 07-07, Tilburg University (2007) 19. Przepi´ orkowski, A., Woli´ nski, M.: A flexemic tagset for Polish. In: Proc. of the Workshop on Morphological Processing of Slavic Languages (EACL 2003), pp. 33–40 (2003) 20. Bie´ n, J., Woli´ nski, M.: Enriched Corpus of Frequency Dictionary of Contemporary Polish. Warsaw University, Poland (2001) (in Polish), http://www.mimuw.edu.pl/polszczyzna 21. Kuta, M., Chrzaszcz, P., Kitowski, J.: A case study of algorithms for morphosyntactic tagging of Polish language. Computing and Informatics 26(6), 627–647 (2007)
Computationally Efficient Nonlinear Predictive Control Based on State-Space Neural Models Maciej L awry´ nczuk Institute of Control and Computation Engineering, Warsaw University of Technology ul. Nowowiejska 15/19, 00-665 Warsaw, Poland Tel.: +48 22 234-76-73
[email protected] Abstract. This paper describes a computationally efficient nonlinear Model Predictive Control (MPC) algorithm in which a state-space neural model of the process is used on-line. The model consists of two Multi Layer Perceptron (MLP) neural networks. It is successively linearised on-line around the current operating point, as a result the future control policy is calculated by means of a quadratic programming problem. The algorithm gives control performance similar to that obtained in nonlinear MPC, which hinges on non-convex optimisation. Keywords: Process control, Model Predictive Control, state-space models, neural networks, linearisation, optimisation, quadratic programming.
1
Introduction
Model Predictive Control (MPC) algorithms [6,12] have been successfully used for years in numerous fields, not only in chemical, food and motor industries, but also in medicine and aerospace [11]. It is because they can take into account constraints imposed on process inputs (manipulated variables) and outputs (controlled variables), which usually decide on quality, economic efficiency and safety. Moreover, MPC techniques are very efficient in multivariable process control. Because properties of many processes are nonlinear, different nonlinear MPC approaches have been developed [3,8,12]. In particular, MPC algorithms based on neural models [2] are recommended [1,4,5,9,10,12,13]. Neural models can be efficiently used on-line in MPC because they have excellent approximation abilities, a small number of parameters and a simple structure. Furthermore, neural models directly describe relations of process variables, complicated systems of algebraic and differential equations do not have to be solved on-line as it is necessary when fundamental models are used in MPC. Usually, for prediction and calculation of the future control policy, discretetime input-output models are used in which the output signal (y) is a function of input (u) and output signals at previous sampling instants (iterations) y(k) = f (u(k − τ ), . . . , u(k − nB ), y(k − 1), . . . , y(k − nA ))
(1)
where integers τ , nA , nB define the order of dynamics. The nonlinear function f : RnA +nB −τ +1 −→ R can be realised by a neural network, for example of Multi Layer Perceptron (MLP) or Radial Basis Function (RBF) type. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 350–359, 2010. c Springer-Verlag Berlin Heidelberg 2010
Computationally Efficient Nonlinear Predictive Control
351
This paper is concerned with the state-space neural model introduced in [14] the properties and applications of which are also discussed in [13]. Such a model has two important advantages. First of all, the structure of the model exactly mirrors the nonlinear state-space representation in discrete-time domain. The model is able to represent, at least theoretically, any nonlinear physical system (it is not true for the input-output representation). Secondly, the state-space representation is a straightforward choice for stability analysis (stability of the model itself and stability of the whole control algorithm). This paper describes a nonlinear MPC algorithm based on the state-space neural model of the process. Unlike the algorithm presented in [13] (nonlinear optimisation is used on-line), the discussed approach is computationally efficient. The model is successively linearised on-line around the current operating point, as a result the future control policy is calculated by means of a quadratic programming problem. The algorithm gives control performance similar to that obtained in nonlinear MPC, which hinges on non-convex optimisation.
2
Model Predictive Control Algorithms
In MPC algorithms [6,12] at each consecutive sampling instant k, k = 0, 1, 2, . . ., a set of future control increments is calculated T
u(k) = [u(k|k) u(k + 1|k) . . . u(k + Nu − 1|k)]
(2)
It is assumed that u(k + p|k) = 0 for p ≥ Nu , where Nu is the control horizon. The objective is to minimise differences between the reference trajectory y ref (k + p|k) and predicted outputs values yˆ(k +p|k) over the prediction horizon N ≥ Nu , i.e. for p = 1, . . . , N . For optimisation the following quadratic cost function is usually used J(k) =
N
(y
ref
(k + p|k) − yˆ(k + p|k)) + 2
p=1
N u −1
λp (u(k + p|k))2
(3)
p=0
where λp > 0 are weighting coefficients. Only the first element of the determined sequence (2) is applied to the process, i.e. u(k) = u(k|k) + u(k − 1). At the next sampling instant, k + 1, the prediction is shifted one step forward and the whole procedure is repeated. For simplicity of presentation a Single Input Single Output (SISO) process is assumed, i.e. u(k), y(k) ∈ R. Future control increments are found on-line from the following optimisation problem (for simplicity of presentation hard output constraints [6,12] are used) min
u(k|k)...u(k+Nu −1|k)
{J(k)}
subject to umin ≤ u(k + p|k) ≤ umax , p = 0, . . . , Nu − 1 −umax ≤ u(k + p|k) ≤ umax , p = 0, . . . , Nu − 1 y min ≤ yˆ(k + p|k) ≤ y max ,
p = 1, . . . , N
where umin , umax , umax , y min , y max define constraints.
(4)
352
M. L awry´ nczuk
Fig. 1. The structure of the state-space neural model, z −1 – the backward shift operator
3 3.1
MPC Based on State-Space Neural Models State-Space Neural Models
The process is described by the state-space discrete-time model x(k) = f (x(k − 1), u(k − 1)) y(k) = g(x(k))
(5)
where x(k) = [x1 (k) . . . xnx (k)]T is the state vector. It is assumed that two Multi Layer Perceptron (MLP) neural networks [2] are used in the model. Both networks have one hidden layer and a linear output. The first network realises the function f : Rnx +1 −→ Rnx , the second network realises the function g : Rnx −→ R. The structure of the state-space neural model is depicted in Fig. 1. Outputs of the first neural network are I
xi (k) =
I,2 wi,0
+
K j=1
I,2 wi,j ϕ(zjI (k))
(6)
where i = 1, . . . , nx , ϕ : R −→ R is the nonlinear transfer function (e.g. hyperbolic tangent), K I is the number of hidden nodes, zjI (k) is a sum of inputs of the j th hidden node (j = 1, . . . , K I ) I,1 zjI (k) = wj,0 +
nx s=1
I,1 I,1 wj,s xs (k − 1) + wj,n u(k − 1) x +1
(7)
I,1 Weights of the first network are denoted by wi,j , i = 1, . . . , K I , j = 0, . . . , nx + I,2 , i = 1, . . . , nx , j = 0, . . . , K I , for the first and the second layer, 1, and wi,j respectively. Combining (6) and (7), outputs of the first network (states) are
Computationally Efficient Nonlinear Predictive Control
I
xi (k) =
I,2 wi,0
+
K j=1
I,2 wi,j ϕ
I,1 wj,0
+
nx s=1
353
I,1 wj,s xs (k
− 1) +
I,1 wj,n u(k x +1
− 1)
(8)
The output of the second neural network is II
y(k) =
w0II,2
+
K l=1
wlII,2 ϕ(zlII (k))
(9)
where K II is the number of hidden nodes of the second network, zlII (k) is a sum of inputs of the lth hidden node (l = 1, . . . , K II ) II,1 + zlII (k) = wl,0
nx i=1
II,1 wl,i xi (k)
(10)
II,1 Weights of the second network are denoted by wi,j , i = 1, . . . , K II , j = 0, . . . , nx , and wiII,2 , i = 0, . . . , K II , for the first and the second layer, respectively. Combining (9) and (10), the output signal of the model is
II
y(k) =
w0II,2
+
K l=1
3.2
wlII,2 ϕ
II,1 wl,0
+
nx i=1
II,1 wl,i xi (k)
(11)
Suboptimal MPC Quadratic Programming Problem
If for prediction in MPC a nonlinear state-space model (5) is used without any simplifications, predictions yˆ(k+p|k) depend in a nonlinear way on the optimised future control policy u(k). In consequence, the MPC optimisation problem (4) becomes a nonlinear task which has to be solved on-line at each sampling instant [13]. Such an approach has limited applicability because the computational burden of nonlinear MPC is enormous and it may terminate in local minima. In order to reduce the computational complexity of nonlinear MPC, the statespace neural model is successively linearised on-line around the current operating point. As a result, predictions yˆ(k + p|k) depend in a linear way on future control increments. Thanks to it, the future control policy can be calculated by means of a quadratic programming problem, the necessity of nonlinear optimisation is avoided. The linear approximation of the nonlinear state-space neural model (5) is x(k) = A(k)x(k − 1) + B(k)u(k − 1)
(12)
y(k) = C(k)x(k)
(13)
where matrices A(k), B(k) and C(k) are of dimensionality nx × nx , nx × 1 and 1 × nx , respectively. Using the linearised state equation (12) recurrently, one obtains state predictions
354
M. L awry´ nczuk
x ˆ(k + 1|k) =A(k)x(k) + B(k)u(k|k) + v(k) x ˆ(k + 2|k) =A(k)ˆ x(k + 1|k) + B(k)u(k + 1|k) + v(k) x ˆ(k + 3|k) =A(k)ˆ x(k + 2|k) + B(k)u(k + 2|k) + v(k) .. .
(14)
where v(k) is the state disturbance (this issue is described in the next section). State predictions can be expressed as functions of future control increments x ˆ(k + 1|k) =B(k)u(k|k) + . . . x ˆ(k + 2|k) =(A(k) + I)B(k)u(k|k) + B(k)u(k + 1|k) + . . . x ˆ(k + 3|k) =(A2 (k) + A(k) + I)B(k)u(k|k)
(15)
+ (A(k) + I)B(k)u(k + 1|k) + B(k)u(k + 2|k) + . . . .. . Using the linearised output equation (13), it is possible to express the output T prediction vector yˆ(k) = [ˆ y(k + 1|k) . . . yˆ(k + N |k)] as yˆ(k) = C(k)P (k)u(k) + y 0 (k)
(16)
The first part of the above sum depends only on the future (on future control increments), the second part depends only on the past, it is called a free trajectory T y 0 (k) = y 0 (k + 1|k) . . . y 0 (k + N |k) . Matrices ⎤ ⎡ B(k) ... 0 ⎥ ⎢ (A(k) + I)B(k) ... 0 ⎥ ⎢ ⎥ ⎢ . . .. .. .. ⎥ ⎢ . ⎥ ⎢ ⎥ ⎢ ( Nu −1 Ai (k) + I)B(k) . . . B(k) (17) P (k) = ⎢ ⎥ i=1 ⎥ ⎢ Nu i ⎥ ⎢ ( i=1 A (k) + I)B(k) . . . (A(k) + I)B(k) ⎥ ⎢ .. .. .. ⎥ ⎢ . ⎦ ⎣ . . N −Nu i N −1 i ( i=1 A (k) + I)B(k) . . . ( i=1 A (k) + I)B(k) and C(k) = diag(C(k), . . . , C(k)) are of dimensionality N nx ×Nu and N ×N nx , respectively. They are calculated on-line from the linearised model (12), (13). Thanks to using the suboptimal prediction (16), the optimisation problem (4) becomes the following quadratic programming task 2 ref 2 0 min y (k) − C(k)P (k)u(k) − y (k) + u(k)Λ u(k)
subject to umin ≤ J u(k) + uk−1 (k) ≤ umax −umax ≤ u(k) ≤ umax y min ≤ C(k)P (k)u(k) + y 0 (k) ≤ y max
(18)
Computationally Efficient Nonlinear Predictive Control
355
Fig. 2. The structure of the MPC-NPL algorithm
T T where y ref (k) = y ref (k + 1|k) . . . y ref (k + N |k) , y min = y min . . . y min , T T y max = [y max . . . y max ] are vectors of length N , umin = umin . . .umin , umax = T T T [umax . . .umax ] , umax = [umax . . . umax ] , uk−1 (k) = [u(k − 1). . .u(k−1)] , are vectors of length Nu , Λ = diag(λ0 , . . . , λNu −1 ), J is the all ones lower triangular matrix of dimensionality Nu × Nu . To cope with infeasibility problems, output constraints can be softened [6,12]. The discussed suboptimal MPC algorithm belongs to the general class known as MPC with Nonlinear Prediction and Linearisation (MPC-NPL) [5,12]. The structure of the algorithm is shown in Fig. 2. At each sampling instant k of the algorithm the following steps are repeated: 1. Linearisation of the state-space neural model: obtain matrices A(k), B(k), C(k) and resulting matrices C(k), P (k). 2. Find the nonlinear free trajectory y 0 (k) using the state-space neural model. 3. Solve the quadratic programming task (18) to find the control policy u(k). 4. Implement the first element of the obtained policy u(k) = u(k|k)+u(k−1). 5. Set k := k + 1, go to step 1. If the current state x(k) cannot be measured, a state observer must be used [12]. It estimates the state using real (measured) values of input and output signals. 3.3
Implementation Details
Using Taylor series expansion, the linear approximation (12), (13) of the nonlinear model (5) obtained at the sampling instant k, is df (x(k − 1), u(k − 1)) , dx(k − 1) dg(x(k)) C(k) = dx(k) A(k) =
B(k) =
df (x(k − 1), u(k − 1)) du(k − 1)
(19)
356
M. L awry´ nczuk
Matrices A(k), B(k), C(k) are calculated on-line analytically taking into account the structure of the state-space neural model given by (8) and (11) ⎡ ⎤ K I I,2 dϕ(zjI (k)) I,1 K I I,2 dϕ(zjI (k)) I,1 wj,nx ⎥ ⎢ j=1 w1,j dz I (k) wj,1 . . . j=1 w1,j dzjI (k) ⎢ ⎥ j ⎢ ⎥ . . . ⎢ ⎥ . . . A(k) = ⎢ . . . ⎥ ⎢ ⎥ K I I,2 dϕ(zjI (k)) I,1 ⎦ ⎣ K I I,2 dϕ(zjI (k)) I,1 w w . . . w w j,1 j,nx j=1 nx ,j j=1 nx ,j dzjI (k) dzjI (k) ⎡ ⎤ K I I,2 dϕ(zjI (k)) I,1 (20) w w ⎢ j=1 1,j dz I (k) j,nx +1 ⎥ ⎢ ⎥ j ⎢ ⎥ .. ⎥ B(k) = ⎢ . ⎢ ⎥ ⎢ ⎥ ⎣ K I I,2 dϕ(zjI (k)) I,1 ⎦ wj,nx +1 j=1 wnx ,j I dzj (k) K II II,2 dϕ(zlII (k)) II,1 K II II,2 dϕ(zlII (k)) II,1 C(k) = w . . . w w w l=1 l l=1 l l,1 l,nx dzlII (k) dzlII (k) If hyperbolic tangent is used as the nonlinear transfer function in hidden layers of neural networks,
dϕ(zjI (k) dzjI (k)
= 1 − tanh2 (zjI (k)),
dϕ(zlII (k) dzlII (k)
= 1 − tanh2 (zlII (k)).
The nonlinear free trajectory y 0 (k + p|k) over the prediction horizon (p = 1, . . . , N ) is calculated on-line recurrently using the nonlinear state-space neural model (5) and assuming only the influence of the past. For p = 1 one has y 0 (k + 1|k) = g(x(k)) + d(k)
(21)
The unmeasured output disturbance is estimated from d(k) = y(k) − y(k|k − 1) = y(k) − g(x(k))
(22)
where y(k) is measured, y(k|k−1) is calculated from the model. For p = 2, . . . , N one has x0 (k + p|k) = f (x0 (k + p − 1|k), u(k − 1)) + v(k) y 0 (k + p|k) = g(x0 (k + p|k)) + d(k)
(23)
The unmeasured state disturbance is estimated from v(k) = x(k) − x(k|k − 1) = x(k) − f (x(k − 1), u(k − 1))
(24)
where x(k) is measured (or observed), x(k|k − 1) is calculated from the model.
4
Simulation Results
The process under consideration is a polymerisation reaction taking place in a jacketed continuous stirred tank reactor [7] depicted in Fig. 3. The reaction is the
Computationally Efficient Nonlinear Predictive Control
357
Fig. 3. The polymerisation reactor control system structure 4
x 10
0.045 4 0.04 0.035 3.5
NAMWref
NAMW
0.03
FI
0.025 0.02
3
0.015 2.5 0.01 0.005 2 0
1
20
40
60
k
80
100
1
20
40
60
80
100
k
Fig. 4. The input FI and the output N AM W : the MPC-NO algorithm (dashed ) and the MPC-NPL algorithm (solid ) based on the same state-space neural model
free-radical polymerisation of methyl methacrylate with azo-bis-isobutyronitrile as initiator and toluene as solvent. The output N AM W (Number Average Molecular Weight) is controlled by manipulating the inlet initiator flow rate FI , the monomer flow rate F is constant. Because the reactor is significantly nonlinear, MPC algorithms based on linear models are unstable [5,12]. The fundamental model is used as the real process during simulations. The model is simulated open-loop in order to obtain two sets of data, namely training and test data sets. These data sets are next used for training and validation of a set of state-space neural models (of a different topology determined by K I and K II ). Finally, the model in which the first neural network has K I = 5 hidden nodes and the second network has K II = 3 nodes is chosen as a good compromise between accuracy and complexity, the model has nx = 4 states. Two nonlinear MPC schemes are compared: the MPC algorithm with Nonlinear Optimisation (MPC-NO) and the discussed MPC-NPL algorithm with quadratic programming. For optimisation in the MPC-NO approach Sequential Quadratic Programming (SQP) is used on-line. Both algorithms use the same state-space neural model. Parameters of both algorithms are the same: N = 10, Nu = 3, λp = 0.2. The manipulated variable is constrained: FImin = 0.003, FImax = 0.06. The sampling time of MPC is 1.8 min. As the reference trajectory (N AM W ref ), five set-point changes are considered.
358
M. L awry´ nczuk
5.7 0.25 5.65 0.2
x2
x1
5.6 5.55
0.15
5.5 0.1 5.45 0.05 5.4 5.35
1
20
40
60
80
0
100
1
20
40
k
60
80
100
60
80
100
k
−3
3.5
x 10
65 60
3 55 2.5
x4
x3
50 2
45 40
1.5 35 1 30 0.5
1
20
40
60
k
80
100
25
1
20
40
k
Fig. 5. State trajectories x1 , x2 , x3 , x4 : the MPC-NO algorithm (dashed ) and the MPC-NPL algorithm (solid ) based on the same state-space neural model
Simulation results are depicted in Fig. 4 (input and output signals) and Fig. 5 (states). Closed-loop performance obtained in the MPC-NPL algorithm with quadratic programming is practically the same as in the computationally prohibitive MPC-NO approach in which at each sampling instant a nonlinear optimisation problem has to be solved on-line. At the same time, the difference in computational burden is big. In case of MPC-NPL the computational cost is 0.6512 MFLOPS while in case of MPC-NO it soars to 7.3424 MFLOPS.
5
Conclusions
This paper details a computationally efficient MPC algorithm in which a statespace neural model is used. The model is successively linearised on-line around the current operating point, as a result the future control policy is calculated from an easy to solve quadratic programming problem, the necessity of on-line nonlinear optimisation is avoided. The algorithm gives control performance similar to that obtained in nonlinear MPC with nonlinear optimisation. Thanks to the fact that neural networks are used to realise nonlinear state and output
Computationally Efficient Nonlinear Predictive Control
359
functions (i.e. functions f and g in (5)), the model has a relatively small number of parameters and its structure is simple. Hence, the implementation of the discussed algorithm is simple as detailed in Subsection 3.3. The algorithm presented in this paper can be easily extended to deal with Multiple Input Multiple Output (MIMO) processes. Moreover, using prediction equations given in Subsection 3.2, the cost function (3) can also take into account differences between the state reference trajectory and predicted states. When necessary, constraints can be imposed on predicted states. Acknowledgement. The work presented in this paper was supported by Polish national budget funds for science for years 2009-2011 in the framework of a research project.
References 1. El Ghoumari, M.Y., Tantau, H.J.: Non-linear constrained MPC: real-time implementation of greenhouse air temperature control. Computers and Electronics in Agriculture 49, 345–356 (2005) 2. Haykin, S.: Neural networks – a comprehensive foundation. Prentice Hall, Englewood Cliffs (1999) 3. Henson, M.A.: Nonlinear model predictive control: current status and future directions. Computers and Chemical Engineering 23, 187–202 (1998) 4. Lazar, M., Pastravanu, O.: A neural predictive controller for non-linear systems. Mathematics and Computers in Simulation 60, 315–324 (2002) 5. L awry´ nczuk, M.: A family of model predictive control algorithms with artificial neural networks. International Journal of Applied Mathematics and Computer Science 17, 217–232 (2007) 6. Maciejowski, J.M.: Predictive control with constraints. Prentice Hall, Englewood Cliffs (2002) 7. Maner, B.R., Doyle, F.J., Ogunnaike, B.A., Pearson, R.K.: Nonlinear model predictive control of a simulated multivariable polymerization reactor using second-order Volterra models. Automatica 32, 1285–1301 (1996) 8. Morari, M., Lee, J.H.: Model predictive control: past, present and future. Computers and Chemical Engineering 23, 667–682 (1999) 9. Nørgaard, M., Ravn, O., Poulsen, N.K., Hansen, L.K.: Neural networks for modelling and control of dynamic systems. Springer, London (2000) 10. Peng, H., Yang, Z.J., Gui, W., Wu, M., Shioya, H., Nakano, K.: Nonlinear system modeling and robust predictive control based on RBF-ARX model. Engineering Applications of Artificial Intelligence 20, 1–9 (2007) 11. Qin, S.J., Badgwell, T.A.: A survey of industrial model predictive control technology. Control Engineering Practice 11, 733–764 (2003) 12. Tatjewski, P.: Advanced control of industrial processes, structures and algorithms. Springer, London (2007) 13. Zamarre˜ no, J.M., Vega, P., Garcia, L.D., Francisco, M.: State-space neural network for modelling, prediction and control. Control Engineering Practice 8, 1063–1075 (2000) 14. Zamarre˜ no, J.M., Vega, P.: State-space neural network. Properties and application. Neural Networks 11, 1099–1112 (1998)
Relational Type-2 Interval Fuzzy Systems Rafal Scherer and Janusz T. Starczewski Department of Computer Engineering, Cz¸estochowa University of Technology al. Armii Krajowej 36, 42-200 Cz¸estochowa, Poland
[email protected],
[email protected] http://kik.pcz.pl
Abstract. In this paper we combine type-2 fuzzy logic with a relational fuzzy system paradigm. Incorporating type-2 fuzzy sets brings new possibilities for performance improvement. The relational fuzzy system scheme reduces significantly the number of system parameters to be trained. Numerical simulations of the proposed combination of systems are presented.
1
Introduction
Recently many authors employ interval type-2 fuzzy logic to various application tasks [1,2,3,4,5,6]. Type-2 fuzzy sets are fuzzy sets characterized by a fuzzy membership function, which has its own memberships, called secondary membership grades. Although type-2 fuzzy sets are theoretically formulated in this general context of a secondary membership function, they are not frequently used for construction of fuzzy logic systems due to their computational complexity. Probably a unique attempt to develope the system working on general is an approximate formulation of a triangular type-2 fuzzy logic system [7]. Interval type-2 fuzzy sets have their secondary membership grades either equal to zero or equal to one within intervals of a secondary domain. These intervals can be treated as an interval of uncertainty, in which all memberships are possible. Hence an interval type-2 fuzzy sets can be represented using only boundaries of its intervals as functions in a primary domain of elements, called upper and lower membership functions. As a source of membership uncertainty in type-2 fuzzy logic systems, quite often noisy training data are considered. Our proposition is to handle the training data which are not perfectly covered by prototype type-1 fuzzy sets generated by classical learning methods. We can expand an upper membership function over a prototype type-1 membership function, and a lower membership function below to contain training data of a fuzzy cluster within the interval of uncertainty. The spirit of type-2 fuzzy sets is very close to the application of rough-fuzzy sets handling missing features rather than their uncertainty [8]. In this letter we combine type-2 fuzzy logic with a relational fuzzy logic system scheme, see e.g. [9][10][11]. In relational fuzzy systems, each antecedent fuzzy set is related to each consequent fuzzy set with an activation to some degree, called a certainty factor. This approach allows us to reduce the number of antecedent R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 360–368, 2010. c Springer-Verlag Berlin Heidelberg 2010
Relational Type-2 Interval Fuzzy Systems
361
fuzzy sets related to consequent fuzzy sets. We believe that the combination of the relational system with interval type-2 fuzzy sets will reduce a number of system parameters to be trained. 1.1
Type-2 Fuzzy Logic System
Let us start from basic preliminaries concerning type-2 sets. The type-2 fuzzy ˜ is a set of pairs {x, μ ˜ (x)}, denoted in the set in the real line R, denoted by A, A fuzzy union notation A˜ = x∈R μA˜ (x) /x, where x is an element of the fuzzy set associated with the fuzzy membership function μA˜ : R → F ([0, 1]), and F ([ 0, 1]) is the set of all fuzzy subsets of the unit interval [0, 1]. The values of μA˜ are fuzzy membership grades μA˜ (x), fx (u) /u , (1) μA˜ (x) = u∈[0,1]
where fx : [0, 1] → [0, 1]. The membership function fx is called a secondary membership function. Almost all applications mentioned in the literature employ interval type-2 fuzzy sets [1,3,12,6]. Our system is based on the typical type-2 fuzzy logic system architecture [13] that consists of the type-2 fuzzy rule base, the type-2 fuzzifier, the inference engine modified to deal with type-2 fuzzy sets, and the defuzzifier split into the type reducer and type-1 defuzzifier. Fuzzy rules are expressed here by so called extended t-norms. For interval memberships, any type-2 fuzzy set A˜ is described by a fuzzy membership grade of the following form μA˜ (x) = 1/[μA (x) , μA (x)], bounded by an upper membership grade and a lower membership grade. Therefore the calculations for an extended t-norm can be reduced to the following: T˜ (μA˜ , μB˜ ) = 1/ T μA , μB , T (μA , μB ) . (2) The inference engine produces the type-2 fuzzy consequence by means of the extended sup-star composition [13] of the premise and the type-2 fuzzy relation, i.e., sup T˜ (μA˜ (x) , μR˜ k (x, y)) . (3) μB˜ k (y) = x∈X
If fuzzy premises A˜n are singletons, A˜n = (1/1) /xn , the inference can be realized by the following formula: hk (y) = μB˜ k (y) = μA˜ ◦(A˜k ∩B˜ k ) (y) = T˜ (μA˜k (x ) , μB˜ k (y)) N = T˜ T˜n=1 μA˜k (xn ) , μB˜ k (y) . n
(4) (5)
If the consequents are also singletons yk , the k-th conclusion may be characterized by N μA˜k (xn ) . (6) hk (yk ) = μB˜ k (yk ) = T˜n=1 n
362
R. Scherer and J.T. Starczewski
In the case of interval type-2 fuzzy sets, a standard approach to transforming type-2 fuzzy sets into a type-1 fuzzy set is the KM type-reduction [14]. This method allows us to obtain the bounds of the type-reduced fuzzy set as ymin and ymax . Hencefore, the overall output of the fuzzy logic system may be expressed as follows ymin + ymax yT 2 = . (7) 2
2
Relational Fuzzy Systems
Relational fuzzy systems [9][10][11] are generalizations of traditional fuzzy systems, where each antecedent fuzzy set is activates each consequent fuzzy set to some degree, called a certainty factor. This cross join of antecedents and consequents is characterized by a relation matrix consisted of all certainty factors, i.e. ⎡ ⎤ c11 c12 · · · c1M ⎢ c21 c22 · · · c2M ⎥ ⎢ ⎥ R=⎢ . .. .. ⎥ . ⎣ .. . c . ⎦ km
cK1 cK2 · · · cKM Each relation represents a fuzzy rule, thus the total number of rules is equal to KM . Output variable y has a set B of M linguistic values B m with membership functions μB m (y), for m = 1, ..., M B = B 1 , B 2 , ..., B M . (8) For certain MIMO and MISO systems the dimensionality of R becomes quite high and it is very hard to estimate elements of R. Sometimes training data are not enough large at all to perform the estimation. For this reason we consider a fuzzy system with multidimensional input linguistic values. Then, we have only one set A of fuzzy linguistic values A = A1 , A2 , ..., AK , (9) thus the relational matrix is only two-dimensional in the multi-input singlex) for output (MISO) case. Having given vector A¯ of K membership values μAk (¯ ¯ of M crisp memberships μm is obtained a crisp observed input value x ¯, vector B through a fuzzy relational composition ¯ = A¯ ◦ R , B
(10)
implemented element-wise by a generalized form of S-norm and T-norm composition, i.e. s-t composition K
μm = S [T (μAk (¯ x) , rkm )] . k=1
(11)
Relational Type-2 Interval Fuzzy Systems
363
Fig. 1. Neuro-fuzzy relational system
The crisp output of the relational system is computed by the weighted mean M K m x) , rkm )] y¯ S [T (μAk (¯ k=1 y¯ = m=1 M , (12) K S [T (μAk (¯ x) , rkm )] m=1 k=1
where y¯m is a centre of gravity (centroid) of the fuzzy set B m . The exemplary neuro-fuzzy structure of the relational system is depicted in Fig. 1. The first layer of the system consists of K multidimensional fuzzy membership functions. The second layer is responsible for s-t composition of membership degrees from previous layer and KM crisp numbers from the fuzzy relation. Finally, the third layer realizes center average defuzzification. Depicting the system as a net structure allows learning or fine-tuning system parameters through the backpropagation algorithm.
3
Main Problem
How to incorporate interval uncertainty of type-2 fuzzy sets into a relational fuzzy system framework? Traditional gradient techniques cannot manage parameter identification of interval fuzzy sets, because of the well known interactive KM type reduction method (used for defuzzification). A natural way is to make use of a gradient method for creating an initial relational type-1 fuzzy system
364
R. Scherer and J.T. Starczewski
and then to expand uncertainty over the membership functions according to the spread of training in the context of activated membership functions.
4
A Method for Knowledge Acquisition by Relational Type-2 Interval Fuzzy Systems
4.1
Type-1 Learning Phase
The k-th antecendent type-1 fuzzy set is characterized by a membership function realized by the T-norm Cartesian product, i.e. N μAkn . μAk = Tn=1
(13)
In this paper we make use of Gaussian membership functions such that each μAkn = gauss x, x¯kn , σnk has its center mkn and its spread σnk , where 2 x−x ¯kn k k gauss x, x¯n , σn = exp − . (14) σnk Tuning antecedent parameters x¯kn , σnk as well as consequent parameters y¯m , σ m can be realized by one of the gradient methods (e.g. backpropagation, recursive least squares or Kalman filter). 4.2
Type-2 Interval Uncertainty Fitting
The number of the cluster centers is equal to K · M , since each k-th antecendent is combined with each m-th consequent. Thus, the cluster center is a vector consisting both centers of antecedents and a consequent center, i.e., vkm = k k x¯1 , x ¯2 , . . . , x ¯kN , y¯m . This vector is actually fixed by the type-1 learning phase, e.g. the BP algorithm. Correspondingly to the cluster center, a t-th pattern is represented by the extended vector xt = [x1 (t) , x2 (t) . . . , xN (t) , y (t)]. Therefore, we can use the fuzzy memberships defined by the fuzzy C-means (FCM) method, i.e., ukmt =
c j=1
1 dkmt djt
, 2 p−1
(15)
where dkmt = xt − vkm > 0, ∀k, m and t, = k and j = m. if dkmt = 0 then ukmt = 1 and uijt = 0 for i We assume that the measurement error of training data is of no importance. With this assumption, the difference between the FCM memberships and the membership functions achieved from the type-1 learning phase is caused by an
Relational Type-2 Interval Fuzzy Systems
365
inner uncertainty of a modeled process. Therefore, we expand upper and lower membership functions over the training data. The upper membership function is a normal Gaussian function, μkm ¯kn , σ kn n (xn ) = gauss xn , x km
μ
m
m
(y) = gauss (y, y¯ , σ )
(16) (17)
being a superior limit of the function family drawn through points (xn (t) , ukmt ), i.e., σ kn = max σt : gauss xn (t) , x ¯kn , σt = ukmt t xn (t) − x ¯kn k σ n = max √ t − log ukmt
(18) (19)
or through points (y (t) , ukmt ), i.e., |y (t) − y¯m | σ m = max √ . t − log ukmt
(20)
Consequently, the upper memberships of antecedents and consequents of the relational system can be thought as an aggregation weighted by certainty factors, i.e.,
μAkn (xn ) =
ckm μkm (xn ) , n (xn ) + (1 − ckm ) μAk n
(21)
m
μB m (y) =
ckm μkm (y) + (1 − ckm ) μB m (y) .
(22)
k
The lower membership function is a scaled Gaussian function, μkm (xn ) = hkn gauss xn , x ¯kn , σnk n km
μ
m
m
m
(y) = h gauss (y, y¯ , σ )
(23) (24)
being an inferior limits of the function family drawn through points (xn (t) , ukmt ), i.e., ¯kn , σnk = ukmt (25) hkn = min ht : ht gauss xn (t) , x t
hkn
= min t
ukmt gauss (xn (t) , x¯kn , σnk )
(26)
or in the consequent part hm = min t
ukmt . gauss (y (t) , y¯m , σ m )
(27)
366
R. Scherer and J.T. Starczewski
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
2
4
6
8
10
Fig. 2. Type-2 interval uncertainty fitting: initial type-1 membership function (dashed line), membership of data (asterisks), uncertainty bounds (solid lines)
Figure 2 presents the idea of the interval uncertainty fitting. Similarly, the upper memberships of antecedents and consequents of the relational system is again a weighted aggregation
μAk (xn ) = n
μB m (y) =
m
ckm μkm (xn ) + (1 − ckm ) μAkn (xn ) n
ckm μkm (y) + (1 − ckm ) μB m (y) .
,
(28) (29)
k
The upper and lower membership functions constitute the interval type-2 fuzzy antecedents and consequents of the relational type-2 interval fuzzy system.
5
Experimental Results
We have tested our relational type-2 interval fuzzy logic system on two standard benchmarks: Iris and Wine classification. The first, Iris flowers classification benchmark contained 150 patterns uniformly distributed among three classes. The Wine dataset contained 59 patterns for class 1, 71 patterns for class 2 and 48 for class 3. The modelling scheme was ”One-against-All”, in which each of the pattern classes was trained against all other classes in independent fuzzy systems. This means that the number of the classifying MISO systems was the same as the number of classes. All classifiers had k = 2 and m = 2. Uncertainty parameter for the second step of training by the FCM was set as p = 2. The decision function
Relational Type-2 Interval Fuzzy Systems
367
was chosen in the form of maximum. We chose Gaussian shape of membership function. The Cartesian product was realised by the algebraic product t-norm. An the relational aggregation was expressed in terms of summation. 75 training patterns were chosen randomly for the Iris problem and 89 patterns were chosen randomly for Wine classification. The rest of patterns were used to test the two systems. Table 1 presents the best, the worst as well as the average classification results of the type-1 and interval type-2 fuzzy relational systems over 50 independent runs of the training algorithm. Comparing the classification rates of these relational systems, the interval systems does not reach the worst classification rate of 50 tests, while the better rate is achieved by both systems. Nevertheless, interval type-2 systems perform better in average rates. Table 1. Classification accuracy of relational systems: type-1 and interval type-2 50 independent runs of the training algorithm problem result relational type-1 FLS relational type-2 interval FLS Iris best 97.3% 97.3% worst 68.0% 73.3% average 84.0% 85.2% Wine best 92.1% 92.1% worst 57.3% 59.6% average 76.1% 78.3%
6
Conclusion
In this paper we have merged interval type-2 fuzzy logic with a relational system framework. For such system we have proposed a new two-step learning method. The first step of this method uses ordinary gradient techniques in order to tune prototype membership functions of type-1 fuzzy sets. The second step relies on the application of fuzzy c-means partition and expands upper and lower membership functions outside of the prototype functions. As a consequence, boundary membership functions of interval type-2 fuzzy sets better fit to data and its FCM membership partitions. The use of the relational fuzzy system framework has allowed us to reduce a number of antecedent and consequent fuzzy sets without decreasing a number of rules sufficient for solving a problem.
Acknowledgements This work was partly supported by the Polish Ministry of Science and Higher Education (Habilitation Project 2007-2010 Nr N N516 1155 33 and N N516 3722 34 2008–2011, Polish-Singapore Research Project 2008-2010).
368
R. Scherer and J.T. Starczewski
References 1. Castillo, O., Aguilar, L., Cazarez-Castro, N., Boucherit, M.: Application of type-2 fuzzy logic controller to an induction motor drive with seven-level diode-clamped inverter and controlled infeed. Electrical Engineering 90(5), 347–359 (2008) 2. Castillo, O., Melin, P.: Intelligent systems with interval type-2 fuzzy logic. International Journal of Innovative Computing, Information and Control 4(4), 771–783 (2008) 3. Hagras, H.: A hierarchical type-2 fuzzy logic control architecture for autonomous robots. IEEE Transactions on Fuzzy Systems 12(4), 524–539 (2004) 4. Mendez, G.: Interval type-2 anfis. In: Advances in Soft Computing – Innovations in Hybrid Intelligent Systems, pp. 64–71 (2007) 5. Torres, P., Sez, D.: Type-2 fuzzy logic identification applied to the modeling of a robot hand. In: Proc. FUZZ-IEEE 2008, Hong Kong (June 2008) 6. Uncu, O., Turksen, I.: Discrete interval type 2 fuzzy system models using uncertainty in learning parameters. IEEE Transactions on Fuzzy Systems 15(1), 90–106 (2007) 7. Starczewski, J.: Efficient triangular type-2 fuzzy logic systems. International Journal of Approximate Reasoning 50, 799–811 (2009) 8. Nowicki, R.: Rough-neuro-fuzzy structures for classification with missing data. IEEE Transactions on Systems, Man, and Cybernetics — Part B: Cybernetics 39(6), 1334–1347 (2009) 9. Scherer, R., Rutkowski, L.: Neuro-fuzzy relational systems. In: Proc. 2002 International Conference on Fuzzy Systems and Knowledge Discovery, Singapore, pp. 18–22 (November 2002) 10. Scherer, R., Rutkowski, L.: Neuro-fuzzy relational classifiers. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC 2004. LNCS (LNAI), vol. 3070, pp. 376–380. Springer, Heidelberg (2004) 11. Setness, M., Babuska, R.: Fuzzy relational classifier trained by fuzzy clustering. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cybernetics 29(5), 619–625 (1999) 12. Liang, Q., Mendel, J.: Interval type-2 fuzzy logic systems: Theory and design. IEEE Transactions on Fuzzy Systems 8, 535–550 (2000) 13. Karnik, N., Mendel, J., Liang, Q.: Type-2 fuzzy logic systems. IEEE Transactions on Fuzzy Systems 7(6), 643–658 (1999) 14. Karnik, N., Mendel, J.: Centroid of a type-2 fuzzy set. Information Sciences 132, 195–220 (2001)
Properties of Polynomial Bases Used in a Line-Surface Intersection Algorithm Gun Srijuntongsiri1 and Stephen A. Vavasis2 1
2
Sirindhorn International Institute of Technology, 131 Moo 5, Tiwanont Road, Bangkadi, Muang, Pathumthani, 12000, Thailand
[email protected] University of Waterloo, 200 University Avenue W., Waterloo, ON N2L 3G1, Canada
[email protected]
Abstract. In [1], Srijuntongsiri and Vavasis propose the KantorovichTest Subdivision algorithm, or KTS, which is an algorithm for finding all zeros of a polynomial system in a bounded region of the plane. This algorithm can be used to find the intersections between a line and a surface. The main features of KTS are that it can operate on polynomials represented in any basis that satisfies certain conditions and that its efficiency has an upper bound that depends only on the conditioning of the problem and the choice of the basis representing the polynomial system. This article explores in detail the dependence of the efficiency of the KTS algorithm on the choice of basis. Three bases are considered: the power, the Bernstein, and the Chebyshev bases. These three bases satisfy the basis properties required by KTS. Theoretically, Chebyshev case has the smallest upper bound on its running time. The computational results, however, do not show that Chebyshev case performs better than the other two. Keywords: Line/surface intersection, polynomial basis.
1
The Line-Surface Intersection Problem and the Required Basis Properties
Let φ0 , . . . , φn denote a basis for the set of univariate polynomials of degree at most n. For example, the power basis is defined by φi (t) = ti . The line-surface intersection problem can be reduced to the problem of finding all zeros of f (u, v) ≡
m n
cij φi (u)φj (v), 0 ≤ u, v ≤ 1,
(1)
i=0 j=0
where cij ∈ R2 (i = 0, 1, . . . , m; j = 0, 1, . . . , n) denote the coefficients [1]. For this article, let the notation · refer specifically to infinity norm. Other norms are explicitly notated so. The Kantorovich-Test Subdivision algorithm
Supported in part by NSF DMS 0434338 and NSF CCF 0085969.
R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 369–378, 2010. c Springer-Verlag Berlin Heidelberg 2010
370
G. Srijuntongsiri and S.A. Vavasis
(KTS in short), proposed by Srijuntongsiri and Vavasis [1], can be used to solve (1). KTS works with any polynomial basis φi (u)φj (v) provided that the following properties hold: 1. There is a natural interval [l, h] that is the domain for the polynomial. In the case of Bernstein polynomials, this is [0, 1], and in the case of power and Chebyshev polynomials, this is [−1, 1]. 2. It is possible to compute bounding polytope P of S = {f (u, v) : l ≤ u, v ≤ m a n h}, where f (u, v) = i=0 j=0 cij φi (u)φj (v) and cij ∈ Rd for any d ≥ 1, that satisfies the following properties: (a) Determining whether 0 ∈ P can be done efficiently (ideally in O(mn) operations). (b) The polytope P is affinely invariant. In other words, the bounding polytope of {Af (u, v) + b : l ≤ u, v ≤ h} is {Ax + b : x ∈ P } for any nonsingular matrix A ∈ Rd×d and any vector b ∈ Rd . (c) For any y ∈ P , y ≤ θ max f (u, v) , (2) l≤u,v≤h
where θ is a function of m and n. (d) If d = 1, then the endpoints of P can be computed efficiently (ideally in O(mn) time). 3. It is possible to reparametrize with [l, h]2 the surface S1 = {f (x) : x ∈ ¯ 0 , r)}, where x0 ∈ R2 and r ∈ R > 0. In other words, it is possible (and B(x efficient) to compute the polynomial fˆ represented in the same basis such that S1 = {fˆ(ˆ x) : x ˆ ∈ [l, h]2 }. 4. Constant polynomials are easy to represent. 5. Derivatives of polynomials are easy to determine in the same basis. (preferably in O(mn) operations). We are generally interested in the case where d = 2. In this case, we call P a bounding polygon. Recall that P is a bounding polygon of S if and only if x ∈ S implies x ∈ P .
2
The Kantorovich-Test Subdivision Algorithm
The description of KTS, as well as the definitions of the quantities mentioned in the description, are given below. More details can be found in [1]. For a given zero x∗ of polynomial f , let ω∗ (x∗ ) and ρ∗ (x∗ ) be quantities satisfying the conditions that, first, ω∗ (x∗ ) is the smallest Lipschitz constant for f (x∗ )−1 f , i.e., ∗ −1 f (x ) (f (x) − f (y)) ≤ ω∗ (x∗ ) · x − y for all x, y ∈ B(x ¯ ∗ , ρ∗ (x∗ )) (3) and, second, ρ∗ (x∗ ) =
2 . ω∗ (x∗ )
(4)
Properties of Polynomial Bases Used in Line-Surface Intersection Algorithm
371
γ(θ) = 1/ 4 θ(4θ + 1) − 8θ ,
Define
where θ is as in (2). Define ωD to be the smallest nonnegative constant ω satisfying ∗ −1 f (x ) (f (y) − f (z)) ≤ ω · y − z , y, z ∈ D , x∗ ∈ [0, 1]2 (5) satisfying f (x∗ ) = 0, where
D = [−γ(θ), 1 + γ(θ)] . 2
(6)
∗
Denote ωf as the maximum of ωD and all ω∗ (x ) ωf = max{ωD ,
max
x∗ ∈C2 :f (x∗ )=0
ω∗ (x∗ )}.
Finally, define the condition number of f to be cond(f ) = max{ωf ,
max
x∗ ∈C2 :f (x∗ )=0,y∈[0,1]2
∗ −1 f (x ) f (y)}.
(7)
¯ 0 , r) as the application of We define the Kantorovich test on a region X = B(x 0 0 ¯ Kantorovich’s Theorem on the point x using B(x , 2γ(θ)r) as the domain (refer to [2,3] for the statement of Kantorovich’s Theorem). The region X passes the ¯ 0 , ρ− ) ⊆ D . Kantorovich test if ηω ≤ 1/4 and B(x The other test KTS uses is the exclusion test. For a given region X, let fˆX be the polynomial in the basis φi (u)φj (v) that reparametrizes with [l, h]2 the surface defined by f over X. The region X passes the exclusion test if the bounding polygon of {fˆX (u, v) : l ≤ u, v ≤ h} excludes the origin. Having defined the above prerequisites, the description of KTS can now be given. Algorithm KTS: – Let Q be a queue with [0, 1]2 as its only entry. Set S = ∅. – Repeat until Q = ∅ 1. Let X be the patch at the front of Q. Remove X from Q. 2. If X ⊆ XS for all XS ∈ S, ¯ 0 , r) • Perform the exclusion test on X = B(x • If X fails the exclusion test, (a) Perform the Kantorovich test on X (b) If X passes the Kantorovich test, i. Perform Newton’s method starting from x0 to find a zero x∗ . ii. If x∗ ∈ XS for any XS ∈ S (i.e., x∗ has not been found previously), ∗ its associated ω∗ (x∗ ) by binary search. ∗ Compute ρ∗ (x ) and ∗ ∗ ¯ ∗ Set S = S ∪ B(x , ρ∗ (x )) . (c) Subdivide X along both u and v-axes into four equal subregions. Add these subregions to the end of Q.
372
G. Srijuntongsiri and S.A. Vavasis
The following theorem shows that the efficiency of KTS has an upper bound that depends only on the conditioning of the problem and the choice of the basis. Theorem 1. Let f (x) = f (u, v) be a polynomial system in basis φi (u)φj (v) in two dimensions with generic coefficients whose zeros are sought. Let X = ¯ 0 , r) be a patch under consideration during the course of the KTS algorithm. B(x The algorithm does not need to subdivide X if r≤
1 · min 2
1 − 1/γ(θ) 1 , ωD 2θ cond(f )2
.
(8)
Proof. See [1]. Remark: Both terms in the bound on the right-hand side of (8) are increasing as a function of 1/θ. Therefore, our theorem predicts that the KTS algorithm will be more efficient for θ as small as possible (close to 1).
3
Properties of the Power, Bernstein, and Chebyshev Bases
As mentioned above, the basis used to represent the polynomial system must satisfy the properties listed in Section 1 for KTS to work efficiently. Three bases, the power, Bernstein, and Chebyshev bases are examined in detail. The power basis for polynomials of n is φk (t) = tk (0 ≤ k ≤ n). The Bernstein basis degree
n (1 − t)n−k tk (0 ≤ k ≤ n). The Chebyshev basis is is φk (t) = Zk,n (t) = k φk (t) = Tk (t) (0 ≤ k ≤ n), where Tk (t) is the Chebyshev polynomial of the first kind generated by the recurrence relation T0 (t) = 1, T1 (t) = t, Tk+1 (t) = 2tTk (t) − Tk−1 (t) for k ≥ 1.
(9)
Another way to define the Chebyshev polynomials of the first kind is through the identity Tk (cos α) = cos kα. (10) This second definition shows, in particular, that all zeros of Tk (t) lies in [−1, 1]. It also shows that −1 ≤ Tk (t) ≤ 1 for any −1 ≤ t ≤ 1. The rest of this article shows that the power, Bernstein, and Chebyshev bases all satisfy these basis properties. The values θ’s of the three bases and their corresponding bounding polygons are also derived as these values dictate the efficiency of KTS operating on such bases. The upper bound of the efficiency of KTS is lowest when it operates on the basis with the smallest θ.
Properties of Polynomial Bases Used in Line-Surface Intersection Algorithm
3.1
373
Bounding Polygons
The choices of l and h and the definitions of bounding polygons of the surface S = {f (u, v) : l ≤ u, v ≤ h}, where f (u, v) is represented by one of the three bases, that satisfy the required properties are as follows: For Bernstein basis, the convex hull of the coefficients (control points), call it P1 , satisfies the requirements for l = 0 and h = 1. The convex hull P1 can be described as ⎧ ⎫ ⎨ ⎬ P1 = cij sij : sij = 1, 0 ≤ sij ≤ 1 . (11) ⎩ ⎭ i,j
i,j
For power and Chebyshev bases, the bounding polygon ⎧ ⎫ ⎨ ⎬ cij sij : −1 ≤ sij ≤ 1 P2 = c00 + ⎩ ⎭
(12)
i+j>0
satisfies the requirements for l = −1 and h = 1. Note that P2 is a bounding polygon of S in the Chebyshev case since |Tk (t)| ≤ 1 for any k ≥ 0 and any t ∈ [−1, 1]. Determining whether 0 ∈ P2 is done by solving a small linear programming problem. To determine if 0 ∈ P1 , the convex hull is constructed by conventional method and is tested if it contains the origin. The affine and translational invariance of P1 and P2 for their respective bases can be verified as follows: Let g(u, v) = Af (u, v) + b =
n m
cij φi (u)φj (v).
i=0 j=0
For the Bernstein basis, by using the property that nk=0 Zk,n (t) = 1, it is seen that cij = Acij + b for all cij ’s. Therefore, the bounding polygon of {g(u, v) : 0 ≤ u, v ≤ 1} is ⎧ ⎫ ⎨ ⎬ P1 = cij sij : sij = 1, 0 ≤ sij ≤ 1 ⎩ ⎭ i,j i,j ⎫ ⎧ ⎬ ⎨ (Acij + b)sij : sij = 1, 0 ≤ sij ≤ 1 = ⎭ ⎩ i,j i,j ⎧ ⎫ ⎨ ⎬ = A cij sij + b sij : sij = 1, 0 ≤ sij ≤ 1 ⎩ ⎭ i,j i,j i,j ⎫ ⎧ ⎬ ⎨ cij sij + b : sij = 1, 0 ≤ sij ≤ 1 = A ⎭ ⎩ i,j
= {Ax + b : x ∈ P1 } .
i,j
374
G. Srijuntongsiri and S.A. Vavasis
For the power and the Chebyshev bases, note that φ0 (u)φ0 (v) = 1 for both bases. Hence, c00 = Ac00 + b and cij = Acij for i + j > 0. The bounding polygon of {g(u, v) : 0 ≤ u, v ≤ 1} for this case is ⎧ ⎫ ⎨ ⎬ P2 = c00 + cij sij : −1 ≤ sij ≤ 1 ⎩ ⎭ i+j>0 ⎧ ⎫ ⎨ ⎬ = Ac00 + b + Acij sij : −1 ≤ sij ≤ 1 ⎩ ⎭ i+j>0 ⎧ ⎛ ⎫ ⎞ ⎨ ⎬ = A ⎝c00 + cij sij ⎠ + b : −1 ≤ sij ≤ 1 ⎩ ⎭ i+j>0
= {Ax + b : x ∈ P2 } . 3.2
The Size of the Bounding Polygons Compared to the Size of the Bounded Surface
This section describes the main technical results of this article. Due to space limitation, readers are referred to [4] for the proofs of the theorems in this section. Item 2c of the basis properties in effect ensures that the bounding polygons are not unboundedly larger than the actual surface itself lest the bounding polygons lose their usefulness. The value θ also can be used as a measure of the tightness of the bounding polygon. Recall from Theorem 1 that the efficiency of KTS depends on θ. Since the bounding polygons P1 and P2 are defined by the coefficients of f , our approach to derive θ is to first derive ξ, a function of m and n, satisfying cij ≤ ξ max f (u, v) , l≤u,v≤h
for any coefficient cij of f . But the following lemma shows that one needs only derive the equivalent of ξ for univariate polynomial to derive ξ itself. Lemma 1. Assume there exists a function h(n) such that bi ≤ h(n) max g(t)
(13)
l≤t≤h
for any bi (i = 0, 1, . . . , n), and any univariate polynomial g(t) = Then cij ≤ h(m)h(n) max f (u, v) , l≤u,v≤h
n
i=0 bi φi (t).
(14)
for any c = 0, 1, . . . , m; j = 0, 1, . . . , n), and any bivariate polynomial ij (i m n f (u, v) = i=0 j=0 cij φi (u)φj (v). Proof. See [4]. The following theorems state ξ of the three bases.
Properties of Polynomial Bases Used in Line-Surface Intersection Algorithm
375
Theorem 2. Let f (t) be a polynomial system n f (t) = i=0 ci Zi,n (t), 0 ≤ t ≤ 1, where ci ∈ Rd . The norm of the coefficients can be bounded by ci ≤ ξB (n) max f (t) , t:0≤t≤1
(15)
where ξB (n) =
n
max{|n − j|, |j|} = O(nn+1 ). |i − j| i=0 j=0,1,...,i−1,i+1,...,n
Remark. An inequality in the other direction, namely, that max f (t) ≤ max ci ,
t:0≤t≤1
is a well-known consequence of the convex hull property of Bernstein polynomials [5]. Proof. See [4]. Theorem 3. Let f (t) be a polynomial system f (t) =
n
ci Ti (t),
i=0
where ci ∈ Rd . The norm of the coefficients can be bounded by √ ci ≤ 2 max f (t) . t:−1≤t≤1
(16)
Proof. See [4]. Theorem 4. Let f (t) be a polynomial system f (t) =
n
c i ti ,
i=0
where ci ∈ Rd . The norm of the coefficients can be bounded by ci ≤
3n+1 − 1 √ max f (t) . t:−1≤t≤1 2
(17)
Proof. See [4]. Having ξ for each of the three bases, the values of θ for the three bases can now be derived.
376
G. Srijuntongsiri and S.A. Vavasis
Corollary 1. Let f (u, v) =
m n
cij Zi,m (u)Zj,n (v),
i=0 j=0
where cij ∈ R2 (i = 0, 1, . . . , m; j = 0, 1, . . . , n). Let P1 be the convex hull of {cij }. Then, for any y ∈ P1 , y ≤ ξB (m)ξB (n) max f (u, v) . 0≤u,v≤1
Proof. By the convex hull property of Bernstein polynomials, y ≤ maxi,j cij for any y ∈ P1 . The corollary then follows from Theorem 2 and Lemma 1. Corollary 2. Let f (u, v) =
m n
cij ui v j ,
i=0 j=0
where cij ∈ R2 (i = 0, 1, . . . , m; j = 0, 1, . . . , n). Let P2 be defined as in (12). Then, for any y ∈ P2 , y ≤
(m + 1)(n + 1)(3m+1 − 1)(3n+1 − 1) max f (u, v) . −1≤u,v≤1 2
Proof. For any y ∈ P2 , y ≤
m n
cij .
i=0 j=0
The corollary then follows from Theorem 4 and Lemma 1. Corollary 3. Let f (u, v) =
m n
cij Ti (u)Tj (v),
i=0 j=0
where cij ∈ R2 (i = 0, 1, . . . , m; j = 0, 1, . . . , n). Let P2 be defined as in (12). Then, for any y ∈ P2 , y ≤ 2(m + 1)(n + 1) Proof. For any y ∈ P2 , y ≤
max
−1≤u,v≤1
m n
f (u, v) .
cij .
i=0 j=0
The corollary then follows from Theorem 3 and Lemma 1.
Properties of Polynomial Bases Used in Line-Surface Intersection Algorithm
4
377
Relationship between the Bounding Polygon of the Power Basis and That of Chebyshev Basis
Let P2p denote the bounding polygon P2 computed from the power basis representation of a polynomial and P2c denote P2 computed from the Chebyshev basis representation of it. The results from previous section show that the value θ of P2c is smaller than θ of P2p . This only implies that the worst case of P2c is better than the worst case of P2p . Comparing the values of θ’s of the two does not indicate that P2c is always a better choice than P2p for every polynomial. The following theorem shows, however, that P2c is, in fact, always a better choice than P2p . Theorem 5. Let f : R2 → R2 be a bivariate polynomial. Its bounding polygon P2c is a subset of its bounding polygon P2p . Proof. See [4].
5
Computational Results
Three versions of KTS algorithms are implemented in Matlab; one operating on the polynomials in power basis, one on Bernstein basis, and one on Chebyshev basis. They are tested against a number of problem instances with varying condition numbers. Most of the test problems are created by using the normally distributed random numbers as the coefficients cij ’s of f in Chebyshev basis. For some of the test problems especially those with high condition number, some coefficients are manually entered. The resulting Chebyshev polynomial system is then transformed to the equivalent system in the power and the Bernstein bases. Hence the three versions of KTS solve the same polynomial system and the efficiency of the three are compared. The degrees of the test polynomials are between biquadratic and biquartic.
Table 1. Comparison of the efficiency of KTS algorithm operating on the power, the Bernstein, and the Chebyshev bases. The number of patches examined during the course of the algorithm and the width of the smallest patch examined are shown for each version of KTS. Power basis Bernstein basis Chebyshev basis cond(f ) Num. of Smallest Num. of Smallest Num. of Smallest patches width patches width patches width 3 3.8 × 10 29 .125 17 .0625 21 .125 1.3 × 104 13 .125 17 .0625 13 .125 2.5 × 105 49 .0625 21 .0625 45 .0625 1.1 × 106 97 .0313 65 .0313 85 .0313 3.9 × 107 89 .0313 81 .0313 89 .0313
378
G. Srijuntongsiri and S.A. Vavasis
For the experiment, we use the algorithm by J´ onsson and Vavasis [6] to compute the complex zeros required to estimate the condition number. Table 1 compares the efficiency of the three versions of KTS for the test problems with differing condition numbers. The total number of subpatches examined by KTS during the entire computation and the width of the smallest patch among those examined are reported. The results do not show any one version to be particularly more efficient than the others although the Chebyshev basis has better theoretical bound than the other two.
6
Conclusion
Three common bases, the power, the Bernstein, and the Chebyshev bases, are shown to satisfy the required properties for KTS to perform efficiently. In particular, the values of θ for the three bases are derived. These values are used to calculate the time complexity of KTS when that basis is used to represent the polynomial system. The Chebyshev basis has the smallest θ among the three, which shows that using KTS with the Chebyshev basis has the smallest worstcase time complexity. The computational results, however, show no significant differences between the performances of the three versions of KTS operating on the three bases. It appears that, in average case, choosing any of the three bases do not greatly affect the efficiency of KTS.
References 1. Srijuntongsiri, G., Vavasis, S.A.: A condition number analysis of a line-surface intersection algorithm. SIAM Journal on Scientific Computing 30(2), 1064–1081 (2008) 2. Deuflhard, P., Heindl, G.: Affine invariant convergence theorems for Newton’s method and extensions to related methods. SIAM J. Numer. Anal. 16, 1–10 (1980) 3. Kantorovich, L.: On Newton’s method for functional equations (Russian). Dokl. Akad. Nauk SSSR 59, 1237–1240 (1948) 4. Srijuntongsiri, G., Vavasis, S.A.: Properties of polynomial bases used in a line-surface intersection algorithm (February 2009), http://arxiv.org/abs/0707.1515 5. Farin, G.: Curves and Surfaces for CAGD: A Practical Guide, 5th edn. Academic Press, London (2002) 6. J´ onsson, G., Vavasis, S.: Accurate solution of polynomial equations using Macaulay resultant matrices. Mathematics of Computation 74, 221–262 (2005)
A GPU Approach to the Simulation of Spatio–temporal Dynamics in Ultrasonic Resonators Pedro Alonso–Jord´ a1, Isabel P´erez–Arjona2, and Victor J. S´ anchez–Morcillo2 1
2
Instituto Universitario de Aplicaciones de las Tecnolog´ıas de la Informaci´ on Universidad Polit´ecnica de Valencia, Cno. Vera s/n, 46022 Valencia, Spain
[email protected] Instituto de Investigaci´ on para la Gesti´ on Integrada de Zonas Costeras, Universitat Polit`ecnica de Val`encia, Crta. Nazaret-Oliva s/n, 46730 Grau de Gandia, Spain {victorsm,iparjona}@fis.upv.es
Abstract. In this work we present the implementation of an application to simulate the evolution of pressure and temperature inside a cavity when acoustic energy is injected, a physical system currently under intensive research. The particular features of the equations of the model makes the simulation problem very stiff and time consuming. However, intrinsic parallelism makes the application suitable for implementation in GPUs providing the researchers with a very useful tool to study the problem at a very reasonable price. In our experiments the problem was solved in less than half the time required by CPUs.
1
Introduction
As it is well known among researchers in high performance computing, Graphics Processor Units (GPUs), beyond their specialized capability to process graphics, have strongly consolidated as a serious alternative to general purpose CPUs to speed up scientific computing applications (GPGPU). Implementing general purpose intensive computation applications in graphics hardware is of interest for several reasons including the high computational power of GPUs at a very good cost–performance ratio together with the availability of CUDA [1]. However, programming GPUs for general purpose applications has clear limitations. Their highly specialized architecture based on breadth parallelism using single instruction multiple data (SIMD) processing units is not suitable for all applications. Although in our experiments we used a 64-bit capable hardware (Nvidia Quadro FX 5800), it does not yet perform as well as single precision. Furthermore, the development software kit still does not fully exploit this new hardware. But despite these and other well-known limitations that can be added
This work was financially supported by Spanish Ministerio de Ciencia e Innovaci´ on, European Union FEDER (FIS2008-06024-C03-02), Universidad Polit´ecnica de Valencia (20080009) and Generalitat Valenciana (20080811).
R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 379–386, 2010. c Springer-Verlag Berlin Heidelberg 2010
380
P. Alonso–Jord´ a, I. P´erez–Arjona, and V.J. S´ anchez–Morcillo
to the list, the potential performance benefits sometimes makes attempting to implement the application for GPUs worthwhile as is the case here. In this paper we present a study of the spatio–temporal dynamics of an acoustic field. An intensive study of the behaviour of such a system must be done in the numerical field to validate the mathematical model, contrasting it with the physical experiment. The physical experiment is characterized by very different time scales that force an extremely small simulation time step to be taken. Thus, the problem becomes extremely time consuming, and often such numerical problems cannot be undertaken in PC systems with standard processing capabilities. Basically, the computational cost of the simulation of our model relies on intensive use of two dimensional FFTs (Fast Fourier Transforms) on small matrices (64 × 64) plus the computation of linear and non–linear terms involved in the model. We resolved the problem in half the time obtained with one of the most powerful CPUs (Intel Itanium II processor). This is achieved not only because of the computation power of the GPUs but because of the intrinsic parallelism of some computations derived from the discretization method used to solve the evolution equations of the physical system. A survey of discretization of ODEs and PDEs on GPUs can be found in [2]. A similar work to ours discretizing PDEs on GPUs can be found in [3]. The paper is organized as follows. Section 2 summarizes the physical problem. Next we describe the simulation algorithm. In Section 4 we describe the implementation details of the GPU algorithm. The experimental results are analyzed in Section 5. We close with a conclusions section.
2
The Physical Model
Many systems in nature, when driven far from equilibrium, can self–organize, giving rise to a large variety of patterns or structures. Although studied intensively for most of the last century, it has only been during the past thirty years that pattern formation has emerged as branch of science in its own right [4]. A model to study the spatio–temporal dynamics of an acoustic field has been proposed in a recent work [5]. The physical system consists of two solid walls, containing a viscous fluid, where acoustic energy is injected by vibrating one of the walls. In a viscous medium sound propagates with a speed c that depends significantly on temperature. The propagation of sound in such a medium can be described by two coupled equations for pressure, p(r, t) = P (x, y, t) cos(kz), and temperature, T (r, t) = T0 (x, y, t) + T1 (x, y, t) cos(2kz). In the case of a fluid confined in a driven resonator, it was shown in [5] that the pressure and temperature fields inside the cavity obey the evolution equations τp ∂τ P = −P + Pin I + i∇2 P + i (T0 + T1 − Δ) P, 2
∂τ T0 = −T0 + D∇2 T0 + 2 |P | , ∂τ T1 =
−τg−1 T1
(1) 2
+ D∇ T1 + |P | . 2
where P , T0 and T1 are proportional to the amplitudes of pressure field, the homogeneous and grating (modulated) temperature components. The constants
A GPU Approach to the Simulation of Spatio–temporal Dynamics
381
τp and τg are the normalized relaxation times of the pressure field and the temperature grating component, respectively. Other parameters are the detuning Δ between the driving frequency ω and the closest cavity mode, Pin the injected pressure and D the diffusion coefficient. Finally ∇2 = ∂ 2 /∂x2 + ∂ 2 /∂y 2 is the transverse Laplacian operator, necessary for pattern formation. The model parameters can be estimated for a typical experimental situation [5]. It follows that the normalized decay times are τp ∼ 10−6 , and τg ∼ 10−2 under usual conditions. So the problem is typically very stiff, since 0 < τp τg 1. Stiff systems of differential equations, where the different variables evolve along very different time scales, are usually problematic for numerical solving, and certain numerical methods are unstable, unless the step size is taken to be extremely small yielding thus very high time consumption.
3
The Simulation Algorithm
The essential part of the simulation algorithm can be summarized as: Algorithm termo2D: ...initializations... for iteration = 1:2000000 (P, T0 , T1 ) ← terms(P, T0 , T1 , n, τp , τg , Δ, Pin , δ) P ← CP ∗ P ← CT0 ∗ T0 T0 T1 ← CT1 ∗ T1 end for ...results storage and analysis...
where symbol ∗ denotes the convolution operation. P , T0 , T1 , CP , CT0 and CT1 are matrices of size n × n, where n = 64, and δ = 10−6 is the time step size. The integration method is divided into two steps: 1) the spatial part, which corresponds to the Laplacian operators in equations (1), and 2) the linear and non–linear blocks. The spatial part is integrated using an exact split–step algorithm, which consists of transforming the spatial part of the equation into the spatial Fourier space and obtaining a differential equation with exact solution: FFT ∂t f = ∇2 f −→ ∂t f˜ = −k 2 f˜ ,
where f is a function, f˜ is the corresponding spatial Fourier transformed and k is the spatial frequency. The equation is transformed into the Fourier space, propagated multiplying by the temporal step and re–transformed into the real space (convolution). The linear block is exactly integrated from the linear matrix system in equations (1) when spatial and nonlinearities are not considered. The nonlinear part is integrated via a second order Runge–Kutta scheme. The computation of these blocks is carried out in function terms (Fig. 1). The chosen method, dividing the integration in two separated steps and using the split-step method for the spatial integration, provides some advantages over other usual methods like the finite elements technique. The linear systems obtained when the finite elements method is used can often be solved by fast
382
P. Alonso–Jord´ a, I. P´erez–Arjona, and V.J. S´ anchez–Morcillo
function (P, T0 , T1 ) = terms(P, T0 , T1 , n, τp , τg , Δ, Pin , δ) is P = P + (0.5δ/τp ) (−(1 + iΔ)P + Pin I + i(T0 + T1 ) · P ) T0 = T0 + 0.5δ (−T0 + 2P · P ∗ ) T1 = T1 + 0.5δ (−(1/τg )T1 + P · P ∗ ) P ← P + (δ/τp ) (−(1 + iΔ)P + Pin I + i(T0 + T1 ) · P ) T0 ← T0 + δ (−T0 + 2P · P ∗ ) T1 ← T1 + δ (−(1/τg )T1 + P · P ∗ ) Fig. 1. Function terms (∗ denotes complex conjugated, · the matrix–wise product of √ matrices, I is the identity matrix and i is the complex variable (i = −1))
algorithms. Nevertheless, to achieve high accuracy a large number of points in the transverse dimension or complex schemes to reduce the sparsity of the matrix are necessary. The advantage of the split–step method (and of spectral and pseudospectral methods, in general) is that it provides high accuracy in comparison with the finite elements technique for a similar number of discretization points [6].
4
The GPU Algorithm
Algorithm termo2D is implemented in CUDA as follows: ...initializations... for (iteration = 1; iteration<=2000000; iteration++) { terms( dP, dT0, dT1, p1, p2, p3, p4, p5, p6, p7, N ); convolution( dP, dCP ); convolution( dT0, dCT0 ); convolution( dT1, dCT1 ); } ...results storage and analysis...
Initialization operations are carried out in the CPU in a very short time so we will focus only on the main loop. Function terms is a function which simply sets up the threads block and grid size and calls the kernel kterms (Fig. 2). The kernel receives arguments p1 to p7 which are pre–computations carried out by the CPU in order to avoid repetitions in the computations inside the GPU. We used auxiliary device functions (complex_mult, conjugate) to perform the denoted operations. We exploited the fact that all the operations can be carried out independently on each matrix element yielding a simple kernel implementation with a very high degree of low grain concurrency. Matrices P , T0 and T1 in the termo2D algorithm are represented in the kernel through local complex (float2) variables Paux, T0aux and T1aux, respectively. Function convolution performs the convolution of two matrices: CP ∗ P ≡ P ← ifft(fft(P ) · fft(CP )) ,
(2)
where fft and ifft are the forward and backward FFT operations, respectively.
A GPU Approach to the Simulation of Spatio–temporal Dynamics
383
__global__ void kterms( float2 *P, float2 *T0, float2 *T1, float2 p1, float p2, float p3, float p4, float p5, float p6, float P7, unsigned int N ) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; int k = i + j*N; float2 Paux, T0aux, T1aux, a, b; if ( i < N && j < N ) { a = complex_mult( p1, P[k] ); a.x += P7; b = complex_mult( make_float2( -(T0[k].y + T1[k].y), T0[k].x + T1[k].x ), P[k] ); Paux.x = P[k].x + ( a.x + b.x ) * p5; Paux.y = P[k].y + ( a.y + b.y ) * p5; a = complex_mult( P[k], conjugate( P[k] ) ); T0aux.x = T0[k].x + ( -T0[k].x + 2.0*a.x ) * p4; T0aux.y = T0[k].y + ( -T0[k].y + 2.0*a.y ) * p4; T1aux.x = T1[k].x + ( p6 * T1[k].x + a.x ) * p4; T1aux.y = T1[k].y + ( p6 * T1[k].y + a.y ) * p4; a = complex_mult( p1, Paux ); a.x += P7; b = complex_mult( make_float2( -(T0aux.y + T1aux.y), T0aux.x + T1aux.x ), Paux ); P[k].x = P[k].x + ( a.x + b.x ) * p3; P[k].y = P[k].y + ( a.y + b.y ) * p3; a = complex_mult( Paux, conjugate( Paux ) ); T0[k].x = T0[k].x + ( -T0aux.x + 2.0*a.x ) * p2; T0[k].y = T0[k].y + ( -T0aux.y + 2.0*a.y ) * p2; T1[k].x = T1[k].x + ( p6 * T1aux.x + a.x ) * p2; T1[k].y = T1[k].y + ( p6 * T1aux.y + a.y ) * p2; } }
Fig. 2. Kernel for the linear and nonlinear terms calculation
void convolution( float2 *A, float2 *B ) { cufftExecC2C( fftPlan, (cufftComplex *) A, (cufftComplex *) A, CUFFT_FORWARD ); cudaMatrixMatrixWiseProduct( A, B, SCALAR, N ); cufftExecC2C( fftPlan, (cufftComplex *) A, (cufftComplex *) A, CUFFT_INVERSE ); }
Function convolution is composed by the three operations shown in equation (2). The first step is a call to the CUDA routine cufftExecC2C to perform the FFT on the first matrix argument A in–place. The second step is a call to a kernel (through a call to function cudaMatrixMatrixWiseProduct) to perform the matrix–matrix point–wise product of matrix A by B, where SCALAR a real number to normalize the FFT. The resulting matrix is stored in A. The third step is the computation of the inverse FFT on A in–place as well. It is assumed that matrices CP, CT0 and CT1 are the FFT of themselves when passed to the function through the formal argument B.
5
Experimental Results
We used different boards to run the tests. One was an Intel Itanium II based board with 4 dual core processors at 1.4GHz. We used optimized code to compute the FFT (Intel mkl 10.1) and the Intel fortran Compiler 11.0. The GPUs boards were an NVidia Tesla C870 with 128 parallel processor cores and 1.5GB global
384
P. Alonso–Jord´ a, I. P´erez–Arjona, and V.J. S´ anchez–Morcillo Table 1. Time results of the termo2D algorithm in four different machines
Itanium Tesla C870 Tesla C1060 FX 5800
terms function
Convolution
Total time
Gflops
3.56 min. 41.9 sec. 18.4 sec. 17.7 sec.
6.07 3.13 3.13 3.05
22.13 min. 10.16 min. 9.72 min. 9.45 min.
1.8 3.9 4.0 4.2
16.1% 6.8% 3.1% 3.1%
min. min. min. min.
27.4% 30.8% 32.3% 32.3%
Table 2. Time results of Algorithm termo2D in parallel
Tesla s870 Tesla s1070
Computations
Communications
Total time
3.83 min. 3.88 min.
7.36 min. 5.03 min.
10.83 min. 8.75 min.
34.2% 43.5%
65.8% 56.5%
memory, an NVidia Quadro FX 5800 and an NVidia Tesla C1060, these last two with 4GB global memory and 240 parallel processor cores. The total time in Table 1 shows a performance improvement using GPUs over one of the currently most powerful CPUs. We used the single precision version of the application in the CPU to make the comparison. This precision is enough for researchers since they only need to observe the shape of the resulting temperature image in order to compare it with the shape obtained in the physical experiments. The improvement is due to the GPUs greater computational capacity to execute operations of the type used in this application. The weight of the computation of the terms function has been reduced by a factor of 5.2. This is due to the way in which GPUs process data making this application suitable for use in GPUs. Also a performance in Gflops is shown. We used the same approximation in [7] to count the flops of the FFTs based on the theoretical cost shown in [8]. FX 5800 computes the terms function 2.37 times faster than Tesla C870. In our analysis we saw that the performance of this computation was 38.9 Gflops/s in FX 5800 and 16.8 Gflops/s in Tesla C870. Table 1 also shows the time to perform one convolution operation, in which we observe a quite different behaviour. FX 5800 was only 1.02 times faster than Tesla C870. This is because the FFTs are computed in both GPUs at ≈ 3.0 Gflops/s, that is, they are computed in a very similar time. This probably means that the cuFFT is not yet sufficiently well–tuned for boards with 240 processors. In our research we tried to improve performance by accessing matrices CP, CT0 and CT1 through the texture memory since they are not changed in the loop. However, no improvements were achieved. We also attempted to improve the algorithm by exploiting parallelism at the iteration level. We implemented a parallel version based on the fact that convolutions can be computed independently. In a first step one GPU computes the terms function while in a second step the three convolutions are computed in parallel by three GPUs. Obviously, some data has to be transferred among
A GPU Approach to the Simulation of Spatio–temporal Dynamics
385
the GPUs and this time it must be taken into account. Table 2 shows the time obtained with a Tesla s870 (which hosts 4 Tesla C870) and a Tesla s1070 (which hosts 4 Tesla C1060). We run two different applications. One performed the same amount of computations as the original and no communications to obtain the Computations time. We proceeded accordingly with the other application for the Communications time. The percentages are related to the sum of these independent quantities. The Total time of the real application is also shown. Theoretical maximum speedup can be expressed as (t0 + 3t1 )/(t0 + t1 ), where t0 denotes the time to compute the terms function and t1 the time to compute one convolution. The smaller t0 , the closer speedup is to 3. Using the times in Table 1, the expected speedup can not be greater than 2.64 and 2.82 for Tesla s870 and Tesla s1070, respectively. The speedup obtained is not far away if we only take into account the computation time, i.e. for Tesla s870 and using times in Table 1 and Table 2 we see that 41.9 sec. + 3.13 min. = 3.83 min. The drawback of the parallel version is obviously the large amount of data to be transferred between the host and the CPU during the computations. We carried out the bandwidthTest of the CUDA SDK with 32KB since this is the data size of the transfer units used in our application. We obtained a Host-to-Device bandwidth of 1260.1MB/s and a Device-to-Host bandwidth of 1531.9MB/s for Tesla s870 case. The corresponding bandwidths for Tesla s1070 was 1644.7MB/s and 1680.1MB/s. Different performance is due to the fact that Tesla s1070 uses a PCIE-Gen2 device. Thus, the total parallel time for Tesla s870 is even worse than the sequential time. And thanks to a faster PCIE the total time is reduced by one minute in the case of Tesla s1070. Regarding the CPU, we explored different ways of improving performance by using several cores. Firstly, we used the threaded version of the 2D FFT. Secondly, we used the same parallelizing technique as in GPUs with OpenMP sections. None of these options returned less than 22.13 min.
Fig. 3. Sound pressure in a 2D space for two different injections of sound
386
6
P. Alonso–Jord´ a, I. P´erez–Arjona, and V.J. S´ anchez–Morcillo
Conclusions
In this work we have shown an application consisting in the simulation of a physical system with considerable research interest. The application is very suitable to be run in GPUs thus obtaining a significant reduction in time. This enables results to be obtained in a short time at the competitive price of a GPU. We also exploited parallelism at the iteration level by using several GPUs. Although the large amount of data transferred between CPU and GPU do not allow this parallelism to be fully exploited, we have managed to reduce the simulation time with a Tesla s1070. The parallel version could offer better speedups in the case of larger matrices. One of the major problems we faced in the work is the poor performance of the cuFFT routines. We also missed having a batch mode for the application of 2D FFTs like the one in the 1D case. With these two features, the results of the simulation would clearly improve. Finally, we also facilitated a tool to visualize the resulting surface of the sound pressure (matrix P ) which is very useful for researchers when analyzing data returned by the simulation. We modified the OpenGL application meshview of Nate Robins [9] to show the surface. With this tool, the pressure of sound in a surface can be visualized in a 3D space, it can be moved and the color can also be changed. An example of surface is shown in Fig. 3.
References 1. NVidia Corp. CUDA Version 2.2 (2009) 2. Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Kr¨ uger, J., Lefohn, A.E., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Computer Graphics Forum 26(1), 80–113 (2007) 3. Januszewski, M., Kostur, M.: Accelerating numerical solution of Stochastic Differential Equations with CUDA (2009), http://arxiv.org/abs/0903.3852 4. Cross, M.C., Hohenberg, P.C.: Pattern formation out of equilibrium. Rev. Mod. Phys. 65, 851 (1993) 5. Perez-Arjona, I., S´ anchez-Morcillo, V.J., de Valc´ arcel, G.J.: Ultrasonic Cavity Solitons. Europhys. Lett. 82, 10002 (2008) 6. Trefethen, L.N.: Spectral methods in MATLAB. SIAM, Philadelphia (2000) 7. Volkov, V., Kazian, B.: Fitting the FFT onto the G80 architecture (2008), http://www.cs.berkeley.edu/ kubitron/courses/cs258-S08/projects/ reports/project-6 report.pdf 8. Van Loan, C.: Computational Frameworks for the Fast Fourier Transform. SIAM Press, Philadelphia (1992) 9. Nate Robins: OpenGL meshview application, http://www.pobox.com/~ ndr
Reduction to Condensed Forms for Symmetric Eigenvalue Problems on Multi-core Architectures Paolo Bientinesi1 , Francisco D. Igual2 , Daniel Kressner3 , and Enrique S. Quintana-Ort´ı2 1
2
AICES, RWTH Aachen University, 52074–Aachen, Germany
[email protected] Depto. de Ingenier´ıa y Ciencia de Computadores, Universidad Jaume I, 12.071–Castell´ on, Spain {figual,quintana}@icc.uji.es 3 Seminar f¨ ur angewandte Mathematik, ETH Z¨ urich, Switzerland
[email protected]
Abstract. We investigate the performance of the routines in LAPACK and the Successive Band Reduction (SBR) toolbox for the reduction of a dense matrix to tridiagonal form, a crucial preprocessing stage in the solution of the symmetric eigenvalue problem, on general-purpose multicore processors. In response to the advances of hardware accelerators, we also modify the code in SBR to accelerate the computation by off-loading a significant part of the operations to a graphics processor (GPU). Performance results illustrate the parallelism and scalability of these algorithms on current high-performance multi-core architectures.
1
Introduction
We consider the solution of the symmetric eigenvalue problem AX = XΛ, where A ∈ Rn×n is a symmetric matrix, Λ = diag(λ1 , λ2 , . . . , λn ) ∈ Rn×n is a diagonal matrix containing the eigenvalues of A, and the jth column of the orthogonal matrix X ∈ Rn×n is the eigenvector associated with λj [1]. Given the matrix A, the objective is to compute its eigenvalues or a subset thereof and, if requested, the associated eigenvectors as well. Applications leading to large eigenvalue problems appear, e.g., in computational quantum chemistry, finite element modeling, multivariate statistics, and density functional theory, to name only a few. Efficient algorithms for the solution of symmetric eigenproblems usually consist of the following three stages: matrix A is first reduced to a (symmetric) tridiagonal matrix T ∈ Rn×n by a sequence of orthogonal similarity transforms: QT AQ = T , where Q ∈ Rn×n is the matrix representing the accumulation of these orthogonal transforms. The MR3 algorithm [2] or the (parallel) PMR3 [3] algorithm is then applied to the tridiagonal matrix T to accurately compute its eigenvalues and, optionally, the associated eigenvectors. Finally, when the eigenvectors of A are desired, we need to apply a back-transform to the eigenvectors of T . In particular, if T XT = XT Λ, with XT ∈ Rn×n representing the eigenvectors R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 387–395, 2010. c Springer-Verlag Berlin Heidelberg 2010
388
P. Bientinesi et al.
of T , then X = QXT . Both the first and last stage cost O(n3 ) floating-point arithmetic operations (flops) while the second stage based on the MR3 algorithm only requires O(n2 ) flops. (Other algorithms for solving tridiagonal eigenvalue problems, such as the QR algorithm, the Divide & Conquer method, etc. [1] are not competitive as they require O(n3 ) flops for the second stage.) In this paper we re-evaluate the performance of the codes in LAPACK and the Successive Band Reduction (SBR) toolbox [4] for the reduction of a symmetric matrix A to tridiagonal form. The LAPACK routine sytrd employs a simple algorithm based on Householder reflectors [1], enhanced with WY representations [5], to reduce A directly to tridiagonal form. Only half of its operations can be performed in terms of calls to BLAS-3 kernels, resulting in a poor use of the memory hierarchy. To overcome this drawback, the SBR toolbox first reduces A to an intermediate banded matrix B, and subsequently transforms B to tridiagonal form. The advantage of this two-step procedure is that the first step can be carried out using BLAS-3 kernels, while the cost of the second step is negligible provided a moderate band width is chosen for B. Our interest in this study is motivated by the increase in the number of cores in general-purpose processors and by the recent advances in more specific hardware accelerators such as graphics processors (GPUs). In particular, we aim at evaluating how the presence of multiple cores in these new architectures affects the performance of the codes in LAPACK and SBR for tridiagonal reduction. Note that, because of the efficient formulation and practical implementation of the MR3 algorithm, the reduction to tridiagonal form and the back-transform are currently the most time-consuming stages in the solution of the symmetric eigenvalue problem. The rest of the paper is organized as follows. In Section 2 we review the routines in SBR for the reduction of a dense matrix to tridiagonal form. There, we also propose a modification of the code in SBR to accelerate the initial reduction to banded form using a GPU. Details on the implementation of routine sytrd from LAPACK can be found in [6]. Section 3 offers experimental results of the LAPACK and SBR codes on a workstation with two Intel Xeon QuadCore processors (8 cores) and an NVIDIA Tesla C1060 GPU. Finally, Section 4 summarizes the conclusions of our study.
2
The SBR Toolbox
The SBR toolbox is a software package for symmetric band reduction via orthogonal transforms. SBR includes routines for the reduction of dense symmetric matrices to banded form (syrdb), and the reduction of banded matrices to narrower banded (sbrdb) or tridiagonal form (sbrdt). Accumulation of the orthogonal transforms and repacking routines for storage rearrangement are also provided in the toolbox. In this section we describe the routines syrdb and sbrdt which, invoked in that order, produce the same result as the reduction of a dense matrix to
Reduction to Condensed Forms for Symmetric Eigenvalue Problems
389
j
w
j
0
A0
A1
b
w−b
A2
k
k=n−(j+w)+1
Fig. 1. Partitioning of the matrix during one iteration of routine syrdb for the reduction to banded form
tridiagonal form using the LAPACK routine sytrd. For the SBR routine syrdb, we also describe how to off-load the bulk of the computations to the GPU. 2.1
Reduction to Banded Form
Suppose that the first j − 1 columns of the matrix A have already been reduced to banded form with bandwidth w. Let b denote the algorithmic block size, and assume for simplicity that j + w + b − 1 ≤ n and b ≤ w; see Figure 1. Then, during the current iteration of routine syrdb, the next b columns of the banded matrix are obtained as follows: 1. Compute the QR factorization of A0 ∈ Rk×b , k = n − (j + w) + 1: A0 = Q0 R0 ,
(1)
where R0 ∈ Rb×b is upper triangular and the orthogonal factor Q0 is implicitly stored as a sequence of b Householder vectors using the annihilated entries of A0 plus b entries of a vector of length n. The cost of this first step is 2b2 (k − b/3) flops. 2. Construct the factors of the compact WY representation [1] of the orthogonal matrix Q0 = Ik + W T W T , with W ∈ Rk×b and T ∈ Rk×k upper triangular. The cost of this step is about kb2 flops. 3. Apply the orthogonal matrix to A1 ∈ Rk×w−b from the left: A1 := QT0 A1 = (Ik + W T W T )T A1 = A1 + W (T T (W T A1 )).
(2)
By performing the operations in the order specified in the rightmost expression of (2), the cost of this step becomes 4kb(w − b) flops. In case the bandwidth equals the block size (w = b), A1 comprises no columns and, therefore, no operation is performed in this step.
390
P. Bientinesi et al.
4. Apply the orthogonal matrix to A2 ∈ Rk×k from both the left and right: A2 := QT0 A2 Q0 = (Ik + W Y T )T A2 (I + W Y T ) T
= A2 + Y W A2 + A2 W Y
T
T
(3) T
+ Y W A2 W Y ,
(4)
with Y = W T . In particular, during this step only the lower (or the upper) triangular part of A2 is updated. In order to do so, (4) is computed as the following sequence of (BLAS) operations: (symm) (gemm) (gemm) (syr2k)
X1 := A2 W, 1 X2 := X1T W, 2 X3 := X1 + Y X2 , A2 := A2 + X3 Y
T
(5) (6) (7) +Y
X3T .
(8)
The major cost of this step is in the computation of the symmetric matrix product (5) and the symmetric rank-2b update (8), each with a cost of 2k 2 b flops. On the other hand, the matrix products (6) and (7) only require 2kb2 flops each. Therefore, the overall cost of Step 4 is approximately 4k 2 b + 4kb2 . This is higher than the cost of the preceding Steps 1, 2, and 3, which require O(kb2 ), O(kb2 ), and O(max(kb2 , kbw)) flops, respectively. In summary, provided that b and w are both small compared to n, the global cost of the reduction of a full matrix to banded form is 4n3 /3 flops. Furthermore, the bulk of the computation is performed in terms of BLAS-3 operations symm and syr2k in (5) and (8), so that high performance can be expected from routine syrdb in case a tuned implementation of BLAS is used. The orthogonal matrix QB ∈ Rn×n for the reduction QTB AQB = B, where B ∈ Rn×n is the (symmetric) banded matrix, can be explicitly constructed by accumulating the involved Householder reflectors at a cost of 4n3 /3 flops. Once again, compact WY representations help in casting this computation almost entirely in terms of calls to BLAS-3 kernels. 2.2
Reduction to Banded Form on the GPU
Recent work on the implementation of BLAS and the major factorization routines for the solution of linear systems [7,8,9] has demonstrated the potential of GPUs to yield high performance on dense linear algebra operations which can be cast in terms of matrix-matrix products. In this subsection we describe how to exploit the GPU in the reduction of a matrix to banded form, orchestrating the computations carefully to reduce the number of data transfers between the host and the GPU. During the reduction to banded form, the operations in Step 4 are natural candidates for being computed on the GPU, while, due to the kernels involved in Steps 1 and 2 (mainly narrow matrix-vector products), these operations are better suited for the CPU. The operations in Step 3 can be performed either on
Reduction to Condensed Forms for Symmetric Eigenvalue Problems
391
the CPU or the GPU but, in general, w−b will be small so that this computation is likely better suited for the CPU. Now, assume that the entire matrix resides in the GPU memory initially. We can then proceed to compute the reduced form by repeating the following three steps for each column block: 1. Transfer A0 and A1 back from GPU memory to main memory. Compute Steps 1, 2, and 3 on the CPU. 2. Transfer W and Y from main memory to the GPU. 3. Compute Step 4 on the GPU. Proceeding in this manner, at the completion of the algorithm most of the banded matrix and the Householder reflectors are available in the main memory. Specifically, only the diagonal b × b blocks in A remain to be transferred to the main memory. 2.3
Reduction to Tridiagonal Form
The routine sbrdt in SBR is responsible for reducing the banded matrix B to tridiagonal form by means of Householder reflectors. Let QT denote the orthogonal transforms which produce this reduction, that is QTT BQT = T . On exit, the routine returns the tridiagonal matrix T and, upon request, accumulates these transforms, forming the matrix Q = QB QT ∈ Rn×n so that QT AQ = QTT (QTB AQB )QT = QTT BQT = T . The matrix T is constructed in routine sbrdt one column at the time: at each iteration the elements below the first subdiagonal of the current column are annihilated using a Householder reflector; the reflector is then applied to both sides of the matrix, and the resulting bulge is chased down along the band. The computation is cast in terms of BLAS-2 operations at best (symv and syr2 for two-sided updates, and gemv and ger for one-sided updates) and the total cost is 6n2 w + 8nw2 flops. If the eigenvectors are desired, then the orthogonal matrix QB produced in the first stage (reduction from full to banded form) needs to be updated by the orthogonal transforms computed during the reduction from banded to tridiagonal form (i.e., QT ). This update requires O(n3 ) flops and can be reformulated almost entirely in terms of calls to level-3 BLAS kernels, even though this reformulation is less trivial than for the first stage [4]. However, the matrix Q = QB QT still needs to be applied as part of the back-transform step, adding 2n3 flops to the cost of building the matrix containing the eigenvectors. We do not propose to off-load the reduction of the banded matrix to tridiagonal form on the GPU as this is a fine-grained computation which does not lend itself to an easy implementation on this architecture.
3
Experimental Results
The target platform used in the experiments is a workstation with two Intel Xeon QuadCore CPUs (8 cores) at 2.33 GHz, with 8 GB DDR2 RAM, and a theoretical peak performance of 149.1/74.6 GFLOPS in single/double precision (1
392
P. Bientinesi et al.
Table 1. Performance (in GFLOPS) of the BLAS kernel symm involved in the reduction to band form and the corresponding matrix product (for reference) symm. C := AB + C gemm. C := AB + C A ∈ Rm×m symmetric, B, C ∈ Rm×n A ∈ Rm×k ,B ∈ Rk×n , C ∈ Rm×n m n 1 Core 4 Cores 8 Cores CUBLAS Our BLAS 1 Core 4 Cores 8 Cores CUBLAS 24 6.6 2048 32 8.0 64 10.9 24 6.6 6144 32 7.8 64 10.7 24 6.6 10240 32 7.8 64 10.7
6.5 8.8 16.0 6.0 7.7 14.1 5.9 7.8 14.2
8.2 6.9 13.8 4.3 6.0 11.6 3.7 6.1 11.7
68.6 89.7 97.1 71, 6 94.1 99.1 57.4 76.0 76.5
67.1 106.5 183.4 73.1 129.6 188.4 68.1 113.5 175.8
5.6 6.8 9.8 5.9 7.0 9.9 5.9 7.0 9.9
18.1 23.5 34.2 21.1 26.0 37.4 21.6 26.3 38.1
25.1 32.7 51.8 30.9 38.7 60.1 30.7 34.9 56.1
101.0 177.5 279.0 134.9 327.5 339.3 139.7 321.9 346.9
GFLOPS = 109 flops/second). The workstation is also equipped with an NVIDIA Tesla C1060 board with 240 single-precision and 30 double precision streaming processor cores at 1.3 GHz, 4 GB DDR3 RAM, and a theoretical peak performance of 933/78 GFLOPS in single/double precision. The Intel chipset E5410 and the Tesla board are connected via a PCI-Express Gen2 interface with a peak bandwidth of 48 Gbits/second. MKL 10.1 was employed for all computations performed on the Intel cores. NVIDIA CUBLAS (version 2.0) built on top of the CUDA application programming interface (version 2.0) together with NVIDIA driver (177.73) were used in our tests. Single precision was employed in all experiments. When reporting the rate of computation, we consider the cost of the reduction to tridiagonal form (either using LAPACK sytrd or SBR syrdb+sbrdb) to be 4n3 /3 flops for square matrices of order n. Note that, depending on w, this count may be considerably smaller than the actual number of flops performed by syrdb+sbrdb. We do not build the orthogonal factors/compute the eigenvectors in our experiments. The matrices are stored using canonical column-wise storage. The GFLOPS rate is computed as 4n3 /3 divided by t × 10−9 , where t equals the elapsed time in seconds. The cost of all data transfers between main memory and GPU memory is included in the timings. Our first experiment evaluates the performance of the major BLAS-3 kernels involved in the reduction to tridiagonal form using the LAPACK and SBR routines: symm and syr2k. For reference, we also evaluate the performance of the general matrix-product kernel, gemm. Tables 1 and 2 report results on 1, 4 and 8 cores of the Intel Xeon processors and 1 GPU. For the latter architecture, we employ the kernels in CUBLAS and also our own implementations (column labeled as “Our BLAS”). The matrix dimensions of symm and syr2k are chosen so that they match the structure of the blocks encountered during the reduction. The matrix dimensions of gemm mimic the sizes of the operands in symm or syr2k. The results show the higher performance yield by the GPU for most
Reduction to Condensed Forms for Symmetric Eigenvalue Problems
393
Table 2. Performance (in GFLOPS) of the BLAS kernel syr2k involved in the reduction to band form and the corresponding matrix product (for reference) syr2k. C := AB T + BAT + C gemm. C := AB T + C n×k n×n A, B ∈ R ,C∈R symmetric A ∈ Rm×k , B ∈ Rn×k , C ∈ Rm×n n k 1 Core 4 Cores 8 Cores CUBLAS Our BLAS 1 Core 4 Cores 8 Cores CUBLAS 24 2048 32 64 24 6144 32 64 24 10240 32 64
10.7 11.7 13.6 10.6 11.9 13.6 10.0 10.8 13.2
24.2 29.9 40.6 24.2 29.9 43.8 20.6 26.6 43.0
36.0 42.9 57.8 34.9 43.2 64.0 24.5 31.5 56.5
36.1 53.2 74.4 40.3 55.9 78.4 41.2 56.4 79.2
36.8 53.2 159.2 40.3 56.0 124.2 41.4 56.4 114.2
14.7 15.6 16.8 15.0 15.9 17.0 15.0 15.8 16.9
44.0 50.4 59.4 27.7 36.6 59.3 27.3 36.0 58.2
86.5 95.3 112.9 27.9 36.9 73.1 27.7 36.8 73.3
70.7 157.2 185.5 71.4 161.0 185.0 71.9 159.3 182.2
matrix operations and the benefits of using block sizes that are integer multiple of 32 in this hardware. Although our own implementations of the symmetric kernels in CUBLAS deliver a higher GFLOPS rate than that of NVIDIA BLAS, they are still quite below the performance of the matrix-matrix product kernel. In particular, the results in Table 2 illustrate that, depending on the value of k, on the GPU it may be more efficient to call twice the gemm kernel (updating the whole matrix in Step 4 of routine syrdb) instead of the syr2k once (updating only one of the triangles), even though this implies doubling the number of flops. Besides, in case both the upper and the lower triangular parts of A2 in (8) are updated, then one can replace the call to symm by a call to gemm. These are strategies used in our implementation of syrdb for the GPU. The second experiment compares the LAPACK and SBR codes for the reduction to tridiagonal form using 1, 4 and 8 cores of the CPU or the GPU plus one Reduction from full to tridiagonal form using LAPACK/SBR SBR - 1 GPU SBR - 8 Intel cores SBR - 4 Intel cores SBR - 1 Intel core LAPACK - 8 Intel cores
100
GFLOPS
80
60
40
20
0 0
5000
10000
15000
20000
Matrix dimension (m=n)
Fig. 2. Performance of the reduction of a dense matrix to tridiagonal form using the routines in LAPACK and the SBR toolbox
394
P. Bientinesi et al. Table 3. Execution time (in seconds) for the two-stage SBR routines 1st stage: 2nd stage: Full→ Band Band→ Tridiagonal n w 1 Core 4 Cores 8 Cores CUBLAS Transf. time 8 cores 32 96 32 6144 96 32 10240 96 32 22528 96 2048
1.1 0.8 0.8 0.9 0.5 0.5 33.5 23.8 28.5 25.3 11.6 11.7 155.8 110.4 129.5 116.6 51.2 51.6 2193.2 1542.1 1867.0 1632.2 617.7 586.8
0.2 0.2 2.5 2.7 10.1 10.6 91.5 92.7
0.026 0.222 0.609 2.928
0.4 0.8 3.7 7.5 10.3 25.6 59.5 212.5
of the cores of the CPU. Figure 2 reports the GFLOPS for these alternatives. Only the results corresponding to the best block size (b) and bandwidth (w) are reported in the figure. The performance behaviour of sytrd in LAPACK is typical for a routine based on BLAS-2: when the matrix is too large to fit into the cache of the processor, the performance rapidly drops. The GFLOPS rate attained by the SBR in the CPU is more consistent and shows a good scalability for four cores; the poor scalability of the symm routine explains the the small performance gain when the number of cores is increased to 8. Finally, the SBR code modified to off-load the bulk of the computation to the GPU clearly outperforms all the executions on the general-purpose CPU. Comparing the SBR routines using 4 cores and the GPU, the speed-up observed for the second when solving the larger problem size is 4.9x. This in practice reduces the cost of the first stage in SBR to that of the second stage, as shown in Table 3; in addition, the table shows the time devoted to data transfers for the GPU implementation. The results in that table also show that the acceleration attained by off-loading the computation of the first stage to the GPU is a factor of 17x for the largest problem size and w = 32.
4
Concluding Remarks
We have evaluated the performance of existing codes for the reduction of a dense matrix to tridiagonal form. Our experimental results confirm that the two-stage approach proposed in the SBR toolbox (reduction from full to banded form in the first stage followed by a reduction from banded to tridiagonal form in a second stage) delivers a higher parallel scalability than the LAPACK-based alternative on general-purpose SMP architectures. We have also modified the SBR codes to off-load the most expensive operations in the first stage to the GPU. There we employ the highly tuned implementation of the matrix-matrix product kernel in CUDA to compute symmetric matrix products and symmetric rank-2b updates. Although this strategy increases the cost of this initial stage by 33%, we found in our experiments that this is clearly compensated by the higher performance
Reduction to Condensed Forms for Symmetric Eigenvalue Problems
395
of that particular kernel. In summary, the GPU variant reduces significantly the cost of the first stage, making it comparable to that of the second stage.
Acknowledgments The first author gratefully acknowledges the support received from the Deutsche Forschungsgemeinschaft through grant GSC 111. The second and fourth authors were supported by projects CICYT TIN2005-09037-C02-02, TIN2008-06570C04-01 and FEDER, and P1B-2007-19 of the Fundaci´ on Caixa-Castell´ on/ Bancaixa and UJI.
References 1. Golub, G.H., Loan, C.F.V.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore (1996) 2. Dhillon, S., Parlett, N.: Vomel: The design and implementation of the MRRR algorithm. ACM Trans. Math. Soft. 32(4), 533–560 (2006) 3. Bientinesi, P., Dhillon, I.S., van de Geijn, R.: A parallel eigensolver for dense symmetric matrices based on multiple relatively robust representations. SIAM J. Sci. Comput. 27(1), 43–66 (2005) 4. Bischof, C.H., Lang, B., Sun, X.: Algorithm 807: The SBR Toolbox—software for successive band reduction. ACM Trans. Math. Soft. 26(4), 602–616 (2000) 5. Dongarra, J., Hammarling, S.J., Sorensen, D.C.: Block reduction of matrices to condensed forms for eigenvalue computations. LAPACK Working Note 2, Technical Report MCS-TM-99, Argonne National Laboratory (September 1987) 6. Bientinesi, P., Igual, F.D., Kressner, D., Quintana-Ort´ı, E.S.: Reduction to condensed forms for symmetric eigenvalue problems on multi-core architectures. Technical Report 2009-13, Seminar for Applied Mathematics, ETH Zurich, Switzerland (2009), http://www.sam.math.ethz.ch/reports/2009 7. Barrachina, S., Castillo, M., Igual, F.D., Mayo, R., Quintana-Ort´ı, E.S.: Evaluation and tuning of the level 3 CUBLAS for graphics processors. In: Proceedings of the 10th IEEE Workshop on Parallel and Distributed Scientific and Engineering Computing, PDSEC 2008 (2008), CD-ROM 8. Barrachina, S., Castillo, M., Igual, F.D., Mayo, R., Quintana-Ort´ı, E.S.: Solving dense linear systems on graphics processors. In: Luque, E., Margalef, T., Ben´ıtez, D. (eds.) Euro-Par 2008. LNCS, vol. 5168, pp. 739–748. Springer, Heidelberg (2008) 9. Volkov, V., Demmel, J.W.: Benchmarking gpus to tune dense linear algebra. In: SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, Piscataway, NJ, USA, pp. 1–11. IEEE Press, Los Alamitos (2008)
On Parallelizing the MRRR Algorithm for Data-Parallel Coprocessors Christian Lessig1 and Paolo Bientinesi2 1
DGP, University of Toronto, Canada
[email protected] 2 AICES, RWTH Aachen, Germany
[email protected]
Abstract. The eigenvalues and eigenvectors of a symmetric matrix are of interest in a myriad of applications. One of the fastest and most accurate numerical techniques for the eigendecomposition is the Algorithm of Multiple Relatively Robust Representations (MRRR), the first stable algorithm that computes the eigenvalues and eigenvectors of a tridiagonal symmetric matrix in O(n2 ) arithmetic operations. In this paper we present a parallelization of the MRRR algorithm for data parallel coprocessors using the CUDA programming environment. The results demonstrate the potential of data-parallel coprocessors for scientific computations: compared to routine sstemr, LAPACK’s implementation of MRRR, our parallel algorithm provides 10-fold speedups.
1
Introduction
The symmetric eigenproblem is of great importance in science and engineering. The problem consists in decomposing a symmetric matrix A into the product A = VΛVT , where the columns of V and the elements of the diagonal matrix Λ contain the eigenvectors and the eigenvalues of A, respectively.1 A variety of numerical solvers have been developed and thoroughly studied in the past decades [1,2]. The efficiency of such eigensolvers depends on the available computational resources and on the required accuracy of the results. With the advent of mainstream parallel architectures such as multi-core CPUs and data-parallel coprocessors also the amenability to parallelization is becoming important. The computation of a symmetric eigendecomposition involves the following three stages: (1) reduction of the original matrix A to tridiagonal form T [3], (2) computation of the eigenpairs of T, (3) mapping the eigenvectors of T into those of A (backtransformation). Both the first and last stage cost O(n3 ) floating point operations (flops). The second stage also required O(n3 ) flops until the introduction of the Algorithm of Multiple Relatively Robust Representations (MRRR) [4], which computes the result with only O(n2 ) operations. In contrast 1
In this report we will employ Householder’s notation. Matrices will be denoted by bold capital letters such as A and T; vectors by bold small letters such as a and b; and scalars by non-bold letters such as a, b, and α.
R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 396–402, 2010. c Springer-Verlag Berlin Heidelberg 2010
On Parallelizing the MRRR Algorithm for Data-Parallel Coprocessors
397
Algorithm 1. Sketch of the MRRR algorithm for the tridiagonal symmetric eigenproblem.
1 2 3 4 5 6 7 8 9 10 11
Input: A tridiagonal symmetric matrix T ∈ Rn×n . Output: The eigenvalues Λ and the eigenvectors Z of T. Initialize the index set K to [1..n]. Find a RRR LDLT for a shift of T with respect to K. Compute or refine the eigenvalues λi of LDLT , with i ∈ K. for each i ∈ K do Classify λi as a singleton or as part of a cluster. for each cluster λk1 :km do ˆ to [k1 ..km ]. Set K ˆD ˆL ˆ T for LDLT − μI with respect to λk1 , . . . , λkm , Compute a new RRR L where μ is a shift close to the cluster λk1 :km . ˆ and go to Line 3. ˆD ˆL ˆ T , K := K, Let LDLT := L for each singleton λi do Compute the eigenvalue and eigenvector to high relative accuracy.
to most other eigensolvers for which space requirements and/or lack of accuracy become limiting factors on parallel architectures, the MRRR algorithm is also amendable to parallelization. An efficient variant of MRRR for distributed systems was recently developed [5], but despite significant renewed interest in data-parallel architectures such as graphics processing units or Larrabee, no data-parallel version has been proposed in the literature. In this article we present a parallelization of the MRRR algorithm for dataparallel coprocessors. By using the Compute Unified Device Architecture (CUDA) programming environment [6], we efficiently map MRRR onto data-parallel architectures. When compared to sstemr, LAPACK’s [7] implementation of the MRRR algorithm, we obtain up to 10-fold speedups. The rest of the paper is organized as follows. In the next section we provide a brief introduction to the MRRR algorithm. In Sec. 3 we discuss our implementation and design choices, and Sec. 4 contains experimental results. Conclusions and directions of future work are given in Sec. 5.
2
The MRRR Algorithm
MRRR is the first algorithm that computes the eigenvalues Λ and the eigenvectors Z of a symmetric tridiagonal matrix in O(n2 ) operations without explicit ˜ and Λ ˜ the computed eigenvectors and eigenorthogonalization. Indicating with Z values, respectively, MRRR guarantees small residuals and orthogonality: ˜ − I = O (n) , ˜−Z ˜ Λ ˜ = O (nT) , and Z ˜T Z TZ
(1)
where I is the identity matrix and || is smaller than u, the machine precision. In the following we present an overview of MRRR as summarized in Alg. 1. A thorough discussion can be found in [8].
398
C. Lessig and P. Bientinesi
The MRRR algorithm is centered around the computation of the eigenvalues to high relative accuracy. The eigenpair2 (zi , λi ) is only obtained when the eigenvalue λi is a singleton, i.e., it is well-separated with respect to all other eigenvalues. An eigenvalue is said well-separated if the relative distance |λ −λ | reldist (λi , λj ) ≡ i|λi | j of λi exceeds a predefined threshold tc for all λj , i = j. Dhillon [4] showed that no explicit orthogonalization is needed for eigenvectors associated with singletons and that these can be computed via twisted factorizations. Eigenvalues that are not well separated belong to eigenvalue clusters and the corresponding eigenpairs cannot be computed directly without loss of accuracy. Matrix shifts of the form Tˆk = T − σk I, with σk being close to one end of the k-th cluster, are employed to increase the relative distance of the eigenvalues. For the shifted matrix Tˆk , it is guaranteed that some of the eigenvalues in the k-th cluster are now well separated. The eigenpairs corresponding to such eigenvalues can then be computed to high accuracy and related back to the input matrix: eigenvectors are invariant under matrix shifts and the eigenvalues of Tˆk are those of T shifted by σk . For eigenvalues that are still clustered after the matrix shift, the algorithm proceeds recursively by computing new shifts until all eigenvalues are well separated. To avoid accuracy loss through the matrix shifts, all computations are performed using relatively robust representations (RRR) of the form LDLT , where L is a lower bidiagonal matrix with unit diagonal and D is a diagonal matrix. A representation tree can be employed to describe the sequence of computations in the MRRR algorithm [9]. The root node of the representation tree is initialized to be the set of desired eigenvalues. The other nodes in the tree correspond to clusters of eigenvalues and to singletons. The edges correspond to matrix shifts.
3
Parallelization and Implementation
In this section we outline how to efficiently map the MRRR algorithm onto a data-parallel architecture using the CUDA programming environment. The computational engine of CUDA-capable devices is a task-parallel array of dataparallel multiprocessors where each multiprocessor has an independent SPMD (Single Process, Multiple Data) execution unit and can run up to 512 or 1024 threads simultaneously. In order to exploit these two levels of parallelism, we employ a two-stage approach: in the first stage the eigenproblem is decomposed into a set of smaller problems, called task batches, that can be solved independently on different multiprocessors. In the second stage, the eigenpairs are computed to high relative accuracy by processing different batches on different multiprocessors. For the first stage, both the host and the device processor are employed. The second stage is performed entirely on the data-parallel coprocessor. 2
zi and λi are the i-th column and the i-th entry of matrices Z and Λ, respectively.
On Parallelizing the MRRR Algorithm for Data-Parallel Coprocessors
399
Input: A symmetric tridiagonal matrix T and a threshold tc to classify the eigenvalues’ relative gaps. Output: The eigenvalues and eigenvectors of T computed to high relative accuracy. Assumptions: T is of size less than or equal to 4096, is unreduced and does not contain clusters of size greater than 512. Creation of Task Batches. The MRRR algorithm, as described in Alg. 1, begins with the computation of a relatively robust representation of the input matrix T, i.e., a LDLT factorization for a suitable shift of T (Line 2 of Alg. 1). Since this is an inherently sequential operation, we assign it to the CPU. The resulting RRR, identified by the vectors d and l (the non-zero elements of the matrices D and L) is then transferred onto the device. There, the bisection algorithm (see below) is employed to obtain initial estimates of the eigenvalues (Line 3) and these are classified as singletons or clusters, according to the threshold tc (Lines 4 and 5). The approximate eigenvalues are subsequently stored on the device and read back to the host. The CPU bundles together eigenpairs into task batches so that the essential information about singletons and clusters in a batch, their location, the contained eigenvalues, as well as all applied shifts, can be stored in the fast, but very limited, shared memory of a multiprocessor. The batch construction additionally guarantees that eigenvalues belonging to one cluster are bundled together. This ensures that different batches can be mapped onto different multiprocessors and that the computation can proceed independently without any communication between multiprocessors. Processing of Task Batches. To start the second phase of our data-parallel MRRR algorithm, the list of task batches generated on the host is uploaded onto the device and the data parallel coprocessor is initialized by setting the number of (virtual) multiprocessors equal to the number of batches. Each multiprocessor mpm loads from global memory into shared memory the data relative to the singletons and clusters that have been assigned to it. The computation begins by processing the nodes at level one of the representation tree, i.e., the local singletons and clusters. The eigenpairs for singletons are computed to high relative accuracy via twisted factorizations (Lines 10 and 11). For each cluster a new shift and RRR is determined (Lines 6 to 9), and a refinement of the eigenvalues is obtained using bisection (Line 3). This process is iterated until all clusters have been resolved or a maximum representation tree depth has been reached. qds transform. The qds transform is at the heart of the MRRR algorithm; it is used to determine the RRR for shifted matrices, to compute the eigenvectors for isolated eigenvalues, and it also underlies the Sturm count computation, an integral part of bisection. Unfortunately, the transform consists of a loop with dependent iterations and is not task-parallelizable. With a data-parallel programming model, the qds transforms can however be performed in parallel for different input data. For example, different matrix shifts or Sturm counts for different intervals can be determined in parallel. Parallelism is only limited by
400
C. Lessig and P. Bientinesi
dependencies in the input data. For the MRRR algorithm these dependencies correspond to different levels of the representation tree. We therefore traverse the representation tree in a breadth first fashion, and exploit data parallelism only when singletons and clusters on the same tree level and in the same task batch are processed. Bisection. The bisection algorithm is a remarkably simple and effective method for computing eigenvalues of a symmetric tridiagonal matrix. The input to the algorithm is a list of intervals containing the desired eigenvalues. The computation begins by subdividing the intervals and by determining the number of eigenvalues contained in each subinterval; this is done by computing Sturm counts at the interval bounds. All non-empty intervals then form the input to the next iteration of the algorithm. The final approximations are tight intervals around the true eigenvalues. We map bisection onto a data-parallel programming model by using one thread to perform the calculations for one interval at each iteration of the algorithm [10]. Sturm counts are computed with the data-parallel qds transform discussed above, and a version of the scan algorithm [11] is used to create a compact list of all non empty intervals after each iteration. Convergence is determined by a mixed relative and absolute criterion, as in LAPACK’s bisection routine stebx. See the technical report by Lessig for a more detailed discussion of our bisection implementation [10].
4
Experimental Evaluation
To assess the efficacy of our parallelization of the MRRR algorithm, we compared our CUDA implementation, mrrr dp (CUDA ver. 2.1), against CLAPACK’s sstemr routine (ver. 3.1.1). Our test system was an Intel Pentium D CPU (3.0 GHz) with 1 GB RAM, and an NVIDIA GeForce 8800 Ultra (driver version 180.22). The operating system employed for the experiments was Fedora Core 10 and all programs were compiled using gcc 4.3.2. The testbed consisted of random tridiagonal matrices of size ≤ 1024 with entries uniformly distributed on [0, 1] and [−1, 1]. In a successive study, we will Table 1. Timings and accuracy comparison between mrrr dp and sstemr for random matrices with entries uniformly distributed on [0, 1] and [−1, 1] rand(0,1) rand(-1,1) Alg. Time Res. Orth. Eigv. Time Res. Orth. mrrr dp 6.26 8.6e-7 6.6e-6 9.6e-7 3.84 5.0e-7 1.8e-5 128 sstemr 6.98 1.4e-6 1.0e-5 1.3e-6 6.79 1.4e-6 1.0e-5 mrrr dp 13.0 8.9e-7 8.0e-6 1.0e-6 8.34 4.9e-7 2.6e-5 256 sstemr 32.1 1.5e-6 1.0e-5 1.4e-6 31.8 1.5e-6 1.2e-5 mrrr dp 28.7 1.2e-6 9.8e-6 1.0e-6 19.2 8.5e-7 3.3e-5 512 sstemr 154.9 1.5e-6 9.4e-6 1.4e-6 152.7 1.6e-6 9.1e-6 mrrr dp 60.2 3.0e-6 3.6e-5 1.2e-6 54.0 3.5e-6 8.1e-5 1024 sstemr 656.1 2.8e-6 1.0e-5 1.6e-6 647.6 2.9e-6 1.4e-5 n
Eigv. 3.0e-7 1.3e-6 2.8e-7 1.5e-6 3.0e-7 1.5e-6 8.0e-7 1.6e-6
On Parallelizing the MRRR Algorithm for Data-Parallel Coprocessors
401
Fig. 1. Average execution time for 30 random matrices of matrix size n ∈ [32, 1024]
test mrrr dp on a much larger variety of test matrices, including tridiagonal matrices arising in scientific applications. In addition to timings, we also report three accuracy measures: the residual and the orthogonality of the eigendecomposition (1), and the distance of the computed eigenvalues from LAPACK’s double-precision function dsterf, that implements the highly accurate QR algorithm. Our experimental data demonstrate the potential of parallelizing the MRRR algorithm for data-parallel architectures. As shown in Fig. 1 and Table 1, mrrr dp provides significant speedups over sstemr for matrices with n ≥ 128 elements. The performance improvement increases linearly with the matrix size, reaching more than 10-fold speedups for n = 1024, which is still a modest problem size. Table 1 also includes data on the accuracy attained by sstemr and mrrr dp . The experiments provide evidence that our implementation is equivalent to sstemr in terms of residual and orthogonality. The orthogonality of the eigenvectors computed by mrrr dp degrades slightly as the matrix size increases, but remains of the same order of magnitude as that attained by sstemr. We believe that this problem can be alleviated by a more careful strategy for choosing the shift in the computation of the RRR for clustered eigenvalues.
5
Conclusion
We introduced mrrr dp , a data-parallel version of the Algorithm of Multiple Relatively Robust Representations (MRRR) for the symmetric tridiagonal
402
C. Lessig and P. Bientinesi
eigenvalue problem. Our experimental results demonstrate that the MRRR algorithm can be mapped efficiently onto data-parallel coprocessors: mrrr dp retains the same accuracy of sstemr, LAPACK MRRR’s implementation, while attaining significant performance speedups. Our results are still preliminary: in the near future we will investigate a number of optimizations for both accuracy and performance. Also, we will release the constraints on the input matrix and further explore the tradeoff between speed and accuracy.
Acknowledgements The first author thanks NVIDIA’s London office for many helpful discussions and acknowledges support by the Natural Sciences and Engineering Research Council of Canada (NSERC). The second author gratefully acknowledges the support received from the Deutsche Forschungsgemeinschaft (German Research Association) through grant GSC 111.
References 1. Golub, G.H., Loan, C.F.V.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996) 2. Parlett, B.N.: The Symmetric Eigenvalue Problem. Prentice-Hall, Inc., Upper Saddle River (1998) 3. Martin, R.S., Reinsch, C., Wilkinson, J.H.: Householder’s Tridiagonalization of a Symmetric Matrix. Numer. Math. 11, 181–195 (1968) 4. Dhillon, I.S.: A New O(n2 ) Algorithm for the Symmetric Tridiagonal Eigenvalue/Eigenvector Problem. PhD thesis, EECS Department, University of California, Berkeley (1997) 5. Bientinesi, P., Dhillon, I.S., van de Geijn, R.A.: A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations. SIAM J. Sci. Comput. 27(1), 43–66 (2005) 6. NVIDIA Corporation: CUDA Programming Guide. First edn. NVIDIA Corporation, 2701 San Toman Expressway, Santa Clara, CA 95050, USA (2007) 7. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd edn. Society for Industrial and Applied Mathematics, Philadelphia (1999) 8. Dhillon, I.S., Parlett, B.N., V¨ omel, C.: The Design and Implementation of the MRRR Algorithm. ACM Trans. Math. Softw. 32(4), 533–560 (2006) 9. Dhillon, I.S., Parlett, B.N.: Multiple Representations to Compute Orthogonal Eigenvectors of Symmetric Tridiagonal Matrices. Linear Algebra and its Applications 387(1), 1–28 (2004) 10. Lessig, C.: Eigenvalue Computation with CUDA. Technical report, NVIDIA Corporation (August 2007) 11. Harris, M., Sengupta, S., Owens, J.: Parallel Prefix Sum (Scan) with CUDA. In: Nguyen, H. (ed.) GPU Gems 3. Addison Wesley, Reading (August 2007)
Fast In-Place Sorting with CUDA Based on Bitonic Sort Hagen Peters, Ole Schulz-Hildebrandt, and Norbert Luttenberger Research Group for Communication Systems Department of Computer Science Christian-Albrechts-University Kiel, Germany
Abstract. State of the art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA increases their usability as high-performance coprocessors for general-purpose computing. Sorting is well-investigated in Computer Science in general, but (because of this new field of application for GPUs) there is a demand for high-performance parallel sorting algorithms that fit to the characteristics of modern GPU-architecture. We present a high-performance in-place implementation of Batcher’s bitonic sorting networks for CUDA-enabled GPUs. We adapted bitonic sort for arbitrary input length and assigned compare/exchange-operations to threads in a way that decreases low-performance global-memory access and thereby greatly increases the performance of the implementation. Keywords: GPU, GPGPU, CUDA, Parallel, Sorting, Multicore.
1
Introduction
The processing power of modern desktop-GPUs has greatly increased during the last years. In fact, it exceeds the processing power (measured in FLOP/s) of desktop-CPUs by far. Since the release of frameworks like NVIDIA’s CUDA, that provide free programmability of GPUs, the processing power of GPUs is easily available for General Purpose Computing on GPU’s (GPGPU). Many general purpose applications require high-performance sorting algorithms, therefore many sorting algorithms have been explicitly developed for GPUs. Early implementations ([1], [2], [3], [4], [5]) often based on bitonic sort [6]. Although bitonic sort is an in-place sorting algorithm, early implementations (due to the limitations of early programming frameworks and the underlying hardware) were not in-place. Meanwhile many sorting algorithms with a wide range in design were developed for GPUs, many of them using CUDA. Harris et al. [7], Grand [8] and He et al. [9] implemented radix sort for GPUs. Sengupta et al. [10] were the first to implement quicksort, later Cederman et al. [11] developed a more efficient implementation of quicksort. Sintorn et al. [12] presented a hybrid algorithm that combines merge sort and bucket sort. Recently, Satish et al. [13] developed new implementations of radix sort and merge sort for CUDA. None of these sorting algorithms is both in-place and comparison-based. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 403–410, 2010. c Springer-Verlag Berlin Heidelberg 2010
404
H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger
In this contribution we present an implementation of Batcher’s bitonic sort for NVIDIA’s CUDA. In opposite to other implementations of bitonic sort this implementation is highly competitive to other sorting algorithms on GPUs. Furthermore, it keeps the benefits of using sorting networks, i.e. it works comparisonbased and in place. We give a short introduction to bitonic sort in section 2 and present a trivial implementation of bitonic sort for CUDA in section 3.1. Our optimization of bitonic sort for NVIDIA’s GPUs by reducing access to global memory is shown in section 3.2. Section 3.3 introduces a modification of bitonic sort for sequences of arbitrary length. In section 4 we show performance measurements and comparisons to other GPU sorting algorithms. We summarize the results of this contribution in section 5.
2
Introduction to Bitonic Sort
Bitonic sort is based on bitonic sequences, i.e. concatenations of two subsequences sorted in opposite directions. Given a bitonic sequence B with length m = 2r , B can be sorted ascending (analog descending) in r steps. In the first step the pairs of elements (B[0], B[ m 2 ]), m (B[1], B[ m 2 + 1]),. . .,(B[ 2 − 1], B[m − 1]) are compared and exchanged if the first element is smaller than the second element. This results (as shown by Batcher[6]) m in two bitonic subsequences, (B[0], . . . , B[ m 2 − 1]) and (B[ 2 ], . . . , B[m − 1]), whereas all elements in the first subsequence are smaller than any element in the second subsequence. In the second step the same procedure is applied to both subsequences, resulting in four bitonic subsequences. All elements in the first subsequence are smaller than any element in the second, all elements in the second subsequence are smaller than any element in the third and all elements in the third subsequence are smaller than any element in the fourth subsequence. The third, fourth,. . . ,r-th step are analog. Processing the r-th step results in 2r subsequences of length 1, thus the sequence B is sorted. Let A be the input array to sort and let n = 2k be the length of A. The process of sorting A consists of k phases. The subsequences of length 2 (A[0], A[1]), (A[2], A[3]),. . ., (A[n−2], A[n−1]) are bitonic sequences by definition. In the first phase these subsequences are sorted (as shown above) alternating descending and ascending1 , what makes the subsequences of length 4, (A[0], A[1], A[2], A[3]), . . .,(A[n − 4], A[n − 3], A[n − 2], A[n − 1]), bitonic sequences. In the second phase these subsequences of length 4 are sorted alternating descending and ascending, resulting in subsequences of length 8 being bitonic sequences. In the i-th phase of bitonic sort the total number of subsequences being sorted is 2k−i and the length of each of these subsequences is 2i , thus the i-th phase consists of i steps. After the (k − 1)-th phase the array A is a bitonic sequence. A is sorted in the last phase k. In every step n2 compare/exchange operations are processed, thus the number of compare/exchange operations in bitonic sort, and by this reason the time complexity, is O(n · log2 n). This complexity makes bitonic sort less advantageous for 1
Or ascending/descending, the resulting subsequences are bitonic in both cases.
Fast In-Place Sorting with CUDA Based on Bitonic Sort
Fig. 1. Sorting network
16
Fig. 2. Partition example
8 Time in ms
405
4
218
220
Fig. 3. Absolute runtime
sequential use. The strength of bitonic sort is that it can be easily implemented by comparator networks. Comparator networks can be implemented directly in hardware, but they are also adequate for generic parallel architectures since they work in-place, require less inter process communication and furthermore, can be naturally implemented in SIMD architectures. Figure 1 shows the bitonic sorting network for an arbitrary input sequence of size n = 8. The sorting network consits of 3 = log 8 phases, phase p having p steps. Every step consists of 4 = n2 compare/exchange operations.
3 3.1
Implementation Basic Implementation of Bitonic Sort with CUDA
CUDA is a parallel programming model and software environment for NVIDIA’s GPU. CUDA allows the programmer to define functions, so called kernels, which are executed on the GPU. One kernel is executed in parallel by a number of CUDA threads. Threads are organized in blocks. The number of blocks and the number of threads per block (and therefore the number of threads) can be specified for each individual launch of this kernel. On up-to-date GPUs the maximum number of blocks is 232 and the maximum number of threads per block is 512. A thread block has a shared memory visible to all threads of the block. Every thread has its own set of registers and all threads have random access to the same global memory. In general, access to global memory is significantly slower than access to shared memory. CUDA provides a barrier-mechanism for synchronization of threads of the same block, but does not provide any synchronisation mechanisms for threads of different blocks. In order to achieve a good utilisation of the GPU one have to take care about the number of launched threads. On up-to-date GPUs the number of threads per block should be at least 64, the number of blocks should be at least 100. For a trivial implementation of bitonic sort with CUDA one could use an approach that step-wise follows the implementation of bitonic sort by sorting
406
H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger
networks: If the array to sort of length n = 2k already resides in GPU memory, bitonic sort could be implemented by a sequence of kernel launches, each launch processing all compare/exchange operations of one step of one phase. Every kernel launch would consist of 2k−1 threads, each thread processing one compare/exchange operation. This implementation is optimal in the sense that it provides a good utilisation of the GPU for larger input size and results in a uniform distribution of memory access and workload to threads. The main disadvantage is the excessive use of global memory what significantly increases the runtime of the algorithm. Since all operations of one step are processed in an individual kernel launch, each compare/exchange operation causes at least 2 reading or writing accesses to global memory. Sorting a sequence of length 2k requires k phases. The j-th phase consists of j steps. In the trivial solution each step causes 2 · 2k−1 = 2k read accesses to global memory. Thus the total number of read accesses in this case is: k j=1
j · 2k =
k · (1 + k) k (log n) · (1 + log n) ·2 = ·n 2 2
(1)
The number of write accesses is at most the same. 3.2
Minimizing Access to Global Memory
Our approach to significantly reduce runtime of the algorithm is to reduce the number of accesses to global memory and the number of kernel launches. Therefore we carefully partition the set of operations into subsets that can be processed by single threads or thread blocks, such that operations of multiple consecutive steps can be processed by one thread in one kernel launch and each element has to be read from or written to global memory at most once. Figure 2 shows an example of such a partition for phase 3. The partition consists of two subsets, the one marked by red dotted lines and arrows, the other marked by green dashed lines and arrows. All operations of step 3 and 2 can be processed in one kernel launch by two threads, because operations in one subset are not affected by operation in the other subset. Note that a thread that processes one of these partitions can keep elements in registers, thus one element is read and written only once (and not twice like in the trivial implementation). In the following we give a formal definition of the partitions used in our algorithm. Consider an array A of size 2k to be the input array to sort. The set of indices in A is I = {0, . . . , 2k − 1}. We define COEXp,s to be the set of compare/exchange operations that belong to step s of phase p of bitonic sort. The commutative representation of one compare/exchange operation in COEXp,s is coexp,s (a, b) = coexp,s (b, a). The whole set of compare/exchange operations processed by bitonic sort for sorting A is: p k COEX = COEXp,s p=1
s=1
Fast In-Place Sorting with CUDA Based on Bitonic Sort
407
We define an operator K that maps sets of compare/exchange operations to the sets of array positions, that are affected by these operations. Furthermore, we define a class of operators Np,s that map a set of array positions J to the set of compare/exchange operations that affect array positions in J in step s of phase p: K : P(COEX) → P(I); M → {i | coex•,• (A[i], •) ∈ M } Np,s : P(I) → P(COEXp,s ); J → {coexp,s (A[i], •) ∈ COEXp,s | i ∈ J} In a single step s of a phase p any element of A is involved in exactly one compare/exchange operation (∀i ∈ I : |Np,s ({i})| = 1). Thus there are no dependencies between operations in the same step of the same phase, all operations of COEXp,s can be done (asynchronously) in parallel, as it is done in the trivial implementation. To ensure correctness bitonic sort requires that before an operation c ∈ COEXp,s is processed, both corresponding operations in the preceding step, more precisely, the operations in Np,s+1 (K({c})) or Np−1,1 (K({c})) (for p = s) must be already processed — of course the same requirement holds for elements in that sets, i.e. the four corresponding operations in step s + 2 must be already processed and so on. Again have a look at figure 2. In step 2 of phase 3 one red dotted operation corresponds to both red dotted operations in step 3. A general definition of a partition2 of I, that ensures a uniform distribution of memory access and workload to threads, is ∀J ∈ P : |J| =
n 1 , |Np,s (J)| = · |J| m 2
(2)
In fact, if P holds (2) then P is a partition of equal sized sets of indices that induce sets (of the half size) of compare/exchange operations, thus {Np,s (J) | J ∈ P } is a partition of COEXp,s
(3)
In simple words, all operations in COEXp,s that affect elements in J ∈ P do not affect elements that are not in J. Since the size of induced sets of operations is half the size of sets of indices all operations in COEXp,s are processed. We define the degree of P in step s of phase p as the maximum number of successive steps that hold (2) and (3). More precise, the degree Dp,s (P ) is the maximum number z for which holds: ∀a ∈ {1, ..., z} : {Np,s−a+1 (J) | J ∈ P } is a partition of COEXp,s−a+1
(4)
In the example in figure 2 the used partition in step 3 of phase 3 is {{0, 2, 4, 6}, {1, 3, 5, 7}}. This partition induces a partition of operations in step 3 of phase 3 and in step 2 of phase 3 (red dotted and green dashed arrows) but not in step 1 — thus the degree of this partition is two. 2
P = {I1 , . . . , Im } is partition of I Ii ∩ Ij = ∅.
⇐⇒
m
i=1
Ii = I, ∀i, j ∈ {1, . . . , m}, i = j :
408
H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger
The trivial implementation of bitonic sort uses partitions of degree 1. Basically, if we could find a partition of degree d ≥ 2 in step s of phase p, we could process compare/exchange operations of d steps in a single kernel launch. More precise, using a partition P that holds (2), all threads have to access n/|P | elements in memory and can then process compare/exchange operations of Dp,s (P ) steps nonstop. All operations of Dp,s (P ) consecutive steps can be done a in single kernel launch without increasing the number of memory accesses per kernel launch. Therefore the overall reading memory access is divided by the degree of P compared to that in (1). This significantly reduces the runtime of the algorithm. Actually, in step s of phase p there exists a partition with degree d for every d ≤ s. This partition can be defined as follows: d Pp,s := K(Np,s (K(Np,s−1 (. . . K(Np,s−(d−1) ({i})) . . .)))) (5) i∈I
For example, = i∈I K(Np,s−(1−1) ({i})) is the partition of the trivial im2 plementation and P3,3 is the partition used in the example in figure 2. In our implementation of bitonic sort we use partitions of at most degree 4, since the number of registers (or the size of shared memory, respectively) is bounded. Furthermore, an increasing degree results in an increasing workload per thread which decreases the number of threads and blocks. Thus a higher degree would result in under-utilisation of the GPU. 1 Pp,s
3.3
Bitonic Sort for Arrays of Arbitrary Size: Virtual Padding
Bitonic sort as introduced by Batcher can only sort sequences with length being a power of 2. A trivial way to sort sequences of arbitrary length ascending (analog for descending) is to pad the sequence with max-values. Since Batcher’s bitonic sort sorts subsequences in alternating directions these elements would be moved while sorting the sequence and therefore have to exist physically. Thus an input sequence of length 2k + 1 would result in sorting a sequence of length 2k+1 . Every phase of bitonic sort sorts bitonic sequences — as mentioned in section 2 it makes no difference if the subsequences are sorted ascending/descending or vice versa. Thus bitonic sort can be modified in a way that in each phase the sorting direction in every subsequence containing normal values and max-values (actually there is only one such subsequence in each phase) is ascending. In this modified variant of bitonic sort max-values are never moved and thus do not have to exist physically. Using this modification no additional memory is needed. Figure 3 shows absolute runtimes of our bitonic sort. Of course bitonic sort performs best for input size being a power of 2. Nevertheless there is only a small negative impact when sorting sequences of length being not a power of 2.
4
Evaluation
We compared our algorithm to the bitonic sort based GPUSort [4], the radix sort introduced in the particle demo in the CUDA SDK [8], the quick sort implementation from Cederman et al. [11], both radix sort algorithms in the CUDPP
Fast In-Place Sorting with CUDA Based on Bitonic Sort
180
180
CUDPP Radixsort (Merge) BitonicSort (optimized) RadixSort (Satish et al.) GPU Quicksort CUDPP Radixsort BitonicSort
160 140
180
BitonicSort (optimized) GPU Gems Radixsort RadixSort (Satish et al.) BitonicSort
160
140
120
120
120
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
210
212
214
216
218
220
222
(a) Sorting integer
224
0
210
212
214
216
218
BitonicSort (optimized) RadixSort (Satish et al.) GPUSort BitonicSort
160
140
220
222
224
0
409
210
212
214
216
218
220
222
224
(b) Sorting (int,int) pairs (c) Sorting (float,int) pairs
Fig. 4. Sorting rate (y-axis) for sorting arrays of size n (x-axis)
library [7], the recently released radix sort algorithm from Satish et al. [13] and to our trivial implementation of bitonic sort. All tests were done on an Intel Core2 Duo 1.86 GHz machine with NVIDIA’s GTX280 GPU. We performed 3 different tests: sorting sequences of (a) integers, (b) (int, int) key-value pairs and (c) (int, float) key-value pairs (figure 4). The sequences were randomly generated with length ranging from 210 to 224 elements. Not all of the algorithms were able to sort negative floats and integers, so we used non-negative floats and integers. We consider sorting to be just one part of larger computation on the GPU, thus we measured the GPU runtime only, not including time for memory transfer between CPU and GPU. In figure 4 we show the sorting rate, i.e. the number of elements divided by the runtime. The sorting rate of our bitonic sort is low for small sequences due to constant start delay and under utilization of the GPU. But it rises with growing number of elements reaching its peak performance for sequences of length 219 . The sorting rate declines for larger sequences due to the the O(n · log2 n) complexity of bitonic sort. While the trivial implementation of bitonic sort performs approximately like GPUSort, the results show the much better performance of the optimized implementation. Our optimized bitonic sort is faster than all other sorting algorithms, except for radix sort from Satish et al [13], that is the fastest algorithm for sequences larger than 216 , but neither comparison-based nor in-place. Particularly, bitonic sort is faster than the comparison based GPUSort and GPU Quicksort. There is no sourcecode available for merge sort from Satish et al. [13], but comparing the sorting rate given in [13] to the sorting rate of bitonic sort, bitonic sort seems to perform slightly better.
5
Conclusion
We have presented an efficient implementation of bitonic sort for NVIDIA’s GPUs. To achieve this we carefully optimized our implementation in respect to
410
H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger
the number of accesses to global memory. It is the first efficient comparison-based in-place implementation of a sorting algorithm for CUDA we found in literature. Although bitonic sort suffers from complexity O(n·log2 n), the results of our tests show that it is highly competitive to all other comparison-based GPU sorting algorithms, and also to most non-comparison-based algorithms. Particularly the performance of our bitonic sort seems to be better than the out-of-place merge sort implemented by Satish et al., which is the fastest comparison-based sorting algorithm for GPUs reported in literature so far.
References 1. Kapasi, U.J., Dally, W.J., Rixner, S., Mattson, P.R., Owens, J.D., Khailany, B.: Efficient conditional operations for data-parallel architectures. In: MICRO 33: Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, pp. 159–170. ACM, New York (2000) 2. Purcell, T.J., Donner, C., Cammarano, M., Jensen, H.W., Hanrahan, P.: Photon mapping on programmable graphics hardware. In: HWWS 2003: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, Aire-laVille, Switzerland, pp. 41–50. Eurographics Association (2003) 3. Kipfer, P., Segal, M., Westermann, R.: Uberflow: a gpu-based particle engine. In: HWWS 2004: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pp. 115–122. ACM, New York (2004) 4. Govindaraju, N., Raghuvanshi, N., Henson, M., Manocha, D.: A cache-efficient sorting algorithm for database and data mining computations using graphics processors. Technical report, University of North Carolina-Chapel Hill (2005) 5. Greb, A., Zachmann, G.: Gpu-abisort: optimal parallel sorting on stream architectures. In: 20th International on Parallel and Distributed Processing Symposium, IPDPS 2006 (2006) 6. Batcher, K.: Sorting networks and their applications. In: AFIPS Spring Joint Comput. Conf. (1967) 7. Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with cuda. In: GPU Gems 3. Addison-Wesley, Reading (2007) 8. Grand, S.L.: Broad-phase collision detection with cuda. In: GPU Gems, vol. 3, Addison-Wesley, Reading (2007) 9. He, B., Govindaraju, N.K., Luo, Q., Smith, B.: Efficient gather and scatter operations on graphics processors. In: SC 2007: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, pp. 1–12. ACM, New York (2007) 10. Sengupta, S., Harris, M., Zhang, Y., Owens, J.D.: Scan primitives for gpu computing. In: GH 2007: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, Aire-la-Ville, Switzerland, pp. 97–106. Eurographics Association (2007) 11. Cederman, D., Tsigas, P.: A practical quicksort algorithm for graphics processors. In: Halperin, D., Mehlhorn, K. (eds.) Esa 2008. LNCS, vol. 5193, pp. 246–258. Springer, Heidelberg (2008) 12. Sintorn, E., Assarsson, U.: Fast parallel gpu-sorting using a hybrid algorithm, Orlando, FL, USA, vol. 68, pp. 1381–1388. Academic Press, Inc., London (2008) 13. Satish, N., Harris, M., Garland, M.: Designing efficient sorting algorithms for manycore gpus. In: Proceedings 23rd IEEE International Parallel and Distributed Processing Symposium (2009)
Finite Element Numerical Integration on GPUs Przemyslaw Plaszewski1, Pawel Maciol2 , and Krzysztof Bana´s1,2 1 Department of Applied Computer Science and Modelling, AGH University of Science and Technology, Mickiewicza 30, 30-059 Krak´ ow, Poland 2 Institute of Computer Modelling, Cracow University of Technology, Warszawska 24, 31-155 Krak´ ow, Poland
Abstract. We investigate the opportunities for using GPUs to perform numerical integration for finite element simulations. The degree of concurrency and memory usage patterns are analyzed for different types of finite element approximations. The results of numerical experiments designed to test execution efficiency on GPUs are presented. We draw some conclusions concerning advantages and disadvantages of off-loading numerical integration to GPUs for finite element calculations.
1
Numerical Integration in Finite Element Simulations
Numerical integration forms one of necessary steps in finite element calculations. The terms from a weak statement of a problem are integrated to form entries in the global stiffness matrix. Then the system of linear equations, with the stiffness matrix as the system matrix, is usually solved leading to the approximate solution. On one hand, numerical integration is an embarrassingly parallel algorithm – integrals over the computational domain are sums of integrals over individual elements. Calculations for different elements are independent and can be done in parallel. Moreover, for many problems, like e.g. linear problems with constant coefficients and linear approximation, numerical integration can be simplified and some precomputed quantities used in codes. There are however problems and approximations – e.g. non-linear problems, problems in domains with curved boundaries, higher order or hp-adaptive approximations – where numerical integration has to be performed step by step. These are usually also the cases in which numerical integration forms a substantial part of finite element calculations – both in terms of required CPU time and RAM memory. We concentrate on such cases, using a model problem of linear elasticity with higher order approximation (for applying GPUs for finite element calculations in different contexts see e.g. [1]).
2
Model Problem
The differential formulation of our model problem of linear elasticity for the vector of displacements u = u1 , ..., ud , d = 2 is: ∂ ∂uk Eijkl = 0, i, j, k, l = 1, .., d (1) ∂xj ∂xl R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 411–420, 2010. c Springer-Verlag Berlin Heidelberg 2010
412
P. Plaszewski, P. Maciol, and K. Bana´s
where Eijkl are elasticity coefficients and summation convention for repeated indices is used. Particular forms of matrices E depend upon materials considered. In our model problem the number of equations in the system and the number of components of the unknown function u, denoted in general by Neq , is equal to the number of space dimensions d. Standard finite element procedures consist of obtaining a weak finite element statement, discretizing the computational domain Ω into finite elements and using a proper basis functions φr , r = 1, .., N , (the same for all components of u), constructed from element shape functions, to express unknown approximating functions uh as linear combinations: uhk = Ukr φr
(2)
The resulting final statement that forms a basis for computational procedures is: Find a set of coefficients Ukr such that the following set of equations is satisfied for all s = 1, .., N and i = 1, .., d: ∂φr ∂φs r Uk Eijkl dΩ + BT = 0 (3) ∂xl ∂xj Ω where N · d is the total number of unknowns in the system and BT denotes terms corresponding to the boundary conditions. Numerical integration of these terms is much less demanding than integration of domain terms, hence we do not discuss them further.
3
Numerical Integration Algorithm
Calculation of integrals from a weak statement in finite element codes is usually performed using reference elements and some form of Gauss quadratures. For domain integrals, the whole computational domain, as the domain of integration is divided into finite elements, of one or several types — triangles, quadrilaterals, tetrahedra, prisms, hexahedra, etc.. The entries in the global stiffness matrix are usually computed in a loop over all elements. For each element an element stiffness matrix is created and its entries added to the corresponding entries in the global stiffness matrix in a separate step of aggregation that we do not discuss here. Reference elements are used to transform integrals over all elements of a particular type to integrals over a single element. For our model problem we will consider, as a reference element, a 2D rectangle [−1, 1]×[−1, 1], with coordinates ξ1 and ξ2 , which is one of typical reference elements used in FEM calculations. Any integral over a real, physical element Ωe is transformed to the correˆe using the change of variables. sponding integral over the reference element Ω For our model problem it leads to: ∂φr ∂φs ∂ φˆr ∂ξ ∂ φˆs ∂ξ Eijkl dΩ = Eijkl (4) det J T e dΩ ∂xl ∂xj ∂ξ ∂xl ∂ξ ∂xj ˆe Ωe Ω
Finite Element Numerical Integration on GPUs
413
where J T e denotes the Jacobian matrix of transformation T e from the reference ˆe to the real element Ωe , ∂ ξ forms a column of the Jacobian matrix element Ω ∂xi ∂ξ J T −1 , J T −1 = ∂ x , that corresponds to the reverse transformation T −1 e . Funce e tions φˆr , element shape functions, are defined over a reference element and are used to create basis functions φr . Although indices r and s in (4) correspond to all basis functions in the whole domain Ω, r, s = 1, .., N , the value of (4) equals zero for almost all pairs r-s. The only remaining non-zero integrals are the integrals for pairs of basis functions that both do not vanish within the element Ωe . These correspond to pairs of shape functions defined for the reference element ˆe . Integrals computed for such pairs form entries of a local, element stiffness Ω matrix. Its size is determined by the number of shape functions, Nsh , for the reference element with a specified order of approximation p. Application of a numerical quadrature transforms the right hand side of (4) to a summation over integration points ξI , I = 1, .., NI : NI ∂ φˆr ∂ξ ∂ φˆs ∂ξ Eijkl (5) det J T e wI ∂ξ ∂xl ∂ξ ∂xj I=1
All values in (5) are computed at integration points ξI (or, e.g. in case of coefficients Eijkl at points x(ξ I ) obtained in real elements through T e ), wI denotes the weight associated with a given point of a particular quadrature.
4
Computational and Memory Complexity of Numerical Integration
From the computational point of view finite element numerical integration consist of independent computations of element integrals. Hence, the computational complexity grows linearly with the number of elements. The natural way for parallel execution is to perform calculations for a single element by a single processor core. The number of elements in finite element meshes is almost always much larger than the number of cores on which calculations are done. If only we can supply a particular core with integration data with sufficient speed, we can get a theoretically perfectly scalable algorithm. To assess this we calculate the computation to communication ratio of the algorithm. 4.1
Data Input and Output
To perform integration for a single element we need the following data: – geometrical data necessary to compute Jacobian matrices J T e and J T −1 e (for linear element these are usually coordinates of element vertices) – PDE coefficients (in the case of our model problem the parameters necessary to compute non-zero entries Eijkl ) – Gauss quadrature data (coordinates of points and the associated weights)
414
P. Plaszewski, P. Maciol, and K. Bana´s
The size of input data depends on two parameters: the degree of polynomials used to describe geometry of elements and the degree of polynomials used for approximation. For geometrically linear elements to compute J T e and J T −1 we need only e coordinates of element vertices. Moreover the matrices are constant over the whole element and can be computed once for each element. For elements with curved boundaries the number of parameters grows depending on the degree of polynomial describing geometry, the way geometry is specified and the space dimension. For our model case we chose a relatively simple case but the case requiring calculation of Jacobian matrices at each integration point. This case is a geometrically bi-quadratic rectangular element with 16 parameters. The reason for this choice is that computing matrices J T e and J T −1 is prace tically never done in parallel (inverting a small 2 × 2 or 3 × 3 matrix is inherently sequential). We want to estimate whether, given this constraint, massively parallel GPU calculations still can be competitive with standard CPU execution. The number of PDE coefficients depends on the problem solved. For nonlinear problems it is possible that PDE coefficients are not constant and have to be computed at each integration point using some other data (like e.g. the values of solution at previous iterations). We assume for the model problem that PDE coefficients, being material data, can be different for each element, but are constant over a single element. We used only three parameters: Young modulus, Poisson ratio and thickness. Hence, our assumption is that the two described above kinds of data, geometry and PDE parameters, can be different for each element. However, for the most of practical calculations the number of these parameters is not large – usually several dozens. For our model case this number was equal to 23. The last type of input data, quadrature coefficients, is the same for all elements of the same type and the same degree of approximation. If the calculations are organized in such a way that elements are grouped according to type and degree of approximation we can assume that quadrature data are loaded once for the whole group of elements and, hence, for a given element are available in some fast, local memory. The number of quadrature points, NI , depends on the degree of approximation, p, and the type of elements. For typical finite element formulations, with products of shape functions and their derivatives in integrals, to integrate with the accuracy sufficient for the convergence of the approximate solution to the exact solution, it is usually required that the number of integration points is of order pd . In our model case NI was equal to (p + 2)2 in order to perform exact integration for bi-quadratic elements (and for each point three numbers, two coordinates and one weight, were used). The output from numerical integration has the form of an element stiffness matrix having the size Nloc = Nsh · Neq . The number of shape functions depends on the degree of approximation and the type of approximation. For quadrilateral elements it can be equal to (p + 1)2 (tensor product shape functions) or 4 + 4(p − 1)+ + (p − 2)(p − 3)+ /2 (reduced space of shape functions, + sign denotes that
Finite Element Numerical Integration on GPUs
415
the term is taken into account only if all factors are positive). We used the latter option. We consider two cases: symmetric stiffness matrices (for standard linear materials) and non-symmetric matrices (e.g. for anisotropic materials). In the first case the number of computed entries of the local stiffness matrix is equal to 2 . Nloc (Nloc + 1)/2 and in the latter to Nloc The amount of data produced as the output from numerical integration is 2 usually much larger than the input data. It is of order Neq (p + 1)2d . The second line in Table 1 shows the size of output data (the number of entries in the local element stiffness matrix) for different orders of approximation in our model test case. 4.2
Computational Complexity
Taken in its most generic form numerical integration for a single element is usually performed in a loop over integration points, with calculations for an integration point done in three steps: – first, all shape functions and their derivatives necessary to compute integrals are calculated – then, Jacobian matrices J T e and J T −1 are calculated e – finally, entries in the element stiffness matrix are obtained, in the nonsymmetric case using a double loop over shape functions with a double loop over solution components for each combination of shape functions. The two first steps are computationally much less demanding than the last one, hence the computational complexity depends on the last step. In general for 2 2 Nsh , which for tensor non-symmetric matrices the complexity is of order NI Neq 2 3d product shape functions gives Neq (p + 1) . Therefore the ratio of arithmetic to memory operations grows at least with the order (p + 1)d . The second line in Table 1 presents the approximate number of operations in numerical integration for our model test case (with symmetric matrices) as a function of the order of approximation p. The third line shows the ratio of arithmetic operations to memory references.
5
Standard Implementation of Numerical Integration
Standard implementation of the algorithm described above for classical processors with powerful cores and large L2 caches is relatively straightforward. One can be sure that the input and output data for a single element fits into the cache, hence computations are fast. Parallelization based on domain decomposition with a single core performing calculations for a group of elements is the simplest and also the most efficient way of doing integration for the whole computational domain. Hence the algorithm is scalable both with respect to the number of elements and the number of cores. Optimizations concern usually single thread calculations.
416
P. Plaszewski, P. Maciol, and K. Bana´s
For the non-symmetric case, operations consist of the proper loops over integration points, shape functions, space dimensions (to take into account different derivatives of shape functions) and solution components. Possible optimizations include L1 cache blocking in loops over shape functions and register blocking combined with loop unrolling. For the symmetric case with standard linear Hooke’s material, instead of a double loop over solution components, for each combination of shape functions approximately 40 operations are performed (depending upon optimizations utilized). This implementation was the base for counting the numbers of arithmetic operations for Table 1.
6
Mapping of the Algorithm to Tesla GPU Resources
We consider numerical integration for the model test case (2D elasticity, Neq = 2, with geometrically second order quadrilateral – 8 node – elements and shape functions being tensor products of 1D hierarchical Lobatto polynomials [2]). As an architecture for implementing the algorithm of finite element numerical integration we consider the Tesla architecture from NVIDIA [3] and as the programming environment the CUDA model [4]. Once again domain decomposition is employed as the highest level for parallelization. The memory of a Tesla device is used for storing input and output data for numerical integration. If the data concerning all elements from the computational domain is to large to fit into the memory of the device the domain is split into large subdomains and the subdomains are processed independently (possibly in parallel if there are more Tesla devices in the system). Then, for a single device, elements are assigned to different multiprocessor. Hence a single multiprocessor performs numerical integration of several elements. The memory resources available for a single multiprocessor are sufficient to compute stiffness matrix entries of a single element for orders of approximation up to 7 (the limit used in our study). Parallelization of computations in the CUDA model is based on splitting operations into threads, grouped into blocks. Further, blocks are grouped into a grid. After launching the execution of the GPU code all threads in a grid of blocks of threads are scheduled for running. Each thread executes the same GPU code in a SIMD fashion. Based on the block number and the thread number different paths of execution can be selected for different threads, allowing for fully parallel execution, as well as execution with certain threads idle or serialized execution with only one thread performing calculations. In our parallelization strategy we assume that integration of a single element is performed by a single block of threads. We use one dimensional grid of blocks, with the dimension equal to the number of elements assigned to the computing device. From the loops performed in each element, the loop over integration points poses important barriers to efficient parallelization. Values computed at different integration points contribute to the same entries of the local, element stiffness matrix, thus forming a source of possible data races.
Finite Element Numerical Integration on GPUs
417
To the contrary the double loop over shape functions does not lead to similar difficulties. For each pair of shape functions the result is placed in a different position of the final element matrix. We parallelize these loops and, hence, assume that a single thread computes several entries in the stiffness matrix, each entry computed in the loop over integration points for a specific pair of shape functions. According to the CUDA model the number of threads in a block should be, for performance reasons, a multiple of 32. If the number of entries in the stiffness matrix is not a multiple of 32, we create more threads, some of which are idle during the execution of the algorithms. This unfortunately leads to drops in performance. Still the number of threads in the block can vary. We can create more threads with a small number of computed stiffness matrix entries per thread or less threads with as much entries per thread as possible. In choosing the number of threads we take into account the specific character of execution in the CUDA model. The number of threads in a block is a multiple of 32 (a group of 32 threads forms a warp). Warps are executed concurrently. A single multiprocessor can execute concurrently up to 24 or 32 (depending on the device) warps belonging to up to 8 blocks. The number of concurrently executed blocks, active blocks, should be maximized for better performance. Hence the number of warps in a block should be minimized. On the other hand, the number of threads in a block should, again for performance reasons, equal at least 64 [4] and in particular for our algorithm, too small number of warps in a block can lead to many threads staying idle. Hence, for each order of approximation p, we performed several computational experiments to specify the optimal number of threads in the block. Apart from performance considerations we had to take into account the following constraints: – the number of threads in a single block cannot exceed 512 – the number of utilized registers for active threads cannot exceed 8129 (we used CUDA devices with 1.0 compute capabilities) Another important from the performance point of view aspect of computations on CUDA devices is the usage of different kinds of memory. The DRAM memory of the device is the slowest available memory. Hence the amount of data send to and from the device memory is limited to a necessary minimum. As it was already mentioned, for a single element the input data consist of 23 numbers and the output data consist of all computed local stiffness matrix entries. Moreover, in order not to serialize the accesses to the memory, reads and writes have to be performed in a specific, coalesced, way [4] for the numbers of array entries being multiples of 32. Hence we pad all the input and output arrays and performed only coalesced memory accesses. During actual calculations each thread uses only fast shared memory, once again accessed in an optimal way. Since we want to maximize the number of active blocks we try to minimize the memory occupancy – the size of shared memory used and the number of utilized registers. As the further optimization we
418
P. Plaszewski, P. Maciol, and K. Bana´s
use constant memory of the CUDA device for storing data concerning Gaussian integration points and weights. This read-only memory is cached on a GPU and well serves the purpose of supplying the multiprocessors with the data that is the same for all elements of a given order p. The actual computations are performed as follows: 1. Gauss points’ data are computed on CPU and copied to constant memory (for the tests it is assumed that all elements are of the same type and the same order of approximation, so the set of Gauss points is the same for all elements) 2. Storage space for element input data and element stiffness matrices as output data is allocated 3. Input data are copied to the device memory 4. The kernel program is launched 5. Output data are copied from the device memory. To maximize the speed of data transfer between the host (CPU) memory and the device memory all transferred data is placed in contiguous memory locations and copied using a single call to suitable procedures. Kernel calculations consist of: – copying element data from device memory to shared memory – a loop over all integration points, where for each point: • the values of all shape functions, their derivatives wrt to the local coordinates, the Jacobian matrices and the values of PDE coefficients are computed sequentially (by a single thread) • all the entries to the local stiffness matrix are calculated by all working(!) threads in parallel – copying the set of all element stiffness matrix entries from the shared memory to the device memory.
7
Numerical Tests
Computational tests were performed for numerical integration algorithm for our model test case on two platforms: a NVIDIA GeForce 8800GTX graphics card (with CUDA SDK 1.1) and a single core of AMD X2 processor (with 2.6GHz clock and 1MB L2 cache). Both, the GPU and the CPU were performing single precision computations. The results are presented for a single core of the standard processor and a single multiprocessor of the graphics card. Since the execution was (in practice and according to the theory) linearly scalable with respect to the number of elements (for sufficiently large number of elements, i.e. several thousands) and the number of cores or multiprocessors, the data from presented tables can be easily extrapolated for an arbitrary number of elements and processor cores or graphics card multiprocessors. The characteristics of GPU execution for different orders of approximation are summarized in Table 1. The most important parameters determining the
Finite Element Numerical Integration on GPUs
419
Table 1. The characteristics of GPU calculations for the numerical integration of element stiffness matrices (SM) for the model problems with different orders of approximation p
The number of shape functions The number of SM entries The number of arithmetic operations The ratio of operations to references The number of threads in a block The number of working threads The number of SM entries per thread The size of padded stiffness matrix The number of registers per thread Shared memory per block (in kB) The number of active blocks
p=1 p=2 p=3 4 8 12 36 136 300
p=4 17 595
p=5 23 1081
p=6 30 1830
p=7 38 2926
8936 71496 2.7·105 8.5·105 2.2·106 5.1·106 1.1·107 ≈248 ≈526 ≈923 ≈1434 ≈2060 ≈2800 ≈3654
64 36 1 64 12 0.56 8
192 136 1 192 13 1.21 3
64 60 5 320 18 1.85 7
128 119 5 640 19 3.29 3
192 181 6 1152 20 5.53 2
384 366 5 1920 19 8.82 1
448 418 7 3136 18 13.94 1
splitting of computations among threads (the size of the local stiffness matrix, the number of its entries per thread), as well as mapping the computations to the GPU resources (the number of threads in a block — including idle threads, the size of the table for stiffness matrix entries — with padding taken into account, the number of registers and the size of shared memory per thread, the number of active blocks) are presented. Table 2 presents performance metrics for both types of processors. First lines compare CPU and GPU time for numerical integration of a single element, using one core of the CPU and one multiprocessor of the GPU (the times for the GPU include the times for transferring the input data from the computer main memory to the device memory and output data in reverse direction – amortized over a large number of elements). For large number of elements the execution time will be equal to the number of elements times the time presented in the table divided by the actual number of cores or multiprocessors. Based on such estimates the speed-up for using a graphics card with 16 multiprocessors versus a standard processor with 2 cores is presented in the table. Table 2. Execution times in milliseconds, performance in GFlops and speed-up GPU versus CPU for the numerical integration algorithm applied to the model problem with different orders of approximation p for GeForce 8800GTX GPU and AMD X2 CPU
CPU core time per element GPU multiprocessor time per element CPU core GFlops GPU multiprocessor GFlops Speed-up (2 cores, 16 multiprocessors)
p=1 0.003 0.016 2.71 0.54 1.61
p=2 0.014 0.055 4.93 1.28 2.07
p=3 0.045 0.150 5.88 1.80 2.44
p=4 0.120 0.389 7.03 2.18 2.48
p=5 0.288 0.996 7.63 2.21 2.31
p=6 0.625 2.576 8.15 1.98 1.94
p=7 1.238 6.091 8.88 1.81 1.62
420
P. Plaszewski, P. Maciol, and K. Bana´s
The algorithm turned out to be difficult to optimize. The calculations for symmetric matrices are not regular – in order to group threads into warps with optimal memory accesses we had to store the resulting element stiffness matrices in one-dimensional arrays. Each thread had to compute, using floor and square root operations, the position of entries it computes. For low order polynomials as shape functions, the degree of parallelism for a single element was low. For higher order of approximation, despite the potential for concurrency due to the large number of local shape functions, the amount of required memory resources was as high as to limit the number of active blocks to one, and thus to reduce the overall performance of the algorithm (for newer devices the resources available are larger, so we hope to get better performance for them). Moreover the scarcity of resources (especially registers) did not allow for single thread optimizations.
8
Conclusion
The results of performed experiments show that GPUs can perform numerical integration at least as fast as CPUs (the exact comparison depends on the number of cores in CPUs and the number of multiprocessors in an accelerator with GPUs). When considering the application of GPUs for numerical integration one should also take into account that when the GPU is calculating domain integrals, the CPU can do some other useful work, e.g. computing boundary integrals or domain integrals for different elements than assigned to GPUs. Hence, using GPUs for numerical integration in finite element codes should always bring some benefits, that for some ranges of approximation orders p may be substantial. Acknowledgements. The support of this work by the Polish Ministry of Science and Higher Education under grant no N N501 120836 is gratefully acknowledged.
References 1. G¨ oddeke, D., Wobker, H., Strzodka, R., Mohd-Yusof, J., McCormick, P., Turek, S.: Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU. In: International Journal of Computational Science and Engineering, IJCSE (2009) (to appear) 2. Solin, P., Segeth, K., Dolezel, I.: Higher-Order Finite Element Methods. Chapman & Hall/CRC (2003) 3. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: Nvidia Tesla: A unified graphics and computing architecture. IEEE Micro 28, 39–55 (2008) 4. NVIDIA CUDA Programming Guide 2.2.1. NVIDIA (2009)
Modeling and Optimizing the Power Performance of Large Matrices Multiplication on Multi-core and GPU Platform with CUDA Da Qi Ren and Reiji Suda Department of Computer Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo, Japan 113-0033 JST, CREST, Japan
[email protected],
[email protected]
Abstract. The power efficiency of large-scale computing on multiprocessing systems is an important issue that interrelated to both of the hardware architectures and the software methodologies. Aiming to design power-efficient high performance program, we have measured the power consumption of large matrices multiplication on multi-core and GPU platform. Based on the obtained power characteristic values of each computing component, we abstract the energy estimations by incorporating physical power constrains from the hardware devices and analysis of the program execution behaviors. We optimize the matrices multiplication algorithm in order to improve its power performance, and the efficiency promotion has been finally validated by measuring the program execution.
1
Introduction
Power dissipation is one of the most eminent limitation factors influencing development of HPC because a large scale computation may need hundreds hours of continuous execution that results enormous energy predicament. Graphics Processing Unit (GPU) is now considered as a serious challenger for HPC solutions because of its suitability for massively parallel processing and vector computation, their power consumptions are also significant (have reached 300W) and will still increase in the future. Toward power-efficient computing on CPU-GPU computers, we investigate software methodologies for optimizing power utilization through algorithm design and programming technique. In this paper, we measure and model the power consumption of large matrices multiplication on multi-core and GPU platform. We also adopt some popular CUDA design considerations, such as saving the CPU and GPU communication bandwidth, in the purpose of power efficient program design.
2
Power Breakdown
An ATX [1] power supply unit (PSU) converts general purpose AC power to low voltage DC power and outputs different power supplies that are interchangeable R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 421–428, 2010. c Springer-Verlag Berlin Heidelberg 2010
422
D.Q. Ren and R. Suda
with different components inside the multi-core GPU computer. As specified in [1], a PSU need to satisfy at least 70% efficient under full load, which means the remaining about 30% of input power will dissipate in heat. 2.1
Power Measurement
CPU power Measurement. The Intel Core i7 CPU [2] is in LA1366 socket [3], to precisely measure its power consumption is difficult, because the power of inside CPU voltage regulation circuitry that is used for converting +12V from the PSU into the CPU working voltage is unknown. We were unable to isolate that power estimates, such unidentified micro-architecture will be covered in the larger operations power utilization by CPU. Therefore we use one approximate way is to measure the CPU input current and voltage at the 8-pin power [3], on most of modern boards the CPU is powered only by this connector, like what we use in this work. GPU power measurement. On a GPU card, video memory, fan and other circuit components withdraw powers at the same time during the CUDA kernel computation. We were unable to isolate their power measurements, such unidentified micro-architecture will be covered in the larger operations power utilization by GPU. Therefore the measurement result of the GPU power includes the power dissipation of all circuit components on the video card. A GPU card is plugged in a PCI-Express slot on main board, it is mainly powered by +12V and +3.3V power from PCI-Express pins, and an additional +12V power directly from PSU (because sometimes the PCI-E power may not be enough to support the GPUs high performance computation). In this paper we abbreviate the former as PCI-Express power and the latter as auxiliary power. Auxiliary power is easy to know: we measure the current through the auxiliary power line with a clamp probe; we strip little pieces of the coatings of +12V and GND wires, attaching voltage probes on them to find the voltage reading. To measure the powers supplied by PCI- Express, we introduced a riser card to connect in between the PCI-Express slot and the GPU plug contact. We identify and separate +12V and +3.3V power wires from the bus of riser card by using current clamp probes we measure the PCI-E currents in these wires. We fuse 2 longer
Fig. 1. Power measurement: (left )connections; (right) real picture
Modeling and Optimizing the Power Performance
423
Table 1. Measurement results from each computing component
wires at +12V and +3.3V connection points on the bus holder of riser card, one to each, in order to measure the voltage by extending the wires. Removing part of the coating on each of the extension wires, we can measure the PCI-Express by voltage probes. Fig.1 summarizes the connections of the measurement. In this work we use an old card, GeForce 8800 GTS with 640 MB memory, because the riser card we currently use does not work with PCI-Express 2.0. Memory and main board power estimation. Power of main memory chips can be calculated based on its DRAM IDD values however the values are different from each other by different vendors. The power consumption also depends on memory usage in real application. To give a precise measurement on memory power consumption is hard, one have to isolate the traces on the main board that provide power to the memory and measure current values on each one. This would involve mudding the board. Since memory only withdraws power from main board, we can make an approximation on its power by measuring the power change on the main board. In detail, if a program on multi-core GPU computer runs only in between cores, GPUs and memory, we can estimate the power spent on memory access and data transmission by taking off the power consumed by CPU and GPU. Instruments and benchmark. We use National Instruments USB-6216 BNC data acquisition, Fluke i30s / i310s current probes, and Yokogawa 700925 voltage probe. The room was air-conditioned in 23C. We use LabView 8.5 as oscilloscopes and analyzer for result data analysis. We benchmark the power consumption of a multi- core CPU-GPU architecture with a simple matrices multiplication (similar with the sample code in CUDA-SDK) as shown in Fig.2. We record real time voltage and current from measurement readings. The product of voltage and current is the instant power at each sampling point during measurement. All power results together can be plotted into a power chart of program execution that shows the power usage of each component when the execution code run on it, as shown in Fig. 2. By testing the power responses of each component involved in this computation, the power breakdown has been analyzed in Table.1.
424
D.Q. Ren and R. Suda
Fig. 2. The power of each major component in the multi-core and GPU platform: (left) benchmark algorithm; (right) power chart
Results. In Table.1, Intel core i7 is power efficient as it has made several advances in fine-tuning. Its standby power is 33.03W and load power is 102.2W (for one core) in this measurement. Nvidia 8800GTS power usage under a hot standby state, which means it is running on a high frequency to standby for workload, consumes 210W before we launch kernel, and 310W during the kernel running. On the main board, after taking off the power to PCI-E Express and CPU fan, it dissipates 169W under standby and 238W under load. There is a 69W difference which is mainly cost by memory and bus on the main board for memory operations and data transmissions between main memories, bus, PCI-Express slots and etc. 2.2
Power Efficiency
Power factor of a component. To a single component, its power depends on many factors. Theoretically the power of a CMOS chip can be estimated by equation: P ower = DynamicP ower + ShortCircuitP ower + LeakageP ower where the Short Circuit Power is the power expended momentarily flows between the supply voltage and ground; the leakage Power is the power due to current leakage. The Dynamic Power dominates among the three terms, it is consumed by charging and discharging the capacitance at each gates output. [4] P = ACV 2 f
(1)
AC is the capacitance being switched per clock cycles: the active fraction of the gates switching each clock is represented by A; the total capacitance driven by all the gates outputs is represented by C ; V is the operating voltage; f is the frequency of the clock. Back to the measurement result in Table.1: the power of CPU is in direct proportion to its frequency, e.g. Intel i7 is about 33W at 1596MHz and 102W at 2793MHz. The frequency scale is managed by system strategies, when new tasks come it raises from standby to load, and remain staying in load frequency until the work complete when the computation task needs consistent responses from CPU. GPU 8800GTS/640M has 500MHz of
Modeling and Optimizing the Power Performance
425
Fig. 3. The algorithm level power model for CUDA on Multi-core and GPU platform
nominal core frequency and 800MHz of nominal memory frequency. By using system command nvidia-setting to find the real time frequency, we read 513Hz core frequency and 792Hz memory frequency. For memory and system bus on main board, the power is determined by the number and the tension of data operations. In this benchmark, an average power of 69W is observed. The frequencies of memory and PCI-Express slot do not shift during the program execution. Algorithm Level Power Model. Power consumption of software is originates from hardware components on which the software execution code runs. To the CPU-GPU platform in this paper, it is basically from each processing elements (PE) including CPU core and GPU; the main board on which the main memory data operations perform and data transfer happen between components. Therefore in our power model, components used for CUDA computation are separated in 3 parts, i.e. the power of CPU, GPU and the memory/main board operations, as shown in Fig.3.
3
CPU and GPU Power Efficiency
To evaluate the computation performance of CPU-GPU platform, we need to consider data transfer in vertical directions between different levels in the cache and memory hierarchy, from CPU main memory to GPU memory and the GPU caches, and vice versa. 3.1
Power Factors
CPU efficiency. We have observed during measurement a CUDA kernel will work on busy GPU cores and hold 100 of the cores in a high working frequency. A certain CUDA operations, like cudaMemcpy, require CPU to give correct answers. Thus a CPU will block in a busy loop polling a GPU card until the
426
D.Q. Ren and R. Suda
kernel is done, before the CPU can respond any other calls. This minimizes kernel latency, but will use the full range of a CPU core. GPU efficiency. We use GeForce 8800 GTS/640M [5] in this work. It has 12 MPs (Multi-Processors) as computing cores, and each MP has 8 SPs (Streaming Processors) to execute threads. CUDA manual introduces a warp includes a group of 32 parallel scalar threads, a warp executes one common instruction at a time, this collection of functional units are recognized as a core. Nvidia 8800GTS/640M card is efficient that it has a communication bandwidth of 64 GB/s. Its memory frequency is 800M; effective frequency is 1600M; bus width is 320 bits (40Bytes). Above features are constrained to each other by equation: BU Swidth × Ef f ectiveM emoryF requency = Communicationbandwidth Bus efficiency. The connectivity of Nvidia8800GTS/640M GPU and CPU is provided by a PCI-Express 1.0 bus with theoretical bandwidth of 2Gbps (250MBps). User data transfer rates depend on higher layer applications and the way the data is moved. 3.2
Improving Power Efficiency
We introduce some basic program design strategies for power-saving HPC software on multi-cores and GPU platform, as below: (1)Minimizing the number of employed power-hungry components if this doesnt affect computation time, because in general more components will dissipate more steady power and thus lower down the overall power efficiency. In this work, we will fix the matrix computation on one CPU core and one GPU. (2)Enhancing the usage of each component in order to give the best computation performance by using a fixed number of power. To GPU, one popular consideration is on its memory hierarchy, i.e. to optimize the memory utilization by making good use of the shared memory of each block, because the shared memories are much faster than global memory. For CPU, we will introduce multithreading approach to limit the resource spent on pulling GPU by only one thread, the rest threads will be used to share some workload that originally belongs to GPU. (3)Speeding up the execution time if the components power is fixed, therefore the energy expense can be reduced. In this work, we can easily know that the leading performance bottleneck is the limitation of the CPU-GPU bus communication bandwidth, therefore, the potential possibility to improve the power performance of multi-core GPU platform are to optimize the utilization of this fixed bandwidth.
4
Power Saving in Large Matrices Multiplication
Based on the Multi-core and GPU power model and the program analysis, we try to improve the power efficiency in large matrices multiplication.
Modeling and Optimizing the Power Performance
427
For matrices A ∈ Rnn and B ∈ Rnn we implement C = AB with one CPU core and one GPU device. A simple implementation (we call it simple kernel in short) is to repeatedly load one row from matrix A and one column from matrix B, computing with CUDA kernel on GPU, then save results back to the main memory. When n = 2000, the total number of memory access will be 2n3 . Each float pointing data is 4Bytes long, thus 64GB data have to be transferred between the PCI bus, this communication latency is considerable slow. One enhanced approach (we call it enhanced kernel in short) is to use block matrix that partition the original matrices into rectangular smaller matrix-blocks, sized 16 16 so that it can exactly fix the size of the shared memory of one GPU block. Each GPU block can individually multiply two matrix-blocks using its shared memory at a time. This will reduce the data transmission between GPU and main memory to 2n3 /16, i.e. the total data transfer is 4GB. This approach will significantly enhance the GPU performance and the utilization of PCI. We have implemented both of above kernels on a real system and measured their power performances. The charts of power and energy for CPU, GPU and main board are shown in Fig.4 and Fig.5. When the CPU runs enhanced CUDA kernel, as shown in Fig.4, the average power is 96.72W that is slightly higher than the power at the beginning when it runs the simple kernel, because the enhanced kernel needs more intensive response from CPU. The enhanced kernel takes only 0.7s (measured by timing the code) CPU time; but the simple kernel needs 7.57s. 9.5 seconds after the program starts, CPU core i7 overclocks, its power raises to about 130W and remain it until the computation complete. The total energy consumed by CPU is 0.019wh running enhanced kernel, and 0.24wh running simple kernel. In Fig.5(lift), GPUs load power running the enhanced kernel is slightly lower than the power running the simple kernel. We think it is because shared memory operation is less expensive than global memory operation in terms of GPU circuit and energy. The GPU energy consumed by enhanced kernel and simple kernel are 0.055wh and 0.654wh, respectively. In Fig.5(right), The main board power running the
Fig. 4. (left) CUDA programming structure on Nvidia 8800GTS/640; (right) CPU power efficiency comparison between simple and enhanced CUDA kernel
428
D.Q. Ren and R. Suda
Fig. 5. (left) GPU and (right) Main board power efficiency comparison between simple and enhanced CUDA kernel
enhanced kernel is 233W, it is about 20W higher than the main board power running the simple kernel (217W). Because the enhance kernel makes main board data transfers and memory operations more intensive than simple kernel, this results higher power consumption. Finally the main board consumes 0.0402W on enhanced kernel, and 0.47wh on simple kernel. Overall, the enhanced kernel speedup the overall execution time of simple kernel by 10.81 times. For total power consumption, the enhanced kernel can save 91% of energy used by the original kernel.
5
Conclusion
A method to measure the power consumed on multi-core and GPU platform has been introduced, a power model is created for large matrices multiplication algorithm on above platform, power parameters are introduced to CUDA algorithm model for power estimation. Host and device efficiency strategy are employed due to the new parallel CUDA algorithm design that can either speedup the program performance also saving the computation power. The power consumption of this design is finally validated by measuring the implemented program running on real systems. Promising routes are to continue this work with more fine work load adjustment and global scheduling, also to apply algorithmic power saving strategies to larger scale and more complicated problem.
References 1. 2. 3. 4. 5.
ATX12V Power Supply Design Guide. Version 2.2 (2005) Intel CoreTM i7 Processor Technical Specification. Intel.com, Accessed (2009) Socket 1366 (LGA1366 / Socket B). CPU-world.com, Accessed (2009) Rabaey, J.M.: Digital Integrated Circuits. Prentice-Hall, Englewood Cliffs (1996) NVIDIA GeForce 8800 GTX/GTS Tech Report. TechARP.com, Accessed (2009)
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger1,2 and Philipp Slusallek1,2 1
2
Computer Graphics Lab, Saarland University, Saarbr¨ ucken, Germany German Research Center for Artificial Intelligence (DFKI), Agents & Simulated Reality, Saarbr¨ ucken, Germany
[email protected],
[email protected]
Abstract. Available GPUs provide increasingly more processing power especially for multimedia and digital signal processing. Despite the tremendous progress in hardware and thus processing power, there are and always will be applications that require using multiple GPUs either running inside the same machine or distributed in the network due to computational intensive processing algorithms. Existing solutions for developing applications for GPUs still require a lot of hand-optimization when using multiple GPUs inside the same machine and provide in general no support for using remote GPUs distributed in the network. In this paper we address this problem and show that an open distributed multimedia middleware, like the NetworkIntegrated Multimedia Middleware (NMM), is able (1) to seamlessly integrate processing components using GPUs while completely hiding GPU specific issues from the application developer, (2) to transparently combine processing components using GPUs or CPUs, and (3) to transparently use local and remote GPUs for distributed processing.
1
Introduction
Since GPUs are especially designed for stream processing and free programmable, they are well suited to be used for multimedia or digital signal processing. Available many-core technologies like Nvidia’s Compute Unified Device Architecture (CUDA) [1] on top of GPUs simplify development of highly parallel algorithms running on a single GPU. However, there are still many obstacles and problems when programming applications for GPUs. In general, a GPU can only execute algorithms for processing data, we call kernels, while the corresponding control logic still has to be executed on the CPU. The main problem for a software developer is that a kernel runs in a different address space than the application itself. To exchange data between the application and the kernel or between different GPUs within the same machine, specialized communication mechanisms (e.g., DMA data transfer), memory areas, and special scheduling strategies have to be used. This seriously complicates integrating GPUs into applications as well as combining algorithms for multimedia or digital signal processing running on CPUs and GPUs or have to be distributed in the network. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 429–438, 2010. c Springer-Verlag Berlin Heidelberg 2010
430
M. Repplinger and P. Slusallek
Available open distributed multimedia middleware solutions such as the Network-Integrated Multimedia Middleware (NMM) [2] provide a general abstraction for stream processing using a flow graph based approach. This allows an application to specify stream processing by combining different processing elements to a flow graph, each representing a single algorithm or processing step. A unified messaging system allows to send large data blocks, called buffer, and events. This enables the usage of these middleware solutions for data driven as well as event driven stream processing. Furthermore, they consider the network as an integral part of their architecture and allow transparent use and control of local and remote components. The important aspect in this context is that an open distributed multimedia middleware like NMM strongly supports the specialization of all its components to different technologies while still providing a unified architecture for application development. Thus, the general idea presented in this paper is to treat GPUs and CPUs within a single machine in the same way as a distributed system. We will show that using an open distributed multimedia middleware for stream processing allows (1) to seamlessly integrate processing components using GPUs while completely hiding GPU specific issues from the application developer, (2) to transparently combine processing components using GPUs or CPUs, and (3) to transparently use local and remote GPUs for distributed processing. This paper is structured as follows: in Section 2 we describe the essential components of NMM for stream processing on GPUs. Section 3 discusses related work and Section 4 presents the integration of CUDA into NMM. Section 5 shows how to use multiple local or distributed GPUs for processing. Section 6 presents performance measurements showing that multimedia applications using CUDA can even improve their performance when using our approach. Section 7 concludes this paper and highlights future work.
2
Open Distributed Multimedia Middleware
Distributed Flow Graph: To hide technology specific issues as well as used networking protocols from the application, NMM uses the concept of a distributed flow graph, providing a strict separation of media processing and media transmission as can be seen in Figure 1. The nodes of a distributed flow graph represent
Fig. 1. Technology specific aspects are either hidden within nodes or edges of a distributed flow graph
Stream Processing on GPUs Using Distributed Multimedia Middleware
431
processing units and hide all aspects of the underlying technology used for data processing. Edges represent connections between two nodes and hide all specific aspects of data transmission within a transport strategy (e.g., pointer forwarding for local and TCP for network connections). Thus, data streams can flow from distributed source to sink nodes, being processed by each node in-between. The concept of distributed flow graph is essential to (1) seamlessly integrate processing components using GPUs and hide GPU specific issues from the application developer. Corresponding CUDA kernels can then be integrated into nodes. In the following, nodes using the CPU for media processing are called CPU nodes while nodes using the GPU are called GPU nodes. A GPU node runs in the address space of the CPU but configures and controls kernel functions running on a GPU. Since kernels of a GPU node run in a different address space, data received from main memory has to be copied to the GPU using DMA transfer before it can be processed. An open distributed multimedia middleware hides all aspects regarding data transmission in the edges of a flow graph. An open distributed multimedia middleware allows to integrate these technology specific aspects regarding data transmission as a transport strategy. In case of NMM, a connection service automatically choose suitable transport strategies and thus allows (2) transparent combination of processing components using GPUs or CPUs in the same machine. Unified Messaging System: In general, an open distributed multimedia middleware is able to support both, data and event driven stream processing. NMM supports both kind of messages by its unified messaging system. Buffers represent large data blocks, e.g., a video frame, that are efficiently managed by buffer manger. A buffer manager is responsible for allocating specific memory blocks and can be shared between multiple components of NMM. For example, CUDA needs memory allocated as page-locked memory on the CPU side to use DMA communication for transporting media data to GPU. Realizing a corresponding buffer manager allows to use this kind of memory within NMM. Since page-locked memory is also a limited resource, this buffer manager provides an
Fig. 2. A data stream can consist of buffers B and events E. A parallel binding allows to use different transport strategies to send these messages while the strict message order is preserved.
432
M. Repplinger and P. Slusallek
interface to the application for specifying a maximum value for used page-locked memory. Moreover, sharing of buffer managers, e.g, for page-locked memory, between CPU and GPU nodes enables an efficient communication between these nodes because a CPU node automatically uses page-locked memory that can be directly copied to the GPU. Events include control information, include arbitrary typed data and are directly mapped to a method invocation of a node, as long as a node supports this event. NMM allows the combination of events and buffers within the same data stream while a strict message order is preserved. Here, the important aspect is that NMM itself is completely independent of the information sent between nodes but provides an extensible framework for sending messages. Open Communication Framework: An open communication framework is required to transport messages between nodes of a flow graph correctly. Since a GPU can only process buffers, events have to be sent to and executed within the corresponding GPU node that controls the algorithm running on a GPU. This means that different transport strategies have to be used to send messages of a data stream to a GPU node, i.e., DMA transfer for media data and pointer forwarding for control events. For this purpose, NMM uses the concept of parallel binding [3], as shown in Figure 2. This allows to use different transport strategies for buffers and events, while the original message order is preserved. Moreover, NMM support a pipeline of transport strategies which is required to use remote GPU nodes. Since a GPU only has access to the main memory of the collocated CPU and is not able to send buffers directly to a network interface, it has to be copied to the main memory first. In a second step, the buffer can be sent to a remote system using standard network protocols like TCP. NMM supports to set up a pipeline of different transport strategies within a parallel binding in order to reuse existing transport strategies. The connection service of NMM allows to automatically choose these transport strategies for transmitting buffers between GPU and CPU nodes and enables (3) transparent use of local and remote GPUs for distributed processing.
3
Related Work
Due to the wide acceptance of CUDA, there already exist some CUDA specific extensions. GpuCV [4] and OpenVidia [5] are computer vision libraries that completely hide the underlying GPU architecture. Since they act as black box solutions it is difficult to combine them with existing multimedia applications that use the CPU. However, the CUDA kernels of such libraries can be reused in the presented approach. In [6] CUDA was integrated in a grid computing framework which is built on top of the DataCutter middleware. Since the DataCutter middleware focuses on processing a large number of completely independent tasks, it is not suitable for multimedia processing. Most of existing frameworks for distributed stream processing that also support GPUs have a special focus on computer graphics. For example WireGL [7],
Stream Processing on GPUs Using Distributed Multimedia Middleware
433
Chromium [8], and Equalizer [9] support distribution of workload to GPUs distributed in the network. However, all these frameworks are limited to work only with OpenGL-based applications and can not be used for general processing on GPUs. Available distributed multimedia middleware solutions like NMM [2], NIST II [10] and Infopipe [11] are especially designed for distributed media processing, but only NMM and NIST II support the concept of a distributed flow graph. However, the concept of parallel binding and pipelined parallel binding are only supported by NMM and not by NIST II.
4
Integration of CUDA
We use a three layer approach for integrating CUDA into NMM as can be seen in Figure 3. All three layers can be accessed from the application, but only the first layer, which includes the distributed flow graph, can be seen by default. Here, all CUDA kernels are encapsulated into specific GPU nodes, so that they can be used within a distributed flow graph for distributed stream processing. Since processing of a specific buffer requires that all following operations have to be executed on the same GPU, our integration of CUDA ensures that all following GPU nodes use the same GPU. Therefore, GPU nodes are interconnected using the LocalStrategy which simply forwards pointer of buffers. However, the concept of parallel binding is required for connecting CPU and GPU nodes. Here, incoming events are still forwarded to a LocalStrategy because a GPU node processes events in the same address space as a CPU node. Incoming buffers are sent to a CPUToGPUStrategy or GPUToCPUStrategy to copy media data from main memory to GPU memory or vice versa using CUDA’s asynchronous DMA transfer. NMM also provides a connection service that is extended to automatically choose these transport strategies for transmitting buffers between GPU and CPU nodes. This shows that the approach of a distributed flow graph (1) hides all specific aspects of CUDA and GPUs to the application.
Fig. 3. CUDA is integrated into NMM using a three layer approach. All layers can be accessed by the application, but only the distributed flow graph is seen by default.
434
M. Repplinger and P. Slusallek
The second layer enables efficient memory management. Page-locked memory that can be directly copied to a GPU is allocated and managed by a CPUBufferManager, while a GPUBufferManager allocates and manages memory on a GPU. Since a CPUToGPUStrategy requires page-locked memory to avoid unnecessary copy operations, it forwards a CPUBufferManager to all predecessor nodes. This is enabled by the concept of shareable buffer managers, described in Section 2. As can be seen in Figure 3, this is done as soon as a connection between a CPU and GPU node is established. Again, this shows the benefit of using a distributed multimedia middleware for multimedia processing on a GPU. GPU nodes can be combined with existing CPU nodes in an efficient way, i.e., without unrequired memory copies, and without changing the implementation of existing nodes.
Fig. 4. This figure shows buffer processing using a combination of CPU and GPU nodes
The lowest layer is responsible for managing and scheduling of available GPUs within a system. Since we directly use the driver API, different GPUs can be accessed by using the same application thread in general by pushing the corresponding CUDA context. However, the current implementation of CUDA’s context management does not support asynchronous copy operation [12]. Thus, the CUDAManager maintains an individual thread for each GPU, called GPU thread, for accessing a GPU. If a component executes a CUDA operation within one of its methods (e.g., executes a kernel), it requests the CUDAManager to invoke this method by using a specific GPU thread. Executing a method through the CUDAManager blocks until the CUDAManager has executed the corresponding method. This completely hides the use of multiple GPU threads as well as different application threads for accessing a GPU and the application logic. However, since page-locked memory is already bound to a specific GPU, the CUDAManager instantiates a CPUBufferManager and GPUBufferManager for each GPU. To propagate scheduling information between different nodes, each buffer stores information about the GPU thread that has to be used for processing. Moreover, CUDA operations are executed asynchronously in so called CUDA streams. Therefore, each buffer provides its own CUDA stream where all components along the flow graph queue their CUDA specific operations asynchronously.
Stream Processing on GPUs Using Distributed Multimedia Middleware
435
Since all buffers used by asynchronous operations of a single CUDA stream can only be released if the CUDA stream is synchronized, each CUDA stream is encapsulated into an NMM-CUDA stream that also stores involved resources and releases them if the CUDA stream is synchronized. Figure 4 shows all steps for processing a buffer using CPU and GPU nodes: 1. When CPUToGPUStrategy receives a buffer BCP U , it requests a suitable buffer BGP U for the same GPU. Then it initiates an asynchronous copy operation. Before forwarding BGP U to GPU node A, it adds BCP U to the NMM-CUDA stream because this buffer can only be released if the CUDA stream is synchronized. 2. GPU node A initiates the asynchronous execution of its kernel and forwards BGP U to GPU node B. Since both GPU nodes execute their kernel on the same GPU, the connecting transport strategy uses simple pointer forwarding to transmit BGP U to GPU node B. 3. GPU node B also initiates the asynchronous execution of the kernel and forwards BGP U . 4. When GPUtoCPUStrategy receives BGP U , it requests a new BCP U and initiates asynchronous memory copy from GPU to CPU. 5. Finally, the transport strategy synchronizes the CUDA stream to ensure that all operations on the GPU have been finished, before forwarding BCP U to CPU node B and releases all resources stored in the CUDA-NMM stream.
5 5.1
Parallel Processing on Multiple GPUs Automatic Parallelization
Automatic parallelization is only provided for GPUs inside a single machine. Here, the most important influence on scheduling is that page-locked memory is bound to a specific GPU. This means that all following processing steps are bound to a specific GPU, but GPU nodes are not explicitly bound to a specific GPU. So if multiple GPUs are available within a single system, the processing of next media buffers could be automatically initialized on different GPUs. However, this is only possible if a kernel is stateless and does not depend on information about already processed buffers, e.g. a kernel that changes the resolution of each incoming video buffer. In contrast to this, a stateful kernel, e.g. for encoding or decoding video, stores state information on a GPU and cannot automatically be distributed to multiple GPUs. When using a stateful kernel inside a GPU node, the corresponding transport strategy of type CPUToGPUStrategy uses a CPUBufferManager of a specific GPU to limit media processing to a single GPU. But if a stateless kernel inside a GPU node is used, the corresponding CPUToGPUStrategy forwards a CompositeBufferManager to all preceding nodes. The CompositeBufferManager includes CPUBufferManager for all GPUs, and when a buffer is requested it asks the CUDAManager which GPU should be used and returns a page-locked buffer for the corresponding GPU. So far we implemented a simple round robin mechanism that is used inside the CUDAManager to distribute buffers one GPU after the other.
436
5.2
M. Repplinger and P. Slusallek
Explicit Parallelization
To explicitly distribute workload to multiple local or distribute GPUs for stateless kernels, we provide a set of nodes that support explicit parallelization for application developer. The general idea can be seen in Figure 5. A manager node is used to distribute workload to a set of local or distributed GPU nodes. Since the kind of data that has to distributed to all succeeding GPU nodes strongly depends on the application, the manager node provides only a scheduling algorithm and distributes incoming data from its predecessor node.
Fig. 5. The manager node distributes workload to connected GPU nodes. After data have been processed by all successive GPU nodes, they are sent back to an assembly node that recreates correct order of processed data.
First, the manager node sends data to one connected GPU node after each other. As soon as a GPU node connected to the manager node has finished processing its data, it informs the manager node, by sending a control event, to send new data for processing. The manager node in turns sends next available data to this GPU node. This simple scheduling approach leads to an efficient dynamic load balancing between the GPU nodes, because GPU nodes that finish processing their data earlier, do receive new processing tasks earlier as well. This approach automatically considers differences in processing time that can be caused by using different graphic boards.
6
Performance Measurements
For all performance measurements we use two PCs, PC1 and PC2, connected through a 1 Gigabit/sec full-duplex Ethernet connection, each with an Intel Core2 Duo 3.33 GHz E8600 processor, 4 GB RAM (DDR3 1300 MHz), running Table 1. Performance of the NMM-CUDA integration versus a single threaded reference implementation. Throughput is measured in buffers per second [Buf/s] for a stateless kernel using 1, 2 and 3 GPUs. Buffer size Reference [Buf/s] NMM: 1 GPU [Buf/s] NMM: 2 GPUs[Buf/s] NMM: 3GPUs [Buf/s] PC1 PC1 PC1 PC1 + PC2 50 KB 1314 1354 (103 %) 2152 (163.8 %) 2790 (212.32%) 500 KB 200 235 (117 %) 463 (231.5 %) 680 (340.0%) 1000 KB 103 117 (113.6 %) 234 (227.2 %) 342 (332.7 %) 2000 KB 52 59 (113.4 %) 118 (226.9 %) 168 (323.1%) 3000 KB 34 39 (114.7 %) 79 (232.3 %) 105 (308.8%)
Stream Processing on GPUs Using Distributed Multimedia Middleware
437
64 Bit Linux (kernel 2.6.28) and CUDA Toolkit 2.0. PC1 includes 2 and PC2 1 Nvidia GeForce 9600 GT graphics boards, each with 512 MB RAM. In order to measure the overhead of the presented approach, we compare it to a reference program that copies data from CPU to GPU, executes a kernel and finally copies data back to main memory using a single application thread with a corresponding flow graph that consists of two CPU nodes and one GPU node in between using the same kernel. Based on the throughput of buffers per second that can be passed, we compare the reference implementation and the corresponding NMM flow graph. Since NMM inherently uses multiple application threads, which is not possible by the reference application without using a framework like NMM, these measurements also include the influence of using multiple application threads. For all measurements, we used a stateless kernel that adjust brightness of incoming video frames. The resulting throughput with different buffer sizes can be seen in Table 1. The achieved throughput of our integration is up to 16.7% higher compared to the reference application. These measurements show that there is no overhead when using NMM as distributed multimedia middleware together with our CUDA integration, even for purely locally running applications. Moreover, the presented approach inherently uses multiple application threads for accessing the GPU, which leads to a better exploitation of the used GPU. Adding a second GPU can double the buffer throughput for larger buffer sizes, if a stateless kernel is used. Moreover, adding a single remote GPU shows that the overall performance can be increased up to a factor of three. When using both PCs for processing we use the manager node described in Section 5.2 that distributes workload to two GPU nodes, each running on PC1 and PC2. Since PC1 provides two GPUs and we us a stateless kernel, both graphic boards of PC1 are used. The assembly node runs on PC1 and receives the results from PC2. However, for large amount of data the network turns to be out the bottleneck. In our benchmark we already send up 800 MBit/sec in both directions so that a 4th GPU in a remote PC can not be fully exhausted. In this case faster network technologies, e.g. Infiniband which provides up to 20GBit/sec, have to be used.
7
Conclusion and Future Work
In this paper we demonstrated that a distributed multimedia middleware like NMM is able (1) to seamlessly integrate processing components using GPUs while completely hiding GPU specific issues from the application developer, (2) to transparently combine processing components using GPUs or CPUs, and (3) to transparently use local and remote GPUs for distributed processing. From our point of view, a distributed multimedia middleware such as NMM is essential to fully exploit the processing power of today’s GPUs, while still offering a suitable abstraction for developers. Thus, future work will mainly focus on integrating emerging many-core technologies to conclude on which functionality should additionally be provided by a distributed multimedia middleware.
438
M. Repplinger and P. Slusallek
Acknowledgements We would like to thank Martin Beyer for his valuable work on supporting the integration of CUDA into NMM.
References 1. NVIDIA: CUDA Programming Guide 2.0 (2008) 2. Lohse, M., Winter, F., Repplinger, M., Slusallek, P.: Network-Integrated Multimedia Middleware (NMM). In: MM 2008: Proceedings of the 16th ACM International Conference on Multimedia, pp. 1081–1084 (2008) 3. Repplinger, M., Winter, F., Lohse, M., Slusallek, P.: Parallel Bindings in Distributed Multimedia Systems. In: Proceedings of the 25th IEEE International Conference on Distributed Computing Systems Workshops (ICDCS 2005), pp. 714–720. IEEE Computer Society, Los Alamitos (2005) 4. Allusse, Y., Horain, P., Agarwal, A., Saipriyadarshan, C.: GpuCV: An Open-Source GPU-accelerated Framework for Image Processing and Computer Vision. In: MM 2008: Proceeding of the 16th ACM International Conference on Multimedia, pp. 1089–1092. ACM, New York (2008) 5. Fung, J., Mann, S.: OpenVIDIA: parallel GPU computer vision. In: MULTIMEDIA 2005: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 849–852. ACM, New York (2005) 6. Hartley, D.R., et al.: Biomedical Image Analysis on a Cooperative Cluster of GPUs and Multicores. In: ICS 2008: Proceedings of the 22nd Annual International Conference on Supercomputing, pp. 15–25. ACM, New York (2008) 7. Humphreys, G., et al.: WireGL: a scalable graphics system for clusters. In: SIGGRAPH 2001: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 129–140 (2001) 8. Humphreys, G., et al.: Chromium: a stream-processing framework for interactive rendering on clusters. In: SIGGRAPH 2002: Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques, pp. 693–702 (2002) 9. Eilemann, S., Pajarola, R.: The Equalizer parallel rendering framework. Technical Report IFI 2007.06, Department of Informatics, University of Z¨ urich (2007) 10. Fillinger, A., et al.: The NIST Data Flow System II: A Standardized Interface for Distributed Multimedia Applications. In: IEEE International Symposium on a World of Wireless; Mobile and MultiMedia Networks (WoWMoM). IEEE, Los Alamitos (2008) 11. Black, A.P., et al.: Infopipes: An Abstraction for Multimedia Streaming. Multimedia Syst. 8(5), 406–419 (2002) 12. NVIDIA: CUDA Programming and Development. NVidia Forum (2009), http://www.forums.nvidia.com/index.php?showtopic=81300 &hl=cuMemcpyHtoDAsync
Simulations of the Electrical Activity in the Heart with Graphic Processing Units Bernardo M. Rocha1 , Fernando O. Campos1 , Gernot Plank1,2 , Rodrigo W. dos Santos3 , Manfred Liebmann4 , and Gundolf Haase4 1
Medical University of Graz, Institute of Biophysics, Harrachgasse 21, 8010 Graz, Austria 2 Medical University of Graz, Institute of Physiology, Harrachgasse 21, 8010 Graz, Austria 3 Federal University of Juiz de Fora, Department of Computer Science and Master Program in Computational Modeling, Campus Universit´ ario de Martelos, 36036-330 Juiz de Fora, Brazil 4 Karl-Franzens-University Graz, Institute for Mathematics and Scientific Computing, Heinrichstrasse 36, 8010 Graz, Austria
Abstract. The modeling of the electrical activity of the heart is of great medical and scientific interest, because it provides a way to get a better understanding of the related biophysical phenomena, allows the development of new techniques for diagnoses and serves as a platform for drug tests. The cardiac electrophysiology may be simulated by solving a partial differential equation (PDE) coupled to a system of ordinary differential equations (ODEs) describing the electrical behavior of the cell membrane. The numerical solution is, however, computationally demanding because of the fine temporal and spatial sampling required. The demand for real time high definition 3D graphics made the new graphic processing units (GPUs) a highly parallel, multithreaded, many-core processor with tremendous computational horsepower. It makes the use of GPUs a promising alternative to simulate the electrical activity in the heart. The aim of this work is to study the performance of the use of GPUs to solve the equations underlying the electrical activity in a simple cardiac tissue. Keywords: Cardiac Electrophysiology, Computer Simulations, High Performance Computing, Graphic Processing Units.
1
Introduction
The phenomenon of electrical propagation in the heart comprises a set of complex nonlinear biophysical processes. Its multi-scale nature spans from nanometer processes such as ionic movements and protein dynamic conformation, to centimeter phenomena such as whole heart contraction [1]. Computer models have become valuable tools for the study and comprehension of such complex phenomena, as they allow different information acquired from different physical R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 439–448, 2010. c Springer-Verlag Berlin Heidelberg 2010
440
B.M. Rocha et al.
scales and experiments to be combined in order to generate a better picture of the whole system functionality. There are two components that contribute to the modeling of cardiac electrical propagation [2]. The first is a model of cellular membrane dynamics, describing the flow of ions across the cell membrane. Such models are usually formulated as a system of nonlinear ODEs describing processes occurring on a wide range of time scales. These models are being continually developed to give an increasingly detailed and accurate description of cellular physiology. However, this development also tends to increase the complexity of the models making them substantially more costly to solve numerically [3]. The second is an electrical model of the tissue that describes how currents from one region of a cell membrane interact with other regions and with neighboring cells. When these two components are put together, the ODEs are coupled to a set of PDEs giving rise to the bidomain model [4]. Despite being currently the most complete description of cardiac electrical activity, the numerical solution of the bidomain equations are very computationally demanding [5,6,7,8]. A simpler approximation can be obtained from the model formulation in [4] by considering the extracellular potential to be constant; or the tissue conductivity to be isotropic; or the intracellular and extracellular conductivities to have equal anisotropy ratios [9]. If one of these assumptions is made, the equations are reduced to the monodomain model [10], which has been recently demonstrated to yield essentially to equivalent results as the bidomain model [11]. Nevertheless, the numerical solution of these tissue models involves the discretization in space and time of PDEs as well as the integration of nonlinear systems of ODEs. Using an appropriate decomposition for the time discretization, the nonlinearity of the system may be isolated, i.e. for each node of the spacial mesh a system of ODEs, that comes from the cellular model, may be solved independently. To perform realistic simulations of cardiac tissue, a large number of ODE systems must be solved at each time step, which contributes substantially to the total computational work. Therefore, it is necessary to pursuit new efficient ways of solving the large linear algebraic system that arises from the discretization of the PDEs as well as the systems of ODEs associated with the cell models. The advent of multi-core CPUs and many-core GPUs means that mainstream processor chips are now parallel systems. Many efforts have been made to solve efficiently the equations modeling the cardiac electrical activity on a single CPU as well as in cluster of computers [3,5,6,8,12]. However, the newer programmable GPU has become a highly parallel, multithreaded, many-core processor with tremendous computational horsepower and very high memory bandwidth. The challenge is to develop application software that transparently scales its parallelism to leverage the increasing number of processor cores. CUDA is a parallel programming model and software environment designed to overcome this challenge while maintaining a low learning curve for programmers familiar with standard programming languages such as C.
Simulations of the Electrical Activity in the Heart with GPUs
441
The aim of this work is to evaluate the performance of CPU and GPU solvers for the monodomain model developed in standard C and extended to the NVIDIA CUDA parallel environment [13].
2
Mathematical Formulation
The monodomain model may be written as: ∂Vm = ∇ · (σ∇Vm ) − βIion (Vm , ηi ) (1) ∂t ∂ηi = f (Vm , ηi ) (2) ∂t where β is the surface-to-volume ratio of the cardiac cells, Cm is the membrane capacitance, Vm is the transmembrane voltage, σ is the conductivity tensor and Iion is the ionic current across the membrane, which depends on Vm as well as on many other state variables represented by ηi . In this work, the Luo-Rudy model (LR-I model) [14], a mathematical description of the action potential (AP) in ventricular cells, was considered to simulate the kinetics of Iion . In this mammalian ventricular model, Iion is defined as: βCm
Iion = IN a + Isi + IK + IK1 + IKp + Ib
(3)
where IN a is the fast sodium current, Isi is the slow inward current, IK is the time-dependent potassium current, IK1 is the time-independent potassium current, IKp is the plateau potassium current and Ib is time-independent background current. Four of the 6 ionic currents in Eq. 3 are controlled by the action of ionic gates described by ODEs of the form: n∞ (Vm ) − n dn = dt τn (Vm ) where
(4)
n∞ (Vm ) =
α(Vm ) α(Vm ) + β(Vm )
(5)
τn (Vm ) =
1 α(Vm ) + β(Vm )
(6)
and
where n represents any gating variable, n∞ (Vm ) is the steady-state value of n, and τn (Vm ) is its time constant. In addition to the ODEs for the gating parameters, the model includes an ODE for the intracellular calcium concentration [Ca2+ ]i : d[Ca2+ ]i = −10−4 Isi + 0.07 10−4 − [Ca2+ ]i (7) dt In summary, the model is based on a set of 8 ODEs describing ionic currents and intracellular calcium concentration. A more complete description can be found in [14].
442
3
B.M. Rocha et al.
Numerical Schemes
In order to obtain a numerical solution for Eqs. (1)-(2), the Godunov operator splitting [15] was employed. The coupled system is then reduced to a two numerical step scheme that involves the solutions of a parabolic PDE and a nonlinear system of ODEs at each time step. This splitting leads to the following two steps algorithm: 1. Solve the systems of ODEs ∂Vm = −βIion (Vm , ηi ) ∂t ∂ηi = f (Vm , ηi ) ∂t 2. Solve the parabolic problem β
∂Vm = ∇ · (σ∇Vm ) ∂t
(8) (9)
(10)
The Finite Element Method (FEM) was used for the spatial discretization of Eq. 10 and the implicit Euler method to advance the solution in time. Applying the FEM for the spatial discretization, a system of linear equations must be solved at each time step: (M + CK)υ k+1 = M υ k
(11)
where υ discretizes Vm at time kΔt; M and K are the mass and stiffness matrices m . The nonfrom the FEM discretization, respectively; and the constant C is βC Δt preconditioned conjugate gradient (CG) method was used to solve this system. Due to the discretization nature of the FEM, K and M are essentially sparse and thus specific data structures were used to store them in an efficient manner. In the present work we used the well known compressed sparse row (CSR) format [16] for the matrix representation. This format stores a sparse matrix A in three arrays Ax, Aj and Ap that hold, respectively, the nonzero entries, the columns of these entries and the pointers to the beginning of each row in the arrays Ax and Aj, as illustrated in Figure 1. The system of ODEs in Eqs. (8)-(9) was integrated using the explicit Euler method. Although is well known that explicit numerical methods have strong limitations because stability restrictions [3], they are widely used due to their simplicity of implementation [5,8].
Fig. 1. Example of storage of a sparse matrix using the CSR format
Simulations of the Electrical Activity in the Heart with GPUs
4
443
Methods
We have developed three different in-silico tissue preparations for our investigations. Each tissue is a two dimensional model, which was discretized through the FEM with bilinear elements with zero flux boundary conditions imposed. Spatial discretization was set to Δx = 12.5 μ m. Numerical tests performed using the forward Euler method to solve the system of ODEs revealed that a Δt = 0.01 ms is the largest allowable time step for stability constraints. Table 1 presents the dimensions of the simulated tissues as well as the properties of the FEM meshes. The absolute tolerance of the CG method was set to 10−6 . The parameters of the monodomain model were taken from the literature [1,17]: Cm = 1 uF/cm2 ; β = 0.14 μm−1 ; and tissue conductivity was considered homogeneous and isotropic with σ = 0.0001 mS/μm. Table 1. Dimensions of the in-silico tissue preparation and the properties of their associated FEM meshes Tissue
Nodes Elements
2 × 2 mm 25921 4 × 4 mm 103041 8 × 8 mm 410881
25600 102400 409600
We simulated 100 ms of electrical propagation in these cardiac tissues after applying a stimulus in a predefined area. The stimulus was made by setting Vm in the left edge of the meshes to a value above the threshold required to generate an AP. Figure 2 (a) illustrates the electrical propagation in a simulated tissue of dimensions 4x4 mm, and (b) the time course of the AP of the cell highlighted in (a).
Fig. 2. Simulation results. (a) Spatial distribution of Vm in a 4x4 mm tissue simulation 3 ms after the stimulus onset. (b) the AP related to the cell highlighted in (a).
444
B.M. Rocha et al.
In order to obtain high performance using GPU programming two requisites should be kept in mind. The first one is the challenge inherent to the development of applications for parallel systems, which should balance the workload and scaled it over a number of processors. The second is that different than in the CPU, in the GPU the global memory space is not cached, so it is also very important to follow the right access pattern to get maximum memory bandwidth [13]. We implemented the numerical solution of Eqs. (1)-(2), in which Iion is given by Eq. 3, using standard C (CPU) and then, extended it to the NVIDIA CUDA parallel environment (GPU). Specific CUDA kernels were developed to solve the system of ODEs as well as the parabolic problem. All kernels were launched with 256 threads per block and the number of blocks per grid was set accordingly to the size of the problem, i.e. according to the number of nodes in the FEM mesh. The ODE Systems. We have designed two kernels to solve the system of ODEs related to Eq. 2. The first one refers to the setting of initial conditions to the cells, i.e. to the systems of ODEs. The state variables of m cells were stored in a unidimensional vector of size m ∗ n, where n is the number of differential equations of the ionic model (n = 8 according to the Luo and Rudy (1991) model [14]). This approach was chosen in order to simplify the structure of the CUDA code as well as the memory access by the threads. Since each cell model has the same number of state variables n, m ∗ n threads are required to set all initial conditions to all cells. The strategy for setting up initial conditions is illustrated in Figure 3 (a). Note that the threads access the memory in a coalesced way, which (according to [13]) should improve the performance of the code. The second kernel integrates the ODEs at each time step and differently from the first one, every thread is used to solve one ODE system. Therefore, each thread has to load and operate over all state variables of one cell model. Figure 3 (b) illustrates this strategy. The Parabolic Problem. The solution of the parabolic problem essentially consists of basic linear algebra operations such as vector scalar product and matrix vector multiplication, which in our case is sparse. The implementation of basic linear algebra kernels in CUDA such as vector addition, vector scale and the SAXPY operation is a straightforward task since each thread is assigned to do the computation of one position of the array. On the other hand, the implementation of the scalar product is not straightforward and requires a reduction operation. In this work this kernel was implemented doing a partial sum of the vectors on each multiprocessor and then a reduction on shared memory. The CSR sparse matrix vector multiplication (SPMV) can be easily parallelized since the product of one row of the matrix and the vector is independent of other rows. We implemented it assigning one thread to compute one dot product of the matrix row and the vector. Although, there are more efficient variants of this algorithm such as the Block Compressed Row Storage (BCRS) [18], we have only tested the CSR kernel as described before.
Simulations of the Electrical Activity in the Heart with GPUs
445
Fig. 3. Strategy used to set up the initial conditions (a) and to solve the ODE systems (b) in a cardiac tissue containing m cells modeled by n ODEs
In order to check for possible deviations in the numerical solutions due to the use of single precision in the arithmetic operations, the results related to Vm were stored and used for comparisons. The numerical solutions obtained using single precision were compared against one computed using double precision through the relative root-mean-square norm (RRMS): t t 2 (υi,j − νi,j ) error = 100
t
i,j
t 2 (υi,j ) t
(12)
i,j
where υ is the solution computed using double precision, ν is the solution computed using single precision and t and i, j are indexes in time and space, respectively.
5
Results
In this section we present the numerical results obtained for the monodomain simulations using a ventricular cell model on graphics processing units through
446
B.M. Rocha et al.
Table 2. Comparison between CPU and GPU solvers for the ODE systems. Execution times are given in seconds. Mesh
CPUTime GPUTime Speedup
2 × 2 mm 379.67 4 × 4 mm 1509.59 8 × 8 mm 6005.06
7.78 29.70 117.57
48.80 50.82 51.07
Table 3. Comparison between CPU and GPU solvers for the parabolic problem. Execution times are given in seconds. Mesh
CPUTime GPUTime Speedup
2 × 2 mm 4 × 4 mm 8 × 8 mm
323.08 1820.72 8442.98
158.31 493.74 1888.38
2.04 3.68 4.68
Table 4. Error values for simulations using single point precision on the CPU and GPU solvers. Mesh
CPUFloat GPUFloat
2 × 2 mm 0.00607% 4 × 4 mm 0.02813% 8 × 8 mm 0.00485%
0.00458% 0.02204% 0.00394%
NVIDIA CUDA environment. Simulations were performed on a single processor of a quad-core Intel Xeon E5420 2.50GHz and 16GB of shared memory equipped with a NVIDIA GeForce GTX 295, a dual-GPU graphic card with two GT200s emulating a pair of GTX 260, with a total of 480 processor cores and 1.8 GB GDDR3 of memory. We compared the CPU and GPU implementations by means of time required to integrate the ODE systems (Table 2) as well as to solve the parabolic system (Table 3). According to Table 2, the GPU achieved much better performance than the CPU in all tissue simulations. Speedup factors of about 48-fold, 50-fold and 51-fold were achieved in the 2x2 mm, 4x4 mm and 8x8 mm tissue preparations, respectively. Regarding the parabolic system, the GPU was 2 times faster than the CPU to simulate the smaller tissue, then it was 3.6 faster and finally 4.5 times faster to simulate the largest preparation. Table 4 gives the RRMS norm related to a solution calculated using double precision. It can be noticed that the precision used does not influenced the numerical solutions.
Simulations of the Electrical Activity in the Heart with GPUs
6
447
Discussion
We have evaluated the performance of CPU and GPU solvers for the monodomain model developed in standard C and extended to the NVIDIA CUDA parallel environment. Three different in-silico tissue preparations were used in this work for the performance tests. In all cases, the GPU performed much better than the CPU during the integration of the ODE systems, where a speed increase of about 50 times was obtained. On the other hand, a more modest speedup of about 3.5 was achieved during the solutions of the parabolic problem. The performance of the parabolic solver, which consists essentially of a sparse matrix vector multiplication and subsequent solution of a linear systems, is strongly dependent on the storage format of the sparse matrix associated to linear system. We have employed the well known CSR format, which is used to store general sparse matrices. However, from the point of view of parallelization using CUDA, the CSR format has a significant drawback. In this format, the threads do not access the CSR Aj and Ax arrays contiguously. A better implementation could make use of the fact that the matrices are diagonal structured and thus leverage the memory access by the threads. In this work, all results were obtained using single point precision operations. To make sure that there were no deviations in the numerical solutions, we compared the results against one computed using double precision. Table 4 reveals that the precision of the arithmetic operations does not affect the numerical solutions. Many efforts have been made to solve efficiently the equations modeling the cardiac electrical activity on a single CPU as well as in cluster of computers [3,5,6,8,12]. However, the CUDA environment represents a new way to develop parallel applications. The results presented in this work showed that the GPUs are a promising alternative to simulate the electrical activity in the heart. The GPU allows the increasing of the complexity of the problem that can be modeled, and also represent a parallel system with high performance computing for a relative low cost when compared to a cluster of CPUs. Some important limitations of this work are: 1- the fact that only one cell model was used to draw the conclusions, whereas many other models with different numerical characteristics are available in the literature; 2- there are more suitable solvers for the system of ODEs.
Acknowledgments This work was supported by the grant P19993-N15 and SFB grant F3210-N18 from the Austrian Science Fund (FWF). Dr. Weber dos Santos thanks the support provided by FAPEMIG, CNPq and CAPES.
References 1. Plank, G., Zhou, L., Greenstein, J.L., Cortassa, S., Winslow, R.L., O’Rourke, B., Trayanova, N.A.: From mitochondrial ion channels to arrhythmias in the heart: computational techniques to bridge the spatio-temporal scales. Philos. Transact. A Math. Phys. Eng. Sci. 366, 3381–3409 (2008)
448
B.M. Rocha et al.
2. Plonsey, R., Barr, R.C.: Mathematical modeling of electrical activity of the heart. Electrocardiol. 20(3), 219–226 (1987) 3. Maclachlan, M.C., Sundnes, J., Spiteri, R.J.: A comparison of non-standard solvers for odes describing cellular reactions in the heart. Comput. Methods Biomech. Biomed. Engin. 10, 317–326 (2007) 4. Geselowitz, D.B., Miller, W.T.: A bidomain model for anisotropic cardiac muscle. Ann. Biomed. Eng. 11, 191–206 (1983) 5. Vigmond, E.J., Aguel, F., Trayanova, N.: Computational techniques for solving the bidomain equations in three dimensions. IEEE Trans. Biomed. Eng. 49(11), 1260–1269 (2002) 6. Santos, R.W., Plank, G., Bauer, S., Vigmond, E.J.: Parallel multigrid preconditioner for the cardiac bidomain model. IEEE Trans. Biomed. Eng. 51(11), 1960– 1968 (2004) 7. Vigmond, E.J., Santos, R.W., Prassl, A.J., Deo, M., Plank, G.: Solvers for the cardiac bidomain equations. Prog. Biophys. Mol. Biol. 96(1-3), 3–18 (2008) 8. Plank, G., Liebmann, M., Santos, R.W., Vigmond, E.J., Haase, G.: Algebraic multigrid preconditioner for the cardiac bidomain model. IEEE Trans. Biomed. Eng. 54(4), 585–596 (2007) 9. Sundnes, J., Nielsen, B.F., Mardal, K.A., Cai, X., Lines, G.T., Tveito, A.: On the computational complexity of the bidomain and the monodomain models of electrophysiology. Ann. of Bio. Eng. 34(7), 1088–1097 (2006) 10. Clark, J., Plonsey, R.: A mathematical evaluation of the core conductor model. Biophys J. 6(1), 95–112 (1966) 11. Potse, M., Dub´e, B., Vinet, A., Cardinal, R.: A comparison of monodomain and bidomain propagation models for the human heart. IEEE Trans. Biomed. Eng. 53, 2425–2435 (2006) 12. Sundnes, J., Lines, G., Tveito, A.: Efficient solution of ordinary differential equations modeling electrical activity in cardiac cells. Mathematical Biosciences 172, 55–72 (2001) 13. NVIDIA Corporation: NVIDIA CUDA Programming Guide (2009) 14. Luo, C., Rudy, Y.: A Dynamic Model of the Cardiac Ventricular Action Potential. Circulation Research 74(6), 1071–1096 (1994) 15. Sundnes, J., Lines, G.T., Tveito, A.: An operator splitting method for solving the bidomain equations coupled to a volume conductor model for the torso. Mathematical Biosciences 194, 233–248 (2005) 16. Saad, Y.: Iterative Methods for Sparse Linear Systems. PWS Publishing Company (1996) 17. Roth, B.J.: Electrical conductivity values used with the bidomain model of cardiac tissue. IEEE Trans. Biomed. Eng. 44(4), 326–328 (1997) 18. Buatois, L., Caumon, G., L´evy, B.: Concurrent number cruncher: An efficient sparse linear solver on the GPU. In: Perrott, R., Chapman, B.M., Subhlok, J., de Mello, R.F., Yang, L.T. (eds.) HPCC 2007. LNCS, vol. 4782, pp. 358–371. Springer, Heidelberg (2007)
Parallel Minimax Tree Searching on GPU Kamil Rocki and Reiji Suda JST CREST, Department of Computer Science Graduate School of Information Science and Technology The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan {kamil.rocki,reiji}@is.s.u-tokyo.ac.jp Abstract. The paper describes results of minimax tree searching algorithm implemented within CUDA platform. The problem regards move choice strategy in the game of Reversi. The parallelization scheme and performance aspects are discussed, focusing mainly on warp divergence problem and data transfer size. Moreover, a method of minimizing warp divergence and performance degradation is described. The paper contains both the results of test performed on multiple CPUs and GPUs. Additionally, it discusses αβ parallel pruning implementation.
1
Introduction
This paper presents a parallelization scheme of minimax algorithm[10], which is very widely used in searching game trees. Described implementations are based on the Reversi (Othello) game rules. Reversi is a two-person zero-sum game. Each node of the tree represents a game state and its children are the succeeding states. The best path is the path providing the best final score for a particular player. The desired search depth is set beforehand, each node at that depth is a terminal node. Additionally, each node, which does not have any children, is a terminal node too.
2 2.1
Implementation Basic Minimax Algorithm
The basic algorithm used as a reference is based on the recursive minimax (negamax) tree search. function minimax(node, depth) if node is a terminal node or depth = 0 return the heuristic value of node else α ← −∞ foreach child of node α ← max(α, −minimax(child, depth − 1)) return α R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 449–456, 2010. c Springer-Verlag Berlin Heidelberg 2010
450
2.2
K. Rocki and R. Suda
Iterative Minimax Algorithm
CUDA does not allow recursion, therefore the GPU minimax implementation uses modified iterative version of the algorithm [1]. 2.3
Parallelization
The main problem which arises regarding parallelization is the batch manner of GPU’s processing. The jobs cannot be distributed on the fly and additionally, the communication cost is high. The number of CPU to GPU transfers needs to be minimized, large amount of data has to be processed at once. The idea of the work distribution for thousands of GPU threads is based on splitting the main tree into 2 parts: the upper tree of depth s processed in a sequential manner and the lower part of depth p processed parallelly. The lower part of is then sliced into k subtrees, so that each of the subtree can be searched separately. Thus, the tree search process is divided into 3 steps: 1. Sequential tree search starting with the root node of depth s. Store the leaves to a defined array/list of tasks. 2. For each of the stored tasks execute parallel search of depth p. After the search is finished, store search results (score) to a defined array/list of results. If k > GP U threads then, several sequential steps are needed. That is solved, by creating a queue of tasks (nodes). GPU would load nodes from the queue until there are some tasks present. 3. Execute sequential tree search once more (like in step 1) andafter reaching leaf node, read stored results (in step 2) from the array/list and calculate main nodes score. Parallelization brings an overhead of executing the sequential part twice. Therefore, proper choice of values p and s is very important. Depth s should be minimized to decrease that cost, while producing enough leaves (value k)
Root node Sequential depth
s
Sequential K leaves
K root nodes
Parallel depth
p
Parallel L leaves
Fig. 1. Parallelization
Parallel Minimax Tree Searching on GPU
451
to keep every GPU thread busy. Hence, following condition should be met: (k ≥ threads). The value k cannot be determined beforehand due to variable branching factor of each tree. One way of solving the case is estimation and the other is unrolling the root node to the moment when the condition is met, which means searching the tree gradually deeper until k is large enough. 2.4
Modifications
Straightforward minimax port to the CUDA code and parallel execution did not show significant speedup compared to the CPU sequential execution time. One of the reasons for weak performance is the SIMD way of data processing for each warp. The main feature of considered group of trees is the variable branching factor which relies on the actual game state. The approximate average branching factor of a single tree node was calculated to be 8,7. 2.5
Warp Divergence Problem
Every thread of each warp performs the same operation at given moment. Whenever the change of control operation within one warp is present (such as if-else instructions), ’warp divergence’ occurs. That means, that all the threads have to wait until if-else conditional operations are finished. Each root node processed in parallel manner represents different board state, therefore every searched tree may have a different structure. Iterative tree searching algorithm is based on one main outer loop and 2 inner loops (traversing up and down). Depending on number of children for each node, the moment of the loop break occurrence varies, and therefore that causes the divergence in each warp. The ideal case, when no divergence is present, was estimated by calculating average time of searching same root node by each thread. The result proved that reducing divergence, could significantly improve the search performance (reduce the time needed). 2.6
Improving the Performance
A method of reducing warp divergence has been implemented. Decreasing the number of different root nodes in each warp minimizes the divergence, but also results in longer processing time. A way of achieving the parallel performance and keeping low warp divergence is to modify the algorithm, so that the parallel processing parts are placed where no conditional code exists and allow warps to process shared data (each warp should processes one root node). In analyzed case each node represents a game board. Each game board consists of 64 fields. Processing a node is:
1. Load the board. If node is not terminal, then for each field: a. If move is possible, generate a child node b. Save list of children (after processing each field) 2. Else calculate the score then for each field:
452
K. Rocki and R. Suda
a. Get value (score) b. Add or subtract from the global score depending on the value and save board score Operations performed for each field can be done in parallel. Because each warp consists of 32 threads, it is not difficult to distribute the work. Each thread in a warp searches the same tree, while analyzing different node parts. In order to implement described method, both the shared memory and the atomic operations have to be used. Each warp has common (shared) temporary memory for storing children generated by the threads and for storing the score. CUDA atomic operations are needed for synchronizing data access of each thread. I.e. adding a value to shared score memory involves reading the memory, adding value and then storing the updated score. Without such a synchronization one thread could easily overwrite a value, previously store by the other one. Warps can also be divided into smaller subgroups to dedicate fewer threads to one node, i.e. if each 16 threads processes one node, then one warp searches 2 boards in parallel. Further, the results of dividing the warps into groups of 8 and 16 are also presented. Summarizing, for each node, the modified algorithm works as follows: 1. Read node 2. If node is a leaf, calculate score in parallel - function calculateScore(). Join (synchronize) the threads, return result 3. Else generate children in parallel - function getChildren(). Join (synchronize) the threads, save children list 4. Process each child of that node Described procedure decreases thread divergence, but does not eliminate it, still, some flow control instructions are present while analyzing possible movement for each board field. Many papers describe load balancing algorithm usage in detail in order to increase the performance. Here the GPU’s Thread Execution Manager performs that task automatically. Provided that a job is divided into sufficiently many parts, an idle processor will be instantly fed with waiting jobs.
Node Part Part
Part .............
Atomic Add
Score Fig. 2. Score calculation
Part
Parallel Minimax Tree Searching on GPU
3
453
Results
Each value obtained is an average value of searching 1000 nodes, being a result of continuously playing Reversi and moving at random thus the result may be an approximation of the average case. 3.1
CPU
First, the basic parallelized algorithm was tested on a 8-way Xeon E5540 machine to check the correctness and to obtain a reference for further GPU results. Graph shows the advantage of using several CPU cores to search tree of depth 11, where 7 levels are solved sequentially and 4 levels in a parallel manner. In this case, each CPU thread processed one node from a given node queue at once.
Parallellized algorithm’s speedup on multicore Xeon machine
Speedup (comparing to single core)
8
actual result ideal speedup
7
6
5
4
3
2
1
1
2
3
4
5
6
7
Number of cores
Fig. 3. CPU results
GPU
Performance of GPU implementation without warp divergency 60 Speedup (comparing with single CPU)
3.2
50
Depth 12 − 8 levels sequential, 4 parallel Depth 11 − 8 levels sequential, 3 parallel Depth 10 − 8 levels sequential, 2 parallel
40
30
20
10
0
1 GPU
2 GPUs
Fig. 4. GPU results - no divergence
8
454
K. Rocki and R. Suda Performance of GPU implementation without warp divergency
Speedup (comparing with single CPU)
60
50
No divergency, depth 12 − 8 levels sequential, 4 parallel Reduced divergency, depth 12 − 8 levels sequential, 4 parallel No divergency, depth 11 − 8 levels sequential, 3 parallel Reduced divergency, depth 11 − 8 levels sequential, 3 parallel
40
30
20
10
0
1 GPU
2 GPUs
Fig. 5. GPU results Performance comparison of implemented algorithms (searching tree of depth 11)
Speedup (comparing to single CPU)
45 40 35 30
no divergency 32 threads per board 16 threads per board 8 threads per board 1 thread per board, 9 levels sequential, 2 parallel 1 thread per board, 8 levels sequential, 3 parallel
25 20 15 10 5 0
1 GPU
2 GPUs
Fig. 6. GPU results
GPU tests were performed on a machine equipped with a 2.83 GHz Q9550 Intel CPU and 2 GeForce 280 GTX GPUs. To ensure that GPU is busy, the configuration was set to 128 threads per block (4 warps) and 240 blocks in total which equals 128 * (240 / 30 processors) = 1024 threads and 8 blocks per MP (physical multiprocessor) - maximum that can work concurrently on GTX 280. Here: 1 core time for level 11 1070s, and for level 10 115s.
4
Analysis
CPU results show that the overhead of the parallel algorithm compared to the standard one is minimal. GPU results present significant improvement in the case when no divergence is present (average speed of searching same node by each thread). I.e., when sequential depth is set to 8 and parallel depth is set to 2 we observe 20x (20.2) speedup, then for value of 3 nearly 27x (26.7) speedup for single GTX 280 GPU and over 30x (32.6) for parallel depth of 4. This can be explained by communication cost overhead, solving more on the GPU at once saves time spend on transferring the data. Algorithm also scales well into 2 GPU giving 55x (55.2) speedup for 2 GPUs in the best case of parallel depth 4. When different nodes/board are analyzed by each thread then the results are much worse in the basic case showing only 1.4x speedup for a single GPU
Parallel Minimax Tree Searching on GPU
455
and 2.7x for 2 GTXs. These are the results for sequential/parallel depth of 8/3. The performance increases slightly when parallel depth is decreased to 2 and sequential is increased to 9. Then the observed speedup for a single GPU was 3x and 5.8 for 2 GPUs. Nevertheless, such a way of improving the effectiveness is not applicable when sequential depth is greater (too many nodes to store) and the increase in performance is still insignificant. Analyzing the reduced thread divergence version of the algorithm, we can observe that the performance increases as number of threads dedicated to solve node grows. For single GPU 5.3/7.1/7.7 and for 2 GPUs 10.5/13.6/14.8 speedups are noticed respectively. Still it is only approximately 1/3 of the ideal case, nevertheless performance is much better than the basic algorithm’s one. Moreover, increasing the parallel depth from 3 to 4 also effected in the performance increase as in the no divergence example. Second important factor observed was the data turnaround size. Tree searching is a problem, where most of the time is spent on load/store operations. Just to estimate the data size for tree of depth 11, if the 10average branching factor equals 8.7: Average number of total nodes in a tree k=0 k = (8.7)k 2.8 ∗ 109. In the presented implementation mostly bitwise operations were used to save space for a single node representation occurring in 16B memory usage for each node/board. Therefore at least 16B ∗ 2.8 ∗ 109 = 44.8GB data had to be loaded/stored (not including the other data load while analyzing the board). While implementing the algorithm, decreasing the size of a single node from a naive 64B board to bit-represented one [3] increased the overall speed ∼10 times.
5
Possible Improvements
Further warp divergence decrease could effect in an additional speedup. A general solution for every non-uniform tree could be branch reordering, so that search process starts in a branch with largest number of children and ends in the one with the smallest number. If a tasks cannot be parallelized in the way presented in this paper, a method of improving SIMD performance with MIMD algorithms is described [8][9]. Parallel node pruning (i.e. αβ) algorithm might be implemented [2][4][5][6], however the batch way of GPU processing implies serious difficulties in synchronizing the jobs processed in parallel. Moreover, the inter-thread communication is very limited (only within the same block). In some papers, SIMD αβ implementations are presented, but they differ from this case. I.e. in [6] no particular problem is being solved, moreover the tree is synthetic with constant branching factor.
6
Conclusions
Considering the results obtained, GPU as a group of SIMD processors performs well in the task of tree searching, where branching factor is constant. Otherwise, warp divergence becomes a serious problem, and should be minimized. Either by parallelizing the parts of algorithm that are processed in a SIMD way or by
456
K. Rocki and R. Suda
dividing main task into several smaller subtasks [8][9]. Direct implementation of the algorithm did not produce any significant improvement over the CPU sequential execution or effected in even worse performance. GPU outperforms a single CPU if high level of parallelism is achieved (here many nodes have to be searched in parallel). For 4 levels searched in parallel, a single GTX 280 GPU performs the task approximately 32 times faster than single CPU core if no divergence is present. 2 GPUs give 55x speedup factor (that is 72% faster than a single GPU). Regarding the modified algorithm, the values are 7.7x speedup for a single GPU and 14.8x for 2 GTXs (92% faster). That shows that the algorithm could be scaled to even larger amount of GPUs easily. One of the main disadvantages of the GPU implementation is that modifications like αβ pruning are hard to implement. Limited communication and necessity to launch all the thread at once cause the difficulty. Another aspect of analyzed approach is possibility of concurrent GPU and CPU operation. While GPU processes bigger chunks of data (thousands) at once, each CPU searches other nodes that are waiting in a queue.
References [1] Knuth, D.E., Moore, R.W.: An analysis of alpha-beta pruning. Artificial Intelligence 6, 293–326 (1975) [2] Manohararajah, V.: Parallel Alpha-Beta Search on Shared Memory Multiprocessors, Master Thesis, Graduate Department of Electrical and Computer Engineering, University of Toronto (2001) [3] Warren, H.S.: Hacker’s Delight. Addison-Wesley, Reading (2002) [4] Schaeffer, J.: Improved Parallel Alpha Beta Search. In: Proceedings of 1986 ACM Fall Joint Computer Conference (1986) [5] Borovska, P., Lazarova, M.: Efficiency of Parallel Minimax Algorithm for Game Tree Search. In: Proceedings of the International Conference on Computer Systems and Technologies (2007) [6] Schaeffer, J., Brockington, M.G.: The APHID Parallel αβ algorithm. In: Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing, p. 428 (1996) [7] Hewett, R., Ganesan, K.: Consistent Linear speedup in Parallel Alpha-beta Search. In: Proceedings of the ICCI 1992, Fourth International Conference on Computing and Information, pp. 237–240 (1992) [8] Hopp, H., Sanders, P.: Parallel Game Tree Search on SIMD Machines. In: Ferreira, A., Rolim, J.D.P. (eds.) IRREGULAR 1995. LNCS, vol. 980, pp. 349–361. Springer, Heidelberg (1995) [9] Sanders, P.: Efficient Emulation of MIMD behavior on SIMD Machines. In: Proceedings of the International Conference on Massively Parallel Processing Applications and Development, pp. 313–321 (1995) [10] Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, pp. 163–171. Prentice Hall, Englewood Cliffs (2003)
A Fast GPU Implementation for Solving Sparse Ill-Posed Linear Equation Systems Florian Stock and Andreas Koch Embedded Systems and Applications Group Technische Universit¨ at Darmstadt {stock,koch}@eis.cs.tu-darmstadt.de
Abstract. Image reconstruction, a very compute-intense process in general, can often be reduced to large linear equation systems represented as sparse under-determined matrices. Solvers for these equation systems (not restricted to image reconstruction) spend most of their time in sparse matrix-vector multiplications (SpMV). In this paper we will present a GPU-accelerated scheme for a Conjugate Gradient (CG) solver, with focus on the SpMV. We will discuss and quantify the optimizations employed to achieve a soft-real time constraint as well as alternative solutions relying on FPGAs, the Cell Broadband Engine, a highly optimized SSE-based software implementation, and other GPU SpMV implementations.
1
Introduction and Problem Description
In this work, we describe the solution of a practical reconstruction problem under soft-real time constraints as required by an industrial embedded computing usecase. For greater generality, we have abstracted away the individual problem details and concentrate on the solution itself, namely the reconstruction of voxel information from the sensor data. For the specific use-case, this process requires the solution of a matrix linear equation system with up to 250 × 106 elements (depending on the image size). However, due to practical limitations of the sensor system (e.g., due to unavoidable mechanical inaccuracies), the gathered data does not allow perfect reconstruction of the original image and leads to a strongly ill-posed linear equation system. To achieve acceptable image quality, a domain-specific regularization, which narrows the solution space, has to be employed. It expresses additional requirements on the solution (such as ∀i : xi ≤ 0) and is applied during each step of the conjugate gradient (CG) method, allowing the reconstruction of suitable-quality images in ≈ 400 CG iterations. It is the need for this regularization vector F as a correction term, that makes other, generally faster equation solving methods (see Section 2) unsuitable for this specific problem. The matrix A, which represents the linear equation system, has up to 855000 nonzero elements. These nonzero elements comprise 0.3 − 0.4% of all elements. If stored as single-precision numbers in a non-sparse manner, it would require more than 2 GB of memory. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 457–466, 2010. c Springer-Verlag Berlin Heidelberg 2010
458
F. Stock and A. Koch
Table 1. Loop body of the CG algorithm. F is the regularizing correction term.
ek = AT Adk + F (xk , dk ) fk = dTk ek αk = −(gkT ek )/f k+1 x = xk + αk dk gk+1 = AT (Axk+1 − b) + F (xk , xk+1 ) T ek )/fk βk = (gk+1 dk+1 = −gk+1 + βk dk
Fig. 1. CUDA architecture
A number of storage formats are available for expressing sparse matrices more compactly (e.g., ELLPACK, CSR, CSC, jagged diagonal format [1]). For our case, as the matrix is generated row-by-row by the sensors, the most suitable approach is the compressed sparse row format (CSR, sometimes also referred as compressed row storage CRS). For our application, this reduces the required storage for A to just 7 MB. As the CG solver demands a positive semi-definite matrix, we use the CGNR (CG Normal Residual) approach [1] and left-multiply the equation system with AT . Hence, we do not solve Ax = b, but AT Ax = b. Due to the higher condition number κ of the matrix AT A, with κ(AT A) = κ(A)2 , this will however result in a slower convergence of the iterative solver. Table 1 shows the pseudo-code for the modified CG algorithm. Computing 400 iterations of this CG (size of matrix A 320,000 × 3,000 with 1,800,000 nonzero entries) requires 15 GFLOPS and a bandwidth of 105 GB/s. The softreal time constraint of the industrial application requires the reconstruction of four different images (taken by different sensors) in 0.44 seconds. However, since these images are independent, they can be computed in parallel, allowing 0.44s for the reconstruction of a single image. The core of this work is the implementation of the modified CG algorithm on a GPU, with a focus on the matrix multiplication. The techniques shown will be applicable to the efficient handling of sparse matrix problems in general. 1.1
GPGPU Computing and the CUDA System
With continued refinement and growing flexibility, graphics processing units (GPUs) can now be applied to general-purpose computing scenarios (GPGPU), often outperforming conventional processors. Nowadays, the architecture of modern GPUs consists of arrays of hundreds of flexible general-purpose processing units supporting threaded computation models and random-access memories. Dedicated languages and programming tools to exploit this many-core paradigm include the manufacturer specific flows CUDA (described below, by NVIDIA) and Firestream [2] by ATI/AMD, as well as the hardware-independent like the new OpenCL [3]. For our work, we target NVIDIA GPUs, specifically the GTS 8800 models and GTX 280. They are programmed using the Compute Unified Device
GPU Implementation for Solving Sparse Linear Equation Systems
459
Architecture (CUDA) development flow. Figure 1 shows the block diagram of such a CUDA-supported GPU. The computation is done by multiprocessors, which consist of eight scalar processors, operating in SIMD (Single Instruction Multiple Data) mode. Each of the scalar processors can execute multiple threads, which it schedules without overhead. This thread scheduling is used to hide memory latencies in each thread. Each multiprocessor furthermore has a small shared memory (16 KB). which is like the device global memory randomly accessible. But accesses to the global memory require for good overall throughput, that consecutive threads must access consecutive memory addresses (such accesses are called coalesced). Others have a significant performance penalty. An entire computation, a so-called kernel, is instantiated on the GPU device as multiple blocks. These are then executed in arbitrary order, ideally in parallel, but may be serialized, e.g., when the number of blocks exceeds the number of multiprocessors available on the device. The block structure must be chosen carefully by the programmer, since no fast inter-block communication is possible. A block is assigned atomically for execution on a multiprocessor, which then processes the contained threads in parallel in a SIMD manner. Note that the threads within a block may communicate quickly via the shared memory and can be efficiently synchronized. At invocation time, a kernel is parametrized with the number of blocks it should execute in as well as the number of threads within each block (see [4] for more details). For high performance computing (HPC) applications the number of threads is usually much larger than the number of multiprocessors.
2
Related Work
A large number of methods can be used for solving linear equations. They are often classified into different types of algorithms, such as direct methods (e.g., LU decomposition), multi grid methods (e.g., AMG) and iterative methods (e.g., Krylov subspace; see [1]) Due to the need to compensate for sensor limitations by the correction term F (see Section 1), an appropriately modified iterative CG method fits our requirements best. The computation of the sparse matrix vector product (SpMV) has long been discovered to be the critical operations of the CG method. With the importance of matrix multiplication in general, both to this and many other important numerical problems, it is worthwhile to put effort towards fast implementations. The inherent high degree of parallelism makes it an attractive candidate for acceleration on the highly parallel GPU architecture. For the multiplication of dense matrices, a number of high-performance GPU implementations have already been developed [5,6]. A very good overview and analysis of sparse multiplications is given in [7]. Different storage formats and methods for CUDA implemented SpMV are evaluated and compared, but the focus is on much larger matrices and the different storage formats. Furthermore,
460
F. Stock and A. Koch
NVIDIA supplies the segmented scan library CUDPP with a segmented scanbased SpMV implementation (described in [8]). Another similar work, which focuses on larger matrices, is presented in [9], where two storage formats on different platforms (different GPGPUs and CPU) are compared. On a more abstract level, the CG algorithm in its entirety is best accelerated by reducing the number of iterations. This is an effective and popular method, but often requires the exploitation of application-specific properties (e.g., [10]). Beyond our application-specific improvement using the F correction term, we have also attempted to speed-up convergence by using one of the few domainindependent methods, namely an incomplete LU pre-conditioner [1]. However, with the need to apply F between iterations, this did not improve convergence.
3
Implementation
Due to the memory bottleneck between GPU and the, in our case significant, fixed-time overhead of starting a kernel on the GPU [11], we have to design our implementation carefully to actually achieve wall-clock speed-ups over conventional CPUs. 3.1
Sparse Matrix Vector Multiplications (SpMV)
As we focus in this paper on the SpMV, we implemented and evaluated a number of different ways to perform SpMVs for our application on the GPU. Variant Simple. This baseline implementation uses a separate thread to compute the scalar product of one row of the matrix and the vector. As described before, the matrix is stored in CSR format and the vector as an array. This arrangement, while obvious, has the disadvantage that for consecutive thread IDs, neither the accesses to the matrix nor to the vector are coalesced and lead to a dramatically reduced data throughput. Variant MemoryLayout. This first GPU-specific optimization uses an altered ELLPACK-like (see e.g. [7] for more details on the format) memory layout for the matrix. By using a transposed structure, we can now arrange the matrix data so that consecutive threads find their data at consecutive addresses. This is achieved by having k separate index and value arrays, with k being the maximum number of nonzero elements in row. indexi [j] and valuei[j] is the ith nonzero value/index of row j. Since in our case A has a uniform distribution of nonzero elements, this optimization has little impact on the required memory. Variant LocalMemory. The next variant attempts to achieve better coalescence of memory accesses by employing the shared memory. Since shared memory supports arbitrary accesses without a performance penalty, we will transfer chunks of data from/to global memory to/from shared memory using coalesced accesses and perform non-coalesced accesses penalty-free within shared memory. Due to
GPU Implementation for Solving Sparse Linear Equation Systems
461
the strongly under-determined nature of our equation system, the result vector of A · x is much smaller than x and fits completely into the local memory for the multiplication with AT . However, the complete vector x does not fit into shared memory for the actual multiplication with A. Thus, the operation has to be split into sub-steps. From the size of the local memory (16 KB) the maximum size mv for a vector n in local memory can be computed. Mathematically, a thread j computes i=1 aji xi . mv 2m This is equivalent to i=1 aji xi + i=mv v +1 aji xi + . . .. These sums can now be used as separate SpMV problems, where the sub-vectors fit completely into the local memory. Variant OnTheFly. This variant trades a higher number of operations for reduced memory bandwidth. The correction term represented by the matrix F used for the image reconstruction is the result of a parametrized computation, that allows the separate computation of each of the rows of F . Thus, each thread is able to compute the indexes and values it needs and only has to fetch from memory the corresponding vector component. As this on-the-fly generation of the matrix F is only possible row by row, but not column by column, this variant can only be used for the multiplication with A, but not for the multiplication with AT . This variant can be combined with the previous one, generating only segments of the matrix row and multiplying them with a partial vector which is buffered in the local memory. Variant Backprojection. The last of our implementation variants tries to reverse the matrix multiplication. Typically, all threads read each component of the x vector multiple times, but write each component in the result vector just once. The idea is to reverse this procedure and read each component of x just once (i.e. thread i would read component xi , which could then be read coalesced) and would write multiple times to the result component (non-coalesced). While the same result is computed, the altered flow of the computation would allow a very efficient memory layout: Reconsider from the local memory variant that the result y of A · x would entirely fit into shared memory. Thus, the noncoalesced writes to the result components yj would carry no performance penalty. Instead, the expensive non-coalesced reads of the xi from main memory could be turned into coalesced ones. However, this promising approach had to be discarded due to limitations of the current CUDA hardware: Parallel threads would have to correctly update yj by accumulating their local result. To ensure correctness, this would require an atomic update operation that is simply not supported for floating point numbers at this time. Variant Prior Work. As mentioned in Section 2, only limited work on GPU accelerated SpMV is published. For comparison with our kernels, we evaluated all available implementations (i.e. CUDPP[8] and the kernels from Bell and Garland [7]) with our matrices.
462
F. Stock and A. Koch
We follow the scheme of Bell and Garland, which classifies the algorithm according to matrix storage layout and multiplication method. In this scheme, CUDPP would be a so-called COO method. As the details on the different groups and kernels is beyond this paper, we can only refer to the original works. The group of DIA kernels were left out, as they operate only on diagonal structured matrices. Variant Xeon CPU using SSE Vector Instructions. To evaluate the performance of our GPU-acceleration against a modern CPU, we also evaluated a very carefully tuned software SpMV implementation running on a 3.2 GHz Xeon CPU with 6 MB cache. The software version was written to fully exploit the SSE vector instructions and compiled with the highly optimizing Intel C compiler. Kernel Invocation Overhead. As already described in Section 2, we expected to deal with a long kernel invocation delay. We measured the overhead of starting a kernel of the GPU as 20μs for an NVIDIA GTS 8800 512 and of 40μs for an NVIDIA GTX 280. One possible explanation for the increased overhead on the larger GTX card could be the doubled number of multiprocessors (compared to the GTS) card and a correspondingly longer time to initialize all of them. When composing an iteration of the CG algorithm (see Table 1) from kernels for the operations, there will be 14 kernel invocations per iteration, translating to 5600 kernel starts over the 400 iterations required to converge. This would require a total of 0.11s on the GTS-series GPU (one fourth of our entire timing budget). To lessen this invocation overhead, we manually performed loop fusion to merge calls to the same kernels (but operating on different data) into a single kernel (e.g. performing two consecutive scalar products not with two kernel invocations, but with one invocation to a special kernel doing two products). In addition to reducing the call overhead, we gain additional performance by loading data, that is used in both fused computations, only once from memory. In this manner, the number of kernel invocations is reduced from 14 to just six per iteration, now taking a total of 0.048s. Resources. All kernels were invoked with at least as many blocks as multiprocessors were available on each GPU, and on each multiprocessor with as many threads as possible to hide memory latencies. Only the matrix-on-the-fly and the correction term kernels could not execute the maximum of 512 threads per block due to excessive register requirements. Nevertheless, even these kernels still could still run 256 threads per block. 3.2
Alternate Implementations - Different Target Technologies
Beyond the GPU, other platforms, namely FPGA and Cell, were considered, but dismissed in an early stage.
GPU Implementation for Solving Sparse Linear Equation Systems
463
The required compute performance of 15 GFLOPS could easily be achieved using an FPGA-based accelerator, which would most likely also be more energy efficient than the GPGPU solution. However, the required memory bandwidth becomes a problem here. The Virtex 5 series of FPGAs by Xilinx [12] is typical of modern devices. Even the largest packages currently available have an insufficient number of I/O pins to connect the number of memory banks required to achieve the memory bandwidth demanded by the application: A single 64b wide DDR3 DRAM bank, clocked at 400 MHz, could deliver a peak transfer rate of 6.4 GB/s and requires 112 pins on the FPGA. A large package XC5VLX110 device can connect at most to six such banks with a total throughput of < 40 GB/s, which is insufficient. Multi-chip solutions would of course be possible, but quickly become uneconomical. Similar bandwidth problems exist when considering IBM’s Cell Broadband Engine (Cell BE). The Cell BE is a streaming architecture, where eight streaming Synergistic Processing Elements are controlled by a Power Processor. Memory IO is here also limited by a theoretical maximum bandwidth of 25.6 GB/s ([13], which also gives some performance numbers on SpMV). Again, one could of course use a set of tightly-coupled Cell BEs. But since even a single Cell blade is considerably more expensive than multiple graphics cards, such a solution would also be uneconomical in a commercial setting.
4
Experimental Evaluation
We used the following hardware to experimentally evaluate our approach: – Host CPU, also used for evaluating the software version: Intel Xeon with 6 MB Cache, clocked at 3.2 GHz – NVIDIA GTX 280 GPU (16 KB shared memory, 30 multiprocessors, 1300 MHz) with GT200 chip. – NVIDIA 8800 GTS 512 (16 KB shared memory, 16 multiprocessors, 1620 MHz) with G92 chip. Due to power constraints in the embedded system this was the target platform. Table 2 shows the run times of the different variants. The measurements were taken using the NVIDIA CUDA profiler, so the data is only valid for comparison with other profiled runs. Depending on the SpMV (Ax or AT y), the variants perform differently: For the transposed matrix multiplication, the best variant is the method LocalMemory combined with MemoryLayout (not using the OnTheFly technique). The measurements include the time to transfer the vector prior to the operation into the shared memory, where it fits completely. Thus, the random accesses into the vector are not penalized (in comparison to those to global device memory). For the multiplication with the non-transposed matrix, the variant OnTheFly computing the whole row (but not using the LocalMemory technique) is most efficient. The effect of non-coalesced random accesses could be a reduced even further by utilizing the texture cache (increasing performance by 15%).
464
F. Stock and A. Koch
Table 2. Performance of the different SpMV variants, measured for Ax and AT y (A is 81, 545 × 3, 072, containing 854, 129 nonzero elements) Ax time [µs] AT y time [µs] optimization Simple 13,356 1,744 MemoryLayout 1,726 n/a n/a 160 MemoryLayout & LocalMemory OnTheFly 400 11,292,080 OnTheFly & LocalMemory 807 n/a Prior Work DIA n/a n/a Prior Work ELL 606 198 1,183 1,439 Prior Work CSR 750 549 Prior Work COO/CUDPP Prior Work HYB 793 289
The variant LocalMemory in itself performs poorly for our application, due to the very specific structure of the matrix A: Although the number of nonzero elements is uniform, i.e. approximately the same in each row, their distribution is different. If A was subdivided into s sub matrices with at most as many columns as elements in the partial vector in the shared memory, the number of nonzero elements in the rows of the sub matrices would not be uniform. The longest path for our SIMD computation on a multiprocessor is determined by the largest number of nonzero values in a row processed by one of the threads. Since this may be a different row in each of sub matrices, the worst case execution time of this variant may be s times as long than without the LocalMemory modification. The last block of results in Table 2 shows the different runtimes of the CUDPP kernel and the kernels from [7]. For each group, we show the best time from all kernels of this group. As the numbers indicate, our best implementations are faster than these others kernels. This is due to our algorithm being specialized for the given matrix structure. In addition to the profiler-based measurements reported above, we also evaluated the performance of our GPU implementation of the complete CG on the system level, comparing to the SSE-optimized software version. As further contender, the GTX 280 GPU was used. These tests also encompass the time to transfer the data/matrix from host memory to the GPU device memory. Since only limited amounts of data have to be exchanged during an iteration, these transfers are relatively short. Figure 2 shows the system-level computation time for different image sizes (expressed as number of nonzero elements). For very small matrices, the SSE CPU does actually outperform the GPU (due to data transfer and kernel invocation overheads). However, as soon as the data size exceeds the CPU caches (the knee in the SSE-line at ca. 450K nonzero elements), the performance of the CPU deteriorates significantly. Remarkably, the G92-class GPU actually outperforms its more modern GT200class sibling. Only when the matrices get much larger (and exceed the size of the
GPU Implementation for Solving Sparse Linear Equation Systems
465
Fig. 2. Execution time of different implementations as function of the number of nonzero elements
images required by our application) does the GT200 solve the problem faster. We trace this to two causes: First, the G92 is simply clocked higher than the GT200, and the larger number of multiprocessors and increased memory bandwidth on the GT200 come into play only for much larger images. Second, the G92 spends just 10% of its total execution time in kernel invocation overhead, while the GT200 requires 20%. Going back to the requirements of our industrial application: Our aim of reconstructing images of the given size in just 0.44s was not fully achievable on current generation GPUs. However, our implementation managed to restore images at 70% of the initially specified resolution in time, which proved sufficient for practical use. In this setting, the GPU achieves a peak performance of 1 GFLOPS and 43 GB/s memory bandwidth. The used GTS has a max. memory bandwidth of 48 GB/s (measured on system with bandwidth test included in the CUDA SDK), so we reach 87% of the maximum.
5
Conclusion
We have shown that a GPGPU approach can be viable not only on huge highperformance computing problems, but also on practical embedded applications handling much smaller data sets. The advantage of the GPU even over fast CPUs continues to grow with increasing data set size. Thus, with the trend towards higher resolutions, GPU use will become more widespread in embedded practical applications. Apart from our concrete application, we implemented very efficient sparse vector matrix multiplication for both non-transposed and transposed forms of a matrix, outperforming reference implementations for a GPU as well as a highly optimized SSE software version running on a state-of-the-art processor.
466
F. Stock and A. Koch
It is our hope that future GPUs will reduce the kernel invocation overhead, which really dominates execution time for smaller data sets, and also introduce atomic update operations for floating point numbers. The latter would allow new data and thread organization strategies to further reduce memory latencies.
Acknowledgements Thanks to our industrial partner for the fruitful cooperation.
References 1. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2003) 2. ATI: AMD Stream Computing - Technical Overview. ATI (2008) 3. Khronos Group: OpenCL Specification 1.0 (June 2008) 4. NVIDIA Corp.: NVIDIA CUDA Compute Unified Device Architecture – Programming Guide (June 2007) 5. Kr¨ uger, J., Westermann, R.: Linear algebra operators for gpu implementation of numerical algorithms. In: SIGGRAPH 2003: ACM SIGGRAPH 2003 Papers, pp. 908–916. ACM, New York (2003) 6. Larsen, E.S., McAllister, D.: Fast matrix multiplies using graphics hardware. In: Supercomputing 2001: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (CDROM), p. 55. ACM, New York (2001) 7. Bell, N., Garland, M.: Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation (December 2008) 8. Sengupta, S., Harris, M., Zhang, Y., Owens, J.D.: Scan primitives for gpu computing. In: GH 2007: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, Aire-la-Ville, pp. 97–106. Eurographics Association, Switzerland (2007) 9. Buatois, L., Caumon, G., Levy, B.: Concurrent number cruncher: a gpu implementation of a general sparse linear solver. Int. J. Parallel Emerg. Distrib. Syst. 24(3), 205–223 (2009) 10. Roux, F.X.: Acceleration of the outer conjugate gradient by reorthogonalization for a domain decomposition method for structural analysis problems. In: ICS 1989: Proceedings of the 3rd International Conference on Supercomputing, pp. 471–476. ACM, New York (1989) 11. Bolz, J., Farmer, I., Grinspun, E., Schr¨ ooder, P.: Sparse matrix solvers on the gpu: conjugate gradients and multigrid. In: SIGGRAPH 2003: ACM SIGGRAPH 2003 Papers, pp. 917–924. ACM, New York (2003) 12. Xilinx: Virtex 5 Family Overview. Xilinx (2008) 13. Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., Yelick, K.: The potential of the cell processor for scientific computing. In: CF 2006: Proceedings of the 3rd Conference on Computing Frontiers, pp. 9–20. ACM Press, New York (2006)
Monte Carlo Simulations of Spin Glass Systems on the Cell Broadband Engine Francesco Belletti1 , Marco Guidetti1 , Andrea Maiorano2, Filippo Mantovani1 , Sebastiano Fabio Schifano3 , and Raffaele Tripiccione1 1
3
Dipartimento di Fisica, Universit` a di Ferrara and INFN Sezione di Ferrara, I-44100 Ferrara, Italy 2 Dipartimento di Fisica, Universit` a di Roma “La Sapienza”, I-00100 Roma, Italy Dipartimento di Matematica, Universit` a di Ferrara and INFN Sezione di Ferrara, I-44100 Ferrara, Italy
Abstract. We implement a Monte Carlo algorithm for spin glass systems and optimize for the Cell-BE processor, assessing the effectiveness of this architecture for state-of-the-art simulations. Recent developments in many-core processor architectures, like the IBM Cell BE seem to open new opportunities in this field, where computational requirements are so demanding that state-of-the-art simulations often use dedicated computing systems. We present our implementation, analyze results and compare performances with those of both commodity processors and dedicated systems.
1
Introduction
Several large-scale computational scientific problems still defy even state-of-theart computing resources. One such problem is the simulation of spin models. Spin models (among them, spin glasses) are relevant in many areas of condensedmatter physics. They describe systems characterized by phase transitions (such as the para-ferro transition in magnets) or by “frustrated” dynamics, appearing when the complex structure of the energy landscape of the system makes the approach to equilibrium very slow. These systems, extensively studied in the last two decades as paradigmatic examples of complexity, have been successfully applied to several scientific areas outside physics, such as quantitative biology, optimization, economics modeling and social studies [1,2]. A spin glass is a disordered magnetic system in which the interactions between the magnetic moments associated to the atoms of the system are mutually conflicting, due to some frozen-in structural disorder [3]. The macroscopic consequence is that it is difficult for the system to relax to equilibrium: glassy materials never truly reach equilibrium on laboratory-accessible times as relaxation times may be, for macroscopic samples, of the order of centuries. As a counterpart of this sluggish physical behavior, finding the state of the system of lowest energy is an NP-hard problem [4] and accurate numerical treatment is a true challenge. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 467–476, 2010. c Springer-Verlag Berlin Heidelberg 2010
468
F. Belletti et al.
Short-ranged spin glass models are defined on discrete, finite-connectivity regular grids (e.g., 3-D lattices, when modeling experiments on real magnetic samples) and are usually studied via Monte Carlo simulation techniques. The dynamical variables are the spins, sitting at the edges of the lattice and having a discrete and finite (usually small) set of possible values. State-of-the-art simulations may stretch for well over 1010 Monte Carlo updates of the full lattice, and have to be repeated on hundreds or thousands of different instantiations of the system (samples, see later for a detailed definition). In addition, each sample must be simulated more than once (replicas), as many properties of the model are encoded in the correlation between independent histories of the same system. For a 3-D lattice of linear size ∼ 80 (a typical value for a state-of-the-art investigation) these requirements translate into 1018...19 Monte Carlo-driven spin updates, a major computational challenge. It is trivial to implement sample and replica parallelism, since there is no information exchange between samples (or replicas) during the simulation. There is (see next section for details) a further very large amount of parallelism available in the algorithms associated to the simulation of each sample, that is not efficiently supported by commodity computer architectures but whose exploitation is nevertheless badly needed: indeed, simulating just one system of 803 sites for 1010 Monte Carlo steps implies 5 × 1015 updates, that is, almost ten years on a commodity processor able to update one spin in 50 ns (a typical figure). The computational challenge in this area is twofold: i) exploit the parallelism available in the simulation of each sample in order to reduce processing time to figures acceptable on human timescales, and ii) replicate (by the hundreds or thousands) whatever solution has been found to point i), to accumulate results on a large set of samples and replicas. In this paper, we call i) and ii) internal and external parallelism, respectively (we do not endorse in this paper the possibly misleading terminology used in spin glass jargon of synchronous and asynchronous parallelism, see ref. [5]). Correspondingly, we will use two performance metrics: system spin update time (SUT) and global spin update time (GUT). SUT is the average time needed by the Monte Carlo procedure to update one spin of one system. GUT is the average time needed to process one spin, taking into account that many systems may be concurrently handled by the simulation. SUT is a measure of how well we are able to exploit internal parallelism, while GUT measures the effects of compounding internal and external parallelism. In the last decade, application-driven machines, whose architectures are strongly focused and optimized for spin glass simulations, have provided a very effective answer to this challenge [6,7]. Recent projects are arrays of FPGAs [8,9,10]; each FPGA implements a very large number (∼ 1000) of concurrent spin-update engines. Performance on these systems is impressive: simulation time for each sample reduces from several years to just a few months. Application-driven systems imply a large development effort (in terms of costs and manpower) so any new opportunity is welcome. Recent developments in multi-core (and many-core) architectures promise to offer such an opportunity. Indeed, the availability of SIMD data-paths
Monte Carlo Simulations of Spin Glass Systems on the Cell BE
469
in each core helps efficiently exploit internal parallelism. On-chip memory systems (caches or otherwise), as well as efficient core-to-core communication mechanisms, reduce data-access latencies making it also possible to further expose parallelism among a large and increasing set of cores. In this paper, we address the problem of efficiently implementing Monte Carlo simulation for spin glasses on one specific many-core processor, the Cell-BE processor, assessing its performance and effectiveness for state-of-the-art simulations. The remainder of this paper is organized as follows: section 2 defines the Monte Carlo algorithm for spin glass simulations. Section 3 describes our implementation on the Cell BE. Section 4 presents our results and compares with production codes available for commodity x86 architectures, and with results on state-of-the-art dedicated architectures.
2
Monte Carlo Algorithms for Spin Glass Systems
In this paper we consider in details the Edwards-Anderson (EA) model, a widely studied theoretical model of a spin glass. Other models, equally popular in the field, share the same computational structure. The EA model lives on a D−dimensional (usually D = 3) square lattice of linear size L. Atomic magnetic moments are modeled by spins S, discrete variables that take just two values, S = ±1 (in other models, like the Potts model [11] spins are q-valued, q = 3 . . . 6). For each pair (i, j) of neighboring spins in the lattice the model defines an interaction term (coupling) Jij , needed to compute the energy. The latter is a functional of the configuration, i.e. of the set {S} of all spin-values: E[{S}] = − Jij Si Sj , ij
where ij denotes the sum on all nearest-neighbor points. The Jij are randomly assigned, mimicking the structural disorder of the real system, and kept fixed during Monte Carlo simulation. A given assignment of the full set of Jij is a sample of the system. Physically relevant results are averages over a large number of independent samples. The probability distribution for the Jij reflects the properties of the system: in glassy systems one usually takes Jij = ±1 with 50% probability (binary EA model) or samples a Gaussian distribution with zero mean and unit variance (Gaussian EA model). As already remarked, for each sample we need to consider several replicas (i.e. several copies of the system sharing the same set of couplings and evolving independently during the Monte Carlo process). For each sample, physically relevant observables are the canonical (i.e., fixed temperature) averages of quantities that are themselves functionals of the configuration – for instance magnetization is defined as M [{S}] = i Si . The canonical average (written as M ) at temperature T is M = M [{S}] e−E[{S}]/T , (1) the sum stretching over all configurations; in words, the canonical average takes into account all configurations, weighted with a factor exponentially depending
470
F. Belletti et al. Require: set of S and J 1. loop {loop on Monte Carlo steps} 2. for all Si do {for each spin Si } 3. Si = (S i == 1) ? − 1 : 1 {flip value of Si } 4. ΔE = ij (Jij · Si · Sj ) − (Jij · Si · Sj ){compute change of energy} 5. if ΔE ≤ 0 then 6. S = S {accept new value of S} 7. else 8. ρ = rnd() {compute a random number 0 ≤ ρ ≤ 1 with ρ ∈ Q} 9. if ρ < e−βΔE then 10. S = S‘ {accept new value of S} 11. end if 12. end if 13. end for 14. end loop
Algorithm 1. Metropolis Algorithm on their energies. Performing the full sum of eq. (1) is an obviously impossible task for typical lattice sizes, so an approximation is needed. Monte Carlo (MC) methods are one such option; a Monte Carlo process produces a sequence of configurations sampled with probability proportional to the exponential term in eq. (1): an unbiased sum on all configurations generated by the process converges to (1) after an infinite – in practice, a large enough – number of steps. Several Monte Carlo algorithms are known. The Metropolis procedure [12] – usually adopted in spin glasses – generates new configurations by tentatively changing spins, as described by algorithm 1. The visit order of the algorithm is not important, as long as all sites are visited an equal number of times on average (the process of visiting all sites, is called an MC sweep). All steps of the algorithm can be applied in parallel to any subset of spins that do not share a coupling term in the energy function (so we can correctly compute ΔE). We divide the lattice in a checkerboard scheme and apply the algorithm first to all black sites and then to all white ones, trivially identifying an available parallelism of degree up to L3 /2, linearly growing as the size of the system: in principle, we may schedule one full MC sweep for any lattice size in just two computational steps. This is still not possible today: what we have to do is to find ways to implement as large a subset of the available parallelism as possible.
3
Algorithm Implementation
In this section we discuss ways in which the available parallelism described above can be implemented on a multi-core architecture like the Cell-BE [13]. Our main priority is exposing internal parallelism, since external parallelism can be trivially farmed out; we are obviously allowed to mix strategies, if useful for performance. A first approach to internal parallelism uses SIMD instructions within each core, and processes several (white or black) spins in parallel. We consider later further level of internal parallelism, stretching across cores.
Monte Carlo Simulations of Spin Glass Systems on the Cell BE
471
Fig. 1. A data layout with external parallelism w and internal parallelism V allocates spins onto 128-bit SIMD vectors. Each vector has V slots, and each slot allocates w spins, one for each lattice.
External parallelism can be exploited on top of SIMD-based internal parallelism by mapping on the same machine variable the spins belonging to the same site of different samples. This technique is usually referred to as multi-spin coding [14,15]. Mixing internal and external parallelism is necessary, because the Cell-BE, like all commodity architectures, does not support efficiently the onebit-valued data types that represent spins and couplings. Data allocation and coding have to be carefully tuned to obtain an efficient implementation. We first consider data layout. We map V (V = 2, . . . , 64) binaryvalued spins of w = 128/V lattice samples on a 128-bit word as shown in Fig. 3. This data layout allows us to use SIMD instructions that update in parallel V spins for each of w independent lattices, compounding an internal parallelism of degree V and an external parallelism of degree w. The key idea is that all steps of algorithm 1 (except for random numbers, see later) for each spin can be coded as bit-valued logical operations, so many spins can be treated at the same time by SIMD logical instructions. We only need to represent each bit of the values of the energy (and other relevant variables) in a different 128-bit SIMD vector. It will turn out that a rather smallset of such vectors is really needed. The energy Ei of a spin at site i is Ei = − Si Jij Sj . This sum includes all sites j that are nearest-neighbors to site i; in 3-D, it takes all even integer values in the range [−6, 6]. We replace S and J by the binary variables σ and σi xor λij xor σj . The energy of the spin λ (σ, λ ∈ {0, 1}) and compute Xi = at site i is Ei = 6 − 2Xi , Xi taking all integer values in [0, 6]. Flipping a spin changes the sign of its local energy, so the energy cost of a flip is simply ΔEi = −2Ei = 4Xi − 12, Xi ∈ {0, 1, 2, 3, 4, 5, 6} The Metropolis algorithm accepts the new state if one of the following conditions is true: i) Δ(Ei ) ≤ 0, ii) ρ ≤ e−βΔ(Ei ) (notice that condition i) implies condition
472
F. Belletti et al. Require: ρ pseudo-random number Require: ψ = int (−(1/4β) log ρ){encoded on two bits} Require: η = ( not Xi ){encoded on two bits} 1. c1 = ψ[0] and η[0] 2. c2 = (ψ[1] and η[1]) or ((ψ[1] or η[1]) and c1 ) 3. σi = σi xor (c2 or not Xi [2])
Algorithm 2. Bit-wise algorithm to update one spin 1. WHEEL[k] = WHEEL[k-24] + WHEEL[k-55] 2. ρ = WHEEL[k] ⊕ WHEEL[k-61]
Algorithm 3. Parisi-rapuano Algorithm. WHEEL is a circular array initialized with random 32-bit unsigned-integers, and ρ is the generated pseudo-random number ii) ). All Xi values in [0, 3] correspond to negative or null energy cost, so the spin flip is always accepted in these cases. In a three-bit representation of Xi , this corresponds to having the most significant bit of Xi set to 0. If the most significant bit of Xi is 1, we have to check for condition ii), which we rewrite as −(1/4β) log ρ + (7 − Xi ) ≥ 4 In a three-bit representation, 7 − Xi = not Xi . Since in this case condition i) is not verified, we only need two bits to represent relevant not Xi values in {1, 2, 3}. Any value of −(1/4β) log ρ ≥ 3 verifies the condition, independently of Xi . In addition, since − log(ρ/4β) is the only non-integer quantity involved, we only need its integer part. The inequality then reads: int (−(1/4β) log ρ) + ( not Xi ) ≥ 4 We call ψ = (int (−(1/4β) log ρ)) and η = (( not Xi )), both quantities being truncated to their two least significant bits, so the condition is verified if (ψ + η) generates a carry out of the second bit. Summing up, we compute the new spin value σi as a bit-wise expression defined by algorithm 2. We further optimize the evaluation of ψ replacing the computation of the logarithm by a look-up table. A more detailed description of this implementation is given in [16]. The advantages of this scheme are twofold: i) it exploits the SIMD capabilities of the architecture leveraging on internal and external parallelism and ii) it does not use conditional statements, badly impacting on performance. At each step of the algorithm we need V (pseudo-)random numbers, since the same random value can be shared among the w samples, but independent random numbers are mandatory to update different spins of the same system. We use the 32-bit Parisi-Rapuano generator [17], defined by algorithm 3, a popular choice for spin glass simulations. We use SIMD instructions to generate 4 random numbers in parallel. For values of V > 4, we iterate the code. So far, we have exploited SIMD parallelism within each core. We now consider multi-core parallelism. We split the lattice in C sub-lattices of contiguous planes
Monte Carlo Simulations of Spin Glass Systems on the Cell BE
473
(C is the number of cores). We map each sub-lattice (of L × L × L/C sites) onto a different core. We consider two cases: 1. each sub-lattice is small enough to fit the private memory of each core. 2. it does not fit fully within the private memory of its core. In this case, each core iteratively loads, updates and stores chunks of its sub-lattice till the whole sub-lattice is updated. The first case implies data exchange across the cores (and no data traffic to main memory), while in the second case data access patterns are between main memory and the cores. The algorithm describing our first approach is defined by a loop in which each core first updates all its white spins and then updates all the black ones (as required by the Metropolis algorithm). White and black spins are stored in data-structures called half-planes, each housing L2 /2 spins. Each core updates the half-planes of one color performing the steps described by algorithm 4 Note that the for all statement can be overlapped with the send and receive of the boundary half-planes. The term previous core and next core refer to the cores housing the two adjacent sub-lattices. In the second approach, data is allocated in main memory and split in blocks, small enough to fit inside core memories. Our implementation for this approach is a loop where each core first updates all its white half-planes in the range [0 ; (L/C) − 1], synchronizes with the other cores and then repeats these steps for the black half-planes. The update of each half-plane is performed using a double-buffer scheme and executing the steps of algorithm 5 for each of halfplane i ∈ [0 ; (L/C) − 1]. The approaches described above are adopted also for the Gaussian model. In this case the only internal-parallelism degree available is V = 4, because the couplings are represented by floating-point numbers , and the update computation can not be translated into a bit-wise expression like in the EA model.
4
Performance Results and Conclusions
For our tests we used an IBM QS22 system, with two tightly connected IBM PowerXCell 8i processors. In this setup, it is possible to use up to 16 cores, but each SPE can exchange data at full speed only with the SPEs and the main memory of the same processor. 1. 2. 3. 4. 5. 6.
update the boundaries half-plane (indexes (0) and ((L3 /C) − 1)). for all i ∈ [1..((L3 /C) − 2)] do update half-planes (i) end for send/receive half-plane (0) to the previous core. send/receive half-plane ((L3 /C) − 1) to the next core.
Algorithm 4. Private Memory Implementation
474
F. Belletti et al. 1. for all black,white do 2. for all half-plane (i) ∈ [0..(L/C) − 1] do 3. if i > 0 then 4. store half-plane (i − 1) to main memory 5. end if 6. if i < (L/C) − 1 then 7. load half-plane (i + 2) from main to local memory 8. end if 9. update half-plane (i) and proceed to half-plane (i) = (i + 1) 10. end for 11. synchronize all cores 12. end for
Algorithm 5. Global Memory Implementation We have made extensive performance tests of most allowed combinations of (L, V, C). We have found large variations in overall efficiency as we move in this parameter space; indeed, efficiency depends on a delicate balance of sustained memory bandwidth and processor performance (the latter affected by scheduling details and by the cost of generating random numbers). Our measured data is consistent with a simple theoretical model that will be presented elsewhere (see however [16] for some details). We summarize our main results for the binary and Gaussian model in Table 4, where values of the parameters relevant for tipical lattice sizes are given. For a given value of the lattice size, we list the values of the parameters that yield the best performance, measured as system update time (SUT) and global update time (GUT). This table can be used, when planning a simulation campaign, to identify the program version that provides the best performance (the best mix of external and internal parallelism) for a given size of the problem. We now compare our performance results with those of a previous implementation for an x86 architecture and of a state-of-the-art custom HPC systems optimized for spin glass applications. Let us start with the binary model. We c report here performance of a program running on a 2.4 GHz Intel Core2-Duo . The program implements the multi-spin coding technique, and is configured for running two replicas of 128 samples, that is each core updates in parallel 128 spins (V = 1 and w = 128 in our previous notation). The program is not coded to use explicitely SSE instructions and registers. However, since the vector type provided by the gcc compiler is used, the compiler is able to exploit some SSE instructions. This program has a SUT of 92.16 ns/spin and a GUT of 0.72 ns/spin. Lattice size does not impact performance as long as it fits the cache memory. On the same processor, a routine partially exploiting internal parallelism, updates 64 spins of a single system of size L = 64 (energies are still computed in parallel using the multi-spin coding trick, but the Metropolis condition check is inherently sequential as different random numbers for every spin are needed in this case). In this case, SUT is 4.8 ns. We now compare our results with the performance of a recent dedicated machine, the Janus processor. Janus is the latest example of an application-driven
Monte Carlo Simulations of Spin Glass Systems on the Cell BE
475
machine strongly optimized for spin glass simulations [8]. Each Janus processor (called an SP) handles one lattice at a time, updating in parallel ∼ 1000 spins (in our notation, V = 1000, W = 1 for Janus) corresponding to a SUT of 0.016 ns/spin. Different copies of the system are handled by different SP on the machine. The smallest Janus system has 16 SPs delivering a GUT of 0.001 ns/spin. Summing up our comparison for the binary model, we can say that, before the introduction of the CBE architecture there was a factor 300X (SUT) or 75X (GUT) between commodity systems and application-driven machines. The Cell strongly reduces this gap to values in the range 10X . . . 60X but does not close it yet: a cluster of ∼ 20 − 60 Cells approaches the performance of one Janus core if we consider the global spin update time, but SUT performance still means that a CBE-based simulation may take several years. This comparison is particularly appropriate, since the cost of commercial Cell-based systems is not appreciably lower than one Janus processor. In any case many-cores are an extremely effective path to performance here and in the not-to-far future application-driven machines may become an obsolete approach to the problem. The scenario is somewhat different for the Gaussian model. In this case the couplings are floating-point numbers, and we cannot exploit multi-spin coding techniques. The code of the Metropolis algorithm available for x86 PC has 23 ns/spin GUT when sharing the same random number among 8 samples. The shortest SUT is 65 ns when only one sample is simulated. On the other side, Table 1. System and global update time (SUT and GUT) for the binary (left) and Gaussian model (right). We list several allowed combinations of (L, V, C). Some values of L (like L = 80 for the binary) imply data mis-alignement, and a corresponding performance loss. For the Gaussian model, lattice sizes larger that L = 80, requiring changes in data organization to fit the CBE local-store, are not yet implemented.
L C 16 16 32 32 48 48 64 64 80 80 96 96 128 128
8 16 8 16 8 16 8 16 8 16 8 12 8 16
Binary model SUT GUT
V w Memory 8 8 16 16 8 8 32 32 8 8 16 16 64 64
16 local 16 local 8 local 8 local 16 local 16 local 4 local 4 local 16 local 16 local 8 global 8 global 2 global 2 local
(ns/spin) (ns/spin)
0.83 1.17 0.40 0.26 0.48 0.25 0.29 0.15 0.82 1.03 0.42 0.41 0.24 0.12
0.052 0.073 0.050 0.032 0.030 0.016 0.072 0.037 0.051 0.064 0.052 0.051 0.120 0.060
Gaussian model SUT GUT
V w Memory 4 4 4 4 4 4 4 4 4 4 -
1 1 1 1 1 1 1 1 1 1 -
local local local local global local global global global global -
(ns/spin) (ns/spin)
1.00 1.34 0.65 0.42 1.65 0.42 1.71 2.23 1.59 2.10 -
0.250 0.335 0.162 0.105 0.412 0.105 0.427 0.557 0.397 0.525 -
476
F. Belletti et al.
Janus, not well-suited for this case, has no published results for this model. Also for the Gaussian model our implementation is around 10X . . . 20X times faster. In conclusion, our implementation on the Cell-BE is roughly two order of magnitude faster than on an x86 PCs but still one order of magnitude far from the performance of a dedicated system. We might expect that forthcoming multicore architectures will be able to support large scale simulation campaigns. Also, graphics processors, with a much more fine-grained processing structure may offer an alternate approach to the problem: work is in progress to asses the efficiencies of both achitecures. Acknowledgements. We thank the J¨ulich Supercomputing Centre for access to QS22 systems, and members of the Janus collaboration for useful discussions.
References 1. M´ezard, M., Parisi, G., Virasoro, M.: Spin Glass Theory and Beyond. World Scientific, Singapore (1987) 2. Schulz, M.: Statistical Physics and Economics: Concepts, Tools, and Applications. Springer, Heidelberg (2003) 3. Binder, K., Young, A.P.: Spin Glasses: Experimental Facts, Theoretical Concepts and Open Questions. Rev. Mod. Phys. 58, 801–976 (1986) 4. Barahona, J.: On the computational complexity of Ising spin glass models. J. Phys. A: Math. Gen. 15, 3241–3253 (1982) 5. Newman, M.E.J., Barkema, G.T.: Monte Carlo Methods in Statistical Physics. Oxford University Press, USA (1999) 6. Condon, J.H., Ogielski, A.T.: Fast special purpose computer for Monte Carlo simulations in statistical physics. Rev. Sci. Instruments 56, 1691–1696 (1985) 7. Cruz, A., et al.: SUE: A Special Purpose Computer for Spin Glass Models. Comp. Phys. Comm. 133, 165–176 (2001) 8. Belletti, F., et al.: Simulating Spin Systems on Ianus, an FPGA-Based Computer. Comp. Phys. Comm. 178, 208–216 (2008) 9. Belletti, F., et al.: Ianus: an Adaptive FPGA Computer. Computing in Science and Engineering 8, 41–49 (2006) 10. Belletti, F., et al.: JANUS: an FPGA-based System for High Performance Scientific Computing. Computing in Science and Engineering 11, 48–58 (2009) 11. Wu, F.Y.: The Potts Model. Rev. Mod. Phys. 54, 235–268 (1982) 12. Metropolis, N., et al.: Equation of State Calculation by Fast Computing Machine. J. Chemical Physics 21, 1087–1092 (1953) 13. IBM Cell Broadband Engine Architecture, http://www-128.ibm.com/developerworks/power/cell/documents.html 14. Bhanot, G., Duke, D., Salvador, R.: Finite-size scaling and three dimensional Ising model. Phy. Rev. B 33, 7841–7844 (1986) 15. Michael, C.: Fast heat-bath algorithm for the Ising model. Phy. Rev. B 33, 7861– 7862 (1986) 16. F. Belletti Monte Carlo Simulations of Spin Glasses on Cell Broadband Engine Phd Thesis Univesit` a di Ferrara (2009) 17. Parisi, G., Rapuano, F.: Effects of the random number generator on computer simulations. Phys. Lett. B 157, 301–302 (1985)
Montgomery Multiplication on the Cell Joppe W. Bos and Marcelo E. Kaihara Laboratory for Cryptologic Algorithms, ´ Ecole Polytechnique F´ed´erale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland {joppe.bos,marcelo.kaihara}@epfl.ch
Abstract. A technique to speed up Montgomery multiplication targeted at the Synergistic Processor Elements (SPE) of the Cell Broadband Engine is proposed. The technique consists of splitting a number into four consecutive parts. These parts are placed one by one in each of the four element positions of a vector, representing columns in a 4-SIMD organization. This representation enables arithmetic to be performed in a 4-SIMD fashion. An implementation of the Montgomery multiplication using this technique is up to 2.47 times faster compared to an unrolled implementation of Montgomery multiplication, which is part of the IBM multi-precision math library, for odd moduli of length 160 to 2048 bits. The presented technique can also be applied to speed up Montgomery multiplication on other SIMD-architectures. Keywords: Cell Broadband Engine, Cryptology, Computer Arithmetic, Montgomery Multiplication, Single Instruction Multiple Data (SIMD).
1
Introduction
Modular multiplication is one of the basic operations in almost all modern publickey cryptographic applications. For example, cryptographic operations in RSA [1], using practical security parameters, requires a sequence of modular multiplications using a composite modulus ranging from 1024 to 2048 bits. In elliptic curve cryptography (ECC) [2,3], the efficiency of elliptic curve arithmetic over large prime fields relies on the performance of modular multiplication. In ECC, the length of most commonly used (prime) moduli ranges from 160 to 512 bits. Among several approaches to speed up modular multiplication that have been proposed in the literature, a widely used choice is the Montgomery modular multiplication algorithm [4]. In the current work, we study Montgomery modular multiplication on the Cell Broadband Engine (Cell). The Cell is an heterogeneous processor and has been used as a cryptographic accelerator [5,6,7,8] as well as for cryptanalysis [9,10]. In this article, a technique to speed up Montgomery multiplication that exploits the capabilities of the SPE architecture is presented. This technique consists of splitting a number into four consecutive parts which are placed one by one in each of the four element positions of a vector. These parts can be seen as four columns in a 4-SIMD organization. This representation benefits from R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 477–485, 2010. c Springer-Verlag Berlin Heidelberg 2010
478
J.W. Bos and M.E. Kaihara
the features of the Cell, e.g. it enables the use of the 4-SIMD multiply-and-add instruction and makes use of the large register file. The division by a power of 2, required in one iteration of Montgomery reduction, can be performed by a vector shift and an inexpensive circular change of the indices of the vectors that accumulate the partial products. Our experimental results show that an implementation of Montgomery multiplication, for moduli of sizes between 160 and 2048 bits, based on our newly proposed representation is up to 2.47 times faster compared to an unrolled implementation of Montgomery multiplication in the IBM’s Multi-Precision Math (MPM) Library [11]. The article is organized as follows. In Section 2, a brief explanation of the Cell broadband engine architecture is given. In Section 3, we describe the Montgomery multiplication algorithm. In Section 4, our new technique for Montgomery multiplication is presented. In Section 5 performance results of different implementations are given. Section 6 concludes this paper.
2
Cell Broadband Engine Architecture
The Cell architecture [12], developed by Sony, Toshiba and IBM, has as a main processing unit, a dual-threaded 64-bit Power Processing Element (PPE) and eight Synergistic Processing Elements (SPEs). The SPEs are the workhorses of the Cell and the main interest in this article. The SPE consist of a Synergistic Processor Unit (SPU) and a Memory Flow Controller (MFC). Every SPU has access to a register file of 128 entries, called vectors or quad-words of 128-bit length, and a 256 kilobyte Local Store (LS) with room for instructions and data. The main memory can be accessed through explicit direct memory access requests to the MFC. The SPUs have a 128-bit SIMD organization allowing sixteen 8-bit, eight 16-bit or four 32-bit integer computations in parallel. The SPUs are asymmetric processors that have two pipelines denoted as even and odd pipelines. This means that two instructions can be dispatched every clock cycle. Most of the arithmetic instructions are executed on the even pipeline and most of the memory instructions are executed on the odd pipeline. It is a challenge to fully utilize both pipelines always at the same time. The SPEs have no hardware branch-prediction. Instead, hints can be provided by the programmer (or the compiler) to the instruction fetch unit that specifies where a branch instruction will jump to. An additional advantage of the SPE architecture is the availability of a rich instruction set. With a single instruction, four 16-bit integer multiplication can be executed in parallel. An additional performance improvement may be achieved with the multiply-and-add instruction which performs a 16 × 16-bit unsigned multiplication and an addition of the 32-bit unsigned operand to the 32-bit multiplication result. This instruction has the same latency as a single 16 × 16bit multiplication and requires the 16-bit operands to be placed in the higher positions of the 32-bit sections (carries are not generated for this instruction). One of the first applications of the Cell processor was to serve as the heart of Sony’s PS3 video game console. The Cell contains eight SPEs, and in the
Montgomery Multiplication on the Cell
479
Input: M : r n−1 ≤ M < r n , 2 M 1: X, Y : 0 ≤ X, Y < M 2: Output: Z = X · Y · R−1 mod M 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:
Z=0 for i = 0 to n − 1 do Z = Z + yi · X q = (−Z · M −1 ) mod r Z = (Z + q · M )/r end for if Z ≥ M then Z =Z−M end if
Algorithm 1. Radix-r Montgomery Multiplication[4] PS3 one of them is disabled. Another SPEs is reserved by Sony’s hypervisor (a software layer which is used to virtualize devices and other resources in order to provide a virtual machine environment to, e.g., Linux OS). In the end, six SPEs can be accessed when running Linux OS on the PS3.
3
Montgomery Modular Multiplication
The Montgomery modular multiplication method introduced in [4] consists of transforming each of the operands into a Montgomery representation and carry out the computation by replacing the conventional modular multiplications by Montgomery multiplications. This is suitable to speed up, for example, modular exponentiations which can be decomposed as a sequence of several modular multiplications. One of the advantages of this method is that the computational complexity is usually better compared to the classical method by a constant factor. Given an n-word odd modulus M , such that rn−1 ≤ M < rn , where r = 2w is theradix of the system, and w is the bit length of a word, and an integer n−1 = X = i=0 xi ·2w·i , then the Montgomery residue of this integer is defined as X X · R mod M . The Montgomery radix R, is a constant such that gcd(R, M ) = 1 and R > M . For efficiency reasons, this is usually adjusted to R = rn . The Y ) = X · Y ·R−1 mod M . Montgomery product of two integers is defined as M (X, = X·R mod M and Y = Y ·R mod M are Montgomery residues of X and Y , If X = M (X, Y ) = X ·Y ·R mod M is a Montgomery residue of X ·Y mod M . then Z Algorithm 1 describes the radix-r interleaved Montgomery algorithm. The conversion between the ordinary representation of an integer X to the can be performed using the Montgomery algoMontgomery representation X = M (X, R2 ), provided that the constant R2 mod M is rithm by computing X pre-computed. The conversion back from the Montgomery representation to the ordinary representation can be done by applying the Montgomery algorithm to 1). the result and the number 1, i.e. Z = M (Z,
480
J.W. Bos and M.E. Kaihara
In cryptologic applications, where modular products are usually performed succeedingly, the final conditional subtraction, which is costly, is not needed until the end of a series of modular multiplications [13]. In order to avoid the last conditional subtraction (lines 7 to 9 of Algorithm 1), R is chosen such that 4M < R and inputs and output are represented as elements of Z/2M Z instead of Z/M Z, that is, operations are carried out in a redundant representation. It can be shown that throughout the series of modular multiplications, outputs from multiplications can be reused as inputs and these values remain bounded. This technique does not only speed-up modular multiplications but also prevents the success of timing attacks [14] as operations are data independent [13].
4
Montgomery Multiplication on the Cell
In this section, we present an implementation of the Montgomery multiplication algorithm that takes advantage of the features of the SPE; e.g. the SIMD architecture, the large register file and the rich instruction set. The multiplication algorithm implemented in the MPM library uses a radix r = 2128 to represent large numbers. Each of these 128-bit words is in turn represented using a vector of four consecutive 32-bit words in a SIMD fashion. One drawback of this representation is that operands whose sizes are slightly larger than a power of 2128 require an entire extra 128-bit word to be processed, waisting computational resources. Another drawback is that the addition of the resulting 32-bit product to a 32-bit value might produce a carry which is not detected. In contrast to the addition instruction, there is no carry generation instruction for the multiply-and-add operation and is not used in the Montgomery multiplication of the MPM library. The technique we present uses a radix r = 216 which enables a better division of large numbers into words that match the input sizes of the 4-SIMD multipliers of the Cell. This choice enables the use of the following property, and hence the multiply-and-add instruction: If a, b, c, d ∈ Z and 0 ≤ a, b, c, d < r, then a · b + c + d < r2 . Specifically, when r = 216 , this property enables the addition of a 16-bit word to the result of a 16 × 16-bit product (used for the multi-precision multiplication and accumulation) and an extra addition of 16-bit word (used for 16-carry propagation) so that the result is smaller than r2 = 232 and no overflow can occur. We will assume hereafter that the radix r = 216 . Given an odd 16n-bit modulus M , i.e. r(n−1) ≤ M < rn , a Montgomery residue X, such that 0 ≤ X < 2M < r(n+1) , is represented using s = n+1 4 vectors of 128 bits. The extra 16-bit word is considered in the implementation because the intermediate accumulating result of Montgomery multiplication can be up to 2M . The Montgomery residue X is represented using a radix r system, n i.e. X = i=0 xi · ri . On the implementation level the 16-bit words xi are stored column-wise in the s 128-bit vectors Xj , where j ∈ [0, s − 1]. The four 32-bit parts of such a vector are denoted by Xj = {Xj [0], Xj [1], Xj [2], Xj [3]}, where Xj [0] and Xj [3] contain the least and most significant 32-bits of Xj respectively. Each of the (n + 1) 16-bit words xi of X is stored in the most significant 16 bits
Montgomery Multiplication on the Cell
X=
⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Xs−1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ . ⎪ ⎪ ⎨ .. ⎪ ⎪ Xj ⎪ ⎪ ⎪ ⎪ .. ⎪ ⎪ ⎪ . ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ X0 ⎪ ⎪ ⎪ ⎩
128-bit legth vector
=
x3s−1 16-bit
16-bit
high
low
x2s−1
xs−1
481
x3s+i
x2s+i
.. .
xs+i
xi
xs
x0
= x3s
=
x2s
.. .
the least significant 16-bit word of X
n i Fig. 1. The 16-bit words xi of a 16(n n+1+ 1)-bit positive integer X = i=0 xi · r < 2M are stored column-wise using s = 4 128-bit vectors Xj on the SPE architecture
of Xi mod s [ si ]. A graphical representation of this arrangement is provided in Figure 1. We follow hereafter the same notation to represent the long integer numbers used. The Montgomery multiplication algorithm using this representation is described in Algorithm 2. The algorithm takes as inputs the modulus M of n words and the operands X and Y of (n + 1) words. The division by r, which is a shift by 16 bits of the accumulating partial product U , is implemented as a logical right shift by 32 bits of the vector that contains the least significant position of U and a change of the indices. That is, during each iteration, the indices of the vectors that contain the accumulating partial product U change circularly among the s registers without physical movement of data. In Algorithm 2, each 16-bit word of the inputs X, Y and M and the output Z is stored in the upper part (the most significant 16 bits) of each of the four 32-bit words in a 128-bit vector. The vector µ stores the replicated values of (−M )−1 mod r in the lower 16-bit positions of the words. The temporary vector K stores in the most significant 16-bit positions the replicated values of yi , i.e. each of the parsed coefficients of the multiplier Y corresponding to the i-th iteration of the main loop. The operation A ← muladd(B, c, D), which is a single instruction on the SPE, represents the operation of multiplying the vector B (where data are stored in the higher 16-bit positions of 32 bit words) by a vector with replicated 16-bit values of c across all higher positions of the 32-bit words. This product is added to D and the overall result is placed into A. The temporary vector V stores the replicated values of u0 in the least significant 16-bit words. This u0 refers n to the least significant 16-bit word of the updated value of U , i.e. U = j=0 uj · rj represented by s vectors of 128-bit Ui mod s , Ui+1 mod s , . . . , Ui+n mod s following the above explained notation (i refers to the index of the main loop). The vector Q is computed as an element-wise logical left shift by 16 bits of the 4-SIMD product of vectors V and µ. The propagation of the higher 16-bit carries of U(i+j) mod s described in lines 9 and 15 consist of extracting the higher 16-bit words of these vectors and placing
482
J.W. Bos and M.E. Kaihara Input: ⎧ ⎪ M represented by s 128-bit vectors: Ms−1 , . . . , M0 , such that ⎪ ⎪ n−1 ⎪ ≤ M < r n , 2 M, r = 216 ⎨ r X represented by s 128-bit vectors: Xs−1 , . . . , X0 , ⎪ ⎪ Y represented by s 128-bit vectors: Ys−1 , . . . , Y0 , such that 0 ≤ X, Y < 2M ⎪ ⎪ ⎩ µ : a 128-bit vector containing (−M )−1 mod r replicated in all 4 elements. Z represented by s 128-bit vectors: Zs−1 , . . . , Z0 , such that Output: Z ≡ X · Y · r −(n+1) mod M, 0 ≤ Z < 2M 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:
for j = 0 to s − 1 do Uj = 0 end for for i = 0 to n do K = {yi , yi , yi , yi } for j = 0 to s − 1 do U(i+j) mod s = muladd(Xj , K, U(i+j) mod s ) end for Perform 16-carry propagation on U(i+j) mod s for j = 0, . . . , s − 1 V = {u0 , u0 , u0 , u0 } Q =shiftleft(mul(V , µ), 16) /* Q = V · µ mod r */ for j = 0 to s − 1 do U(i+j) mod s = muladd(Mj , Q, U(i+j) mod s ) end for Perform 16-carry propagation on U(i+j) mod s for j = 0, . . . , s − 1 Ui mod s =vshiftright(Ui mod s , 32) /* Vector logical right shift by 32 bits*/ end for Perform carry propagation on Ui mod s for i = n + 1, . . . , 2n + 1 for j = 0 to s − 1 do Zj = U(n+j+1) mod s /* Place results in higher 16-bit positions*/ end for
Algorithm 2. Montgomery Multiplication Algorithm for the Cell Table 1. Performance results, expressed in nanoseconds, for the computation of one Montgomery multiplication for different bit-sizes on a single SPE on a PlayStation 3 Bit-size This work MPM Unrolled Ratio Ratio of moduli (ns) (ns) MPM (ns) 192 224 256 384 512 1024 2048
174 369 200 369 228 369 339 652 495 1,023 1,798 3,385 7,158 12,317
2.12 1.85 1.62 1.92 2.07 1.88 1.72
277 277 277 534 872 3,040 11,286
1.59 1.39 1.21 1.58 1.76 1.69 1.58
them into the lower 16-bit positions of temporary vectors. These vectors are then added to U(i+j+1) mod s correspondingly. The operation is carried out for the vectors with indices j ∈ [0, s − 2]. For j = s − 1, the temporary vector that contains
Montgomery Multiplication on the Cell
483
2.6
2.4
2.2
Times faster
2
1.8
1.6
1.4
1.2
1 256
512
768
1024
1280
1536
1792
2048
Number of bits in the modulus
Fig. 2. The speed-up of the new Montgomery multiplication compared to the unrolled Montgomery multiplication from MPM
the words is logically shifted 32 bits to the left and added to the vector Ui mod s . Similarly, the carry propagation of the higher words of U(i+j) mod s described in line 18 is performed with 16-bit word extraction and addition, but requires a sequential parsing over the (n + 1) 16-bit words. Note that U is represented with vectors whose values are placed in the lower 16-bit word positions.
5
Experimental Results
In this section, performance comparison of different software implementations of the Montgomery multiplication algorithm running on a single SPE of a PS3 using moduli of varying bit lengths is presented. The approach described in Section 3 is compared to the implementation in the Multi-Precision Math (MPM) Library [11]. The MPM library is developed by IBM and part of the software development kit [15] for the Cell. MPM consists of a set of routines that perform arithmetic on unsigned integers of a large number of bits. According to [11]: “All multiprecision numbers are expressed as an array of unsigned integer vectors (vector unsigned int) of user specified length (in quadwords). The numbers are assumed to be big endian ordered”. To enhance the performance and avoid expensive conditional branches, a code generator has been designed. This generator takes as input the bit-size
484
J.W. Bos and M.E. Kaihara
of the modulus and outputs the Montgomery multiplication code in the Cprogramming language that uses the SPU-intrinsics language extension. Such a generator has been made for both our approach as well as for the implementation of Montgomery multiplication in the MPM library. We refer to this faster version of MPM as unrolled MPM. The computational times, in nanoseconds, of a single Montgomery multiplication for cryptographically interesting bit sizes, are given in Table 1. Our Montgomery implementation is the subtraction-less variant (suitable for cryptographic applications) while the MPM version is the original Montgomery multiplication including the final subtraction. A subtraction-less variant of the MPM implementation is significantly slower since it requires the processing of an entire 128-bit vector for one extra bit needed to represent the operands in the interval of values [0, 2M ). Thus, it is not considered in our comparison. The performance results include the overhead of benchmarking, function calls, loading and storing the inputs and output back and forth between the register file and the local store. The stated ratios are the (unrolled) MPM results versus the new results. Figure 2 presents the overall speed-up ratios, compared to the unrolled implementation of MPM. Every 128 bits a performance peak can be observed in the figure, for moduli of 128i + 16 bits. This is because the (unrolled) MPM implementations work in radix r = 2128 and requires an entire 128-bit vector to be processed for these extra 16 bits. This peak is more considerable for smaller bit-lengths because of the significant relative extra work compared to the amount of extra bits. This effect becomes less noticeable for larger moduli. The drop in the performance ratio of our algorithm, that occurs every multiple of 64 bits in Figure 2, can be explained by the fact that moduli of these bit sizes require an extra vector to hold the number in redundant representation. Despite this fact our implementation outperforms the unrolled MPM (which is faster compared to generic MPM) because it only iterates over the 16-bit words that contain data. The only bit sizes where the unrolled MPM version is faster compared to our approach are in the range [112, 128]. This is due to a small constant overhead built in our approach and becomes negligible for larger bit-lengths. The maximal speed-up obtainable from our approach is 2.47 times compared to the unrolled Montgomery multiplication of the MPM library and occurs for moduli of size 528 bit.
6
Conclusions
We have presented a technique to speed up Montgomery multiplication on the SPEs of the Cell processor. The technique consist of representing the integers by grouping consecutive parts. These parts are placed one by one in each of the four element positions of a vector, representing columns in a 4-SIMD organization. In practice, this leads to a performance speed-up of up to 2.47 compared to an unrolled Montgomery multiplication implementation as in the MPM library for moduli of sizes 160 to 2048 bits which are of particular interest for publickey cryptography. Although presented for the Cell architecture, the proposed techniques can also be applied to other SIMD architectures.
Montgomery Multiplication on the Cell
485
References 1. Rivest, R.L., Shamir, A., Adleman, L.: A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM 21, 120–126 (1978) 2. Koblitz, N.: Elliptic curve cryptosystems. Mathematics of Computation 48, 203– 209 (1987) 3. Miller, V.S.: Use of elliptic curves in cryptography. In: Williams, H.C. (ed.) CRYPTO 1985. LNCS, vol. 218, pp. 417–426. Springer, Heidelberg (1986) 4. Montgomery, P.L.: Modular multiplication without trial division. Mathematics of Computation 44(170), 519–521 (1985) 5. Costigan, N., Scott, M.: Accelerating SSL using the vector processors in IBM’s Cell broadband engine for Sony’s playstation 3. Cryptology ePrint Archive, Report 2007/061 (2007), http://eprint.iacr.org/ 6. Bos, J.W., Casati, N., Osvik, D.A.: Multi-stream hashing on the PlayStation 3. In: PARA 2008 (2008) (to appear) 7. Bos, J.W., Osvik, D.A., Stefan, D.: Fast implementations of AES on various platforms. Cryptology ePrint Archive, Report 2009/501 (2009), http://eprint.iacr.org/ 8. Costigan, N., Schwabe, P.: Fast elliptic-curve cryptography on the Cell broadband engine. In: Preneel, B. (ed.) AFRICACRYPT 2009. LNCS, vol. 5580, pp. 368–385. Springer, Heidelberg (2009) 9. Bos, J.W., Kaihara, M.E., Montgomery, P.L.: Pollard rho on the PlayStation 3. In: SHARCS 2009, pp. 35–50 (2009) 10. Stevens, M., Sotirov, A., Appelbaum, J., Lenstra, A., Molnar, D., Osvik, D.A., de Weger, B.: Short chosen-prefix collisions for MD5 and the creation of a rogue CA certificate. In: Halevi, S. (ed.) Advances in Cryptology - CRYPTO 2009. LNCS, vol. 5677, pp. 55–69. Springer, Heidelberg (2009) 11. IBM: Multi-precision math library. Example Library API Reference, https://www.ibm.com/developerworks/power/cell/documents.html 12. Hofstee, H.P.: Power efficient processor architecture and the Cell processor. In: HPCA 2005. IEEE Computer Society, Los Alamitos (2005) 13. Walter, C.D.: Montgomery exponentiation needs no final subtractions. Electronics Letters 35(21), 1831–1832 (1999) 14. Kocher, P.C.: Timing attacks on implementations of Diffie-Hellman, RSA, DSS, and other systems. In: Koblitz, N. (ed.) CRYPTO 1996. LNCS, vol. 1109, pp. 104–113. Springer, Heidelberg (1996) 15. IBM: Software Development Kit (SDK) 3.1 (2007), http://www.ibm.com/developerworks/power/cell/documents.html
An Exploration of CUDA and CBEA for Einstein@Home Jens Breitbart1 and Gaurav Khanna2 1
2
Research Group Programming Languages / Methodologies Universit¨ at Kassel Kassel, Germany
[email protected] Physics Department, University of Massachusetts at Dartmouth North Dartmouth, MA, USA
[email protected]
Abstract. We present a detailed approach for making use of two new computer hardware architectures–CBEA and CUDA–for accelerating a scientific data-analysis application (Einstein@Home). Our results suggest that both the architectures suit the application quite well and the achievable performance in the same software developmental time-frame is nearly identical.
1
Introduction
The performance of computing technologies like graphics cards or gaming consoles is increasing at a rapid rate, thus making general-purpose computing on such devices a tantalizing possibility. Both Compute Unified Device Architecture (CUDA) and Cell Broadband Engine Architecture (CBEA) are new hardware architectures designed to provide high performance and scaling for multiple hardware generations. CUDA is NVIDIA’s general-purpose software development system for graphics processing units (GPUs) and offers the programmability of their GPUs in the ubiquitous C programming language. The Cell Broadband Engine (CBE), which is the first incarnation of the CBEA, was designed by a collaboration between Sony, Toshiba and IBM (so-called STI). The CBE was originally intended to be used in gaming consoles (namely Sony’s Playstation 3), but the CBEA itself was not solely designed for this purpose and has been used in areas such as high-performance computing as well. In this article, we will compare the current state of both CUDA and the CBEA by modifying the Einstein@Home client application, to enable it to take advantage of these different hardware architectures. Einstein@Home is a distributed computing project that uses the computing power volunteered by end users running its client application. The computation performed by this client application can be executed in a data-parallel fashion, which suits both CUDA and the CBE well. We took approximately the same time for developing the client on both architectures and achieved similar performance, but faced very R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 486–495, 2010. c Springer-Verlag Berlin Heidelberg 2010
An Exploration of CUDA and CBEA for Einstein@Home
487
different challenges. Developing the CUDA client required us to redesign the existing data structure, which we could use for the CBE client as well. However, on the CBEA we could not use all hardware features due to the low amount of on-chip memory and thus likely lost some performance. This article is organized as follows. First, Sect. 2 gives an overview of the Einstein@Home client application. The next two sections, describe the CBEA architecture (Sect. 4) and our newly developed CBEA application (Sect. 5). The following two sections give an overview of CUDA (Sect. 6) and our experiences with that architecture (Sect. 7). Finally, in Sect. 8 we compare both implementations and outline the benefits and drawbacks of CUDA and the CBE. Section 9 describes related work, whereas Sect. 10 summarizes this work.
2
Einstein@Home
Einstein@Home is a BOINC [1] based, public distributed computing project that offloads the data-analysis associated to these observatories to volunteers worldwide. The goal of Einstein@Home is to search for gravitational waves emitted by neutron stars (pulsars), by running a brute force search algorithm for different waveforms in a large data-set. We consider the Einstein@Home client a meaningful test application for our comparison of CUDA and the CBEA since it is very compute intensive, and its parallelization is quite straightforward. The computation of the application can be roughly divided into two parts – the so-called F-Statistics computation, and a Hough-transformation. We will concentrate on the F-Statistics computation in this work, because the Hough code will be replaced by an alternative, extremely efficient algorithm soon. We provide below a brief overview of F-Statistics and its data dependencies; a detailed discussion can be found in [2]. Listing 1.1 provides an overview of the F-Statistics code. The code uses for each whenever the loop can be carried out in parallel with no or minimal changes to the loop body. + is used for all commutative calculations. The parameters of a function denote all the values a function depends on, however we simply use “...” as a reference to all parameters passed to the calling function. The F-Statistics code consist of three nested loops, each looping through a different part of the input data set, namely: a frequency band, all used detectors and an SFT of each signal. Listing 1.1. F-Statistics pseudo code
T [ ] ComputeFStatFreqBand ( . . . ) { T result [ ] ; int i = 0; f o r ea ch ( f r e q u e n c y i n f r e q u e n c y b a n d ( . . . ) ) r e s u l t [ i ++]= ComputeFStat ( . . . , f r e q u e n c y ) ; return r e s u l t ; }
488
J. Breitbart and G. Khanna
T ComputeFStat ( . . . ) { T result ; f o r ea ch ( d e t e c t o r i n d e t e c t o r s ) r e s u l t += ComputeFaFb ( . . . , d e t e c t o r ) ; return r e s u l t ; } T ComputeFaFb ( . . . ) { T result ; f o r ea ch (SFT i n SFTs ( f r e q u e n c y ) ) r e s u l t += s o m e c a l c u l a t i o n s ( . . . , SFT ) ; return normalized ( r e s u l t ) ; } In this article, we will indicate the current state of the calculation, as the presently computed upon SFT, detector and frequency. More formally speaking – the current state is the tuple of (f requency, detector, SF T ) using the variable names used in the listing 1.1.
3
Test Cases
The various measurements presented in this article were done with two different kinds of data-sets that not only differ in the overall runtime, but also in their memory requirements and performance characteristics. The small test case is based on a data-set used to check if the Einstein@Home client application is producing correct results. It uses a small number of frequencies in contrast to a full Einstein@Home work unit. In the small data set case, the F-Statistics takes nearly 90% of the overall runtime, while in the full work unit case its share is nearly 50%. A full work unit consists of multiple, so-called sky points. The calculation done for one sky point consists of both the F-Statistics and the Hough transformation and the runtime required to compute one sky point is nearly identical to the one required for the other sky points. For this work, we measure the time required to calculate one sky point for the full work unit.
4
Cell Broadband Engine
The CBE has a general purpose CPU core, called the PPE, which can run 2 software threads simultaneously, and 8 specialized SIMD compute engines, called SPEs available for numerical computation. All these compute elements are connected to each other through a high-speed interconnect bus (EIB). We will not attempt to go into much detail concerning the CBE’s design here, rather we will simply point out one feature that addresses the issue of the memory wall. The memory wall refers to the large (and increasing) gap between processor and memory performance, causing the slower memory speeds to become a significant bottleneck. The CBE allows the programmer to overlap
An Exploration of CUDA and CBEA for Einstein@Home
489
memory access and the actual computation (“double buffering”), in order to hide the time it takes to access memory, thus providing a workaround for the memory wall issue. In addition, the parallel programming model on the CBE allows for the use of the SPEs for performing different tasks in a workflow (“task parallel” model) or performing the same task on different data (“data parallel” model). We use the data parallel model in our implementations. One challenge introduced by this design, is that the programmer typically has to explicitly manage the memory transfer between the PPE and the SPEs. The PPE and SPEs are equipped with a DMA engine – a mechanism that enables data transfer to and from, main memory and each other. The PPE can access main memory directly, but the SPEs can only directly access their own, rather limited (256KB) local store. This poses a challenge for some applications, including the Einstein@Home client application. However, compilers (e. g. IBM XLC/C++) are now available that enable a software caching mechanism that allow for the use of the SPE local store as a conventional cache, thus negating the need of transferring data manually to and from main memory. Another important mechanism that allows communication between the the different elements (PPE, SPEs) of the CBE is the use of mailboxes. These are special purpose registers that can be used for uni-directional communication. They are typically used for synchronizing the computation across the SPEs and the PPE, and that is primarily how we made use of these registers as well. Details on our specific use of these various aspects of the CBEA for the Einstein@Home client application appear in the next section of this article.
5
Implementation on the Cell Broadband Engine
The F-Statistics calculation consists of multiple nested loops that can be carried out in parallel, as seen in Listing 1.1. The outer loop iterations are independent from one another and each iteration calculates only one result, which is written to a memory location that is different for every loop iteration. The results written in an outer loop iteration are a reduction of all inner loop iterations. We parallelized the F-Statistics calculation by splitting the outer loop into equal-sized parts that can be executed in parallel. We chose this approach, since it does not require any communication and still provides almost perfect work distribution, since the number of inner loop iterations varies only be a small margin. Each such part is assigned to an SPE, and each SPE calculates the results for the assigned part independently. The PPE is only used to start the SPEs and waits for them to complete the F-Statistics calculation before continuing with the Einstein@Home application. The code executed by the SPEs is identical to the original code, except for the modification described below. We developed two F-Statistics implementations for the CBE. In our first implementation, we manually manage transfers to and from the local store of the SPEs, whereas our second implementation relies on a software cache, which must be provided by a compiler. In this section, we first describe the two implementations and
490
J. Breitbart and G. Khanna
discuss the benefits of each later. We refer to the first implementation as DMAFstat, whereas the second implementation is called EA-Fstat. In DMA-Fstat the PPE creates multiple threads, each of which is used to control a single SPE. After the threads are created, the PPE inputs the data structure addresses used by F-Statistics into the mailboxes of each SPE. This communication is also used to notify the SPEs to begin work. After the SPEs have received the addresses, they use DMA transfers to fetch all data required for the complete computation. We cannot use double buffering for all data structures because the data that is needed for the calculation is computed on-the-fly for some of these. We could implement double buffering for some data structures, but we did not do so. It turns out that DMA-Fstat cannot be used for a full work unit, because the data needed cannot not fit into the SPE local store. Therefore, we did not optimize DMA-Fstat any further. Since we did not use double buffering, that diminishes the possible performance gain we could achieve with this implementation. After the data is processed, the SPEs write their results back to main memory by using DMA transfers and place a “work finished” message in the mailbox. The PPE waits until all SPEs have placed this message in their mailbox, before the Einstein@Home client is executed any further. DMAFstat produces correct results for the small test case. We developed EA-Fstat to no longer be limited by the amount of data that can be processed. EA-Fstat relies on a SPE software cache implementation provided by e.g. IBMs XLC/C++ compiler, which frees us from manually transferring data to the local store. We only needed to guarantee that in the SPE code all pointers, pointing to main memory are qualified with the __ea qualifier. Since the data structures used by the Einstein@Home client are deeply pointer based, this modification took some time, but by itself was not very challenging. The initial communication is done identically as in DMA-Fstat, meaning that the addresses of the data structures in main memory are sent to the SPE mailboxes. These addresses are assigned to __ea qualified pointers and then used as if they point to locations in the SPE’s local store. The synchronization of the SPEs and the PPE is done identically to that of DMA-Fstat. However before the SPE sends out the “work finished” message in the mailbox, it writes back all data stored in the software cache. The cache write back is done by manually calling a function. The benefit of EA-Fstat compared to DMA-Fstat is that the developer no longer has to worry about the size of the local store and can automatically benefit from a possible larger local store in future hardware generations. Furthermore, relying on the software cache reduces the complexity of the SPE program code: DMA-Fstat requires 122 lines of code (not counting comments) consisting of memory allocation and pointer arithmetic, whereas the EA-Fstat implementation only consists of 9 library function calls, that read the data structure addresses out of the mailboxes. All performance comparisons of our CBE clients were done by running the codes on a Sony Playstation 3, that allows the use of a maximum of 6 SPEs running at a clock frequency of 3.2 GHz. The best performance in the small
An Exploration of CUDA and CBEA for Einstein@Home
491
Table 1. Time per sky point for the CBE client for a full work unit case Processing elements SPEs 1 2 3 4 5 6 22 min 20 min 16 min 14.5 min 13.75 min 13.5 min 13 min PPE
test case is achieved by using 3 SPEs, since the amount of work in this test case is rather small. We gain a factor of 3.6 in performance over the PPEonly version. To finish the small test case, DMA-Fstat requires 2:30 minutes, whereas EA-Fstat needs 2:34 minutes. The comparison of both DMA-Fstat and EA-Fstat shows the performance hit of using the software cache mechanism is only about 2.5%. The low performance loss is partly due to the fact that we did not use double buffering for DMA-Fstat, which would have likely increased its performance, and that all the data required fits into the local store of an SPE. We show the performance of our EA-Fstat solution in Table 1. When running a full work unit the EA-Fstat client cannot outperform the PPE version as well as it did in the small data-set test case, since 50% of the runtime is used up by the hough transformation. When running the client with 6 SPEs, the client finishes in about 59% of the runtime of the PPE-only client – we gain a factor of 1.7 in overall application performance, by making use of the CBE architecture. The performance hardly seems to be limited by the size of the used software cache – halving the cache size to 64KB reduces the performance by 2%. When considering the F-Statistics computation alone, the performance improved by a factor of about 5.5 upon using 6 SPEs – F-Statistics requires less than two minutes of the overall runtime. The best overall performance of the Einstein@Home client on the CBE platform is probably achieved by running multiple clients on one system, so the PPE’s ability of running 2 software threads can be used and all SPEs are kept busy, even when one client is currently executing the Hough transformation. Our experimentation suggests that one can gain an additional 30% in overall application performance in this manner. Recall, that the Hough code is soon due to be replaced by an extremely efficient algorithm – when that happens, we expect to gain over a factor of 5 in the overall application performance.
6
CUDA Architecture
CUDA is a general-purpose programming system currently only available for NVIDIA GPUs and was first publicly released in late 2007. Through CUDA, the GPU (called device) is exposed to the CPU (called host ) as a co-processor with its own memory. The device executes a function (called kernel ) in the SPMD model, which means that a user-configured number of threads runs the same program on different data. From the host viewpoint, a kernel call is identical to an asynchronous function call. Threads executing a kernel must be organized within so called thread blocks, that may consist of up to 512 threads; multiple thread
492
J. Breitbart and G. Khanna
blocks are organized into a grid, that may consist of up to 217 thread blocks. Thread blocks are important for algorithm design, since only threads within a thread block can be synchronized. NVIDIA suggests having at least 64 threads in one thread block and up to multiple thousands of thread blocks to achieve high performance. Threads within thread blocks can be addressed with one-, two- or three-dimensional indexes; thread blocks within a grid can addressed with one- or two-dimensional indexes. We call the thread index threadIdx and the dimensions of the thread block x, y and z (therefore threadIdx.x is the first dimension of the thread index). We refer to the thread block index as blockIdx and use the same names for the dimensions. The high number of threads is used by the thread scheduler to hide memory access latency (see memory wall in Sect. 4) for accesses to the so-called, global memory. Global memory is located on the device and can be accessed by both host and device. In contrast to host memory, global memory is not cached so accesses to this memory cost an order-of-magnitude more than most other operations at the device. For example, global memory access can cost up to 600 clock cycles, whereas an addition or the synchronization of a thread block is performed in about 4 clock cycles. Another way to hide global memory access latency is by using so-called shared memory as a cache. Shared memory is fast on-chip memory that is shared by all threads of a thread block. However, using global memory cannot be avoided, because it is the only kind of memory, that can be accessed by both the host and the device. Data stored in main memory must be copied to global memory, if it is needed by the device. Results of a kernel that need to be used by the CPU must be stored in global memory and the CPU must copy it to main memory to use them. All transfers done to or from global memory are DMA transfers and have a high cost of initialization and a rather low cost for the actual data transfer itself.
7
Implementation Using CUDA
The development of the CUDA based F-Statistics was an evolutionary process. We developed three different versions, each solving the problems that emerged in the previous version. Version 1 executes the innermost loop on the device by using one thread per loop iteration i. e. we calculate the states (f requency, detector, threadIdx.x) with one kernel call. We can parallelize the loop this way because, there are no dependencies between the loop iterations, except a reduction for all the results of the loop iterations. The reduction is done by storing all results in shared memory and using one thread to sum up the results and store it back into global memory. Our implementation works, because the number of loop iterations of the inner loop is always smaller than the maximum number of threads allowed within a thread block. Now, we could easily implement a parallel reduction, e. g. NVIDIA provides a parallel reduction example [3] that could be used. However, these calculations are not the performance bottleneck for version 1.
An Exploration of CUDA and CBEA for Einstein@Home
493
Version 1 requires about 70 times the runtime of the original Einstein@Home client, whereas 95% of this time is used for memory management. This results from the fact that the data structures used by Einstein@Home consist of a high amount of small memory blocks and version 1 copies all these memory blocks to global memory one after another. Since all these memory transfers have a rather high cost of initialization, we cannot achieve a high performance with this solution. Our second implementation continues to calculate the innermost loop of the F-Statistics calculation in parallel, but uses a new data structure, which solves the performance problems when copying the data. Our new data structure is an aggregation of a small number of arrays. We group the arrays in two types. One type is called data array – these store the data from the original Einstein@Home data structures. The Einstein@Home data is copied into the data arrays one after another. The second array type is called offset arrays and are used to identify the original memory blocks inside the data arrays, by storing the starting point of the previously independent memory block inside the data array. By using this data structure, the performance drastically improved to about the same level as that of the CPU based application. Since version 2 still only calculates (f requency, detector, threadIdx.x) with one kernel call and thereby only uses one thread block, whereas the device is designed to run multiple thread blocks at once, most of the processing power of the device is unused. Version 3 uses multiple thread blocks to calculate all three loops of F-Statistics, in parallel. A single loop iteration of the outer loop is calculated by one thread block, whereas each iteration of the inner loops is calculated by one thread. More formally speaking, version 3 calculates (blockIdx.x, threadIdx.y, threadIdx.x) with one kernel call. We chose this approach, because the number of loop iterations of the inner loops is always less than the maximum number of threads allowed inside one thread block. This approach allows us to easily implement the reduction that is executed on all results of the innermost loops, since we can synchronize all threads within the thread block. We continue to use the reduction introduced in version 1, in this version. The number of threads within one thread block is identical for all thread blocks of a kernel call, however the number of loop iterations done by the inner loops depends on the index of the outer loop. Therefore, we have to identify the maximum number of the iterations of the inner loops at the host and use this maximum number for the kernel call. This approach results in idle threads for all thread blocks that have more threads than loop iterations that must be calculated. Our evaluation shows that there are only a small fraction of idle threads – typically, 6 threads per thread block, with the maximum being 10. This is our final implementation of the Einstein@Home CUDA application. The development of all three versions was done on a device, that did not support double-precision floating point operations, which are required by Einstein@Home to produce correct results. This client therefore, does not use double-precision and the results are of no scientific value.
494
J. Breitbart and G. Khanna Table 2. Time measurements of the CUDA based client test case CPU GPU small 2:19 min 1:22 min full (per sky point) 11:00 min 7:00 min
Upon running our implementation on a test system with a GeForce GTX 280 and two AMD Opteron 270 (2 GHz) the CUDA version performs about 1.7 times as fast as the CPU version for the small test case. When run on a full work unit, the CUDA client performs about 1.6 times as fast as the CPU version. Note that both the CPU and the GPU version only use one core when running the Einstein@Home client. Considering only F-Statistics, the performance was improved by more than a factor of about 3.5 – the F-Statistics calculation only takes about 1.5 minutes.
8
Comparison of CUDA and the CBE
One of the most important factors when developing for the CBEA is the size of local store, especially when the developer manually manages the data in the local store. The data required by our application does not fit in the local store. An easy way to overcome this is to use a software cache – however, by using this technique, the developer loses the chance to explicitly utilize double buffering, which is one of the most important benefits of the CBEA. Using CUDA required us to change the underlying data structure of our application. In our case the data structure redesign was rather easy, however in other cases this may be more problematic. The main benefit of using CUDA is the hardware, that increases performance at a much higher rate than multi-core CPUs or the CBEA. Furthermore, software written for CUDA can easily scale with new hardware. However, writing software that achieves close-to-the-maximum performance with CUDA is very different when compared to most other programming systems, since thread synchronization or calculations are typically not the time consuming parts, but memory accesses to global memory are. We did not optimize any of our implementations to the maximum possible performance, instead invested a similar amount of development time into both the CBE and the CUDA client. Therefore, our performance measurements should not be considered as what is possible with the hardware, but rather what performance can be achieved within a reasonable development time-frame.
9
Related Work
Scherl et al. [4] compare CUDA and the CBEA for the so called FDK method for which CUDA seems to be the better option. Christen et al. [5] explore the usage of CUDA and the CBEA for stencil-based computation, which however does not give a clear winner. In contrast, our work does not strive for the maximum possible performance and relies on very low cost hardware.
An Exploration of CUDA and CBEA for Einstein@Home
10
495
Conclusion
The main outcome of our work presented in this article, is that new upcoming architectures such as CBEA and CUDA have strong potential for significant performance gains in scientific computing. In our work, we focused on a specific, data-analysis application from the gravitational physics community, called Einstein@Home. Using these architectures, we successfully accelerated by several fold, one of the two computationally intensive routines of the Einstein@Home client application. Our final CBEA and CUDA implementations yield comparable performance and both architectures appear to be a good match for the Einstein@Home application.
Acknowledgment The authors would like to thank NVIDIA and Sony for providing the hardware that was used for development, testing and benchmarking our codes. Gaurav Khanna would also like to acknowledge support from the National Science Foundation (grant number: PHY-0831631).
References 1. Anderson, D.P.: Boinc: A system for public-resource computing and storage. In: GRID 2004: Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing, Washington, DC, USA, pp. 4–10. IEEE Computer Society, Los Alamitos (2004) 2. Breitbart, J.: Case studies on gpu usage and data structure design. Master’s thesis, University of Kassel (2008) 3. NVIDIA Corporation: CUDA parallel reduction (2007), http://developer.download.nvidia.com/compute/cuda/1 1/Website/ Data-Parallel Algorithms.html#reduction 4. Scherl, H., Keck, B., Kowarschik, M., Hornegger, J.: Fast GPU-Based CT Reconstruction using the Common Unified Device Architecture (CUDA). In: Frey, E.C. (ed.) Nuclear Science Symposium, Medical Imaging Conference 2007, NSS 2007. Nuclear Science Symposium Conference Record, vol. 6, pp. 4464–4466. IEEE, Los Alamitos (2007) 5. Christen, M., Schenk, O., Messmer, P., Neufeld, E., Burkhart, H.: Accelerating Stencil-Based Computations by Increased Temporal Locality on Modern Multi- and Many-Core Architectures (2008)
Introducing the Semi-stencil Algorithm Ra´ ul de la Cruz, Mauricio Araya-Polo, and Jos´e Mar´ıa Cela Barcelona Supercomputing Center, 29 Jordi Girona, 08034 Barcelona, Spain {raul.delacruz,mauricio.araya}@bsc.es
Abstract. Finite Difference (FD) is a widely used method to solve Partial Differential Equations (PDE). PDEs are the core of many simulations in different scientific fields, e.g. geophysics, astrophysics, etc. The typical FD solver performs stencil computations for the entire 3D domain, thus solving the differential operator. This computation consists on accumulating the contribution of the neighbor points along the cartesian axis. It is performance-bound by two main problems: the memory access pattern and the inefficient re-utilization of the data. We propose a novel algorithm, named ”semi-stencil”, that tackle those two problems. Our first target architecture for testing is Cell/B.E., where the implementation reaches 12.4 GFlops (49% peak performance) per SPE, while the classical stencil computation only reaches 34%. Further, we successfully apply this code optimization to an industrial-strength application (Reverse-Time Migration). These results show that semi-stencil is useful stencil computation optimization. Keywords: Stencil computation, Reverse-Time Migration, High Performance Computing, Cell/B.E., heterogeneous multi-core.
1
Introduction
Astrophysics [1], Geophysics [2], Quantum Chemistry [3] and Oceanography [4] are examples of scientific fields where large computer simulations are frequently deployed. These simulations have something in common, the mathematical models are represented by Partial Differential Equations (PDE), which are mainly solved by the Finite Difference (FD) method. Large simulations may consume days of supercomputer time, if they are PDE+FD based, most of this execution time is spent in stencil computation. For instance, Reverse-Time Migration (RTM) is a seismic imaging technique in geophysics, of the RTM kernel execution time up to 80% [5] is spent in the stencil computation. In this paragraph, we first review the stencil computation characteristics, then we identify its main problems. Basically, the stencil central point accumulates the contribution of neighbor points in every axis of the cartesian system. This operation is repeated for every point in the computational domain, thus solving the spatial differential operator, which is the most computational expensive segment of a PDE+FD solver. We identify two main problems: R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 496–506, 2010. c Springer-Verlag Berlin Heidelberg 2010
Introducing the Semi-stencil Algorithm
497
– first, the non-contiguous memory access pattern. In order to compute the central point of the stencil, a set of neighbors have to be accessed, some of these neighbor points are far in the memory hierarchy, thus paying many cycles in latencies. Furthermore, depending on the stencil order (associated to the number of neighbors that contribute in the spatial differential operator) this problem becomes even more important, due to that the required data points are even more expensive to access. – Second, the low computation/access and re-utilization ratios. After gather the set of data points just one central point is computed, and only some of those accessed data points will be useful for the computation of the next central point. We introduce the semi-stencil algorithm, which improves the stencil computation tackling the above mentioned problems. We remark that this algorithm changes the structure of the stencil computation, but anyway it can be generally applied to most of the stencil based problems. The semi-stencil algorithm computes half the contributions required by a central point, but for two central points at the same time, with one a stencil computation is completed, and the other half central point pre-computes the stencil computation of another point. This algorithm reduces the number of neighboring points loaded per computation, this is achieved by just accessing the points required to compute half the stencil. This helps with the first problem of the stencil computation. At every step of the algorithm half of the stencil for the next step is pre-computed. The number of floating point operations remain the same, but because the load number is reduced the computation-access ratio is higher. So, tackling the second problem of the stencil computation. Currently, the GHz race for higher frequencies has slowed down due to technological issues. Therefore, the main source of performance comes from the exploitation of multi-core architectures. Among such architectures, the Cell/B.E. (our target achitecture) exhibit appealing features as energy efficiency and remarkable computing power, which is demanded by the applications which rely on stencil computations. The remainder of this paper is organized as follows: Section 2 introduces the classical stencil algorithm and its problems. In Section 3 we review the principal techniques that help to cope with the stencil problem. Section 4 introduces the novel semi-stencil algorithm, its internals, features and considerations. In Section 5, we evaluate the performance. Finally, Section 6 presents our conclusions.
2
The Stencil Problem
The stencil computes the spatial differential operator, which is required to solve a PDE+FD. A multidimensional grid (often a huge 3D data structure) is traversed, where the elements are updated with weighted contributions. Figure 1.b) depicts classical generic stencil structure. The parameter represents the number of neighbors to be used at each direction of the cartesian axis.
498
R. de la Cruz, M. Araya-Polo, and J.M. Cela
Two major problems can be inferred from this computation, the first one is the sparse memory access pattern [6,7]. Data is stored in Z-major form, therefore accesses across the other two axis (X and Y ) may be significantly more expensive (see Figure 1(a)) latency-wise. The parameter has a direct impact on this problem, the larger the value the more neighbors of each axis have to be t . The points of the Z axis are stored sequentially in loaded to compute Xi,j,k memory, and the cost to fetch them is low.
Fig. 1. (a) The memory access pattern for a 7-point stencil. The sparsity of the required data to compute the stencil is higher in the last axis. (b) The generic stencil structure.
The second problem has two faces: the low floating-point instruction to memory access ratio, and the poor reuse of the accessed data [8]. We state a simple metric to expose these two faces: FP/Mem. This ratio is the result of dividing the number of floating-points instructions (Multiply-Adds) per loads and stores during one X t computation step. F loatingP oint Instructions M ultiplyAdd Instructions = M emory Accesses X t−1 Loads + X t Stores 2 ∗ dim ∗ + 1 2 ∗ dim ∗ + 1 = = (2 ∗ (dim − 1) ∗ + 1) + (1) 2 ∗ dim ∗ − 2 ∗ + 2
F P/M emClassical =
(1)
Equation 1 states this metric for the classical stencil. dim is the number of dimensions of our PDE, where for each dimension 2 ∗ Multiply-Add instructions are required. Also, one extra Multiply-Add instruction must be considered for
Introducing the Semi-stencil Algorithm
499
t−1 self-contribution (Xi,j,k ). The number of loads needed to compute the stencil change depending on the axis. Those dimensions that are not stride direction (X and Y ) require 2 ∗ loads each step iteration. On the other hand, the stride direction requires only 1 load because the remaining loads can be reused from t ) is needed. previous iterations. Finally, 1 store for saving the result (Xi,j,k FP/Mem ratio depends on variables dim and , taking into account that is the only variable that may grow, the ratio tends to 1.5, which is very poor. This result along with previous research [9,10,8] shows that stencil computation is usually memory-bound. In other words, it is not possible to feed the computation with enough data to keep a high arithmetic throughput, thus the performance of this kind of algorithm is limited. These concerns force us to pay special attention on how data is accessed. It is crucial to improve the memory access pattern (reducing the overall amount of data transfers), and to exploit the memory hierarchy as much as possible (reducing the overall transfer latency). Next section review the main approaches to the problem found in the literature.
3
State of the Art
Most of the contributions in the stencil computations field can be divided into three dissimilar groups: space blocking, time blocking and pipeline optimizations. The first two are related with blocking strategies widely used in cache architectures, while the last one is used to improve the performance at the pipeline level.
Space Blocking Space blocking algorithms try to minimize the access pattern problem of stencil computations. Space blocking is specially useful when the data structure does not fit in the memory hierarchy. The most representative algorithms of this kind are: tiling or blocking [9] and circular queue [11].
Time Blocking Time blocking algorithms perform loop unrolling over the time steps to exploit as much as possible a data already read, therefore increasing the data reuse. Time blocking may be used where there is no other computation between stencil sweeps such as: boundary condition, communication, I/O, where a time dependency exists. Optimizations of the time blocking can be divided in explicitly and implicitly algorithms: time-skewing [10] and cache-oblivious [8].
Pipeline Optimizations Low level optimizations include well-known techniques suitable to loops, like unrolling, fission, fusion, prefetch and software pipelining [12,13]. All those techniques have been successfully used for a long time in many other computational fields to improve the processor throughput and decrease the CPI (Cycle per Instruction).
500
4
R. de la Cruz, M. Araya-Polo, and J.M. Cela
The Semi-stencil Algorithm
The semi-stencil algorithm changes in noticeable way the structure of the stencil computation as well as the memory access pattern of the stencil computation. This new computation structure (depicted in Figure 2) consist in two phases: forward and backward, which are described in detail in Section 4.1. Besides, head and tail computations are described in last part of the chapter. The semistencil tackles the problems described in Section 2 in the following ways: – It improves data locality since less data is required on each axis per iteration. This may have an important benefit for hierarchy cache architectures, where the non-contiguous axes of the 3D domain are more expensive latency-wise. – The second effect of this new memory access pattern is the reduction of the total amount of loads per inner loop iteration, but keeping the same number of floating-point operations. Thereby it increases the data re-utilization and the F P/M em ratio. We calculate this ratio for the semi-stencil as follows: F P/M emSemi =
2 ∗ dim ∗ + 1 M ultiplyAdd Instructions (2) = X t−1 Loads + X t Stores dim ∗ − + dim + 2
In Equation 2, and regarding the data reuse in the stride dimension, the number of loads for X t−1 have decreased substantially, almost by a factor of 2. Due to the reduced number of loads less cycles are needed to compute internal loop, also it has the benefit of using fewer registers per iteration. This gives a chance to perform more aggressive low level optimizations, and avoid instruction pipeline stalls. As shown in Figure 2.Left, the semi-stencil algorithm updates two points per step by reusing X t−1 loads, but at the same time doubles the number of stores needed per step of X t . Depending on the architecture, this could be a problem and a source of performance loss. For instance, hierarchy cache architectures with write allocate policy. Nowadays, some hierarchy cache architectures implement cache-bypass techniques as a workaround for this problem. On the other hand, the scratchpad memory based (SPM) architectures, like Cell/B.E., are immune to this problem. Therefore, semi-stencil mapping on the Cell/B.E. architecture should not be affected by this issue. It is worth pointing out that the semi-stencil algorithm can be applied to any axis of a 3D stencil computation. This means that it can be combined with the classical stencil algorithm. Moreover, this novel algorithm is independent of any other optimization technique, like blocking or pipeline optimizations. In fact, we can stack semi-stencil algorithm over any other known technique. In the following sections we will elaborate on the semi-stencil two phases. 4.1
Forward and Backward Update
Forward update is the first contribution that a point receives at time-step t. In this phase, when step i of the sweep direction axis is being computed, the
Introducing the Semi-stencil Algorithm
501
Fig. 2. Left: Detail of the two phases for the semi-stencil algorithm at step i. a) Forward t t−1 point and b) Backward update on Xit point. Notice that the Xi+1 to update on Xi+ t−1 Xi+−1 loads can be reused for both phases. Right: Execution example of semi-stencil algorithm in a 1D problem, where = 4. F stands for a forward update and B stands for a backward update. t t point Xi+ is updated (producing Xi+ ) with the t − 1 rear contributions (Figure 2.Left.a and Equation 3 depicts this operation). In the following equations, prime character denotes the point that has been partially computed. t−1 t−1 t−1 t Xi+ = C1 ∗ Xi+−1 + C2 ∗ Xi+−2 + · · · + C−1 ∗ Xi+1 + C ∗ Xit−1 t−1 t−1 + C2 ∗ Xi+2 Xit = Xit + C0 ∗ Xit−1 + C1 ∗ Xi+1
+
···
t−1 t−1 + C−1 ∗ Xi+−1 + C ∗ Xi+
(3)
(4)
In the backward update, the X ti point, computed in a previous forward step, t−1 t−1 to Xi+ ) is completed with the front contributions of the axis (points Xi+1 (Figure 2.Left.b and Equation 4). Remark, only 1 load is required since almost all data points of time-step t − 1 were loaded in the forward update. Therefore, t−1 point, 1 store for Xit value and finally + 1 this phase needs 1 load for Xi+ Multiply-Add instructions ( neighbors + self-contribution). 4.2
Head, Body and Tail Computations
The FD methods require interior points (inside the solution domain) and ghost points (outside the solution domain). To obtain the correct results on border interior points, the algorithm must be split into three different parts: head, body
502
R. de la Cruz, M. Araya-Polo, and J.M. Cela
and tail. The head segment updates the first interior points with the rear contributions (forward phase). In the body segment, the interior points are updated with neighbor interior elements (forward and backward phases). Finally, in the tail segment, the last interior points of the axis are updated with the front contributions (backward phase). Figure 2.Right shows an example execution of the algorithm with the three segments.
5
Performance Evaluation and Analysis
We evaluate the performance of the semi-stencil algorithm in two steps: as single SPE implementation, and then integrating the implementation into a numerical kernel of a real-life scientific application. The purpose of the first implementation is to show the local behavior of the algorithm, the second implementation purpose is twofold: measure the expected positive impact of the algorithm on a real application, and expose the impact of the architecture on the algorithm performance. After the evaluation, we analyze the results taking the formulas from Sections 2 and 4 into account. As we have seen in Section 3, it exists other techniques to deal with stencil computations. Time blocking methods, due to their inherent time dependency, may pose integration problems (boundary conditions, communication, I/O) with scientific numerical kernels. Also, space blocking is not useful on SPM architecture like Cell/B.E., this due to the constant access cost to memory. Before to present the results, we briefly review the Cell/B.E. architecture. Each Cell/B.E. processor on a QS22 blade contains a general-purpose 64-bit PowerPC-type PPE (3.2 GHz) with 8 GB of RAM. The SPEs have a 128-bit wide SIMD instruction set, which allows to process simultaneously 4 single-precision floating-point operands. Also, the SPEs have software-based SPM called Local Stores (LS) of 256 KB. The 8 SPEs are connected with the PPE by the Element Interconnect Bus of 200 GB/s of peak bandwidth, but the bandwidth to main memory is only 25.6 GB/s.
Single-SPE Implementation In the first porting of our semi-stencil algorithm to Cell/B.E. we focus only on the SPE’s code, the PPE’s code and the techniques required to make the main memory transfers efficient are taking into account in our second semi-stencil implementation. Our implementation is fully handcoded SIMDized (as well as the classical stencil code). The classical stencil implementation takes advantage of optimization techniques such as: software prefetching, software pipelining and loop unrolling. The following results are expressed in terms of elapsed time and floating-point operations. Notice that the results were computed in single precision arithmetic, and the stencil size () is 4. Our purpose with the results is to compare both approaches implementation. Also, the experiments look for situate the proposed algorithm with respect the peak performance of the Cell/B.E. architecture.
Introducing the Semi-stencil Algorithm
503
Table 1. Performance results for the single SPE implementations. These results were obtained with IBM XL C/C++ for Multi-core Acceleration for Linux, V9.0 and GCC 4.1.1. No auto-vectorization flags were used, all codes were vectorized by hand. Stencil size = 4. Compiler (Optimization) XLC -O3 XLC -O5 GCC -O3
Classical Stencil Time Performance [ms] [GFlops] 6.84 4.32 3.44 8.61 7.83 3.78
Semi-stencil Time Performance [ms] [GFlops] 3.35 8.83 2.38 12.44 4.57 6.47
The peak performance achieved by the semi-stencil is 12.44 GFlops (Table 1) which corresponds to 49% of the SPE peak performance. Under the same experimental setup the classical stencil reaches 8.61 GFlops (34% of the SPE peak performance). This means that the semi-stencil algorithm is 44% faster than the classical stencil. The projected aggregated performance of this algorithm is 99.52 GFlops for one Cell/B.E., which is to the best of our knowledge [11], the fastest stencil computation on this architecture. Table 2. Pipeline statistics for the single SPE implementations. Obtained with the IBM Full-System Simulator. Classical Stencil Semi-stencil Semi-stencil gain Total cycle count [cycles] 11460147 7592643 33.7% CPI [cycles/instructions] 0.69 0.75 -9% Load/Store instr. [instr.] 4236607 2618079 38.2% Floating point instr. 5800000 4652400 19.8% Ratio FP/Mem 1.36 1.78 24.4%
In Table 2 every metric is in semi-stencil favor, except the CPI measure, which is 9% better for the classical stencil algorithm. In any case, the most important metric and the one that summarize the performance is the FP/Mem ratio. This ratio is 24% better for the semi-stencil than the classical stencil computation.
Real-life Scientific Application Implementation We integrate the semi-stencil implementation with a Reverse-Time Migration (RTM) implementation [5]. RTM is the tool of choice when complex subsurface areas are intended to be clearly defined. RTM has proven to be very useful for the subsaly oil discoveries of the Gulf of Mexico. The core of the RTM is an acoustic wave propagation (PDE+FD) solver, which suits as test case for our proposed algorithm. The RTM+semi-stencil is 16% faster than RTM+classical, which partially (44% in the ideal case, see Table 1) keeps the performance distance presented
504
R. de la Cruz, M. Araya-Polo, and J.M. Cela
Table 3. Performance results for the RTM implementations. These results were obtained with GCC 4.1.1, using only one Cell/B.E. of a QS22 blade. Each experiment carried 500 steps for a 512x512x512 data set. Total time experiments cover both computation and communication times. Stencil size = 4. Total Only Computation Only Communication Unbalance Time Performance Time Performance Time % Algorithm [s] [GFlops] [s] [GFlops] [s] RTM+classical 74.24 39.15 71.92 41.06 70.33 2.3 RTM+semi-stencil 63.33 46.62 48.13 61.35 64.62 25.6
in the previous tests. This is because RTM and the data transfers introduce extra complexity to the implementation, RTM has several segments that demand computational resources, e.g. boundary conditions, I/O, etc. As can be seen in Table 3, the unbalance between communication (data transfers from/to main memory to/from LS) thwart the performance. This is especially hard with the RTM+semi-stencil, this implementation lost performance up to 25.6%, even after using multi-buffering technique. If we add this 25.6% to the already gained 16%, we recover the 44% of advantage of the semi-stencil over the classical stencil algorithm.
Analysis In this section, given that the experimental results show a clear lead for the semi-stencil algorithm, we confront these results with the theoretical estimation based on the FP/Mem ratio formulas of Section 2 and 4.
Fig. 3. Projected FP/Mem ratio for different stencil size. Notice that the stencils size in the central segment are the most commonly used.
From Figure 3, the theoretical FP/Mem ratio for the classical stencil algorithm is 1.39, while in Table 2 is 1.36. In the semi-stencil case, the theoretical FP/Mem
Introducing the Semi-stencil Algorithm
505
ratio is 1.92 and in Table 2 is 1.78. These figures show that our model of FP/Mem is robust and reliable. Also, both implementation have room for improvements, where for the classical stencil it may only represents a 2.1%. Implementation of the semi-stencil may improve up to 7.3%, but due to its logic complexity this is hard to achieve.
6
Conclusions and Future Work
In this paper we have presented a new generic strategy for finite difference stencil computations. This new algorithm approach, called semi-stencil, is especially well suited for scientific applications on heterogeneous architectures, like Cell/B.E., where a low latency SPM exists. Semi-stencil has shown two benefits compared to the classical stencil algorithm: – It reduces the minimum data working set, reducing the required space in the SPM for each processing unit. This effect becomes more critical for high order stencils. – It increases data locality, by reducing the number of loads but increasing the number of stores. In a heterogeneous system, these additional stores are executed in the processing unit’s SPM, removing the negative impact on the global performance. For a PDE+FD scheme, the best implementations of the classical stencil computation typically are able to reach to 30% of the processor peak performance. Under the same conditions, the semi-stencil algorithm achieve up to 49% of the peak performance. Also, this improvement revamps the performance of already developed code, as our RTM application. Future work will focus on research of this novel algorithm on cache-based architectures, where the write policy for the cache hierarchy may have a negative impact on the performance due to the increased number of stores.
References 1. Brandenburg, A.: Computational aspects of astrophysical MHD and turbulence, vol. 9. CRC, Boca Raton (April 2003) 2. Operto, S., Virieux, J., Amestoy, P., Giraud, L., L’Excellent, J.Y.: 3D frequencydomain finite-difference modeling of acoustic wave propagation using a massively parallel direct solver: a feasibility study. In: SEG Technical Program Expanded Abstracts, pp. 2265–2269 (2006) 3. Alonso, J.L., Andrade, X., Echenique, P., Falceto, F., Prada-Gracia, D., Rubio, A.: Efficient formalism for large-scale ab initio molecular dynamics based on timedependent density functional theory. Physical Review Letters 101 (August 2008) 4. Groot-Hedlin, C.D.: A finite difference solution to the helmholtz equation in a radially symmetric waveguide: Application to near-source scattering in ocean acoustics. Journal of Computational Acoustics 16, 447–464 (2008)
506
R. de la Cruz, M. Araya-Polo, and J.M. Cela
5. Araya-Polo, M., Rubio, F., Hanzich, M., de la Cruz, R., Cela, J.M., Scarpazza, D.P.: 3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors. Scientific Programming: Special Issue on the Cell Processor 16 (December 2008) 6. Kamil, S., Husbands, P., Oliker, L., Shalf, J., Yelick, K.: Impact of modern memory subsystems on cache optimizations for stencil computations. In: MSP 2005: Proceedings of the 2005 Workshop on Memory System Performance, pp. 36–43. ACM Press, New York (2005) 7. Kamil, S., Datta, K., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Implicit and explicit optimizations for stencil computations. In: MSPC 2006: Proceedings of the 2006 Workshop on Memory System Performance and Correctness, pp. 51–60. ACM, New York (2006) 8. Frigo, M., Strumpen, V.: Cache oblivious stencil computations. In: 19th ACM International Conference on Supercomputing, pp. 361–366 (June 2005) 9. Rivera, G., Tseng, C.W.: Tiling optimizations for 3D scientific computations. In: Proc. ACM/IEEE Supercomputing Conference (SC 2000), p. 32 (November 2000) 10. Wonnacott, D.: Time skewing for parallel computers. In: Carter, L., Ferrante, J. (eds.) LCPC 1999. LNCS, vol. 1863, pp. 477–480. Springer, Heidelberg (2000) 11. Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on stateof-the-art multicore architectures. In: SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, Piscataway, NJ, USA, pp. 1–12. IEEE Press, Los Alamitos (2008) 12. Allan, V.H., Jones, R.B., Lee, R.M., Allan, S.J.: Software pipelining. ACM Comput. Surv. 27(3), 367–432 (1995) 13. Callahan, D., Kennedy, K., Porterfield, A.: Software prefetching. In: ASPLOSIV: Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 40–52. ACM, New York (1991)
Astronomical Period Searching on the Cell Broadband Engine Maciej Cytowski1 , Maciej Remiszewski2 , and Igor Soszy´ nski3 1
Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw 2 IBM Deep Computing, Central and Eastern Europe 3 Department of Observational Astrophysics, Faculty of Physics, University of Warsaw
Abstract. Period searching is widely used in astronomy for analyzing observatory data. Computer programs used for these tasks can be opitmized to work on novel computer architectures, such as Cell Broadband Engine. In this article we present performance results of one of these programs on the Cell processor. Important programming steps and performance comparisons are shown and discussed. Keywords: Period searching, Cell architecture, astrophysics.
1
Introduction
The Cell Broadband Engine is a novel and pioneering processor architecture. It allows for a very interesting study of inside chip parallelization over short vector SIMD cores. There are many codes which performance can benefit from Cell BE specific architecture. In this paper we describe the Cell BE implementation of an astronomical period-searching program used in operational computations in the Optical Gravitational Lensing Experiment [1]. The scientific results obtained with the use of Cell-accelerated computer program where presented in [2], [3] and [4] . The original code was ported using the Cell SDK and its SPE library in a relatively short time. We achieved an overall speedup of 9.68 over current workstations.In the course of our work we encountered many important aspects of Cell programming and learned how to identify codes whose performance could benefit from this architecture. We are currently working on porting two other astrophysical applications to Cell. Presented work can serve as a fundament for creating a general framework for period searching on the Cell processor. The complexity of such tasks is often induced by the vast amount of data to be processed. Parallelization can therefore easily be achieved by partitioning the data across available compute cores. Astronomical period searching is a good example here. Computations could be realized independently on a number of cores independently for each individual star under consideration. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 507–516, 2010. c Springer-Verlag Berlin Heidelberg 2010
508
M. Cytowski, M. Remiszewski, and I. Soszy´ nski
Moreover the possibility of vector processing could easily be exploited for these computations. Time comparisons presented here show that vector implementations on short vector SIMD cores in current processor architectures are approximately 2.4 times faster than their scalar counterparts.
2
Optical Gravitational Lensing Experiment
The Optical Gravitational Lensing Experiment (OGLE) project is a continuos effort devoted mainly to the search for dark matter using microlensing phenomena as originally proposed by Paczy´ nski [5] [6]. As described in great detail in [1] long term observations of the Magellanic Clouds (MC) and the Galactic Bulge (GB) are conducted from the observatory in Las Campanas, Chile. Photometric data is collected via a CCD camera over extensive periods to allow for both detection and collection of statistically significant samples of microlensing events. The result is a vast amount of experimental data, reaching 4TB per year with OGLE-III. An annual total of 10 Billion photometric measurements record light curves for 200 Million individual stars from the Large and Small Magellanic Clouds (LMC and SMC respectively) as well as GB. In further data processing each light curve is analysed for distinctive patterns which can indicate a microlensing event taking place. Additional improvement to the search method used by OGLE came from incorporating the Transit Method, as proposed in [7]. Since 1992, throughout the three phases of the OGLE project (OGLE, OGLE-II and OGLE-III) 13 planets outside our solar system have been discovered, six of which were found using the microlensing and seven the transit method. While the search for planets and other forms of dark matter is the leading target of the OGLE project, the data collected in the process is very well suited for identification of different kinds of variable stars. The light curve analysis for periodicity is performed by the Fnpeaks code written by W.Hebisch, Z.Kolaczkowski and G.Kopacki and is a process performed for each star individually. The work described in this paper relates to this very procedure for the OGLE-III experimental data. Because of the huge amount of processing to be performed, the OGLE team requested the support of ICM in conducting the calculations at their facility. At the same time, due to an ongoing collaboration between ICM and IBM, it was decided to port the Fnpeaks code to the Cell Broadband Engine Architecture and use Cell BE based machines in the data processing as well. The scientific results were presented in separate publications in [2], [3] and [4]. The purpose of this paper is to describe how the Fnpeaks code has been ported and tuned on the Cell BE and how processing performance on this architecture compares to x86. It is worth noticing that while a large amount of data has already been processed, more input is still to follow, enabling the Cell Broadband Engine Architecture to continue to accelerate processing of experimental data for OGLE.
Astronomical Period Searching on the Cell Broadband Engine
3
509
Porting Process
The parallelism on the Cell BE processor can be exploited at multiple levels. Each Cell BE chip has eight SPEs with two-way instruction-level parallelism. Moreover each SPE supports single instruction multiple data (SIMD) operations [10]. The Power Processing Unit of Cell BE processor is usually used as a management thread. It can read and preprocess data, create SPE contexts and then run an appropriate SPE program. The SPE program can copy data from main memory to the Local Store through the Element Interconnect Bus. After computation it can store results back into Main Memory. Porting of an arbitrary code to the SPE may be challenging since the local memory space (Local Store) is only 256 KB. This rather small space was designed to accommodate both instructions and data. The Local Store can be regarded as a software-controlled cache that is filled and emptied by DMA transfers. The programmer has to decide how to partition the data and how to organize those transfers during execution. We have ported the Fnpeaks period-searching code (Z.Kolaczkowski, 2003) used in the OGLE project [9] to the Cell BE architecture. Fnpeaks implements the Discrete Fourier Transform algorithm. It is executed in an iterative process. After finding the highest peak in the power spectrum, a third order Fourier series is fitted to the folded light curve (Harmonphfit program) and the corresponding function is subtracted from the data. Afterwards the Fnpeaks program is once again executed on the modified data set. The entire process is repeated twice. The single threaded Fnpeaks program was already vectorized and achieved good performance when using SSE instructions on the AMD Opteron processor. Hence the porting process was first focused on compiling the code to work on a single SPE and using the appropriate vector instructions of the SPE API. To achieve better performance we have also analyzed performance with available Cell tools like Oprofile, Visual Performance Analyzer and spu timing. The parallel scheme based on pthreads library was designed and implemented subsequently. The PPE works as a management and working thread as we make use of its 2-way multithreaded core architecture. This chapter describes in detail all consecutive steps. The Fnpeaks code is not memory consuming since the observation data for one star (single computational element) consisted of a maximum of 6000 single precision floating point numbers. Thanks to this property we didn’t need to empty or refill the Local Store with incoming partitioned data sets. One SPE thread (context) performs a DMA load command at the begining and stores results back into main memory at the end. 3.1
SIMD Optimization
The Fnpeaks program was originally written in a vector fashion using Streaming SIMD Extensions (SSE). It achieved good performance when compared to a simple scalar version. This could automatically be adopted to fit the Cell BE vector processing capabilities. The notation used for implementing the SSE version
510
M. Cytowski, M. Remiszewski, and I. Soszy´ nski
had to be simply rewritten into the Cell BE SDK language. The main vector instructions used in the computational SPU kernel were: spu mul, spu madd and spu msub. All of these instructions were operating on float vectors of size 4. The computational SPU kernel was optimized with the use of spu timing tool to achieve best possible performance with the use of dual-pipe vector processing. Below we present an excerpt of the computational SPU kernel of the application.
void do_row( .. ) { .. for(i=0;i
Fig. 1. SPE Intrinsics implementation. Cell intrinsics spu msub, spu madd, spu mul are used here.
3.2
Parallel Scheme
The parallelization of the Fnpeaks program was achieved by simple partitioning of the input data. Computations for each star data set could be performed independently with no communication occurring between working threads. For cluster systems we have written specific shell scripts that managed data distribution across available nodes. In the case of Cell BE processor additional work had to be done in order to implement this simple parallel scheme inside the chip. We have designed a two level Pthreads parallel scheme on the PPE side of the application. The main management thread on the PPE implements 8 to 16 parallel threads in the case of QS2x servers and 6 parallel threads in the case of PlayStation3 (1st level parallelizm). Moreover all of these threads (which we call Control threads) implement and run two kinds of threads (2nd level parallelizm): I/O thread and Run SPE job thread. The I/O thread reads the data from file
Astronomical Period Searching on the Cell Broadband Engine
511
and stores the results back. The Run SPE job thread creates an SPE context and eventually executes this context when previous work has finished. All of this is implemented in a double-buffer fashion. I/O threads read next star observation data while the current one is being processed on the SPE. Thanks to this simple idea we are able to hide all the I/O operations behind real computations. Described parallel scheme is presented on Figure 2.
Fig. 2. Parallel scheme for the Fnpeaks application. Each new arrow symbolizes the creation of a new PPU thread (with pthreads library).
Additionally we have decided to make use of the two-way multithreaded core architecture of the PPE by implementing additional Control threads fully operating and computing on the PPE side. Some modifications to the vector implementation of the Fnpeaks core where needed since there are differences between the SPE vector intrinsics and the VMX standard of the PowerPC architecture. Thanks to all of this effort we were able to run up to 18 parallel instances of the Fnpeaks computational kernel within the QS2x servers (2 Cell chips). We have tested and compared the performance of the resulting application in many different configurations. The results are presented in Tab. 1. First of all we can see that the application scales linearly with increasing number of SPUs used for computations. Secondly the usage of additional PPU computational kernels increases the performance significantly when using 1 up to 8 SPUs. This result is consistent with the performance we achieved on PlayStation3 consoles where we were actually using 6SPU and one additional PPU kernel. In the case of 16SPUs using additional PPU computational kernels does not affect the performance.
512
M. Cytowski, M. Remiszewski, and I. Soszy´ nski
Table 1. Scaling of the Fnpeaks application with variable number of working threads. Measurements made on IBM QS22 system with a dataset consisting of 6096 star lightcurves. No. working threads 1PPU 1SPU 1SPU + 1PPU 2SPU 2SPU + 1PPU 2SPU + 2PPU 4SPU 4SPU + 1PPU 4SPU + 2PPU 8SPU 8SPU + 1PPU 8SPU + 2PPU 16SPU 16SPU + 1PPU 16SPU + 2PPU
Walltime Speedup 01:50:47 0.53 00:59:24 1.00 00:38:47 1.53 00:29:43 1.99 00:23:30 2.52 00:19:40 3.02 00:14:52 3.99 00:13:09 4.51 00:11:54 4.99 00:07:26 7.99 00:07:00 8.48 00:06:39 8.93 00:03:43 15.98 00:03:38 16.34 00:03:36 16.50
Table 2. Oprofile result analyzed by Visual Performance Analyzer CYCLES 1843674 207733 18136 13116 2940 2871 1480 1306 663
% 88,09 9,93 0,87 0,63 0,14 0,14 0,07 0,06 0,03
Symbol/Functions do row run fnpeaks store best getln xl sin xl cos fill cossintab pthread read data output data
This is due to the fact that PPU has to perform a lot of additional management work: 16 Control threads, 16 I/O threads and 16 Run SPE job threads. We have testes that the best possible configuration for performing Fnpeaks computations on QS2x system is actually to run two separate instances of the application, each using 8SPU and 1PPU computational kernels. 3.3
Performance Analysis and Tuning
We have used the Oprofile [11] and Visual Performance Analyzer [12] to identify computational kernels of Fnpeaks Cell implementations. The results presented in Tab. 2 show that all the optimization effort could be applied to do row function (88,09%).
Astronomical Period Searching on the Cell Broadband Engine
513
The do row function was optimized with the use of instruction substitution and Xlc compiler optimization abilities. All of this was analyzed with the use of spu timing tool. The resulting machine code performs dual pipe and fused operations and is well suited for Cell architecture.
4
Performance and Computations
We have designed several performance tests for the Fnpeaks program. First of all we have measured execution time of our Cell BE implementation. Thanks to the good load balancing scheme managed by PPE thread we have received perfect scaling over increasing number of working SPE threads. Moreover we have discovered that the usage of an additional PPE working thread for computations increased performance. Secondly we have prepared a small data set (6 096 stars) and compared execution times on an AMD Opteron based server and on the PlayStation 3 using GCC and XLC compilers. Results (presented in Tab. 3) show not only the architectural advantage of the Cell BE processor but also the importance of vector processing on both systems. Table 3. Comparison of architectures, compilers and vector processing Architecture used Compiler used Version AMD Opteron 875 (single core) gcc no SIMD AMD Opteron 875 (single core) gcc SIMD Cell BE (PS3), 6SPE gcc SIMD Cell BE (PS3), 6SPE xlc SIMD
Time 2:50:22 1:03:16 00:27:24 00:09:21
The most important test was a comparison of operational execution performance on a large data set of almost 150 000 stars. The Fnpeaks program was executed on a single AMD Opteron 875 system, a PlayStation3 console and an IBM QS21 server with two Cell BE processors. The results are presented in Tab. 4. Table 4. Performance results on three systems: AMD Opteron 875, PlayStation3 and IBM QS21. The speedup was measured against the vector AMD Opteron single core implementation. Non vector version is approximately 2.4 times slower. Architecture used Compiler used CPU time Speedup AMD Opteron 875 (single core) gcc, no SIMD 81:50:59 0.41 AMD Opteron 875 (single core) gcc, SIMD 34:09:00 1 Cell BE, PS3 (6SPE) xlc, SIMD 04:49:20 7.08 Cell BE, QS21 (8SPE) xlc, SIMD 03:31:40 9.68 Cell BE, QS21 (16SPE) xlc, SIMD 01:47:30 19.09
514
M. Cytowski, M. Remiszewski, and I. Soszy´ nski
The real data set consisted of folded light curves of more than 100 million stars from the Large Cloud of Magellan (LMC). The computations were partitioned and performed on 4 kinds of compute platforms. The first one was a Linux cluster based on AMD Opteron 875 processors located at the Interdisciplinary Centre for Mathematical and Computational Modelling (ICM), University of Warsaw. The second platform were three PlayStation3 consoles running Fedora Core 6 (ICM). The two remaining servers were IBM Cell blades: QS20 and QS21 located at the IBM Development Laboratory in B¨ oblingen, Germany. From the overall number of 800 data sets of similar size 270 were computed on 30 AMD Opteron cores. Approximately 170 sets were computed on the PlayStation3 consoles. The rest (about 360 sets) were sent and computed on the two IBM Cell blades in B¨ oblingen. All computations were finished within one month. Thanks to the possibility of using Cell BE based systems as an additional computational resource for the 30 AMD Opteron cores we were able to perform all calculations more than 3 times faster than expected.
5
Conclusions
The process of porting applications to the Cell Broadband Engine Architecture is not necesserily complicated. We have shown that it can easily be done in the case of period searching on huge astronomical data sets. The parallel scheme is achieved by distributing independent computations over the SPEs (for each star data set). The PPE can be used not only as a management but as a working thread as well. The vector processing units should be used in order to achieve good performance. We have ported the Fnpeaks program used in the OGLE project and achieved an impressive speedup of 9.68 over the AMD Opteron implementation. Our work was applied in real computations in the field of period searching on an observational data set of more than 100 million stars. Period searching is used in a wide range of applications. We have shown that by using the Cell Broadband Engine time to solution can be significantly reduced. In our case using 7 Cell BE processors as additional accelerators to 30 AMD Opteron cores has decreased the computation time more than 3 times.
6
Future Plans
As desribed in previous sections more observation data already collected in the OGLE-III project is still to be processed. While OGLE-III continues to deliver new data, the follow on phase IV is already under preparation. The first image of the sky with the use of 32 chip mosaic camera designed for OGLE IV was obtained on 11 September, 2009. Upcoming phase of the project will deliver us approximately 10 times more data and so the use of highly optimized computational kernels becomes even much more important. We will continue to use the CBEA for acceleration of data processing in the field of period searching for this very project. Future runs will be performed on the Nautilus system installed at ICM [13].
Astronomical Period Searching on the Cell Broadband Engine
515
The OGLE project itself features a variety of algorithms widely used in astronomical computations. As many of these could benefit from implementations on leading multicore architectures we plan to extend the applicability of the CBEA by porting and tuning further codes used by astronomers in HPC. While period searching is widely used in the field of astronomy it is applicable to other scientific areas as well. Our work can serve as a fundament for creating a general framework for period searching on the Cell Broadband Engine Architecture. Recognising the potential speedup of computations this should be of interest to other projects using similar methods. This work has been conducted within the ICM IBM Joint Cell Competence Center by University of Warsaw. More information on its activity and other projects can be found at [8].
Acknowledgement We would like to thank Joachim Jordan and Michael Engler from the IBM Development Laboratory in B¨oblingen for making the QS20 and QS21 blades available remotely for tests and computation as well as looking after the data transfer procedure. Additional thanks go to Duc Vianney from the IBM Austin Research Lab for continous support of our joint IBM-ICM team in the field of application development on the CBEA. We are greatful to the OGLE project team for approaching us thereby initiating this case study. Finaly we would like to thank prof. Marek Niezg´ odka and Krzysztof Bulaszewski for fostering the collaboration between ICM and IBM.
References 1. Udalski, A., Kubiak, M., Szymanski, M.: Optical Gravitationl Lensing Experiment. OGLE-2 – the Second Phase of the OGLE Project. Acta Astronomica 47 (1997) 2. Soszynski, I., Udalski, A., Szymanski, M.K., Kubiak, M., Pietrzynski, G., Wyrzykowski, L., Szewczyk, O., Ulaczyk, K., Poleski, R.: The Optical Gravitational Lensing Experiment. The OGLE-III Catalog of Variable Stars. I. Classical Cepheids in the Large Magellanic Cloud. Acta Astronomica 58, 163–185 (2008) 3. Soszynski, I., Udalski, A., Szymanski, M.K., Kubiak, M., Pietrzynski, G., Wyrzykowski, L., Szewczyk, O., Ulaczyk, K., Poleski, R.: The Optical Gravitational Lensing Experiment. The OGLE-III Catalog of Variable Stars. II. Type II Cepheids and Anomalous Cepheids in the Large Magellanic Cloud. Acta Astronomica 58, 293–312 (2008) 4. Soszynski, I., Udalski, A., Szymanski, M.K., Kubiak, M., Pietrzynski, G., Wyrzykowski, L., Szewczyk, O., Ulaczyk, K., Poleski, R.: The Optical Gravitational Lensing Experiment. The OGLE-III Catalog of Variable Stars. III. RR Lyrae Stars in the Large Magellanic Cloud. Acta Astronomica 59, 1–18 (2009) 5. Paczynski, B.: Gravitational Microlensing at Large Optical Depth. Astrophysical Journal 301 (1986) 6. Paczynski, B.: Gravitational microlensing by the galactic halo. Astrophysical Journal 304 (1986)
516
M. Cytowski, M. Remiszewski, and I. Soszy´ nski
7. Rosenblatt, F.: A Two-Color Photometric Method for Detection of Extra solar Planetary Systems. Icarus 14, 71 (1971) 8. IBM, ICM (University of Warsaw): Joint Cell Competence Center, http://cell.icm.edu.pl 9. Soszynski, I., Udalski, A., Kubiak, M., Szymanski, M.K., Pietrzynski, G., Zebrun, K., Szewczyk, O., Wyrzykowski, L., Ulaczyk, K.: The Optical Gravitational Lensing Experiment. Miras and Semiregular Variables in the Large Magellanic Cloud. Acta Astronomica 55 (2005) 10. Jacobi, C., Oh, H.-J., Tran, K.D., Cottier, S.R., Michael, B.W., Nishikawa, H., Totsuka, Y., Namatame, T., Yano, N.: The vector floating-point unit in a synergistic processor element of a Cell processor. In: Proceedings of the 17th IEEE Symposium on Computer Arithmetic, ARITH 2005, Washington, DC, USA, pp. 56–67. IEEE Computer Society, Los Alamitos (2005) 11. Oprofile. A system profiler for Linux, http://oprofile.sourceforge.net/news/ 12. Visual Performance Analyzer. An Eclipse-based visual performance toolkit, http://www.alphaworks.ibm.com/tech/vpa/ 13. Nautilus system, http://cell.icm.edu.pl/index.php/Nautilus
Finite Element Numerical Integration on PowerXCell Processors Filip Kru˙zel1 and Krzysztof Bana´s1,2 1
Institute of Computer Modelling, Cracow University of Technology, Warszawska 24, 31-155 Krak´ ow, Poland
[email protected] 2 Department of Applied Computer Science and Modelling, AGH University of Science and Technology, Mickiewicza 30, 30-059 Krak´ ow, Poland
[email protected]
Abstract. We investigate opportunities for parallel execution on PowerXCell 8i processors of numerical integration algorithm as it is used in finite element codes. Different kinds of problems and different types of finite element approximations are taken into account. We design and implement several parallelization strategies and test their performance for different situations. The results of analyses and tests can be used as a guideline for choosing the proper parallelization strategy for different problems and finite element discretizations.
1
Finite Element Numerical Integration
Numerical integration in finite element codes is used to create entries in the global stiffness matrix A and the global vector of the right hand side b, i.e. the matrix and the vector of coefficients of the finite element system of linear equations Au = b (1) where u is the vector of unknowns (degrees of freedom). A and b are determined by the problem solved (through a weak finite element statement consisting of domain and boundary integrals) and u is used to create the approximate solution (as a linear combination of finite element basis functions with components ui as coefficients). The integration, performed for terms from the weak statement, concerns the whole computational domain that is divided into elements (hence domain integrals are computed as a sum of element integrals). We consider only domain integrals as more time consuming than surface (boundary) integrals. Different integrals contain coefficients from the approximated set of partial differential equations and transformed derivatives or values of test and trial functions. Before forming the system (1), test and trial functions from the weak statement are approximated as linear combinations of global basis functions φi . For an example term (that might represent convection term in a scalar convection-diffusion problem) the formula for obtaining an entry Aij can look as follows: R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 517–524, 2010. c Springer-Verlag Berlin Heidelberg 2010
518
F. Kru˙zel and K. Bana´s
C=
e
Ωe
l
bl
∂φi φj dΩ ∂xl
(2)
Here, Ωe is the volume or area of element e, bl is a coefficient of convection (i.e. l-th component of velocity vector), with l = 1, 2 or l = 1, 2, 3 depending upon the number of space dimensions of the problem, φi and φj are global basis functions. Theoretically, entries Aij exist for all pairs of global basis functions (each degree of freedom is associated with one basis function). However, from the nature of finite element approximation, it is known in advance that most of the entries Aij are zeros, with non-zero terms usually only for those basis functions that do not vanish together in at least one element. Global basis functions are created from shape functions, functions defined separately for single elements. In practice calculation of entries Aij is performed in a loop over elements. In each element, its contributions to the corresponding non-zero entries from A are calculated (the entries form a square matrix Ae with dimension equal to the number of shape functions in the element, the so called, element stiffness matrix, that is than assembled into A in a separate step, that we do not discuss here). To perform integration for a single element (defined in the global coordinate system x) the change of variables is performed that transforms the integral (2) into an integral over a reference element (defined in the local coordinate system ξ). Practically used shape functions are almost exclusively in the form of polynomials and quadratures are different forms of Gaussian quadrature. Given a set of NI quadrature points with local coordinates ξ I and weights wI , the final formula for numerical integration approximating the integral (2) has the form C≈
NI e
I=1
bl
∂ φˆi I ∂ξ I ˆ I (ξ ) (ξ )φj (ξ ) det J T e (ξ I )wI ∂ξ ∂xl
(3)
where φˆi and φˆj are shape functions and J T e is the Jacobian matrix of transformation x(ξ) from the reference element to the real element (the inverse transformation ξ(x) is used to compute the derivatives of shape functions with respect to coordinates of the real element).
2
Computational Complexity of Numerical Integration
From the description of stiffness matrix creation and formula (3) it follows that the calculations within a single element consist of three loops: a double loop over shape functions (for all entries of the element stiffness matrix) and a loop over integration points. The number of shape functions, for a given element and a given simulation, depends on the order of approximation that is expressed using the degree of approximating polynomials. For elements with uniform approximation properties in each space dimension this number is of order pd , where p is the degree of polynomials used as shape functions and d is the number of space dimensions [1].
Finite Element Numerical Integration on PowerXCell Processors
519
The number of integration points results from requirements of accuracy for numerical integration, that in turn is determined by the analysis of accuracy of the final element method [2]. Usually this number is also of order pd . The actual calculations, in the implementation that we consider as a basis for our parallelization efforts, are performed using an outer loop over integration points with the following steps at each integration point: – calculation of values of all shape functions φˆi (and their derivatives with ˆ respect to physical coordinates ∂∂φxi ) – with complexity O(pd+1 ) – calculation of PDE coefficients and geometrical coefficients J T e , J T e and det J T e – often with constant cost – a double loop over shape functions with summing up entries of the element stiffness matrix Ae – with complexity O(p2d ) The following data are necessary within an element to perform the algorithm: – values of coefficients of PDEs, which can be: • • • •
constant over the whole domain constant over the element functions of physical coordinates functions of the current solution (e.g. for iterative non-linear solvers or for time dependent problems)
– values of shape functions and their derivatives at all integration points ∂ξ – Jacobian matrices J T e and J T −1 (i.e. ∂ x ) e – coordinates ξ I and weights wI for all integration points Our aim is to investigate high performance algorithms for numerical integrations. The factor that limits the most the performance of contemporary processors is the speed of memory transfers. Whenever the number of computations per memory transfer is large enough, the processors perform well. Hence we will consider the case of constant coefficients of PDEs (from now on assumed to be equal to one). If coefficients are in the form of complex functions the situation is beneficial for the processors – they will attain higher GFlops rates. We also consider only the case of geometrically linear (or multilinear) elements, i.e. elements without curved boundaries. For such elements, given the coordinates of their vertices, both J T e and J T −1 can be computed with a cone stant cost at each integration point or even once for the whole element (in the case of triangles and tetrahedra). To make considerations of this section more concrete we present in tables 1, 2 and 3 the parameters of numerical integration for 3D geometrically linear prismatic elements with different orders of approximation p. For each value of p the corresponding shape functions where obtained as tensor products of 1D shape functions in vertical direction with 2D shape functions for triangles in horizontal direction.
520
F. Kru˙zel and K. Bana´s
Table 1. Number of operations for different steps (computing different quantities) performed at each integration point in numerical integration algorithm for 3D prismatic elements with different degree of approximation Computed Degree of approximating polynomials p quantities 1 2 3 4 5 6 7 ˆi ∂φ ˆ φi , ∂ x 186 558 1240 2325 3906 6076 8928 J T e , J T −1 310 310 310 310 310 310 310 e Ae 252 2268 11200 39375 111132 268912 580608 Table 2. The total number of operations for different steps (computing different quantities) in numerical integration algorithm for 3D prismatic elements with different degree of approximation Computed Degree of approximating polynomials p quantities 1 2 3 4 5 6 7 ˆ φˆi , ∂∂φxi 1116 10044 59520 186·103 585·103 1403·103 2999·103 J T e , J T −1 1860 5580 14880 24·103 46·103 71·103 104·103 Ae
e
1512 40824 537600 3150·103 16669·103 62118·103 195084·103
Table 3. The size of data structures – in the number of double precision scalars – for storing values used in numerical integration for 3D prismatic elements with different degree of approximation Stored quantities ˆ φˆi , ∂∂φxi I ξ , wI Ae
3
Degree of approximation p 1 2 3 4 5 6 7 24 72 160 300 504 784 1152 24 72 192 320 600 924 1344 36 324 1600 5625 15876 38416 82944
Parallelization of Numerical Integration
We consider three strategies of parallelization for the presented algorithm of numerical integration. In the first strategy the loop over elements is parallelized. This is the preferred strategy for standard computations, on ordinary processors with large cores and large caches. Calculations for different elements are practically independent and the algorithm is easy, embarrassingly easy, to parallelize. The only problems are the necessity to retrieve input data (at least geometrical properties of an element and coefficients of PDEs) from finite element data structures and the assembly procedure after computing each element stiffness matrix. The first problem means that for computations on accelerator cores one needs to send input data directly for each element. This can pose some problems since we often do not want to start new threads for each new element, rather
Finite Element Numerical Integration on PowerXCell Processors
521
prefer to start numerical integration threads at the beginning of global stiffness matrix creation and use each thread for many elements. The second problem, assembly procedure, results in the necessity to take care to avoid possible data races during updating the global stiffness matrix entries. Both problems can be overcome, however, there is one more limiting factor – the size of data structures. For higher order approximation the resources of processors with small cores may be insufficient. In such a case different parallelization strategies have to be designed. The second strategy that we consider consist in parallelizing the loop over integration points. Apart from possibly using the same values of J T e , J T −1 and e coefficients of PDEs, calculations involving shape functions and their derivatives are practically independent for each integration point. However, the results obtained at different integration points are written to the same places within local, element stiffness matrices Ae . For calculations using accelerator cores, this means that another step, synchronized reduction after local calculations has to be performed by the thread that controls the whole integration procedure. On the other hand calculations in the double loop over shape functions store, for each iteration, values at different places of local stiffness matrices Ae . Parallelization of the loops can be interpreted as data parallelism, where the resulting element stiffness matrix is decomposed into parts that are assigned to different processing elements. The only drawback is that now each core has to know values of all shape functions and their derivatives at each integration point and some calculations and storage are redundant.
4
PowerXCell Processor
The PowerXCell processor architecture [4] is a variation, designed for scientific computing, of the Cell Broadband Engine (CBE) architecture. It shares with the CBE almost all features, adding mainly the improved double precision performance. The CBE architecture situates in between classical post-RISC processors and GPUs. In contrary to classical processors it offers explicit software management of fast local (cache) memory. As opposed to GPUs it has less massive parallel resources (8 vector cores), making it easier than for GPUs to obtain high performance for a broad spectrum of algorithms. Moreover, processing cores are not synchronised on the hardware level and are equipped with larger resources (registers and a local store), allowing for more flexible program execution model. We used a programming model in which a general purpose PowerPC core of the PowerXCell processor (PPE – PowerPC Processor Element) executes a large complex code and several vector cores (SPEs – Synergistic Processor Elements) performs some specifically designed procedures. The management of memory accesses for parallel parts of the code executed on a single SPE is based on several principles:
522
F. Kru˙zel and K. Bana´s
– the necessary data is explicitly brought from the main memory to the SPEs local store – possibly in advance and concurrently with other computations – data from the local store are packed into 128-bit registers – the results of calculations are explicitly brought back from the local store to the main memory In practice, the synchronisation of actions of different SPEs, especially their communication with PPE and their accesses to global memory, poses many problems. In our implementations we limited our efforts to resolve these problems, without further optimizations concerning efficient use of pipelines and local memory resources.
5
Parallel Implementation of Numerical Integration for PowerXCell Processors
We parallelized the procedure of numerical integration in a large finite element code for DG (discontinuous Galerkin) approximations [3]. The whole code is run using a single thread on a single PPE. During calculations designated parts of the code are run using SPE threads, that are created at the beginning of the finite element simulations. SPE threads receive signals to perform their actions from the PPE thread, together with some input data (using control blocks) that specifies e.g. the locations for additional input and output data. 5.1
Strategy I – Parallelization of the Loop over Elements
The size of data structures allowed us to perform integration for degrees of approximation p up to 5. However, due to the specific organization of the whole finite element code, we could not perform calculations with many SPE threads per single PPE thread. 5.2
Strategy II – Parallelization of the Loop over Integration Points
The calculations are performed in the following order: – created SPE threads receive data on integration points and wait for the signal to start calculations – PPE thread sends a signal to start for each SPE thread using mailboxes and a part of input data using control block – SPE threads read geometrical data from the memory (from the stack of PPE thread) – SPE threads compute contributions to the values of all entries of the element stiffness matrix coming from the integration points assigned to each thread – at each integration point they compute the values of shape functions and Jacobian matrices
Finite Element Numerical Integration on PowerXCell Processors
523
– SPE threads send all computed stiffness matrix entries and notify PPE thread using mailboxes – PPE thread receives notification from SPE threads and perform reduction of contributions from SPE threads to produce the final element stiffness matrix Since each SPE thread computes contributions to all entries of the element stiffness matrix the size of data structures stored by SPEs is large and the degree of approximation for which computations can be performed is limited. Because of that we considered another version of the procedure in which the code for computing the values of shape functions and Jacobian matrices is not sent to SPEs. Instead the values are computed by PPE thread and sent to SPE threads. This version gave similar results that the previous one with limited gains in terms of local memory usage, due to the dominant role of the storage for the element stiffness matrices. 5.3
Strategy III – Parallelization of the Outer Loop over Basis Functions
In this strategy SPE threads perform calculations for entries of the local stiffness matrix assigned to them in the loop over integration points. The detailed algorithm looks similar to the one for the parallelization of the loop over integration points, with mailboxes used for notification messages.
6
Numerical Experiments
We present results of numerical experiments performed for the mentioned before 3D geometrically linear prismatic elements with degrees of approximation 1, 3, 5 and 7. The results in tables 4 and 5 show the values and the ratio of the execution times obtained for different strategies described above and different configurations of cores of a single PowerXCell processor within QS22 server to the times for a single standard processor core (from Intel Core2 Quad Q6600 CPU, 2.40GHz). The tables indicate which strategies can be used for different cases of numerical integration and which represent the best opportunities for obtaining high speed-ups when executing on processors with many, possibly heterogeneous, cores, like the PowerXCell processor. Table 4. Execution times obtained during numerical integration over a single element for different orders of approximation p and different parallelization strategies (description in the text) Number and type Strategy Degree of approximation p of cores 1 3 5 7 Core2 0,0000059 0,0010499 0,0330820 0,3868980 PPU+1SPU I 0,0000247 0,0015399 0,0411880 PPU+8SPU II 0,0001149 0,0004639 PPU+8SPU III - 0,0004220 0,0063760 0,0630519
524
F. Kru˙zel and K. Bana´s
Table 5. Speed-ups obtained during numerical integration over a single element for different orders of approximation p and different parallelization strategies (description in the text) Number and type Strategy Degree of approximation p of cores 1 3 5 7 Core2 1.00 1.00 1.00 1.00 PPU+1SPU I 0.24 0.68 0.80 PPU+8SPU II 0,05 2.26 PPU+8SPU III - 2.49 5.19 6.14
7
Conclusion and Future Work
The results of calculations show a potential for efficient execution of finite element numerical integration algorithms on heterogeneous multi-core architectures. There are several problems that has to be resolved before optimal implementations can be obtained. The main problem that we encountered using PowerXCell processor was the synchronization of input-output operations, together with synchronization of PPE and SPE threads. In all our implementations the most time consuming operation was reading mailboxes by SPEs with data reception confirmation data send by PPE. The reason for this may be too slow PPE operation or some bus or memory delays. We plan to investigate this further. One of advantages of Cell programming is the fact that it is close to the hardware and that the hardware model is relatively simple. This should allow for the best utilization of available resource. However, non-standard programming tools and the fact of asynchronous memory transfers make it difficult to obtain proper and efficient codes. Acknowledgements. The support of this work by the Polish Ministry of Science and Higher Education under grant no N N501 120836 is gratefully acknowledged.
References 1. Solin, P., Segeth, K., Dolezel, I.: Higher-Order Finite Element Methods. Chapman & Hall/CRC (2003) 2. Ciarlet, P.: The Finite Element Method for Elliptic Problems. North-Holland, Amsterdam (1978) 3. Bana´s, K.: A model for parallel adaptive finite element software. In: Kornhuber, R., Hoppe, R., P´eriaux, J., Pironneau, O., Widlund, O., Xu, J. (eds.) Domain Decomposition Methods in Science and Engineering. Lecture Notes in Computational Science and Engineering, vol. 40, pp. 159–166. Springer, Heidelberg (2004) 4. Cell Broadband Engine Programming Handbook Including the PowerXCell 8i Processor. IBM (2008)
The Implementation of Regional Atmospheric Model Numerical Algorithms for CBEA-Based Clusters Dmitry Mikushin and Victor Stepanenko Research Computing Center, Moscow State University, Vorobievy Gory GSP-2, 119991 Moscow, Russia {mikushin,stepanen}@srcc.msu.ru http://geophyslab.srcc.msu.ru Abstract. Regional atmospheric models are important tools for shortrange weather predictions and future climate change assessment. The further enhancement of spatial resolution and development of physical parameterizations in these models need the effective implementation of the program code on multiprocessor systems. However, nowadays typical cluster systems tend to grow into very huge machines with over petaflop performance, while individual computing node design stays almost unchanged, and growth is achieved simply by using more and more nodes, rather than increasing individual node performance and keeping adequate power consuming. This leads to worse scalability of data-intensive applications due to increasing time consumption for data passing via clusters interconnect. Especially some of numerical algorithms (e.g. those solving the Poisson equation) satisfactorily scaling at previous generation cluster systems do not utilize the computational resources of clusters with thousands cores effectively. This prompts to study the performance of numerical schemes of regional atmospheric models on processor architectures significantly different from those used in conventional clusters. Our approach focuses on improving the performance of time explicit numerical schemes for Reynolds-averaged equations of atmospheric hydrodynamics and thermodynamics by parallelization on CellBE processors. The optimization of loops for numerical schemes with local data dependence pattern and with independent iterations is presented. Cell-specific workloading managers are built on top of existing numerical schemes implementations, conserving the original source code layout and bringing high speed-ups over serial version on QS22 blade server. Intercomparison between Cell and other multicore architectures is also provided. Targeting the next generation of MPI-CellBE hybrid cluster architectures, out method aims to provide additional scalability to MPI-based codes of atmospheric models and related applications. Keywords: cell broadband engine, MPI clusters, atmospheric modeling.
1
Introduction
Regional atmospheric models are an effective instrument in many studies of the atmosphere physics, especially short-range weather forecast and the future R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 525–534, 2010. c Springer-Verlag Berlin Heidelberg 2010
526
D. Mikushin and V. Stepanenko
climate changes assessment. They are based on solution of the system of atmospheric hydrodynamics equations including heat transfer processes, the equations of ocean dynamics, land surface processes (including soil, vegetation, water bodies, ice sheets and urbanized territories). These equations are solved by numerical methods, introducing the numerical grid into the problem. The processes with spatial scales less than the grid cell size, are taken into account implicitly by the so-called parameterizations. The major way to improve the quality of weather forecasts and climate prediction skills involves the increasing of grid resolution, the use of more accurate numerical schemes and embedding the new physical parameterizations. All of these lead to more computations being implemented by the model. Hence the standard of regional atmospheric models is governed to a large extent by available parallel supercomputing systems and the performance of these models on these systems. Almost all codes of state-of-the-art regional atmospheric models are adapted for distributed memory clusters, using the message passing interface (MPI). Some of them also make use of OpenMP directives for running at MPI clusters based on multicore processors [1]. However, the problems with scalability of regional models codes have been reported if the number of processor units exceeds 60-70 [2]. This implies that in order to effectively run regional models on petascale computing systems the traditional approaches to parallel implementation of these models have to be reconsidered. The saturation of speedup of the parallel atmospheric model in respect to sequential one on conventional cluster systems is caused by increasing amounts of data transfers between computational nodes. The intensity of data transfers is defined by the numerical scheme used. If the scheme with local data dependence (i.e. when the calculation of the variable in the given grid point uses the values in several neighboring grid points, defined by the scheme pattern) is iterated (e.g. to integrate in time discrete analogues of hydrodynamics equations), the domain decomposition strategy turns to be quite effective with the parallel program speedup being close to the number of computational cores (with efficiency of 70-80%). However, there are numerical algorithms that include global data dependencies, which in the framework of domain decomposition strategy require global data exchanges. In hydrodynamic (including atmospheric) applications the example of such an algorithm is the solution of the Poisson equation based on Fast Fourier Transform (FFT). The increase of numerical hydrodynamic solvers scalability may be achieved by either minimizing the data dependence in numerical algorithms or by using novel high computing density architectures, such as Cell Broadband Engine (CBEA) [3] or General Purpose Graphics Processing Unit (GPGPU) [4]. This paper focuses on the implementation of numerical algorithms of a particular regional atmospheric model on CBEA processors as a promising way to increase the scalability of the model version adapted for MPI clusters. It includes 4 sections. The 2-nd section provides general notes on the mesoscale atmospheric model being developed and its implementation for conventional MPI clusters. The 3-rd section describes CBEA key features and an approach for
The Implementation of Regional Atmospheric Model Numerical Algorithms
527
acceleration of numerical schemes solvers. Section 4 summarizes CBEA benefits and introduces the concept of hybrid MPI-Cell implementation for atmospheric modeling.
2
The Regional Atmospheric Model and Its MPI Implementation
The mesoscale model considered here is being developed in Research Computing Center (Moscow State University). It is based on the original sequential code of NH3D model [5]. This model solves the three-dimensional non-hydrostatic hydrodynamic equations set, expressed in σ coordinate system. It uses radiative boundary conditions on the lateral boundaries. The shortwave and longwave radiation transfer are parameterized by Clirad-SW and Clirad-LW codes, developed for NASA GEOS climate model [6],[7]. Land surface processes are represented by either ISBA model [8] or the soil-vegetation parameterization from climate model of Institute for Numerical Mathematics, Moscow [9]. The state of water bodies and heat fluxes at the air-water interface are simulated by LAKE model [10]. The passive tracer transport equation is solved by the Smolarkiewich monotonous scheme and used to study the tracer dispersion caused by thermallyinduced mesoscale circulations [11].
24% 18% 11% 14% 4% 1% 12% 12% 4%
Diagnostic Poisson equation for geopotential Momentum equations Updating reference state variables Vertical diffusion Clouds and rainwater microphysics Land surface parameterization Atmosphere thermodynamics Temperature equation and radiation parameterization Water bodies parameterization
Fig. 1. The time consumption by major numerical schemes of the atmospheric model
The relative time consumption of major model numerical blocks is depicted at Fig. 1. The numerical scheme for evolution equations is the time explicit leapfrog
528
D. Mikushin and V. Stepanenko
Fig. 2. Mesoscale atmospheric model performance on the MPI cluster SKIF-MSU ”Chebyshev”
scheme, in which the spatial derivatives are approximated by the second order central differences implying local and compact data dependence pattern. We demonstrate this scheme for the passive tracer transport equation: −
1 n−1 n n n ([qp]n+1 i,j,k − [qp]i,j,k ) =Lx ([uqp]i,j,k ) + Ly ([vqp]i,j,k ) + Lσ ([σqp]i,j,k ), 2Δt fi+1,j,k − fi−1,j,k , Lx (fi,j,k ) = 2Δx (1) fi,j+1,k − fi,j−1,k Ly (fi,j,k ) = , 2Δy fi,j,k+1 − fi,j,k−1 Lσ (fi,j,k ) = . σk+1 − σk−1
Here q is the concentration of the tracer, p - difference between the atmospheric pressure values at the earth surface and at the top of the model domain, w vertical velocity, u and v - horizontal velocity components, Δt is the timestep, Δx and Δy are the numerical grid steps in horizontal directions; n, i, j, k are timestep number and spatial grid indexes. In addition to solving the evolution equations for each prognostic variable the filtering procedures are applied. The Aselin filter is used to dump the computational mode of the leapfrog scheme. The filtration of spurious waves, arising from the nonlinear numerical instability, is realized by three-dimensional diffusion-like filter. Described numerical solution of evolution equations allows effective use of domain decomposition. For computation of prognostic variables in gridpoints of a subdomain the data exchanges along the boundaries with neighboring
The Implementation of Regional Atmospheric Model Numerical Algorithms
529
subdomains are only required. The thickness of grid points boundary slice being exchanged is defined by the data dependency pattern, and equals 1 for central spatial differences. The finite-difference form of elliptical equation is solved using a combination of fast Fourier transforms and factorization for solving the tridiagonal sets of linear algebraic equations. First the 2D FFT in horizontal coordinates of elliptical equation right hand side is performed, then the Fourier coefficients are found by factorization algorithm in vertical, and finally the inverse 2D FFT is applied to obtain the solution. The global data exchanges sequence in solving the elliptical equation is quite similar to those used in 3D FFT in [12]. In this case it allows to achieve up to 26 times performance speedup on SKIF-MSU cluster Chebyshev [13], reducing further scalability of the model. Explicit data distribution in MPI case requires parallel implementation of all model algorithms, even having minor impact on the performance of the whole model, and even those, that are of low scalability (cyclic reduction, fast Fourier transform, etc.). Due to hierarchical cluster network structure and unmanaged allocation of the working nodes, the larger number of processing units is used the larger is the performance variability, Fig. 2.
3
CBEA Approach
General reason for low efficiency of MPI cluster solution highlighted in previous section lies primarily in an inappropriate data transfer bandwidth. Dataintensive problems (such as numerical hydrodynamic modeling) performance on clusters is limited by the so-called memory-wall - RAM I/O, with typical access rate of 7-10 Gb/sec, and much more by nodes interconnect, with typical data transfer rate of 1.5-4 Mb/sec. In comparison to clusters, Cell architecture typically provides 11-13 Gb/sec access rate for the local XDR (eXtreme Data Rate) memory, and 6-8 Gb/sec over boards BIF (Broadband Interface Protocol) bus interconnect [14]. On the other hand, data to be distributed have to fit very small SPU local storage size 256 KB for data and instructions. That is why MPI-like full data distribution is not possible in this case for large grid domains of realistic applications. Domain has to be divided into numerous small subdomains, each one fitting SPU local storage. Each SPU processes a sequence of assigned subdomains one by one, fetching required data to local storage and putting results back to main memory, Fig. 3. Subdomain term stands here for a set of dependent input and output data arrays fragments (data chunks), typically 10-100 data chunks of 16-256 8-byte elements. Aiming to reduce synchronization overhead, our implementation contains two workload managers one on the PPU side and another on the SPU side. PPU side performs grid division into continuous sequences of grid points, one sequence per SPU. Static grid division is typically enough to obtain high speed-up, since the times of SPUs data processing are quite similar to have any significant benefit from load balancing. Continuity of grid points sequences is important to simplify the algorithm: data array with any number of dimensions could be represented
530
D. Mikushin and V. Stepanenko
Fig. 3. CBEA numerical scheme general processing algorithm
as vector, keeping SPU implementation more efficient and free of specific details. However, this approach introduces issues with special grid point locations, like boundary points if they are kept together with domain interior. One should avoid this to eliminate branching and masking. It is possible to split the computational grid into interior domain points and boundaries to be handled separately, or to simply renew boundary values after the interior points data have been processed in the latter case SPU may fill them with some garbage to be corrected afterwards. SPU workload manager is intended to subdivide SPU-assigned data vectors into smaller chunks to fit into the local storage. On every step of SPU internal loop only one chunk of data is fetched to local storage and only one chunk of results is transferred back to the main memory. To implement data double buffering, additional local space is reserved to fetch next data chunk with the odd pipeline, while even pipeline performs computations on the current data chunk. In our approach we attempted to avoid a significant part of manual Cellspecific code porting. Hand-written CBEA implementations of every heavy routine would be elusive, especially when the program consists of numerous separate kernels with 1-10% of the total implementation time of the application. For the computational loop patterns of serial or MPI-parallelized source code frequently found in the mesoscale model it is possible to assist manual development with automatic tools. Efficient and reliable architecture-specific porting would likely grow from a set of specially-designed instrumentation tools for analyzing and preprocessing application source code, as proposed, for example, in [4]. The key concept of our toolset is runtime dependencies probing. Once application
The Implementation of Regional Atmospheric Model Numerical Algorithms
531
Fig. 4. The example of computational loop probing to determine the data dependencies pattern
component is scheduled for SPU execution during the runtime, actual computations are preceded by SPU analysis to automatically determine the data dependencies pattern. Probing records the order of used data arrays and the offsets of elements being referenced relative to the loop index, Fig. 4. Collected sequence is then transformed into the list of DMA (Direct Memory Access) requests, that must be performed before execution of corresponding loop iterations. In the same way resulting data are scheduled to be returned back after processing. Entire SPU implementation is mostly targeted on explicit finite-difference numerical schemes, with each grid point value dependent only on a local set of grid points from the previous time step. Grid point values calculation typically makes use of double precision arithmetics and intrinsics. Minimizing, maximizing, physical effects like water phase transitions may result into a presence of embedded branching conditions. For each grid point numerous data arrays are defined to keep track of every prognostic variable. Arrays for scalar variables are usually independent, while vector variables and complex numbers can be stored packed as single arrays. Tens of numerical schemes operate on different subsets of arrays data, therefore further grouping of variables is of little use. SPU vectorization of numerical scheme computations is possible, however since vector operations require 16-byte address alignment, relative offsets of dependent array elements introduce some limitations. Consider 3-dimensional second order central difference scheme for the spatial derivatives (Eq. 1). All six elements of 3d pattern have identical volumetric offsets. Multidimensional arrays are kept as vectors, and memory offsets of elements would be -1, 1, −nx, nx, −np, np , where nx is the row dimension, and np is the plane size. Consider data type is 8-byte floating point and central pattern element has aligned address. In this case at least two elements addresses are always unaligned -1, 1 , while it is possible to align four other by providing even nx dimension. So, a pair of unaligned elements consumes several shuffling instructions in this case.
532
D. Mikushin and V. Stepanenko
Another general issue is floating point division: due to a lack of hardware support, SPU cannot divide fast. The best solution we have found is a composite intrinsic divd2.
4
Benchmarks
Performance intercomparison of similar algorithm on different architectures is a general benchmarking approach allowing to conduct more reliable performance measurements and analysis. In our tests we involved three systems: AMD Athlon 64 X2 4200+, 2.1 GHz (2x128Kb of L1 cache, 2x512Kb of L2 cache) - desktop PC, Intel Xeon E5472, 3.0 GHz (4x64KB L1 cache, 2x6Mb of L2 cache) - 8-core dual node of SKIF-MSU Chebyshev cluster, IBM PowerXCell 8i, 3.2 GHz - blade server installed in the T-Platforms company data center and Sony Playstaion 3 (PS3), model CECHK08. First two are equiped with Intel C++ Compiler 11.0, latter have Cell SDK 3.0, GNU compilers set 4.1.1 and IBM XL C++ 10.1. To benchmark the performance of the code for CBEA two atmospheric model components have been chosen. The first one - ElliptFR - computes the right hand side of the elliptic equation from 28 input data arrays, another one - Leapfrog is the 10 arrays scheme for tracer advection equation, including leap-frog time integration and the second order central differences for spatial derivatives. All computations are done with double floating point precision. Time measurements are obtained for different problem domain sizes, performance for AMD and Xeon systems is measured both for single core and multicore builds, using OpenMP directives.
Fig. 5. Performance of numerical schemes on PowerXCell 8i SPUs in comparison to other test systems, left - Leapfrog, right - ElliptFR
Despite the fact that SPU implementations of both algorithms archive impressive speed-ups over serial PPU versions, in comparison to other architectures
The Implementation of Regional Atmospheric Model Numerical Algorithms
533
results are not so good, Fig. 5. We used SPU decrementer routines to measure impacts of computational and data I/O algorithm parts into total time. While computational part scales well, communication overheads are growing, likely indicating insufficient DDR2 memory bandwidth. For all general-purpose processors scalability rapidly decreases due to poor caching. Different kinds of cache-targeted magic could be applied to the data layout, however in this case we have to redesign all schemes to make use of matrix-vector operations, or at least convert all arrays to block structures, what is likely not possible for huge existing models.
5
Conclusions
We have seen Cell Broadband Engine architecture addresses data-intensive problems, such as atmosphere modeling better than more recent general-purpose multicore processors. However, XDR or DDR2-based Cell solutions are still limited by memory bandwidth. To improve the speedup, we have to study approaches of recombining the model computational loops into common data patterns groups. This will help to reuse transfered data and reduce memory bandwidth dependency. But even current results prompt us to consider an idea of combining Cell boards in a distributed memory MPI-capable system and try to achieve extra speedups for problems saturated on traditional architectures. Acknowledgments. We are pleased to aknowledge T-Platforms for free of charge Cell servers lease, especially Dmitry F. Tkachev for technical guidance and support of our research. We are also grateful to the Jonathan Adamczewski, University of Tasmania PhD student, for productive discussions and useful advices. We thank staff of SKIF-MGU ”Chebyshev” cluster, especially Sergei A. Zhumaty for assistance in various technical problems. Also we acknowledge three anonymous reviewers, whose comments helped us to improve the quality of our paper.
References 1. Michalakes, J., Dudhia, J., Gill, D., Henderson, T., Klemp, J., Skamarock, W., Wang, W.: The Weather Research and Forecasting Model: software architecture and performance. In: Mozdzynski, G. (ed.) 11th ECMWF Workshop on the use of High Performance Computing in Meteorology, Reading, UK (2004) 2. Dubtsov, R., Semenov, A., Shkurko, D.: WRF performance on Intel platforms. In: 8th WRF Users Workshop, p. 6.4 (2007) 3. Zhou, S., Duffy, D., Clune, T., Williams, S., Suarez, M., Halem, M.: Accelerate Climate Models with the IBM Cell Processor, American Geophysical Union, Fall Meeting, abstract #IN21C-02 (2008) 4. Michalakes, J., Vachharajani, M.: GPU acceleration of numerical weather prediction. In: IEEE International Symposium on Parallel and Distributed Processing, April 14-18, pp. 1–7 (2008)
534
D. Mikushin and V. Stepanenko
5. Miranda, P.M.A., James, I.N.: Non-linear three-dimensional effects on gravity wave drag: splitting flow and breaking waves. Quart. J.R. Met. Soc. 118, 1057–1082 (1992) 6. Chou, M.-D., Suarez, M.J., Liang, X.Z., Yan, M.M.-H.: A thermal infrared radiation parameterization for atmospheric studies: Technical Report Series on Global Modeling and Data Assimilation, NASA/TM-2001-104606, vol. 19, 55 p. (2003) 7. Chou, M.-D., Suarez, M.J.: A solar radiation parameterization for atmospheric studies: Technical Report Series on Global Modeling and Data Assimilation, NASA/TM-1999-10460, vol. 15, 42 p. (2002) 8. Mahfouf, J.F., Manzi, A.O., Noilhan, J., Giordani, H., Deque, M.: The land surface scheme ISBA within the Meteo-France Climate Model ARPEGE. P.1: Implementation and preliminary results. J. of Climate 8, 2039–2057 (1995) 9. Volodin, E.M., Lykosov, V.N.: Parameterization of heat and moisture transfer processes in the soil-vegetation system for the general atmospheric circulation modeling. 1. Model description and simulations using local observation data. Izvestiya RAS, Atm. Ocean Phys. 34, 453–465 (1998) 10. Stepanenko, V.M., Lykosov, V.N.: Numerical modeling of heat and moisture transfer processes in the soil-lake system. Russian Journal of Meteorology and Hydrology 3, 95–104 (2005) 11. Stepanenko, V.M., Mikushin, D.N.: Numerical modeling of mezoscale dynamics in the atmosphere and tracer transport above hydrologically inhomogeneous land. Computational Technologies 13(Special issue 3), 104–110 (2008) 12. Takahashi, D.: An Implementation of Parallel 3-D FFT with 2-D Decomposition on a Massively Parallel Cluster of Multi-Core Processors. In: Wyrzykowski, R., et al. (eds.) PPAM 2009, Part I. LNCS, vol. 6067, pp. 606–614. Springer, Heidelberg (2010) 13. Supercomputer SKIF-MSU Chebyshev, http://parallel.ru/cluster/skif_msu.html 14. Altevogt, P., Boettiger, H., Kiss, T., Krnjajic, Z.: Evaluating IBM BladeCenter QS21 hardware performance. IBM Multicore Acceleration Technical Library, 2008, http://www.ibm.com/developerworks/library/pa-qs21perf/index.html
Adaptation of Double-Precision Matrix Multiplication to the Cell Broadband Engine Architecture Krzysztof Rojek and L ukasz Szustak Czestochowa University of Technology Institute of Computer and Information Sciences Dabrowskiego 73, 42-200 Czestochowa, Poland {krojek,lszustak}@icis.pcz.pl http://icis.pcz.pl
Abstract. This paper presents an approach to adaptation of the doubleprecision matrix multiplication to the architecture of Cell processors. The algorithm used for the adaptation on a single SPE is based on C = C +A∗B operation performed for matrices of size 64×64; these matrices are further divided into smaller submatrices which correspond to micro-kernel operations. Our approach is based on a performance model which is constructed as a function of submatrix size. The model accounts for such factors as size of local storage, number of registers, properties of double-precision operations, balance between pipelines, etc. This approach allows us to take into consideration properties of the first generation of Cell processors and its successor - PowerXCell 8i. This adaptation is followed by an optimization phase which includes loop transformations, kernel implementation with SIMD instructions, and other transformations necessary to achieve balance between even and odd pipelines. Finally we present hand-tunings performed with the IBM Assembly Visualizer tool. The proposed adaptation and optimizations allow us to achieve about 96% of the peak performance.
1
Introduction
The Cell Broadband Engine architecture (CBEA) takes a radical departure from conventional multi-core architectures. The Cell Broadband Engine (shortly Cell/B.E) processor is the first implementation of CBEA, developed jointly by IBM, Sony, and Toshiba [1,2]. It was initially used in game consoles, but soon utilized for High Performance Computing as well. This technology we can find in the Sony PlayStation3, and IBM BladeCenter QS21. Each QS21 contains two Cell/B.E. processors. The PowerXCell 8i processor is a new implementation of CBEA, with improved double-precision floating-point performance, and abilility to increase the main memory capacity up to 16 GB per processor. Each IBM BladeCenter QS22 contains two such processors. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 535–546, 2010. c Springer-Verlag Berlin Heidelberg 2010
536
K. Rojek and L . Szustak
Both generations of Cell processors support a single-precision peak performance of 204.8 GFLOPS per single chip, which gives 25.6 GFLOPS per each of eight SPE (Synergistic Processing Element) cores. At the same time, in case of double-precision arithmetics the peak performance for the first generation of Cell processors is only 1.825 GFLOPS per single SPE, and 14.6 GFLOPS per all the eight SPEs available in the chip. The PowerXCell 8i offers seven times the double-precision performance of the previous Cell/B.E. processor, providing the peak performance of 12.8 GFLOPS per single SPE, and 102.4 GFLOPS per eight SPE cores. The Cell family of processors provides outstanding opportunities for parallel computing. This allows developers to create high performance applications with the sustained performance close to the peak performance. However, utilizing the full potential of Cell processors is the big challenge for programmers [7,8]. For the matrix multiplication, which is a basic operation in many numerical algorithms, this challenge was investigated in [2,4,5,7,3]. For example, the kernel of the single-precision matrix multiplication algorithm performed on a single SPE was studied in [5]. The proposed algorithm and code optimizations allow the authors to achieve almost the peak performance. The results of implementing the matrix multiplication CM×N = AM×K ∗ BK×N
(1)
using 4-way SIMD multiply-add operations with 32-bit data were reported in [3], where the performance of 379 GFLOPS was achieved for 16 SPEs and the largest matrix size, as 92.5% of the peak performance delivered by QS21. The performance of 170.7 GFLOPS for the double-precision matrix-multiplication running on the IBM QS22 blade system with 32 GB of DDR2 SDRAM was achieved in [4]. In this paper, we propose an approach to adaptation of the double-precision matrix multiplication to the architecture of Cell processors. This approach is based on a performance model which is constructed as a function of submatrix size. The performance model allows for selecting ”the best” size of a micro-kernel, which is used for the adaptation. This adaptation is followed by an optimization phase which includes loop transformations, kernel implementation with SIMD instructions, and other transformations necessary to achieve balance between even and odd pipelines. The proposed adaptation and optimizations allow us to achieve about 96% of the peak performance. The paper is organized as follows. Section 2 introduces performance properties of the double-precision arithmetics in both generations of Cell processors. The idea of adaptation of the double-precision matrix multiplication to the Cell architecture is presented in Section 3, while Section 4 describes the micro-kernel used for this aim. The performance model for the double-precision matrix multiplication on Cell processors is derived in Section 5. Section 6 describes systematic implementation and optimizations steps carried out for the micro-kernel, while hand-tunings performed with the IBM Assembly Visualizer tool are illustrated in Section 7. Performance results are presented in Section 8, while Section 9 gives conclusions and future works.
Adaptation of Double-Precision Matrix Multiplication to the CBEA
2
2.1
537
Architecture of Cell Processors and Performance of Double-Precision Arithmetics First Generation of Cell Processors
The Cell/B.E. processor consists of nine cores on a single chip, all connected to each other and to external devices by a high-bandwidth, memory-coherent bus. This processor has one PowerPC Processor Element (PPE) and eight SPE cores. The SPEs are SIMD processors optimized for data-rich operations allocated to them by the PPE. Each of these identical SPE cores contains a Synergistic Processing Unit (SPU) with a RISC core, 256-KB, software-controlled local store for instructions and data, and large register file with 128 registers, 128-bit each. For the first generation of Cell processors, double-precision instructions are performed as two double precision operations in 2-way SIMD fashion, but the SPE is capable of performing only one double-precision operation per cycle. Double-precision instructions have 13 clock cycle latencies, but only the final seven cycles are pipelined. No other instructions are dual-issued with doubleprecision instructions, and no instructions of any kind are issued for six cycles after a double-precision instruction is issued. As a result, two consecutive instructions must be independent to eliminate dependency stalls (Fig. 1). The SPU has two pipelines, named even (pipeline 0) and odd (pipeline 1). Into these pipelines, the SPU can issue and complete up to two instructions per cycle, one in each of the pipelines. Whether an instruction goes to the even or odd pipeline depends on its instruction type. The instructions responsible for data transfer (textttlqd, stqd), and for data organization (shufb) are assigned to the odd pipeline. Double-precision floating-point operations (spu madd, spu mul) are assigned to the even pipeline. 2.2
Second Generation of Cell Processors - PowerXCell 8i
The IBM PowerXCell 8i differs from the Cell/B.E. in that each SPE core includes an enhanced, double-precision unit. The second generation of Cell processors support the fully pipelined execution of double-precision instructions, with latency of 9 clock cycles. As a result, it is required to execute at least 9 consecutive, independent instructions to eliminate dependency stalls (Fig. 2).
000800 000807 000814 000821 000828
0 0123456789012 0 ------7890123456789 0 ------4567890123456 0 ------1234567890123 0 ------8901234567890
dfma dfma dfma dfma dfma
$92,$4,$17 $94,$4,$18 $92,$4,$19 $94,$4,$20 $98,$4,$21
Fig. 1. Latencies of double-precision instructions in the first generation of Cell processors
538
K. Rojek and L . Szustak
000170 000171 000172 000173 000174 000175 000176 000177 000178 000179
0 0 0 0 0 0 0 0 0 0
012345678 123456789 234567890 345678901 456789012 567890123 678901234 789012345 890123456 901234567
dfma dfma dfma dfma dfma dfma dfma dfma dfma dfma
$75, $16,$47 $78, $16,$49 $92, $23,$33 $94, $23,$36 $96, $23,$39 $98, $23,$43 $99, $23,$47 $102,$23,$49 $109,$16,$28 $75, $20,$33
Fig. 2. Latencies of double-precision instructions in the second generation of Cell processors
3
Idea of Adaptation
Due to the limited resources of a SPE core, we base on a block version of the matrix multiplication when the original operation (1) is performed as a set of block matrix multiplications: CMB×N B = A1MB×KB ∗ B1KB×N B , CMB×N B + = A2MB×KB ∗ B2KB×N B , . . . (2) The block size of 64 × 64 was chosen, because: – it is a multiple of maximum transfer size between local storage and main memory (16 KB); – the amount of memory necessary to store five matrices (A1, B1, A2, B2, and C) must be less than the size of local storage (256 KB); – it is the largest size that satisfies the above assumptions. Our model assumes implementation of matrix multiplication using a single SPE core. Computations on huge matrices with high performance are strongly limited by the size of register file. Storing 64 ∗ 64 double-precision elements requires 64 ∗ 64/2 = 2048 registers. By that reason, performance model assumes partition of each matrix into smaller submatrices (Fig. 3). The proposed adaptation assumes implementation of submatrix multiplications based on a micro-kernel (Fig. 3 and Fig. 4). But, to achieve a maximum
Fig. 3. Illustration of matrix multiplication micro-kernel
Adaptation of Double-Precision Matrix Multiplication to the CBEA
539
vector unsigned char pattern[2] = {{0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7}, {8, 9, 10, 11, 12, 13, 14, 15, 8, 9, 10, 11, 12, 13, 14, 15}}; (...) for(i=0; i
performance we need to find ”the best” values of n and m. To find optimal sizes of submatrices, a performance model is derived.
4
Micro-Kernel of Matrix Multiplication
Fig. 4 shows the basic SIMD implementation of the micro-kernel matrix multiplication [5]. The main task of the micro-kernel is to copy each element of a column of A-submatrix into all two slots of a register that requires spu shuffle instructions, multiply it by a row of B-submatrix, and add to the output Csubmatrix. But, to compute all elements of C-submatrix we need to repeat the kernel for all columns of A-submatrix and rows of B-submatrix. We assume that each element of C-submatrix, each column of A-submatrix, and each row of Bsubmatrix are stored in the register file. The offsets offsetA and offsetB point out to base addresses of A- anb B-submatrices, respectively.
5 5.1
Performance Model for Double-Precision Matrix Multiplication on Cell Processors Model Assumptions
Our performance model is designed for a single SPE core. It is formulated for both of the first generation of Cell processors, and PowerXCell 8i architecture. The main challenge of the model is to find ”the best” values for the sizes n and m of submatrices involved in the micro-kernel (Fig. 3 and Fig. 4), where n ≤ 64, m ≤ 64. Since these values must be a divisor of 64, they belong to the following set of numbers: 1, 2, 4, 8, 16, 32, 64. The input parameters for this model are as follows: 1. number of available SPE registers; 2. number of independent double-precision operations.
540
K. Rojek and L . Szustak
Also, let us enumerate the assumptions of the model: 1. the number of used registers can not exceed 128; 2. the number of operations on the even-pipeline can not be less than the number of odd-pipeline operations; 3. the number of odd-pipeline operations should be minimized; 4. for the first generation of Cell processors, the ratio between even- and oddpipeline operations should be maximized. This model does not take into consideration store operations performed in direction from the register file to the local storage. The number of these operations is constant, and independent from the sizes of submatrices. 5.2
Performance Model
In accordance with our assumptions, each element of A-submatrix is stored in one register. This register contains two copies of one double-precision element. In our algorithm, a single column of A-submatrix is stored in registers. So, at least n registers are required to store A-submatrix. We need two more registers for spu shuffle instructions. These instructions are responsible for distributing (copying) one element across a vector. In addition, the algorithm holds one row of B-submatrix in m/2 additional registers. Also, we assume that all elements of C-submatrix are stored in n ∗ m/2 registers. So, at least n + m/2 + n ∗ m/2 + 2 registers are required to perform the micro-kernel. Our algorithm executes n∗m/2 even-pipeline operations for each C-submatrix. The whole algorithm requires to execute 64 ∗ 64 ∗ 64/2 = 131072 vectorized operations on the even pipeline. So, each submatrix multiplication must be repeated 131072/(n ∗ m/2) times. Odd-pipeline operations are responsible for data transfers between the local storage and register file, as well as for spu shuffle instructions. A single submatrix of size n × m requires loading m elements for B-matrix (m/2 instructions), and n elements for A-matrix (n instructions). In addition, our algorithm executes n spu shuffle operations (one for each element of A-subbmatrix). So, the total number of odd-pipeline operations is at least m/2 + 2 ∗ n. For the first generation of Cell processors, our model assumes m/2 consecutive, independent operations. So, to eliminate stalls, m/2 ≥ 2 independent operations should be performed, that gives m ≥ 4. As a result, we have m ∈ {4, 8, 16, 32, 64}. Table 1 presents: (i) the number of required registers; (ii) the numbers of even and odd pipelines operations for a single submatrix multiplication, and for the whole matrix multiplication of size 64 × 64; (iii) the ratio between the number of even- and odd-pipeline instructions. All these parameters are given as a function of submatrix sizes (m and n). Because of space constraints, the rows corresponding to m = 4, 64 are omitted in Table 1. For the whole algorithm, the number of operations executed on the even pipeline is constant. However, the number of operations executed on the odd pipeline depends on selected sizes of submatrices. For the first generation of Cell
Adaptation of Double-Precision Matrix Multiplication to the CBEA
541
Table 1. Performance model for the first generation of Cell processors n 2 4 8 16 32 64
m registers even odd 8 16 8 8 8 26 16 12 8 46 32 20 8 86 64 36 8 166 128 68 8 326 256 132
all odd 131072 98304 81920 73728 69632 67584
all even 131072 131072 131072 131072 131072 131072
ratio reg. < 128 1 TRUE 1,33 TRUE 1,6 TRUE 1,78 TRUE 1,88 FALSE 1,94 FALSE
2 4 8 16 32 64
16 16 16 16 16 16
28 46 82 154 298 586
16 32 64 128 256 512
12 16 24 40 72 136
98304 65536 49152 40960 36864 34816
131072 131072 131072 131072 131072 131072
1,33 2 2,67 3,2 3,56 3,76
2 4 8 16 32 64
32 32 32 32 32 32
52 86 154 290 562 1106
32 64 128 256 512 1024
20 24 32 48 80 144
83968 49152 32768 24576 20480 18432
131072 131072 131072 131072 131072 131072
1,56 TRUE 2,67 TRUE 4 FALSE 5,33 FALSE 6,4 FALSE 7,11 FALSE
TRUE TRUE TRUE FALSE FALSE FALSE
processors, the fewer number of odd-pipeline operations the better performance can be achieved. In our performance model, the register usage depends on chosen values of m and n. Moreover, we can accept only those values of m and n for which the number of required registers is less than 128. From the set of possible solutions, we select that with the biggest ratio between even- and odd-pipeline operations. So, we choose finally n = 4 and m = 32. To eliminate stalls for the second generation of Cell processors represented by PowerXCell 8i, it is required to execute m/2 ≥ 9 independent, consecutive operations, that gives m ≥ 18. As a result, we have m ∈ {32, 64}. Table 2 presents results given by the performance model for the selected values of m and n, in case of the PowerXCell 8i. This table includes: (i) the number of required registers; (ii) the number of even- and odd-pipeline operations for a single submatrix multiplication, and for the whole matrix multiplication of size 64 × 64. To eliminate additional stalls, the number of odd-pipeline operations must not exceed the number of even-pipeline operations. Taking this condition into account, our model indicates three potential solutions for values of n and m. Among them, we choose a solution with the least number of operations on the odd pipeline. Like the first generation of Cell processors, again we have n = 4, m = 32.
542
K. Rojek and L . Szustak Table 2. Performance model for PowerXCell 8i
6
n 2 4 8 16 32 64
m 32 32 32 32 32 32
reg. 52 86 154 290 562 1106
even 32 64 128 256 512 1024
odd 20 24 32 48 80 144
all odd 83968 49152 32768 24576 20480 18432
all even Reg. < 128 even >= odd 131072 TRUE TRUE 131072 TRUE TRUE 131072 FALSE TRUE 131072 FALSE TRUE 131072 FALSE TRUE 131072 FALSE TRUE
TRUE TRUE FALSE FALSE FALSE FALSE
2 4 8 16 32 64
64 64 64 64 64 64
100 166 298 562 1090 2146
64 128 256 512 1024 2048
36 40 48 64 96 160
73728 40960 24576 16384 12288 10240
131072 131072 131072 131072 131072 131072
TRUE FALSE FALSE FALSE FALSE FALSE
TRUE FALSE FALSE FALSE FALSE FALSE
TRUE TRUE TRUE TRUE TRUE TRUE
Systematic Implementation and Optimization Steps
The canonical algorithm for computing the matrix multiplication (1) is based on three nested loops [5]. Each loop iterates over a certain dimension - m, n, or k. When adapting this algorithm to the Cell/B.E. architecture, the main challenge is a suitable partitioning of a matrix into smaller submatrices. The technique used for the matrix partitioning is known as a loop tiling. It creates three outer loops, which calculate addresses of submatrices, and three inner loops, which calculate a product of a submatrix of A and a submatrix of B, and updates a submatrix of C with the partial result. In this way, the register reuse is maximized, and the number of loads and stores is minimized. In our approach, the sizes of submatrices are chosen based on the performance model. The next technique applied by us is loop unrolling. It is a loop transformation technique, that attempts to optimize the execution speed of a program at the expense of its size. Loops can be re-written as a sequence of independent statements, which eliminates the loop-control overhead. In our implementation, this is applied for the three most nested loops. The loop unrolling allows for eliminating dependency stalls, as well as decreasing the amount of branch and increment instructions. On SPEs cores, this technique allows us to achieve a better balance between pipelines, and maximize the dual-issue rate. The next challenge is to eliminate stalls caused by latencies of access to the local storage. For each iterations, the algorithm executes the following sequence of operations: – loading data to the register file; – submatrix multiplication (textttspu shuffle and spu madd operations), – storing data to the local storage. The optimization technique used to solve this problem is double-buffering, applied on the register level and involving two loop iterations [5]. The existing loop
Adaptation of Double-Precision Matrix Multiplication to the CBEA
543
body is duplicated, and two separate C-submatrices take care of the even and odd iterations, respectively. The algorithm presented in Fig. 4 requires some additional operations to compute addresses of elements of submatrices. Time required to compute these addresses has a significant influence on the performance of the whole algorithm. Computing indices of each elements requires to use even-pipeline instructions, that can not be hidden behind spu madd instructions. In fact, a part of operations on addresses, like multiplication by a power of 2, are computed using shift operations, such as shli shown in Fig. 5a. However, by changing shli operations on shlqbii ones (Fig. 5b), which are executed on the odd pipeline, we achieve that a part of address operations can be hidden behind even-pipeline instructions. (a) Before optimization: 000021 0 12 ori $94,$5,0 000022 0 2345 shli $86,$58,0 000023 0 34 ai $sp,$sp,-304 000024 0 4567 shli $44,$58,0 000025 0 56 ori $85,$58,0 000026 0 6789 shli $43,$58,0 (b) After optimization: 000021 0D 12 ori $94,$5,0 000021 1D 1234 shlqbyi $86,$58,0 000022 0D 23 ai $sp,$sp,-304 000022 1D 2345 shlqbyi $44,$58,0 000023 0D 34 ori $85,$58,0 000023 1D 3456 shlqbyi $43,$58,0
Fig. 5. Results of spu timing analysis before (a) and after (b) optimization
7
Hand-Tuning with the IBM Assembly Visualizer Tool
The IBM Assembly Visualizer tool [9] annotates an assembly source file with static analysis of instruction timing, assuming the linear (branchless) execution. Such an analysis does not account for branch taken instructions caused, for example, by loops. The Assembly Visualizer allows the user to safely and interactively edit the assembly in a manner that does not violate the correctness of the original program. The compilation to an assembly code can be done by the spuxlc or spu-gcc compiler with -S flag. The user can then hand-tune and fix pipeline stalls that the compiler has missed. This powerful functionality enables a level of performance optimization that is not possible at the C/C++ level. Summarizing, the static timing analysis of assembly codes: – is based on dual-issue rules, and static dependency stalls of instructions; – accounts for branch not taken instructions; – does not account for local store transfers, and branching.
544
K. Rojek and L . Szustak
Fig. 6. Timing analysis before hand-tuning
Fig. 7. Timing analysis after code hand-tuning with the IBM Assembly Visualizer
Adaptation of Double-Precision Matrix Multiplication to the CBEA
545
Fig. 6 shows the static timing analysis of the SPE code before its hand-tuning, while Fig. 7 demonstrates the result of the code hand-tuning with the IBM Assembly Visualizer tool.
8
Performance Results
For the first generation of Cell processors, the basic implementation of our matrix multiplication algorithm allows us to achieve 35.53% of the peak performance. After including optimizations such as loop reorganizations (loop tiling and loop unrolling), but without double buffering, the performance increases to 94.63%. For the PowerXCell 8i architecture, the basic implementation allows for achieving 39.2% of the peak performance. The loop reorganizations increase the performance to 86.41%. The next optimization, which is the double buffering technique used on the register level, improves the performance up to 95.14%. Finally, optimizations included on the assembly level with the IBM Assembly Visualizer tool allows for achieving 96.05% of the peak performance.
9
Conclusions and Future Works
This paper presents the approach to adaptation of the double-precision matrix multiplication to the architecture of both generations of Cell processors. The algorithm used for the adaptation on a single SPE core is based on C = C + A ∗ B operation performed for matrices of size 64 × 64; these matrices are further divided into smaller submatrices which correspond to micro-kernel operations. Our approach is based on the performance model which accounts for such factors as size of local storage, number of registers, properties of double-precision operations, balance between pipelines, etc. The proposed approach allows us to take into consideration properties of the first generation of Cell processors, and its successor - PowerXCell 8i. This adaptation is followed by an optimization phase which includes loop transformations, kernel implementation with SIMD instructions, and other transformations necessary to achieve balance between even and odd pipelines. Finally we present hand-tunings performed with the IBM Assembly Visualizer tool. The proposed adaptation and optimizations allow us to achieve 94.63% and 96.05% of the peak performance for the first and second generations of Cell processors, respectively. The performance model derived in this work allows for selecting ”the best” size of the micro-kernel, which is used for the adaptation. We have investigated other solutions extracted from the performance model, however, performance results for these solutions are worse by about 10 This paper focuses on multiplying matrices of size 64 × 64 on a single SPE. In future works the proposed approach will be extended for multiplying large matrices on all 16 SPEs available in both of the QS21 and QS22 blade centers.
546
K. Rojek and L . Szustak
References 1. Buttari, A., Dongarra, J., Kurzak, J.: Limitations of the PlayStation3 for High Performance Cluster Computing, http://www.netlib.org/lapack/lawnspdf/lawn185.pdf 2. Chen, T., Raghavan, R., Dale, J.N., Iwata, E.: Cell Broadband Engine Architecture and its first implementation - A performance view. IBM Journal of Research and Development 51(5), 559–572 (2007) 3. Dolfen, A., Gutheil, I., Homberg, W., Koch, E.: Applications on Juelich’s Cell-based Cluster JUICE, http://www.fz-juelich.de/jsc/datapool/cell/Para08_apps_on_juice.pdf 4. Kistler, M., Gunnels, J., Brokenshire, D., Benton, B.: Programming the Linpack Benchmark for the IBM PowerXCell 8i Processor. Scientific Programming 17(1-2), 43–57 (2009) 5. Kurzak, J., Alvaro, W., Dongarra, J.: Optimizing matrix multiplication for a shortvector SIMD architecture - CELL processor. Parallel Computing 35(3), 138–150 (2009) 6. Wang, H., Takizawa, H., Kobayashi, H.: A Performance Study of Secure Data Mining on the Cell Processor. Int. Journal of Grid and High Performance Computing 1(2), 30–44 (2009) 7. Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., Yelick, K.: The potential of the Cell Processor for Scientific Computing. In: Proc. 3rd Conf. on Computing Frontiers, Ischia, Italy, pp. 9–20 (2006) 8. Woodward, P.R., Jayaraj, J., Lin, P., Yew, P.: Moving Scientific Codes to Multicore Microprocessor CPUs. Computing in Science and Engineering 10(6), 16–25 (2008) 9. IBM Assembly Visualizer for Cell Broadband Engine, http://www.alphaworks.ibm.com/tech/asmvis
Optimization of FDTD Computations in a Streaming Model Architecture Adam Smyk1 and Marek Tudruj1,2 1
2
Polish-Japanese Institute of Information Technology, 86 Koszykowa Str., 02-008 Warsaw, Poland Institute of Computer Science, Polish Academy of Sciences, 21 Ordona Str., 01-237 Warsaw, Poland {asmyk,tudruj}@pjwstk.edu.pl
Abstract. This paper presents a methodology, which enables an optimized execution of FDTD (Finite Difference Time Domain) computations in a streaming model architecture (Cell BE processors). A data flow graph that represents the FDTD computations in an irregular wave propagation area is transformed into a set of tiles. Each tile represents regular computations for a small part of a given computational area. Tiles are injected into computational nodes and processed in a pipe-like manner. It will be shown that such approach enables solving the FDTD problem with a speedup almost equal to the ideal one. Several computation optimization methods are presented. Efficiency of streaming computations for various simulation parameters is discussed. Experimental results obtained for the streaming model on a physical PS3 machine are presented as well.
1
Introduction
Recent development of fast multicore processor architectures has marked its significant impact on the methodology of parallel computations. The main reason of strong interest in such architectures are among others low power consumption, high level parallelism, easy data sharing and very fast data transmission. Multicore processors are currently used in almost all fields of computations. They are widely used in business applications for WEB and database servers [9], for Java enterprise applications execution [6] and also for entertainment [12]. The number of cores in one processor varies from 2 to 8 (e.g. for Niagara, Cell BE [4]) through 54 (for AZUL [6]) and even to 80 in some Intel processors [8]. The computational power of most modern computer installations exceeds more than 1PFlops (e.g. for the Roadrunner cluster system [10]) and can be efficiently used for most of power consuming computations like numerical processing. Numerical applications are very good target for parallel implementation using different parallelization models in multicore architectures. In this paper, we have used a multicore architecture of the Cell BE processor. There are many kinds of programming models that can be considered for Cell BE architecture e.g.: function offload model, device extension model, computational acceleration, R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 547–556, 2010. c Springer-Verlag Berlin Heidelberg 2010
548
A. Smyk and M. Tudruj
shared memory model, asymmetric thread runtime model, the streaming model and the CellSs model [2, 3, 5, 13]. Depending on the application, the models mentioned above can be used separately or can be mixed together to fully exploit the computational power of a given system architecture. For the FDTD computations, we have used streaming computational model available on the Cell BE processor. The main idea of the streaming computations is to provide a continuous stream of data for all computational nodes involved in solving of given problem [4]. This model has been merged with another optimization technique widely used for parallel implementation of numerical applications called loop tiling [14, 15, 17]. The paper is composed in three sections. In the first section, an outline of the FDTD problem is given. In the next section, the idea of the FDTD implementation on the Cell BE architecture is presented. In the last section, experimental results and some discussion are given.
2
FDTD Algorithm
Finite Difference Time Domain method (FDTD) is used for simulation of high frequency electromagnetic wave propagation. It is done by solving two Maxwell equations, (see Fig.1). Discrete simulation points Mesh of physical points
Hx
Hx
Ez Hy
∇ × H = γE + ε
Ez Hy Hx
Hy Hx Ez
Ez Hy
Hy Hx
Hy
H (i, j ) = H
∂E , ∂t
n −1 y
∇ × E = −μ n −0.5 z
∂H ∂t
(Maxwell equations )
(1)
n −0.5
(i, j ) + RC ⋅ [ E (i, j − 1) − E z (i, j + 1)], n n −1 n −0.5 n −0.5 H x (i, j ) = H x (i, j ) + RC ⋅ [ E z (i − 1, j ) − E z (i + 1, j )], ( 2) n n −1 n −0.5 E z (i, j ) = CAz (i, j ) ⋅ E z (i, j ) + CBz (i, j ) ⋅ [ H y (i + 1, j ) − n −0.5 n −0.5 n −0.5 − H y (i − 1, j ) + H x (i, j − 1) − H x (i, j + 1)] n y
Hx
Computational mesh
Fig. 1. Computational mesh of the 2D FDTD problem
In our two-dimensional FDTD problem, the simulation is performed in a two dimensional computational area determined by an irregular shape. At first, a whole computational area is transformed into a physical mesh (see Fig.1). The mesh consists of a specified number of points, which contain either electric component Ez of electromagnetic field or one from two magnetic components Hx or Hy (depending on its coordinates). Before the numerical simulation, the two Maxwell equations (1) must be transformed into their differential forms (2), which show the dependency between Ez, Hx and Hy components. The value of Ez component depends on the values of four nearest magnetic field components (Hx and Hy). However, to calculate the value of a magnetic component we need two nearest electric field components. The whole FDTD computation is done iteratively. Each iteration is divided into two parts: firstly the value of
Optimization of FDTD Computations in a Streaming Model Architecture
549
all Ez points is computed and next the values of Hx and Hy components are computed. If the shape of computational area is regular (e.g. rectangular), the computation can be easily parallelized (e.g. stripe or block partitioning of the computational area). In the case of an irregular shape of computational area, such decomposition is more complicated. In next section, we have described the main idea of the streaming computation in the multicore architecture.
3
Streaming FDTD Computations
The FDTD problem has been computed on the Cell BE architecture with the use of the streaming programming model [1]. In order to design any computational process for Cell BE, quite good understanding of its architecture is needed [4]. Each Cell BE processor contains a general purpose processing element (PPE with its global memory) and a number of parallel computational elements (e.g. 6 on the PS3 machine) called synergistic processing elements (SPE) each with 256KB of the local memory without cache. All elements are interconnected by a very efficient element interconnect bus (EIB). The EIB can be used for communication between PPE and the SPE nodes and also for communication between SPE nodes (DMA transfer, mailbox communication is available). Data sending is mainly done by DMA transfers. Several problems appear in the DMA communication. At first, it needs to be synchronized by a programmer. The second problem is, that if we want to transfer any data from/to global to/from local memory we need to know the global/local (remote) address and the local address. The local address on each SPE is known, but the global one needs to be sent by a separate transmission. It can be done by the mailbox communication. This kind of communication does not require knowing the global address space. It is done by direct addressing of the remote context (program) of each SPE node. There are blocking and non-blocking versions of the mailbox communication. Both of them can be used in the FDTD computational algorithm. The general simplified scheme of the computation (without optimizations) on a Cell BE processor is presented in Fig. 2. First of all, the discrete mesh for the given computational shape is created. Next, the mesh is converted into a set of tiles. A single tile represents a small part of the FDTD computation. The tiles are created by loop tiling algorithms widely described in [17]. Each tile is processed separately, on the first found idle SPE node. Efficiency of the computations strongly depends on the size and shape of a single tile. There are several methods that enable finding and adj usting the shape of the tiles to the given system architecture (to reduce load imbalance and communication volume). Such adjustment is strongly enforced by the relation existing in the computational mesh (see Fig. 1.). For the FDTD problem, two shapes of tiles can be applied, the rectangular and diamond-shaped tiles [17]. In our case, we have used rectangular tile shape. To create the set of tiles, the shuffle generic intrinsic cell compiler function has been used. It is defined by exactly one assembly instruction. All tiles are located in the global memory of the PPE node. To access them from a SPE node, we need to transfer the global address of a given tile to a chosen SPE. It is done by the
550
A. Smyk and M. Tudruj
mailbox communication. When the SPE receives the global address, it initiates a DMA transfer from the global PPE to the local SPE memory. The SPE will start computations, when the DMA transfer is finished. After the computations, the tile should be transferred back to the global memory. It is done also by a DMA transfer. In each even iteration, an electric component is computed, while in odd iterations magnetic components are computed.
PPE Start, iter=0 M = fdtdMeshCreation() S’ = S = tilesGeneration(M)
SPE Start remoteAddr = RecvMessageFromPPE()
isMoreIterations(iter)?
isTheLastMessage(remoteAddr)?
Y N Y
iter=iter+1, S = S’
N
isMoreTiles(S)?
Y SPEk = getFreeSPENumber()
N local=transferDMAFromPPE(remoteAddr) waitingForData(local) FDTDComputation(local)
T = getNextTileAddress(S) transferDMAToPPE(local, remoteAddr) SendMessage(SPEk, T)
Not available SendMessageToPPE(SPEk)
freeSPE = RecvMessage() SPE Stop SendLastMessageForAllSPEs() PPE Stop
Execution on PPE
Execution on SPE
Fig. 2. The FDTD computation scheme
Without optimization, each tile is transferred to/from SPE’s memory from/to PPE’s memory. The DMA transfer can be easily optimized by the double buffering technique [4]. It consists in exchangeably using two buffers: one for sending already produced results and another for collecting them. In order to use this technique, we need to have a free space in the local (SPEs) memory for one additional buffer. In our case, we have extended the double buffering into a rotating-buffers technique [16]. The main idea of this modification for both a PPE and a SPE side is presented in Fig. 3. The total number of rotating buffers strongly depends on the size of a single tile. If the size of the tile is small, the total number of tiles in a buffer can grow. Such an approach enables using the interlacing of the FDTD computations with DMA data transfers, what can make a total communication time shorter. On a SPE, the modification is much more complicated. The computations on the given SPE will start as soon as the required tile is received. All next tiles will be received during the processing of the tiles already received. Testing if the needed tile is already present in the local memory, is always done just before the computations. It is done to reduce the total simulation time. If the data are in
Optimization of FDTD Computations in a Streaming Model Architecture
551
the local memory, the time overhead related to controlling the reception of data is minimal. Similar optimization is done when the computations for given tile are finished. As soon as possible, the DMA back transfer SPE->PPE is initialized. Such an approach creates a pipeline of the DMA read, DMA write and computational instructions. Such a program pipeline constantly supplies new tiles to the computational nodes (SPEs), which makes that the streaming computations can be done very efficiently with the rotating-buffers infrastructure. Additionally, the rotating-buffers technique can be very helpful in finding the saturation threshold of the DMA facility in the multicore architecture.
remoteAddr = RecvMessageFromPPE() isTheLastMessage(remoteAddr)?
N NumberOfTiles=RecvMessageFromPPE() SPEk = getFreeSPENumber() T = getNextTileAddress(S, NumberOfTiles)
local[] = transferDMAFromPPE( remoteAddr, NumberOfTiles) buffer = 0 waitingForDataForBuffer(local[buffer])
SendMessage(SPEk, T) SendMessage(SPEk, NumberOfTiles)
Execution on PPE
FDTDComputation(local[buffer]) transferDMAToPPE(local[buffer], remoteAddr[buffer]) buffer = buffer + 1 If buffer
N SendMessageToPPE(SPEk) SPE Stop
Execution on SPE
Fig. 3. Rotating-buffers modification in the PPE and the SPE
4
Experimental Results
The algorithm described above has been tested on the Cell PS3 architecture for various simulation parameters. In the first part of tests, all experiments have been done for the same convex irregular wave propagation area. Transformation of the data flow graph into a set of tiles has not been done during the computations. It has been prepared in advance, before the simulation. It is because the complexity of this step can have very significant influence on the total execution time, especially for very short simulations. In the experiments, the shape of a single tile has to match data dependencies resulting from the differential equations (1). As it was mentioned before, the FDTD algorithm imposes two shapes of tiles: rectangular and diamond-shaped tiles. Both of them can be used
552
A. Smyk and M. Tudruj
in streaming computations. We have examined the rectangular loop tiling of the FDTD algorithm. All experiments have been done on a PS3 system with Yellow Dog Linux 6.1 distribution [11] and IBM SDK 3.1 [7]. In first experiments (see Fig. 4 left), we have tested the behavior of streaming computations for various tile sizes. All experiments described below have been done for tile data sizes equal to: 16, 64, 128, 1024, 2048 and 16384 floating point numbers. For the tile data size 16384 not all experiments could be done. It is because a local memory insufficiency on SPE nodes has occurred. The total number of tiles in each experiment is set to 8000.
Fig. 4. (Left) Speedup for FDTD computations for various tile sizes. (Right) Comparison of efficiency of the FDTD computations with non-blocking and blocking mailbox communication.
Fig. 5. (Left) Execution time of the FDTD computations with rotating-buffers optimization (tile size = 128). (Right) Speedup of computations for various numbers of SPEs (tile size = 128).
As we can see, the speedup strongly depends on the tile size. The speedup is quite close to the ideal one for tile sizes greater than 128. For small tile sizes (for very fine grain computations), the speedup is very unsatisfactory. It can be explained by the requirements of the Cell BE architecture that for the most efficient computations all data in all transfers between PPE and SPEs nodes must be aligned to 128. For small tiles, this condition cannot be met. In
Optimization of FDTD Computations in a Streaming Model Architecture
553
the FDTD problem, when we perform some computations for strongly irregular computational areas, we cannot use big size tiles. Because we have assumed regularity of tile shapes, some tiles cannot be properly adjusted to the irregular computational areas. If the boundary of the computational area crosses a tile, there can be many unused computational cells in a tile, thereby efficiency of computations can decrease. To avoid it, several smaller tiles can be used instead. It is very important especially for strongly irregular shapes of the electromagnetic wave propagation area. We have used two kinds of communication: the DMA transfers to send data between PPE and SPE nodes, and the mailbox transfers to send control messages and for synchronization. Sending control messages can have significant influence on the total communication time, especially for fine grain computations. We have found out (see Fig. 4 right), that if we use non-blocking mailbox transfers for small tiles, the execution time is on average about 20% shorter in comparison to blocking transmissions (MailboxSpeedup=TimeNonBlocking/TimeBlocking where TimeNonBlocking is an execution time for non-blocking mailbox communications and TimeBlocking is an execution time for blocking mailbox communications). Such behavior is observed only for configurations with small size tiles. For tiles larger than 128 no significant difference has been noticed. Simultaneously, the non-blocking message communication does not produce any speedup for larger number of SPE nodes used in computations. As we can see, the message transfer optimization cannot be applied efficiently for all computational configurations. To increase the speedup of fine grain computations, we have applied another kind of communication optimization. The rotating-buffers method described above has been used and applied for the FDTD computations. It enables fully exploiting of the DMA transfer facility used in the Cell BE processor. We have performed several experiments and we have obtained very promising results. We have tested the FDTD computation efficiency for various sizes of buffer: 1, 2, 4, 8 and 16 tiles in a buffer. The limitation of the size of buffer is the size of the local memory on one SPE node. We cannot test some configurations e.g. for the tile size equal to 16384 and 16 tiles in a buffer. The results for the tile size equal to 128 are presented in Fig. 5 left. We can see, that for all configurations with more than 1 tile in a buffer, we have received some speedup, from 1.2 to 1.5 times faster than for configuration with 1 tile (Fig. 5 right). Interesting observations are that for 2-tile configuration (double buffering), the speedup was better in comparison to 3-tile configuration. When the number of tiles increased, the speedup increased as well, but the maximal value has not been so gratifying (only ~1.5). We have tested the rotating-buffering for various sizes of buffers and for various sizes of tiles (see Fig. 6). All further experiments have been done for various numbers of total tiles computed according to the rule (see Table 1): total_number_of_tiles = total_number_of_computational_cells/tile_size. As we can see, the most significant difference (about 10 times) in the execution time has appeared for configuration with 1 tile in a buffer and for the tile size equal to 16. What is very interesting in this case: the execution time has not
554
A. Smyk and M. Tudruj
changed even if the number of SPE nodes has increased (from 2 to 6). It can be also seen for the tile size 64. It indicates that for these configurations the DMA communication infrastructure has been fully saturated and an increase of the computational power has not brought any increase of performance. Table 1. Values of tile sizes and total number of tiles used in last test Tile size Total number of tiles 16 100000 64 25000 128 12500 1024 1562 2048 781 16384 98
Fig. 6. Execution time for FDTD computation for various “tile” configurations
We can see also, that when the number of tiles in a buffer increases, the total execution time decreases for all configurations with small tile sizes. The small difference can be still visible for configuration with 8 tiles in a buffer, but for 16 tiles in a buffer the execution time is almost equal for all configurations. It indicates, that the intensity of the DMA transfers on the Cell BE processor has achieved its peak value, which is possible in this case. If we focus on the configuration with 16 tiles in a buffer, we will see that the best execution time has been obtained exactly for the simulation done for the configuration with the smallest tile size (16). The difference is not so big and it can be explained by multiple overlapping of small parts of computations and communication during t he FDTD computation - the non-blocking DMA write transfer is done simultaneously with the FDTD computations. In this optimization, we do not consider,
Optimization of FDTD Computations in a Streaming Model Architecture
555
when the DMA write transfer should be started. The transfer of a given tile from the local to the global memory will start if processing related to the tile is completed. Such optimization together with the rotating-buffer optimization, have produced a slight super linear-speedup, see Fig. 7. The experiments have been done for the tile size equal to 128. The speedup for almost all configurations is very close to the ideal one, but for configuration with 16 tiles in a buffer, it is always slightly better than linear. The super linear speedup has appeared when the DMA write optimization has been used. We suppose, that the overlapping of the computation and communication operations performed simultaneously can reduce the overhead related to data transmissions between PPE and SPE nodes. Only slight super-linear speedup have appeared because the communication has been performed only between PPE node and SPE nodes (probably, the DMA channels for the PPE node have been overloaded). We can expect that for more diversity in communication (not only between PPE and SPE nodes, but also between SPE nodes) the speedup can increase much more. It is a base for future works.
Fig. 7. (Left) Speedup obtained with the rotating-buffers optimization. (Right) Speedup obtained with the rotating-buffers and the DMA write optimizations.
5
Conclusions
The method presented in the paper enables to carry on the FDTD simulations in streaming model architectures such as the one embedded in the Cell BE processors. The tiles of the FDTD data are spreading among all available SPE computational nodes by means of the DMA communication infrastructure. Several optimization methods for the FDTD parallel computations have been introduced such as blocking/non-blocking mailbox communication, rotating-buffers and DMA write transfer optimization. Such optimizations gave promising results for fine grain computations by reaching the maximal possible performance on the Cell BE processors. The rotating-buffers technique has turned out to be the most efficient optimization. As it has been shown, its usage has resulted in the biggest increase of the computational efficiency for configurations with small size tiles. We have shown that the FDTD computations for strongly irregular shapes of the electromagnetic wave propagation area can be efficiently implemented with the streaming model available in the Cell BE processors.
556
A. Smyk and M. Tudruj
References [1] Adams, S., Payne, J., Boppana, R.: Finite Difference Time Domain (FDTD) Simulations Using Graphics Processors. In: HPCMP Users Group Conference 2007 (2007) [2] Akhter, S., Roberts, J.: Multi-Core Programming - Increasing Performance through Software Multithreading. IntelPress (2006) [3] Bellens, P., Perez, J.M., Badia, R.M., Labarta, J.: CellSs: a Programming Model for the Cell BE Architecture. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (2006) [4] Buttari, A., Dongarra, J., Kurzak, J.: Limitations of the PlayStation 3 for High Performance Cluster Computing, Technical Report UT-CS-07-597, Department of Computer Science, University of Tennessee, Knoxville, TN, USA, and as LAPACK Working Note 185 (May 2007) [5] Gonzàlez, M., Vujic, N., Martorell, X., Ayguadé, E., Eichenberger, A.E., Chen, T., Sura, Z., Zhang, T., O’Brien, K., O’Brie, K.: Hybrid access-specific software cache techniques for the cell BE architecture. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, Toronto, Ontario, Canada (2008) [6] http://www.azulsystems.com/ [7] http://www.ibm.com/developerworks/power/cell/ [8] http://www.intel.com/ [9] http://www.sun.com/processors/niagara/ [10] http://www.top500.org/ [11] http://www.yellowdoglinux.com/ [12] http://www.us.playstation.com/PS3/ [13] Kruijf, M., Sankaralingam, K.: MapReduce for the Cell B.E. Architecture, Technical Report #1625 (October 2007) [14] Manjikian, N., Abdelrahman, T.S.: Exploiting Wavefront Parallelism on LargeScale Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems 12(3) (March 2001) [15] Orozco, D., Gao, G.: Mapping the FDTD Application to Many-Core Chip Architectures, CAPSL Technical Memo 087 (March 3, 2009) [16] Smyk, A., Tudruj, M.: RDMA Control Support for Fine-Grain Parallel Computations, PDP 2004, La Coruna, Spain (2004) [17] Xue, J.: Loop tiling for parallelism. Kluwer Academic Publishers, Dordrecht (2000)
An Orthogonal Matching Pursuit Algorithm for Image Denoising on the Cell Broadband Engine Dominik Bartuschat, Markus St¨ urmer, and Harald K¨ ostler University of Erlangen-Nuremberg, 91058 Erlangen, Germany
Abstract. Patch-based approaches in imaging require heavy computations on many small sub-blocks of images but are easily parallelizable since usually different sub-blocks can be treated independently. In order to make these approaches useful in practical applications efficient algorithms and implementations are required. Newer architectures like the Cell Broadband Engine Architecture (CBEA) make it even possible to come close to real-time performance for moderate image sizes. In this article we present performance results for image denoising on the CBEA. The image denoising is done by finding sparse representations of signals from a given overcomplete dictionary and assuming that noise cannot be represented sparsely. We compare our results with a standard multicore implementation and show the gain of the CBEA.
1
Introduction
Sparse representations of signals or images, based on a frame (overcomplete basis), the so-called dictionary, are widely used in many imaging applications like image denoising, super-resolution, or image restoration [1,2,3]. In previous work we have shown that sparse representations are suitable for CT image denoising and compared the results to other denoising approaches [4,5,6,7,8]. In this article we accelerate the image denoising on the Cell Broadband Engine Architecture (CBEA) in order to achieve close to real-time performance. One major and very time-consuming task is to find the coefficients for the sparse representation in the given basis. This is in general an NP-hard problem and can be solved approximately by the orthogonal matching pursuit (OMP) algorithm [9]. We implemented a variant of it, the Batch-OMP algorithm [10]. In Sect. 2 we introduce the idea of sparse representation and show how it can be used for image denoising. In Sect. 3 we discuss the Cell implementation and in Sect. 4 we present performance and scaling results for the CBEA. We apply the denoising method to a noisy real-world image. Future work is outlined in Sect. 5.
2
Methods
Sparse representation: By extending a set of elementary vectors beyond a basis of the signal vector space, signals can be compactly represented or approximated R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 557–566, 2010. c Springer-Verlag Berlin Heidelberg 2010
558
D. Bartuschat, M. St¨ urmer, and H. K¨ ostler
by a linear combination of only few of these vectors. This is called sparse representation. The problem of finding the sparsest representation a ∈ RK of a signal X ∈ Rn , up to a given error tolerance > 0, can be formulated as: mina0 subject to Da − X2 ≤ , a
(1)
where ·0 denotes the 0 - seminorm that counts the nonzero entries of a vector K a0 = j=0 |aj |0 . The column vectors of the dictionary, the full rank matrix D, are called atoms. They form an overcomplete basis of the signal space (i.e. K > n). Exactly determining the sparsest representation of signals, called sparse coding, is an NP-hard combinatorial problem [9,11]. To reduce the computational effort, one divides the whole signal or image X into small, typically overlapping patches x and solves (1) for each patch separately. At the end, the patches are assembled to retrieve X. Orthogonal Matching Pursuit: Orthogonal matching pursuit (OMP), a simple greedy algorithm, solves (1) approximately by sequentially selecting dictionary atoms. OMP is guaranteed to converge in finite-dimensional spaces in a finite number of iterations [9]. The complexity of this algorithm for finding n atoms is O(n3 ) [10]. In each iteration of the OMP algorithm: – First, the projection of the residual r on the dictionary p = DT r is computed, and the atom kˆ = argmax |p| with maximal correlation to the residual is k
selected. – Then, the current patch x is orthogonally projected onto the span of the selected atoms by computing a = (DI )+ x. This orthogonalization step ensures that all selected atoms are linearly independent [10]. – Finally, the new residual is computed by r = x − DI a, which is orthogonal to all previously selected atoms. Here, I denotes a set containing indices of selected atoms, DI are the corresponding columns of D and (DI )+ represents the pseudoinverse of DI . The Batch-OMP algorithm [10], summarized in algorithm 1, accelerates the OMP algorithm for larger patch sizes. It pre-computes the Gram matrix G = DTD and the initial projection p0 = DT x in order to require only the projection of the residual on the dictionary instead of explicitly computing the residual. In addition to that we replace the computation of the pseudoinverse in the orthogonalization step, which is done in OMP by a singular value decomposition, with a progressive Cholesky update performed in lines 5 - 8 of algorithm 1 by means of forward substitution. The two subscripts at the Gram matrix in line 6 indicate that entries of the Gram matrix in rows corresponding to previously chosen atoms and in the column corresponding to the latest chosen atom are considered. The orthogonalization and residual update step in the OMP algorithm can be written as r = x − DI (DTI DI )−1 DTI x .
(2)
Orthogonal Matching Pursuit Algorithm for Image Denoising on the Cell BE
559
Algorithm 1. a = Batch-OMP p0 = DT x, 0 = xT x, G = DT D 1 Init: Set I = ∅, L1 = [1] , a = 0, δ0 = 0, p = p0 , i = 1 2 while i−1 > do ˆ = argmax |p| 3 k k
4 5 6 7 8 9 10 11 12 13 14 15
ˆ Ii = Ii−1 ∪ k if i > 1 then w = Solve: Li−1 w = GIi−1 ,kˆ Li−1 √ 0 where Li = wT 1 − wT w end if aIi = Solve: Li (Li )T aIi = p0Ii β = GIi aIi p = p0 − β δi = aTIi βIi i = i−1 − δi + δi−1 i= i+1 end while
Due to orthogonalization, the matrix (DTI DI ) is symmetric positive definite, which allows the Cholesky decomposition. In each iteration the triangular matrix L is enlarged by another row. The non-zero element coefficient vector aIi is computed in line 9 by means of a forward- and backward substitution. In line 11 we update the projection p = DT r = p0 − GI (DI )+ x .
(3)
When an error-constrained sparse approximation problem is to be solved, the residual is required to check the termination criterion. The 2 norm of the residual i is computed in line 13. Image denoising: Image denoising tries to remove noise from a given image X and recover the original noise-free image X0 that is corrupted by additive, zeromean, white and homogeneous Gaussian noise g with standard deviation σ [12] X = X0 + g .
(4)
Sparse representations can be used for image denoising, if a dictionary containing noise-free image components is given. The image X is decomposed into small overlapping patches x. After having computed the noise-free sparse representation of each patch by means of Batch-OMP, the denoised image is given by a linear combination of the noisy image and an average of denoised patches. A suitable dictionary, that we use throughout our paper, is a frame derived from cosine functions. An alternative would be dictionary training with the KSVD algorithm [13].
560
3
D. Bartuschat, M. St¨ urmer, and H. K¨ ostler
Implementation on the Cell Processor
The Cell Broadband Engine consists of a general-purpose PowerPC Processor Element (PPE) for control-intensive tasks and several accelerator cores called Synergistic Processor Elements (SPEs), short-vector RISC processors specialized for data-rich and compute-intensive SIMD applications. The SPEs cannot directly work on main memory, but contain a small, private memory instead of a cache. This so-called Local Storage (LS) holds both instructions and data, which must be copied from and to main memory by means of asynchronous DMA transfers. This design explicitly parallelizes computation and transfer of data and instructions [14]. In order to achieve maximum performance, the programmer has to exploit this kind of parallelism, together with data-level parallelism through SIMD vector processing and thread-level parallelism [15]. For denoising, we decompose the image into fully overlapping patches of 8 × 8 pixels. Due to their superior performance, all work is done by the SPEs, which must extract and copy the patches from main memory to their LS, sparse-code, reconstruct, and finally combine them to the denoised image. Depending on the local image content, the effort for sparse-coding a patch varies. To address this possible imbalance, patches are distributed dynamically to the SPEs with a granularity of four whole lines of patches, requiring data from 11 subsequent image rows. As only an even smaller window of the image can be held in the LS at a time, the row needs to be traversed in several steps. On the other hand, this allows for optimal overlapping of data transfer and computation by means of multibuffering. Separate target buffers are provided for each SPE, because otherwise additional synchronization of the SPEs’ write accesses is required. The most time consuming part of the denoising is the Batch-OMP algorithm. Its data structures and kernels must be tuned to meet the requirements of the SPE as good as possible. Using SIMD is especially important, as this is the default case on an SPE and scalar operations are usually much more expensive. Further aspects are the in-order execution and the relatively high branchmiss penalty that both make loops with short bodies and few iterations very inefficient. Hence, the size of the atoms and of the dictionary were restricted to multiples of eight and fixed at compile time, as well as the maximum number of atoms. The static data structures and known loop ranges generally simplify SIMD vectorization, address calculations, and enable better unrolling of loops. Typically, for a sufficient representation only a small but varying number of atoms needs to be chosen. As they possess different complexity, the importance of the several kernels can change in this range a lot, and it is advisable to put at least some effort into optimizing all parts. The following types of kernels are required: Operations on dense vectors: These are computation of the dot product (line 1 and 12), subtraction of a vector from another (line 11), and determining the index of the element with the maximal absolute value (line 3). For all of them, loop unrolling and SIMD vectorization is applied, which is straight-forward for the subtraction. As the SPEs’ instruction set architecture
Orthogonal Matching Pursuit Algorithm for Image Denoising on the Cell BE
561
offers solely vertical SIMD operations for floating point values, only the four dot products of every fourth element can be computed efficiently. The final sum of the four partial results requires multiple instructions that can reorder SIMD vectors. A similar approach is used for the index search, where again four maximal value and the corresponding index are determined first. Only vector lengths a multiple of four, i.e. full SIMD vectors, need to be considered, as the vector lengths are either restricted to a multiple of four or padded by zeros. Gather operations: Creating vectors by gathering elements according to an index list is required in several steps of the algorithm (line 6, 9, and 12). As the SPEs only support natively aligned SIMD load and store operations, it is advisable to combine four elements of a target vector in a SIMD register before storing them to the LS. A trick is required to handle vectors with a size not a multiple of four: All source vectors are padded with an additional zero, and the index arrays are initialized with that offset to their maximal length beforehand. When rounding the number of indices up to a multiple of four in the gather kernels, no special handling of other sizes is required and target vectors are automatically zeropadded to whole SIMD vectors. To extract values from the symmetric Gram matrix, an additional row of all zeros is added, and addressing is done rowinstead of column-wise. Matrix-vector-multiplications: Multiplication of a dense matrix with a dense vector during initialization (line 1) and with a sparse vector for the projection of the Gram matrix onto the residual (line 10) need to be performed. In addition to loop unrolling and SIMD vectorization, register blocking is required to reduce the number of load and store operations. The purely dense multiplication is comparable to the SPE function of IBM’s Cell-enhanced BLAS library, but has fewer restrictions on the matrix and vector size. For multiplication with the sparse vector, one to four lines of the matrix are scaled and added at once. Cholesky substitution: The lower triangular matrix and its upper triangular transpose both are stored in column-major order, as this is required to enable some SIMD processing in the forward and backward substitutions (line 5 and 9). The reciprocal of the diagonal is stored separately, for each element a whole SIMD vector is used with the i − th coefficient in the (i mod 4) − th slot and all other slots being zero. Although this makes it more complex to append a row or colum to the triangular matrices (line 7), the possibility to use SIMD operations throughout the whole substitutions more than compensates. A corresponding formulation of the forward substitution Ly = b can be seen in algorithm 2. Note that for the given storage scheme all operations, especially in lines 3 and 5, can be done in SIMD – the ranges of the inner loop at 4 need be rounded to multiples of four in this case. The backward substitution is done analogously. To speed up the computation further, separate kernels for a size of 1, 2 to 4, 5 to 8, 9 to 12, 13 to 16, and 17 and 20 have been created, which allows to split the outer and fully unroll the inner loop and thus hold the whole temporary vector t in registers. Otherwise
562
D. Bartuschat, M. St¨ urmer, and H. K¨ ostler
Algorithm 2. Forward substituion y = L−1 b 1 Init: y = 0, t = b 2 for j = 1 to n do 3 yi = yi + ti · L1j,j 4 for i = j + 1 to n do 5 ti = ti − yi · Lj,i 6 end for 7 end for
this loop would require to load and store all operands, as no registers cannot be addressed according to the loop variable.
4
Results
Batch-OMP kernel: We use the IBM systemsim Cell-Simulator to investigate the performance of the Batch-OMP kernel itself. Like for all other experiments presented, a dictionary containing K = 128 atoms of 64 elements each is used. The time required for denoising a single patch is shown in Fig. 1. Initialization (constant run time) and reconstruction of the patch (run time depending on the number of atoms chosen) are executed only once. The other kernels need to be executed whenever a new atom is chosen. As they possess different complexity, their contribution is analyzed in more detail in Fig. 2.
Fig. 1. Time required for sparse coding and reconstruction of a single 64-element patch with a dictionary of 128 atoms for a fixed number of atoms to be chosen
About 94 % of the initialization is spent with multiplying the dictionary matrix D with the current patch vector x. If only a small number of atoms is chosen, it dominates the overall runtime. Index search (line 3 of algorithm 1) and vector subtraction (line 11) only dependent on the dictionary size. They play some role as long as the triangular
Orthogonal Matching Pursuit Algorithm for Image Denoising on the Cell BE
563
matrix L is very small, but become less and less important if more than four atoms are chosen. One would expect the substitution kernels to become dominant soon. While all kernels not mentioned yet have linear complexity, these are the only ones with complexity O(i2 ). When the whole Batch-OMP is considered, the effort should even increase cubically with the number of atoms chosen. For the set sizes evaluated, however, this cannot be observed. This is related to the unrolling of the inner loop of the substitution kernel of algorithm 2 starting at line 4. As the temporary vector t corresponds to very few SIMD vectors, all succeeding computations fall into the latency of the first update, which is on the critical path. The substitution kernels therefore exhibit linear increase of runtime for small triangular matrices, and quadratic only for larger ones.
Fig. 2. Contribution of the various kernel types to the overall runtime
In general, the multiplications of a matrix with dense (during initialization) or sparse vectors (projection of the Gram matrix onto the residual and reconstruction of the denoised patch) together with the substitutions need at least 3/4 of the time. While the former generally exhibit a high computational density, the substitutions also suffer from latencies as long as only very few atoms have been chosen. Denoising performance: We run our Cell implementation to denoise an image R 3 (PS3), an IBM QS22 blade and a current of size 512 × 512 on a Playstation multicore CPU. Each of the 255025 overlapping patches corresponds to 8 × 8 pixels. As the rounding mode of the SPEs yield results slightly different from the CPU’s, for each patch a sparse representation with 16 atoms is computed to ensure that the same amount of work is done. The measured runtime excluding reading the noisy image and writing the denoised image are shown for the blade and for the PS3 in Fig. 3 together with the parallel efficiency for different numbers of used SPEs on the blade. Here the
564
D. Bartuschat, M. St¨ urmer, and H. K¨ ostler
parallel efficiency is considered with respect to the runtime on one SPE. The results from the PS3 resemble the results from the blade up to the number of six available SPEs scaled by the slightly lower CPU frequency of 3.188 instead of 3.2 GHz. Based on the analysis of the Batch-OMP kernel, the overhead can be estimated. In the serial case, the Batch-OMP kernel is executed 99.1 % of the time, and even when using 16 SPEs for 96.6 % on each of them. Orchestrating the DMA transfers and combining the denoised images takes only a small amount of time. The parallel efficiency fluctuates mainly due to the granularity of four lines that are assigned to an SPE at a time. Less apparent, a continuous decrease of efficiency is observed. The main reason here is the increasing work to combine the partial results each SPE creates separately. Both effects could partly be countervailed, but the effort does not seem worthwhile with 92.7 % parallel efficiency in the worst case of 15 SPEs.
Fig. 3. Runtimes and parallel efficiency of image denoising on Cell Broadband Engine
In Tab. 1 the runtimes of the algorithm on the blade from Fig. 3 are compared to runtimes of a (not hand-optimized) parallel OpenMP implementation TM R Core i7 940 for multicore CPUs. The runtimes were measured on an Intel with 2.93GHz (Nehalem). It can be seen that the per-core performance of the CBEA implementation is about 5 times higher. Denoising results: A comparison of a noisy image and the denoised image resulting from the algorithm described in this paper, is presented in Fig. 4. Fig. 4(a) Table 1. Comparison of runtimes of the image denoising algorithm on a current multicore CPU and Cell for different numbers of cores or SPEs Cores / SPEs 1 2 4 Runtime [s] on QS22 3.51 1.76 0.89 Runtime [s] on Nehalem 17.9 9.52 4.79
Orthogonal Matching Pursuit Algorithm for Image Denoising on the Cell BE
565
depicts the original image and in Fig. 4(b) the same image corrupted by Gaussian noise with σ = 20 is shown. As can be seen from Fig. 4(c), the noise has been reduced significantly, while the underlying structure and edges have been preserved. Thus, the difference image in Fig. 4(d) contains much noise, but only little structure. The denoising of the shown image with 16 SPEs on a blade takes 59 ms and on average 4 atoms per patch are used.
(a) original image
(b) noisy image (Inoise )
(c) denoised image (Iden )
(d) difference image (Inoise − Iden )
Fig. 4. Denoising results for Lena image of size 512 × 512 pixels
5
Future Work
Currently we work on a comparison of the CBEA and a modern Graphic Processing Units with respect to the performance of the denoising algorithm. In addition to that also new architectures like the Intel Larrabee processor are worthwhile to consider.
566
D. Bartuschat, M. St¨ urmer, and H. K¨ ostler
References 1. Starck, J., Elad, M., Donoho, D.: Image decomposition via the combination of sparse representations and a variational approach. IEEE Transactions on Image Processing 14(10), 1570–1582 (2005) 2. Tropp, J.: Topics in Sparse Approximation. PhD thesis, The University of Texas at Austin (2004) 3. Aharon, M., Elad, M., Bruckstein, A.: On the uniqueness of overcomplete dictionaries, and a practical way to retrieve them. Linear Algebra and Its Applications 416(1), 48–67 (2006) 4. Borsdorf, A., Raupach, R., Hornegger, J.: Wavelet based Noise Reduction by Identification of Correlation. In: Franke, K., M¨ uller, K.-R., Nickolay, B., Sch¨ afer, R. (eds.) DAGM 2006. LNCS, vol. 4174, pp. 21–30. Springer, Heidelberg (2006) 5. Borsdorf, A., Raupach, R., Hornegger, J.: Separate CT-Reconstruction for 3D Wavelet Based Noise Reduction Using Correlation Analysis. In: Yu, B. (ed.) IEEE NSS/MIC Conference Record., pp. 2633–2638 (2007) 6. Mayer, M., Borsdorf, A., K¨ ostler, H., Hornegger, J., R¨ ude, U.: Nonlinear Diffusion vs. Wavelet Based Noise Reduction in CT Using Correlation Analysis. In: Lensch, H., Rosenhahn, B., Seidel, H.P., Slusallek, P., Weickert, J. (eds.) Vision, Modeling, and Visualization 2007, pp. 223–232 (2007) 7. Bartuschat, D., Borsdorf, A., K¨ ostler, H., Rubinstein, R., St¨ urmer, M.: A parallel K-SVD implementation for CT image denoising. Technical report, Department of Computer Science 10 (System Simulation), Friedrich-Alexander-University of Erlangen-Nuremberg, Germany (2009) 8. K¨ ostler, H.: A Multigrid Framework for Variational Approaches in Medical Image Processing and Computer Vision. Verlag Dr. Hut, M¨ unchen (2008) 9. Davis, G., Mallat, S., Avellaneda, M.: Adaptive greedy approximations. Constructive Approximation 13(1), 57–98 (1997) 10. Rubinstein, R., Zibulevsky, M., Elad, M.: Efficient Implementation of the K-SVD Algorithm and the Batch-OMP Method 11. Donoho, D.L., Elad, M.: Optimally sparse representations in general (nonorthogonal) dictionaries via l1 minimization. Proc. Nat. Acad. Sci. 100, 2197–2202 (2002) 12. Aubert, G., Kornprobst, P.: Mathematical Problems in Image Processing: Partial Differential Equations and the Calculus of Variations. Applied Mathematical Sciences, 2nd edn., vol. 147. Springer, Heidelberg (2006) 13. Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process 15(12), 3736–3745 (2006) 14. IBM Corporation Rochester MN, USA: Programming Tutorial, Software Development Kit for Multicore Acceleration, Version 3.0 (2007) 15. Gschwind, M.: The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor. International Journal of Parallel Programming 35(3), 233–262 (2007)
A Blocking Strategy on Multicore Architectures for Dynamically Adaptive PDE Solvers Wolfgang Eckhardt and Tobias Weinzierl Technische Universit¨ at M¨ unchen, 85748 Garching, Germany {eckhardw,weinzier}@in.tum.de http://www5.in.tum.de
Abstract. This paper analyses a PDE solver working on adaptive Cartesian grids. While a rigorous element-wise formulation of this solver offers great flexibility concerning dynamic adaptivity, and while it comes along with very low memory requirements, the realisation’s speed can not cope with codes working on patches of regular grids—in particular, if the latter deploy patches to several cores. Instead of composing a grid of regular patches, we suggest to identify regular patches throughout the recursive, element-wise grid traversal. Our code then unrolls the recursion for these regular grid blocks automatically, and it deploys their computations to several cores. It hence benefits from multicores on regular subdomains, but preserves its simple, element-wise character and its ability to handle arbitrary dynamic refinement and domain topology changes.
1
Introduction
Besides experiments and theoretical models, computational science and engineering is a driving force for new scientific insights, and, thus, there is an everlasting need for more detailed and accurate numerical results. While scientific codes and their results have been benefiting from increasing clock speeds for decades, hardware evolution nowadays comprises an increase in the number of cores. Software has to exploit multicores [9]. Tree data structures and element-wise traversals are of great value for the management of adaptive Cartesian grids for partial differential equations (PDEs). The combination facilitates dynamic h-adaptivity together with geometric multigrid algorithms, it yields storage schemes with low memory requirements, and the tree topology incorporates a domain decomposition [2,5,10], i.e. our PDE solvers based on this combination all run on distributed memory machines. In this paper, we concentrate on the shared memory parallelisation for multicore systems where running multiple domain decomposition codes on one multicore machine is not an option—here, a single element-wise traversal has to exploit the cores. On a distributed memory cluster with multicore nodes, the techniques then are one building block of a heterogeneous parallelisation strategy. Dynamic adaptivity—neither structure, location nor moment of refinement are known a priori—is a key ingredient of sophisticated PDE solvers. For such R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 567–575, 2010. c Springer-Verlag Berlin Heidelberg 2010
568
W. Eckhardt and T. Weinzierl
codes, it is convenient to elaborate an algorithm on the smallest work unit available. The solver then arises from the combination of individual work units and the actual grid. An element-wise grid traversal mirrors this fact. A rigorous sequential, element-wise formulation often introduces runtime penalties, as the number of computations per element is typically small. With a small set of operations, a code neither exploits the vast of registers available (due to VLIW or SSE), nor can it use multicore architectures economically. Many PDE solvers hence do not support arbitrary dynamic grids, but restrict themselves to adaptive grids consisting of regular refined grid patches ([1,3], e.g.). On these patches, they switch from an element-wise formulation to a holistic point of view, optimise the calculations ([1,4,7], e.g.), and outperform elementwise approaches. This paper combines advantages of both worlds, as it allows to have arbitrary adaptive and changing grids but nevertheless benefits from regularly refined patches. A synthesised attribute [6] acts as marker on the tree representing the adaptive Cartesian grid. This marker is updated on-the-fly and identifies invariant, i.e. unchanging and regular refined subdomains. Throughout the traversal, the algorithm analyses the marker and either processes the grid element-wise or switches to a block-wise processing: The latter holds complete regular subgrids in one array. The corresponding computations fit to a shared-memory paradigm and, thus, fit to a multicore parallelisation. The results exhibit a high level of concurrency albeit the original algorithm is a pure element-wise formulation and although the grid supports arbitrarily dynamic refinement. Since the tree traversal’s formulation is recursive, such a blocking of computations equals a recursion unrolling [8]: Our approach tessellates a domain with an arbitrary refined grid, but then identifies and optimises regular subregions throughout the computation. The remainder of the paper is organised as follows: First, we introduce our dynamic grids and the corresponding element-wise traversal. These grids require a very small amount of memory to store them, they support dynamic h-refinement, and they resolve all grid resolutions simultaneously, i.e. they yield a sequence of regular Cartesian grids. The latter is an important ingredient for geometric multigrid algorithms. Second, we introduce our recursion unrolling and blocking approach, and we give the tree grammar identifying regularly refined subgrids. The numerical results, third, illustrate the runtime improvement due to this blocking, i.e. they illustrate the advantages for multicore architectures. Although the experiments concentrate on a matrix-free solver of the Poisson equation, the insights hold for many PDEs, and a short conclusion picks up this fact and closes the discussion.
2
k-Spacetrees
Let d denote the spatial dimension of the PDE. The adaptive Cartesian grids in this paper result from k-spacetrees: First, we embed the computational domain into a hypercube. Second, we cut this hypercube into k parts, k ≥ 2, along each coordinate axis, and, thus, end up with k d small hypercubes. Third,
Blocking Strategy on Multicores for Dynamically Adaptive Solvers
569
Fig. 1. k-spacetree construction process for d = 2 and k = 3 for a square (left). The square’s boundary is resolved with a smaller resolution than the square’s area. Cut through adaptive spacetree for a sphere (right).
we individually decide for each small hypercube whether to apply the refinement scheme recursively. The construction scheme equals, on the one hand, an adaptive Cartesian grid (Figure 1): Each tree level yields one regular Cartesian grid, while the individual regular grids are embedded into each other. The tree’s leaves (unrefined cells) in turn give an adaptive Cartesian grid. On the other hand, the construction imposes a tree topology on the grid’s elements. As our adaptive grids and k-spacetrees are equivalent, we hold the k-spacetree instead of the grid itself. This tree equals a quadtree for k = 2, d = 2. For k = 2, d = 3, we end up with an octree. The number of recursive steps depends on the boundary approximation, on the computational domain’s shape, and on the numerical accuracy to be obtained. Having a tree structure holding all the levels of the grid construction process is of great value for geometric multigrid algorithms [10]. Furthermore, dynamic h-adaptivity and dynamic coarsening fit perfectly to this approach, since they just add elements to or remove elements from the k-spacetree. Besides the flexibility, k-spacetrees also facilitate an efficient encoding: Throughout a depth-first traversal of the tree, one bit per cell is sufficient to decide whether the cell is refined or not, i.e. if the spacetree is encoded along a depth-first traversal, a bit stream of length n is sufficient; n being the number of spacetree elements. A fixed number of additional bits per vertex or cell, respectively, then allows to encode the geometry due to a marker-and-cell approach, i.e. we denote for each cell whether it is inside the computational domain or not, and it allows to encode the boundary conditions. We end up with a storage scheme with very modest memory requirements [2,5], although it supports dynamic, arbitrary adaptivity. Hence, it is an obvious idea to use a depth-first traversal to run through the spacetree. Such a traversal equals an element-wise traversal of the corresponding adaptive Cartesian grid levels where each element has 2d adjacent elements. The vertices have either 2d adjacent cells if they are part of a regular Cartesian subgrid, or less than 2d cells. The latter vertices are hanging vertices, they border
570
W. Eckhardt and T. Weinzierl
a regular Cartesian grid of one level, and they require a special mathematical treatment. A PDE solver finally is an element-wise assembly or an element-wise matrix-vector evaluation if we are implementing a matrix-free method. It is typically realised recursively.
3
Blocking
A recursive, element-wise traversal of the spatial discretisation given by a k-spacetree is flexible with respect to changing grids, and it comes along with low memory requirements. Yet, the work per element typically comprises only a small number of floating point operations. This reveals two shortcomings of a strict element-wise approach: It is first of all difficult to exploit huge numbers of registers and instruction level parallelism (VLIW or SSE) with few instructions—with only a small number of optimisations at hand, one has neither a broad amount of opportunities to tune the data layout and access ordering manually, nor can the compiler optimise the instruction scheduling [4]. Furthermore, the traversal comprises integer arithmetics. In the worst case, these integer arithmetics dominate the execution time, i.e. the traversal’s integer evaluations slow down the floating point computations. Solvers working on regular Cartesian grids do not have these very problems: Integer arithmetics for regular data structures are uncomplicated, and the regularity of the data structures facilitates permutations such as loop unrolling, loop merging, and blocking [4,7]. They make the code utilize advanced hardware features. Regular grids also fit to a shared memory parallelisation, where the set of computations is split among several cores sharing memory or caches. They fit to multicores. Our tree algorithm is very flexible and allows the treatment of adaptive grids. Whenever it resolves regular grids, the recursive formulation however can not cope with codes optimised for and benefiting from regular data structures. In particular, it is difficult to port it to a multicore architecture. As we do not want to restrict any flexibility, we make our adaptive algorithm identify regular subgrids throughout the traversal. For each regular subgrid, we replace the recursive element-wise formulation with tailored patch-based algorithmics. 3.1
Patch Identification
The traversal splits up the adaptive refinement into a two-step process: Whenever the algorithm wants to refine a spacetree leaf, it sets a refinement flag for the leaf. We then add the k d additional spacetree nodes throughout the subsequent traversal. Let p : E → N+ 0 ∪ {⊥} assign each cell of the k-spacetree a number. This number is a pattern identifier. E denotes the nodes of the k-spacetree. For all leaves e ∈ E ⎧ no refinement flag is set, no adjacent vertex is hanging, ⎨ 0 p(e) = and all adjacent vertices are inside the domain, (1) ⎩ ⊥ else
Blocking Strategy on Multicores for Dynamically Adaptive Solvers
571
holds. Let denote the k-spacetree’s topology, i.e. for a, b ∈ E with a b, a is a child of b. With (1), the traversal assigns each refined node e a pattern number i+1 ∃i ≥ 0 : ∀c e : p(c) = i p(e) = (2) ⊥ else throughout the steps-up of the traversal, i.e. whenever it backtracks the depthfirst traversal and unrolls the call stack, p is set. This p is an analysed tree attribute [6]. The coarsening is treated analogously and resets pattern identifiers. Due to the attribute, we need one additional integer per spacetree node. Each refined node and its descendants compose a (sub-)k-spacetree. Each time the algorithm encounters a pattern number greater or equal one, we know that this subtree corresponds to a cascade of regular grids due to the universal quantifier in (2). We furthermore know that it will not change throughout the traversal and comprises exclusively non-hanging inner nodes. p(e) gives us the height of this subtree. The corresponding levels of the subtree yield regular Cartesian grids, and the identifier’s value describes the size of these grids. For the vertices within this gird, we can omit any case distinctions handling hanging nodes, boundary nodes, and so forth. 3.2
Recursion Unrolling
The operations executed by the recursive traversal per element split up into three phases: First, the traversal loads the data belonging to a spacetree node (vertex and cell data). Second, it triggers the element-wise operator evaluations or the assembly. Afterwards, it steps down recursively. Third, it writes back the node’s data. Our multicore strategy merges these three steps for regular subtrees. When the algorithm encounters an element e with p(e) = i ≥ 1, the algorithm allocates memory for i + 1 regular Cartesian grids with 1, k d , k 2d , . . . cells each. Afterwards, it reproduces the three processing phases: First, it invokes the load operations for all the elements. The data is stored in the regular Cartesian grids. Second, it applies the operator on all the regular Cartesian grids. Third, it invokes the store operations. The modified traversal transforms the recursive formulation into a sequential algorithm. It is a recursion unrolling where the three different processing steps are merged. While the traversal still works on an arbitrary adaptive grid, and while the grid may change between two traversals, the algorithm identifies regular patches on-the-fly and processes them with the modified scheme. For the remaining (irregular) grid parts, it preserves its recursive nature. 3.3
Blocking and Multicore
The recursion unrolling first eliminates several loads and stores for elements sharing vertices within a patch, e.g., and, thus, it improves the performance. Second, having regular Cartesian grids at hand allows to realise sophisticated solvers processing whole blocks of unknowns or discretisations or shape functions, respectively. In this paper, we concentrate on the performance aspect and neglect
572
W. Eckhardt and T. Weinzierl
solver modifications. However, all solvers, third, benefit from the application of source code optimisations such as loop unrolling and loop merging. Finally, the blocks provide big, homogeneous data structures. Multicores benefit from the latter. Our concurrency strategy is twofold: The original, recursive, element-wise space-tree traversal runs within one single thread. This thread also computes the pattern identifiers. As soon as the single thread encounters a refined element with a positive identifier, it switches to the block algorithm. This block algorithm in turn distributes its computing phase among several threads. As splitting up uniform work packages on a regular grid is straightforward, the thread work assignment and scheduling is straightforward, too. Grids with many huge patches of regular Cartesian grids benefit from this multicore strategy. They profit in particular, if the regular subgrids remain unchanged throughout the simulation. Nevertheless, the flexibility of the overall algorithm is preserved, i.e. it still handles arbitrary refined and changing grids.
4
Numerical Results
We tested the blocking and multicore strategy on AMD Opteron and Intel Itanium2 processors at the Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities. The latter were part of SGI’s Altix 4700 platform with up to four dual cores per blade, while the Opteron cluster provided up to four processors per node. A matrix-free solver for a simple Poisson equation discretised with linear finite elements on the unit square/cube (d ∈ {2, 3}) acted as test PDE solved by a Jacobi scheme. We started with a regular grid and, afterwards, applied an error estimator refining the grid where the second derivative became big. The figures reveal a significant speedup for regular grids (Table 1). With smaller mesh widths, the k-spacetree’s depth grows (second column), and the Table 1. Speedup on a regular grid with mesh width h. The upper part of the table gives results for d = 2, the lower part for d = 3.
h Depth Patches 6.6 · 10−3 5 1 × 4, 40 × 3, 184 × 2,. . . 3.3 · 10−3 6 1 × 5, 40 × 4 184 × 3, 616 × 2,. . . 6.6 · 10−4 7 1 × 6, 40 × . . . 8 ... 3.3 · 10−4 3.3 · 10−2 4 1 × 3, 316 × 2 5 1 × 4, 316 × 3 6.6 · 10−3 6364 × 2, . . .
Opteron Itanium2 Vertices 2 Cores 4 Cores 2 Cores 4 Cores 8 Cores 5.86 · 104 1.18 1.18 1.23 1.14 1.14 5.30 · 105
1.42
1.52
1.41
1.47
1.57
4.78 · 106 4.30 · 107 5.12 · 105 1.42 · 107
1.68 1.83 1.07 1.36
2.28 2.99 1.06 1.47
1.66 1.84 1.09 1.41
2.29 2.93 1.10 1.62
2.64 4.43 1.06 1.31
Blocking Strategy on Multicores for Dynamically Adaptive Solvers
573
Table 2. Speedup for d = 2 on the Opteron. The error estimator updates the grid several times (left). Adaptive grid after three iterations (right).
it #vertices Patches 1 5.414 · 106 1 × 7, 40 × 6, 184 × 5, 616 × 4, 1912 × 3 2 1.971 · 107 447 × 4, 15519 × 3 3 2.490 · 107 176 × 4, 7122 × 3 4 1.318 · 108 2 × 5, 443 × 4, 6582 × 3 5 3.208 · 108 1 × 5, 363 × 4, 19005 × 3 6 3.251 · 108 275 × 4, 18893 × 3 ... ...
Cores 2 4 1.71 2.28
1.02 1.00 1.00 1.02 1.02 1.02 1.18 1.16 1.23 1.23 ...
number of regular subtrees without boundary vertices increases (third column with the second digit giving the recursion unrolling depth p(e))—the path numbers follow a simple analytical formula. If cells adjacent to the boundary were not excluded from the recursion unrolling, the whole grid would fit into one patch. Multicore systems benefit from big patches, as they facilitate a high concurrency level and big grain sizes with low synchronisation and balancing overhead. Small patches can not benefit from high core numbers according to Amdahl’s law, i.e. the parallel efficiency declines with an increasing number of cores, and it increases with an increasing tree depth. Furthermore, the main memory restricts the maximum recursion unrolling. The parallel speedup hence suffers from a big dimension d: the patch size of a fixed recursion unrolling level grows exponentially, and the ratio of big to smaller patches drops, i.e. few relatively small patches already occupy the complete memory. Nevertheless, the results reflect that for d = 3 a pure recursion unrolling is not sufficient to obtain a better performance—here, the unrolled implementation has to be tuned, too. In Table 2, we study a dynamic changing grid. Here, the runtime profits from the recursion unrolling for the initial regular grid. Afterwards, the speedup breaks down due to the adaptive refinement. If the grid becomes quite regular again after a couple of iterations—the figures track the grid construction phase, while an almost stationary grid again yields figures similar to those in Table 1— the speedup raises again. While the parallelisation can not exploit several cores throughout the grid refinement phase, it is robust for all experiments, i.e. it never introduces a runtime penalty.
5
Conclusion
This paper starts from the observation that regular data structures in particular profit from multicores. It suggests a paradigm shift for adaptive mesh refinement:
574
W. Eckhardt and T. Weinzierl
Instead of composing a grid of small regular grids, it identifies regular subgrids on-the-fly, and switches to an optimised, multithreaded traversal code for these subgrids. Hence, it can tackle grids changing their structure and topology permanently. As our approach restricts to an algorithmic optimisation of the tree traversal, matrix-free PDE solvers such as the solver studied here benefit immediately from the traversal tuning. If the system matrices are set up explicitly, this approach speeds up solely the assembly (and, perhaps, some pre- and postprocessing)—a phase that typically does not dominate the overall solver runtime. However, the assembly workload becomes the more critical the more often the mesh changes, or, in turn, our optimisation enables the code to remesh more frequently. The effect of the blocking on the memory overhead of an explicit matrix setup— matrices corresponding to regular Cartesian grids can be stored more efficiently than flexible sparse matrices—is another issue yet to be studied. The numerical results restrict themselves to the Poisson equation. Nevertheless, the approach works for every PDE solver realised on k-spacetrees, and every solver holding big regular refined stationary grid patches benefits from the recursion unrolling. The primary application scope of the code here is computational fluid dynamics and fluid-interaction problems [2], where the computational domain typically consists of huge areas covered by regular grids. On the permanently changing boundary areas, a fine, adaptive grid resolution is of great value. The paper shows that a rigorous recursive, element-wise formulation for such a problem still benefits from multicore architectures—especially, if we adopt the refinement criteria and the maximum unrolling depth to the actual multicore architecture. Nevertheless, it is still a fact that a problem’s structure has to fit to a multicore architecture, i.e. if the grid consists exclusively of irregular subdomains, the approach here does not result in a multicore speedup.
Acknowledgments Thanks are due to Hans-Joachim Bungartz and Miriam Mehl for supervising the underlying diploma and Ph.D. thesis, as well as the valuable contributions and encouragement.
References 1. Bergen, B., Wellein, G., H¨ ulsemann, F., R¨ ude, U.: Hierarchical hybrid grids: achieving TERAFLOP performance on large scale finite element simulations. International Journal of Parallel, Emergent and Distributed Systems 4(22), 311–329 (2007) 2. Brenk, M., Bungartz, H.-J., Mehl, M., Muntean, I.L., Neckel, T., Weinzierl, T.: Numerical Simulation of Particle Transport in a Drift Ratchet. SIAM Journal of Scientific Computing 30(6), 2777–2798 (2008) 3. de St. Germain, J.D., McCorquodale, J., Parker, S.G., Johnson, C.R.: Uintah: a massively parallel problem solving environment. In: The Ninth International Symposium on High-Performance Distributed Computing, pp. 33–41 (2000)
Blocking Strategy on Multicores for Dynamically Adaptive Solvers
575
4. Gropp, W.D., Kaushik, D.K., Keyes, D.E., Smith, B.F.: High-performacne parallel implicit CFD. Parallel Computing 27(4), 337–362 (2001) 5. G¨ unther, F., Mehl, M., P¨ ogl, M., Zenger, C.: A cache-aware algorithm for PDEs on hierarchical data structures based on space-filling curves. SIAM Journal on Scientific Computing 28(5), 1634–1650 (2006) 6. Knuth, D.E.: The genesis of attribute grammars. In: WAGA: Proceedings of the International Conference on Attribute Grammars and Their Applications, pp. 1–12 (1998) 7. Kowarschik, M., Weiß, C.: An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms. In: Algorithms for Memory Hierarchies 2002, pp. 213–232 (2003) 8. Rugina, R., Rinard, M.C.: Recursion Unrolling for Divide and Conquer Programs. In: Midkiff, S.P., Moreira, J.E., Gupta, M., Chatterjee, S., Ferrante, J., Prins, J.F., Pugh, B., Tseng, C.-W. (eds.) LCPC 2000. LNCS, vol. 2017, pp. 34–48. Springer, Heidelberg (2001) 9. Sutter, H.: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. Dr. Dobb’s Journal 3(30), 202–210 (2005) 10. Weinzierl, T.: A Framework for Parallel PDE Solvers on Multiscale Adaptive Cartesian Grids. Verlag Dr. Hut (2009)
Affinity-On-Next-Touch: An Extension to the Linux Kernel for NUMA Architectures Stefan Lankes, Boris Bierbaum, and Thomas Bemmerl Chair for Operating Systems RWTH Aachen University 52056 Aachen, Germany {lankes,bierbaum,bemmerl}@lfbs.rwth-aachen.de
Abstract. For many years now, NUMA architectures are being used in the design of large shared memory computers and they are gaining importance even for smaller-scale systems. On a NUMA machine, the distribution of data has a significant impact on the performance and scalability of data-intensive programs, because of the difference in access speed between local and remote parts of the memory system. Unfortunately, memory access patterns are often very complex and difficult to predict. Affinity-on-next-touch may be a useful page placement strategy to distribute the data in a suitable manner, but support for it is missing from the current Linux kernel. In this paper, we present an extension to the Linux kernel which implements this strategy and compare it with alternative approaches.
1
Introduction
Today, shared memory multiprocessor computers increasingly often employ a NUMA (Non-Unified Memory Access) design, even ones with only a single mainboard. In such systems, the memory is directly attached to the processors via integrated memory controllers, each processor/memory pair forming a NUMA node. Node-local memory access is faster than access to remote memory, which is reached via some system interconnect, leading to a non-uniform memory access performance characteristic in the system as a whole. Also, the performance may be constrained by congested data paths to and from NUMA nodes, if too many remote accesses occur. Hence, on NUMA systems, the distribution of data typically has a significant impact on the performance of applications which process large amounts of data. To parallelize data-intensive algorithms, the shared memory programming model OpenMP [1] is often preferred over alternatives like MPI or PGAS because it allows an incremental approach to parallelism and obviates the need to specify the distribution of data over the NUMA nodes. However, the flat memory model of OpenMP needs to be properly mapped onto the hierarchical memory system of a NUMA architecture to avoid the performance issues mentioned above. A feasible approach may be to analyze the application’s data access patterns and try to minimize the number of remote memory accesses by a proper data R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 576–585, 2010. c Springer-Verlag Berlin Heidelberg 2010
Affinity-On-Next-Touch
577
distribution [2,3]. But this may be a tediuos task if the memory access patterns are complex and dynamically changing during runtime. In such situations, an adaptive data distribution mechanism, supported by the operating system or runtime system, may be more beneficial, if the overhead to determine the actual access pattern and migrate the data can be kept low. Affinity-on-next-touch is a promising adaptive data distribution strategy. The basic idea is this: Via some runtime mechanism, a user-level process activates affinity-on-next-touch for a certain region of its virtual memory space. Afterwards, each page in this region will be migrated to that node which next tries to access it. Noordergraaf and van der Pas [4] have proposed to extend the OpenMP standard to support this strategy. That proposal was first implemented in Compaq’s OpenMP compiler by Bircsak et al. [5]. Since version 9, the affinity-onnext-touch mechanism is available in the Solaris operating system and can be triggered via the madvise system call. L¨of and Homgren [6] und Terboven et al. [7] have described their encouraging experiences with this implementation. Linux, one of the most important operating systems in HPC, does not yet support affinity-on-next-touch. Terboven et al. [7] have presented a user-level implementation of this strategy for Linux. Unfortunately, its overhead is very high, as is shown in Sect. 2. In Sect. 3, we present a new, kernel-based solution, and in Sect. 4, we evaluate its performance.
2
Analysis of the User-Level Implementation
To realize affinity-on-next-touch in user space, Terboven et al. [7] protect a specific memory area from read and write accesses and install a signal handler to catch access violations. If a thread accesses a page in the protected memory area, the signal handler migrates the page to the node which handled the access violation. An access violation raises a synchronous signal which is handled on the node of the interrupted thread. Therefore, the page migrates to the node of the thread that tries to access the page. Afterwards, the signal handler clears the page protection and the interrupted thread is resumed. To evaluate the overhead of this user-level solution, we use a small OpenMP program, in which a parallel for-loop increments each element of an array. We measure the runtime of this loop during consecutive iterations. We expect that during the first iteration the pages are migrated and therefore, the difference between the time for the first iteration and the subsequent iterations shows the overhead of the page migration mechanism. We ran this benchmark on devon, a dual-socket, quad-core Opteron 2376 (2.3 GHz) system with 512 KB L2 cache per core, 6 MB shared L3 cache per processor and 32 GB of main memory running Fedora 10, which is based on kernel version 2.6.27. The benchmark was compiled with gcc 4.3.2 and uses an array size of 512 MB and 8 threads in the parallel region. Each thread is bound to one core of the NUMA system. We compare the results with those using Linux’s default first touch page placement strategy, which puts each page next
578
S. Lankes, B. Bierbaum, and T. Bemmerl
to the processor first accessing it. Because the benchmark initializes the array in a sequential loop, it is entirely placed on one node in this case. As the 3rd column of Tab. 1 shows, the user-level solution suffers from a high migration overhead. After the pages have been migrated, accesses to the array are significantly faster than with first touch, because remote accesses are completely avoided. The time for the first iteration shows the overhead which is too high to make this implementation really beneficial. Table 1. Measured time per loop iteration over the array (8 threads) Iteration First touch Next Touch Next Touch (User-Level) (Kernel-Level) 1 127.23 ms 5179.09 ms 418.75 ms 2 128.40 ms 66.82 ms 66.45 ms 3 127.99 ms 66.96 ms 66.72 ms 4 128.80 ms 66.74 ms 66.83 ms 5 128.45 ms 66.64 ms 66.94 ms 6 128.41 ms 66.60 ms 66.74 ms
This overhead may be caused by switching the thread context to the signal handler or by running the signal handler itself. Coarsely speaking, the signal handler consists of three system calls: getcpu is used to get the number of the thread’s NUMA node, move_pages to migrate the page and mprotect to change the access permissions. To understand what causes the overhead, we measured the system call time per thread and the total time for the first loop iteration with an increasing thread count. The difference between the loop runtime and the sum of the system call times is the time needed to create the parallel forloop, to increment the elements, to handle the access violation and to switch the thread context to the the signal handler. Fig. 1 shows that the loop runtime keeps almost constant while the thread count increases, a sign of quite bad scalability, because the more threads there are, the less each single thread needs to do. As long as the system’s data paths are not congested by the migration traffic, the loop should run faster with more threads. The main reason for the observed behavior is the increasing time for mprotect and the bad scalability of move_pages. The time for getcpu ist negligible. Besides this, the difference between the loop runtime and the time needed for the system calls is pretty large and increases when starting more threads. That time is primarily spent for access violation handling and context switching. Creating the parallel region and incrementing the array elements does not take very long in comparison as can be seen in Tab. 1, because this is done in the subsequent loop iterations as well. Without significant changes to the Linux kernel it is not possible to reduce the overhead of handling the access violation. For this work, we concentrated on reducing the time spent in mprotect and move_pages, because these system calls primarily cause the bad scalability of the user-level solution.
Affinity-On-Next-Touch 8000
579
getcpu mprotect move_pages difference to loop runtime
7000
time [ms]
6000 5000 4000 3000 2000 1000 0 1 Thread
2 Threads
4 Threads
8 Threads
Fig. 1. Time per thread for 1st loop iteration (Next Touch, User-Level)
To explain the behavior of mprotect, we need to take a closer look at the memory management of the Linux kernel [8]. A process’s virtual memory is sliced into Virtual Memory Areas (VMA), each of which covers a contiguous address range made up of pages with identical properties, e.g. access permissions. VMAs are represented by a record of type vm_area_struct and stored in a single list per process. In our case, the user-level solution creates a VMA with read and write permissions cleared and sets the same permissions in the page table entries. On access to one page in this VMA, the signal handler migrates the page to the current node and allows read and write access to it, splitting the VMA up into two new ones for the pages left and right of the affected one (if any) and one for the page itself. To reduce the number of VMAs, the Linux kernel tries to merge the new memory areas with their predecessors and successors to create one new area with a contiguous address range and uniform access permissions. During all of this, the VMA list needs to be updated, generating a lot of write accesses to the list. To avoid race conditions, the list is protected by a lock, which essentially serializes our code and slows down memory access. Consequently, we developed a new kernel extension to realize affinity-on-next-touch, which does not need write access to the list of virtual memory areas. Compared to the user-level implementation, this approach promises better performance. In addition to mprotect, move_pages does not scale well, too. To explain this phenomenon, we need to look to at how the Linux kernel handles demand paging. For each page frame, there exists a record page that contains information about the pages mapped onto this page frame. Every such record is added to an LRU page cache for fast lookup. This cache is made up of two lists: The first one is called active list and contains recently referenced pages, while the second list contains a set of inactive pages. If Linux needs to swap a page out from main memory, the kernel uses one from the inactive list. To accelerate
580
S. Lankes, B. Bierbaum, and T. Bemmerl
list operations and avoid locking, each list operation is delayed and temporarily buffered. Each core has got its own such buffer and exclusive access to it. To migrate a page, move_pages needs to remove the page from the page cache. Afterwards, the system call allocates a new page frame on a specific node, copies the content of the page to the new page frame and puts a new page record into the page cache. When doing so, it assumes that the LRU lists are upto-date. Therefore, move_pages executes the buffered list operations and sends a messages to all other cores to make them drain their buffers as well. This interrupts the computation on all cores and decreases the scalability. Disabling the demand paging algorithm does not solve this problem because the page cache is also used in other parts of the kernel. Therefore, we developed a kernel extension, which minimizes the need to drain the buffers on all cores.
3
An Extension to the Linux Kernel
Like in Solaris, our kernel extension can be triggered via the system call madvise with the new parameter MADV_ACCESS_LWP. Similar to the user-level solution, madvise protects the memory area against read and write accesses, but it changes the permissions only in the page table entries and does not update the corresponding vm_area_struct. Therefore, madvise does not need write access to the list of memory areas. Now, the permissions stored in vm_area_struct differ from the permissions set on a per-page basis. An attempt to access a page inside of this VMA triggers an interrupt, which is handled by the Linux kernel. If it detects that the page, which the thread tried to access, uses affinity-on-next-touch, the Linux kernel reads the original permissions from vm_area_struct, restores them in the page tables and migrates the page to the current node. To store the information that a page is using affinity-on-next-touch, madvise sets one bit inside of the page record, which can be found without traversing a linear list or a similar data structure. Depending on the memory model of the Linux kernel, only some simple shift operations are necessary. The page fault handler clears the affinity-on-next-touch bit with the atomic operation test_and_set and checks if the bit was set. If it was, the page fault handler restores the original permissions and migrates the page to the current node. To accelerate the page migration process, we make draining the list operation buffers on all cores less often necessary. Bevor returning from madvise, the kernel drains the buffer on the current core. If another core has buffered list operations for a page which is to be migrated, the migration fails. If this happens, the buffers on all cores are drained and the page migration is restarted. The probability that such a second attempt on migration is necessary is quite small. Our benchmark results, shown in the 4th column of Tab. 1, prove our assumptions and show that the kernel-level solution is more than twelve-fold faster than the user-level solution. By using the kernel-level solution, the average time for migrating one page is to 2.8 µs, while the user-level solutions needs 39.4 µs.
Affinity-On-Next-Touch
4
581
Performance Evaluation
4.1
The stream Benchmark
To evaluate the achievable memory bandwidth of our kernel solution, we ran the OpenMP version of McCalpin’s popular stream benchmark1 [9], modified to trigger page migration via next-touch, on the devon system described in Sect. 2. Fig. 2 shows the measured results for different page placement strategies. If the arrays are initialized by the master thread and first touch is used, all data are located on the master’s node. Access to this node’s memory becomes a bottleneck, constraining the achievable bandwidth (a). If each thread initializes the chunks of the arrays which it uses later during the computation, the benchmark achieves peak bandwidth (b). (c) - (f) show the memory bandwidth when using next-touch. The first iteration over the array includes the cost of page migration, if necessary, and of restoring the access permissions. Therefore, the memory bandwidth is much lower for (c) and (e). Further iterations achieve the peak bandwidth because the pages are ideally located then.
14
1 thread 2 threads 4 threads 8 threads
12
memory bandwith [GB/s]
10
8
6
4
2
0 first touch by master (a)
first touch next touch next touch next touch next touch by all (user-level, (user-level, (kernel, (kernel, (b) first iter.) best iter.) first iter.) best iter.) (c) (d) (e) (f)
Fig. 2. Memory bandwidth measured with a modified stream benchmark
A closer look at the results of the first iteration shows that the kernel-level solution reaches its peak bandwidth of 2.19 GB/s by using just one thread, because the pages are already located on the correct (the only) node. Therefore, no page 1
http://www.cs.virginia.edu/stream/
582
S. Lankes, B. Bierbaum, and T. Bemmerl
migration is necessary and the kernel just needs to restore the original access permissions of the pages. When using more threads, the kernel-level approach reaches a memory bandwidth up to 1.59 GB/s. This is more than seventeen-fold faster than the peak memory bandwidth of the user-level solution (91 MB/s). 4.2
The Jacobi Solver
We studied the overhead of adaptive data distribution by running a simple Jacobi solver, parallelized with OpenMP, on devon. The euclidian norm is used to detect that the solver found a suitable solution vector, which happens after 492 iterations using a 10000x10000 matrix. Fig. 3 shows the time to solution for different page placement strategies. In (a) - (c) the data is initialized by the master thread. Like for stream, this constrains the scalability when using first touch (a), affinity-on-next-touch provides a significant improvement here. The difference between the user-level and kernel-level solution is explicable by the additional time which the user-level solution needs to update the list of virtual memory areas. For the ideal solution (d) the memory access pattern of the solver has been analyzed and the matrix has been ideally distributed during the intialization phase. The difference between the ideal solution and the kernel-level implementation of affinity-on-next-touch is quite small, because the time for doing the calculations dominates the additional overhead of page migration. 4.3
An Adaptive PDE Solver
The stream benchmark and the Jacobi solver have static memory access patterns and are easy to optimize for a NUMA system. Programs like these are not the 160
first touch by master (a) next touch (user-level) (b) next touch (kernel-level) (c) ideal data distribution (d)
140 120
time [s]
100 80 60 40 20 0 1
2
4 number of threads
Fig. 3. Time to solution with the Jacobi solver
8
Affinity-On-Next-Touch
583
typical domain for an adaptive page placement strategy like affinity-on-nexttouch, but the large number of applications with a dynamic memory access pattern. Examples of this class of applications are PDE solvers using adaptive mesh refinements (AMR). In this case the work and data need to be dynamically repartitioned at runtime to get good parallel performance. Norden et al. [10] have presented a parallel structured adaptive mesh refinements (SAMR) PDE solver which has been parallelized with OpenMP. A specific aspect of this particular program is that is already optimized for NUMA architectures and dynamically triggers the affinity-on-next-touch mechanism. In addition to devon, we also ran the SAMR PDE solver on a quad-socket, dual-core Opteron 875 (2.2 GHz) system with 1 MB L2 cache per core and 16 GB of RAM running CentOS 5.2 (kernel version 2.6.18). The code was compiled with the Intel Fortran compiler 11.0. Tab. 2 shows the time to solution with various experimental setups. Data redistribution happens more often with a higher number of NUMA nodes. Table 2. Time to solution with the SAMR PDE solver (20000 iterations, 8 threads) First Touch Next Touch Next Touch by all (user-level) (kernel-level) 4-socket Opteron 875 sytem 2-socket Opteron 2376 sytem
5
5318.646s
4514.408s
4489.375s
3840.465s
3904.923s
3777.936s
Related Work
Currently, the Linux scheduler does not take NUMA locality into account when migrating tasks between NUMA nodes. In [11] L. T. Shermerhorn presents an extension to the Linux kernel which permits a task to pull pages close to itself after having been migrated to a new node. To realize this, he uses a mechanism similar to affinity-on-next-touch called migration on fault. His kernel extension unmaps the eligible pages, making the virtual-to-physical address translation fail on access by a thread. After having entered the fault path, the kernel migrates the page to the current node and remaps it. The ideas behind affinity-on-nexttouch and migration on fault are nearly the same. Shermerhorn’s extension uses the fault path to migrate the pages, while the kernel extension proposed in this paper uses an access violation to detect the necessity for a page migration. The performance differences between both solutions are negligible on small NUMA systems. On larger systems, we have to evaluate the scalability of both extensions. This needs to be examined further. Shortly after this paper had been accepted for publication, B. Goglin and N. Furmento published their work about an extension to the Linux kernel which implements affinity-on-next-touch, too [12]. A quick analysis of their patch showed our optimization of move_pages reaching a higher bandwidth. In the future, we need to further compare both kernel extensions.
584
6
S. Lankes, B. Bierbaum, and T. Bemmerl
Outlook and Conclusions
Affinity-on-next-touch is a promising page placement strategy for NUMA machines. However, this paper shows that without changes to the kernel it is not possible to implement it on Linux systems with satisfying performance. Additionally, the bad scalability of mprotect is a problem for any distributed shared memory system using this system call to intercept memory accesses. Our kernel extension provides significantly better performance than the user level implementation, as shown by benchmark results. The mprotect bottleneck was removed and page migration accelerated. In the near future, we want to evaluate the behavior of our kernel extension on systems with a higher number of cores and compare the results with alternative extensions [11,12]. We’d like to thank the authors of [7,10] for their support.
References 1. Dagum, L., Menon, R.: OpenMP: An Industry-Standard API for SharedMemory Programming. IEEE Computational Science & Engineering 5(1), 46–55 (1998) 2. Dormanns, M., Lankes, S., Bemmerl, T., Bolz, G., Pfeiffle, E.: Parallelization of an Airline Flight-Scheduling Module on a SCI-Coupled NUMA Shared Memory Cluster. In: High Performance Computing Systems and Applications (HPCS), Kingston, Canada (June 1999) 3. Dormanns, M.: Shared-Memory Parallelization of the GROMOS 1996 Molecular Dynamics Code. In: Hellwagner, H., Reinefeld, A. (eds.). SCI: Scalable Coherent Interface. Springer, Heidelberg (1999) 4. Noordergraaf, L., van der Pas, R.: Performance Experiences on Sun’s WildFire Prototype. In: Proceedings of the 1999 ACM/IEEE Conference on Supercomputing, Portland, Oregon, USA (November 1999) 5. Bircsak, J., Craig, P., Crowell, R., Cvetanovic, Z., Harris, J., Nelson, C.A., Offner, C.D.: Extending OpenMP for NUMA machines. In: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, Dallas, Texas, USA (November 2000) 6. L¨ of, H., Holmgren, S.: Affinity-on-next-touch: Increasing the Performance of an Industrial PDE Solver on a cc-NUMA System. In: Proceedings of the 19th Annual International Conference on Supercomputing, Cambridge, Massachusetts, USA, pp. 387–392 (June 2005) 7. Terboven, C., an Mey, D., Schmidl, D., Jin, H., Reichstein, T.: Data and Thread Affinity in OpenMP Programs. In: Proceedings of the 2008 Workshop on Memory Access on future Processors: A solved problem? ACM International Conference on Computing Frontiers, Ischia, Italy, pp. 377–384 (May 2008) 8. Love, R.: Linux Kernel Development, 2nd edn. Novell Press (2005) 9. McCalpin, J.D.: Memory Bandwidth and Machine Balance in Current High Performance Computers. In: IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25 (December 1995)
Affinity-On-Next-Touch
585
10. Nord´en, M., L¨ of, H., Rantakokko, J., Holmgren, S.: Geographical Locality and Dynamic Data Migration for OpenMP Implementations of Adaptive PDE Solvers. In: Proceedings of the 2nd International Workshop on OpenMP (IWOMP), Reims, France, pp. 382–393 (June 2006) 11. Schermerhorn, L.T.: Automatic page migration for linux (a matter of hygiene). In: linux.conf.au 2007 (2007) 12. Goglin, B., Furmento, N.: Enabling High-Performance Memory Migration for Multithreaded Applications on Linux. In: Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2009), Workshop on Multithreaded Architectures and Applications (MTAAP 2009), Rome, Italy (2009)
Multi–CMP Module System Based on a Look-Ahead Configured Global Network Eryk Laskowski1, L ukasz Ma´sko1, and Marek Tudruj1,2 1
2
Institute of Computer Science PAS, 01-237 Warsaw, Ordona 21, Poland Polish-Japanese Institute of Information Technology, ul. Koszykowa 86, 02-008 Warsaw, Poland {laskowsk,masko,tudruj}@ipipan.waw.pl
Abstract. The impressive progress of Systems on Chip (SoC) design enables a revival of efficient massively parallel systems based on many Chip Multiprocessor (CMP) modules interconnected by global networks. The paper presents methods for the optimized program execution control for such modular CMP systems. At the CMP module level, communication through shared memory is applied, improved by a novel efficient group communication mechanism (reads on the fly). The inter–module global communication is implemented as a message passing between module memories placed in a shared address space. A two–phase structuring algorithm is described for programs represented as macro data-flow graphs. In the first phase, program tasks inside the CMP modules are scheduled, using an algorithm based on the notion of moldable tasks. In the next phase, the moldable task graph is structured for optimized communication execution in the global interconnection network according to the look-ahead link connection setting paradigm. Simulation experiments evaluate the efficiency and other properties of the proposed architectural solutions.
1
Introduction
Systems on Chip (SoC) and Network on Chip (NoC) technologies [6] which are different forms of Chip Multiprocessors (CMP), are based on many processor cores implemented together on single chips. In recent years, they have given strong arguments for fulfillment of designer’s and researcher’s dream of massive parallelism. Although the most mature CMP systems such as Cell BE and Niagara are still based on a relatively small number of cores which is 8, new ones, much more developed appear, such as 128–core Nvidia GeForce 8800GTX [2] and 188–core Cisco Silicon Packet Processor [1]. Challenges of computing power maximization and technology problems of heat dissipation and communication in multicore systems currently stimulate research in the domain of efficient architectural paradigms and new parallel programming styles. A processor centric design style, commonly used at an early stage of CMP technique, has been now transformed into the interconnect–centric design. System design experience accumulated so far, demonstrates that cluster– based systems, supported by adequate intra–cluster communication solutions, R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 586–595, 2010. c Springer-Verlag Berlin Heidelberg 2010
CMP Module System Based on a Look-Ahead Configured Global Network
587
can strongly increase execution efficiency of parallel programs and scalability. Highly modular properties of CMPs direct structural research towards hierarchical network structures. Instead of a large connection switching network connecting all processor cores, local core sub-networks can be arranged to be next connected by a central global network. Following this new idea, efficient massively parallel systems can be built today. An early implementation of such concept is the RoadRunner system [3]. This paper concerns optimization of global communication in the system with multiple CMP modules interconnected by a global network. We consider a global network based on a circuit–switching and message passing of cacheable blocks of data. We assume that both intra– and inter–CMP module communication is located at the application program level. We assume redundant communication resources in the global communication network (e.g. multiple crossbar switches or multiple communication links), so that inter CMP module connections can be look-ahead configured to provide time transparency of connection setting [8]. This factor is important for communication at application program level with very small control overheads. The paper presents algorithms for global communication structuring based on heuristic principles, which reduces application program execution time. The algorithms are verified experimentally using a simulator, which executes application program graphs with structured global communication and evaluate their parallel efficiency.
2
The System Architecture
In the paper, we consider parallel multi–CMP systems, whose general structure is presented in Fig. 1. Basic system elements are CMP modules interconnected by a global network. A single CMP module, Fig. 2, consists of a number of processors and shared cache memory modules connected via a local network. A CMP module is implemented as a single integrated circuit. The module contains a number of processor cores (each with its local L1 data cache) and a number of shared L2 data cache banks interconnected through a local network. Additional instruction caches are provided, but they are out of the interest of this paper, so all discussed caches are data caches. Dynamic core clusters can be created around L2 banks using local data exchange networks (L2 buses). All L2 cache
Fig. 1. General system structure
588
E. Laskowski, L . Ma´sko, and M. Tudruj
Fig. 2. The structure of a CMP module
banks of the CMP module are connected to the local fragment of distributed memory shared by all CMP modules in the system. Tasks in programs are built according a cache–controlled macro data-flow paradigm, so, all data have to be pre-fetched to processor L1 data cache before a task begins and L1 cache reloading is disabled. Operations on data include: data pre-fetch from memory (L2 cache) to a core L1, write from L1 to memory or L2, read on the fly (similar to cache injection) from a L2 bank bus to many L1s in parallel, core switching between L2 buses (clusters). Current task results are sent to the L2 cache memory module only after task completes. This program execution paradigm completely prevents L1 data cache thrashing. New features of the data cache organization consist also in multi–ported structure of L1 data caches (multiple banks are used, which can be connected to many L2 buses). It enables parallel loading of arguments of subsequent numerical operations and many communications (or reads) on the fly performed at a time for a processor (core). The system provides a new mechanism to provide L1 and L2 data caches synchronization in the context of processor switching and reads on the fly. If a processor is switched from one L2 module to another and is to write some of its data from L1 to this new L2 module (for instance to enable other processors to read these data through reads on the fly or simply to do a cache block flush), a respective new line in a target L2 data cache must be provided. This operation is not performed in a standard manner (by transmitting proper data from system memory), but instead, just before the L2 write operation, a dummy empty line in L2 is generated together with a mask, which controls in terms of L1 line blocks validity, which data will be written by the considered data transfer. This line is then filled with new data transferred via a L1–L2 bus. When desired, the operating memory will be updated only with the new validated parts of the L2 lines.
CMP Module System Based on a Look-Ahead Configured Global Network
589
a)
b) Fig. 3. The system based on multiple: a) crossbar switches, b) module communication links
Similar actions are performed in L1 caches, when data are read to L1 to new addresses to comply with the assumed single-assignment principle, which eliminates consistency problems for multiple copies of modified shared data (data read for subsequent modification are stored under new addresses). This imposes, that a new dummy lines in L1 data caches are provided with similar control fields. Only on L1 to L2 flushing or reads on the fly on the L2 buses corresponding L2 dummy lines will be generated (a lazy synchronization is used). More details on the proposed architecture of the CMP module and system execution control but with a single level data caches can be found in [7]. The global interconnection network allows every processor to read or write data to any shared memory module present in the system. This network provides standard data exchange between CMP modules, but at the cost of higher data access latency. The global network can be implemented as an on-board network placed on a backplane with sockets for CMP modules. Due to technology constraints, the latency of global interconnection network will be at least tenfold (or even several dozen times) bigger than the latency of local networks. Thus, proper optimization of global communication can play significant role in overall parallel execution efficiency. In the paper, we consider a global network based on a circuit–switching and message passing of cacheable blocks of data. In such a network, before a message can be sent, the connection between the sender and the receiver has to be created. We assume that both intra– and inter–CMP module communication is located at
590
E. Laskowski, L . Ma´sko, and M. Tudruj
the application program level. It means that data transfers are not processed by the operating system nor any communication library, so the transfers have very small start-up time and control overheads. In the simplest case, we can assume dynamic on request connection reconfiguration, where each communication in an application program generates a connection request directed to the global network controller. However, we can provide some redundant communication resources in the global network (see Fig. 3), to be able to apply the look-ahead dynamic connection reconfiguration paradigm [8]. It is based on anticipated connection setting in the redundant network resources to eliminate connection creation time overhead. The redundant resources considered in this paper are multiple global link connection switches (crossbar switches, multistage connection networks) and CMP module communication link sets. With the look-ahead dynamic reconfigurability, an application program is divided into communication–disjoint sections. Sections are executed using connections, which were created in advance in some connection switching devices. The connections do not change during execution of the current section. Special inter–module synchronization hardware has to be included in the system to enable execution of many program sections in parallel with the link connection reconfiguration. The application program has to be divided into sections at compile time, as the result of an analysis of the program communication.
3
The Algorithm
In the paper we propose a program structuring algorithm covering global communication in the multiple CMP system. The communication structuring assumes the look-ahead dynamic inter–module connection reconfiguration. The proposed communication structuring method is best suited for circuit–switched interconnection networks with application program level communication, where link creation overhead can not be neglected. Such interconnects include crossbars switches, multistage networks and multibus structures. A two–phase approach is used, which comprizes on one side program graph scheduling onto CMP modules and CMP external communication links, and on the other side program graph partitioning into sections based on the look-ahead created connections between CMP modules. The outcome of the scheduling phase of the algorithm is the scheduled program graph, assigned to computational and communication resources in CMP modules. Global communication is scheduled to the communication link(s) of the CMP modules, but the partitioning into sections and mapping of global communication to redundant communication resources is not defined. The scheduling phase applies the moldable task concept [5]. Moldable tasks (MT) are parallel tasks, for which the number of assigned processors can be variable, but is determined before task execution and then doesn’t change. The scheduling phase consists of three steps:
CMP Module System Based on a Look-Ahead Configured Global Network
591
Fig. 4. The graph partitioning heuristics
1) building a MT data flow graph, based on the program macro data-flow graph, 2) defining the best internal structure of each MT node for each number of processors (schedule of component nodes to logical processors inside CMP modules of different sizes), 3) defining an assignment of resources to MTs (allotment) and scheduling the MT graph in the architecture with simplified inter-CMP connections (fully connected network). In the partitioning phase, scheduled program graph is divided into sections for the look-ahead execution of global communication in the assumed environment. Program graph partitioning into sections is defined using the Communication Activation Graph (CAG). The CAG is extracted from the scheduled program graph and constitutes the basic graph structure for partitioning heuristics. The program graph partitioning heuristics used to find program sections is described in Fig. 4. Besides the partition of the graph, the heuristics assigns a communication resource (e.g. crossbar switch) to each section. The algorithm starts with an initial partition, which consists of sections built of a single communication and assigned to the same crossbar switch. In each step, a vertex of CAG is selected and then the algorithm tries to include this vertex to a union of existing sections determined by edges of the current vertex. The heuristics tries to find such a union of sections, which doesn’t break rules of graph partitioning. The union, which gives the shortest program execution time is selected. See [4] for detailed description of the partitioning heuristics. The program execution time is obtained by simulated execution of the partitioned graph in the presented system. The functioning of the global network, Network Interface Controller and the Global Network Controller is modeled as subgraphs executed on virtual additional processors added to the initial application graph. Weights in the graph nodes correspond to latencies of respective control actions, such as crossbar switch reconfiguration, bus latency, connection setting and activation of program sections. These weights are expressed as control parameters of the simulation.
4
Experimental Results
The presented program optimization algorithms have been verified experimentally and used for evaluation of execution efficiency of programs for different program execution control strategies and system parameters. The results presented in the paper are obtained for four families of synthetic sample program
592
E. Laskowski, L . Ma´sko, and M. Tudruj
graphs (G-A2, G-A5, G-B2, G-B5), which were randomly generated for the same set of parameters. These graphs have similar granularity to fine–grain numeric applications programs (e.g. Strassen matrix multiplication with two recursion levels). The structure of experimental graph is shown in Fig. 5. The assumed parameters for the program graph generator are the following: the number of levels – 10, the number of subgraphs – 8, the width of a level in a subgraph – 4, the total number of nodes in a graph – 320, the node weight 100–200, the communication edge weight – 20–30 (G-A2, G-B2) or 50–75 (G-A5, G-B5), the node input degree – 1–2 (G-A2, G-A5) or 3–4 (G-B2, G-B5). Up to 100 additional inter–subgraph data transfers have been added to each graph. The initial program graphs were first processed by the task scheduling algorithm based on moldable task approach for execution in the assumed modular system with 32 processors organized to contain 1, 2, 4, 8 CMP modules or 32 single–processor nodes interconnected by a global network. The programs were executed using the look-ahead paradigm in the following system architectural models: system with redundant connection switches (R2 – 2 crossbar switches), system with partitioned module communication link sets (L2 – 2 links in each CMP module/node), and system with single connection switch with on-request reconfiguration (as a reference system). A hardware synchronization was used in the system with multiple crossbar switches. No synchronization was required for model L2. System parameters that were used during experiments: tR – reconfiguration time of a single connection, range 1–400, tV – section activation time overhead, range 2–256. The latency of the global network is assumed to be 2, 8, 16 times bigger than that of the intra–module bus (X2, X8, X16 global network speed coefficient, respectively). Program execution speedup (against sequential execution) for different numbers of CMP modules and different global network latency, for L2 architecture is shown in Fig. 6a – Fig. 6b. The speedup degrades with the increasing number of modules or the latency of the global network. This is due to the increasing volume of global transfers, which are substantially slower than intra–module communication. Thus, in most cases, for optimal program execution the number of CMP modules used for tested graphs should be set as low as possible (the lower limit depends on the efficiency of intra–module communication subsystem). An exception is the G-B5 graph with the biggest volume of global communications,
Fig. 5. Experimental graph structure
CMP Module System Based on a Look-Ahead Configured Global Network
a)
593
b)
Fig. 6. Speedup for L2 architecture as a function of the number of CMP modules: a) G-B5 graph, b) G-B2 graph
for which the execution on 8 CMP modules with slow global network gives better speedup than on 2 or 4 modules (Fig. 6a). It is since a denser global communication in the G-B5 graph is distributed among a bigger number of CMP module external links, thus reducing the possible communication contention on 8 modules links comparing the case of the system with 2 or 4 modules. The number of processor cores in a CMP module is bounded by the current level of the VLSI technology and the kind of the assumed CMP module internal data network (a shared memory bus in the assumed environment). The number of modules used for execution of a parallel program will be governed by a tradeoff between the parallel speedup and the limitations of the CMP module technology. The evaluation of different architectures of communication and reconfiguration control subsystem is shown in Fig. 7 – Fig. 8. The system based on multiple crossbar switches (R2) is able to reduce the reconfiguration time overhead better than L2 architecture. It is confirmed in Fig. 8a by the best reduction
a)
b)
Fig. 7. Average reduction of reconfiguration overhead for L2 architecture as a function of the number of CMP modules: a) tR maximal, tV minimal, b) both tR and tV maximal
594
a)
E. Laskowski, L . Ma´sko, and M. Tudruj
b)
Fig. 8. Average reduction of reconfiguration overhead for R2 architecture as a function of the number of CMP modules: a) tR maximal, tV minimal, b) both tR and tV maximal
of reconfiguration overhead for X2 global network latency, when the amount of the global communication is the largest. On the other side, the worst speedup of R2 architecture for big value of section activation time overhead (tV parameter) indicates that a system with multiple connection switches employs much more control communication (e.g. barrier synchronization is applied at the end of program section before switching processor links to an other crossbar) than L2 architecture. Thus, an efficient synchronization and section activation subsystem is essential for good performance of the R2 architectural model. The overall assessment of profits coming from the look-ahead reconfiguration in global interconnection networks is shown in Fig. 9. The use of the look-ahead reconfiguration provides a 50% increase of program execution speedup, comparing execution of the same program using an on-request reconfigurable global network. This big increase is obtained when reconfiguration efficiency is low, that is when classical approach does not give good results.
a)
b)
Fig. 9. Speedup increase against on-request reconfiguration as a function of the number of CMP modules: a) L2, both tR and tV maximal, b) R2, tR maximal, tV minimal
CMP Module System Based on a Look-Ahead Configured Global Network
5
595
Conclusions
In the paper, we have discussed the look-ahead dynamic connection reconfiguration in parallel multi–processor system implemented as many CMP modules interconnected by global network. The presented communication and reconfiguration control is supported by program structuring algorithm, based on program graph partitioning. Dynamic connection reconfiguration enables optimization of communication infrastructure according to application program requirements. Experimental results have shown that the systems based on multiple communication resources (link sets or crossbars switches) and optimized reconfiguration control through program graph partitioning into sections give better speedups than classic on-request reconfiguration control, especially when the reconfiguration subsystem is not efficient enough or not adjusted to application communication requirements. To summarize, the lower is the reconfiguration efficiency of the system, the better results are obtained from the application of the look-ahead dynamic connection reconfiguration. Thus, the look-ahead reconfiguration is especially useful for execution of fine-grain parallel programs, which have time–critical reconfiguration requirements.
References 1. Asanovic, K., et al.: A View of the Parallel Computing Landscape. Communications of the ACM 52(10), 56–87 2. Che, S., et al.: A Performance Study of General Purpose Applications on Graphics Processors. In: 1st Workshop on General Purpose Processing on Graphics Processing Units, Boston (October 2007) 3. Koch, K.: Roadrunner System Overview, Los Alamos National Laboratory, http://www.lanl.gov/orgs/hpc/roadrunner/rrinfo/RR%20webPDFs/ RRSystemOversm.pdf 4. Laskowski, E., Tudruj, M.: Efficient Parallel Embedded Computing Through LookAhead Configured Dynamic Inter–Processor Connections. In: 5th Int. Symp. on Parallel Computing in Electrical Engineering, PARELEC 2006, Bialystok, Poland, September 2006, pp. 115–122. IEEE CS, Los Alamitos (2006) 5. Ma´sko, L ., Dutot, P.–F., Mouni´e, G., Trystram, D., Tudruj, M.: Scheduling Moldable Tasks for Dynamic SMP Clusters in SoC Technology. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 879–887. Springer, Heidelberg (2006) 6. Owens, J.D., et al.: Research Challenges for On–Chip Interconnection Networks. In: IEEE MICRO, pp. 96–108 (September-October 2007) 7. Tudruj, M., Ma´sko, L .: Towards Massively Parallel Computations Based on Dynamic SMP Clusters wih Communication on the Fly. In: 4th Int. Symp. on Parallel and Distrib. Computing, ISPDC 2005, Lille, France, July 2005, pp. 155–162. IEEE CS, Los Alamitos (2005) 8. Tudruj, M.: Look-Ahead Dynamic Reconfiguration of Link Connections in Multi– Processor Architectures. In: Parallel Computing 1995, Gent, pp. 539–546 (September 1995)
Empirical Analysis of Parallelism Overheads on CMPs Ami Marowka Department of Computer Science, Bar-Ilan University, Israel
[email protected]
Abstract. OpenMP and Intel Threading Building Blocks (TBB) are two parallel programming paradigms for multicore processors. They have a lot in common but were designed in mind for different parallel execution models. Comparing the performance gain of these two paradigms depends to a great extent on the parallelization overheads of their parallel mechanisms. Parallel overheads are inevitable and therefore understanding their potential costs can help developers to design more scalable applications. This paper presents a comparative study of OpenMP and TBB parallelization overheads. The study was conducted on a dual-core machine with two different compilers, Intel compiler and Microsoft Visual Studio C++ 2008, and shows that Intel compiler outperforms Microsoft compiler. Nevertheless, the relative performance of TBB versus OpenMP is mainly depends on the implementation of the parallel constructs of a specific compiler. Keywords: OpenMP, TBB, Benchmarks, Multicore, Parallel Overhead.
1
Introduction
Multi-core processing has become ubiquitous, from laptops to supercomputers, and everywhere in between. The appearance of Multicore processors brings high performance computing to the desktop and opens the door of mainstream computing for parallel computing. Unfortunately, writing parallel code is more complex than writing serial code. Parallel programming is cumbersome, error prone, and difficult [1,2,3,4,5,6,7,8]. OpenMP [9,10] and Intel Threading Building Blocks (TBB) [11,12] are two parallel programming paradigms for multicore processors. OpenMP and TBB have a lot in common but were designed for different parallel execution models. Both are shared-memory data-parallel programming models and are based on multithreading programming to maximize utilization of multicore processors. However, OpenMP does not free the programmer from most of the tedious issues of parallel programming. The programmer has much to understand, including: the relationship between the logical threads and the underlying physical processors and cores; how threads communicate and synchronize; how to R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 596–605, 2010. c Springer-Verlag Berlin Heidelberg 2010
Empirical Analysis of Parallelism Overheads on CMPs
597
measure performance in a parallel environment; and the sources of load unbalancing. The programmer must check for dependencies, deadlocks, conflicts, race conditions, and other issues related to parallel programming. On the other hand, the Intel Threading Building Blocks (TBB) is a new high-level library that hides some the issues mentioned above from the programmer and automated the data decomposition and tasks scheduling in an efficient manner. As multi-core architectures continue to evolve, however, they will require developers to refine their threading techniques as a core aspect of their solutions rather than as merely a desirable feature. Overheads associated with operations like thread creation, synchronization and locking, threading granularity, scheduling, and process management will become more pronounced as time goes on, and the necessity of planning for parallel scalability will become more and more important. The contribution of this paper is twofold. First, we present a TBB microbenchmarks suite called TBBench that was developed for benchmarking TBB parallel construct overheads. The design of the TBB micro-benchmark follows the design methodology of the OpenMP EPCC micro-benchmarks [13,14] for enabling accurate comparison between the measured overheads of the two programming models. Second, we use the TBB and OpenMP micro-benchmarks to study the parallelization overheads on a dual-core machine with two different compiler, Intel compiler and Microsoft Visual Studio C++ 2008. We report in details the various results of this study. The rest of this paper is organized as follows. In Section 2, the methodology design of TBBench and its contents are presented. Section 3 presents an in-depth analysis of the running results of OpenMP and TBB benchmarks, and Section 4 concludes the paper.
2
TBBench: A TBB Micro-Benchmarks Suite
TBBench is a suite of core benchmarks for measuring the overheads associated with the execution of TBB parallel constructs for synchronization, scheduling, and work-sharing. The design of TBBench follows the design concepts of the OpenMP EPCC micro-benchmarks suite. In this way, a more accurate comparison between the parallel constructs of the two programming models can be achieved. The approach used by EPCC Micro-benchmarks to measure the overhead associated with a parallel construct is to compare the running time of a regionof-code running in parallel on P cores (Tp ) to the running time of the same region-of-code running sequentially (Ts ). The calculated overhead is given by taking the difference Tp -Ts /p. For example, by measuring the overhead associated with the OpenMP parallel for directive, EPCC Micro-benchmarks measures the time to run the following region-of-code:
598
A. Marowka
f o r ( j =0; j
and then subtract the reference time, the time it takes to run the following loop on a single thread: f o r ( j =0; j
And finally, divide the measured time by the number of iterations innerreps. Measuring the equivalent TBB parallel for is done by running the following region-of-code: f o r ( j =0; j
(0 , n t h r e a d s , 1 ) , T B B p a r a l l e l f o r B o d y ( ) ) ; }
where TBB parallelforBody() is the following class: c l a s s TBB parallelforBody { public : T B B p a r a l l e l f o r B o d y ( ) {} / / ! main l o o p v o i d o p e r a t o r ( ) ( c o n s t b l o c k e d r a n g e \& r a n g e ) c o n s t { f o r ( i n t i=r a n g e . b e g i n ( ) ; i != r a n g e . end ();++ i ) { delay ( delaylength ) ; }}}
TBB Micro-benchmarks follows EPCC Micro-Benchmarks in insuring for measuring statistically and reproducible results: The delaylength of the reference routine chosen is in the same order of magnitude as the parallel construct under evaluation; the number of loop iterations innerreps chosen is larger then the clock resolution; and each running result reported here is an average of 20 different runs of 30 measurements each. The next example demonstrates how TBB Micro-Benchmark measures the overhead incurred by TBB spin mutex. The main routine invokes the call TBB lock<sp i n m u t e x > ( ) ;
Where TBB lock is the following template that calls to TBB parallel for with the class parameter TBB lockBody: t e m p l a t e v o i d TBB lock ( ) { M mutex ; f o r ( i n t k =0; k<=OUTERREPS ; k++){ p a r a l l e l f o r ( b l o c k e d r a n g e (0 , n t h r e a d s , 1 ) , TBB lockBody<M>(mutex ) ) ; } }
The definition of TBB lockBody is shown below and contains repeated calls to TBBlock.acquire and TBBlock.release pair:
Empirical Analysis of Parallelism Overheads on CMPs
599
t e m p l a t e c l a s s TBB lockBody { M &mutex ; public : TBB lockBody ( M &m ) : mutex ( m ) {} / / ! main l o o p v o i d o p e r a t o r ( ) ( c o n s t b l o c k e d r a n g e & r a n g e ) c o n s t { typename M: : s c o p e d l o c k TBBlock ; f o r ( i n t i=r a n g e . b e g i n ( ) ; i != r a n g e . end ();++ i ) { f o r ( i n t j =0; j
The TBB Micro-Benchmarks package contains routines for evaluation the overheads associated with parallel for and parallel reduction constructs, synchronization mechanisms of various kind of mutexes supported by TBB, and core benchmarks for measuring the overheads incurred by different scheduling approaches and different chunk sizes.
3
Experimental Results
This section describes and analyzes the benchmarks performed for comparing the parallelization overheads of the main constructs of TBB and OpenMP by using EPCC and TBBench. EPCC and TBBench cannot be considered ”black boxes” benchmarking tools. In other words, these tools depend on setting and tunings of a few parameters that affect the accuracy and the reproducibility of the measurements. There are external parameters such as the number of threads and the compiler used and its options (optimization level, runtime library type etc.). There are internal parameters such as delaylength, innerreps and itersperthr (see section 2). Moreover, the fluctuations of the measurements call for paying attention for achieving statistical stable and reproducible results. However, there are many reasons for getting different measurements from run to run such as non-deterministic behavior of the scheduler; different allocations of variables in the memory space; the accuracy of time measurements of the clock routines; unpredictable context switches; initialization overheads and inconsistency of cache memories behavior. Similar problems were reported in [13,14]. Analysis of the measurement results usually does not find unexpected behavior but sometimes carefully inspection of the data discovers unusual results. Some of these discoveries can be explain by finding the problems using profiling and performance tools that have access to the hardware counters. Unfortunately, there are cases that cannot be explained without knowing the specific implementation of the tested software and only educated guesses can be given to explain the unexpected results. We report here on the actual measurements results of the benchmarks while presenting relative performance when comparing between OpenMP and TBB.
600
A. Marowka
Table 1. OPenMP and TBB Synchronization and Worksharing overheads(in KiloClock cycles(KCC)) with 2 threads on Core 2 Duo, T8100, 2.1 GHz machine, for Intel and Microsoft compilers Operations Parallel-for
OMP TBB Reduction OMP TBB Atomic OMP TBB Lock-Unlock OMP Mutex TBB Queuing Mutex TBB Spin Mutex TBB
MS Compiler Intel Compiler 28.9 5.28 28.1 4.64 0.13 0.045 1.47 12.6 12.8 0.24
1.62 4.67 1.42 4.58 0.11 0.048 0.25 9.14 4.84 0.105
Again, the actual results presented here can be reproduced only by setting the system as explained below. Any reasonable fluctuations can be expected because the reasons mentioned above. The relative performance method used here for comparing between TBB and OpenMP is considered very immune against measurement fluctuations. We tested the overheads of the TBB and OpenMP using EPCC microbenchmarks 2.0 and TBBench on a dual-core machine. The testbed platform was based on Intel Core 2 Duo processor (T8100) with clock speed of 2.1GHz; 2 X 32KB L1 data cache memory and 2 X 32KB L1 instructions cache memory; 1 X 3MB L2 unified shared cache memory and 1GB DDR2 main memory. On the software side we used the OpenMP 2.0 and TBB 2.0 under Intel C++ compiler 11.0 and Microsoft Visual Studio C++ 2008 on top of XP operating system. The compilers options used were /O2 level of optimization and /MDd Multithreaded Debug DLL. The internal parameters of EPCC and TBBech were set to delaylength = 500 (the granularity of loop body reference), innerreps = 10000 (the number of loop iterations) and itersperthr = 128 (the number of iteration per thread for the scheduling measurements). The benchmarking results report here are averages of 20 runs of 50 measurements each for statistical stability. A measurement that exceeds threshold of three times the standard-deviation was ignored. 3.1
Synchronization and Work-Sharing Benchmarks
Table 1 presents the overhead measurements of OpenMP and TBB synchronization mechanisms (Lock-Unlock for OpenMP, and Mutex, Queuing-Mutex and SpinMutex for TBB) and work-sharing operations (Parallel-For and Reduction). The measurements shown are in Kilo-Clock-Cycles (KCC) and for the case of running the benchmarks with two threads. Figure 1 plots the relative overheads of OpenMP versus TBB for the work-sharing Parallel-For and Reduction and for
Empirical Analysis of Parallelism Overheads on CMPs
/ŶƚĞůŽŵƉŝůĞƌ
601
DŝĐƌŽƐŽĨƚŽŵƉŝůĞƌ
ϳ ϲ͘Ϭϱ
ϲ
ϲ͘Ϭϱ
ϱ͘ϰϳ
KDWͰ d
ϱ ϰ ϯ
Ϯ͘Ϯϵ
Ϯ ϭ
Ϭ͘ϯϰ
Ϭ͘ϯϭ
WĂƌĂůůĞůͲ&Žƌ
ZĞĚƵĐƚŝŽŶ
Ϭ dKD/
Fig. 1. Relative overheads of OpenMP vs.TBB of Parallel-For, Reduction and ATOMIC operations for 2 threads on Core 2 Duo, T8100, 2.1 GHz machine, and for Intel and Microsoft compilers
the ATOMIC operation. Figure 2 plots the relative overheads of OpenMP (LockUnlock) operation versus TBB mutexes. The ATOMIC operation is presented in Figure 1 because both, TBB and OpenMP, have ATOMIC operations that can be compared. The other mutual exclusions of OpenMP and TBB do not have the same semantic and therefore we choose to compare between the Lock-Unlock functions pair of OpenMP relative to the mutexes of TBB and to present the comparison in a separate figure. First, it can be observed from Table 1 (and Table 2) that Intel compiler outperforms Microsoft compiler for all the overheads incurred by OpenMP (up to 20 times better for the case of Reduction operation). On the other hand, TBB exhibits similar overheads for the two compilers. This observation demonstrates the impact of different factors on the performance such as the OpenMP translation strategy used by the compiler, the characteristics of the runtime library routines and the way they are used and the way compiler optimize code. The conclusion is that the implementation of OpenMP in Microsoft compiler can be improved drastically. Moreover, Figure 1 shows that TBB operations Parallel-For, Reduction and ATOMIC incur up to six times less overhead compare to OpenMP of Microsoft compiler. On the other hand, Parallel-For and Reduction operations of OpenMP incur less overheads (up to three times) compare to TBB with Intel compiler. The reason for this phenomenon lies in the different between OpenMP and TBB paradigms for extending an existing sequential language for handling parallelism. OpenMP is implemented by using an internal multithreading runtime library and thus it is a compiler implementation depended while TBB is an external C++ template library and thus compiler independed. As we saw above, Intel compiler exhibits a very efficient implementation for OpenMP that even outperforms the efficient design of TBB. However, the TBB ATOMIC operation
602
A. Marowka
KDW;ůŽĐŬͲƵŶůŽĐŬͿͰ d
/ŶƚĞůŽŵƉŝůĞƌ
DŝĐƌŽƐŽĨƚŽŵƉŝůĞƌ
ϭϬ
ϲ͘ϭϮ Ϯ͘ϯϴ
ϭ
Ϭ͘ϭϭϲ
Ϭ͘ϭ
Ϭ͘ϭϭϰ Ϭ͘Ϭϱϭ
Ϭ͘ϬϮϳ
Ϭ͘Ϭϭ DƵƚĞdž
YƵĞƵŝŶŐDƵƚĞdž
^ƉŝŶDƵƚĞdž
Fig. 2. Relative Synchronization overheads of OpenMP(Lock-Unlock) vs. TBB(Mutex, Queuing Mutex and Spin Mutex) for 2 threads on Core 2 Duo, T8100, 2.1 GHz machine, and for Intel and Microsoft compilers
achieves twice less overhead compare to OpenMP with Intel compiler suggesting that OpenMP implementation does not use the low level atomic mechanisms supported by the underlying hardware. Simulations show that as the number of worker threads is increased, atomic operations can become a significant source of performance degradation when a relatively large number of tasks are created [15]. Figure 2 plots the relative mutual exclusion operations (mutexes) of TBB versus OpenMP Lock-Unlock synchronization operations pair. A Mutex in TBB is a global variable that multiple tasks can access. Protecting a mutex is done by locking mechanisms similar to those of OpenMP. Spin mutex causes a task to spin in user space while it is waiting; Mutex causes a task to sleep. For short waits, spinning in user space is fastest because putting a task to sleep takes cycles. Queuing mutex likes Spin mutex causes a task to sleep but each task gets its turn based on First-In First-Out (FIFO) policy. The principal observation from these results is that Mutex and Queuing mutex incur substantial overheads compare to the Lock-Unlock pair of OpenMP. This appears to be largely due to the cost of managing these mutexes compare to the Lock-Unlock operations that are using only calls to low-level synchronization mechanisms. The Spin mutex presents the lowest overheads, as expected, compare to all other TBB mutexes and OpenMP Lock-Unlock functions on both compilers. This is most likely due to its simplicity implementation that makes Spin mutex very fast in lightly contended situation such as in our case where the benchmarks are executed with only two threads. Therefore, we omit the discussion on scalability with respect to the number of cores along this article because it is useless to do such an analysis when the machines have only two cores.
Empirical Analysis of Parallelism Overheads on CMPs
603
Table 2. OpenMP and TBB Scheduling overheads (in Kilo-Clock cycles(KCC))with 2 threads on Core 2 Duo, T8100, 2.1 GHz machine, for Intel and Microsoft compilers
3.2
Chunk size 1 2 TBB 1.33 1.55 OMP Static 1.79 2.75 3.50 OMP Dynamic 21.8 13.1 OMP Guided 6.69 5.86
4 1.11 2.18 7.22 5.35
Chunk size 1 2 TBB 1.40 1.38 OMP Static 8.29 8.72 8.52 OMP Dynamic 41.4 25.1 OMP Guided 11.1 10.6
4 1.41 8.38 17.1 10.4
Intel Compiler 8 16 32 64 128 SIMPLE AUTO 1.15 1.14 1.15 1.04 1.17 1.03 1.15 1.84 1.86 1.86 1.89 1.91 4.67 3.39 2.75 2.39 2.31 4.87 5.56 4.27 3.33 3.11 Microsoft Compiler 8 16 32 64 128 SIMPLE AUTO 1.37 2.00 2.59 1.44 1.48 1.33 1.45 8.35 8.43 8.37 8.78 8.42 13.2 10.2 9.21 8.74 8.64 10.5 8.84 8.64 8.74 8.88
Scheduling Benchmarks
Overheads for the different policies of loop schedules are given in Table 2 and Figure 3. Table 2 presents the OpenMP and TBB scheduling overheads measured by running the EPCC and TBBench micro-benchmarks with two threads and two compilers (Intel and Microsoft) for 1, 2, 4, 8, 16, 32, 64 and 128 chunk sizes. In case of TBB, the results of SIMPLE and AUTO options are also depicted. SIMPLE refers to TBB simple partitioner option that recursively splits a range until it is no longer divisible and AUTO refers to TBB auto partitioner option that guides splitting decisions based on the work-stealing behavior of the task scheduler. All the measurements shown are in Kilo-Clock-Cycles (KCC). Note the OpenMP performance gain of a Static schedule, and the penalties incurred by the Dynamic and Guided schedules, where loops must pull chunks of work at run time. Therefore, Figure 3 plots the relative performance of OpenMP Static schedule (the best schedule of OpenMP) versus TBB. First, it can be observed again from Table 2 that TBB obtains roughly same overheads in both compilers. This observation confirms the commitment of TBB designers to make it compiler independence. Second, TBB scheduler outperforms OpenMP for all chunk sizes in both compilers. In the best case (Figure 3), TBB obtains up to twice and six times better performance for Intel and Microsoft compilers respectively. In the worst case (Table 2), TBB obtains overheads that are up to four orders of magnitude smaller than OpenMP (in case of dynamic schedule with chunk size of 1 for Microsoft compiler). Third, Intel compiler obtains better (up to an order of magnitude) performance compare to Microsoft suggesting that recursive-based TBB scheduler is more efficient than the static scheduler of OpenMP. Moreover, the AUTO and SIMPLE options of TBB scheduler outperforms any manually setting of a chunk size for all schedule policies and both compilers.
604
A. Marowka
OMP( static) \ TBB
1
2
4
8
16
ϳ
32
64
128
ϲ͘ϮϮ
ϲ
ϱ͘ϲϴ
ϱ ϰ ϯ Ϯ
Ϯ͘Ϭϲ
ϭ͘ϲϯ
ϭ Ϭ
/ŶƚĞůŽŵƉŝůĞƌ
DŝĐƌŽƐŽĨƚŽŵƉŝůĞƌ
Fig. 3. Relativet Scheduling overheads of OpenMP(Static) vs. TBB for two threads on Core 2 Duo, T8100, 2.1 GHz machine, and for Intel and Microsoft compilers
4
Conclusions
The course of change in the hardware industry is clear - the primary means of evolution for processors in the foreseeable future is through increasingly large numbers of execution cores per processor package. Understanding their performance characteristics is essential for designing scalable and efficient applications. In this paper, we reported on the running results of micro-benchmarks with the aim of measuring the overhead costs of OpenMP compiler directives and TBB parallel mechanisms. The benchmarking results show that Intel compiler outperforms Microsoft compiler. Moreover, the relative performance of TBB versus OpenMP is mainly depends on the implementation of the parallel constructs of a specific compiler. Another conclusion raises from this study is that the TBB task-parallelism model implemented by an efficient scheduler engine is a better platform for development of scalable applications on multicore machines and therefore the TASK directive extension of OpenMP 3.0 can improve future OpenMP-based applications.
References 1. Sutter, H.: The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb’s Journal 30(3) (March 2005) 2. Sutter, H., Larus, J.: Software and the concurrency revolution. ACM Queue 3(7), 54–62 (2005) 3. Marowka, A.: Parallel Computing on Any Desktop. Communication of ACM 50(9), 74–78 (2007) 4. Asanovic, K., et al.: The landscape of parallel computing research: A view from Berkeley. University of California at Berkeley, Technical Report No. UCB/EECS2006-183 (December 18, 2006), http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
Empirical Analysis of Parallelism Overheads on CMPs
605
5. Marowka, A.: Execution Model of Three Parallel Languages: OpenMP, UPC and CAF. Scientific Programming 13(2), 127–135 (2005) 6. Nicols, B., et al.: Pthreads Programming, A POSIX Standard for Better Multiprocessing. O’reilly, Sebastopol (September 1996) 7. High Productivity Computing Systems, http://www.highproductivity.org/ 8. UPCRC, http://www.upcrc.illinois.edu/index.html 9. OpenMP Application Program Interface, Version 3.0 (2008), http://www.openmp.org 10. Chapman, B., Jost, G., van der Pas, R.: Using OpenMP, Portable Shared Memory Parallel Programming. MIT Press, Cambridge (2007) 11. Reinders, J.: Intel Threading Building Blocks, Outfitting C++ for Multi-core Processor Parallelism. O’Reilly, Sebastopol (2007) 12. TBB Web Site, http://www.threadingbuildingblocks.org/ 13. Bull, M.: Measuring Synchronisation and Scheduling Overheads in OpenMP. In: Proceeding of First European Workshop on OpenMP (EWOMP 1999) Lund, Sweden (October 1999) 14. Bull, M., O’Neill, D.: Microbenchmark Suite for OpenMP 2.0. In: Proceedings of the Third European Workshop on OpenMP (EWOMP 2001), Barcelona, Spain, pp. 41–48 (September 2001) 15. Contreras, G., Martonosi, M.: Characterizing and Improving the Performance of Intel Threading Building Blocks. In: IEEE Proceeding of International Symposium on Workload Characterization, pp. 57–66 (2008)
An Implementation of Parallel 3-D FFT with 2-D Decomposition on a Massively Parallel Cluster of Multi-core Processors Daisuke Takahashi Graduate School of Systems and Information Engineering, University of Tsukuba 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8573, Japan [email protected]
Abstract. In this paper, we propose an implementation of a parallel three-dimensional fast Fourier transform (FFT) with two-dimensional decomposition on a massively parallel cluster of multi-core processors. The proposed parallel three-dimensional FFT algorithm is based on the multicolumn FFT algorithm. We show that a two-dimensional decomposition effectively improves performance by reducing the communication time for larger numbers of MPI processes. We successfully achieved a performance of over 401 GFlops on 256 nodes of Appro Xtreme-X3 (648 nodes, 147.2 GFlops/node, 95.4 TFlops peak performance) for 2563 -point FFT.
1
Introduction
The fast Fourier transform (FFT) [1] is an algorithm that is widely used today in science and engineering. Parallel three-dimensional FFT algorithms on distributed-memory parallel computers have been investigated extensively [2,3,4,5,6,7,8]. A typical decomposition for performing a parallel three-dimensional FFT is slab-wise. In this case, a three-dimensional array x(N1 , N2 , N3 ) is distributed along the third dimension N3 . However, N3 must be greater than or equal to the number of MPI processes. This becomes an issue with very large node counts for a massively parallel cluster of multi-core processors. To solve this problem, a scalable framework for three-dimensional FFTs on the Blue Gene/L supercomputer [5] and a three-dimensional FFT on the sixdimensional network torus QCDOC parallel supercomputer [8] have been proposed. The parallel three-dimensional FFT algorithms use the three-dimensional decomposition for the three-dimensional FFT. These schemes require three allto-all communications. In this paper, we propose a two-dimensional decomposition for the parallel three-dimensional FFT. This requires only two all-to-all communications. We have implemented the parallel three-dimensional FFT algorithm on a massively parallel cluster of multi-core processors, and the obtained performance results are reported herein. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 606–614, 2010. c Springer-Verlag Berlin Heidelberg 2010
An Implementation of Parallel 3-D FFT with 2-D Decomposition
607
The remainder of this paper is organized as follows. Section 2 describes the conventional three-dimensional FFT algorithm. In Section 3, we propose a twodimensional decomposition for the parallel three-dimensional FFT. Section 4 compares the communication times between the conventional one-dimensional decomposition and the two-dimensional decomposition. Section 5 describes the performance results. Finally, in Section 6, we present concluding remarks.
2
Three-Dimensional FFT
The three-dimensional discrete Fourier transform (DFT) is given by y(k1 , k2 , k3 ) =
n 1 −1 n 2 −1 n 3 −1 j1 =0 j2 =0 j3 =0
x(j1 , j2 , j3 )ωnj33k3 ωnj22k2 ωnj11k1
(1)
√ where ωnr = e−2πi/nr (1 ≤ r ≤ 3) and i = −1. The three-dimensional FFT based on the multicolumn FFT algorithm [9] is as follows: Step 1:
n2 n3 individual n1 -point multicolumn FFT n 1 −1 x1 (k1 , j2 , j3 ) = x(j1 , j2 , j3 )ωnj11k1
Step 2:
Transpose
j1 =0
Step 3:
x2 (j2 , j3 , k1 ) = x1 (k1 , j2 , j3 ) n3 n1 individual n2 -point multicolumn FFT n 2 −1 x3 (k2 , j3 , k1 ) = x2 (j2 , j3 , k1 )ωnj22k2 j2 =0
Step 4: Step 5:
Transpose x4 (j3 , k1 , k2 ) = x3 (k2 , j3 , k1 ) n1 n2 individual n3 -point multicolumn FFT n 3 −1 x5 (k3 , k1 , k2 ) = x4 (j3 , k1 , k2 )ωnj33k3 j3 =0
Step 6:
Transpose y(k1 , k2 , k3 ) = x5 (k3 , k1 , k2 )
The distinctive features of the three-dimensional FFT can be summarized as follows: – Three multicolumn FFTs are performed in Steps 1, 3 and 5. Each column FFT is small enough to fit into the data cache. – The three-dimensional FFT has three transpose steps.
608
D. Takahashi 1. FFTs in x-axis
2. FFTs in y-axis
z
3. FFTs in z-axis
z
z
y
y
x
y
x
x
Fig. 1. One-dimensional decomposition along the z-axis 1. FFTs in x-axis
2. FFTs in y-axis
z
3. FFTs in z-axis
z
z
y x
y x
y x
Fig. 2. Two-dimensional decomposition along the y- and z-axes
3
Parallel Three-Dimensional FFT Algorithm with Two-Dimensional Decomposition
One-dimensional decomposition along the z-axis for the three-dimensional FFT is shown in Figure 1. We propose a two-dimensional decomposition for the threedimensional FFT. Let N = N1 × N2 × N3 . On a distributed-memory parallel computer having P × Q processors, a three-dimensional array x(N1 , N2 , N3 ) is distributed along the second dimension N2 and the third dimension N3 . If N2 is divisible by P and N3 is also divisible by Q, then each processor has distributed data of size N1 × (N2 /P ) × (N3 /Q). Next, we introduce the notation Nˆr ≡ Nr /P and ˆ ˆ Nˆr ≡ Nr /Q. Furthermore, we denote the corresponding indices as Jˆr and Jˆr , which indicate that the data along Jr are distributed across P processors and Q processors, respectively. Here, we use the subscript r to indicate that this index ˆ belongs to dimension r. The distributed array is represented as xˆ(N1 , Nˆ2 , Nˆ3 ). ˆ At processor l in the y-direction, the local index Jr (l) corresponds to the global index as the block distribution: Jr = Jˆr (l) × P + l,
0 ≤ l ≤ P − 1,
1≤r≤3
(2)
ˆ Moreover, at processor m in the z-direction, the local index Jˆr (m) also corresponds to the global index as the block distribution:
An Implementation of Parallel 3-D FFT with 2-D Decomposition
ˆ Jr = Jˆr (m) × Q + m,
0 ≤ m ≤ Q − 1,
1≤r≤3
609
(3)
In order to illustrate all-to-all communication, it is convenient to decompose Ni into two dimensions: – N˜i and Pi , where N˜i ≡ Ni /Pi ˜i and Qi , where N˜ ˜i ≡ Ni /Qi – N˜ Although Pi and Qi are the same as P and Q, respectively, we use the subscript i to indicate that these indices belong to dimension i. ˆ Starting with the initial data x ˆ(N1 , Nˆ2 , Nˆ3 ), the parallel three-dimensional FFT with two-dimensional decomposition can be performed according to the following steps: Step 1:
(N2 /P ) · (N3 /Q) individual N1 -point multicolumn FFT N 1 −1 ˆ ˆ J1 K1 xˆ1 (K1 , Jˆ2 , Jˆ3 ) = xˆ(J1 , Jˆ2 , Jˆ3 )ωN 1 J1 =0
Step 2:
Transpose ˆ ˆ xˆ2 (Jˆ2 , Jˆ3 , K˜1 , P1 ) ≡ xˆ2 (Jˆ2 , Jˆ3 , K1 ) ˆ = xˆ1 (K1 , Jˆ2 , Jˆ3 )
Step 3:
Q individual all-to-all communications across P processors in the y-axis ˆ ˆ xˆ3 (J˜2 , Jˆ3 , Kˆ1 , P2 ) = xˆ2 (Jˆ2 , Jˆ3 , K˜1 , P1 )
Step 4:
Rearrangement ˆ ˆ xˆ4 (J2 , Jˆ3 , Kˆ1 ) ≡ xˆ4 (J˜2 , P2 , Jˆ3 , Kˆ1 ) ˆ = xˆ3 (J˜2 , Jˆ3 , Kˆ1 , P2 )
Step 5:
(N3 /Q) · (N1 /P ) individual N2 -point multicolumn FFT N 2 −1 ˆ ˆ J2 K2 xˆ5 (K2 , Jˆ3 , Kˆ1 ) = xˆ4 (J2 , Jˆ3 , Kˆ1 )ωN 2 J2 =0
Step 6:
Transpose ˆ ˜2 , Q2 ) ≡ xˆ6 (Jˆ ˆ3 , Kˆ1 , K2 ) xˆ6 (Jˆ3 , Kˆ1 , K˜ ˆ = xˆ5 (K2 , Jˆ3 , Kˆ1 ) Step 7: P individual all-to-all communications across Q processors in the z-axis ˜ , Kˆ , Kˆ ˆ , Q ) = xˆ (Jˆ ˆ , Kˆ , K˜ ˜ ,Q ) xˆ (J˜ 7
Step 8:
3
1
2
3
6
3
Rearrangement ˆ ˜3 , Q3 , xˆ8 (J3 , Kˆ1 , Kˆ2 ) ≡ xˆ8 (J˜ ˜ , Kˆ , = xˆ (J˜ 7
3
1
1
2
ˆ Kˆ1 , Kˆ2 ) ˆ Kˆ2 , Q3 )
2
610
D. Takahashi
Step 9: (N1 /P ) · (N2 /Q) individual N3 -point multicolumn FFT N 3 −1 ˆ ˆ J3 K3 xˆ9 (K3 , Kˆ1 , Kˆ2 ) = xˆ8 (J3 , Kˆ1 , Kˆ2 )ωN 3 J3 =0
Step 10:
Transpose ˆ ˆ yˆ(Kˆ1 , Kˆ2 , K3 ) = xˆ9 (K3 , Kˆ1 , Kˆ2 )
The proposed two-dimensional decomposition along the y- and z-axes for the three-dimensional FFT is shown in Figure 2. The distinctive features of the parallel three-dimensional FFT algorithm with two-dimensional decomposition can be summarized as follows: – The parallel three-dimensional FFT is accompanied by a local transpose (data rearrangement). – (N 1/3 /P )·(N 1/3 /Q) individual N 1/3 -point multicolumn FFTs are performed in steps 1, 5 and 9 for the case of N1 = N2 = N3 = N 1/3 . – All-to-all communications occur twice. ˆ Although the input data x ˆ(N1 , Nˆ2 , Nˆ3 ) is distributed along the second dimenˆ sion N2 and the third dimension N3 , the output data yˆ(Nˆ1 , Nˆ2 , N3 ) is distributed along the first dimension N1 and the second dimension N2 . If we assume that the input data and output data have the same distribution, then additional all-to-all communication steps in the y- and z-axes, respectively, are needed.
4
Communication Times of One-Dimensional Decomposition and Two-Dimensional Decomposition
Let us assume the following parameters for N = N1 × N2 × N3 -point FFT: – Latency of communication: L (sec) – Communication bandwidth: W (Byte/s) – Number of processors: P × Q We will compare the communication times of one-dimensional decomposition and two-dimensional decomposition. 4.1
Communication Time of One-Dimensional Decomposition
In the one-dimensional decomposition, each processor sends N/(P Q)2 doublecomplex data to P Q − 1 processors. Thus, the communication time of one-dimensional decomposition T1dim can be estimated as follows: 16N T1dim = (P Q − 1) L + (P Q)2 · W 16N (sec) (4) ≈ PQ · L + PQ · W
An Implementation of Parallel 3-D FFT with 2-D Decomposition 1000
n n n n
611
=32x32x32 =64x64x64 =128x128x128 =256x256x256
Gflops
100
10
1
0.1 1
4
16
64 256 Number of cores
1024
4096
Fig. 3. Performance of parallel three-dimensional FFTs with two-dimensional decomposition
4.2
Communication Time of Two-Dimensional Decomposition
In the two-dimensional decomposition, each processor in the y-axis sends N/(P 2 Q) double-complex data to P − 1 processors in the y-axis. Moreover, each processor in the z-axis sends N/(P Q2 ) double-complex data to Q − 1 processors in the z-axis. Thus, the communication time of two-dimensional decomposition T2dim can be estimated as follows: 16N 16N T2dim = (P − 1) L + 2 + (Q − 1) L + P Q·W P Q2 · W 32N (sec) (5) ≈ (P + Q) · L + PQ · W
4.3
Comparison of One-Dimensional Decomposition and Two-Dimensional Decomposition
The communication times of one-dimensional decomposition and twodimensional decomposition are estimated using equations 4 and 5, respectively. By comparing two equations, the communication time of the two-dimensional distribution is less than that of the one-dimensional distribution for a larger number of processors P × Q and latency L.
5
Performance Results
In order to evaluate the proposed parallel three-dimensional FFT, we compared its performance to that of the conventional one-dimensional decomposition. We averaged the elapsed times obtained for 10 executions of complex forward 323 ,
612
D. Takahashi 1000
1-D distribution 2-D distribution
Gflops
100
10
1
0.1 1
4
16
64 256 Number of cores
1024
4096
Fig. 4. Performance of parallel three-dimensional 2563 -point FFTs
643 , 1283 and 2563 -point FFTs. The parallel FFTs were performed on doubleprecision complex data, and a table for twiddle factors was prepared in advance. We used transposed-order output to reduce the number of all-to-all communication steps. An Appro Xtreme-X3 (648 nodes, 32 GB per node, 147.2 GFlops per node, 20 TB total main memory, communication bandwidth 8 GB/sec per node, and 95.4 TFlops peak performance) was used as the distributed-memory parallel computer. Each computation node of the Xtreme-X3 is with a four-socket of quad-core AMD Opteron (2.3 GHz) in 16-core shared-memory configuration with a performance of 147.2 GFlops. The nodes are interconnected through a DDR InfiniBand. MVAPICH 1.2.0[10] was used as a communication library. In the experiment, we used one to 4,096 cores. In order to evaluate the scalability of the number of MPI processes, the flat MPI programming model was used. The Intel Fortran Compiler (ifort, version 10.1) was used on the Appro Xtreme-X3. The compiler option used was specified as “ifort -O3 -xO”. Figure 3 shows the performances of parallel three-dimensional FFTs (twodimensional decomposition) in terms of the average execution performance in GFlops. Each GFlops value is based on 5N log2 N for a transform of size N = 2m . For 323 -point FFT, we can clearly see that communication overhead dominates the execution time. In this case, the total working set size is only 1 MB. On the other hand, the two-dimensional decomposition scales well up to 4,096 cores for 2563 -point FFT. The performance on 4,096 cores is over 401 GFlops, which is approximately 1.1% of theoretical peak performance. The performance for other than all-to-all communications is over 10 TFlops, which is approximately 26.7% of theoretical peak performance. Figure 4 compares the performances of one-dimensional decomposition and two-dimensional decomposition. For P × Q ≤ 64, the performance of the onedimensional decomposition is better than that of the two-dimensional decomposition. This is because that the total communication amount of the onedimensional decomposition is half that of the two-dimensional decomposition.
An Implementation of Parallel 3-D FFT with 2-D Decomposition
613
Table 1. Communication time of parallel 2563 -point FFTs with two-dimensional decomposition Cores 1 2 4 8 16 32 64 128 256 512 1024 2048 4096
Execution time (sec) 2.84448 1.60708 1.11583 0.61128 0.34474 0.17320 0.09039 0.04823 0.02346 0.01793 0.01199 0.00736 0.00502
Communication % of time (sec) Communication 0.33756 11.87 0.25705 15.99 0.26579 23.82 0.16514 27.02 0.11832 34.32 0.06372 36.79 0.03737 41.34 0.02176 45.11 0.01320 56.28 0.01461 81.48 0.01079 89.98 0.00652 88.64 0.00482 96.07
However, for P × Q ≥ 128, due to the latency, the performance of the twodimensional decomposition is better than that of the one-dimensional decomposition. Table 1 shows that the all-to-all communication overhead contributes significantly to the execution time.
6
Conclusion
We have implemented a parallel three-dimensional FFT with two-dimensional decomposition on a massively parallel cluster of multi-core processors. We showed that a two-dimensional decomposition improves performance effectively by reducing the communication time for larger numbers of MPI processes. The proposed parallel three-dimensional FFT algorithm with two-dimensional decomposition is most advantageous on a massively parallel cluster of multi-core processors. A performance of over 401 GFlops was achieved on 256 nodes of Appro Xtreme-X3 (648 nodes, 147.2 GFlops/node, 95.4 TFlops peak performance) for 2563 -point FFT.
References 1. Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19, 297–301 (1965) 2. Brass, A., Pawley, G.S.: Two and three dimensional FFTs on highly parallel computers. Parallel Computing 3, 167–184 (1986)
614
D. Takahashi
3. Agarwal, R.C., Gustavson, F.G., Zubair, M.: An efficient parallel algorithm for the 3-D FFT NAS parallel benchmark. In: Proceedings of the Scalable HighPerformance Computing Conference, pp. 129–133 (1994) 4. Takahashi, D.: Efficient implementation of parallel three-dimensional FFT on clusters of PCs. Computer Physics Communications 152, 144–150 (2003) 5. Eleftheriou, M., Fitch, B.G., Rayshubskiy, A., Ward, T.J.C., Germain, R.S.: Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: Implementation and early performance measurements. IBM J. Res. Dev. 49, 457–464 (2005) 6. Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE 93, 216–231 (2005) 7. Takahashi, D.: A hybrid MPI/OpenMP implementation of a parallel 3-D FFT on SMP clusters. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wa´sniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 970–977. Springer, Heidelberg (2006) 8. Fang, B., Deng, Y., Martyna, G.: Performance of the 3D FFT on the 6D network torus QCDOC parallel supercomputer. Computer Physics Communications 176, 531–538 (2007) 9. Van Loan, C.: Computational Frameworks for the Fast Fourier Transform. SIAM Press, Philadelphia (1992) 10. MVAPICH: MPI over InfiniBand and iWARP, http://mvapich.cse.ohio-state.edu/
Introducing a Performance Model for Bandwidth-Limited Loop Kernels Jan Treibig and Georg Hager Regionales Rechenzentrum Erlangen University Erlangen-Nuernberg {jan.treibig,georg.hager}@rrze.uni-erlangen.de Abstract. We present a diagnostic performance model for bandwidthlimited loop kernels which is founded on the analysis of modern cache based microarchitectures. This model allows an accurate performance prediction and evaluation for existing instruction codes. It provides an indepth understanding of how performance for different memory hierarchy levels is made up. The performance of raw memory load, store and copy operations and a stream vector triad are analyzed and benchmarked on three modern x86-type quad-core architectures in order to demonstrate the capabilities of the model.
1
Introduction
Many algorithms are limited by bandwidth, meaning that the memory subsystem cannot provide the data as fast as the arithmetic core could process it. One solution to this problem is to introduce multi-level memory hierarchies with lowlatency and high-bandwidth caches, which exploit temporal and spatial locality in an application’s data access pattern. In many scientific algorithms the bandwidth bottleneck is still severe, however. While there exist successful balance models predicting the influence of main memory bandwidth on the performance [4], they implicitly assume that caches are infinitely fast in comparison to main memory. This is why those models fail when applied to in-cache situations. Our proposed model explains what parts contribute to the runtime of bandwidthlimited algorithms on all memory levels. We will show that meaningful predictions can only be drawn if the execution of the instruction code is taken into account. To introduce and evaluate the model, basic building blocks of streaming algorithms (load, store and copy operations) are analyzed and benchmarked on three x86-type test machines. In addition, as a prototype for many streaming algorithms we use the STREAM triad A = B + α ∗ C, which matches the performance characteristics of many real algorithms [5]. The main routine and utility modules are implemented in C while the actual loop code uses assembly language. The runtime is measured in clock cycles using the rdtsc instruction. Section 2 presents the microarchitectures and technical specifications of the test machines. In Section 3 the model approach is briefly described. The application of the model and according measurements can be found in Sections 4 and 5. R. Wyrzykowski et al. (Eds.): PPAM 2009, Part I, LNCS 6067, pp. 615–624, 2010. c Springer-Verlag Berlin Heidelberg 2010
616
J. Treibig and G. Hager
Table 1. Test machine specifications. The cache line size is 64 bytes for all processors and cache levels. Core 2 Intel Core2 Q9550
Nehalem Intel i7 920
Shanghai AMD Opteron 2378
Execution Core Clock [GHz] 2.83 2.67 2.4 Inst. Throughput 4 ops 4 ops 3 ops Peak DP FP rate 4 flops/cycle 4 flops/cycle 4 flops/cycle L1 Cache 32 kB 32 kB 64 kB Parallelism 4 banks, dual ported 4 banks, dual ported 8 banks, dual ported L2 Cache 2x6 MB (inclusive) 4x256 KB 4x512 KB (exclusive) L3 Cache (shared) 8 MB (inclusive) 6 MB (exclusive) Main Memory DDR2-800 DDR3-1066 DDR2-800 Channels 2 3 2 Memory clock [MHz] 800 1066 800 Bytes/ clock 16 24 16 Bandwidth [GB/s] 12,8 25,6 12,8
2
Experimental Test-Bed
An overview of the test machines can be found in Table 1. As representatives of current x86 architectures we have chosen Intel “Core 2 Quad” and “Core i7” processors, and an AMD “Shanghai” chip. The cache group structure, i.e., which cores share caches of what size, is illustrated in Figure 1. For detailed information about microarchitecture and cache organization, see the Intel [1] and AMD [2] Optimization Handbooks. Although the Shanghai processor used for the tests sits in a dual-socket motherboard, we restrict our analysis to a single socket. An important difference between Intel and AMD cache architecture is that the former uses an inclusive hierarchy whereas the latter employs exclusive caches. With inclusive caches, if a line is allocated in some cache level, this line is also present in all lower levels. In contrast, exclusive caches allow a certain line to be present only in one cache level at any given time. While exclusive caches provide a larger usable cache size, they also involve more inter-level cache line transfers.
Fig. 1. Data cache group structure of the multi-core architectures in the test-bed for Core 2 (left), Core i7 (middle) and Shanghai (right)
Introducing a Performance Model for Bandwidth-Limited Loop Kernels
3
617
Diagnostic Performance Model
This model proposes a multi-level approach to analytically predict the performance of bandwidth-limited algorithms in all memory hierarchy levels. The basic building block of a streaming algorithm is its computational kernel in the inner loop body. Since all loads and stores come from and go to L1 cache, the kernel’s execution time on a cache line basis is governed by the maximum number of L1 load and store accesses per cycle and the capability of the pipelined, superscalar core to execute instructions. All lower levels of the memory hierarchy are reduced to their bandwidth properties, with data paths and transfer volumes based on the real cache architecture. The minimum transfer size between memory levels is one cache line. Based on the transfer volumes and bandwidths, the total execution time per cache line is obtained by adding all contributions from data transfers between caches and kernel execution times in L1. To lowest order, we assume that there is no access latency (i.e., all latencies can be effectively hidden by software pipelining and prefetching) and that the different components of overall execution time do not overlap. The total runtime of a bandwidth-limited loop kernel is modeled as: Ttotal = TL1 +
n i=2
Tmemory leveli
(1)
All runtime values are valid for one cache line update. TL1 is singled out as it quantifies the cycles to execute the loop body from L1 cache and is determined by instruction code analysis. A good lower bound is the minimum number of cycles taken for the execution of load and stores apparent in the instruction code under consideration. All other memory levels are modeled according to their bandwidth capabilities and the data volume transferred to L1 cache. Here the number of cycles per cache line depends on the algorithmic data access patterns and on the memory level the data comes from. This runtime can be computed as: Tmemory leveli =
NCL ∗ bytes per CL bus width [byte]
(2)
Ttotal is calculated in core cycles. If a memory level runs with different clock speed, this has to be taken into account by converting the bus cycles into core cycles. The number of cache lines to be transferred must be based on the real cache line transfers including those triggered by the cache architecture (like, e.g., write allocate and copy back transfers). It must be stressed that a correct application of this model requires intimate knowledge of cache architectures and data paths. This information is available from processor manufacturers [1,2], but sometimes the level of detail is insufficient for determining all parameters, and relevant information must be inferred from measurements.
618
4
J. Treibig and G. Hager
Theoretical Analysis
In this section we substantiate the model outlined above by providing the necessary architectural details for current Intel and AMD processors. Using simple kernel loops, we derive performance predictions which will be compared to actual measurements in the following section. All results are given in CPU cycles per cache line; if n streams are processed in the kernel, the number of cycles denotes the time required to process one cache line per stream. The kernels are implemented in assembly language. In Listing 1.1 the loop kernel of the triad benchmark is shown. Note that the array registers point to the end of the vectors and the index register rax is negative. Listing 1.1. Assembly code for triad loop kernel 1
1:
2
movaps xmm0 , [ rsi + rax *8] movaps xmm1 , [ rsi + rax *8+16] movaps xmm2 , [ rsi + rax *8+32] movaps xmm3 , [ rsi + rax *8+48] mulpd xmm0 , xmm4 addpd xmm0 , [ rdx + rax *8] mulpd xmm1 , xmm4 addpd xmm1 , [ rdx + rax *8+16] mulpd xmm2 , xmm4 addpd xmm2 , [ rdx + rax *8+32] mulpd xmm3 , xmm4 addpd xmm3 , [ rdx + rax *8+48] movaps [ rdi + rax *8] , xmm0 movaps [ rdi + rax *8+16] , xmm1 movaps [ rdi + rax *8+32] , xmm2 movaps [ rdi + rax *8+48] , xmm3 add rax , 8 j s 1b
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
As mentioned earlier, basic data operations in L1 cache are limited by cache bandwidth, which is determined by the load and store instructions that can execute per cycle. The Intel cores can retire one 128-bit load and one 128-bit store in every cycle. L1 bandwidth is thus limited to 16 bytes per cycle if only loads (or stores) are used, and reaches its peak of 32 bytes per cycle only for a copy operation. The AMD Shanghai core can perform either two 128-bit loads or two 64-bit stores in every cycle. This results in a load performance of 32 bytes per cycle and a store performance of 16 bytes per cycle. For load-only and store-only kernels, there is only one data stream, i.e., exactly one cache line is processed at any time. With copy and stream triad kernels, this number increases to two and three, respectively. Together with the execution limits described above it is possible to predict the number of cycles needed to execute the instructions necessary to process one cache line per stream (see the “L1” columns in Table 2).
Introducing a Performance Model for Bandwidth-Limited Loop Kernels
619
Table 2. Theoretical prediction of execution times for eight loop iterations (one cache line per stream) on Core 2 (A), Core i7 (B), and Shanghai (C) processors
Load Store Copy Triad
L1 A B 4 4 4 4 4 4 8 8
L2 L3 C A B C B C 2 6 6 6 8 8 4 8 8 8 12 10 6 10 10 14 16 18 8 16 16 20 24 26
Memory A B C 20 15 18 36 26 32 52 36 50 72 51 68
L2 cache bandwidth is influenced by three factors: (i) the finite bus width between L1 and L2 cache for refills and evictions, (ii) the fact that either ALU access or cache refill can occur at any one time, and (iii) the L2 cache access latency. All three architectures have a 256-bit bus connection between L1 and L2 cache and use a write back and write allocate strategy for stores. In case of an L1 store miss, the cache line is first moved from L2 to L1 before it can be updated (write allocate). Together with its later eviction to L2, this results in an effective bandwidth requirement of 128 byte per cache line write miss update. On the Intel processors, a load miss incurs only a single cache line transfer from L2 to L1 because the cache hierarchy is inclusive. The Core i7 L2 cache not strictly inclusive, but for the benchmarks covered here (no cache line sharing and no reuse) an inclusive behavior was assumed due to the lack of detailed documentation about the L2 cache. In contrast, the AMD L2 cache is exclusive: It only contains data that was evicted from L1 due to conflict misses. On a load miss the new cache line and the replaced cache line have to be exchanged. This results in a bandwidth requirement of two cache lines for every cache line load from L2. The overall execution time of the loop kernel on one cache line per stream is the sum of (i) the time needed to transfer the cache line(s) between L2 and L1 and (ii) the runtime of the loop kernel in L1 cache. Table 3 shows the different contributions for pure load, pure store, copy and triad operations on Intel and AMD processors. Looking at, e.g., the copy operation on Intel, the model predicts that only 6 cycles out of 10 can be used to transfer data from L2 to L1 cache. The remaining 4 cycles are spent with the execution of the loop kernel in L1. This explains the well-known performance breakdown for streaming kernels when data does not fit into L1 any more, although the nominal L1 and L2 bandwidths are identical. All results are included in the “L2” columns of Table 2. The large number of cycles for the AMD architecture can be attributed to the exclusive cache structure, which leads to a lot of additional inter-cache traffic. Not much is known about the L3 cache architecture on Intel Core i7 and AMD Shanghai. It can be assumed that the bus width between the caches is 256 bits, which was confirmed by measurements of raw L3 cache bandwidth (realized with a memory copy only moving one value per cache line). Our model assumes a strictly inclusive cache hierarchy for the Intel designs, in which L3 cache is “just
620
J. Treibig and G. Hager
Fig. 2. Main memory performance model for Intel Core i7. There are separate buses connecting the different cache levels.
Table 3. Loop kernel runtime for one cache line per stream in L2 cache Intel AMD Load Store Copy Triad Load Store Copy Triad L1 part 4 4 4 8 2 4 6 8 L2 part 2 4 6 8 4 4 8 12 L1+L2 6 8 10 16 6 8 14 20
Fig. 3. Main memory performance model for AMD Shanghai. All caches are connected via a shared bus.
Introducing a Performance Model for Bandwidth-Limited Loop Kernels
621
another level.” For the AMD chips it is known that all caches share a single bus. On an L1 miss, data is directly loaded into L1 cache. Only evicted L1 lines will be stored to lower hierarchy levels. While the L3 cache is not strictly exclusive on the AMD Shanghai, exclusive behavior can be assumed for the benchmarks considered here. Under these assumptions, the model can predict the required number of cycles in the same way as for the L2 case above. The “L3” columns in Table 2 show the results. If data resides in main memory, we again take into account a strictly hierarchical (inclusive) data load on Intel processors, while data is loaded directly into L1 cache on AMD even on store misses. The cycles for main memory transfers are computed using the effective memory clock and bus width and are converted into CPU cycles. For consistency reasons, non-temporal (“streaming”) stores were not used for the main memory regime. Data transfer volumes and rates, and predicted cycles for a cache line update are illustrated in Figures 2 (Core i7) and 3 (Shanghai). They are also included in the “Memory” columns of Table 2.
5
Measurements
Measured cycles for a cache line update, the ratio of predicted versus measured cycles, and the real and effective bandwidths are listed in Table 5. Here, “effective bandwidth” means the bandwidth available to the application, whereas “real bandwidth” refers to the actual data transfer taking place. For every layer in the hierarchy the working set size was chosen to fit into the appropriate level, but not into higher ones. The measurements confirm the predictions of the model well in the L1 regime, with slightly larger deviations for the AMD architecture. In general, we refer to additional (measured) cycles spent compared to the model as “overhead.” Also the L2 results confirm the predictions. One exception is the store performance of the Intel Core i7, which is significantly better than the prediction. Because the prediction was based on bandwidth capabilities the only explanation for this can be either higher bus clock or lower data volume. The dynamic over clocking feature of the Core i7 was disabled, therefore it is probable that the write allocate load on a store miss is not triggered in all cases on the Core i7. Still this issue must be clarified by Intel to enable a sensible application of the model on Nehalem based processors. The overhead for accessing the L2 cache with a streaming data access pattern scales with the number of involved cache lines, as can be derived from a comparison of the measured cache line update cycles in Table 5 and the predictions in Table 2. The highest cost occurs on the Core 2 with 2 cycles per cache line for the triad, followed by Shanghai with 1.5 cycles per cache line. Core i7 has a very low L2 access overhead of 0.5 cycles per cache line. Still, all Core i7 results must be interpreted with caution until the L2 behavior can be predicted correctly by a revised modeling of the Core i7 cache architecture. All architectures are good at hiding cache latencies for streaming patterns.
622
J. Treibig and G. Hager Table 4. Threaded stream triad performance, effective bandwidth Threads 1 2 (shared) 4 Nehalem [GB/s] 1 2 4 Shanghai [GB/s] 1 2 4 Core 2 [GB/s]
L1 L2 L3 Memory 66.1 23.7 4.9 134.1 46.9 5.0 5.3 61.1 29.0 20.5 11.9 122.1 55.9 39.8 14.8 247.7 113.3 51.3 16.1 49.2 17.7 9.1 5.5 49.1 35.3 19.5 7.1 187.0 70.7 36.9 7.9
On the AMD Shanghai there is significant overhead involved in accessing the L3 cache. On Core i7 the behavior is similar to the L2 results: The store result is better than the prediction, which influences all other test cases involving a store. The effective bandwidth available to the application is dramatically higher on the Intel Core i7 owing to the inclusive cache hierarchy, while the AMD Shanghai works efficiently within the limits of its architecture but suffers from a lot of additional cache traffic due to its exclusive caches. As for main memory access, one must distinguish between the classic frontside bus concept as used with all Core 2 designs, and the newer architectures with on-chip memory controller. The former has much larger overhead, which is why Core 2 shows mediocre efficiencies of around 60 %. The AMD Shanghai, on the other hand, reaches around 80 % on all benchmarks. The Core i7 shows results better than the theoretical prediction. This can be caused either by a potential overlap between contributions, or by deficiencies in the application of the model caused by insufficient knowledge about the details of data paths inside the cache hierarchy. 5.1
Multi-threaded Stream Triad Performance
An analytical extension of the model to multi-threading is beyond the scope of this work and would involve additional analysis of the cache and memory subsystems and threaded execution. However, as bandwidth scalability of shared caches on multi-core processors is extremely important for parallel code optimization, we determine the basic scaling behavior of the cache subsystem using multithreaded stream triad bandwidth tests. The Core 2 Quad processor used here comprises two dual-core chips in a common package (socket), each with a shared L2 cache. On the other two architectures each core has a private L2 cache and all cores share an L3 cache. For our tests, threading was implemented based on the POSIX threading library, and threads were explicitly pinned to exclusive cores. Pinning was done with a wrapper library overloading the pthread_create() function. The measurements for four threads on the Core 2 architecture did not produce reasonable results due to the large and varying barrier overhead [3]. The
Core 2 [%] CL update GB/s eff. GB/s Nehalem [%] CL update GB/s eff. GB/s Shanghai [%] CL update GB/s eff. GB/s
Load 96.0 4.17 43.5 97.1 4.12 41.3 88.3 2.27 67.9 -
L1 Store Copy 93.8 92.7 4.26 4.31 42.5 84.1 95.3 94.1 4.20 4.26 40.5 79.8 95.3 97.1 4.20 6.18 36.7 49.9 Triad 99.5 8.04 67.7 96.0 8.34 61.2 85.0 9.41 49.2 -
Load 83.1 7.21 25.1 83.5 7.18 23.7 74.5 8.05 19.2 -
L2 Store Copy 94.1 74.9 8.49 13.34 42.7 40.7 21.3 27.2 120.9 91.4 6.61 10.94 51.5 46.7 25.7 31.1 55.1 80.6 13.58 17.36 22.7 35.6 11.4 17.8 Triad 70.4 22.72 31.9 23.9 91.7 17.45 39.0 29.3 78.5 25.47 36.4 18.2
L3 Load Store Copy Triad Load 67.6 29.60 6.1 95.3 121.4 103.9 96.3 106.8 8.39 9.88 15.4 24.91 14.02 20.3 34.4 33.2 27.3 12.1 17.2 22.1 20.5 48.9 54.9 50.6 49.5 75.4 16.36 18.20 35.53 50.7 23.86 9.4 16.9 17.4 18.1 6.5 8.5 8.7 9.0 -
Memory Store Copy 49,9 58.7 72.04 88.61 5.0 6.1 2.5 4.1 142.2 123 18.27 29.25 18.6 17.4 9.3 11.6 75.6 80.8 42.32 61.89 7.3 7.4 3.6 4.9
Triad 66.6 108.15 6.7 5.0 119.4 42.72 15.9 11.9 80.6 84.32 6.9 5.5
Introducing a Performance Model for Bandwidth-Limited Loop Kernels 623
Table 5. Benchmark results and comparisons to the predictions of the diagnostic model for Load, Store, Copy and Triad kernels
shared L2 cache for the Core 2 scales to two cores (Table 4). This is possible by interleaving the reload and the execution of the instructions in L1 cache between the two cores, as described earlier. The same is valid for the shared L3 caches on the other two architectures. The L3 cache on the Intel Core i7 scales up to two threads but shows strong saturation with four threads.
624
J. Treibig and G. Hager
On all architectures, a single thread is not able to saturate the memory bus completely. This can be explained by the assumption that also for main memory access only a part of the runtime is usable for data transfer. Of course, this effect becomes more important if data transfer time from main memory is short compared to the time it takes to update and transfer the data on the processor chip.
6
Conclusion
The proposed diagnostic model introduces a systematic approach to understand the performance of bandwidth-limited loop kernels, especially for in-cache situations. Using elementary data transfer operations we have demonstrated the basic application of the model on three modern quad-core architectures. The model explains the bandwidth results for different cache levels and shows that performance for bandwidth-limited kernels depends crucially on the runtime behavior of the instruction code in L1 cache. This work does not claim to give a comprehensive explanation for every aspect in the behavior of the three covered architectures. Work in progress involves application of the model to more relevant algorithms like, e.g., the Jacobi or Gauss-Seidel smoothers. Future work will include verification and refinements for the architectures under consideration. An important component is to fully extend it to multi-threaded applications, and to substantiate the results by hardware counter measurements.
References 1. Intel Corporation: Intel 64 and IA-32 Architectures Optimization Reference Manual, Document Number: 248966–17 (2008) 2. AMD, Inc.: Software Optimization Guide for AMD Family 10h Processors. Document Number: 40546 (2008) 3. Treibig, J., Hager, G., Wellein, G.: Multi-core architectures: Complixities of performance prediction and the impact of cache topology (2009), http://arxiv.org/abs/0910.4865 4. Sch¨ onauer, W.: Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers. Self-edition, Karlsruhe (2000) 5. Jalby, W., Lemuet, C., Le Pasteur, X.: WBTK: a New Set of Microbenchmarks to Explore Memory System Performance for Scientific Computing. International Journal of High Performance Computing Applications 18, 211–224 (2004)
Author Index
Acebr´ on, Juan A. I-41 Acevedo, Liesner I-51 Alania, Michael V. I-105 Alonso, Pedro I-51, I-379 Anderson, David I-276 Araya-Polo, Mauricio I-496 Arbenz, Peter II-310 Atanassov, Emanouil II-204 Auer, Ekaterina II-408 Aversa, Rocco II-214 Avolio, Maria Vittoria II-495 Bader, Joel S. II-280 Bala, Piotr II-155 Balis, Bartosz II-224 Bana´s, Krzysztof I-411, I-517 Bartosiewicz, Pavel II-466 Bartuschat, Dominik I-557 Belletti, Francesco I-467 Bemmerl, Thomas I-576 Benedyczak, Krzysztof II-155 Berger, Simon A. II-270 Berli´ nska, Joanna II-1 B´ezy-Wendling, Johanne I-289 Bielecki, Wlodzimierz I-196 Bientinesi, Paolo I-387, I-396 Bierbaum, Boris I-576 Blaheta, Radim I-266 Boeke, Jef D. II-280 Boltuc, Agnieszka I-136 Borkowski, Janusz II-82, II-234 Bos, Joppe W. I-477 Botinˇcan, Matko II-62 Bouguerra, Mohamed-Slim I-206 Bouvry, Pascal I-31, II-102, II-547 Breitbart, Jens I-486 Bryner, J¨ urg II-310 Brzezi´ nski, Jerzy I-216 Bubak, Marian II-224 Burns, Randal II-280 Campos, Fernando O. I-439 Carean, Tudor II-145 Cela, Jos´e Mar´ıa I-496 Chandrasegaran, Srinivasan II-280
ˇ Ciegis, Raimondas II-320 Ciglan, Marek II-165 Clayton, Sarah II-505 Clematis, Andrea II-174 Corana, Angelo II-174 Cunha, Jos´e C. II-370 Cytowski, Maciej I-507 Czech, Zbigniew J. I-146 D’Agostino, Daniele II-174 D’Alimonte, Davide II-370 D’Ambrosio, Donato II-495 de la Cruz, Ra´ ul I-496 Denis, Christophe II-330 Desell, Travis I-276 Digas, Boris II-340 Di Martino, Beniamino II-214 Domagalski, Piotr I-256 Donini, Renato II-214 dos Santos, Rodrigo W. I-439 Drozdowski, Maciej II-92 Dunlop, Dominic II-102 Dymond, Jessica S. II-280 Dymova, Ludmila II-418, II-427 Dziubecki, Piotr I-256 Dzwinel, Witold I-312, I-322 Ebenlendr, Tom´ aˇs II-11, II-52 Eckhardt, Wolfgang I-567 Emans, Maximilian II-350 Fedorov, Mykhaylo II-360 Flasi´ nski, Mariusz I-156 Fraser, David L. I-1 Fras, Mariusz I-246 Funika, Wlodzimierz II-115, II-125 Galizia, Antonella II-174 Ganzha, Maria II-135 Garcia, Victor M. I-51 Gautier, Thierry I-206 Gepner, Pawel I-1, I-299 Gepner, Stanislaw I-61 Gera, Martin II-165 Giegerich, Robert II-290
626
Author Index
Giraud, Mathieu II-290 Glavan, Paola II-62 Godlewska, Magdalena II-244 Gorawski, Marcin I-166 Guidetti, Marco I-467 Gurov, Todor II-204 Haase, Gundolf I-439 Habala, Ondrej II-165 Hager, Georg I-615 Hluchy, Ladislav II-165 Ignac, Tomasz I-31 Igual, Francisco D. I-387 Iwaszyn, Radoslaw I-236 Jakl, Ondˇrej I-266 Jakob, Wilfried II-21 Jamro, Ernest I-115 Janczewski, Robert I-11 Jankowska, Malgorzata A. Jung, Jean-Pierre I-21 Jurczuk, Krzysztof I-289 Jurek, Janusz I-156
II-436
Kaihara, Marcelo E. I-477 Kajiyama, Tamito II-370 ˇ Kancleris, Zilvinas II-320 Kapanowski, Mariusz II-184 Karaivanova, Aneta II-204 Kawecki, Jaroslaw II-250 Kerridge, Jon II-505 Khanna, Gaurav I-486 Kirik, Ekaterina II-513 Kitowski, Jacek I-340, II-115, II-184 Klein, Wolfram II-521 Koch, Andreas I-457 Kohut, Roman I-266 Kolaczek, Grzegorz I-226 Konieczny, Dariusz I-246 Kopta, Piotr I-256 Koroˇsec, Peter II-398 Korytkowski, Marcin I-332 K¨ oster, Gerta II-521 K¨ ostler, Harald I-557 Kowalewski, Bartosz II-224 Kowalik, Michal F. I-1 Kreinovich, Vladik II-456 Kremler, Martin II-165 Kr¸etowski, Marek I-289
Kressner, Daniel I-387 Krol, Dariusz II-115 Krouglov, Dmitriy II-513 Krusche, Peter I-176 Kru˙zel, Filip I-517 Krysinski, Michal I-256 Kryza, Bartosz II-115 Kubica, Bartlomiej Jacek II-446 Kuczynski, Tomasz I-256 Kulakowski, Konrad II-529 Kulakowski, Krzysztof II-539 Kurowski, Krzysztof I-256 Kurowski, Marcin J. II-380 Kuta, Marcin I-340, II-184 Kwiatkowski, Jan I-236, I-246 Lankes, Stefan I-576 Laskowski, Eryk I-586 L awry´ nczuk, Maciej I-350 Le´sniak, Robert I-299 Lessig, Christian I-396 Lewandowski, Marcin II-155 Liebmann, Manfred I-439 Lindb¨ ack, Leif II-194 Lirkov, Ivan II-135 Ludwiczak, Bogdan I-256 Lupiano, Valeria II-495 Luttenberger, Norbert I-403 Maciol, Pawel I-411 Magdon-Ismail, Malik I-276 Maiorano, Andrea I-467 Majewski, Jerzy I-61 Malafiejska, Anna I-11 Malafiejski, Michal I-11 Ma´ nka-Kraso´ n, Anna II-539 Manne, Fredrik I-186 Mantovani, Filippo I-467 Marcinkowski, Leszek I-70 Marowka, Ami I-596 Marton, Kinga II-145 Ma´sko, L ukasz I-586, II-31 Maslennikowa, Natalia I-80 Maslennikow, Oleg I-80 Meister, Andreas II-521 Melnikova, Lidiya II-340 Mikanik, Wojciech I-146 Mikushin, Dmitry I-525 Mokarizadeh, Shahab II-194 Mounie, Gregory II-31 My´sli´ nski, Szymon I-156
Author Index Nabrzyski, Jaroslaw I-256 Newberg, Heidi I-276 Newby, Matthew I-276 Nikolow, Darin II-184 Nowi´ nski, Aleksander II-155 Numrich, Robert W. II-68 Olas, Tomasz I-125, I-299 Olson, Brian S. II-280 Olszak, Artur II-260 Palkowski, Marek I-196 Paprzycki, Marcin II-135 Paszy´ nski, Maciej I-95 Patwary, Md. Mostofa Ali I-186 Paulino, Herv´e II-74 Pawliczek, Piotr I-312 Pawlik, Marcin I-246 P¸egiel, Piotr II-125 P´erez–Arjona, Isabel I-379 Perfilieva, Irina II-456 Peters, Hagen I-403 Pilarek, Mariusz II-427 Pinel, Fr´ed´eric II-547 Piontek, Tomasz I-256 Piotrowski, Zbigniew P. II-380 Placzek, Bartlomiej II-553 Plank, Gernot I-439 Plaszewski, Przemyslaw I-411 Pogoda, Marek II-184 Portz, Andrea II-561 Progias, Pavlos II-569 Przystawik, Andreas I-276 Quarati, Alfonso II-174 Quintana-Ort´ı, Enrique S. Quinte, Alexander II-21
I-387
Ratuszniak, Piotr I-80 Rauh, Andreas II-408 Remiszewski, Maciej I-507 Ren, Da Qi I-421 Repplinger, Michael I-429 Richardson, Sarah M. II-280 Rocha, Bernardo M. I-439 Rocki, Kamil I-449 ´ Rodr´ıguez-Rozas, Angel I-41 Rojek, Krzysztof I-535 Rokicki, Jacek I-61 Rongo, Rocco II-495
Rosa, Bogdan II-380, II-388 Rozenberg, Valerii II-340 Runje, Davor II-62 Sakho, Ibrahima I-21 S´ anchez–Morcillo, Victor J. I-379 Schadschneider, Andreas II-575 Scherer, Rafal I-332, I-360 Schifano, Sebastiano Fabio I-467 Schulz-Hildebrandt, Ole I-403 Sendor, Jakub II-115 Seredynski, Franciszek II-42, II-585 Seredynski, Marcin I-31 Sergiyenko, Anatolij I-80 Sevastjanov, Pavel II-466 Seyfried, Armin II-561, II-575 Seznec, Andre II-145 Sgall, Jiˇr´ı II-52 Shehu, Amarda II-280 ˇ Silc, Jurij II-398 Sirakoulis, Georgios Ch. II-569 Skalkowski, Kornel II-115, II-184 Skalna, Iwona II-475, II-485 Skinderowicz, Rafal I-146 ˇ Slekas, Gediminas II-320 Slota, Renata II-115, II-184 Slusallek, Philipp I-429 Smolka, Maciej II-250 Smyk, Adam I-547 Sobaniec, Cezary I-216 Soszy´ nski, Igor I-507 Spataro, William II-495 Spigler, Renato I-41 Srijuntongsiri, Gun I-369 Stamatakis, Alexandros II-270 Starczewski, Janusz T. I-360 Star´ y, Jiˇr´ı I-266 Steffen, Peter II-290 Stelling, J¨ org II-300 Stepanenko, Victor I-525 Stock, Florian I-457 Stpiczy´ nski, Przemyslaw I-87 Stucky, Karl-Uwe II-21 St¨ urmer, Markus I-557 Suciu, Alin II-145 Suda, Reiji I-421, I-449 S¨ uß, Wolfgang II-21 Switalski, Piotr II-42 Szaban, Miroslaw II-585 Szejnfeld, Dawid I-256
627
628
Author Index
Szustak, L ukasz I-535 Szymanski, Boleslaw K. I-276 Szymczak, Arkadiusz I-95 Takahashi, Daisuke I-606 Tarnawczyk, Dominik I-256 Taˇskova, Katerina II-398 Terzer, Marco II-300 Tiskin, Alexander I-176 Tkacz, Kamil II-466 Tobler, Christine II-310 Tomas, Adam I-80 Tran, Viet II-165 Treibig, Jan I-615 Tripiccione, Raffaele I-467 Trystram, Denis I-206, II-31 Tudruj, Marek I-547, I-586, II-31, II-234 Urquhard, Neil
II-505
Vardaki, Emmanouela II-569 Varela, Carlos A. I-276 Varrette, S´ebastien II-102 Vavasis, Stephen A. I-369 Venticinque, Salvatore II-214 Vidal, Antonio M. I-51 Vincent, Jean-Marc I-206 Violino, Gabriele II-194
Vlassov, Vladimir II-194 Vutov, Yavor II-135 Wang, Lian-Ping II-388 Wasilewski, Adam I-226 Waters, Anthony I-276 Wawrzynczak, Anna I-105 Wawrzyniak, Dariusz I-216 Wcislo, Rafal I-322 Weinzierl, Tobias I-567 Wiatr, Kazimierz I-115 Wielgosz, Maciej I-115 Wiszniewski, Bogdan II-244 Witkowski, Krzysztof I-256 II-529 Was, Jaroslaw W´ ojcik, Wojciech I-340 Wolniewicz, Malgorzata I-256 Wo´zniak, Adam II-446 Wozniak, Marcin I-125 Wrzeszcz, Michal I-340 Wyrzykowski, Roman I-125, I-299, II-427 Yurgel’yan, Tat’yana
II-513
Zaj´ıˇcek, Ondˇrej II-52 Zibordi, Giuseppe II-370 Ziemianski, Michal Z. II-380 Zieniuk, Eugeniusz I-136