Parallel Computing - PDF Free Download

Proceedings of the International Conference ParCo2001 PARALLEL COMPUTING Advances and Current Issues *++ Editors G. R...

Author: G. R. Joubert | Italy) Parco200 (2001 Naples | Gerhard Joubert | Almerica Murli | Frans Peters

387 downloads 7494 Views 26MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Proceedings of the International Conference ParCo2001

PARALLEL COMPUTING Advances and Current Issues

*++ Editors

G. R. Joubert A. Murli F. J. Peters M. Vanneschi

Imperial College Press

PARALLEL COMPUTING Advances and Current Issues

This page is intentionally left blank

Proceedings of the International Conference ParCo2001

PARALLEL COMPUTING Advances a n d C u r r e n t Issues Naples, Italy

4-7 September 2001

Editors

G. R. Joubert Clausthal University of Technology, Clausthal, Germany

A. Murli University of Naples "Federico II", Naples, Italy

F. J. Peters Philips Research, Eindhoven, The Netherlands

M. Vanneschi University of Pisa, Italy

Imperial College Press

Published by Imperial College Press 57 Shelton Street Covent Garden London WC2H 9HE Distributed by World Scientific Publishing Co. Pte. Ltd. P O Box 128, Farrer Road, Singapore 912805 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

PARALLEL COMPUTING: ADVANCES AND CURRENT ISSUES Proceedings of the International Conference ParCo2001 Copyright © 2002 by Imperial College Press All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN

1-86094-315-2

Printed in Singapore by Uto-Print

CONFERENCE COMMITTEE Gerhard R. Joubert (Germany/Netherlands) (Conference Chair) Almerico Murli (Italy) Frans J. Peters (Netherlands) Roberto Vaccaro (Italy) Marco Vanneschi (Italy)

STEERING COMMITTEE Frans Peters (Netherlands) (Chair) Bob Hiromoto (USA) Masaaki Shimasaki (Japan) Dennis Trystram (France) Marian Vajtersic (Slovakia)

ORGANISING COMMITTEE Almerico Murli (Italy) (Chair) Roberto Vaccaro (Italy) (Vice-Chair) Pasqua D'Ambra (Italy) Luisa D'Amore (Italy) Riccardo Simino (Italy) (Exhibition)

FINANCE COMMITTEE Frans J. Peters (Netherlands) (Chair)

SPONSORS Center for Research on Parallel Computing & Supercomputers (CPS) - CNR, Naples and University of Naples "Federico II" v

INTERNATIONAL PROGRAM COMMITTEE Marco Vanneschi (Italy) (Program Committee Chair) Almerico Murli (Italy) (Program Committee Vice-Chair) HojjatAdeli(USA)

Lucio Grandinetti (Italy)

Andreas Reuter (Germany)

Giovanni Aloisio (Italy)

Rolf Hempel (Germany)

Dirk Roose (Belgium)

Hamid Arabnia (USA)

Rudolf Kober (Germany)

Domenico Sacca (Italy)

Farhad Arbab (Netherlands)

Norbert Kroll (Germany)

Giuseppe Serazzi (Italy)

Lidia Arcipiani (Italy)

Giuliano Laccetti (Italy)

Henk Sips (Netherlands)

Fabrizio Baiardi (Italy)

Domenico Laforenza (Italy)

Antonio Sgamellotti (Italy)

Arndt Bode (Germany)

Antonio Lagana (Italy)

Giandomenico Spezzano (Italy)

Alan Chalmers (UK)

Keqin Li (USA)

Vaidy Sunderam (USA)

Jacques Chassin de Kergommeaux (France)

Alberto Machi (Italy)

Domenico Talia (Italy)

Andrea Clematis (Italy)

Enzo Marinari (Italy)

Marco Tomassini (Switzerland)

Pasqua D'Ambra (Italy)

Valero Mateo (Spain)

Theo Ungerer (Germany)

Luisa D'Amore (Italy)

John Mellor-Crummey (USA)

Dennis Trystram (France)

Marco Danelutto (Italy)

Paul Messina (USA)

Marian Vajtersic (Slovakia)

Erik D'Hollander (Belgium)

Hermann Mierendorff (Germany)

Henk van der Vorst (Netherlands)

Koen De Bosschere (Belgium)

Giovanni Milillo (Italy)

Arjan van Gemund (Netherlands)

Giuseppe De Pietro (Italy)

Wolfgang Nagel (Germany)

Nicola Veneziani (Italy)

Jack Dongarra (USA)

Salvatore Orlando (Italy)

Heinrich Voss (Germany)

Ian Duff (UK)

Yoshio Oyanagi (Japan)

Helmut Weberpals (Germany)

Salvatore Gaglio (Italy)

Nikolay Petkov (Netherlands)

David Womble (USA)

Wolfgang Gentzsch (USA)

Wilfried Philips (Belgium)

Andrea Zavanella (Italy)

Giulio Giunta (Italy)

Erhard Rahm (Germany)

Hans Zima (Austria)

VI

VII

ADDITIONAL REFEREES Marco Aldinucci (Italy)

Yves Denneulin (France)

Hans-Peter Kersken (Germany)

Paolo Palmerini (Italy)

Alberto Apostolico (Italy)

Daniela di Serafino (Italy)

Jochen Kreuzinger (Germany)

Enric Pastor (Spain)

Eduard Ayguade (Spain)

J. Diaz

Norbert Kroll (Germany)

Matthias Pfeffer (Germany)

Kristof Beyls (Belgium)

Jose Duato (Spain)

Uwe Lehmann (Germany)

Francesco Scarcello (Italy)

Steeve Champagneux (France)

Bernardo Favini (Italy)

Giuseppe Lo Re (Italy)

R. Sommerhalder (Netherlands)

Guilhem Chevalier (France)

Uwe Fladrich (Germany)

Maria Lucka (Austria)

Gerardo Toraldo (Italy)

Massimo Coppola (Italy)

Ilio Galligani (Italy)

Giuseppe Manco (Italy)

Salvatore Vitabile (Italy)

Benedicte Cuenot (France)

Serge Gratton (France)

Eduard Mehofer (Austria)

Manuela Winkler (Germany)

Marco D'Apuzzo (Italy)

Mario Guarracino (Italy)

Lorenzo Muttoni (Italy)

Yijun Yu (Belgium)

Monica De Martino (Italy)

Juan GuillenDaniel Ortega Scholten (Holland)

Peter Zoeteweij (Netherlands)

This page is intentionally left blank

PREFACE The international conference ParCo2001 was held in September 2001 in Naples, Italy. This biannual conference, which is the longest running and most prestigious international conference on all aspects of parallel computing in Europe, again formed a milestone in assessing the status quo and highlighting future trends. Whereas many aspects of parallel computing have already become part of main stream computing, challenging new application areas are opening up. These aspects were highlighted by the invited speakers and the panel discussion. Together with the contributed papers and the mini-symposia an overall scenario of on the one hand consolidation of parallel computing technologies and on the other emerging new areas of research and development was presented. New areas in which parallel computing is fast becoming a strategic technology are image and video processing, multimedia applications, financial modelling, data warehousing and mining, to name but a few. New definitions of the parallel computing paradigm in the form of cluster and grid computing are gradually reaching the stage where their widespread application to a multitude of problem areas will become a viable option. With the ParCo conferences the emphasis has always been on quality rather than quantity. This approach resulted in the decision to run these conferences on a biannual basis, such that noticeable strides in technology development can be highlighted. Furthermore, all contributions were in the past reviewed during and after the conference. This latter approach had the disadvantage that proceedings were only available many months after the conference. In an attempt to shorten the time to publication in the case of ParCo2001, authors were requested to submit full versions of papers for the first review and selection process. Electronic versions of revised papers were made accessible to all registered delegates before the start of the conference. A final revision during and after the conference resulted in the papers incuded in these proceedings. It should be noted that papers presented as part of the various mini-symposia are not included in the printed proceedings. Their inclusion would have seriously delayed publication. Such papers will be considered for publication in Special Issues of Parallel Computing journal.

IX

X

The editors are indebted to the members of the International Program Committee, the Steering Committee and the Organising Committee for the time they spent in making this conference such a successful event. Special thanks are due to the staff of the Center for Research on Parallel Computing & Supercomputers (CPS) - CNR in Naples for their enthusiastic support. In this regard particular mention should be made of the key roles played by Almerico Murli (Organising Committee Chair), Pasqua D'Ambra and Luisa D'Amore in making this event such a great success. Gerhard Joubert Germany November 2001

Almerico Murli Italy

Frans Peters Netherlands

Marco Vanneschi Italy

CONTENTS

Committees

v

Preface

ix

Invited Papers

1

Deploying Parallel Numerical Library Routines to Cluster Computing in a Self Adapting Fashion J. J. Dongarra andK. J. Roche

3

GRID: Earth and Space Science Applications Perspective L. Fusco

31

e-Science, e-Business and the Grid T.Hey

33

Graph Partitioning for Dynamic, Adaptive and Multi-Phase Scientific Simulations V. Kumar, K. Schloegel and G. Karypis

34

Challenges and Opportunities in Data-Intensive Grid Computing P. Messina

40

Applications

41

Giant Eigenproblems from Lattice Gauge Theory on CRAY T3E Systems N. Attig, Th. Lippert, H. Neff, J. Negele andK. Schilling

43

Parallel Consistency Checking of Automotive Product Data W. Blochinger, C. Sinz and W. Kiichlin

50

Implementation of an Integrated Efficient Parallel Multiblock Flow Solver T. Bonisch and R. Ruhle

58

Distributed Simulation Environment for Coupled Cellular Automata in Java M. Briesen and J. Weimar

66

XI

XII

Human Exposure in the Near-Field of a Radiobase-Station Antenna: A Numerical Solution Using Massively Parallel Systems L. Catarinucci, P. Palazzari and L. Tarricone Tranquillity Mapping Using a Network of Heterogeneous PC A. Clematis, M. De Martino, G. Alessio, S. Bini, and S. Feltri Parallel Skeletons and Computational Grain in Quantum Reactive Scattering Calculations S. Crocchianti, A. Lagana, L. Pacifici and V. Piermarini Parallel Image Reconstruction Using ENO Interpolation J. Czerwinska and W. E. Nagel Training On-Line Radial Basis Function Networks on a MIMD Parallel Computer A. D'Acierno Parallel Simulation of a Cellular Landslide Model Using Camelot G. Dattilo and G. Spezzano Parallel Numerical Simulation of Pyroclastic Flow Dynamics at Vesuvius T. Esposti Ongaro, C. Cavazzoni, G. Erbacci, A. Neri and G. Macedonio

75

83

91

101

109

117

125

A Fast Domain Decomposition Algorithm for the Simulation of Turbomachinery Flows P. Giangiacomo, V. Michelassi and G. Chiatti

13 3

Massively Parallel Image Restoration with Spatially Varying Point-Spread-Functions G. Gorman, A. Shearer, N. Wilson, T. O'Doherty andR. Butler

141

Asynchronous Algorithms for Problem of Reconstruction from Total Image JV. M Gubareni

149

Parallel Flood Modeling L. Hluchy, G. T. Nguyen, L. Halada and V. D. Tran

157

XIII

The Xyce Parallel Electronic Simulator — An Overview S. Hutchinson, E. Keiter, R. Hoekstra, H. Watts, A. Waters, T. Russo, R. Schells, S. Wix and C. Bogdan

165

A Breakthrough in Parallel Solutions of MSC. Software L. Komzsik, S. Mayer, P. Poschmann, P. Vanderwalt, R. Sadeghi, C. Bruzzo and V. Giorgis

173

Implementation of a Parallel Car-Parrinello Code on High Performance Linux-Based Clusters 5. Letardi, M. Celino, F. Cleri, V. Rosato, A. De Vita and M. Stengel

181

PRAN: Special Purpose Parallel Architecture for Protein Analysis A. Marongiu, P. Palazzari and V. Rosato

189

Design of a Parallel and Distributed Web Search Engine S. Orlando, R. Per ego andF. Silvestri

197

Active Contour Based Image Segmentation: A Parallel Computing Approach V. Positano, M. F. Santarelli, A. Benassi, C. Pietra andL. Landini

205

Parallelization of an Unstructured Finite Volume Solver for the Maxwell Equations J. Rantakokko and F. Edelvik

213

An Hybrid OpenMP-MPI Parallelization of the Princeton Ocean Model G. Sannino, V. Artale and P. Lanucara

222

A SIMD Solution to Biosequence Database Scanning B. Schmidt, H. Schroder and T. Srikanthan

230

Parallel Program Package for 3D Unsteady Flows Simulation E. Shilnikov andM. A. Shoomkov

238

Parallel Lossless Compression Algorithm for Medical Images by Using Wavefront and Subframe Approaches A. Wakatani Adaptivity and Parallelism in Semiconductor Device Simulation H. Weberpals and S. Thomas

246

254

XIV

Algorithms

263

Analysis of Communication Overhead in Parallel Clustering of Large Data Sets with P-AutoClass S. Basta and D. Talia

265

A Coarse-Grain Parallel Solver for Periodic Riccati Equations P. Benner, R. Mayo, E. S. Quintana-Orti and V. Hernandez

274

Preconditioning of Sequential and Parallel Jacobi-Davidson Method L. Bergamaschi, G. Pini and F. Sartoretto

282

Parallel Checkpointing Facility on a Metasystem Y. Cardinale andE. Hernandez

290

An Efficient Dynamic Programming Parallel Algorithm for the 0-1 Knapsack Problem M. Elkihel andD. El Baz

298

Parallel Algorithms to Obtain the Slotine-Li Adaptive Control Linear Relationship J. C. Fernandez, L. Pehalver and V. Hernandez

306

A High-Performance Gepp-Based Sparse Solver A. Gupta

314

System Model for Image Retrieval on Symmetric Multiprocessors O.Kao

322

Granularity and Programming Paradigms in Parallel MPP Image Coding R. Norcen and A. Uhl

330

Parallel Quasi-Newton Optimization on Distributed Memory Multiprocessors /. Pardines and F. F. Rivera

338

A Parallel Condensation-Based Method for the Structural Dynamic Reanalysis Problem A. Pelzer andH. Voss

346

Combined MPI/OpenMP Implementations for a Stochastic Programming Solver D. Rotiroti, C. TrikiandL. Grandinetti

354

XV

The Same PDE Code on Many Different Parallel Computers W. Schonauer, T. Adolph and H. Hafner What Do We Gain from Hyper-Systolic Algorithms on Cluster Computers? W. Schroers, Th. Lippert andK. Schilling

362

370

Leader Election in Wrapped Butterfly Networks W. Shi, A. Bouabdallah, D. Talia and P. KSrimani

382

Software Technology and Architectures

391

Assembling Dynamic Components for Metacomputing Using CORBA A. Amar, P. Boulet and J. Dekeyser

393

Simulation-Based Assessment of Parallel Architectures for Image Databases T. Bretschneider, S. Geisler and O. Kao

401

Structured Parallel Programming and Shared Objects: Experiences in Data Mining Classifiers G. Carletti and M. Coppola

409

Access Histories Versus Segment Histories for Datarace Detection M. Christiaens, M. Ronsse andK. De Bosschere

417

On Skeletons & Design Patterns M. Danelutto

425

A Portable Middleware for Building High Performance Metacomputers M. Di Santo, F. Frattolillo, E. Zimeo and W. Russo

433

Using a Parallel Library of Sparse Linear Algebra in a Fluid Dynamics Application Code on Linux Clusters S. Filippone, P. D'Ambra andM. Colajanni Performance Evaluation of a Graphic Accelerators Cluster M. R. Guarracino, G. Laccetti and D. Romano

441

449

xvi Analysis and Improvement of Data Locality for the Transposition of a Sparse Matrix D. B. Heras, J. C. Cabaleiro andF. F. Rivera

457

Automatic Multithreaded Parallel Program Generation for Message Passing Multiprocessors Using Parameterized Task Graphs E. Jeannot

465

Parallel Launcher for Cluster of PC C. Martin and O. Richard

473

Parallel Program Development Using the MetaPL Notation System ./V. Mazzocca, M. Rak and U. Villano

481

PAVIS: A Parallel Virtual Environment for Solving Large Mathematical Problems D. Petcu and D. Gheorchiu

490

Development of Parallel Paradigms Templates for Semi-Automatic Digital Film Restoration Algorithms G. Sardisco and A. Machl

498

Exploiting the Data-Level Parallelism in Modern Microprocessors for Neural Network Simulation A. Strey andM. Bange

510

Extending the Applicability of Software DSM by Adding User Redefinable Memory Semantics B. Vinter, O. J. Anshus, T. Larsen and J. M. Bjorndalen

518

Industrial Perspective

527

Porting and Optimizing Cray T3E Applications to Compaq AlphaServer SC Series J. Pareti

529

Author Index

541

INVITED PAPERS

This page is intentionally left blank

DEPLOYING PARALLEL N U M E R I C A L LIBRARY R O U T I N E S TO CLUSTER C O M P U T I N G LN A SELF A D A P T I N G FASHION K E N N E T H J. R O C H E , J A C K J . D O N G A R R A Department

of Computer Science, The University 203 Claxton Complex, Knoxville, Tennessee 37996-3450 roche, [email protected]

of

Tennessee,

This paper discusses middleware under development which couples cluster system information with the specifics of a user problem to launch cluster based applications on the best set of resources available. The user is responsible for stating his numerical problem and is assumed to be working in a serial environment. He calls the middleware to execute his application. The middleware assesses the possibility of solving the problem faster on some subset of available resources based on information describing the state of the system. If so, the user's data is redistributed over the subset of processors, the problem is executed in parallel, and the solution is returned to the user. If it is no faster, or slower, the user's problem is solved with the best appropriate serial routine. The feasibility of this approach is empirically investigated on a typical target system and results reported which validate the method.

1

Overview

On multi-user computing systems in which the resources are dedicated but the allocation of resources is controlled by a scheduler, one expects a job, once allocated, to finish execution in a predictable amount of time. In fact, the user on such a system usually submits a job through a batch script which requires an upper bound on the predicted runtime of the task being queued. The user is billed for the entire time requested and thus has a responsibility to himself and other users to understand the behavior of his code. Scheduling schemes on such systems attempt to order the computations in a fair manner so that user jobs show progress towards completion in a timely fashion while the overall system throughput is maximized. One problem with this approach is that such a scheduling scheme is easier to talk about than to implement.(ref. [1]) It is a fact of life that the multi-processor scheduling problem is iVP-complete in the strong sense." Thus, developers must look for algorithms which are "Theoretically, a decision problem II is NP-complete in the strong sense if (II 6 NP)A(3Tlp which is JVP-complete). For decision problem II and polynomial p, defined over the integers, lip is the restriction of II t o instances / 3 Max[i] < p(Length[I]). If II is solvable by a pseudo-polynomial time algorithm, then n p is solvable in polynomial time. Consider the

3

4

efficient. This is a difficult and time consuming task (which can't really be avoided on production scale or commodity clusters such as those at national laboratories and supercomputing centers). The large variance in the average duration and demand of user jobs, for instance, further complicates the task. It's complicated. Even though there's no provably optimal way of addressing this problem, people do it all the time because it has to be done. We don't make further considerations of such systems in this paper. In shared, multi-user computing environments, such as clusters of workstations in a local area network, the notion of determinism in computations can be lost due to resource contention. Thus, successive runs of the same linear algebraic kernel with the same problem parameters, for instance, may result in grossly different wall clock times for completion due to the variability of the work load on the CPUs from competing tasks. If users compute responsibly and in coordination with one another such systems can be and are successful. (Administrators intervene otherwise to mediate serious contention problems.) This approach is often more efficient, for instance, in the developmental stage of parallel application codes due to the constant testing and debugging of software, or in groups where the average user job tends not to saturate the system resources and runs to completion on a relatively short time scale (e.g. minutes or even hours). One way for the user to possibly make better use of the available resources in such an environment is to employ the information describing the state of the computational system at runtime to select an appropriate subset of resources for the specific kernel at hand. It is acknowledged that making low risk predictions in such a system when multiple users are sharing the resources cannot be done with certainty. There is nothing to preclude the event that the demand on the resources may change dramatically in the time which transpires between deciding on a set of resources and getting one's problem set up and ready to go on the resources. Nonetheless, it seems negligent not to try to use available system related data at runtime. In the very least a user can identify saturated resources in the system and avoid allocating them for his/her run. In the event that the overall system behavior fluctuates about some time sensitive normal level of activity, then a statistical analysis of the collected data can be used as the basis for a predictive model of the system's behavior at some future time. Software, such as NWS, the multi-processor scheduling problem TIMS'- given a finite set J of jobs, a length l(j) e Z+Vj e J , a number, m e Z+, of processors, and deadline D e Z+ , is there a partition J = Ji U Ji U • • • U Jm of J into m disjoint sets such that m a i [ J ] . - j . l(j) : 1 < i < m] < Dl This problem is NP-complete in the strong sense and thus cannot be solved by a pseudo-polynomial time algorithm unless P = NP. (For proof see reference [2], for related information see references [3,4,5,6,7,8].)

5

Network Weather Service, operates sensors in a distributed computing environment and periodically (in time) collects measured data from them.(ref. [9]) NWS includes sensors for end-to-end T C P / I P performance (bandwidth and latency), available CPU percentage, and available non-paged memory. The collected data is kept and analyzed as a time series which attempts to forecast the future behavior of the system through low order ARM A, autoregressive moving averages, methods. This paper discusses software being developed which couples system information with information specifically related to the numerical kernel of interest. The model being used is that a user is assumed to contact the middleware through a library function call during a serial run. The middleware assesses the possibility of solving the problem faster on some subset of available resources based on information describing the state of the system. If so, the user's data is redistributed over the subset of processors, the problem is executed in parallel, and the solution is returned to the user. If it is no faster, or slower, the user's problem is solved with the best appropriate serial routine. It is conjectured that if the underlying application software is scalable then there will be a problem size which marks a turning point, Ntp, for which the time saved because of the parallel run (as opposed to the best serial runs for the same problem) will be greater than the time lost moving the user's data around. At this value, such software is deemed useful in the sense that it provides an answer to the user's numerical question faster than had the user done as well as an expert working in the same serial environment. That is, it benefits even the expert user working on a single node of the shared system to use the proposed software for problem sizes in which Nuaer > Ntp. As a case study we consider the problem of solving a system of dense, linear equations on a shared cluster of workstations using the ScaLAPACK software, (ref. [10]) A discussion of some specific implementations tested is made and results of selected experiments are presented. It is observed that even with naive data handling the conjecture is validated in a finite ensemble of test cases. Thus there is motivation for future studies in this area. It is also observed that the expert user in the parallel environment can always complete the dense, algebraic task at hand faster than the proposed software. (There are no clear winners for small problem sizes since each approach solves the problem serially with the best available version of the library routine.) This is no surprise since even in the most ideal cases, the proposed software has to touch the user's data at least enough to impart the relevant data structure expected by the logical process grid. The parallel expert, on the other hand, is assumed to be able to generate relevant data structures in-core, in-parallel at the time of the distributed run. This user also knows how to initialize the

6

numerical library routine, and make compilation time optimizations. He/she is probably not the typical scientist who has likely already labored just to reduce their problem to linear algebra. There are, in fact, many details to account for by any user before the parallel kernel runs correctly. Reporting the results of this expert user provides a basis for comparison to the other projected users and scenarios. 2

2.1

Numerical libraries in shared, homogeneous, distributed environments The computing

environment

In the development of the current investigation heterogeneous grid computing systems have not been the central focus. (See references [11,12,13,14,15,16,17].) However, it is noteworthy that one of the goals in resource selection when considering a pool of heterogeneous (and potentially geographically distributed) candidate resources is to achieve as much homogeneity in the allocated resources possible. Here's at least one complication of scheduling in a shared distributed system which is general to both heterogeneous and homogeneous systems: the scheduler of resources for a task in question has to try and allocate resources which not only look homogeneous at the instant of inquiry, but remain as homogeneous as possible for the duration of time that the task is in (parallel) execution. In short, even if we could solve the general multi-processor scheduling problem at some specific instant in time, we can't count on this partitioning to assist us in forecasting the state of the system resources at some time in the future. This is due to the fluctuating properties of system resources which one can observe in a shared environment. This brief subsection intends to describe the notion of homogeneity in the context of the current study. Some sample results of timing various operations in one of the systems tested demonstrates the notion as it is observed empirically. In complex mechanical systems the notion of homogeneity usually implies that the system behaves in a predictable manner when performing a specific task only in the absence of external influences. If this definition applies to computational systems (see Figure 1 for a sample computing environment), then it cannot be that a shared set of resources alone, such as a cluster of workstations in a local area network, is homogeneous. Usually such a system is only meaningful when responding to a user's requests. Users' requests are developed externally and then serviced by the system at runtime. Since there is no way to know when a user intends to make requests in such an open

7 Users, etc.

100 Mbit

5 Network File System Sun's NFS (RPC/UDP)

100 Gbit Switch, (fully connected)

100 Mbit Switch, (fully connected)

Remote memory server, e.g. IBP(TCP)

QQ Figure 1. The figure is a diagram of part of the local area net work in which many of the current investigations were made. It is noteworthy for the purposes of interpreting some of the results presented in this paper that the memory depot and network file server are separate machines in reality sitting on a shared network. The clusters on which we have developed the current study are removed from the shared network through one of two switches through which the cluster of workstations is said to be fully connected. It is a factor for the types of studies we have made that there is only one, 100Mbit line for all of the network flow to and from the network disk or the memory server.

system, any specific task, such as solving a set of linear equations, is likely to exhibit different total wall times for completion on subsequent runs. So what does one mean by a shared homogeneous, distributed computing environment? Naively, it is assumed that the hardware specifications and available software are duplicated between compute nodes on such a system. This is not enough however (and may not be necessary). The notion of homogeneity has meaning only in terms of some specific system state engaged in some specified activity as observed in an ensemble of test cases. Let us elaborate on this thought a little. Physical system parameters, when observed at equidistant time intervals and kept in collections of ordered data, comprise a time series, (see ref-

8

erences[18,19,20,21]) Because of the inherent fluctuations in the system parameters, CPU loads on a shared cluster for instance, the time series of the state of the shared system is a non-deterministic function. (For instance, the activity level of system resources often reflects the underlying nature of the humans which use the system. At lunch time, dinner time, or bed time one often observes a decrease in overall system activity. In the morning, one often observes some adiabatic growth of activity in the system as users arrive and begin working. This growth continues until some normal level of system activity is achieved. For some duration of time, one may expect the system resource activity levels to fluctuate around this normal. But the point is that the activity norm is time of day dependent more often than not.) Non-deterministic time series can only be described by statistical laws or models. To study such systems formally one assumes that a time series can only be described at a given instant in time, t, by a (discrete) random variable, Xt, and its associated probability distribution, fxtThus, an observed time series of the system parameters can be regarded as one realization of an infinite ensemble of functions which might have been generated by a stochastic process -in this case multi-users sharing a common set of resources as a function of time. Stochastic processes are strictly stationary when the joint probability distribution of any set of observations is invariant in time and are particularly useful for modeling processes whose parameters tend to remain in equilibrium about a stationary mean. Any stationary stochastic process can be described by estimating the statistical mean p (x = n _ 1 £)"=i xt), its variance a2 (s% = n _ 1 J2t=i(x* ~ ^)2)> * n e autocovariance function {cxx(k) = n _ 1 S£Ti (xt — x)ixt-k — x) e.g. the extent to which two random variables are linearly independent), and the sample autocorrelation function which is a kind of correlation coefficient (rXx(k) — Cxa(fc)(cx»(0))-1,fc = 0,...,n - l ) . A discrete random process for which all the observed random variables are independent is the simplest form of a stationary stochastic process. For this process, the autocovariance is zero (for all lags not zero) and thus such a process is referred to as purely random, or white noise. Usually, the observable system parameters such as CPU loads and available memory are not independent. Thus, the notion of homogeneity is manifest only in observing a specific task on the system, such as data I/O or multiplying matrices in-core, repeatedly under normal system activity on each of the computing nodes said to comprise the system. An expectation value (fi) for the specified task will be formed for each unit only in this time tested man-

9 ner. 6 One can subsequently compare the results from each of the units and determine a level of similarity between them. Ideally, the time to complete any serial task would be the same within some standard deviation (approximated by the square root of the variance) regardless of the compute node executing it. Further, the error bars should ideally tend to zero as the number of observations tends to infinity. This is not achievable in practice, clearly. (For non-stationary processes, one filters the available data sets thus transforming the problem into a stationary form.) Figures 2, 3, 4, and 5 illustrate the results of portions of an empirical study on one of the systems used to develop the current investigation. In each of the figures, successive runs of a task are time stamped and recorded to file. The results presented were analyzed statistically and only the mean and root of the variance are reported. The runs were conducted over the course of weeks and, as much as possible, during the same hours of the day (10am until 5pm,EST). No attempt was made to push users off the shared system accept during development of some test cases. In Figure 2, an assessment of the CPU of each node when executing (serial runs) a numerical kernel rich in floating point operations as a function of the numerical problem size is made. In this case the time to solution and performance are reported and there is a clear correlation in the expected behavior of each node. Figure 3 is composed of three plots. The plots look at the read and write I/O times per node as a function of bytes on each node's local disk, on the local network users' disk (in our case operating under Sun's NFS utilizing RPCs, UDP/IP), and on a memory server (running IBPC (ref. [23]), TCP/IP) on the same local network but with a different IP address from any node on the cluster or the NFS server itself. In plot one, the local disk accesses, one can only imagine that sporadic user activity is responsible for the larger variances in some of the reported results. To within the error bars 6

If observational data is to be of use in developing advanced software, a standard metric has to be agreed upon and needs to be reliable across different platforms. PAP I (ref. [22]), which accesses hardware counters as well as the system clock, provides such a tool and has been used throughout this investigation when recording observations. C IBP, the Internet Backplane Protocol, is software for managing and using memory on a distribution of disks across a network. Its design is really intended for large scale, logistical networking. However, because it was designed with a client /server model in mind, is also useful for our purposes (as will be discussed further in this report.) The client accesses remote resources through function calls in C, for instance. The multi-threaded servers have their own IP addresses. The client has to know this address in advance as well as to which port the server listens. Usually one sets up his own IBP depot and can choose the port. The IBP group also manages some public domain (free) depots across the country which one can use.

10 TORC(Homogeneity(CPlJ)) Performance of solve routine _GESV() in Gflops

TORC(Homogeneity(CPU)) Time to Solve Ax=b:: _GESVQ (ATLAS)

50

|

T

i

|

i

|

i

|

i

|

ns*!

0.40

/ CPU homogeneity on TORC::

/

/ 40

Time to solve Ax=b::_GESV()

30

• Morel * — »torc2

/

CPU homogeneity on TORC:: 0.38 -

Performance of _GESV() in GFLOPS

Tim

*

«

?0

<

10

«torc4 torc5 -:~ torc6 torc7

/ /

-•• torc8

~~P~

-h-H"" I

1500

2000

N Figure 2. Performance and time to completion numbers for the serial, dense, linear solve routine gesvQ from ATLAS are reported as an example. CPU homogeneity in the shared cluster is a very important criterion for developing numerical software intended for distributed environments.

the wall times reported are within seconds of one another and thus invoke some sense of homogeneity. In the software designed to date, we don't make explicit use local disk I/O. In plot two, the accesses to the local network disk as controlled by NFS, we again see fluctuations, in particular during the UNIX system reads. This data is of particular interest to us, as will become clear in the sections to follow. It is recalled that to access {NFS controlled) the network disk that data is moved over a single shared local communication line. Further, multiple users tax NFS due to the design of the shared local file system. One expects larger variances here. Nonetheless, the results again instill some confidence within the expected error. Further, we have to deal with reality. The systems being tested ore supposedly identical. However, in open systems we must work with the expectation that this notion is a fallacy -we can only make sensible predictions within some confidence limits which are set by the actual behavior of the system. It is fun, however, to guess at why the writes appear to have much tighter error bars than the reads.

u TORC(Homogeneiry(I/0))

TORC(Hom0geneity(I/O))

Time to WRITE versus number of bytes

Time to READ versus number of bytes

*

•torcl

*

* torc2

*-

* torc3

*

< torc4

i

•

i

i

I" —

_ =,= J - - j f J=

.

—

•^**f— -— '

i l.OH-Og

1

'

1

'

-

•

^t ;

*" torcS

O.OiM-00

1

• • torcl » *torc2 * *torc3 •*—
~

•• torc6 torc7

*>

'

-

torcS -

*

Local disk(s)

i , i 2.0»0g 3.0»08 Bytes

Local disk(s)

=c:

y* T*"™*-"

'_'

'.^£

, 4.0M-08

S.Oc+08

0.0e+00

l.Oe+Og

2.0e+0g

3.0W-08

4.0e+08

TORC(Homogeneirj1I/0))

TORC(Homogeneity(I/0))

Time to WRITE versus number of bytes

Time to R E A D versus n u m b e r of bytes

5 Oo+OS

150

150

• *

• torcl •» torc2

* +—*

s

••>

•" torc6 torc7
"»•

| 100

*

UDP/IP, Network

torc3 torc4 torcS

• * * <

125

Pile System

100

EH

a

75

B

50

• torcl * torc2 * torc3 4torc4 torc5 T; torc6 torc7

UDP/IP Network File System

25 0-

0.0e+00

150

100

1.0w08

2.0M-08

3.0»08 Bytes

5.0H-08

6.0M-08

0.OH-0O

l.Ow-08

2.0e+O8

lCe+08 Bytes

4,0c+08

5.0o+Og

TORC(Homogeneity(I/0))

TORC(Homogeneity(I/0))

Time to W R I T E versus n u m b e r of bytes

Time to R E A D versus n u m b e r of bytes

• *

•torcl < torc2

* •*

* torc3 « torc4 torcS

1

1

50

$jfi~ i ¥ O.Ow-00 1.0e+08

1

I • » *•• *

IBP.Remote Memory Server (TCP/IP)

........ -

, * '

toic7 «-—-i lorc8

J

4.OW-08

~^^r

*££ .

l?8^ i 3.0e+08 Bytes

t

:

-

-

-jJE i 2.0e+Q8

•>

I 75

1 1 I I • torcl IBP .Remote Memory a torc2 Server (TCP/IP) *torc3 «torc4 torc5 -.': torco" torc7 <•• torc8

6.0O+08

i 4.0e+08

i 5.0e+08

l 6.0M-Q8

O.Ow-00

1.0

i Z0e+08

i 3.&H-08

i 4.0e+08

i 5.0e+08

6.0W-08

Figure 3. Empirical study of what is meant by I/O homogeneity on the shared cluster.

12

One can't be certain, of course, but maybe NFS is buffering part of the data to be written. Since NFS utilizes RPCs which use the U D P / I P protocol, the application does not require a response before sending the next message. Thus to the user, it appears as though all is well. This is a reliability issue and not addressed here. For the reads, the user doesn't report a time until he actually has his data. This may add to the fluctuations, but it is more likely due to the fact that writes require that the CPU (or I/O device) send both address and data and require no return of data, whereas usually the CPU must be waiting between sending the address and receiving the data on a read. Often, the CPU will not wait on writes. In plot three, the results are quite remarkable in some sense. For one thing, the accesses to the memory server, IBP on a LINUX workstation in the local area, do have to utilize the shared communication line as with the NFS case. However, this server is on a different machine in the local area from that which serves as the NFS server. Thus, one suspects that there is much less overall activity on this server as opposed to the the local network disk itself. In Figure 3, there are two important points to keep in mind when considering the plots. One, for all three data sets, care must be taken to collect such numbers since buffering by various levels of software skews the numbers if successive runs aren't replicated scenarios. For instance, in the NFS runs if one collects such data with naive nested loop approaches, one can observe (not shown) that on the first run the time to report is always larger than subsequent runs. This is because of the buffering of data by either NFS, or possibly locally. If this is not accounted for, very spurious averages are formed which do not reflect the likely reality of servicing a user's request in numerical libraries such as those we are trying to build. The point is, unless the user makes successive requests with the same data set in mind, moving his/her data will be the unbuffered case -which is considerably larger in time. This is accounted for in the presentation here. Next, the reader is advised not to make serious comparisons of plots two and three. The machines servicing these requests employ different hardware and operating systems as well as network protocols. It is noteworthy that the NFS will likely exhibit larger variances since it is subject to a larger average activity in the local area network. One should keep in mind that the requests are all generated from some arbitrary node in our target computing system -this is why we care about these observations. Figure 4 reports available physical memory in kilobytes per node. The plot is important because it demonstrates the reality that homogeneity can't be taken for granted -despite common resources at the onset. In particular, on the cluster under observation, there are certain nodes which are more

13 TORC(HOMOGENEITY(AvailaWe Memory))

i •

•

I

• available unpaged memory / node

-le+05

i

i

i

4 5 Machine Number

Figure 4. The available unpaged physical memory is reported in KBytes per node. Available memory is a critical criterion in resource selection since a user's job must fit in physical memory lest it suffer serious time penalties due to page swapping during execution. The variances for these observations are large since during heavy activity levels on a node, physical memory tends to be nearly fully in use -at other times, nearly completely idle save the OS.

commonly used than others. Who knows why this is, it just is. Until now, we have ignored discussing some important observables such as time for broadcasts, sends, and receives as a function of bytes when a collection of machines from the available resources is executing in parallel. In some sense, it is up to the library developer to know his target system. In other words, the parallel application may impose the system properties of interest. In our test case, we study the solution of systems of linear equations. We will discuss this more later, but the parallel kernel is known to be rich in matrix multiplies. We've seen CPU homogeneity already per node and so expect the local performance to be a good fit for such a kernel. In addition, during the factorization of A, the kernel is also rich in broadcasts. For this reason, the behavior of broadcasts from a root node (the BLACS (ref. [24]) communication library implementation) as observed in our shared system is

14

O.OetflO

TORC(Homogeneity(Communication))

TORC(Homogeneity(Communication))

Time to broadcast versus number of bytes

Time to send/receive versus number of bytes

5.0e4<>7

L.0e403

t.5e+08

O.Oe+OO

5.0etO7

1.0e408

l.Se+08

Figure 5. The plot on the left reports the time for root to broadcast N bytes of data during a parallel run in which 8 processors are selected from the cluster under investigation. Each computing node acts as root. Although the behavior is clearly homogeneous looking, there is a spurious node in the cluster as regards parallel communication of data. The fact is exposed through an investigation of the send/receive times (point to point communication) as a function of bytes. Plot two reports results from two different nodes acting as the lead node in different runs. The lead node contacts each other node in turn and sends N bytes of data to be received.

also reported. Each node is considered as root and results as a function of bytes broadcast timed. We wait for confirmation that each node has received the data before stopping the clock. The slowest node should always dominate -in this sense the results are meaningless for understanding the behavior of a single node's parallel communication with another particular node. The small errors are expected since the machines are fully connected through a common switch. Just in case there is bad communication between two nodes, e.g. to address an issue which the broadcast cannot, one can investigate the behavior of the point to point communications. Results are presented to this end. Two identical runs are considered (there are simply too many permutations of possible collections of allocated resources to report all such numbers here) in which the lead node is different. In the runs, root conducts BLACS sends

15

and receives as a function of message size in bytes (always double precision data in our sample implementation) individually to each other process in the allocated resources. The results in this subsection are intended help to classify our use of the term homogeneous as it applies to individual compute nodes from a pool of candidate resources. The important point is to understand why this is a desirable property for the development of user friendly numerical software. Consider block factorizations from linear algebra which depend upon lower level block operations that have been tuned to be optimal (ref. [25]) on the system for which they intend to be used. Multiplying matrices is one such operation and it is common knowledge that tuned versions of this kernel exist on most any platform. When factorizations, such as A —• P~XLU, are implemented in parallel it remains important for matrix multiplication to be homogeneous between nodes in the parallel run to retain load balance and performance (and in fact scalability in the parallel case). Otherwise, there may be a single node which imparts large delays on the remaining nodes or sits idle waiting for the remaining nodes during execution of the application. Although the numerical answer should be the same regardless, time is the factor which the library routine attempts to minimize. (Naturally, the same could be said of the time to communicate a set number of bytes between processors during a solve. We hope that each node in a set of allocated resources is capable of sending and receiving data at the same rate. If this is not the case, the resources can be stuck waiting. Etc.) Thus, in some sense, to within the errors of each task sampled, there is some expectation as to how the system resources will behave. There is a theoretical isotropy in the compute nodes when homogeneity can be expected. That is, we expect that any particular computing node may be of equal use to us in the selection process subject to interpreting its current state.

2.2

The user

Our target user has a numerical application to execute and intends to do so on a single node of a shared, distributed computing environment which exhibits a strong degree of homogeneity (as described in the previous section). It is assumed that the user invokes one of our library routines to this end. We appeal to the user who may benefit from some user friendly software which intervenes on his behalf to determine if the problem can be solved more quickly through a redistribution of the user's data onto a team of computing nodes, solved in parallel, and the solution mapped back. A standard structure of our

16

target user's code is: User_Code(){ Define J)ata_H an dle(daia Jiandle(out)); Generate _Data(daia(out),dataJiand/e(in)); Invoke -Numerical -Application Jtoutine(dataJiancMe(in/out), routine -name (in) ,routine -parameter 8 (in)); Use_Solution(); Clean-Up(); Exit(); } The user's data will be double precision matrix and vector elements, for instance, in most linear algebra kernels which concern us now. The storage may assume some sparse or dense (row or column major) data structure on input. At this point, we are inclined to assume that the user has correctly formatted his data for a serial execution of a specific routine from some specific numerical library. He has correctly identified the input routine-parameters for the routine-name. The data-handle is what the middleware uses as a key to handling the user's data. For now, suffice it to note that the user can invoke the application routine with his data in-core, in a file on the local network disk (NFS case), or in a file on a memory depot (IBP). The data-handles are a pointer in physical memory, a file name (e.g. a path to the data in the file system) on the local network disk, or an IBP capability respectively. (A capability is the interface to a specific memory location in an IBP memory depot. It is the key to allocateing, storeing, loading, and manageing a logically contiguous block of bytes on the memory server.) One motivation for adding a remote memory depot is to allow for a user to generate problems of a magnitude which would otherwise, due to system constraints, not be feasible on a single node. This can be accomplished through buffered stores to a depot which has abundant available memory. We provide an interface for the user to achieve this set-up as well. Further, a user can choose to generate a data set in the memory depot and share the capability with others if so desired. This allows multiple users to collaborate on the generation of a data set and to share in the solution to a potentially common problem with no difficulties. 2.3

The middleware model

The middleware provides several services which are discussed in turn briefly. The entry point into the middleware is from the user's serial call Invoke-Numerical-Application-Routine() . The middleware, once invoked,

17

first assembles a collection of system specific parameters about the available resources which reflect the current state of the shared system. As previously mentioned NWS, or the like, can be used to achieve this step. Information about the number of candidate computing nodes, the free available memory per node, the CPU load (a number between 0 and 1 for each processor of each node) per node, and the bandwidth and latency for communication between each node in the cluster is gathered. Let us collect this information into the systemjxirameters. A resource selection process is started which relies upon the evaluation of a time function which depends explicitly upon the computational demands oi routine.name as a function of the routine-parameters, and the system .parameters returned from the previous step. Let us call the team of resources which the middleware intends to allocate the selected-resources. Next, if the dataJiandle specifies that the user's data has been generated in-core, then the middleware writes the user's data to the local network disk. (As will be discussed in the comments on data movement later in this paper, the developer has to decide whether to map the user's data at the time of this write or to simply write the data as is and allow the mapping to be made in the application routine prior to executing the parallel application. Both cases are considered in our experiments and results are presented later.) Otherwise, and after the write for the in-core case, the middleware assembles the command line and the machine file necessary for launching the parallel application. The application routine is forked and waited upon by the middleware. On return, if the user expects an answer in-core, the middleware assembles the solution (from disk) and passes it to the user. Otherwise, the middleware returns the modified dataJiandle which now contains updated information regarding where to find the expected solution. The process can be described as pseudo-code as follows:

18

Invoke-Numerical _Application_Routine(dataJiand/e(in/out), routine -name(in),rout*ne-parametera(in)){ Get_State_of_System(«2/a£em-parametera(out)); Do_Resource_Selection(ae/ected sesources(ovX) ,systemj)arameter8(m), routine jname(in),routine4)arameters(m)); If(proceaa==l){ Special-Case -Serial _Run(da£aJMind/e(in/out), selectedjresource8(in),routineJWTne(m),routine-paranieter8(in)y, Return_ControLtO-User(); } Create-Machinefile-and_Dress_CommandJLine(comroandJine(out), selected jre8ource8(io.) ,routinejname(m) ,routine.parameters(m)); If(data Jiandle==" incore") Write_Data-tO-Disk(daiaJiand/e(in)); Fork_Application_Routine_and-Wait(cornmondJine(in)); If( data Jiandle==" incore") Read-Data_from_Disk(data-hond/e(in/out)); Return-Control-to_User(); } It is useful to look at DoJtesourceSelectioni). this stage of the middleware reads:

A general structure for

DoJR^ource_Selection(ae/erfed_re«
19

imparts its own congestion control (e.g. sliding windows) when the network becomes congested complicates matters.(See references [26,27].) Further, we don't know whether the system handling the disk accesses employs DMA or, if not, how many accesses to disk may be necessary to service a request, the time per disk access, etc. If we did know this, we could write a function to approximate the procedure. For instance, suppose we wanted to load N bytes from a network disk into local physical memory. Suppose we know the bandwidth and latency between the source and destination machines, the number of bytes loadable per disk access as well as the time overhead per disk access. In this case, the total number of disk accesses is (N bytes) / (X bytes loaded per disk access), or N/X disk accesses. The time just to get the data to the network will then be {N/X disk accesses) x (Y seconds per disk access), or NY/X seconds, for instance. Next, to move the data over the network, we could assume the ideal and thus, the time to move the data over the network would be simply the latency (converted to seconds) plus N bytes divided by the network bandwidth (converted into bytes per second). Unfortunately, these simplifications ignore the true complexities of the underlying resources and protocols which are used in practice. Such models fail to yield realistic predictions in a shared system. Again, it's complicated and a work in progress. Naively, and for the sake of commentary, one assumes that the typical form for a time function will be of t h e SOrt, 1 solve-user' s-problem — 1 handle -user'8-data.

+Tex ecute -parallel

-application •

Here, Texecute-paraiiei-appiicati<m and Thandie_Uaer's-data are functions of the routine-parameters and system-parameters. In these terms, our original conjecture for a study may be restated as: if Tseriai_expert — Texecute-parallel-application

>

Thandle-user1

s-data,

then

the USer benefits

fwm

invoking the middleware. Otherwise, the user's problem is simply solved serially without any benefits from having invoked the middleware. Again, if the parallel application routine is scalable, we expect to find the problem size which marks the turning point which validates the method. We return to the issue when considering the test case. 2-4

The application layer

The application level of our current effort is built upon pre-existing numerical packages such as PETSc (ref. [28]) or ScaLAPACK. We have relied upon the scalability of such libraries to more than account for the time required for the middleware to handle the user's data. It is the developer's burden to understand the application routines from such libraries at an intricate level so that accurate time functions can be written to be

20

used by the middleware during the resource selection process. At this point, the typical time function for Texecute_paraiiei-appiication has the form -* execute-parallel-application

— •*• communicate-data

"r -* floating -point-operations

• *"

the example discussed in the next section, details of such a function are placed in context of an actual parallel kernel. The application routine has to get the user's data in-core before the parallel execution. There have been several approaches tested to this end. Basically, the data may be pre-mapped by the middleware in which case the parallel application starts with a parallel read of the data from the local network filesystem. The user's data may reside on the local network disk or a memory server in an unmapped format. In this case, either each node assembles its own data through a series of random accesses to the file housing the data, or a single, lead, node is responsible for bringing the data incore and distributing it in a mapped manner -or in bulk letting each node claim its own data- through a series of send/receives or broadcasts respectively. Let us agree for simplicity to lump all these scenarios into the routine GetJJser's-DataJownJncoreQ. Each processor in the allocated team of resources will use this. Then, Parallel_Application_Routine(command Jine(in)) { Parse_Command_Line(env*ronment Jnformation(out) ,dataJiandle(out), routine -parameters(o\ii),com,mandJine(m)); Initialize_ParalleI_Environment(ent;iroT»men<Jn/ormotion(in)); Get_User's_DataJownJncore(do^a(out),dotoJiondle(in)); Execute-Parallel .Application Jloutine((2ata(in/out),r0irf»ne .parameters(in)); Collect _Answer_to _Root(da
3 3.1

Sample software implementation and results Some comments on the kernel, pdgesvi)

In this section, results from a study of the kernel pdgesvi) (from the numerical library ScaLAPACK) are presented and briefly discussed. In each case, it is assumed that the matrix elements form a well determined system, (ref. [29]) Routine pdgesvi) computes solutions to the system of equations Ax = b where A € ftm'n, x € SR", and b e SRm. In particular, a dense, block based factorization of the matrix A is made reducing it to the form P~lLU where P - 1 is a pivot array, L is a lower triangular matrix (unit diagonal), and U is upper triangular. (Thus we expect the blocks -Lii2, and E/2,i to be zero.)

21

After said reduction, two relatively trivial systems of equations are left to solve instead of the original set. Thus, one has PA = LU or A = P_1LU, solves Ly = 6 for y, and finally solves Ux = y for a;. There are a couple of important points to make here. The algorithm for the factorization is applied recursively. Thus, the routine factors (Gaussian elimination) the Ait\, ^2,1 blocks of A recursively employing partial pivoting over rows in a single column of the process grid (the other nodes remain idle). At the stop case, each process in the current, active, process column broadcasts the pivot information to all the remaining columns of processes. Each process can then apply the row interchanges (to all the columns that were not involved in the iteration) to reflect the changes from the previous step. After each such factorization, the Liti, £2,15 and f/i,i blocks are known. The evaluation of each block row of the matrix U requires the solution of a lower triangular system of equations over the elements in a single row of the process grid. Thus, Liti is broadcast along the current row of processes converting panel A\i2 to f/li2 -e.g. J7ii2 *- L^\Ait2- The last step in any iteration of the factorization is the Schur update of the trailing sub-matrix. The column block Z/2,1 is broadcast over rows across all columns of the process grid. £A,2 is broadcast over columns along all the rows of the process grid. Then the Schur update modifies each process' local portion of the block ^2,2- (^2,2 4— ^2,2 — L2,1^1,2) This operation is a parallel matrix multiplication, pdgemmQ (ref. [30]), and it dominates the execution time. After the update, the process is begun again recursively on block A? p. It is important to note that the factorization proceeds from left-to-right, top-to-bottom. Thus, the amount of work per node becomes uneven as the factorization progresses. It is noteworthy that the expert user recognizes this point as a need for optimizing both the block size and the grid aspect ratio (the logical dimensions of the rectangular process grid). The block size is generally chosen to coincide with the problem size which achieves 90% of the performance expected by dgemmQ (ATLAS) on each node. Again, we have a need for homogeneity between nodes. The grid aspect ratio is no big deal really but it does change the time of execution for solving the problem. An example is provided in Figure 6. It is observed quite generally that for a p x q logical rectangular process grid (p,q denotes the number of logical process rows,columns respectively) p/q < 1 yields better performance for the specific kernel at hand than the cases where p/q > 1. When allocating resources from a pool of computing resources for the kernel pdgesvQ, the times of particular interest are tfactorization, hroadcasu and tupdate- On a user's request to solve a system of linear equations, the middleware ascertains the state of the shared system and invokes the selection process which attempts to minimize a time function. This function purports

22 TORC:: Grid Aspect Ratio, pdgemfl

2x4 Proces Grid: Parallel AXEB ScaLAPACK vasus KPL

ThMmftradkacf grid ihape and proManriK

•—•(Pfl)=(S.U *-*(Pfl)=W) lp^=[W) *—»(PA)=(W) •*• --:(pfl) = (W) v - :--GMj) = (W) • >(pfl) = (U) — - (Pfl) = (6.L)

j

t

//SM -

•

S00O

TOCO

N(pnUantlM)

Figure 6. The developer is responsible for understanding how to get the most out of the kernel being implemented in the proposed library. The plot on the left demonstrates a time range of over lOQseconds for the 8 processor run due to either a very poor choice of logical grid layouts, or the best possible. The plot on the right compares (a ratio of execution times is formed) the best run of the ScaLAPACK routine pdgesvQ versus the High Performance Linpack benchmark over the same number of nodes on the system. HPL is highly tuned and is really hard to beat in practice, even for experts. The results demonstrate the reliability of ScaLAPACK to perform well provided the correct system parameters are identified. Tuning the major kernels in ScaLAPACK has been a major concern on the side of the application routine developers.

to model the kernel as a function of the problem parameters as well as the state of the system. The basic communication model assumes a linear relation between any two nodes i, j such that the time to send a message of size X bytes is tcommii^ibandwidthij,latencyij) ~ latencyij + bandwidthj~j • X. Fora,pxq process grid, the factorization of a m x n panel, again, (see the documentation regarding HPL (ref. [31])) occurs within a single process column (p processors). Since the algorithm recurses within a panel, the assumption is made that any of the p processors within the panel perform at level 3 BLAS (ref. [32]) rates. Here is where we employ the notion of homogeneity in our target system as regards both communication and computational expectations. In the kernel of interest we are always bound by the slowest machine in the broadcast and update phases. Thus we select p processors which reflect homogeneity in the sense defined in this paper. If one only considers the time the nearest neighboring process columns will spend in the broadcast phase (after the fac-

23

torization), then the time to factor and broadcast a block panel is estimated for a collection of homogeneous resources by tfactor(m, n) + hroadcaat(m, n) ~ fdgemmi— - *jf) + latency{l + nlgip)) + (bandwidth)'1 (2n21 g(p) + nst). The Schur update of the remaining n x n matrix is approximated for homogeneous systems by tup(iate — Sq'1 (bandwidth)'1 n • nb + latency(p + hip) — 1) + fdgemm • n • nb(*f + 2-£). ^ere "^ *s *ke block size, m,n are the panel dimensions (not global) for the factorization and fdgemm approximates the performance of matrix multiply on an arbitrary node of our homogeneous system. Naturally, the system constraints have to be considered despite the assumption of homogeneity. The total time to perform this factorization in a homogeneous system may be approximated as ~ (start,i = 0) £"+=n6(*/actor(H ~ *>«&) + hTOadcast(n ~ *,nb) + tupdate(n

- t -

nb,nb)).

Quite frankly, one really requires a separate paper simply to work through the details of how we arrive at this time function and how it is used in the selection process. Clearly, we wish to minimize this function. Scheduling schemes are truly difficult however. People try simulated annealing, genetic algorithms, apply low order time series analyses, etc. The problem is of interest to the community in general. We will address the issue at this time. Future papers will describe our efforts in this area in detail. For now, suffice it to say, we crudley minimize this function based on the available system resources through an adhoc means and observe how well the allocated resources complete the task relative to the time we predict it should take based on analyzing the time function. The plots in Figure 7 reveal the faithfulness of our model in its current form. 3.2

Data movement scenarios

Moving and mapping m • n double precision matrix elements is informally discussed. The major consideration is on getting the data, matrices A e 3J m,n t-> (m*n*sizeof(double))bytes and vectors b € 5ft"1 •-> (m*sizeof(double))bytes , from the user and in place for the parallel solve routine. d It is assumed that the data is generated in a natural ordering, which is to say, in the language C for instance, that *(A + i + j*n) references the value of the matrix element in row i and column j of matrix A when 0 < i < m and 0 < j < n. This mapping is of general importance because the scalability of numerical calculations in a distributed computing environment is dependent upon spending more time computing than moving a user's data around. In particular, if numerical libraries such as that being investigated here are to be successful, it is necessary d

In the test cases m = n a s w e consider square matrices in the parallel solve routine.

24 Busy TORC Rons •: Total Wall Clock Time :PDGESVKandOnly

Busy TORC Runs:: PDGESV Kernel Only

Perfonnance Model:: Predicted vs. Measured Results

Ratio of Measured Tune vs. Tune Predicted by the Performance Model

; _s*> i™a jrt*^

; A—i. Ideal

1 '•* P

1.2

||.IS

^ /

f>

i ; - ^

(

/

/

State

• K

-

"

•—•

:

*

*

f--

P a9 |

;

X

0.«5

g oi |

0.75 0.7

Znofa

•

•

1000

2000

3000

5000

6000

s(DD

9O0O

10000

Figure 7. In plot one, the predicted wall times for execution given several different instances of the state of the system is plotted against the resulting execution times actually measured. The second plot is a dimensionless ratio of the two data sets of plot one. The ideal case is the value 1.

that the user's data can be forward mapped into a form expected at the application level (the parallel routines) and the solution backward mapped into natural form for the user in a timely fashion. For dense linear algebra kernels being studied in parallel computing environments it is known that a 2d blockcyclic mapping of the naturally structured data (e.g. matrix A and vector b in Ax = b) provides excellent load balance during parallel runs. (See references [33,34,35] for instance.) The mapping is a function of the problem size (mn, or n for square matrices where [A e 5Rn'n >->• (n 2 * sizeof(dauble))bytes]-^ problem size = n), the block sizes (nbjrow, nb-column, or nb), the number of process columns in the logical rectangular process grid (npcols), the number of process rows in the logical rectangular process grid (nprows), and the matrix elements, A. In fact, the mapping is quite easy to implement naively but it is found that real improvements on such a mapping are more difficult to achieve than expected. In the remainder of this section we outline the general scenarios that we have considered. In each instance, the corresponding read of the data, from the perspective of the parallel application routine, is also discussed. In scenario one, the user first generates the data in-core. There are n2 +n

IIOOO

25

elements and each requires sizeof (double) bytes of memory. This user then passes a pointer to the middleware which in turn (after deciding on the process grid) writes (nprows x npcols) 2d block cyclically pre-mapped work files to the network disk (NFS) as well as nprows work files for the vector elements. In total, of course, only (n 2 + n) * sizeof (double) bytes are written to disk. However, it is useful to compare the time it takes to simply write natural A to a single file on the network disk versus the time to write the elements of A to multiple files in an order imposed by the 2d block cyclic mapping, (see Figure ) In this scenario, the processors involved in the parallel application routine have only to read their matrix elements from disk before executing the parallel application. It is noted that no extra elements are read per processor as the mapping has already taken place in the middleware and is exact. The second general scenario begins with the user's data having been written to a file on either the local network disk or on the memory depot. After the middleware determines a set of processes for the application routine, the dataJiandle is passed to the application routine from the middleware. The natural data is brought in-core by a chosen processor (root) from the logical process grid. The data may be distributed by root in a manner which imposes the mapping during point to point communication with the other nodes(2a). Alternatively, root may proceed to broadcast the data as is and let the mapping be done by each process locally(2b). (We label these scenarios here so the reader can make sense of the plots.) In the final scenario we consider the case where the natural data is brought in-core in parallel by each processor in the logical process grid. The mapping may be imposed exactly during the load phase through random access to the file (3a). Alternatively, bulk data can be brought over the network by each process and the mapping carried out in-core (3b). It is noted that one may contend with network congestion in this scenario. In reality, the developer has to experiment with these approaches before truly understanding the best approach for a given system. 4

Results and conclusions

Figure 8 is interesting to think about. For one thing, we clearly see Ntp exists for each of the scenarios reported on the graph. Notice that the scale of the graph is linear-log. Thus, small separations on the graph suggest large separations in time. Not all the scenarios are reported but the point is clearly made -the user will benefit from interfacing with the proposed software. Plot two of Figure 8 demonstrates at once the strength ScaLAPACK for the dense algebraic kernel, and the reason we pursued this idea in the first place

26 Total Wall Times Solving Ax=b:: TORC (data_generation (and movement)) + (solve)

Total Wall Time for kernel .pdgesvQ:: TORC

10000

10000

1000

1000

NPROOCS-.

^

100

Tolal_wail_time[expen/distr] Tolal_walljime[NFS_scenario(l)] » — « Toul_wall_timetIBP_scenario(2a)] «! ToIal_Wall_Time[IBP_scenalio(3b)|l' ^ Tota]_wall_time(expen/seria!j

4000 6000 N (problem_size)

.pgesv[expert/dislr] * _pgesv[NFS_scenario(l)] » * •s

• _pgesv[lBP_scenario(2a)] ^ _gesv[expert/serial] -s _pgesv[IBP_scenario(3b)]

I i I 4000 6000 N (problem_size)

I 8000

Figure 8. The total time is reported for multiple approaches to solving the linear problem Ax — b. The serial and parallel expert cases proceed without the intervention of additional software. All other runs reported invoke the middleware just as desribed in the current work.

-scalability. For each run on the graph the number of processors allocated for the parallel runs is provided. The expert user in the serial environment can no longer conduct his normal activities in a timely manner. His task is over lOOOseconds behind the proposed software when we stop counting. Figure 9 is also interesting because it displays the overheads involved in simply porting the user's data into a form amenable to the parallel application routine. The times reported for the top two plots are simply the time it takes for the expert user (serial, or parallel) to generate his data before calling the library routine. Next, if we look at the local network disk data (NFS), handling there are multiple plots. The original time it took for the middleware to generate the block cyclic data files for each process in the logical process grid was due to a careless mapping. This is an easy mistake for user's -even developers- to make. Simply compare the time it takes to write some set number of bytes to the network disk with these maps. The only difference in the poorly implemented case (which also affects the numbers in Figure 9) and the faster mappings is in buffering. Similarly, in the load of the workfiles in scenario (2a) attempts to first map the user's natural data and then communicate it. It is simply faster to let multiple requests engage the memory server. Of course, this is only true until such traffic generates

27

2000

T"

4000

6000

8000

10000

~T

T~

0

2000

4000

6000

8000

10000

1000

IBP.DATA

NFS DATA

100

geiKraie_A,bJiicore[nst/seiial] « wntc_wo4_Files[mw/serial] •* load_A',b'_wrile_x[appl/disci] (1) - seiial_wriie_no_map D writtj»oikJilesj3i]f[n]w/serial]

I generate, A,b_remotely[usr+mw/serial] I load_A,b_write_x[appl/distr] (2a) s load_A,b_writc_x[appl/distr] (3b)

_l_ 2000

4000

6000

8000

10000

N

0

2000

4000

6000

8000

10000

N

Figure 9.

congestion on the network. We don't see that here but easily could as larger problems and methods are investigated. There is much left that could be discussed about such next generation software which we have not had time to even mention. For instance, the issue of how to build fault tolerance into such a set of library routines, or assembling contracts which attempt to monitor the progress of a user's task and act if necessary. The general problem of redistributing the user's data is also important.

28

In closing, we have attempted to define the target system through identifying a set of criterion relevant to the numerical library routines we are investigating. Homogeneity is an important criterion and as has been discussed, cannot be taken for granted. We have denned our target user and provided an interface for this user into the middleware which has been designed to utilize existing, scalable library routines. We have reserved a detailed conversation for scheduling in shared, homogeneous system for another time. This subject is of great importance however and cannot be avoided in practice. It is possible, if not likely, that techniques from time sensitive statistical investigations will play a major role in coming to terms with large scale systems -especially as they only become larger. Motivation to further pursue such approaches to numerical libraries on shared, homogeneous systems was provided. We are proposing a more thorough study at this point which attempts to ferret out many of the outstanding mysteries. References 1. See documentation on IBM's LoadLeveler batch system. For instance http :: //uagibm.nerscgov/docs/LoadL/index.html provides documentation of a specific implementation on NERSC's IBM SP/RS6000, currently known as seaborg.nersc.gov. 2. Garey, M. and Johnson, D.,Computers and Intractability, a Guide to the Theory of NP-Completeness,Bel\ Telephone Laboratories, 1979 3. Hopcroft, J. E. and Ullman, J. D.,Introduction to Automata Theory, Languages, and Computation, Addison-Wesley, 1979 4. Ullman, J. T).,"NP-complete scheduling problems,",Journal of Computer and Systems Sciences 10:3, pp. 384-393 5. Knuth, D. E.,Fundamental Algorithms, volume 1 of The Art of Computer Programming, Addison-Wesley, 1968 6. Knuth, D. E.,Seminumerical Algorithms, volume 2 of The Art of Computer Programming, Addison-Wesley, 1969 7. Knuth, D. E.,Sorting and Searching, volume 3 of The Art of Computer Programming, Addison-Wesley, 1973 8. Cormen, T. H., Leiserson, C. E, Rivest, R. L., Stein, C,Introduction to Algorithms, 2nd edition, MIT Press, 2001 9. NWS, The Network Weather Service, http://nws.cs.ucsb.edu/ 10. Blackford, L. S., Choi, J., Cleary, A., D'Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R. C.,ScaLAPACK Users' Guide, SIAM, 1997 11. Berman, F., Chien, A., Cooper, K., Dongarra, J., Foster, I., Gannon, D.,

29

12.

13.

14. 15.

16.

17.

18. 19. 20. 21. 22. 23. 24.

25.

Johnsson, L., Kennedy, K., Kesselman, C , Mellor-Crummey, J., Reed, D., Torczon, L., Wolski, R.,"The GrADS Project: Software Support for High-Level Grid Application Development," 2001, Rice University, Houston, Texas; see also http://www.hipersoft.rice.edu/grads Foster, I. and Kesselman, C,"GLOBUS: A metacomputing infrastructure toolkit ^International Journal of High Performance Computing Applications, vol. 1 1 , pp. 115-128 Petitet, A., Blackford, S., Dongarra, J., Ellis, B., Fagg, G., Roche, K., Vadhiyar, S.,"Numerical Libraries and the Grid, "International Journal of High Performance Computing Applications, vol. 15, pp. 359-374 Foster, I. and Kesselman, C,The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, San Francisco, 1999 Petitet, A., Blackford, S., Dongarra, J., Ellis, B., Fagg, G., Roche, K., Vadhiyar, S.,Numerical Libraries and the Grid: The GrADS Experiments with ScaLAPACK, Computer Science Department Technical Report UTCS-01-460, University of Tennessee, Knoxville, Tennessee 37996-3450 Boulet, P., Dongarra, J., Rastello, F., Robert, Y., Vivien, F.,"Algorithmic issues on heterogenous computing platforms," Parallel Processing Letters, 9:2, pp. 197-213, 1999 Kalinov, A. and Lastovetsky, A.,"Heterogeneous Distribution of Computations While Solving Linear Algebra Problems on Networks of Heterogenous Computers," Journal of Parallel and Distributed Computing, 6 1 , 4, pp. 520-535, 2001 Oksendal, B.,Stochastic Differential Equations, 3rd edition, SpringerVerlag (Berlin), 1992 Bremaud, P.,Point Processes and Queues, Martingale Dynamics, Springer-Verlag (New York), p.181 , 1981 Lipster, R. and Shiryayev, A.,Statistics of Random Processes: General Theory, Springer-Verlag (New York), 1977 Lipster, R. and Shiryayev, A..Statistics of Random Processes: Applications, Springer-Verlag (New York), 1978 PAP I, Performance API, http://icl.cs.utk.edu/projects/papi IBP, Internet Backplane Protocol, http://loci.cs.utk.edu/ibp Whaley, R. C,Basic linear algebra communication subprograms: Analysis and implementation across multiple parallel architectures, Computer Science Department Technical Report UT-CS-94-234, University of Tennessee, Knoxville, Tennessee 37996-3450 Whaley, R. C , Petitet, A., Dongarra, J. J., "Automated empirical optimizations of software and the ATLAS project," Parallel Computing, 27, 1,2, pp.3-35, 2001

30

26. Peterson, L. L. and Davie, B. S., Computer Networks: A Systems Approach, Morgan Kaufmann (San Francisco), 1996 27. Stevens, W. R., Unix Network Programming, Prentice-Hall, 1990 28. Balay, S., Gropp, W., Curfman Mclnnes, L., Smith, B.,PETSc Users Manual, rev. 2.1.0, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, Illinois 60439 29. Golub, G. H. and Van Loan, C. F.,Afatrix Computations, 3rd edition, John's Hopkins University Press, 1996 30. Choi, J., Dongarra, J., Ostrouchov, S., Petitet, A., Walker, D., Whaley, R. C.,A Proposal for a Set of Parallel Basic Linear Algebra Subprograms, Computer Science Department Technical Report UT-CS-95-292, University of Tennessee, Knoxville, Tennessee 37996-3450 31. HPL, High Performance Linpack benchmaxk,http://www. netlib. org/benchmark/hpl 32. Dongarra, J., Du Croz, J., Duff, I. S., Hammarling, S.,"A Set of Level 3 Basic Linear Algebra Subprograms," ACM Trans. Math. Soft., 14, pp. 1-17, 1988 33. Lichtenstein and Johnson, "Block-cyclic dense linear algebra," SI AM J. Sci. Stat. Compt., 14, pp.1259-1288,1993 34. Petitet, A. Algorithmic Redistribution Methods for Block Cyclic Decompositions, PhD thesis, Department of Computer Science, University of Tennessee, Knoxville, Tennessee 37996-3450 35. Beaumont, O., Legrand, A., Rastello, F., Robert, Y., "Dense linear algebra kernels on heterogeneous platforms: Redistribution issues,", Parallel Computing, 28, (issue 2), pp. 155-185, 2002 36. Kernighan, B. W. and Ritchie, D. M.,The C Programming Language, 2nd edition,Bei\ Telephone Laboratories, Inc., 1988 37. Snir, M. , Otto, S. W., Huss-Lederman S., Walker, D. W., and Dongarra, J. 3.,MPI: The Complete #e/erence,MIT Press,1996

GRID: Earth and Space Science Applications Perspective Luigi Fusco Senior Advisor for EO Applications ESA - ESRIN, 00044 Frascati, Italy [email protected]

The integration of space and information technologies is an asset for many science and operational applications in various disciplines. For example, limiting to the case of Earth Observation, many environmental satellites have been and are operated to provide large and repetitive volumes of geophysical measurements for natural resource management and environmental monitoring of our planet, which require sophisticated computing resources. The worldwide scientific user community is relatively large, in the order of many thousands, and the International programmes (e.g. IGBP, ICSU, WCRP, IOC) are the key consumers of EO data and its derived information. At the same time operational institutional and commercial users have real time and off-line access to data and services. At present, the provision of user information services is far from being optimal, due to the complexity of product format, algorithms and processing required to meet the specific user needs (e.g. immediate access to integrated EO and other information, for the specific application, in the end user system environment). To meet the user communities' aims it is required the effective co-operation of all involved actors and the sharing of experience, methods, resources, etc. Many obstacles have to be removed, for example: • access to large data archives (which today are mainly in close operational environment); • common definition of metadata and data formats (interoperability across various data providers); • access to relevant ancillary information; • adequate network capacity across space data handling dedicated facilities, value adding, science and service industries; • access to adequate computing resources; • development of real time services; • development of effective user tools for integrating different data sets; • agreements on common or compatible data policies for data access. The GRID technology (in terms of network, distributed and high performing computing power, collaborative environment and large data archive resources) can and should help

31

32 the process of supporting the user community in their scientific and operational applications. The European Space Agency' has plans for handling the dedicated archives, operational and future missions, and to support their scientific, operational and commercial exploitation. Efforts in demonstrating the potential of the GRID technology elements and services (e.g. middleware for distributed processing and access to distributed archives) are on going. The plans to encourage a European strong presence in this domain are essential actions to provide the necessary support and infrastructure as required by complex environmental applications.

e-Science, e-Business and the Grid

Tony Hey University of Southampton, UK

The talk begins with a rapid survey of the technological drivers behind the IT revolution. Increasingly, many areas of science involve access to distributed computing and data resources, remote access to specialized and expensive facilities and world-wide collaborations of scientists. There are many examples of such 'e-Science' endeavours ranging from bioinformatics and proteomics to collaborative engineering and earth observation. To enable scientists to maximize the science derived in such a fashion we will see the emergence of a new IT infrastructure called the Grid. The Web gave us easy access to html pages and information: the Grid will give us seamless access to a much wider range of distributed resources. It will enable us to form transient 'Virtual Organisations' without compromising security or privacy. The Grid infrastructure developed to allow interoperability and scalability of such heterogeneous and dynamic resources has obvious interest for industry. The talk concludes with some examples of Grid technology in an industrial context. The UK e-Science Programme is also briefly described.

33

G R A P H PARTITIONING FOR D Y N A M I C , A D A P T I V E A N D MULTI-PHASE SCIENTIFIC SIMULATIONS KIRK SCHLOEGEL, GEORGE ~KARYPIS, AND VIPIN KUMAR Dept. of Computer Science and Engineering University of Minnesota E-mail: kirk, karypis, kumarQcs.umn.edu The efficient execution of scientific simulations on HPC systems requires a partitioning of the underlying mesh among the processors such that the load is balanced and the inter-processor communication is minimized. Graph partitioning algorithms have been applied with much success for this purpose. However, the parallelization of multi-phase and multi-physics computations poses new challenges that require fundamental advances in graph partitioning technology. In addition, most existing graph partitioning algorithms are not suited for the newer heterogeneous high-performance computing platforms. This talk will describe research efforts in our group that are focused on developing novel multi-constraint and multi-objective graph partitioning algorithms that can support the advancing state-of-the-art in numerical simulation technologies. In addition, we will present our preliminary work on new partitioning algorithms that are well suited for heterogeneous architectures.

1

Introduction

Algorithms that find good partitionings of unstructured and irregular graphs are critical for the efficient execution of scientific simulations on highperformance parallel computers. In these simulations, computation is performed iteratively on each element (and/or node) of a physical two- or threedimensional mesh. Information is then exchanged between adjacent mesh elements. Efficient execution of such simulations on distributed-memory machines requires a mapping of the computational mesh onto the processors that equalizes the number of mesh elements assigned to each processor and minimizes the interprocessor communication required to perform the information exchange between adjacent elements 1 0 . Such a mapping is commonly found by solving a graph partitioning problem 3 ' 4 . Simulations performed on shared-memory multiprocessors also benefit from partitioning, as this increases data locality, and so, leads to better cache performance. Although the graph partitioning problem is NP-complete, good heuristic solutions for instances arising in scientific simulation can be found using multilevel algorithms. Many of these algorithms are incorporated in software packages such as Chaco 2 , METIS 6 , and SCOTCH 7 . However, the parallelization of multi-phase and multi-physics computa-

34

35

tions poses new challenges that require fundamental advances in graph partitioning technology. In addition, most existing graph partitioning algorithms are not suited for the newer heterogeneous high-performance computing platforms. This talk will describe research efforts in our group that are focused on developing novel multi-constraint and multi-objective graph partitioning algorithms that can support the advancing state-of-the-art in numerical simulation technologies. In addition, we will present our preliminary work on new partitioning algorithms that are well suited for heterogeneous architectures. 2

Multi-Constraint and Multi-Objective Partitioning Algorithms

The traditional graph-partitioning problem is a single objective optimization problem subject to a single balance constraint. Our research has focused on generalizing this problem to allow for multiple optimization objectives as well as multiple balance constraints by assigning weight vectors to the vertices and edges of the graph 5 ' 8 . The resulting generalized algorithms have enabled effective partitioning for a variety of applications such as weapon-target interaction simulations involving multiple computational meshes and particlein-cells computations that underlie diesel combustion engine simulations. As an example, consider a multi-physics simulation in which a variety of materials and/or processes are simulated together. The result is a class of problems in which the computation as well as the memory requirements are not uniform across the mesh. Existing partitioning schemes can be used to divide the mesh among the processors such that either the amount of computation or the amount of memory required is balanced across the processors. However, they cannot be used to compute a partitioning that simultaneously balances both of these quantities. Our inability to do so can either lead to significant computational imbalances, limiting efficiency, or significant memory imbalances, limiting the size of problems that can be solved using parallel computers. Figure 1 illustrates this problem. It shows three possible partitionings of a graph in which the amount of computation and memory associated with a vertex can be different throughout the graph. The partitioning in Figure 1(b) balances the computation among the subdomains, but creates a serious imbalance for memory requirements. The partitioning in Figure 1(c) balances the memory requirement, while leaving the computation unbalanced. The partitioning in Figure 1(d), that balances both of these, is the desired solution. In general, multi-physics simulations require the partitioning to satisfy not just one, but a multiple number of balance constraints. (In this case, the partitioning must balance two constraints, computation and memory).

36

Figure 1. An example of a computation with nonuniform memory requirements. Each vertex in the graph is split into two amounts. The size of the lightly-shaded portion represents the amount of computation associated with the vertex, while the size of the dark portion represents the amount of memory associated with the vertex. The bisection in (b) balances the computation. The bisection in (c) balances the memory, but only the bisection in (d) balances both of these.

Other examples of scientific simulations that require partitionings with multiple constraints and multiple objectives are presented in s , 1 ° . Serial and parallel algorithm for solving such multi-objective, multi-constraint graph partitioning problems are described in 5 ' 8 . 9 > n . New Challenges As numerical simulation technologies continue to become more sophisticated and as the number of processors routinely used increases into the thousands and tens of thousands, partitionings are required to satisfy more and more generalized constraints and optimize many different types of objectives in order to ensure good parallel efficiencies. Many of these objectives cannot be defined on the vertices and the edges of the graph, but are instead defined in terms of the overall structure of the partitioning. As an example, many types of the parallel indirect solvers require partitionings that, in addition to minimizing the inter-processor communication, are also composed of well-shaped subdomains (i.e., the subdomains have good aspect ratios) x. As another example, domain-decomposition-based numerical simulations, such as those proposed for computational structural mechanics, require that the resulting partition-

37

ing simultaneously balances (i) the amount of time required to factorize the local subproblem using direct factorization, (ii) the size of the interface problem assigned to each processor, and (iii) the number of subdomains to which each subdomain is adjacent. In both of these examples, the various objectives and constraints cannot be modeled by assigning appropriate weights to the vertices and/or edges of the graph, as they depend on the structure of the partitioning. Developing algorithms for solving such problems is particularly challenging, as it requires that the partitioning algorithm balance quantities that can be measured only after a partitioning has been computed. Also, in many time dependent computations the physics or subdomain meshes change as a function of time. For such computations to remain load balanced, the mesh must be redistributed periodically. This requires an adaptive repartitioning algorithm that has yet an additional optimization objective on top of any others specified by the user (i.e., the minimization of the amount of data that needs to be redistributed during load balancing). In our research, we are continuing to develop a general partitioning framework that allows constraints and objectives to be specified in terms of the structure of the desired partitioning as well as the development of algorithms suitable for this framework. 3

Partitioning for Heterogeneous Computing Platforms

Most existing scalable parallel multi-disciplinary simulation codes can be easily ported to a wide range of parallel architectures as they employ a standard messaging layer such as MPI. However, their performance on these depends to a large degree on the architecture of the parallel platform. In particular, many core parallel algorithms were designed under the assumption that the target architecture is flat and homogeneous. However, the emergence of parallel computing platforms built using commercial-off-the-shelf components has resulted in high-performance machines becoming more and more heterogeneous. This trend is also influenced by the geographically distributed nature of computing grids as well as the effects of increasingly complex memory hierarchies. This heterogeneity presents real challenges to the scalable execution of scientific and engineering simulation codes. A promising approach for addressing this problem is to develop a new class of architecture-aware graph partitioning algorithms that optimally decomposes computations given the architecture of the parallel platform. Ideally, such as intelligent partitioning capability could alleviate the need for major restructuring of scientific codes. We are in the process of developing graph-partitioning algorithms that take into account the heterogeneity of the underlying parallel computing ar-

38

chitecture, and hence, compute partitionings that will allow existing scientific codes to achieve the highest levels of performance on a wide range of platforms. The intent is to develop an extensible hierarchical framework for describing the various aspects of the target platform that captures the underlying network topologies, inter-connection network bandwidths and latencies, processor speeds, memory capacities, and the various levels of the memory hierarchy. New graph-partitioning algorithms will then be designed that can use information from this framework to optimize partitionings with respect to the specified architecture. Acknowledgments This work was supported by DOE contract number LLNL B347881, by NSF grants CCR-9972519, EIA-9986042, and ACI-9982274, by Army Research Office contracts DA/DAAG55-98-1-0014, by Army High Performance Computing Research Center cooperative agreement number DAAD19-01-2-0014, the content of which does not necessarily reflect the position or the policy of the government, and no official endorsement should be inferred. Additional support was provided by the IBM Partnership Award, and by the IBM SUR equipment grant. Access to computing facilities was provided by AHPCRC and the Minnesota Supercomputer Institute. References [1] R. Diekmann, R. Preis, F. Schlimbach and C. Walshaw. Shape-Optimized Mesh Partitioning and Load Balancing for Parallel Adaptive FEM. Parallel Computing 26, 12 (2000). [2] B. Hendrickson and R. Leland. The Chaco User's Guide, Version 2.0. Technical Report Sandia National Laboratories SAND94-2692, 1994. [3] B. Hendrickson and R. Leland. A Multilevel Algorithm for Partitioning Graphs. Proceedings Supercomputing '95, 1995. [4] G. Karypis and V. Kumar. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing 20, 1 (1998). [5] G. Karypis and V. Kumar. Multilevel Algorithms for Multi-constraint Graph Partitioning. Proceedings Supercomputing '98, 1998. [6] G. Karypis, K. Schloegel and V. Kumar. ParMETIS: Parallel Graph Partitioning and Sparse Matrix Ordering Library, Version 3.0, 2002. [7] F. Pellegrini and J. Roman. Scotch: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture

39

Graphs. HPCN-Europe, Springer LNCS 1067, 1996. [8] K. Schloegel, G. Karypis and V. Kumar. A New Algorithm for Multiobjective Graph Partitioning. Proceedings EuroPar '99, pages 322-331, 1999. [9] K. Schloegel, G. Karypis and V. Kumar. Parallel Multilevel Algorithms for Multi-constraint Graph Partitioning. Proceedings EuroPar2000, pages 296-310, 2000. [10] K. Schloegel, G. Karypis and V. Kumar. Graph Partitioning for High Performance Scientific Simulations. In CRPC Parallel Computing Handbook, Morgan Kaufmann. To appear. [11] K. Schloegel, G. Karypis and V. Kumar. Parallel Static and Dynamic Multi-constraint Graph Partitioning. Concurrency: Practice and Experience, To appear.

Panel Session on Challenges and Opportunities in Data -Intensive Grid Computing Paul Messina Center for Advanced Computing Research, CalTech [email protected] The panel will discuss software and hardware technology issues related to very large, geographically distributed scientific data sets: archiving, accessing, and analyzing them. Discussion topics would include distributed, on-demand computing for analysis, federation of databases for simultaneous access to various databases, data mining, network middleware, scalability, and distributed visualization, among others.

40

APPLICATIONS

This page is intentionally left blank

G I A N T EIGENPROBLEMS F R O M LATTICE G A U G E THEORY ON CRAY T3E SYSTEMS N. ATTIG 1 , TH. LIPPERT 2 , H. NEFF 1 , J. NEGELE 3 , AND K. SCHILLING 1 ' 2 John von Neumann Institute for Computing Research Center Mlich, 52425 Jiilich, Germany 2 Dept. of Physics, University of Wuppertal, 42097 Wuppertal, Germany 3 Center for Theoretical Physics, MIT, Cambridge, USA The determination of physical properties of flavor singlet objects like the if meson by computer simulation requires the computation of functionals of the inverse fermionic matrix A/ - 1 . So far, only stochastic methods could cope with the enormous size of M. In this paper, we introduce an alternative approach which is based on the computation of a subset of low-lying eigenmodes of the fermionic matrix. The high quality of this 'truncated eigenmode approximation' (TEA) is demonstrated by comparison with the pion correlator, a flavor octet quantity, which is readily computable through a linear system of equations. We show that TEA can sue- cessfully approximate the flavor singlet rf correlator. We find that the systematic error of the method is tolerable. As the determination of the chosen subset of 300 eigenmodes requires about 3.5 Tflops-hours CPU-time per canonical ensemble and at least 15 GBytes of memory, the power of high-end supercomputers like the CRAY T3E is indispensable.

1

Introduction

A major goal of non-perturbative lattice quantum chromodynamics (LQCD) is the determination of hadronic mass states which are characterized by nonvalence contributions, such as flavor singlet mesons. Their correlation functions, CV'(ti - t 2 ) , contain so-called 'disconnected diagrams', i.e. correlators between closed virtual fermion loops. The reliable determination of these disconnected diagrams has been a long- standing issue ever since the early days of lattice gauge theory. It can be reduced to the numerical problem of how to achieve information about functionals of the inverse fermionic matrix M _ 1 . The first attempts in this direction have been started only a few years ago, using the so-called stochastic estimator method (SE) * to compute the trace of M _ 1 . This approach requires to solve the linear system Mx = £ on some hundred source vectors £, with £ being Z2 or Gaussian noise vectors. Meanwhile, substantial progress could be achieved for the determination of the T]1 by application of refined smearing methods 2 , where for the first time a proper signal-to-noise-ratio could be established. However, SE introduces stochastic noise, in addition to the stochastics already inherited from the

43

44

Monte Carlo process. In the following, we describe the determination of the 77' mass based on the computation of a set of low-lying eigenmodes of Q = 75 M, the hermitian form of M. We use the implicitly restarted Arnoldi method, a generalization of the standard Lanczos procedure. A crucial ingredient is the Chebyshev acceleration technique to achieve a transformation of the spectrum to a form suitable for the Arnoldi eigenvalue determination. The low-lying modes given, it is possible to estimate the entire matrix Q~l and those matrix functionals or functions of Q and M which are sensitive to long-range physics. In section 1, we introduce the meson correlators and in section 2, we shortly review their computation by conventional means. Section 3 is devoted to TEA and the organization of the computation on the CRAY T3E by use of the parallel Arnoldi package (PARPACK). In section 4, we assess the viability of TEA by comparing the correlator of the TT meson as computed from TEA with the result from the conventional approach. As the IT is a flavor octet quantity it can easily be computed through the solution of a linear system of equations by iterative Krylov subspace algorithms 3 . Finally, we apply TEA to the computation of the 77' meson correlator and compare with results from SE computations. 2

Meson Correlators

In LQCD, hadronic masses are extracted from the large-time behavior of correlation functions. The correlator of the flavor octet n meson is defined as C»(t = t 1 - t 2 ) = / ^ T r [ g - 1 ( n 1 t 1 ; T n , * a ) g - 1 ( m l t 2 ; n > t i ) ] \ \n,m

,

(1)

I y

while the flavor singlet 77' meson correlator is composed of two terms, one being connected and equivalent to the pion correlator, the second being the disconnected contribution from the correlation of virtual quark loops: C„-(ti-<2) = a ( i 1 - t 2 ) - 2 / ^ T r [ Q - 1 ( n , t 1 ; n , t 1 ) ] T r [ Q - 1 ( m , t 2 ; m , t 2 ) \n,m

1 u

(2) (.. .)v indicates the average over a canonical ensemble of gauge field configurations. Q is the hermitian Wilson-Dirac matrix 3 , i.e. Q = 75M; n and m denote spatial lattice sites, ti and ti determine the time separation t. The color and Dirac indices are suppressed. For large times t, the respective correlation functions become proportional to exp(-m 0 £), where mo is the mass

45

of the particle described by the correlation function. Since our lattice has anti-periodic boundary conditions in time direction, the correlation functions actually consist of a sum over two contributions, exp(-m 0 i) and exp(+m 0 £), i.e. they will exhibit a cos/i-like behaviour (C ~ cosh(m0t)). 3

Conventional Computation of Correlators

In the conventional computation of the ^-correlator (1) the source point is held fixed (e.g. at index tupel (1,1) = (1,1)), where the first index symbolizes the spatial vector (1,1,1) and the second one denotes the time component: CW(*I) = / £ T T [ Q - H M I ; 1 , I ) Q - H I , 1 ; M I ) ] )

•

(3)

IV

\ n t

Using 75M75 = M , we obtain:

C,(ti) = (52Tr[M- 1 (n J t 1 ;l,l)Af- lt (l ) l;n,ti)]\ .

(4)

IV

\ n

Thus it suffices to determine 12 columns ( 3 x 4 for the color and Dirac indices) of M_1 in order to compute (4). The columns c(n, t) of M~x are obtained by solving the linear system M(n,ti;m,t2)c(m,t2)

= 6(1, l;n, t{);

(5)

where 6 is the Kronecker delta function. Of course, the statistics could be improved by averaging over many or even all source points. However this would be prohibitively expensive as the effort increases with the number of sources. For Cy, however, the second term, J2rft[Q~1(n,ti;n,t1)}Tr[Q-1(m,t2;m,t2)],

(6)

n,m

depends on the diagonal elements of Q~x which cannot be determined from one source point alone. Instead of going through all sites the method of choice is the stochastic estimator technique (SE). One creates series of complex numbers (a)j such that they converge to the diagonal elements Q~l(i, i). The series are constructed through noise vectors r)k, 3

(ci)i = Tir,k(i)Q-1rik(i)/j *=i

(7)

46

Q~lT)k is determined by solving the corresponding linear system Qx = 77. In order to achieve a satisfying approximation, 400 noise vectors r]k are required 2 . Therefore the flavor singlet calculations are about 30 times more expensive than the octet ones. 4

TEA

We start from the following equation:

Q-Hn,t1]m,t,) = £ ^ \ ^ m ^ i

At

t

(8)

(ipi\ipi)

where A* and tpi are the eigenvalues and the eigenvectors of Q respectively. Note that Q is hermitian indefinite0. We approximate the sum on the right hand side by restriction to the 300 lowest-lying eigenvalues and their corresponding eigenvectors. Due to the factor 1/A* one can hope that the low-lying eigenmodes will dominate the sum 6 . We emphasize that we have an approximation for the entire matrix Q _ 1 (i, j). Therefore, we can retrieve the diagonal elements of the disconnected diagrams as well as the n correlator on all source points. To compute the eigenvalues and their corresponding eigenvectors we employ the Implicitly Restarted Arnoldi Method (IRAM). The huge size of Q requires parallel supercomputers. We work on two CRAY T3E systems with 512 nodes each, located at the Research Center Jiilich in Germany and NERSC, Berkeley, USA. A comfortable parallel implementation of IRAM is provided by the PARPACK package 5 . In order to overcome the problem of slow convergence for the low-lying eigenvalues we apply the Chebyshev polynomial acceleration technique, where the eigenvalue spectrum is transformed such that the 300 smallest eigenvalues become much larger than the rest of the spectrum, a situation favorable for eigenvalue calculations by IRAM. We work on a 163 x 32 lattice i.e. 16 lattice sites in space and 32 lattice sites in time direction. Taking\into account the Dirac and color indices, we see that the Dirac matrix acts on a 12 x 163 x 32 = 1.572.864 dimensional vector space. This explains why we cannot invert the entire Dirac matrix, since this would need about 40 TByte memory space, whereas the determination of " T h e hermitian matrix Q leads to an orthogonal eigenbase. If we compute the eigenmodes from the non-normal matrix M instead, the resulting eigenbase is non-orthogonal. In our investigations, we found no convergence of the low modes from the non-orthogonal eigenbase. 6 In the result section, we will come back to the question of cancellation effects due to positive and negative eigenvalues.

47

300 low-lying eigenvectors leads to about 15 GByte memory space only. Our computations are based on canonical ensembles of 200 field configurations with n/ = 2 flavors of dynamical sea quarks, generated at 4 different quark masses, in the framework of the SESAM project 2 . Fast and automated access to an archive space of approximately 6 TBytes is required to store all eigenvectors from the SESAM ensembles. Thus, computations of this kind are not feasible without the facilities available at supercomputer centers. It takes about 3.5 Tflops-hours to solve for 300 low-lying modes on each ensemble. In the eigenmode approach the CPU-time is decreasing with lighter quark masses, as in that case the eigenvalues become smaller and one can expect that they will dominate the sum earlier. This is a substantial advantage of TEA compared to the SE approach, where smaller eigenvalues lead to a slower convergence of the linear system solvers. With 400 stochastic estimates it takes about 1.5 Tflops-hours to treat an ensemble of 200 gauge configurations. However, in future simulations (i.e. for lighter quark masses) TEA will soon become superior to SE. 5

Results

The correlator C-* can serve as simple test for the quality of TEA. Fig. 1 compares CV as determined by eq. (5) with the result from TEA. The low modes are expected to describe long range physics. -Accordingly, TEA underestimates the true Cff for small time separations (t < 8). OnHhe other hand, the large time behavior is represented quite well. Since the IT correlator can be determined extremely accurately by the conventional method, a tiny truncation effect is still visible in the large time range of the propagator. This deviation decreases for smaller quark masses, as well as for a larger number of eigenmodes. Nevertheless, the deviation for large t still are surprisingly small: Let us sum (1) over ti and t2,

U,t2

i

l

Here all contributions are positive and no cancellation occurs as in the case

for Cfo.e situation is even more favorable for the disconnected part of rj'. Let us again consider the sum over t\ and t
T±[Q-1(n,t1]n,t1)\Tr[Q-1(m,t3-,rn,t2J\

= ( E ^ )

•

<10>

48 Pion Correlation Function Conventional Method >—t-— Alternative Method >•••*•••

0.1 •

***

***,

0.001

0

5

10

15

20

25

30

time Figure 1. Comparison of CV from TEA and from the conventional computation.

Obviously, the positive and negative eigenvalues can cancel each other here. We find that convergence is achieved from about 150 eigenvalues on! TEA and SE agree well within the error bars that are due to the gauge field fluctuations. Twoloop Correlation Function

m*F 0.001

0.0001

Conventional Melhod • — i — Alternative Method ••••*••

¥*.

"?*,

%i

10

**

mm^m 15 time

20

if"

25

30

Figure 2. Comparison of the disconnected correlator from SE and TEA. The large errors in both cases are solely due to gauge field noise.

In conclusion our results show that TEA is comparable to SE for the computation disconnected diagrams 4 . The costs are similar to SE computations at present. Going to realistic light masses, TEA will become superior to SE

49

since low modes tend to dominate more and more. Acknowledgments We thank the staff of the computer centers at Research Center Julich and NERSC at Berkeley for their friendly support. References 1. 2. 3. 4. 5.

S. Giisken et. al. Phys. Rev., D59:114502, 1999. T. Struckmann et. al. hep-lat/0010005. A. Frommer et al. Int. J. Mod. Phys., C5:1073, 1994. see also L. Venkataraman and G. Kilcup, hep-lat/9711006. http://www.caam.rice.edu/~kristyn/parpack_nome.html

PARALLEL CONSISTENCY CHECKING OF AUTOMOTIVE PRODUCT DATA WOLFGANG BLOCHINGER, CARSTEN SINZ AND WOLFGANG KUCHLIN Symbolic Computation Group, WSlfor Computer Science, Universitdt Tubingen, 72076 Tubingen, Germany http://www-sr.informatik.uni-tuebingen.de This paper deals with a parallel approach to the verification of consistency aspects of an industrial product configuration data base. The data base we analyze is used by DaimlerChrysler to check the orders for cars and commercial vehicles of their Mercedes lines. By formalizing the ordering process and employing techniques from symbolic computation we could establish a set of tools that allow the automatic execution of huge series of consistency checks, thereby ultimately enhancing the quality of the product data. However, occasional occurrences of computation intensive checks are a limiting factor for the usability of the tools. Therefore, a prototypical parallel re-implementation using our Distributed Object-Oriented Threads System (DOTS) was carried out. Performance measurements on a heterogeneous cluster of sharedmemory multiprocessor Unix workstations and standard Windows PCs revealed considerable speed-ups and substantially reduced the average waiting time for individual checks. We thus arrive at a noticeable improvement in usability of the consistency checking tools.

1

Introduction

Today's automotive industry manages to supply customers with highly individualized products by configuring each vehicle individually from a very large set of possible options. E.g., the Mercedes C-class of passenger cars allows far more than a thousand options, and on the average more than 30,000 cars will be manufactured before an order is repeated identically. Heavy commercial trucks are even more individualized, and every truck configuration is built only very few times on average. The space of possible variations is so great that the validity of each order needs to be checked electronically against a product data base which encodes the constraints governing legal combinations of options1. But the maintenance of a data base with thousands of logical rules is error-prone, especially since it is under constant change due to the phasing in and out of models. Every fault in the data base may lead to a valid order rejected, or an invalid (non constructible) order accepted which may ultimately result in the assembly line to be stopped. Therefore, reaching correctness of the product data base is a high priority goal. DaimlerChrysler employs the electronic product data management (EPDM) system DIALOG for the configuration of their Mercedes lines. Within this system, a customer's order consists of a basic model class selection together with a set of further equipment codes describing additional features. Each equipment code is represented by a Boolean variable, and choosing some piece of equipment is reflected by setting

50

51

the corresponding variable to true. An order is processed in three major steps, as depicted in Figure 1. All of these steps are controlled by logical rules of the EPDM system: 1. Order completion: Supplement the customer's order by additional (implied) codes. 2. Constructibility check: Are all constraints on constructive models fulfilled by this order? 3. Parts list generation: Transform the (possibly supplemented) order into a list of parts (a bill of materials).

Customer's Order

Supplemented Customer's Order

Checked and Supplemented Customer's Order

Order's Parts List

Figure 1. Processing a customer's order.

In order to systematically detect defects in the rule system, we developed a formal model of the ordering process. Thus, we are able to apply an automatic theorem prover to check certain consistency criteria of the rule base as a whole: Necessary and inadmissible codes: Are there codes which must invariably appear in each constructible order? Are there codes which cannot possibly appear in any constructible order? Superfluous parts: Are there parts which cannot occur in any constructible order? Both criteria can be formulated as propositional logic satisfiability (SAT) problems2, and our interactive consistency support tool BIS 3 contains an implementation of a Davis-Putnam-style4 propositional prover to verify them. To completely check the above mentioned criteria for only one model class, up to 10,000 prover runs have to be performed. Most of these automatic proofs are completed in a few seconds, but there remains a small fraction that requires comparatively high run-times of up to several hours. Unfortunately, there is no known method to estimate the run-times in advance, and in an interactive system like BIS, users hardly accept long and unpredictable waiting times. It may happen that one of the first in a sequence of proofs requires a very long run-time, which causes a delay

52

in the presentation of the results of all the other proofs: The user does not get any result until the first proof is completed. Our parallelization approach therefore is two-fold: we execute a set of proofs in parallel, and if we hit a long-running proof we additionally process this individual proof in parallel. Thus, the average waiting time for the result of an individual proof can be reduced considerably. Before we describe our parallelization in more detail, we will give an overview of our software infrastructure DOTS used in this approach. We will then explain our method in terms of the DOTS system, before presenting experimental results, related work, and our conclusions. 2

Parallelization Infrastructure

As parallelization infrastructure, our Distributed Object-Oriented Threads System DOTS5 is used. DOTS is a C++ parallel programming toolkit that integrates a wide range of different computing platforms into a single system environment for high performance computing. 2.1

The programming paradigm of DOTS

The main idea of DOTS is to make the threads programming paradigm used on shared-memory machines available in a distributed memory environment. With DOTS, a hierarchical multiprocessor, consisting of a (heterogeneous) cluster of shared-memory multiprocessor systems, can be efficiently programmed using a single paradigm. The DOTS API provides primitives for DOTS thread creation (dots.fork), synchronization with the results computed by other DOTS threads (dots-join), and DOTS thread cancellation (dots.cancel). All primitives can also be used in conjunction with so called thread groups. Thread groups are a means of representing related DOTS threads. When applied with thread groups, the semantics of each primitive is automatically changed to the appropriate group semantics. E.g. when using the join primitive with a thread group, join-any semantics will be applied. 2.2

The Architecture of DOTS

Figure 2 gives an overview of the basic components of the architecture of DOTS. When a DOTS thread is created with the dotsJork primitive, a so called thread object is instantiated that represents the DOTS thread within the system during the complete execution process. It stores all information that is necessary to execute the DOTS thread.

53 Execution Unit i;-lrM'-l.|.-hl4.k-in-l-l--n thread queue

_

Figure 2. The DOTS Architecture

ready queues

Figure 3. The Execution Unit

DOTS threads are executed within the Execution Unit (see Figure 3). It contains a thread queue in which newly created thread objects are enqueued. A pool of (OS native) worker threads dequeue thread objects from the queue and execute the corresponding DOTS threads. The number of worker threads can be determined by the programmer. Normally, for each node the number of available processors is chosen. After the execution of a DOTS thread is completed, its thread object is placed into a ready queue. To support the execution of DOTS threads in a distributed environment, the DOTS architecture includes additional components. The Thread Transfer Unit transfers (serialized) thread objects between queues of execution units residing on different nodes. The Load Monitoring Framework traces all events concerning the execution of DOTS threads and provides status information like the current load or the current length of the thread queue. Based on the Load Monitoring Framework, different load distribution strategies can be implemented. A load distribution strategy is responsible for triggering the transfer of thread objects and selecting destination nodes according to a particular strategy. 3

Parallelization

The presented parallelization approach pursues two major goals. The first goal is to achieve a total speedup of the computation. A second important goal is to reduce the average waiting time for the result of a proof in order to improve the usability of the application. In the subsequently described procedure, hard proofs are determined by setting a limit for the computation time for a proof. If the time for a proof has expired, it is considered as a hard proof and treated separately by executing it in parallel. Consequently, the parallel execution is organized in two phases: • Phase 1: Concurrent execution of proofs.

54

A timeout is set to suspend (long-running) hard proofs. Hard proofs are queued along with their current execution state. • Phase 2: Parallel execution of the queued hard proofs. Each queued proof is treated individually in parallel, starting from its previously saved state computed in phase 1. The execution of phase 1 is organized in a master-slave approach. For all proofs, the root thread creates corresponding DOTS threads that are executed concurrently. The computed results are joined by the root thread and displayed. In the case of a timed out hard proof, its current execution state is joined and queued for later execution in phase 2. The realization of phase 2 requires more sophisticated techniques. We adopted the parallelization scheme for the Davis-Putnam algorithm presented by Zhang et a/.6. Basically, we are dealing with the parallelization of a combinatorial search problem. This implies that the search space has to be divided into mutually disjoint portions to be treated in parallel. However, a (static) generation of balanced subproblems is not feasible, since it is impossible to predict in advance the extent of the problem reduction delivered by the Davis-Putnam procedure. Instead, a dynamic search space splitting approach is carried out. To start the execution of a queued hard proof the root thread forks one DOTS thread that has the entire search space assigned. During the whole computation of phase 2, all DOTS threads periodically monitor the length of the local thread queue (see Section 2.2). If the thread queue is empty, a new DOTS thread is forked. The parent thread splits off a region of its search space and assigns it to the new DOTS thread. Details of the applied search space splitting heuristics can be found in Zhang et a/.6. To prevent uncontrolled splitting actions, a predefined time interval has to be waited before the next split can be carried out by the DOTS thread. The newly created DOTS thread is queued and can be executed by another local worker thread or can be transferred to other nodes (see Section 3.1). The described splitting procedure generates subproblems on demand. This ensures that new subproblems are generated during the initialization phase of the computation to exploit the available processing capacity and every time a subproblem has been completely processed without finding a solution. After forking the initial DOTS thread, the root thread calls dotsjoin to wait for the created DOTS threads. All DOTS threads (except the initial one) are created with the dots-subfork primitive. This means that they can be joined by the root thread (and are not joined by their actual parent threads). The result of a DOTS thread indicates whether a problem solution was found within its assigned search region or not. The processing of a hard proof is completed either if all created DOTS threads have been joined without returning a solution, or when the first DOTS thread that has found a

55

solution is joined. In the latter case, all remaining DOTS threads are immediately canceled to make all processing capacities available for the next queued hard proof. 3.1

Load Distribution

As described in Section 2.2, new load distribution strategies can easily be integrated into DOTS using the Load Monitoring Framework. For the presented application, a customized load distribution strategy was realized that reflects the division of the computation in two phases. In both phases, a work-stealing strategy is applied, i.e. when all worker threads on a node are idle, the distribution strategy tries to transfer thread objects from the thread queues of other nodes. However, the phases are treated differently in the way how the victim node is chosen. Whereas in phase 1 always the master node is selected as victim (all DOTS threads are created by the master in phase 1), victims are chosen randomly in phase 2. It has been shown that applying a randomized work-stealing strategy to distribute the load in backtrack search algorithms is likely to yield a speedup within a constant factor from optimal (when all solution are required)7. Since this approach does not involve central components for load distribution, the scalability of the parallel application is improved. 4

Experimental Results

The parallel environment used for the presented performance measurements consisted of a cluster made up of the following components (all nodes were connected with 100 Mbps switched Fast-Ethernet). • 2 Sun Ultra E450, each with 4 UltraSparcII processors (@400 MHz) and 1 GB of main memory, running under Solaris 7. • 4 PCs, each with 1 Pentiumll processor (@400MHz) and 128 MB of main memory, running under Windows NT 4.0. We have applied the necessary and inadmissible codes checks as well as the superfluous parts checks on configuration data for the C-class and E-class limousines. Tables 1 and 2 show the measured run-times (wall-clock times) in seconds. A timeout of 20 seconds for detecting hard proof was chosen. The tables give sequential runtimes of the prover executed on an E450, on a PC, and the corresponding weighted mean of the sequential run-times. Additionally, run-times of three parallel program runs on all processors are shown. The parallel execution of the prover can exhibit a

56 Table 1. Results for the Mercedes C-class. C-class limousines (code checks) proofs: 520 hard proofs: 4 waiting time run-time 1,417.0 654.6 seq (E450) 1,738.0 800.3 seq (PC) seq (mean) 1,524.0 703.2 parallel 158.0 66.8 (3 runs) 162.0 66.6 164.0 67.1

C-class limousines (part checks) proofs: 512 hard proofs: 6 run-time waiting time seq (E450) 10,447.0 5,459.5 seq (PC) 13,667.0 7,167.1 seq (mean) 11,520.3 6,028.7 parallel 1,026.0 100.6 (3 runs) 1,043.0 100.8 1,039.0 100.8

Table 2. Results for the Mercedes E-class. E-class limousines (code checks) proofs: 525 hard proofs: 6 waiting time run-time seq (E450) 1,935.0 956.1 seq (PC) 2,281.0 1,123.9 seq (mean) 2,050.3 1,012.0 parallel 253.0 111.4 (3 runs) 245.0 110.6 252.0 111.5

E-class limousines (part checks) proofs: 500 hard proofs: 6 run-time waiting time seq (E450) 40,186.0 16,370.5 seq (PC) 51,769.0 21,500.4 seq (mean) 44,047.0 18,080.5 parallel 2,198.0 110.4 (3 runs) 2,174.0 109.3 2,192.0 108.8

non-deterministic behavior, for this reason we don't give averaged times or speedupvalues. The average waiting times were calculated by Y^ waiting Jbirne{i) £—1 Sproofs where waiting Jtime(i) is defined as the time from the start of the whole set of tests until the result of proof i is reported. Since some of the detected hard problems in the considered examples turned out to be satisfiable, the parallel execution of these proofs can lead to super-linear speedups and may consequently greatly reduce the total run-time as well as the mean waiting time for a proof. 5

Related Work and Conclusion

In the realm of industrial verification, Spreeuwenberg et al.s present a tool to verify knowledge bases built with Computer Associate's Aion9. Concerning parallel satisfiability checking, PSATO of Zhang et a/.6 is a distributed prover for propositional

57

logic on a network of workstations. In contrast to our work, a master-slave model is applied, where a central master is responsible for the division of the search space and for assigning the subtasks to the slaves. We see the main contribution of our paper in presenting an industrial realworld application of symbolic computation where employment of parallelization techniques greatly enhances usability by reducing the user's waiting time to an acceptable amount. Sophisticated parallel execution and scheduling methods almost manage to overcome the unpredictability of proof times. References 1. E. Freuder. The role of configuration knowledge in the business process. IEEE Intelligent Systems, 13(4):29-31, July/August 1998. 2. W. Kuchlin and C. Sinz. Proving consistency assertions for automotive product data management. J. Automated Reasoning, 24(1-2): 145-163, February 2000. 3. C. Sinz, A. Kaiser, and W. Kuchlin. Detection of inconsistencies in complex product model data using extended propositional SAT-checking. In FLAIRS'01, 2001. To appear. 4. M. Davis and H. Putnam. A computing procedure for quantification theory. In Journal of the ACM, volume 7, pages 201-215, 1960. 5. Wolfgang Blochinger, Wolfgang Kuchlin, Christoph Ludwig, and Andreas Weber. An object-oriented platform for distributed high-performance symbolic computation. Mathematics and Computers in Simulation, 49:161-178, 1999. 6. H. Zhang, M. P. Bonacina, and J. Hsiang. PSATO: A distributed propositional prover and its application to quasigroup problems. Journal of Symbolic Computation, 21:543-560, 1996. 7. R. M. Karp and Y. Zhang. Randomized parallel algorithms for backtrack search and branch-and-bound computation. Journal of the ACM, 40(3):765-789, July 1993. 8. S. Spreeuwenberg, R. Gerrits, and M. Boekenoogen. VALENS: A Knowledge Based Tool to Validate and Verify an Aion Knowledge Base. In ECAI 2000, 14th European Conference on Artificial Intelligence, pages 731-735. IOS Press, 2000. 9. S. Garone and N. Buck. Capturing, Reusing, and Applying Knowledge for Competitive Advantage: Computer Associate's Aion. International Data Corporation, 2000. IDC White Paper.

IMPLEMENTATION OF AN INTEGRATED EFFICIENT PARALLEL MULTIBLOCK FLOW SOLVER THOMAS BONISCH, ROLAND RUHLE High Performance Computing Center Stuttgart (HLRS), Allmandring 30, 70550 Stuttgart, Germany Email: boenisch(a).hlrs. de This paper describes the effort taken to introduce a parallel multiblock structure into the URANUS simulation code for calculating reentry flows on structured c-meshes. The new data structure, the handling of block sides at physical and inner boundaries and the load balancing approach are presented. Target is the efficient calculation of reentry flows using multiblock meshes on current and future supercomputer platforms.

1

Introduction

Calculating flows around space vehicles during the reentry phase of their mission is a challenging task. Besides the normal flow one has to consider the occurring chemical reactions as they have a significant influence on the flow. The reason for these chemical reactions is the high temperature of the gas flow during the reentry, while the space vehicle is slowed down by the friction of the air. At these temperatures the air's components mainly nitrogen and oxygen react with each other. Modern space vehicles have complex geometries. To calculate flows around such a geometry, there exist several approaches: unstructured meshes, block structured meshes and other technologies like hybrid or overset meshes. Unstructured meshes can be generated automatically, mesh generation is much easier than generating a multiblock mesh. But the calculation of a flow using them costs a lot of computational power as indirect memory access methods have to be used. Therefore, efficient cache usage and vectorization is limited. As the development of the memory speed cannot keep track with the increase in processor performance this will be worse in the future. The second method is using structured meshes, but allowing several blocks of them to be assembled in an unstructured way. So, it is possible to mesh complex geometries. The calculation of such structured topologies profits from cache and vector technology. A good performance is much easier to obtain and it is normally higher than with unstructured meshes. But the mesh generation is a costly task. Sometimes it costs several weeks of an expert to generate a good so called multiblock mesh. The existing URANUS flow solver uses structured 3D C-meshes for its calculation. Considering this and the presented arguments we decided for the more natural step and using multiblock meshes for our calculations instead of rewriting the code completely to handle unstructured meshes.

58

59 2

The flow solver URANUS

In the URANUS (Upwind Relaxation Algorithm for Nonequilibrium Flows of the University Stuttgart) [1.2] flow simulation program the unsteady, compressible Navier-Stokes equations in the integral form are discretized in space using the cellcentred finite volume approach. The inviscid fluxes are formulated in the physical coordinate system and are calculated with Roe/Abgrall's approximate Riemann solver. Second order accuracy is achieved by a linear extrapolation of the characteristic variables from the cell-centres to the cell faces. TVD limiter functions applied on forward, backward and central differences for non-equidistant meshes are used to determine the corresponding slopes inside the cells, thus characterizing the direction of information propagation and preventing oscillation at discontinuities. The viscous fluxes are discretized in the computational domain using classical central and one-sided difference formulas of second order accuracy. Time integration is accomplished by the Euler backward scheme. The resulting implicit system of equations is solved iteratively by Newton's Method, which theoretically provides the possibility of quadratic convergence for initial guesses close to the final solution. The time step is computed locally in each cell from a given CFL number. To gain full advantage of the behaviour of Newton's method, the exact Jacobians of the flux terms and the source term have to be determined. The resulting linear system of equations is iteratively solved by the Jacobi line relaxation method with subiterations to minimize the inversion error. A simple preconditioning technique is used to improve the condition of the linear system and to simplify the LU-decomposition of the block-tridiagonal matrices to be solved in every line relaxation step. The boundary conditions are formulated in a fully implicit manner to preserve the convergence behaviour of Newton's method [3]. The usage of the sequential URANUS program on high-end workstations and on vector computers shows that the compute time and the memory requirements especially when changing to real gas models or using fine meshes for real world problems are too high to use the program on these platforms without the ability to use processors in parallel. Furthermore, with the so far used c-meshes many topologies cannot be meshed. The geometric singularity of such 3D c-meshes is complicated to handle and it often limits the convergence speed. The solution for these problems is to introduce a multiblock structure into this code which is also parallelized because only massively parallel platforms and modern hybrid parallel computers can fulfil the program's needs in memory size and computing power.

3

Requirements to an Efficient Parallel Multiblock Solver

To be able to calculate correct solutions on a multiblock mesh in parallel, a flow solver has to have several additional features. In the serial program using c-meshes

60

each of the different physical boundaries is fixed to a dedicated plane of the mesh. This means, e.g. that the outflow boundaries are always fixed to the mesh plane, where the first index reaches its maximum. In the multiblock case this is not longer true. Theoretically, each boundary type can appear at each of the six plsnes cf the 3D mesh. However, usually the mesh is generated in a way where the most complicated boundary if fixed to one specific plane of a block. 3o ; the boundary with the condition of the body wall has only to be implemented for one index direction. Nevertheless, this limits the usability of the code, because this does not work for some special topologies where at least one of the blocks has to calculate wall boundary conditions at least at two of its boundaries. The ability of turning the mesh blocks such mat always the same block surface fits to the body wall requires that each of the blocks can have its cv/n local coordinate system. This is necessary anyway as the bloc!cs itself are disposed unstructured around the geometry. At particular points where more ere less than four blocks in 2D or eight blocks in 3D respectively are connected together,

Figure 1. The different local coordinate systems and how they occure

the local coordinate system of at least one of the blocks neighbours has to be different (Figure 1). Furthermore, in a multiblock mesh each block can have multiple neighbours on each of its block sides, not only one as it happens, when you are cutting a C-mesh into parts, or when using special mesh generators. As we are not using our own meshes within this project, we are depending on meshes of our partners. Additionally, we wanted to be able to calculate results for as many cases and with as much project partners as possible. So we were not interested in limiting ourselves too much in choosing one special multiblock mesh type. We wanted to keep a high flexibility of our code. So, using meshes from project partners should be as easy as adapting and extending the L/5ut routines. The program parts for the parallelization the multiblock structure and the flow simulation core should be separated as far as possible. This increases maintainability as the modeling people still can recognize and update thsir flow patterns as the parallelization people can enhance the parallel performance.

61 As the flow simulator mentioned is memory and calculation intensive we aim at using the newest available hardware platforms. This requires a portable code to be able to move easily from one platform to another. In the parallel case this requires the usage of a standard parallel programming model. In order to be able to run on all platform types, like MPP's, SMP's and the new hybrid architectures consisting of distributed SMP nodes, we decided to use MPI as communication library. Currently, we use a pure distributed memory approach as this performs also well on today's SMP machines and on hybrid architectures. In a future version it would also be possible to add OpenMP directives to profit in addition from SMP architectures. A pure SMP approach was not considered, because this would limit the platforms to be used. Additionally, the experience of NASA shows that such an approach on today's ccNUMA architectures requires an effort comparable to an implementation with MPI to achieve a good performance[4]. 4

The parallel multiblock approach

The new parallel program needs a newly designed data structure to meet the requirements of the multiblock structure. Using Fortran90 we have the whole ability of dynamic and structured data types. Therefore, it is not longer necessary to implement a data management which stores the 3D arrays of all blocks contiguously into a one dimensional array or to waste memory by allocating the same amount of memory for each block calculated on one processor, like it was usual when using FORTRAN77 [5]. 4.1

The new data structure

In the new multiblock flow simulation program all information regarding one block is combined in a new data type. Each block is a instance of this data type having exactly the size the block needs. The blocks residing onto one process are organized in a pointered list of these instances. So, there is no waste of memory as only the really required memory is allocated. There is no additional programmer effort in manually handling the memory layout as the run time system cares for this. The information is easily accessible and the data structure itself is easy to extend to meet future requirements. The data structure for a block includes the ability to use several meshes with different resolutions for one block. So the data structure is already prepared for additional features like local refinement and multilevel or multigrid technologies. 4.2

Domain Decomposition and data exchange between the blocks

For the parallelization we used the domain decomposition approach. Each block has a two cell overlap region at the inner boundaries to maintain the second order scheme without additional communication. The values of the overlap region

62

are stored according to the local mesh values. This means, that they are stored converted to the local coordinate system of the block, even if the coordinate system of the block where they are originally located is different. Consequendy, the conversion of the halo cell data to the particular local coordinate system is done during the data transfer between the blocks. This communication between neighbours is necessary to update the results of the cells in the overlap region. During the solving step additional communication between the neighbours is performed to exchange intermediate data. This ensures a more accurate solution and a better convergence. Due to the irregular structure of the mesh blocks within the topology, the identification number of a neighbour block cannot be calculated out of the blocks own number and the block side. Additionally, a block can have several neighbours at each of its sides as mentioned above. Accordingly, there is a data structure implemented to store all the information about the relationship of the block and its particular neighbours. This information enfold the block number and the block side the local block is connected to, the neighbour blocks processor as far as the orientation of the neighbour block and the part of the local blocks side which is adjoined to the neighbour. Each block is treated separately on its processor. There is no difference in the handling of its neighbour blocks independent where they are located. This means, the communication between two blocks residing on the same processor is also done using the regular communication routines of the program. In this case from the processor point of view the processor is sending a messages to itself. A specific data structure also exists for each physical boundary type of a block, where all the data of one physical boundary type is specified. It contains the subtype and the exact position of this physical boundary on the block. Using this data structure, there is no branching necessary for each of the six block sides where a physical boundary can reside. Additionally, all physical boundaries of one type at a block can be handled efficiently in a loop. 4.3

Load Balancing

To obtain good performance on any kind of parallel computers it is essential to have a good load balance between the processors allocated to the parallel job. Thus, we should allocate the same portion of load to each processor. In our case this means that every processor should make calculations for more or less the same number of mesh cells. Compared to an approach where the blocks which are too large are cut in the middle as long as there are empty processors[6], we used a different technique: we also have to cut all blocks which are too large to fit onto one processor. But our cutting algorithm first calculates the number of necessary pieces for each block to be cut into. This number must only contain powers of two and powers of three as divider. If the calculated number does not accomplish this requirement the next appropriate larger number of pieces doing so is calculated. As the resulting parts may become to small we allow a slight overload of the processors

63

Figure 2. Different neighbour configurations

to avoid cutting into too much pieces. Then, the number of cuts necessary in each of the three dimensions is calculated in a way that the resulting blocks are not misshapen. This is the reason for allowing only part numbers containing powers of two and three. Consider the case if we have to cut a block into fourteen pieces. Then, we have to cut one time in one dimension, six times in a second and we cannot divide in the third dimension. If we are dividing the block into 16 parts instead, we can freely choose how to place the cuts within the three dimensions. This strategy has an additional advantage as we are cutting all the blocks in that way: assume a block being divided two times in one dimension and its neighbour being divided four times in the same dimension. Assuming the same local coordinate system, two of the three block parts have to communicate with two neighbour blocks each and one has to talk to three neighbours (figure 2). Having six

Figure 3. Solution of an Nonequilibrium Euler flow around X-38, angle of attack 40 degrees, mach number 19.8, printed is the mach number

64

parts within this dimension on the neighbouring block, each of the three blocks has two neighbours and the dependencies are simpler. It can happen that we have to adjust the cutting line by one cell to ensure this (but this is not yet implemented). Now, we have to rearrange the resulting blocks in a way that no processor is overloaded. For this step we use the load balancing tool jostle[7]. Jostle was primarily designed to be used for unstructured meshes, but it does not know anything about meshes itself. As any other load balancing tool jostle works on graphs. So, we only have to translate our new block structure, obtained after the cutting step, with its connections to the neighbours into a weighted graph where the block size is the weight of the node in the graph which corresponds to the block. If possible we get a load balanced block distribution and have to redistribute the blocks in the way proposed by jostle. As each block has its own data structure and is handled separately, it is easy to redistribute the blocks. Nevertheless, the time needed for this data transfer may be significant. 500 Cra --T3E

450

*

400 350

9-

300

t

290

a.

200

^

/

<£

190 100 90 0 0

50

100

190

200 250 300 Prazsssorzahl

350

400

490

500

Figure 4. Speedup for 20 Iterations on a 192 000 cell colibri mesh on the Cray T3E

5

Results

The obtained parallel multiblock flow solver has been used to perform reentry calculations for the colibri capsule and the X-38 (Figure 3), the prototype of the crew rescue vehicle for the international space station (ISS). Due to the usage of Fortran90 and MPI the code is portable and was tested on NEC SX-5, Hitachi SR8000, Cray T3E, IBM SP3 and a DEC alpha cluster. The convergence for the tested cases is good. The performance on the NEC-SX5e with a well sized problem

65

is nearly 1.5 Gflops which is 37% of the peak performance. The scaled speedup on the Cray T3E is given in table 1. Table 1.: Scaled speedup on the Cray T3E Mesh size 24 000 192 000

Mesh blocks 5 5

Blocksno. for calculation i.e. Proc. count 18 144

Simulation time 255.3 285.7

Efficency 1.0 0.893

A speedup measurement for a case with 192 000 mesh cells (figure 4) was also done on the Cray T3E. To compare the times 20 iterations were calculated. The smallest processor number possible with this case is 72 because of memory limitations. The given speedup values are compared to the case with 72 processors. The reason for the shown superlinear speedup for 108 and 216 processors is a better block shape in this cases compared to that running on 72 processors. 6

References 1. Scholl, E., Fruhauf, H.-H.: An Accurate and Efficient Implicit Upwind Solver for the Navier-Stokes Equations in Notes on Numerical Fluid Mechanics, Numerical Methods for the Navier-Stokes Equations, Hebecker F.-K., Ranacher R., Wittum G. (Eds.), Proceedings of the International Workshop on Numerical Methods for the Navier-Stokes Equations, Heidelberg, Germany, October 1993, Vieweg, 1994. 2. Friihauf, H.-H., DaiB, A., Gerlinger, U., Knab, O., Scholl, E.: Computation of Reentry Nonequilibrium Flows in a Wide Altitude and Velocity Regime, AIAA Paper 94-1961, June 1994. 3. Gerlinger, U., Fruhauf, H.-H., Bonisch, T. : Implicit Upwind Navier-Stokes Solver for Reentry Nonequilibrium Flows, 32nd Thermophysics Conference, AIAA 97-2547, 1997. 4. R.B. Ciotti, J.R. Taft, XPetersohn: Early Experiences with the 512 Processor Single System Image Origin2000, Proceedings of the 42nd CUG Conference (CUG Summit 2000), 2000 (CDROM) 5. F.S. Lien, W.L. Chen and M.A. Leschziner: A Multiblock Implementation of a Non-Orthogonal, Collocated Finite Volume Algorithm for Complex Turbulent Flows, International Journal for Numerical Methods in Fluids, Vol. 23, pp. 567-588, 1996 6. H.M. Bleecke et al: 'Benchmarks and Large Scale Examples' in A. Schiiller 'Portable Parallelization of Industrial Aerodynamic Applications (POPINDA)', Notes on Numerical Fluid Mechanics, Volume 71, Vieweg, 1999 7. C. Walshaw, M. Cross and M. Everett, 'Parallel Dynamic Graph Partitioning for Adaptive Unstructured Meshes', Journal of Parallel and Distributed Computing, 47(2), pp. 102-108, 1997

D I S T R I B U T E D SIMULATION E N V I R O N M E N T FOR COUPLED CELLULAR AUTOMATA IN JAVA M A R C U S B R I E S E N ' A N D J O R G R. W E I M A R * Institute of Scientific Computing, Technical University Braunschweig, D-38092 Braunschweig, Germany ; E-mail: J.Weimarffltu-bs.de

We describe a software environment for the coupling of different cellular automata. The system can couple CA for parallelization, can couple CA with different lattice sizes, different spatial or temporal resolutions, and even different state sets. It can also couple CA to other (possibly non-CA) simulations. The complete system, which is integrated with the CA simulation system JCASim, is written in Java and and uses remote method invocation as the communication method. The parallel efficiency is found to be satisfactory for large enough simulations.

1

Introduction

The concept of cellular automata is a framework for simulating highly parallel processes. They are very easy to parallelize and are often called "embarrassingly parallel". Usually, a cellular automaton consists of a regular lattice of cells, each of which contains a state from a finite set of states. From one time step to the next, all cells change their state in parallel depending on their current state and the state of a finite set of neighbors. Since the lattice is uniform, cellular automata are well suited only for the simulation of spatialtemporal phenomena in fairly uniform spatial domains. Some non-uniformity can be introduced by the initial conditions, but in other cases it would be better to combine different cellular automata for different regions in space. In this paper we demonstrate several types of combinations of cellular automata in the context of a cellular automata simulation system written in Java, JCASim 1 ' 2 . In this system, cellular automata can be described in a compact way using Java, or in a specialized CA language, CDL (which is then translated into Java). The system allows the simulation of cellular automata in one, two, and three dimensions, has several built-in boundary conditions, and allows the display of CA using text, colors, or icons.

'Present address: disy Informationssysteme GmbH, Stephanienstr. 30, D-76133 Karlsruhe, Germany. tTo whom correspondence should be addressed.

66

67

We consider five different types of coupling between cellular automata (CA), which can also be combined:

1.1

Coupling identical CA

If one or more identical CA are coupled through interactions at their borders, this can be used as a parallelization strategy if the simulation of the CA runs on different processors or computers. This option can also be used to model simulation domains which can not efficiently be covered by one rectangle, e.g., an L-shaped domain. Of course, the coupling interface must be able to connect the corresponding cells on the borders of the different sub-regions.

1.2

Coupling CA with different state sets

In order to couple two CA with different state sets, the programmer of the CA must specify how the state of a cell in one CA can be converted into the state of the other C A. In some cases this can lead to a loss of information, but this should be under the control of the designer of the CA. All the facilities required for case 1 are also required here.

1.3

Coupling CA with different spatial or temporal resolution

In some simulations, certain regions need to be resolved much better than others. This can be achieved by using a fine grid in some regions and a coarse grid elsewhere. In order to combine CA with different resolutions, one must provide interpolation routines and ensure the correct synchronization of the different CA. The interpolation is a more difficult problem for cellular automata than for finite difference or finite element methods, since CA are discrete in space and state, therefore averaging different cells is not directly possible.

1.4

Composing CA

The preceding cases all describe the coupling of different CA at a common border. A different case is the coupling over the whole space, which can also be viewed as the composition of different CA. In order for this kind of composition to work, the programmer must again provide methods for the translation of one state into the other.

68

1.5

Coupling CA with different (possibly non - CA) simulations

CA can also be coupled to other simulations that need not be cellular automata. An example could be a solver for an ordinary differential equation, which provides the boundary conditions for a CA. 2

Implementation

The JCASim system is completely implemented in Java. Therefore the coupling extension3 is also implemented in Java. In order to enable the parallelization in a distributed environment, each CA can run on a separate virtual machine (VM), and communication between them is handled via remote method invocation (RMI). An alternative would have been to use CORBA. Here we don't need the language independence of CORBA and prefer the RMI approach, as it is in integrated into the language standard. Furthermore, later migration is expected to be possible with automated tools. Another possibility would be to restrict parallelism to shared memory machines, and simply use multithreading. This approach will be added later for improved performance. 2.1

Structure of JCASim

Figure 1 shows the central classes of the JCASim system. A CellularAutomaton has a L a t t i c e of a certain dimension and type. The L a t t i c e contains Cells, and each Cell contains the S t a t e . This S t a t e is the class written by the user and contains all the information specific to a special CA, apart from configuration options specified in the simulation system. The user writes a subclass of S t a t e which contains variables to be held in each cell and contains the method for changing the state in one time step. Access to the neighbors is through a method of the Cell, getNeighbors(). If a cell accesses a neighbor outside the simulation region, a special object of type BoundaryHandler is called to handle this request. There are three basic boundary handlers in JCASim: PeriodicBoundaryHandler, Ref lectiveBoundaryHandler, and ConstantBoundaryHandler. They return the appropriate state of the lattice at the periodic or reflected position or a special constant state. For coupled CA, a different BoundaryHandler is introduced, which is able to get a cell from the remote CA. This RemoteBoundaryHandler cooperates with a BoundaryServer on the remote site. The BoundaryHandler and the BoundaryServer also need to ensure the proper synchronization of the CA. Each CA can only start the next iteration if the cell data are not required by

69 Interface

emstm.L m ttfcm

4-HEXAGOWAlTnt

-lx:lnt -ry:tnt -l*:int #boundaryHandJer:Bour>daryHarii

+setStMt*CI*si(it*tttCI*SM;Cl*SMtvot4 +setLMttfcet>*fimtto*0ihlatttc*DeffaHlmnl +9ttStafCfass0:ClMSS +^aUattheDtfMtH>aO.-lMtt/<:eOmffmftfoa +teUMttkt0±attke +r***tO,voM

fraiatftvold *b*ckup0:vi>id * t n r t a ittonttv oid «oSt«pO:voW fdoNftapsOwofd *9ttSt*te6t*rt£St*te *rttSt*re(k:/mry.*rtj:Sr*te *»*tSt*t*(xfat,r:tirt^JM).-$tmte

• T K IANGLE AHCU:lnt-l

+d65tmp1htoM +b*ckapO&otd +tr*asltkmQ.-voM +fttG*nerMtten0.iitt +fetStMttnit;fittly.1itttz.-int&tattt

ca,a1m.L«nlc«3D

+9etSute(x':intly :'lBt ,z :lnt):Sttt« +gatOMStata(x:lnt,r:tRt^:int>:St

g>aima.attlc«20

+gat$tataO(:lat,y:lnt}:Siata +««tOWStatatx:lm,Y:lat)3iat(

QcaalmJ-attlcalD x:kni V:bit z:bit

+ c.H.:C.tlfl

caslm.CALocal -rcadyForTrarti It Ion iboolcatt

+CALo»IO +CALocaf
i a t t k c D t f imt ion&attfce Definition iUteCliis:Clait lattice l a t t i c e flaneratlon:lnt

caalmXall «backupS t i t c S t a t c -cur rent Stat e:State #f*ttlce:Uttrca

#C«IIOiLattlca,st«ta:Clasi) #backup0:vold ftrarulttonfrvold gatOUStata&State djetStataftStat* getNctgMwriDrStareD oetNaighborR«latlvcO(:lnt,y:lnt l z:lnt):St

copyfr$tifeJR/*i mutsttlaitOarekl aatColarOiColor +toStrlng05trlng

merits Ck

caalm.Lamc«D«nnltlBf

•E

cmmtimJm**mttdmryHaiHUmr 0 _ #tattlca:L*ttlca

ca.almXawatatKKoundaryHaiid.il

+BoundaiYHandUr(my DKn:lnt,po» ltrtfe:baolei iXTat +0iitOuuMeStMte0c:/ittlyJatfx:fnt&tat*

+ConatKKlloundafvH*ndlartrnyOini:lnt +aatOuttldaStata(x:Mtt,y:int f z:lnt^tat

dtmantlon:l(tt a»liw.Mflactlv«»PundaryH*ndl vcoupliiaa |«ckaga cacomb nb I1 '

IV |_ ** *•-• - ^ .

RamonSoundairyHancih ramoteSafva l o u r d a r y i a r v t r

TJ

+ M f la«tya»o»ndary Handla r(my Dlm:l 4-gctOutildeStatc(x:lnt,y:lnt,z:lnt):Sta

• alaw.Par'kofllcioutidaryHand'li

\

\

+r>arlodlcloundaryHandlar(myD!m:lnt

\ tpllfoimdaryHandUr

bavndaryHandlaM Vector ipflrpoat at

getloundarySrataahlerth hit) Lattice raadyFortackiipttltta dMble)-va|d readyFoiTranflc ondliiM daub a h r a M

tSpittCauifdaryffandterOnyD m-tnt&mttht oaohtafi Lattice] +connactUp1rtfolnt kitJtl-taundaryHaadlar ti2 lojuidaryHandler) , vo raaayForlaeKupRlnia dounfla>-vold ready F a r T n n i tkwftlme double) void +aatOutf IdaStatiGr. kit y lnt,z I n t l d t a t e

Figure 1. Basic class structure of the JCASim package.

70 CilutorAutomaton

-1 I

I ^orUII slates on boundary)

^ I

I

I

•

I I I I

Figure 2. Sequence diagram for the transfer of cells from one CA to another.

another coupled CA. The synchronicity requirement is somewhat relaxed by the fact that each CA keeps a copy of the old state for access to the neighbors during the state transition. 2.2

Remote Data Transfer

The transfer of data from one CA to the other goes through a number of stagesas shown in Figure 2. Within one time step, an access to an outside cell of CA 1 leads to a call to g e t O u t s i d e S t a t e of the appropriate RemoteBoundaryHandler. The RemoteBoundaryHandler (RBH) checks whether the requested cell state is available in the cache. If so, it is returned from the cache (and no remote communication is necessary). If not, a remote method invocation to the corresponding BoundaryServer on the remote machine is initiated. The RBH requests not just one cell, but a whole slice in the assumption that most other boundary cells will also be requested within the same time step. The request is serialized (converted into a byte stream) and sent over the network to the remote BoundaryServer, where the request is unpacked. The BoundaryServer then creates a temporary object of class SimpleLattice which contains just those cells of the lattice requested by the RBH. This object is again serialized and sent back to the requesting RBH, which then unpacks the results and stores it in its cache and can finally deliver the requested cell state. Most of the time required for the data exchange is spent in the serialization and deserialization of the SimpleLattice transferring cell states. This could be improved by using the E x t e r n a l i z a b l e interface instead of the built-in S e r i a l i z a b l e mechanism. The E x t e r n a l i z a b l e interface simply replaced the automatic conversion of objects to byte-streams by a programmercontrolled conversion. The disadvantage is that the user must provide code

7?

for this operation, while serialization is automated using the Java reflection mechanism. Even more efficient is the case where all processes run on the same multiprocesor. In this case, a simple shared-memory access is sufficient. The current implementation does not yet detect this situation. 2.3

Split boundaries, interpolation, and state set conversion

The case where two CA meet, but do not share a complete border (see Figure 3, left) is handled by another specialized BoundaryHandler, a SplitBoundaryHandler, which in turn uses RemoteBoundaryHandlers to communicate with the remote CA. The case of different spatial resolutions is also handled by a specialized InterpolatingBoundaryHandler. In order to implement coupling case 2 (different state sets), the RemoteBoundaryHandler compares the classes of the local and the remote cells states. If they are not equal, and the local state implements the StateConversion interface, then the remote boundary handler creates new states of the local class and asks them to import the data from the remote cells. 2.4

Synchronization

Two coupled CA must be synchronized so that each request for states from the other CA can return states at the correct time step. If one CA advances too fast, the states requested by the other CA are not available any more (states from later times are stored in their place). On the other hand, the number of requests is not predetermined, since a CA might during some time steps not request any outside neighboring cells at all. Therefore separate synchronization calls are inserted, where each CA announces to its RemoteBoundaryHandlers the intention to proceed to the next time step. The RBH can delay this progress until the corresponding RemoteServer agrees that the next time step can be executed. 3

Initialization

The different CA, boundary handlers, and boundary servers are initially set up by a central configurator. This configurator has a graphical user interface for setting up the simulation In the configurator it is possible to add a new cellular automaton to the simulation. The boundaries can be connected automatically

72

Figure 3. Left: Greenberg-Hastings CA testing general communication and boundary splitting. Right: Diffusion with two different approaches

or explicitly. The configuration of a simulation can be saved in XML-format, which can then also be edited with a text-editor or with a automated tools to create larger simulations. 4

Simulation

The simulation of coupled cellular automata is controlled by a user interface which can display all of the CA together. The user specifies a time until which the simulation should be run. All the coupled CA receive this information and execute the necessary transition steps to reach this given time. The synchronization is strictly local, i.e., neighboring regions control each other. When the predetermined target time is reached, the controller interface is notified. The GUI can then copy the cell states for display to the user. 4-1

Test-cases

As a first test (Figure 3, left), we coupled three CA of the same type to verify the correct exchange at the (split) border. We selected a Greenberg-Hastings CA in which any error in the coupling can be easily spotted. As an example for coupling different types of CA, we show two differ-

73 t t ••]

so;

(a). y=200/

y=100/*

600 400

y=10 -32.0323

20

50

(c)

It"

CA 2 CA 3 CA 1

v

Figure 4. Measurement results: (a) Time in milliseconds per time step to simulate coupled CA with different sizes x and y (with 4 processors), (b) Speedup for different sizes ( where the problem size scales tith the number of processors p). (c) Configuration used.

e n t c e l l u l a r a u t o m a t a s i m u l a t i n g diffusion.

One a u t o m a t o n uses the lattice

gas approach, the other uses a finite difference averaging with probabilistic rounding 4 . Figure 3(right) shows the simulation after 300 time steps. 4-2 Speed The speed of the simulation is influenced by a number of factors. To obtain a reasonable simulation speed in this Java based system, a number of recommendations should be followed2. Most importantly, no object should be created in the inner simulation loop. For CA requiring random numbers, an efficient random number generator should be used. Besides these recommendations, recent Java virtual machine implementations come with just-in-time compilers which are quite efficient. The coupling interface introduces an additional overhead. This overhead comes from the serialization, which is used to package the border cells, and the remote method invocation, which includes the network communication latency. The current implementation also does not heed the above recommendation to avoid all object creations (Java serialization creates new objects when de-serializing). To measure the overhead, we performed a number of simulations on a Linux-cluster (Pentium 266). The simulations were performed using a typical CA (with diffusion and using random numbers extensively) and different sizes of the CA. We report the simulation times for four processes and indicate the size of the data chunk for each process. In this simulation the

74

processes are arranged in a linear array and the results are shown in Figure 4. The internal calculation of the transition function takes 0.032ms per cell. The border exchange introduces a delay of 85ms plus 1.02ms per border cell. The constant delay is approximately equal to the variable delay for 82 cells, and the total delay for exchanging 82 cells is approximately equal to the time required to update 82 x 64 cells. Therefore this approach is only useful for parallelization of large grids. The speedup (with growing problem size) is nearly linear for two up to 10 processors, since there is no sequential overhead in the synchronization. The border exchange can be improved using the E x t e r n a l i z a b l e interface of Java, which reduces the communication time per cell by 25%. The constant overhead could probably be improved by careful tuning of the synchronization messages. 5

Conclusion

We have demonstrated an extension to the cellular automata simulation system JCASim, which allows the coupling of different CA. The CA can have different state sets and different spatial or temporal resolutions. The coupling uses the Java RMI mechanism and is mainly geared towards flexibility as opposed to speed. Parallelization can be performed using this mechanism, but is efficient only for large enough simulations. The JCASim sytem is available at h t t p : / / w w w . j c a s i m . d e / . References 1. Uwe Freiwald. Eine Java - Simulationsumgebung fur Zellularautomaten. Diplomarbeit, Inst, of Scientific Computing, Techn. University Braunschweig, 1999. 2. Jorg R. Weimar and Uwe Freiwald. JCASim - a Java system for simulating cellular automata. In S. Bandini and T. Worsch, editors, Theoretical and Practical Issues on Cellular Automata (ACRI 2000), pages 47-54, London, 2000. Springer-Verlag. 3. Marcus Briesen. Eine verteilte Simulationsumgebung fur gekoppelte Zellularautomaten in Java. Diplomarbeit, Inst, of Scientific Computing, Techn. University Braunschweig, 1999. 4. Jorg R. Weimar. Simulating reaction-diffusion cellular automata with JCASim. In T. Sonar, editor, Discrete Modelling and Discrete Algorithms in Continuum Mechanics, page (to be published). Logos-Verlag, Berlin, 2001.

HUMAN EXPOSURE IN THE NEAR-FIELD OF A RADIOBASE-STATION ANTENNA: A NUMERICAL SOLUTION USING MASSIVELY PARALLEL SYSTEMS LUCA CATARINUCCI AND PAOLO PALAZZARI E.N.E.A. - HPCN Group, Casaccia, Roma, Italy E-mail: palazzari@casaccia. enea. it LUCIANO TARRICONE D.I.E.I., Via G. Duranti 93, 06125 Perugia Italy E-mail: tarricone@diei. unipg. it Abstract The rigorous characterization of the behaviour of a radiobase antenna for wireless communication systems is a hot topic both for antenna or communication system design and for radioprotection-hazard reasons. Such a characterization deserves a numerical solution, and the use of a Finite-Difference Time-Domain (FD-TD) approach is one of the most attractive candidates. It has strong memory and CPU-time requirements, and parallel computing is the suitable way to tackle this problem. In this work we discuss a parallel implementation of the FD-TD code, present the findings achieved on the APE/Quadrics SIMD massively parallel systems, and discuss results related to the human exposure to the near-field of real radiobase antennas. Results clearly demonstrate that massively parallel processing is a viable approach to solve electromagnetic problems, allowing the simulation of radiating devices which could not be modeled through conventional computing systems: a detailed numerical human phantom is used to solve the human-antenna interaction problem, and the absorbed fields inside it are estimated. The achieved solution accuracy paves the way to a rigorous evaluation of the radiofrequency safety standards in a relevant class of occupational cases.

1

Introduction

The astonishing boom of wireless communication services has focused the attention of researchers on several different aspects: theoretical radiowave propagation problems, both in open and in indoor environments [1-3], experimental set-ups for electromagnetic (EM) dosimetry [4], and source optimum dislocation [5]. Another high-impact item has been represented by the assessment of possible hazards caused by the exposure to EM fields. In this widely-spread class of research, an interesting paper in the literature is focused on an occupational problem, i.e. the near-field exposure of workers in the nearby of radiobase antennas (RBA) [6]. This latter paper is a challenging attempt to experimentally quantify the amount of EM power induced inside a homogeneous human phantom. The approach therein proposed is experimental, and no rigorous theoretical or numerical solutions are

75

76

proposed in the literature for the same problem. This is basically due to the huge numerical effort required to rigorously characterize the near-field behavior of standard RBA. In fact, in the near-field case, the plane-wave approximation does not hold, and the electric and magnetic fields must both be evaluated. Moreover, the rigorous definition of the threshold distance for the far-field zone, is itself a discussed issue. Taking into account the classical distance giving the lower limit for the far field zone [7] 2D 2 O,. — (!) where D is the antenna's maximum geometrical dimension and X is the signal wavelength, we often have to deal with near-field distances in the order of 10-20 meters (we consider a typical case for RBA, such as the array of electric dipoles). The huge dimension of the near-field domain, and the need of accurate full-wave solvers for a rigorous EM characterization, turn the problem into a strong numerical effort. Therefore parallel computing, and in particular Massively Parallel Processing, seems to be the best way to solve the challenging problem of characterizing the near field behavior of RBA. Moreover, as immediately apparent when looking at the wide class of applications reported in [1-6], the development of such a parallel solution can be of great usefulness in a large number of different problems. In this work we recall in Section 2 the Finite Difference in the Time Domain (FD-TD) integration scheme used to solve Maxwell's equations [8-10], and elected as the reference numerical method because of its versatility. In Section 3 a rationale for the choice of the parallel architecture is given. In Section 4 we describe the FDTD implementation on APE/Quadrics platform. A quick discussion of a numerical phantom used to solve the human-antenna interaction problem is then proposed in Section 5, and results are given in Section 6, describing the human-antenna interaction in real cases. Finally some conclusions are drawn. 2

The Finite Difference in the Time Domain (FD-TD) integration scheme

The Finite-Difference Time-Domain (FD-TD) method [8] is one of the most used approaches to solve Maxwell's partial differential equations (PDEs). The Yee's FDTD algorithm transforms the time-dependent Maxwell's curl equations into a set of finite-difference relations which have the following properties [9,10]: a) The space and time derivatives are approximated by using the central discretization, therefore resulting in second-order accurate expressions. b) The location of the electrical (E) and magnetic (H) fields in the Yee grid implicitly enforce the Maxwell's divergence equations.

77

c)

The time-stepping process is folly explicit and the algorithm is nondissipative (numerical wave modes propagating in the mesh do not decay because of nonphysical effects due to the time-stepping algorithm). The FD-TD algorithm is based on temporal and 3D spatial discretizations. It uses a leap frog integration scheme to solve Maxwell's equations: at time step t=n+l/2 it computes, at each grid point (ij,k), the value of the H1^"2 components as a function of the previous value H1*"1 in the same point plus a function of the values of the E components at time t=n in the grid points belonging to the neighborhood of (ij,k); in a similar way the value of the E components at time step t=n+l is computed, at each grid point (ij,k), as a function of the same component at previous time step (En) plus a function of the H components at time r=n+l/2 in the grid points belonging to die neighborhood of (ijjc). The exact expressions for the computation are similar to the following, which is used to compute the Hx component:

H

*G=D.\uyH.Zj+D>\uS

( 17 I" ^y Uj,k+l I 2

r i» y \i,j,k-\ I 2

a

£

CI" V I" "\ ^ ' z \j,j+\ I 2,k ~-°z \i,j-\ I 2,k

Az

Ay

being Da\. . , and Df\. . , some constants which take into account the structure of the material in each grid point [10]. In order to deal with the finite dimensions of the domain on which the integration is performed, we used the 2 nd order Mur absorbing boundary conditions (ABC) [11]. ABC simulate the propagation of a wave incident on the boundaries of the integration domain and they try to generate no reflected waves (i.e. they simulate an infinite domain). We preferred to use the Mur instead of the PML ABC [12,13] because the computational burden is substantially smaller and they guarantee an accuracy appropriate for the addressed problem. In order to ensure numerical stability of FD-TD method, the temporal increment step At must satisfy the relation [10]: Ar<—,

I 1

sec (3) t

1

|

1

being M =

A

the spatial discretization step (k=x,y,z, Nt is the number of samples for

N

k

wavelength in the k direction). In order to avoid modal dispersion, the spatial discretization step must be chosen smaller than X/10, i.e. A^>10 must result ([9,10]). FD-TD simulation of a domain with size \LX xLyXLz\m3 requires a number of floating point operations

78

NFlop = 36xLx xLy xLz x being N the number of samples considered for each wavelength and X the wavelength of the electromagnetic component with highest frequency. Such a large number of floating point operations clearly demonstrate that the simulation of EM fields at high frequencies can obtain a significant improvement when implemented on parallel systems. 3

Choice of the Parallel Architecture

From the considerations presented at the end of previous section, we decided to use parallel systems to simulate, by means of the FD-TD technique, Radio Base antennas and their interaction with human phantoms. Once decided to adopt parallel processing, the target parallel architecture must be selected among the many available on the market. A complete analysis of such item is quite time-taking, and we report here just some hints. If we take into account electrical power consumption and H W volume of the systems, it is apparent that special purpose systems, like the ones of the APE/Quadrics series, offer better performance (for the same HW volume) than general purpose systems: in fact near all the silicon of the system is used to implement operations which are almost always active, thus allowing the achievement of high efficiency. Moreover, the FD-TD integration scheme belongs to a typical class of problems well supported by the APE/Quadrics massively parallel systems, and it can be efficiently programmed on them, with a consequently high performance-to-price ratio. In fact, the FD-TD algorithm has the following characteristics: 1. it is synchronous; 2. all the processors execute the same instructions on different domains; 3. it needs interprocessor communications executed in a synchronous way; 4. it has regular patterns for memory accesses, so deep memory hierarchies are not required. Points 1 and 3 show that the time spent in synchronization phases, required by MIMD systems, is an overhead not required by SIMD machines with synchronous communications. Point 2 shows that all the HW dedicated to manage different program flows in the processors is unnecessary, being sufficient, as in the SIMD systems, one centralized controller of program flow. Point 4 means that cache memory and the related policies are not needed for the efficient implementation of the FD-TD algorithm. Previous reasoning constitutes the rationale for using the APE/Quadrics SIMD systems to efficiently implement the FD-TD integration scheme.

79 4

FD-TD Parallel Implementation on the APE/Quadrics

In this section the sketch of the parallel implementation of the FD-TD algorithm on the APE/Quadrics machine is given. On a machine with n processors, the whole computation domain is divided into n sub-domains (with equal volume and shape); each sub-domain is assigned to a processor and adjacent sub-domains are assigned to adjacent processors (both the algorithm and the machine implementing it have 3D topology). The EM field components are contemporaneously updated in each processor through equation (2) and its companions (for the remaining scalar Ex,yz and Hj,,z components). When the computation updates a field component on the border of me domain, some values belonging to the border of the adjacent domain are required: in order to avoid communications during the computations, each sub-domain is surrounded by the border cells of the other domain. These border values are communicated after the updating phase. The scheme of the parallel algorithm is given in the following: FD-TD parallel algorithm begin FD-TD algorithm Choose a spatial discretization of the domain (Ax, Ay, Az); if the domain has dimensions (L* x Ly x Lz), the grid has Nj x Nj x N k points, being 1,J,K

^ x

.y> z

Determine the time step At (eq. 3); Partition the whole rectangular domain D=[Ni x Nj x N J into P= P; x Pj x Pk rectangular subdomains D'= [N'i x N'j x N ' J , being Pi( Pj, and P k the number of processors along dimension ij and k, and N. N; Nk N; = — - , N : = and N ^ = the dimension (expressed as number of "i

"j

"k

grid points) of the generic subdomain D'. for^OjKT^t^+At) in all the processors do compute the new values of H communicate the H values on the boundary of each subdomain to its neighbor compute the new values of E enddo in all put the correct value in the feed point in the processors containing the source;

80

in the boundary processors do compute the absorbing boundary conditions; in all the processors communicate die E values on the boundary of each subdomain to its neighbor; endfor end FD-TD algorithm. When implementing me algorithm, particular care has been put in the vectorization of memory accesses (to reduce the severe memory start-up penalty), the unrolling of critical loops (to minimize pipeline stalls and die waiting for the consequent initialization latency), the using of the fast internal multiport register file.

5

The Numerical Phantom

A relevant role in the accurate solution of the addressed problem is played by die numerical technique utilized to represent die electromagnetic properties of die exposed human subject. The use of an appropriate numerical phantom, in feet, is the attractive issue to overcome die real problem of several proposed experimental techniques, using homogenous representations of die human body, with a consequent approximation error not adequate to die goals of a radio-protection analysis. The story of die development of accurate numerical phantoms is rich and long, and we address to die specialized literature me interested reader [14 and references tiierein]. Here we recall diat a generical numerical phantom is an organized archive to build up a detailed millimeter-resolution human model so diat die dielectric and conductive properties are fixed for each working frequency in each part of the sample body. In mis work we refer to one of die most appreciated phantoms, die one proposed by die Visible Human Project at Yale University, whose development is supported by [15]. It is also important referring to [16] for a sensitivity analysis of die predicted values witii respect to frequency and voxel size. 6

Results

Fig. 1 report results for a human phantom exposed to a real radiobase antenna (Katiirein 730678). The distance between die operator and die antenna is around 60 cm. The domain is partitioned into 320x320x256 cubic cells, witii a 4 mm edge. The operating frequency is 902 MHz, for die well known GSM cellular system. At this frequency, die wavelength in vacuo is 33.2 cm. As die maximum relative dielectric constant in the phantom is around 60, die use of a 4mm cubic edge is in accordance witii a standard convergence criterion to be used witii FDTD codes, requiring a

81

minimum ratio of 10 between the wavelength and the cell's edge. Nonetheless, it is quite obvious that an adaptive meshing is nearly mandatory, and is currently under development

50

100

150

200

250

50

100

150

200

250

Fig la,b: Human phantom exposure to a real radiobase antenna in two different configurations. Red colour is for high-intensity E fields, blue is for low amplitudes.

As for computing times and performance, for the sake of brevity, we omit here details: an efficiency TJ=—2-- — = 6.68E-01 is achieved for a 26214400 cell tpar

P

domain, thus proving that the FD-TD integration scheme is very well suited to be implemented on massively parallel systems. The accuracy in the solution of the problem allows the correct evaluation of E and H fields wherevere inside the human phantom, as depicted in Fig. la,b. It is just worth noting here that the maximum observed percentage of absorbed E field inside the phantom is around 40 % in the nearby of the chin. 7

Conclusions

In this paper we have proposed an FDTD implementation on massively parallel systems for the evaluation of radiofrequency hazards in humans exposed to the nearfield of radiobase station antennas. The approach is proved to be appropriate, and the FDTD strategy extremely amenable to such a parallel implementation. E and H fields are accurately evaluated inside a numerical phantom, thus solving in a

82 rigorous and efficient fashion a relevant problem in a wide class of real industrial applications.

References 1 2 3

4

5

6 7 8

9

10 11

12 13 14 15 16

G. Wolfle, R. Hoppe, F. M. Landstorfer: 'Radio network planning with ray-optical propagation models for urban, indoor and hybrid scenarios', Proc. of Wireless 99, 1999. G. Wolfle, F. M. Landstorfer: 'Prediction of the field strength inside buildings with empirical, nerual and ray-optical prediction models', COST 259 TD 98-008. R. Hoppe, G. Wolfle, F. M. Landstorfer: 'Fast 3D Ray-tracing for the planning of microcells by intelligent preprocessing of the database', Proc. of COST 259 Workshop, 1998. G. Wolfle, A. J. Rohatscheck, H. Korner, F. M. Landstorfer: 'High resolution measurement equipment for the determination of channel impulse responses for indoor mobile communications', Proc. of PIERS 98,1998. J. Zimmermann, R. Hons, H. Mulhenbein: "The antenna placement problem for mobile radionetworks: an evolutionary approach', Proc. 8th Conf. Tel. Systems, pp. 358-364, 2000. A. Bahr, D. Manteuffel, D. Heberling: 'Occupational safety in the near field of a GSM base station', Proc. of AP2000, Davos, April 2000, 3A6.7. J.D. Kraus: 'Antennas'. Mc Graw Hill - 1988. K. S. Yee: 'Numerical solution of initial boundary value problems involving Maxwell's equations in isotropic media', IEEE Transactions on Antennas and Propagation, AP-14, 4, pp. 302-307, 1996. A. Taflove and M.E. Brodwin: 'Numerical solution of steady-state electromagnetic scattering problems using the time-dependent Maxwell's equations', IEEE Transactions on Microwave Theory and Techniques, MTT-23, 8, pp. 623-630,1975. A. Taflove: 'Computational Electrodynamics: The Finite-Difference Time-Domain Method', Norwood, MA, Artech House, 1995. G.Mur: 'Absorbing boundary conditions for the Finite-Difference approximation of the Time-Domain Electromagnetic-Field equations', IEEE Transactions on Electromagnetic Compability, vol. EMC-23,4, pp. 377-382, 1981. J.-P. Berenger: 'A perfectly matched layer for the absorption of electromagnetic waves', J. Computational Physics, vol. 114, 1994. J.-P. Berenger: 'A perfectly matched layer for the FD-TD solution of wave-structure interaction problems', IEEE Antennas Propag. Symp., vol. 44, n.l, 1996. C. Gabriel, S. Gabriel, E. Corthout, "The dielectric properties of biological tissues: I Literature Survey', Phys. Med. Biol., vol. 41, pp.2231-2249, 1996. C. Gabriel, 'Compilation of the Dielectric Properties of Body Tissues at RF and Microwave Frequancies', Brooks Air Force Technical Report AL/OE-TR-1996-0037. P. A. Mason et al., 'Effects of Frequency, Permittivity and Voxel Size on Predicted SAR values in biological tissues during EMF Exposure', IEEE Transactions on Microwave Theory and Techniques, MTT-48, 11, pp. 2050-2061, 2000.

Tranquillity Mapping Using a Network of Heterogeneous PC Andrea Clematis, Monica De Martino, Giulia Alessio Istituto per la Matematica Applicata - Consiglio Nazionale delle Ricerche Via De Marini 6, 16149 Genova, Italy Sara Bird, Susanna Feltri Amministrazione Provinciate di Savona ViaSormano 6, 17100 Savona, Italy

The use of parallel processing to speed-up Geographic Information System (GIS) algorithms and applications has been widely considered and documented. In this paper we present our experience and early results about the use of parallel computing to speed-up tranquillity mapping, a methodology which is finalised to provide support to the landscape assessment process, and which is of interest for public administrations. The parallelisation of tranquillity mapping algorithm is described. Since the final goal is to make it possible to use the parallel algorithm in a public administration, aspects related with the use of an heterogeneous network of Personal Computers are addressed.

1 Introduction The usefulness of parallel processing for Geographic Information System (GIS) applications is widely documented in several reports and research papers (Healey98). Experiences are available about the development of parallel algorithms which improve the performances of different costly GIS functions like raster to vector data conversion, Digital Terrain Model (DTM) and Triangulated Irregular Network (TIN) construction, Drainage Basins identification, morphological analysis and other. In this paper we are interested in the use of GIS for tranquillity mapping (Bell99), a process aimed to produce a thematic map, which shows the levels of disturbance or tranquillity of a region. This process may be considered as a part of landscape assessment (Cudlip99), and it is becoming of interest for public administration as well as for other subjects like tourist operators or estate agents. The reported experience is part of the PLAINS project (Prototype Landscape Assessment Information System), a research project sponsored by European Union as a Centre for Earth Observation (CEO) activity. The parallelisation of the tranquillity mapping process for an heterogeneous network of PC is described. 2 Tranquillity mapping Tranquillity mapping concerns the idea of mapping areas of landscape that could be regarded as tranquil: away from noise and visual intrusion. This concept was introduced and developed by the ASH Consulting Group in England in 1991, first

83

84

for a study for the Department of Transport, and then for the Countryside Commission and Council for the Protection of Rural England (Bell 99). If a scale or spectrum of tranquillity is drawn up, a "complete" tranquillity would correspond to the absence of unnatural noise and visual intrusion, while at the other end an absolute lack of tranquillity would exist where noise is greatest and/or visual intrusion is significant. A degree of tranquillity depends on the proportion and combination of effects between these two extremes. Moreover it is necessary to consider diat die terrain morphology may affect or improve tranquillity. Here a tranquillity mapping methodology is proposed which takes into account the effect of the presence of disturbance sources. The methodology is characterised by two main activities: 1. 2.

Input data identification and acquisition; Spatial and attribute data processing for tranquillity map generation.

Parallel computing will be used to speed-up the tranquillity map generation step. 2.1 Input data identification and acquisition In our approach the basic input data is the vector cartography of the investigated region. The region we have considered is part of Liguria in North- West of Italy, and it is characterised by a complex terrain morphology. For this reason we adopted a 1:5000 scale for the digital cartography. For other region with a more regular terrain morphology a smaller scale could be adopted. The other input data are disturbance values, represented by attributes attached to each geo-referenced object, which is a source of disturbance. In the present version of our Tranquillity mapping process we have considered factors concerning noise and visual intrusion generated by three categories of unnatural sources, which are road and railroad networks, industrial sites, and urban centres. Each object belonging to one of these categories must be contained in the spatial database as a single feature or as a set of features according wim the appropriate geometry: linear features for road and railroad networks and area features for industrial and urban centres. A disturbance level is attached to each single feature. The whole set of features of a category is identified as a theme. We have defined 8 disturbance levels: 1 is for highly negative influence and 8 is low influence. The definition of a tranquillity spectrum extends the effect of the disturbance level to the influenced area using an appropriate set of distance intervals to identify different areas ranging from very disturbed to very tranquil or remote. See (AlessioOl) for a table, which defines a tranquillity spectrum schema.

85

2.2 Spatial and attribute data processing for tranquillity map generation This activity may be split in two phases: • Calculus of tranquillity map for each theme or category of features • Generation of a final map as a synthesis of all the generated tranquillity maps. For each theme, tranquillity mapping process is characterised by the following steps: • Group features in distinct classes in accordance with their disturbance level; • For each class generate a buffer layer using the table of distance range; • Combine buffer layers generated by different classes of disturbance; • Map optimisation. Considering that we group features according with the disturbance levels, which are eighth, we generate a map with eight buffer layers. The buffer layers are then analysed and synthesised: buffers are combined to generate a unique map by using overlay analysis operation. The resulting map is characterised by a lot of polygons obtained by combining all buffers in such a way that the minimum tranquillity scale (corresponding to a maximum disturbance) of the combined buffers is associated to polygons. The tranquillity map is optimised, by merging adjacent areas with the same tranquillity scale value, in order to reduce the number of distinct polygons. The tranquillity maps derived from the different themes are finally merged together by overlay and a final optimisation is done. xsaaEssa^H^mmmmmmm^m^^^mmmmmmmmmmi^^^^m^m

Figure 1: A tranquillity map

86

Figure 1 shows the tranquillity map for the Letimbro drainage basin in Savona District ( Liguria Region, North West of Italy). 3 Parallelising TM: outline of a strategy for an heterogeneous environment The tranquillity mapping algorithm exhibits an high computational cost and its execution time on an unloaded workstation equipped with a professional GIS environment is of about 12 hours for an area of 400 Km2 characterised by a complex morphology and by a medium level density of disturbance features. Thus to speed-up the mapping algorithm will lead to important improvements especially for simulation purposes. 3.1 The Parallel Program Outline The Tranquillity Mapping algorithm permits to process in an independent way tasks, which may be obtained by partitioning the data domain using a logical approach (different themes) or a spatial approach (different sub-regions) or a mix of the two. To put together the results of the processing of two or more independent tasks requires to properly overlay maps, if the tasks correspond to different themes of the same sub-region, or join maps if tasks correspond to neighbour sub-regions and finally to execute a map optimisation. Note that the overlay operation is the same used in the sequential algorithm, while join does not present particular problems. In the present implementation, if a feature lies over a boundary between two sub-regions it is assigned to one of the two regions. This is sufficient to ensure the correctness of the parallel algorithm. Figure 2 provides an outline of the parallel program for tranquillity mapping. The parallel algorithm is based on master worker paradigm and it is divided in three distinct phases. In the first phase the master provides to data distribution and schedules a task for each worker. In the second phase each worker processes tasks and it notifies the master after each task completion. Results of computation are kept in local database of each worker during this phase. When all tasks have been completed the third phase, dedicated to overlay and join operation may start. These operations are executed as much as possible in parallel, trying also to minimise data transfer among worker. A tree like approach is used to overlay themes of each subregion, and to join maps of different sub-regions. The number of independent tasks progressively decreases during this phase of computation, and workers will leave computation following master instructions. 3.2 Load Balancing and data partitioning Load balancing is the key point to make the parallel algorithm efficient, and to obtain load balancing we have to consider the computational cost of each task,

87

Phase 1: data distribution and start up Master: { copy dataset to each worker local disk; schedule the first set of tasks; send TaskActivation message to each worker; }

Phase 2: Single task processing Master: while WorkToDo { receive TaskDone message; schedule NextTask; send TaskActivation message; } Worker: while WorkToDo { receive TaskActivation message; compute task; send TaskDone message; }

Phase 3: Maps overlay and join Master: { co-ordinate workers to properly overlay thematic maps for each drainage basin; co-ordinate workers to properly join drainage basin maps; accept and merge final dataset } Worker: { receive information from master about WhatToDo; select WhatToDo: Overlay: overlay themes and optimise map following master directives; Join: join and optimise regions following master directives; AcceptData: merge in your local database data provided by another worker; ProvideData: copy your local database to indicated worker location; GiveUp: exit computation }

Figure 2: Parallel tranquillity mapping which depends on the number of features contained in the task, and for tasks with a similar number of features on their density. In fact a higher density of features will lead to an higher fragmentation for the intermediate maps, thus making the optimisation phase more costly. Domain partitioning is primary obtained using the different themes. Since the different themes contain a different number of features we have to find a way to further subdivide more populated themes in appropriate

88

sub-themes. In the considered case this has been done by splitting the transportation infrastructure theme into four sub-themes, namely rail-roads, highways, primary roads and secondary roads. In mis way we obtained a limited number (six in the considered case) of independent tasks with a similar computational cost. To have a small number of tasks with the same computational cost ensures load balancing using few homogeneous resources. Actually we are interested in the use of heterogeneous resources. In diis case we may further divide sub-domain so that to generate a sufficient number of independent tasks with a finer granularity, which permits to exploit the self balancing properties of master - worker programs (Bacci°9, Schmidt94). The thematic sub-domains have been further divided using a spatial approach and individuating independent sub-regions. The sub-regions have been derived using the natural subdivision represented by drainage basins. In the studied region we have eight main drainage basins, which combined witii the six themes lead to a set of 48 tasks of sufficiently fine granularity to ensure an acceptable load balancing for the available set of heterogeneous nodes we have considered (see Section 4). Using a mix of thematic and spatial subdivision we obtain tasks with different computational cost. It is simple to predict the approximate relative computational cost of each task by considering the number of features and their density, this latter obtained using the area of the basin. In the present implementation we based scheduling mainly on a locality principle, and the master tries to assign the set of tasks corresponding to a sub-region to a single worker as much as possible, but still keeping load balancing. The master assigns tasks to processors considering the relative cost of sub-regions, the relative speed of processors, and a simple round robin strategy. This scheduling policy permits to reduce communication of large data set during the third phase of the algorithm, since the overlay operations may be executed using data available on each computing node. 4 Experimental results and conclusions We experienced the parallel tranquillity mapping algorithm on six personal computers running Windows NT and connected on a local area network, based on switched and mixed Ethernet / Fast-Ethernet. In this implementation the data warehouse is hold by an office like database, and the software is implemented using an object based proprietary library for GIS functions and PVM for communications. The PC are equipped with a Pentium III processor, with different clock frequency (from 350 to 700 Mhz), and 128 Mbyte of RAM. The measured network bandwith is around 1.7Mbyte/sec. The relative speeds of die computing nodes are 1, 1.3, 1.5, 2.1, 2.3, 3.0, and are calculated assuming the GIS workstation, used to run the sequential version of the algorithm, as the reference node with speed equal to 1. The relative speed has been measured executing the same task on the unloaded nodes.

89

The total available relative computing power is 11.2. The program has been tested by running it on the dataset of the region described in Section 2. We recall that to run the sequential code for this region on the GIS workstation takes about 12 hours. To discuss performance of the parallel program it is useful to consider its structure as depicted in Figure 2. Phase 1 represents an overhead with respect to the sequential algorithm. The main cost of this phase is due to dataset transfer to the local disk of each node. The input dataset for the considered region is around 80 Mbytes. The measured time for this phase is around 5 minutes. Phase 2 is dedicated to tasks processing, and its execution corresponds with a good approximation to themes processing for the sequential algorithm. This part of the sequential algorithm represents about the 70% of the total execution time, which is around 8 hours and 30 minutes. The measured time for this phase of the parallel program is about 61 minutes, which leads to a speed-up of 8.4, and a weighted efficiency around 75%, considering the available computational power. During this phase speed-up is mainly limited by not balanced computation, since the communication is limited to the exchange of few short messages between master and workers. Monitoring the load of each worker we note an over-scheduling of the node with relative speed equal to 2.3. Indeed this node computes two complete drainage basins (sub-regions) with a quite large number of features, and its relative load overwhelms its relative speed. Phase 3 is dedicated to map overlay and join. The overlay operation corresponds at a large extent to the map optimisation process in the sequential algorithm. The join operation, the dataset transfer and merging represent overheads of the parallel algorithm. The measured execution time for this phase is around 42 minutes. The whole parallel algorithm takes about 110 minutes, thus leading to a speed up around 6.5 and a weighted efficiency around 54%, again considering the available computational power, which is quite satisfactory as a first result. Up to now we do not have the opportunity of experimenting the algorithm on different data sets, because of lack of availability of suitable input data collection. However we made some simulation to better assess the performances of the algorithm. The simulation considers as input parameters the number of processing nodes, their relative speeds, the number of tasks and their costs, the communication cost, and the tree structure to build up the complete solution by properly combining single tasks results by overlay and join operations. The simulation is based on a simple analytical model of the computation derived from (Pasetto97, Clematis98). The simulation results show that the algorithm should maintain the observed performances in most cases, provided that data partitioning is properly supervised.

90 In future works we plan to modify the simple round robin strategy for sub-region assignment to processors with more sophisticated list scheduling heuristics. The reported experience shows that it is possible to use parallel computing for GIS applications developed using proprietary GIS, and to make it available in the public administration domain. The use object based technology seems to provide an effective possibility, while waiting for more advanced solutions which will be able to exploit the computing power provide by large, distributed, and truly heterogeneous GIS systems (Dowers 00 ). Acknowledgements Part of this work has been supported by the PLAINS project founded by European Union (Project ENV-4CT98-0753). The cartographic data used to produce Tranquillity Mapping are properly of Regione Liguria. The traffic data and other data used to derive disturbance levels have been provided by Savona Province. References (AlessioOl) G. Alessio, S. Bini, A. Clematis, M. De Martino, "Landscape characterisation", Tech. Rep. PLAINS-IMA-DE2, Marzo 2001 (Bacci99) Bacci B., Danelutto M., Pelagatti S., and Vanneschi M. "SklE: a heterogeneous environment for HPC applications", Parallel Computing., 25 (1999) 1827-1852. (Bell99) Bell S., "Tranquillity Mapping as an Aid to Forest Planning", Information Notes, March 1999, Forestry Commission, ( www.forestrv.gov.uk/publications/index.html) (Clematis98) A. Clematis, A. Corana, "Performance Analysis of SPMD algorithms on a network of workstation with virtual shared memory", in "Parallel Computing Fundamentals, Applications and New Directions", E.H. D'Hollander et al. Eds., 1998, pp.65 7-664 , North Holland (Cudlip99) Cudlip W. et al, "A new information system in support of landscape assessment: PLAINS", Computers, Environment and Urban Systems, 23 (1999) 459-467 (Dowers99) Dowers S., "Towards a framework for high-performance geocomputation: handling vector-topology within a distributed service environment", GeoComputation 99 (www.geovista.psu.edu/geocomD/geocomp99 ) (Healey98) Healey R. et al, "Parallel processing algorithms for GIS", Taylor and Francis, 1998 (Pasetto97) Pasetto D., Vanneschi M., "Machine independent analytical models for cost evaluation of template-based programs", Fifth Euromicro Workshop on Parallel and Distributed Processing, IEEE Computer Society, Jan. 1997, pp. 485-492 (Schmidt94) Schmidt B.K., Sunderam V.S., "Empirical Analysis of Overheads in cluster Environments", Concurrency Practice & Experience, 6(1994) 1-32

91 Parallel skeletons and computational grain in quantum reactive scattering calculations Stefano Crocchianti, Antonio Lagana, Leonardo Pacifici, Valentina Piermarini Department of Chemistry, University of Perugia, Via Elce di Sotto, 8, 06123 Perugia, Italy Abstract Two quantum reactive scattering computational approaches have been analyzed for parallelization. The parallel structuring of the two codes has been carried out using both the constructs based on directives of the MPI library and the skeletons defined by the SklE coordination language. 1

Introduction

In order to design efficient parallel computational procedures for dealing with complex chemical applications one has first to map the physical problem into a suitable mathematical model and then to transform the mathematical problem into an appropriate algorithmic structure. This process is not simple since there is not a unique correspondence relating physical, mathematical, and algorithmic parameters. A way of learning how these parameters are linked together is to single out of the existing codes the structures suitable for parallelization : . In this paper we examine two computational procedures taken from the problem solving environment SIMBEX 2 (an a priori simulation of molecular beam experiments). Typically, these procedures consist of a central computationally high demanding section. This central section can be schematized as one (or more than one) block(s) 3 having the structure given in Scheme 1: LOOP on a LOOP on b LOOP on z G = g(a,b, END the z loop

,z;a,/3,

w)

END the b loop END the a loop Scheme 1: The scheme of the typical central section of a reactive scattering code. where latin letters are used to indicate parameters for which the calculation of G is separated into uncoupled (albeit, eventually, nested) loops while greek

92

letters are used to indicate parameters for which the calculation of G is not decoupled into independent computations. 2

The computational procedures

The G block of the two quantum reactive scattering programs considered here is devoted to the integration of the Schrodinger equation for reactive processes 4 . The first of these programs is ABM. This program constructs the coupling matrix of the reactive scattering differential equations 5 obtained in quantum time independent approaches. The second program is TIDEP. This program integrates in time the quantum time dependent Schrodinger equation after collocating the system wavepacket on a proper multidimensional grid 6 . Both programs store calculated quantities that, owing to their large size, cannot be kept in memory. This allows the use of these data by the other programs in which the complex computational procedure is articulated. 2.1

ABM

At a given value of the total angular momentum quantum number J (in some approaches this is performed only at a given reference value of J ) , the G block of the ABM program 7 calculates for every point of a grid (the grid along the reaction coordinate that is divided into sectors) the eigenfunctions (given as a combination of some primitive functions) of a proper portion of the Hamiltonian. The program calculates also related eigenvalues and the overlap integrals between the eigenfunctions of adjacent sectors. The structure of ABM is given in Scheme 2: LOOP on the reaction coordinate grid-points LOOP on A Build the local primitive functions Evaluate local eigenfunctions and eigenvalues Store on disk the eigenvalues IF(not first grid-point) THEN Calculate overlaps with previous grid-point eigenfunctions Store on disk the overlap matrix END IF Calculate the coupling matrix END the A loop Store on disk the coupling matrix END the reaction coordinate grid-point loop Scheme 2: Scheme of the G block of the ABM program.

93 where the two nested loops run over the grid-points of the reaction coordinate and the 2 J + 1 projections A of the total angular momentum J on the z axis of a body fixed frame. The central section of the G block calculates the local eigenvalues and surface functions at each allowed value of the projection A. In the same section, overlaps between surface functions calculated at neighboor grid-points and the various contributions to the coupling matrix are calculated. This tightly couples each cycle of the loop on the grid-points to the previous one. Overlaps and eigenvalues are stored on disk for use by subsequent programs. 2.2

TIDEP

The G block of the time-dependent program TIDEP propagates in time t the real part of the wavepacket describing the system 6 . At the beginning (t = 0), for a given pair of values of J and A, the system wavefunction, expressed in a proper functional form using a suitable set of coordinates, and provided with a given amount of energy 6 , is collocated on a sufficiently fine grid of the spatial coordinates. Then the integration on time is performed by repeatedly applying, for about 104 -j- 105 times, the propagator which involves some time consuming matrix operations. After each time step propagation, the wavepacket is expanded, in terms of product states, at an analysis line placed in the product region. The coefficients of the expansion are stored on disk for use by subsequent programs. The structure of TIDEP is given in Scheme 3: LOOP on t LOOP on A Perform time step integration Perform the asymptotic analysis Store C(t) coefficients END loop on A END loop on t Scheme 3: Scheme of the G block of the TIDEP program.

where the two nested loops run over time and A. There are, obviously, some outer loops running on J and other initial conditions, like the vibrational (v) and the rotational (j) quantum numbers, which are fully decoupled. 3

Suitable parallelization schemes

The two programs allow the exploitation of the parallelism at levels external to the G block in an SPMD fashion. In the case of TIDEP this can be applied,

94

for example, to v and j quantum numbers. For both TIDEP and ABM it can also be applied at a lower level to J. This choice is, however, not exempt of problems since an increase of J makes the associated matrices very large and the computing time very long, leading to memory and load imbalance problems. This is the main reason for pushing the decomposition to a finer level inside the G block where a natural parallelization is not applicable. To this end different actions can be undertaken depending on the program considered and, sometimes, on the physics of the problem. 3.1

ABM

In ABM, the coarsest grain of G for which parallelization can be enforced is the loop over the reaction coordinate grid-points. Single grid-point eigenfunction calculations and overlap integral (with eigenfunctions of the previous grid-point) evaluations are, in fact, computational tasks small enough to be easily left with a single processor. At the same time, related memory requirements are limited enough to be handled by the node local memory. This makes it convenient, in certain cases, to group the calculations related to several grid-points and assign them to a single processor in a static way using a data parallel approach (for example, this is what has been found convenient when dealing with reduced dimensionality techniques 8 ). For procedures based on full dimensional approaches cpu and memory requirements are so large that each grid-point calculation has to be assigned to an individual processor and the distribution has to be performed by adopting a dynamic task farm model. However, to make this parallelization scheme viable, the evaluation of the overlap integrals considered has to be decoupled by repeating the calculation of the surface functions performed on the preceeding node 7 . In the actual implementation of ABM, the master process sends to each worker a grid-point for which the eigenfunctions are to be calculated. Once the calculation is performed, the eigenvectors are stored on disk for use by the processor dealing with the subsequent reaction coordinate grid-point. The next processor retrieves from disk the preceeding grid-point eigenvectors and evaluates the eigenfunctions at the quadrature points of the current sector. To prevent attempts to read not yet stored information, nodes are synchronized before reading from disk. This parallelization scheme has been implemented using MPI 9 and the reactive probabilities of the Li + HF reaction 10 have been calculated. Typical processor time measurements for runs performed on the Cray T3E of EPCC (Edinburgh, UK) using 32 (solid line) and 64 (dashed line) processors, are shown in Figure 1. The excellent performance of the model is confirmed by the good balance of the load assigned to the various processors whose values deviate only by a few

95 50 n 32 processors ••• 64 processors 40

30

20

10-

-5

0

_L 5

_EL 10

15

Percentua] deviation from average processor time

Figure 1: Frequency of percentual deviation from the average processor time.

percents from the average time. Calculated speedups S, shown in Figure 2, also indicate that the program scales quite well, since for up to 128 processors the speedup is never smaller than 70 % the ideal value. 3.2

TIDEP

Direct calculations cannot be used to decouple the t loop of TIDEP inside the G block. The solution is, in fact, recursively accumulated at each iteration. Nonetheless, a partial decoupling can be introduced by assuming that A is conserved within each sector during the one propagation. This confines the coupling between different A blocks of the solution matrix to that exerted by the Coriolis term of the Hamiltonian on adjacent blocks. Test calculations were performed on the ORIGIN 3800 at CINECA (Bologna, I) u using a task farm model for the 0( 1 D)+HC1 atom diatom reaction 12 on a gridsize of dimension 127 x 119 x 8. In this model node zero acts as a master. It performs preliminary calculations and distributes fixed J and A propagations to the workers. Average processor computing time for J and A pair calculations are given in Figure 3 as a function of J (each J calculation includes all the fixed A components of one parity). The excellent performance of the model is confirmed by the fact that the average processor computing time increases

96

50

0

50 Number of processors

100

Figure 2: Limit (dashed line) and measured speedup (solid line) for the ABM program

only of about 20 % in going from J = 0 to J = 1 3 . Speedups calculated on the ORIGIN 3800 are shown in Figure 4. They clearly indicate that the scalability of this program is very satisfactory for platforms with up to 16 processors. 4

Towards a SklE based approach

The detailed analysis of a computational application (as the one outlined above for ABM and TIDEP) requires a great deal of know-how not only about the physics background of the code itself but also about the computing platform adopted. An increasingly popular approach is to make use of parallelization environments like SklE (Skeleton-based Integrated Environment) 13 and to exploit the features of the related coordination language (SklE-CL). The use of SklE-CL makes the simplification of the parallel restructuring of the considered codes and the exploitation of alternative parallelization strategies extremely simple as shown below. 4.1

ABM

The parallelism on the initial conditions of ABM (outer loops) can be dealt by SklE as a data parallelism using a map structure (in which the same com-

97

5400

o

5200

o o

/ /

-S5000

p 4800

o

/ Jt

O

4600 s o 4400i 0

^

° '

' 5

'

1 10

' 15

J

Figure 3: Average processor computing time.

putation is applied to all the elements of a data array). In this case, each computation is totally independent of the others, and the main body of the program preserves its sequential structure. On the other hand, when the parallelization is pushed to the finer grain of the reaction coordinate grid-points, inside G, a farm construct needed to be considered in order to assign a virtual processor to each grid-point calculation in a dynamic way. Accordingly, a certain number of grid-point calculations are sent in a data stream and then each of them is assigned to a worker of the farm. As soon as the worker completes its task, it receives the next grid-point calculation as channeled by the stream (here, it would not be needed to recalculate the surface functions of the preceeding grid-point if a shared memory facility could be used). 12

TIDEP

Also TIDEP can use a map construct to parallelize the outer loops of v and j (at fixed value of J ) . A certain amount of work-load imbalance can arise when j and v become large and require a denser grid (and, therefore, a larger dimension of the matrices). When J increases, however, because of the large number of A projections, fixed J calculations do not fit into the node memory. To push the parallelization to a finer level it is appropriate to nest a farm construct (to

98

<«

10

^ 0

-i

1 5

1

1 10

i

I 15

i20

Number of processors

Figure 4: Limit (clashed line) and measured speedup (solid line) for the TIDEP program handle the correct stream of the total angular momentum quantum number values) with another farm, to distribute the related A calculations within each J value, provided that at each step a mechanism is developed to feed the Coriolis terms of the Hamiltonian with the adjacent fixed A blocks of matrices to be manipulated. 5

Conclusions

The parallelization of complex computational chemical applications can be pushed to a fine grain level by carefully handling parallelization libraries (such as MPI) and by introducing suitable decoupling schemes. In this paper, the decoupling has been exploited both by introducing (at least partially) direct calculations and by enforcing some dynamical constraints. This offers the advantage of leaving with the user the control of the efficiency of the parallelization while the code is being structured by inserting MPI calls. As an alternative, the use of coordination languages allowing the user to design a parallel model at an abstract level has been also experimented. This has made the investigation of different parallel models straightforward and the introduction of decoupling approximations simple. Our work has, however, singled out some critical features of SklE. Namely, the rigidity of the basic skeletons and

99

the lack of a shared memory structure. These problems will be taken care by ASSIST, a CL resulting from an evolution of SklE 1 4 . 6

Acknowledgements

Thanks are due to CINECA (Bologna, Italy), EPCC (contract ERB FMGE CT95 0051, Edinburgh), CESCA-CEPA (contract ERB FMGE CT95 0062, Barcelona) for computer time grants. We thank also CNR, ASI and MURST for funding the present research. References 1. A. Lagana, S. Crocchianti, A. Bolloni, V. Piermarini, R. Baraglia, R. Ferrini, D. Laforenza, Computational Granularity and Parallel Models to Scale up Reactive Scattering Calculations, Comp. Phys. Comm. 128, 295 (2000). 2. O. Gervasi, D. Cicoria, A. Lagana, R. Baraglia, Animazione e Calcolo Parallelo per lo Studio delle Reazioni Elementari, Pixel 10, 19 (1994); R. Baraglia, D. Laforenza, A. Lagana, A Web Based Metacomputing Problem-Solving Environment for Complex Chemical Applications, Lecture Notes in Computer Science, 1971, 111 (2000). 3. A. Lagana, Innovative Computing and Detailed Properties of Elementary Reactions Using Time Independent Aprroaches, Comp. Phys. Comm. 116, 1 (1999). 4. A. Lagana, A. Riganelli, Computational Reaction and Molecular Dynamics : from Simple Systems and Rigorous Methods to Large Systems and Approximate Methods, Lecture Notes in Chemistry 75, 1 (2000). 5. G.A. Parker, S. Crocchianti, M. Keil, Quantum Reactive Scattering for Three Particle Systems Using Hyper spherical Coordinates, Lecture Notes in Chemistry 75, 88 (2000). 6. G.G. Balint-Kurti, Time Dependent Quantum Approach to Chemical Reactivity, Lecture Notes in Chemistry 75, 74 (2000). 7. A. Lagana, S. Crocchianti, G. Ochoa de Aspuru, R. Gargano, G.A. Parker, Parallel Time-Independent Quantum Calculations of AtomDiatom Reactivity, Lecture Notes in Computer Science 1041, 361 (1995); A. Bolloni, A. Riganelli, S. Crocchianti, A. Lagana, Parallel Quantum Scattering Calculations Applied to the Dynamics of Elementary Reactions, Lecture Notes in Computer Science 1497, 331 (1998). 8. R. Baraglia, D. Laforenza, A. Lagana, Parallelization Strategies for a Reduced Dimensionality Calculation of Quantum Reactive Scattering Cross

100

9.

10.

11.

12.

13.

14.

Section on a Hypercube Machine, Lecture Notes in Computer Science 919, 554 (1995). Message Passing Interface Forum, Int. J. of Supercomputer Applications 8 ( 3 / 4 ) (1994); M. Smir, S. Otto, S. Huss-Ledermam, D. Walker, J. Dongarra, MPI: The complete reference (1996), MIT Press. A. Lagana, A. Bolloni, S. Crocchianti, Quantum Isotopic Effects and Reaction Mechanism: the Li + HF Reaction, Phys. Chem. Chem. Phys. 2, 535 (2000) V. Piermarini, L. Pacifici, S. Crocchianti, A. Lagana, G. D'Agosto, S. Tasso, Parallel Methods in Time Dependent Approaches to Reactive Scattering Calculations, Lecture Notes in Computer Science 2073, 567 (2001). V. Piermarini, G.G. Balint-Kurti, S. Gray, F. Gogtas, M.L. Hernandez, A. Lagana, Wavepacket Calculation of Cross Sections, Product State Distributions and Branching Ratios for the 0 ( : D ) + HC1 Reaction, J. Phys. Chem A 105 (24), 5743 (2001) B. Bacci, B. Cantalupo, P. Pesciullesi, R. Ravazzolo, A. Riaudo, M. Torquati, SklE-CL: User's Guide, Version 2.0; M. Vanneschi, Parallel Paradigms for Scientific Computing, Lecture Notes in Chemistry, 75, 168 (2000). P. Ciullo, M. Danelutto, L. Vaglini, M. Vanneschi, D. Guerri, M. Lettere, Progetto ASI-PQE2000, Workpackage 1, Ambiente ASSIST: modello di programmazione e linguaggio di coordinamento ASSIST-CL (versione 1.0) (2001)

PARALLEL IMAGE RECONSTRUCTION USING ENO INTERPOLATION

J.CZERWINSKA,W.E.NAGEL Center for High Performance Computing (ZHR), Dresden University of Technology, D-01062 Dresden, Germany E-mail: czerwinska,[email protected] The scope of this paper is a presentation of the method for image reconstruction. Most interpolation methods such as polynomial interpolation, with the special case of a linear interpolation, or splines, are sensitive to the sharp change in interpolated data. Essentially non-oscillatory reconstruction adapts to the function gradient in a way that deals efficiently with the sharp edges, which are quite common in average images built by discrete data. The drawback of the method is the relatively complex algorithm, which needs a lot of logical operation. Due to the parallelisation ENO, which improved the efficiency of the method, interpolation with its high accuracy shows itself to be as a reasonable alternative to other types of interpolation.

1

Introduction

A essentially non-oscillatory (ENO) method was originally proposed, to overcome problems of smoothing across the discontinuities in computational fluid dynamics by Harten 1 . At first it was used for the numerical solution of Hamilton-Jacobi equations by Oser et al. 2 . Later it was made more efficient by Shu and Osher 3 4 . The accuracy of the method and the way it deals with discontinuities can be used in the calculation of partial differential equations, as well as in a wide range of other problems. The basic concept can be moved to other fields of application, like image reconstruction by interpolation or subpixel interpolation 5 . Splines and polynomials are commonly used in computer vision and graphics applications for obtaining interpolation of discrete data. The behaviour of those methods is satisfying when the region of data is relatively smooth, but they are prone to error in the vicinity of the discontinuities. Due to this fact ENO interpolation seems to be a good alternative. The essentially non-oscillatory interpolation is based on the choice of the smoothest stencil possible, to obtain interpolation results. The selection is done by means of logical choices. Hence, the method is not as fast and efficient as the other commonly used reconstructions of the function on the discrete set of data. Parallelisation seems to be one of the ways to make it more attractive for general application.

101

-.k-

O ^ a

A

p™1 s c o n d unier neighbcwtrod itmid i n k r neighborhood forth order neighbourhood

Figure 1. Estimation of the order of the reconstruction

2

ENO interpolation

The basic concept of the ENO reconstruction used in this paper can be described as follows. Consider a domain fi with characteristic length of tessellation Ax. The expansion in the vicinity of node i has to fulfill the following conditions: • conserves the means • have a compact support ( only points in the neighbourhood are taken to the reconstruction in the vicinity of the node i ), where the size of neighbourhood depends on reconstruction accuracy (see Figure 1) • exactly reconstruct polynomials of degree < k equivalent to an accuracy of order (k+1) • ENO property- choice of the smoothest stencil used for the interpolation. For the structured grid, which was analysed in this paper, the 2D interpolation can be done by locally splitting into ID cases, which was proved to be valid for partial differential equations of a hyperbolic type by Maraquina 6 . ENO property is realised by searching for the smoothest stencil. The polynomial interpolation reconstructs data using the same type of data, for instance linear interpolation will behave better near jump of the function, but will not be accurate in other smooth regions. The parabola will interpolate data better in smooth regions, but discontinuities will cause overshoots on the interpolated

103

Figure 2. Discrete data of the Abgrall function on the coarse mesh

data. ENO property is basically an adaptation of the stencil (data used for interpolation) to the interpolated data set. 3

Results

The method, which has been described above, can be used in a variety of applications; in mathematics (function interpolation), computer science (image processing), etc. In this section, one particular case will be presented - the function proposed by Abgrall 7 8 «/* < ^cos(ny),f(x,y)

ifx > -cos{xy),f{x,y) where, u$

= «^f)(*.l/).

- u_^)(x,y)

+cos(2ny),

(0

(2)

104

0.2

0.4

0.6

0.8

1

Figure 3. Processed image - ENO reconstruction

ifr < --yU^x^y)

ifr > -,u^{x, o

if I

r

=

y) = 2r - 1 +

-rsin(-r2),

b

-rsin(3irr),

\< 2>u(x>y) =1 sin(2irr) \,

r = x +

tan(4>)y,

(3)

(4)

(5)

(6)

In Figure 2, the graph of the function in the domain [-1,1] is presented. The discrete data are obtained on mesh 51x51 points. The image reconstruction was done using the few following steps. From a starting level node, three steps of interpolation were carried out to obtain three new forward values of the function in new nodes. The distance between nodes shrinks to half of its

105

Figure 4. Processed image - linear interpolation ENO

original value in every step and the amount of the nodes is eight times the original. In Figure 3 the reconstruction for ENO interpolation for (408x408) nodes is presented. Figure 4 presents reconstruction for linear interpolation. It is easy to see the difference in the behaviuor of the solution for ENO and simple linear interpolation in the vicinity of the discontinuities. A satisfying sharpness of the discontinuities was not obtain for the latter. On the discontinuities ENO reconstruction accuracy is of the first order (same as linear interpolation), but in other region the accuracy only depends on the choice of the order of the approximation. The case presented in Figure 3 is second order of accuracy. The bad result for Figure 4 is also caused by the fact that error in the vicinity of the discontinuities has accumulated over steps of interpolations. For ENO reconstruction every step introduced only low accuracy points on the sharp discontinuity, while preserving accuracy of other regions.

106

4

Aspects of parallelisation

The parallelisation of the method was carried out using domain decomposition. A two-dimensional image was divided for non-overlapping subdomains (parallel stripes). The choice of such parallelisation method has provided no restriction on scalability. Non-overlapping subdomains were used to decrease memory usage needed to store data from the neighbourhood of the subdomain borders. This means that the values of the points are only computed once in a specific subdomain. The only values needed to compute interpolated points on the boundary are sent from neighbouring domains. After every level of reconstruction, data is collected by the main processor and if a new computation is needed, the data is resent. Due to the way the algorithm is constructed, the parallel efficiency is as presented in Figure 5. The time was estimated for the image - 166464 points low accuracy (second order method, called basic case in Figure 5) and high accuracy interpolation (forth order method, larger case needing a computational time,which is about 200 times longer). The gain in computational time is very drastic between computers with only one and those with several processors. With an increased number of processors, the advantage is not that large for a basic low accuracy case because the cost of the computation starts to be comparable with the time necessary to send and collect data. In the applications usually processed, images are not any larger than in the considered case. For that reason, it seems that, the considered parallel ENO interpolation algorithm can be used with quite good efficiency for computers with less than 10 processors. Of course in some cases the data files are much larger than the case considered above (for instance, when there is a need for an interpolation of the results of engineering computations several millions of points can be reached). In this case, then an increase in the number of processors will significantly increase the efficiency of the computations. For the pictures considered in this paper, time of calculation on 16 processors was very short, allowing to obtain enhanced picture almost immediately. Hence, there was no need to study effect of parallelisation on the higher number of processors. The algorithm of the method is very general, so there is no obstacle to calculate more complex cases with the larger number of processors. For the realisation of parallelisation, MPI was used and all calculations were done at the R10000, Onyx 3800.

107 11»

1

1

i

1

1

1

1

* larger case O basic case 0.9 0.8 0.7

-

o

-

0

-

§0.6 S

H;0.5

«

-

E

-

|0.4 0

0.3

-

# 0.2

"

o

0

* 0.1 0

* '

1

8

10

12

* 1

18

number of processors

Figure 5. Parallel efficiency of the image reconstruction algorithm. Basic case represents lower accuracy interpolation. Large case is higher order interpolation

5

Future work

There are basically two directions in which recent work can be continued. The first one is the adaptation of the existing algorithm to unstructured 2D cases. It may not have a straightforward application significance for image processing, but for large scale engineering applications it is a very desirable goal. The unstructured meshes are better at adapting to complex geometries, which are common for engineering problems, but it is also known that existing methods of interpolation and differentiation do not reach the required accuracy, which is easily obtained by structural grid calculation. Because of this the unstructured ENO interpolation will be interesting for future work .The second path could be 3D approximation. It is again important for large engineering computations and particularly interesting for computational fluid dynamics. In both cases the parallelisation issue is very important to make considered methods efficient in usage.

108

6

Conclusions

In this paper, the accurate and efficient method for interpolation was presented(ENO interpolation), which can be used for image reconstruction of a discrete set of data. Due to the parallelisation, the efficiency of the method increased and made it a good alternative to other widely used interpolation methods. The parallel algorithm seems to work especially well for large set of data,which it makes it even more promising for future applications. References 1. A. Harten. Eno schemes with subcell resolution. J. of Comp. Physics, 83:148-184, 1989. 2. C-W. Shu and S.Osher. Efficient implementation of essentially nonoscillatory shock-capturing schemes. J. of Comp. Physics, 77:439-471, 1988. 3. S. Osher and C.-W. Shu. High-order essentially non-oscillatory schemes for hamilton-jacobi equations. SI AM J. of Numer. Anal., 77:907-922, 1991. 4. B.B. Kimia K. Siddiqi and C.-W. Shu. Geometric shock-capturing eno schemes for subpixel interpolation, computation and curve evolution. Tech.Report LEMS, 142, Febuary 1995. 5. P. Haeberli and D. Voorhies. Image processing by linear interpolation and extrapolation. IRIS Universe Magazine, August 1994. 6. A. Maraquina and R. Donat. Capturing shock reflections: A nonlinear local characteristic approach. UCLA CAM Report, 31, April 1993. 7. R. Abgrall. On essentially non-oscillatory schemes on unstructured meshes: analysis and implementation. J. of Comp. Physics, 114:45-58, 1994. 8. R.Abgrall and A.Harten. Multiresolution representation in unstructured meshes. SIAM J. of Numer. Anal., 35:2128-2146, 1998.

TRAINING ON-LINE RADIAL BASIS FUNCTION NETWORKS ON A MIMD PARALLEL COMPUTER A. D'ACIERNO 8 ISA -CNR, via Roma 52 A/C, 83100 - Avellino, Italy This paper describes a parallel mapping scheme for the gradient-descent learning algorithm. The problem we are dealing with is Time-Series forecasting by means of GRBF Networks so that (i) the network has one neuron in the output layer but, at the same time, (ii) the memory of each processor can (typically) hold the whole training set. A mapping scheme recently proposed seems to be the optimal one (the only one?) for speeding the learning up. The used approach is synchronous, so that both SIMD and MIMD parallel computers can be used for actual implementations. Results on a MIMD parallel computer are shown and commented.

1

Introduction

The Time-Series forecasting problem (given a sequence x(l),x(2),....,x(N) find the continuation x(N+l),x(N+2),...) can be stated in terms of estimating an unknown function/such that: x(t) «Jc(f) = f(x(t - 1 ) , x(t -2),....,x(t-M)) (1) where M is the (unknown) memory of the (unknown) system. As it is well known, the problem of estimating an unknown function given some input-output pairs (regression problem) can be parametric and non-parametric. In parametric regression the form of the functional relationship between the dependent and independent variables is known (or guessed) but it contains parameters whose values have to be determined. In the non-parametric regression, instead, there is no a priori knowledge about the shape of the function to be estimated. Generalized Radial Basis Function (GRBF) neural networks are non-parametric models (that can be linear as well non-linear) able to approximate any reasonable continuous function mapping arbitrarily well and with the best approximation property [3]. As each neural network, the behaviour of a GRBF network depends on free parameters evaluated through a learning procedure; such a procedure tend to be very time-consuming and it is obvious to capitalize on the intrinsic parallelism of neural systems to speed the computation up. Learning algorithms typically involve only local computation but, on the other

hand, the output of each unit depends on the output of many other units so that most of the running time can easily be spent in communication rather than in actual

E-Mail: dacierno.a @isa.av. enr. it

109

110

computation. The motivation for this work was exactly to try to solve the mapping problem, with reference to a well known learning algorithm. 2

RBF: A Brief Introduction

The problem of estimating an unknown function / : 3iM —> SR given a set of data {(jr.,y(.)€ 9t" x5t|_, clearly is ill-posed, since it has an infinite number of solutions; to choose a particular solution, we need some a priori knowledge about the unknown function, typically assuming that the function is smooth (i.e. two similar outputs corresponds to two similar inputs). Since we look for a function that is, at the same time, close to the data and smooth, it is natural to choose as the solution of our regression problem the function /that minimizes the following functional:

n[f]=hf(xi)-yi)2+4Pf\\

(2)

where X (> 0) is the so called regularization factor, P is a constraint operator (usually a differential operator) and II - II is a norm (usually the L2 norm) on the function space off. The operator P clearly embodies the a priori knowledge about the solution and so depends on the problem to be solved. Minimization of functional H leads to the Euler-Lagrange equation that can be written as:

PPfi*) =T2>i:"/(*,))*(*-*,)

(3)

where P is the adjoint of the (differential) operator P and the right side comes from the functional derivative (with respect to/) of//. The solution of the partial differential equation 3 is the integral transformation of its right side with a kernel given by the Green's function of the differential operator PP, that is the function G satisfying: PPG(x;y) = S(x-y) whose solution, because of the delta functions, can be written as: f(x)=\fd(yi-f(xl))G(x;xi)

(4) (5)

By evaluating equation 5 at the W data points we easily obtain a set of equations for the coefficients c = ^' ~*^X'''A '

/A.

.

111

When the operator P is translationally invariant G will depend on the difference of its arguments ( G ( x ; x i ) = G ( x - x,)) and if P is also rotationally invariant G will be a radial function, i.e. G(x;xt) = G(|x - x,|) (e.g. a Gaussian function). The solution given by standard regularization theory is expensive in computational terms (OfN*)) and, what is of course worse, the probability of ill-conditioning the system is higher for larger and larger matrices. The generalized regularization theory approach is to expand the solution on a smaller basis, so deriving:

/Oc) = £c,.G(r,f,.)

(6)

where n«N and the f,'s are called "centres": equation 6 is topologically described in figure 1, where each Green's function is realized by means of a neuron; these networks are called Generalized Figure 1. A GRBF Network. Radial Basis Function (GRBF) networks. When the G,'s are Gaussian functions, it is of course: ( lv Output

Input

Lvytr

0(*) = £ w,. *f[exp' y-i

(*,-'„> >

(7)

Kjl

To evaluate the free parameters there are several possibilities. The simplest one selects centres by means, for example, of a clustering algorithm; then selects the o's according some heuristics and last, to evaluate the w's, solves a least square problem. Using such an approach a GRBF network behaves almost linearly. A quite different approach (that makes GRBF networks highly non-linear models) assumes that all the parameters can change during the training phase; in our experiments, for example, the centres are initialised using a competitive clustering algorithm and the learning strategy tries to minimize the error E defined as:

by using the well-known gradient-descent algorithm. This is an iterative algorithm that at each step (by considering the i-th training pattern and if Xy represents component 7 of example i) applies the following formulae: Aw

/ = ~1„ TT^" = IS, Ii e x P 2 dw,

*„f

\J 2a,

(9)

112

1 def

f^-^Vr

((to-',/)

2dalr

{

^ 2a„. )

alr

)M

n n

This procedure can of course work either on-line or in batch mode; in the on-line case parameters changes are used as they are evaluated while, in the batch mode, they are accumulated to evaluate global terms that are summed after all (or some) examples have been processed. 3. The Proposed Mapping Scheme The design of a neural predictor requires the choice of the parameter M to create the training set from the available historical sequence (this is one of the complications that make Time-Series forecasting a more difficult problem that straight classification). Analyse mis fact from the neural designer point of view; even if tricks and methods to guess the right M exist, it is a matter of fact that (theoretically) an over-sized M could be used, since not useful values should (or, maybe better, could) be neglected by the neural structure simply making a\ —» +<» for all i and for ally greater than some *. From a practical point of view, instead, the condition fffj -» +oo can be only approximated, so deriving that the selection of M is based mostly on trial and error (with some heuristics); moreover, the gradient descent algorithm gets stuck in the first local minima it meets, and thus it has to be usually re-started many times with different initial conditions. As concerns the right value of H it is a matter of fact that it should be chosen as low as possible just to save computing time since, as it should be clear from section 2, the performance does not decrease as H increases (provided that it remains « N). Summarizing, the neural designer has to perform a lot of experiments to optimally dimension and train the network, the learning phase thus being a process that has to be performed interactively; given the computational requirements of the problem at hand, the availability of fast implementations on parallel machines is mandatory to speeding the prototyping phase up. When the algorithm we are using to train our parameters is batch (or, with different words, when the batch algorithm can be used) the parallel implementation is straightforward, being of course possible to have a copy of the network on each processor together with some training patters. The problem clearly holds when we have to use the on-line version of the algorithm since, to speeding the computation up, we have to split the GRBF among processing nodes. This problem is, at least from an algorithmic point of view, almost equivalent to the parallel implementation

113

of the Back-Propagation of errors learning Algorithm (BPA) for feed-forward neural networks; this is not surprising, since (from a topological point of view) a GRBF can be viewed as a feed-forward neural network. Thus, even if (as far as the author knows) there are not (significant) papers dealing with gradient-descent algorithm for GRBF, we could try to implement it in parallel by using results suggested for the parallel BPA. Typically, the on-line BPA is seen as a sequence of matrix-vector products that are performed in parallel usually in a systolic fashion [5]. As it is well known, there are two systolic solution to the matrix-vector product problem; the first one maps, on each processor, a row of the matrix while, the second one, maps, on each processing element, a column of the matrix. A parallel implementation based on intra-layer parallelism can be obtained by using the by-row solution or, maybe better, by using a mixture of by-row and by-col solutions [6]. It has been also proposed in [4] a parallel implementation that, although still based on intra-layer parallelism, exploits the collective nature of the communication involved (see also [1]). The idea underlying many of these schemes cannot be used for the problem at hand for the following trivial reason. For a 3-layer feed-forward networks the maximum number P , ^ of processors that can be used equals the dimension of the smaller layer; in our problem the dimension of the output layer is 1, that simply means no parallelism at all. A quite different approach [2] is based on the synergistic use of neuron parallelism and synaptic parallelism to split the neural network. Suppose, for the sake of simplicity, that the number of processors to be used equals M, the generalization being straightforward; in this case we can map, on each processor, an hidden unit and, as concerns the input layer, we assume that each processor knows all the training input/output pairs. (The last of course holds for Time-Series forecasting since the training set is a simple array). If each processor knows each training pattern it can evaluate (without communication) the activation value of the handled hidden neuron. Once the components of the sum have been evaluated what it is needed is a reduction operator that allows each processor to know 0(x(). Last, we have to apply Eqns. 9, 10 and 11; by using our strategy each processor knows all it needs to evaluate the A's related to parameters handled without communication. A (maybe negligible) drawback, clearly, is that the error term e, has to be evaluated in each processor. Using the proposed schema, clearly we have that P^ equals H, and this does not seem an hard constrain when coarse-grained parallel computers are considered. If fine-grained machines are considered, on the other hand, it is highly possible that the number of processors is » H\ if this is the case a lower level parallelism has to be used to increase Pmax (this is the concern of our work in progress).

114

4. Implementations and Results Given a problem and a possible solution, the actual implementation step must be clearly addressed. When the problem is the parallel implementation of an algorithm (the solution being a mapping scheme) the implementation step typically is the porting step; to writer's mind, as better is the solution as easier will be the porting. For the problem at hand, we will try to show that the porting problem is trivial when the described solution is used. The Meiko CS-2 is a MIMD parallel computer able to provide both scalar and vector processing nodes; in our configuration it is composed by SPARK multiprocessors nodes (2 processor for each node). The communication network is a logarithmic network (a Benes network folded about its centre line) constructed from 8 way cross-point switches and bi-directional links; each node has its own communication processor, i.e. a dedicated interface to the communication network. A key point is the availability of a library of network functions which allow user to efficiently solve some communication problems without using low-level routines. For examples, very simple and efficient routines are provided for message broadcasting, for DMA accessing of any node memory, etc. Particularly interesting for the problem at hand is the gsum function; this function performs a reduce operation on a vector, and the result is broadcast to all processors. Using the gsum function and the suggested mapping the porting problem can be solved very easily; suppose to start from a sequential implementation where it is possible to set at run-time both M and H (and this does not seem too difficult). To obtain the parallel version we have to: 1. add the code to determine the number of processes involved in the current simulation; 2. add the code to determine the number of hidden neurons that each process has to simulate; 3. add a gsum call in the forward phase to complete equation. 7; 4. modify I/O routines. Since just step 3 notably affects the performance we have that: rjiSeq

TPar - - — + T g s u m (12) P where 7**"" represents the time to perform reductions. The speed-up obtained as the number P of processor vary having M as parameter and with #=32 is shown in figure 2 (left); the results obtained as P vary having H as parameter with M=16 are shown in figure 2 (right). Clearly, having fixed P, the speed-up increases as M or H increases; this is not surprising, since the communication overhead does not increases with the number of free parameters of the neural network. As concerns the load balancing, it is worth noting that, in an

115 14 •

12

/ /.

^r •^

AJf^

C

4

10

I

8

*^tf^ r^

.

«•

'

4^x£-+ IP -

6 4 2

=yg

0

8

12

1g

8 P

P

Rgure 1. Left: the speed-up having M as parameter (circle is M=8, square is M=16 and triangle is M=32). Right: the speed-up having H as parameter (circle is H=16, square is //=32 and triangle is //=64).

actual implementation, the speed-up clearly depends on

; this effect can be

well appreciated for small networks (see the case H-16). It is worth noting that, in the experiments, processors belongs to different nodes (see section 5). 7*""", clearly, depends just on P, being fixed to 1 the dimension of the vector to be combined, and it is T*sl""=0(lg2P), being the processors arranged in a tree. This, together with equation 12, suggests the following: TgSUm^TPar_i

fSeq

„* i + * 2 [lgp1

(13)

Using the available 48 data points it is possible to give a rather accurate estimates of k, and k2 solving the least square linear problem based on equation 13. When the time are in usees for 1 iteration on 1 example we have that k, = 5.3 and k2 = 22; 43 out of the 48 data points (90%) are in the range k,+ k2\\g2P'] ± 30% (see figure 3). The suggested analysis could seem too inaccurate, but it is worth remembering that we are talking of usees on a MIMD machine, with each node running its own operating system. 5. Concluding Remarks In this paper it is proposed a mapping scheme for the gradient descent algorithm; we are using this algorithm in the learning phase of GRBF Networks for Time-Series forecasting. The used approach is based on a mapping scheme whose main drawback is that each processor has to know the whole training set; such a drawback does not apply at the problem under study being the training set a simple array. The scheme is synchronous so that both MIMD and SIMD architectures can be used. Results obtained on the Meiko CS-2, a MIMD coarse-grained parallel computer, have been shown. The results demonstrate the worth of the suggested approach;

116

another useful result obtained is that coding the parallel program is almost as easy as coding the sequential one. A problem we are dealing with is that the code isn't able to efficiently use both the processors constituting the node and sharing the memory; roughly speaking, this means that, for example, the efficiency obtained using 8 processors is higher if the processors belongs to different nodes. This surprising and 180 undesired behaviour depends (i) • 160 on the fact that the gsum function 140 does not avail itself of 120 shared-memory mechanisms to /r~* • • 100 combine vectors residing in the /*— • ' ' ' so same memory and (ii) on the fact A . • i 8 «0 that the 2 processors share the 40 same communication processor so 20 that many conflicts on such a 0 critical resource exist. To solve the problem a new version of the combination function able to overcome such a substantial ss m Figure 3. T " : the model ± 30% vs. data points. deficiency is under development.

References M. Besch and H. W. Pohl, "Flexible Data Parallel Training of Neural Networks Using MIMD-Computers", In Third Euromicro Workshop on Parallel and Distributed Processing, Sanremo, Italy, January 1995. A. d'Acierno, "Back-Propagation Learning Algorithm and Parallel Computers: The CLEPSYDRA Mapping Scheme", Neurocomputing, 31 (2000), 67-85. F. Girosi and T. Poggio, "Networks and the best approximation property", Biological Cybernetics, Vol. 63 (1990) 169-176. H. Klapuri et al, "Mapping Artificial Neural Networks To a Tree Shape Neurocomputer", Microprocessors & Microsystems, 20 (5) (1992), 267-276. 5. S.Y. Kung and and J.N. Hwang, "Parallel Architectures for Artificial Neural Nets", Proc. ICNN, San Diego, CA, 1988, vol. 2, 165-172. 6. X. Zhang, M. Mckenna, J.P. Mesirov, and D. L. Waltz, "The Back-Propagation Algorithm on Grid and Hypercube Architectures", Parallel Computing, 14 (1990)317-327.

PARALLEL SIMULATION OF A CELLULAR LANDSLIDE MODEL USING CAMELOT

GIUSEPPE DATTILO AND GIANDOMENICO SPEZZANO ISI-CNR, c/o DEIS Universita delta Calabria, 87036 Rende(CS), ITALY E-mail: [email protected]. unical. it

The simulation of landslide hazards is a key point in the prevention of natural disasters, since it enables to compute risk maps and helps to design protection works. We present a parallel simulator that handles debris/mud flows developed by a problem-solving environment, called CAMELot, which allows interactive simulation and steering of parallel cellular computations. CAMELot is a system that uses the cellular automata formalism to model and simulate dynamic complex phenomena on parallel machines. It combines simulation, visualisation, control and parallel processing into one tool that allows to interactively explore a simulation, visualise the state of the computation as it progresses and change parameters, resolution or representation on the fly. In the paper, we give an overview of the CAMELot system and we show its practical use for programming and steering landslides models developed in the project. Moreover, an evaluation of the performances of the simulator on the MEKO CS-2 parallel machine is presented.

1

Introduction

The rapid development of computer technology has opened up new perspectives, and has increased the importance of computer simulation as an essential tool in research and experimentation. Today, we can build simulation models which allow to predict die behavior of real physical systems in actual or hypothetical circumstances. Moreover, the advent of parallel computing is making complex simulations in many disciplines tractable, increasing the interest in different application fields. One area of remarkable interest is the simulation of natural disasters as landslides, earthquakes and floods. In these simulations we must face the difficult scientific problem of predicting how territorial systems respond to global change. For instance, good research models allowing to calculate the trajectories of debris/mud flows can help many engineers and geologists to establish safety perimeters, or design protection works. Many risk situations that are determined by the overcoming of the threshold value of a parameter or which come from a situation determined by value of more parameters such as, for instance, the contemporary increase of the level of several rivers, the persistence of slight rains for a certain number of days and the morphology of the ground, can be detected by a simulation. In the area of geophysics, only a few simulation codes are available for such work [1,2] and even rarer are those that tackle die three dimensional aspects of simulation. Moreover, the current methodologies for parallel computing

117

118

are too complex and too specialised to be accessible for most engineers and scientists. To help overcome these obstacles, we have developed a problem solving environment [3], called CAMELot, that allows interactive simulation and steering of parallel cellular computations. CAMELot is a system that uses the cellular automata model [4] both as a tool to model and simulate dynamic complex phenomena and as a computational model for parallel processing. It combines simulation, visualisation, control and parallel processing into one tool that allows to interactively explore a simulation, visualise the state of the computation as it progresses and change parameters, resolution or representation on the fly. CAMELot supports exploratory, algorithmic and performance steering of cellular computations [5]. Currently, a research project is under way involving the CNR-IRSIP, the CNRIRPI, United States Geological Survey (Landslide Hazard Program), the CNRLARA, University and CNR researchers in the field of applied geology, stratigraphy and forest science as well as our Institute (CNR-ISI). The aim of the common work is to improve the overall ability to predict landslides and mud flows and to develop countermeasures to limit its disastrous consequences. We are building a simulator based on a cellular automata model that handles the subject to landslides events that have interested me Campania Region in May 1998. The paper gives an overview of the CAMELot system and shows its practical use for programming and steering landslides and mud-flows models developed in the project. Moreover, an evaluation of the performances of the simulator on the MEIKO CS-2 parallel machine is presented. 2

Simulation environment CAMELot

The CAMELot prototype is derived from the CAMEL system [6] but offers additional functionality to perform program steering by a language based approach. The CAMELot environment consists of: • a language, called CARPET, which can be used to define cellular algorithms and to perform steering commands when complex space and time events are detected; • a graphic user interface (GUI) for editing, compiling, configuring, executing, visualising and steering the computation; • a load balancing algorithm similar to the scatter decomposition technique to evenly distribute the computation among processors of the parallel machine. CAMELot implements a cellular automaton as a SPMD {Single Program Multiple Data) program. The CAMELot architecture is composed of a set of macrocell processes, a controller process and a GUI process. Each macrocell process, that contains a portion of cells of the CA, runs on a single processing element of the parallel machine and executes the updating of the states of the cells

119

belonging to its partition. The synchronisation of the automaton and the execution of the commands provided by a user through the GUI interface, are carried out by the controller process. MPI primitives handle all the communications among the processes using MPI communicators. 3

CARPET, a programming and steering language

CAMELot supports the CA programming language CARPET [7]. CARPET is a language to program cellular algorithms and contains constructs to handle regionbased programming and computational steering. It is a high-level language based on C with additional constructs to describe the rule of the state transition function of a single cell of a cellular automaton and steer the application. A CARPET program is composed of a declaration part that appears only once in the program and must precede any statement, a body program that implements the transition function, and a steering part that contains a set of commands to extract and analyse system information and to perform steering. The main features of CARPET are the possibility to describe the state of a cell as a record of typed substates {char, shorts, integers, floats, doubles and monodimensional arrays), the simple definition of complex neighbourhoods (e.g., hexagonal, Margolus) and the specification of non-deterministic, time-dependent and non-uniform transition functions. In standard CA a cell can interact only with the cells defined within its neighbourhood. CARPET extends the range of interaction among the cells introducing the concept of region [8]. Regions are statically defined spatio-temporal objects which allow a cell to know, by aggregate functions, the behaviour of a substate within a set of cells put within a defined area. The predefined variable cell refers to the current cell in the n-dimensional space under consideration. A substate can be referred appending the name of the substate to the reserved word cell by the underscore symbol '_' (i.e, cell substate). Cell substates are updated at each iteration only by the update function, in order to guarantee the semantics of cell updating in cellular automata. After an update statement, the value of the substate in the current iteration is unchanged. The new value is effective at the beginning of the following iteration. The neighbourhood of a cell is defined as the maximum number of cells that a cell can access in reading. For example, in a 2-dimensional automaton defining the radius equal to 1 the number of the neighbours can be up to 8. To define time dependent transition functions or neighbourhood, the predefined variable step is used. Step is automatically updated by the system. Initially the value of step is 0 and it is increased by 1 at each iteration. To allow a user to define spatially non-uniform CA, CARPET defines the GetX, GetY and GetZ operations that return the value of X,Y and Z coordinates of a cell in the automaton. Parameter objects describe some global features of the system. CARPET allows the definition of global parameters and their initialisation to specific values. The value of a global parameter is the same in each cell of the automaton.

120 In CARPET, region objects are statically defined by the region declaration. A ddimensional region is defined by a sequence of indexes which represent the geometric coordinates, the time period (starting time and ending time) in which the region is defined, and the interval of monitoring. CARPET also implements the aggregate functions MaxRegion, MinRegion, SumRegion, AndRegion, OrRegion, AvgRegion, which respectively allow to calculate the maximum, the minimum, the sum, the logical and, the logical or, and the average value of substates of the cells belonging to a region. Other functions can be added in the future. Furthermore, the InRegion, InTempRegion and Distance functions allow to know if a cell belongs to a spatial region, if the current iteration is within the temporal window and the monitoring step defined for the region, and calculate the distance between the cell and the region considered. CARPET is also a language that allows the definition of algorithms to perform computational steering. The steering commands are defined in the steering section that is performed by the runtime system at each iteration, after the transition functions of all cells have been evaluated. CARPET allows to define significant events using the substates as variables as well as the parameters and the regions objects defined in the declaration part. Besides, it allows to take actions when an event is detected. The basic event-action control structure of CARPET is if event expr then <steering_action> An event expr is an expression that combines the aggregate functions defined on a region and the steering commands applied to the event variables by basic numerical expressions as well as relational and logical operators. Steering commands are of four types. Those for the control of the simulation are: cptabortQ for stopping the computation, cpt set_param for changing the value of a parameter, cptview and cptedit to view and change the substate's value of a cell in a point of the automaton. Those for changing the visualisation are: cptzoom to set a region and zoom that region, cptchangeplane and cptchangetype to define the plane and the substate to visualize, cpt colorange to change the range of the colours associated with the values of a substate, cptopenstate and cptclosestate to open and close a window for the visualisation of a substate. Cptload and cpt save are used to load and save a configuration of an automaton from/in a file. For tuning the performance of the application the commands are: cptchangefold which sets the active folds, cpt_getstarfold and cpt_getendfold which return the initial and final folds among those actives. 4

A preliminary simulation of landslides by CAMELot

Using CAMELot, we showed that the dynamics of debris/mud flows can be simulated and that the shape of the simulated landslide demonstrates a substantial agreement with the real event. The simulation is based on the preliminary model of

121

debris/mud flows defined by Di Gregorio [9] in the project. This model defines the ground as a two-dimensional plane partitioned into square cells of uniform size. Each cell represents a portion of land, whose altitude and physical characters of the debris column laid on it are described by the cell states. The state evolution depends on a transition function, which simulates the physical processes of the debris flow. Here we show the definition data and some portions of the code of the transition function of the automaton programmed in CARPET. The cell state, the neighbourhood and some global parameters of the two dimensional automaton are defined in figure 1 that shows the declaration part of the model defined by CARPET. cadef { dimension 2; radius 1; state (int Altitude, Thick, RunUp, Depth, Mobilisation, OutFlow[4].Water, Steps) neighbor moore[9] ([0,0]C,[0,-1]N,[1,0]E, [0,1]S, [-1,0]W[-1,-1]NW, [1,-1]NE, [1,1]SE, [-1,1]SW); parameter (pr_starting 2.0, pr_ending 2.0, pr 2.0, disl_min 0.0, disl_max 1.0, pfl 0.0, pf2 0.0); —i. Figure 1. The CA declaration part.

The finite set of substates of an elementary automaton (a cell) have the following meanings: • Altitude: the altitude of the cell; • Thick: the debris thickness in the cell; • Run-up: the "run up" height of the cell debris; • Depth: the maximum "depth" of the ground stratum, that is transformed by the erosion in debris/mud; • Mobilisation: the "mobilisation" activation of the ground stratum, which becomes debris/flow; • Outflow: the debris flows toward the eight neighbourhood directions; • Water: the water content in the debris of the cell; • Steps: the steps necessary for the "mobilisation" of the ground stratum which becomes debris/mud; The global parameters represent the relaxation rate and the mobilisation angles. The main features of the transition function concern computing of internal transformations that involve the solidification, the water loss and the mobilisation effect as well as local interactions that involve the debris/mud flows from a single cell toward its neighbor cells, the run up and adherence determination and the mobilisation propagation. At the beginning of the time we specify the states of all the cells defining the initial configuration of the automaton. The transition function before computes the internal transformation and then the local interactions.

122 NewAltitude() { if (cell_Thick > CritValue) { spooning = (cellJThick - CritValue) * k; update(cell_altitude, (cellAltitude - spooning)); } if (cell_water <= MinimalValue) { update(cell_Altitude, (cell_Altitude + cell_Thick)); update(cell_Thick, 0 ) ; } _} Figure 2. The NewAltitude procedure.

For example, figure 2 describes the altitude and thickness variation by solidification. The cell altitude at the step t+1 is increased by the tiiickness of debris, once the debris water content drops to a minimal value so that the motion is blocked (solidification). At the same time, the debris thickness of the cell is set to zero. When the debris thickness is larger than a critical value and there is no solidification, men the cell altitude is decreased by the spooning effect proportional (by a constant k) to the difference between the debris tiiickness and a critical value. At the beginning of the time we specify the states of all the cells defining the initial configuration of the automaton. The transition function before computes the internal transformation and then the local interactions. Currently, we are defining a more complex model that takes into account many suggestions provided by geologists and engineers of our team. It is our intention to improve the accuracy the model as well as to introduce new mechanisms to run simulations where the resolution of cells can be automatically changed if a significant event is detected. Moreover, these mechanisms should enable to define regions (identifying danger zones of the territory), that can be monitored by userdefined triggers and perform actions (as for instance the signalling to evacuate the population) if an event occurs. The constructs for the steering and the region-based programming model provided by CARPET represent an efficient and flexible manner to design these extensions to the model.

5

Simulation results

We have developed and tested me model of the landslide simulation on a MEIKO CS-2 parallel machine. We have selected the Chiappe di Sarno-Curti debris flow as a first case of study. For our simulation we have used a map of the landslide at 1:5000 scale. The area of interest has been mapped on a cellular

123

automata of 772 cells long and 880 cells wide, with each square cell being 2.5 meters on each side. Figure 3 shows the real landslide shape and an intermediate snapshot of the shape resulted from the CAMELot simulation. It can be observed a good accordance between the real phenomenon and the simulation result.

Figure 3. The real landslide shape (left) and the CAMELot simulation result (right).

Table 1 also shows the execution times and the speedup of the model on a MEIKO CS-2 using 1, 2, 4, 6, 8 and 10 processors with and without the load balancing strategy. The table clearly shows that the balanced simulation achieves better speedup than unbalanced simulation, and the difference in performance increases for larger number of processors. Tabella 1. Execution times and speed up of unbalanced and balanced simulations . Unbalanced simulation Processors

1 2 4 6 8 10

6

Balanced simulation

Time (sec)

Speedup

Time (sec)

Speedup

964.45 556.52 277.18 241.60 184.25 141.47

1.00 1.73 3.48 3.99 5.23 6.82

987.65 534.67 254.40 212.09 183.21 128.07

1.00 1.84 3.85 4.65 5.39 7.71

Conclusions

This paper has described a problem solving environment for interactive simulation of natural phenomena. In particular, we have demonstrated that complex problems as the simulation of landslides can be modelled by cellular automata formalism and efficiently performed on parallel machines. The steering constructs

124

and the region based programming model can be used to implement more complex models and to accelerate the validation phase of the model. The availability of models and simulations of natural phenomena can contribute to better understand their dynamics and lead to more informed choices about the management of territory and global change. 7

Acknowledgements

This research has been partially funded by the Environmental Policies Councillorship of the Campania Regional Government.

References 1. J. Gascuel, M. Cani-Gascuel, M. Desbrun, E. Leroi, C. Mirgon, Simulating Landslides for Natural Disaster Prevention, Eurographics Workshop on Animation and Simulation, Lisboa, Portugal, Septembre 1998. 2. S. Di Gregorio, F. Nicoletta, R. Rongo, G. Spezzano, D. Talia, M. SorrisoValvo, Landslide simulation by cellular automata in a parallel environment", in: Proc. of the Int. Workshop on Massively Parallel: Hardware, Software and Applications, World Scientific, 1994, 392-407. 3. E. Gallopoulos, E.N. Houstis, J.R. Rice, Computer as Thinker/Doer.ProblemSolving Environments for Computational Science, IEEE Computational Science & Eng., vol. 1, n. 2, Summer 1994, 11-23. 4. P. B. Hansen, Parallel Cellular Automata: a Model for Computational Science, Concurrency: Practice and Experience, vol. 5, 1993, 425-448. 5. S. G. Parker, C. R. Johnson, D. Beazley, Computational Steering Software Systems and Strategies, IEEE Computational Science & Engineering, vol.4, n.4, 1997,50-59. 6. M. Cannataro, S. Di Gregorio, R. Rongo, W. Spataro, G. Spezzano, D. Talia, A Parallel Cellular Automata Environment on Multicomputers for Computational Science, Parallel Computing, North Holland, vol. 21, n. 5, 1995, 803-824. 7. Spezzano G., Talia D., "Programming Cellular Automata for Computational Science on Parallel Computers", Future Generation Computer Systems, NorthHolland, Amsterdam, vol. 16, n. 2-3, pp. 203-216, 1999. 8. G. Folino, G. Spezzano, "CELLAR: A High Level Cellular Programming Language with Regions", Proc. of the 8th Euromicro Work, on Parallel and Distributed Processing PDF"2000, IEEE Computer Society, pp. 259-266, 2000. 9. S. Di Gregorio, "Cellular Automata Models for the Simulation of the Landslides Occurred in the Sarno Area of Campania Region in the May of 1998, Technical Report, December 1999.

PARALLEL NUMERICAL SIMULATION OF PYROCLASTIC FLOW DYNAMICS AT VESUVIUS T. ESPOSTI ONGARO Dip.to di Scienze della Terra, Universita degli Studi di Pisa Via S. Maria 53, 1-56126 Pisa C. CAVAZZONI, G.ERBACCI CINECA, via Magnanelli 6/3, 1-40033 Casalecchio di Reno (Bologna) A. NERI CNR, Centro di Studio per la Geologia Strutturale e Dinamica dell'Appennino, Dip.to di Scienze della Terra, Universita degli Studi di Pisa Via S. Maria 53, 1-56126 Pisa G. MACEDONIO Osservatorio Vesuviano, Istituto Nazionale di Geofisica e Vulcanologia, Via Diocleziano 328, 1-80124 Napoli The study performed focuses on the development and application of a parallel multiphase fluid dynamic code for the simulation of the transient 2D dynamics of pyroclastic flows, one of the most hazardous phenomena occurring during explosive volcanic eruptions. The parallelization strategy adopted is based on a domain decomposition of the real space grid using a SPMD paradigm within a message passing scheme. The speed-up of the parallel code has been tested against the number of processors, the size of the computational grid, the computational platform, and the domain decomposition criterion. The results obtained show a good scalability and portability of the parallel code and indicate a promising 3D extension of the model.

1

Introduction

Pyroclastic flows are high-velocity and high-temperature ground-hugging flows occurring during explosive volcanic eruptions and commonly produced by the collapse of the volcanic column. In the last ten years, there have been considerable advances in the understanding of t h e generation and propagation of these flows as well as in their numerical simulation. T h e modeling work has developed according to two different typologies: 1) models based on a monodimensional, steady-state, homogeneous flow description of the process (Bursik and W o o d s 1 ; Dade and H u p p e r t 2 ) and 2) models based on a 2D, transient, and multiphase flow description of the physical process capable of solving the fundamental equations of conservation of mass, m o m e n t u m , and energy for each phase of the eruptive mixture (Valen-

125

126

tine and Wohletz 8 ; Dobran et al. 3 ) . The present work pertains to the latter approach and, specifically, to the most recent development of this second type of models describing pyroclastic flows as a mixture of n different particulate solid phases dispersed in a gas continuous phase (Neri et al. 6 ) . Results obtained by this model have thrown light on the potentialities of this approach in the study of the thermal and mechanical non-equilibrium processes within the flow and also in the quantification of pyroclast dispersal in the various transport systems. Application of this model to the hazard assessment from pyroclastic flows has been also carried out for Vesuvius (see Figure 1) (Dobran et al. 4 ; Neri et al. 7 ) . It was therefore possible to estimate the arrival times of the flow at different distances from the crater, the maximum distance reached by the flow, the effect of topographic barriers, the temperature and dynamic pressure of the flow.

H(km)

Figure 1: Particle distribution and flow field of a pyroclastic flow simulation at Vesuvius. Contour lines indicate the log to the base 10 of the total volumetric fraction of solids.

However, the 2D assumption of the code limits the application of the model to specific and limited sectors of the volcanic cone. In addition, the large computational resources required by the model have so far limited the simulation time to a few tens of minutes of eruption thus preventing the analysis of any long-term process. The aim of this work is the development of a prototype 2D parallel version of the above mentioned physical model and its application to the study of pyroclastic flows at Vesuvius. Such a parallel code is a necessary intermediate step in order to achieve a longer simulation time as well as to move towards a fully 3D description of the phenomenon.

127

2

Physical model overview

According to the model, each phase is treated as a continuum and balance equations for mass, momentum, and energy are solved accounting for advective transport, viscous dissipation, and interphase momentum and energy transfers. If subscripts k and j indicate the kth phase and the j t h chemical component in the gas phase, respectively, the fundamental transport equations solved by the model can be expressed as (Neri et al. 6 ): -Q^kPh + V - (ekPkVk) = 0, d -QfaPgVi +

v

k

=g,si,s2,...,sn

• (egPgVjVg) = 0 j =

P^k—rr = VTfc + efcPfcf+ pfc, dhk _ , dPk , Pk^k—jr = - vefcqfc + 6 fe-jr +Uk,

H20,C02,Air,....

k = , k =

g,si,S2,...,sn

g,ai,s2,...,sn

where e is the volumetric fraction, p the density, y the mass fraction of each gas component, v the velocity, T the stress tensor, h the enthalpy, f the body force, p the interphase force, P the pressure, q the heat flux, and u the interphase heat transfer term. Subscripts g, si, s 2l ..., sn denote the gas phase and the n solid phases, respectively. An equation of state and a Newtonian stress tensor are prescribed per each phase in order to close the set of coupled partial differential equations (PDE). The large-scale dynamics of the dispersal process are described by adopting a conventional Large Eddy Simulation (LES) approach for the gas phase, whereas physical and rheological properties of solid particles, as well as interactions between them, are described by using semi-empirical correlations validated in the laboratory. The transport equations of the physical model are solved in 2D, by applying a numerical procedure based on the Implicit Multi Field (ICE-IMF) scheme (Harlow and Amsden 5 ). This consists of a first-order, finite differences algorithm on a staggered grid with implicit treatment of the pressure and interphase terms in the momentum balance equations. Convective fluxes are calculated explicitly with a donor-cell upwind scheme. Diffusive fluxes are computed explicitly with a centered finite difference scheme. Energy equations are fully explicit. The set of the resulting algebraic equations is solved iteratively with a point-relaxation (SOR) technique. This computational technique is particularly suited for the study of the dynamics of both subsonic and supersonic multiphase flows.

128

3

Parallelization strategy

The parallelization strategy adopted is based on a domain decomposition of the real space grid using a SPMD (Single Program Multiple Data) paradigm within a message passing scheme. We adopted two different partitioning criteria; the first one subdivides the whole domain into N rectangular sub-domains, or "blocks", similarly to a chessboard arrangement, whereas the second one subdivides the domain in N horizontal "layers". In order to avoid an unbalanced load between the processors when a large amount of cells is required for the representation of the volcano topography, we assigned, as a first approximation of the relative load, a computational weight to each cell. Specifically, we assigned a weight of one to fluid cells and a weight of zero to boundary and soil cells. We then modified the two above described decompositions so that each processor has the same number of fluid cells and therefore the same total computational weight (see Figure 2).

'". .-. . "'./. Figure 2: Example of domain decompositions with balanced distribution of the fluid cells in four blocks (left) or four layers (right). The domain is axisymmetric so that the figure represents a poloidal half-plane with the simmetry axis on the left hand-side boundary and the volcano topography at the bottom. In these figures the cells are represented with a uniform size so that the areas of the computational sub-domains are equal.

Communication between processors allows the exchange of the flow variables at the boundaries between sub-domains. Boundary values of neighbour processors are allocated in a virtual frame surrounding the processor subdomain (ghost-cells). The data-exchange interface is completely transparent providing a direct way to store and access values allocated in the ghost-cells. It is worth noting that the boundary data exchange is required at each iteration of the point-by-point procedure solving the implicit algebraic system of PDE.

129

4

Results

We tested the speed-up of the parallel code against various parameters such as: 1) number of processors, 2) size of the computational mesh 3) flow conditions, 4) computational platform, and 5) criterion of domain decomposition. The speed-up was evaluated on 100 t-steps, by keeping fixed the maximum residual of the equation of conservation of mass, that is the parameter controlling the convergence of the iterative procedure. The two non-uniform grids employed cover the same computational domain and are characterized as follows: Coarse grid: 146 x 251 total cells (27031 fluid cells, 9061 soil cells, and 554 boundary cells). Fine grid: 292 x 501 total cells (128418 fluid cells, 16674 soil cells, and 1200 boundary cells). The first set of speed-up curves is reported in Figure 3. They have been carried out on a IBM SP Power3/128 (8 nodes, 128 Processors Nighthawk, 375MHz, 192 Gflop/s peak performance). Figures 3(a) and 3(b) refer to the coarse and fine grids, respectively, with blocks domain decomposition. In addition, different curves in each figure refer to different flow conditions (corresponding to those at 0, 30, and 300 s from the beginning of the simulation). Both figures

Figure 3: Speed-up curves obtained with the coarse (a) and fine (b) grids. Both figures were carried out on a IBM SP Power3/128 computer assuming a blocks decomposition of the domain.

show an initial quasi-linear speed-up followed by a saturation of the curves. This is clearly due to increase of the communication/computation time ratio with the decrease of the subdomain sizes. From a comparison between the two figures it is also evident that, as expected, the number of processors to reach

130

saturation increases with the size of the problem envisaged. On the coarse grid the number of iteration needed to reach the prescribed residual of the mass-balance equation was not influenced by the domain decomposition. On the fine grid we observed that the point-by-point over-relaxation procedure needs a lower number of iterations to reach convergence when the number of processors increases, especially for flow conditions typical of the beginning of the simulation (t=0 in figure). This fact can justify the higher efficiencies (the ratio between the speed-up and the number of processors), obtained with the fine grid (e.g., the speed-up with 12 processors is 13.1). This effect is not easily predictable and further studies are in progress to verify the conditions of its occurrence. We then investigated the speed-up of the parallel code on two other architectures, i.e.: - Cray T3E 1200/256E (256 Processors, 600MHz, 308 Gflop/s peak performance) - SGI Origin 3800/128 (128 Processors R14000, 500MHz, 128 Gflop/s peak performance). The comparison between the three speed-up curves is shown in Figure 4(a) up to 32 processors, whereas Figure 4(b) shows the speed-up curve of the Cray T3E up to 128 processors. The three speed-up curves are very similar up to

Figure 4: Comparison of the speed-up curves of IBM-SP3, SGI-03K, and CRAY-T3E (a) and CRAY-T3E (b). The curves are obtained by using the fine grid and at t = 0.

16 processors, whereas the CRAY T3E performes significantly better with 32 processors. This is due to the smaller latency time of the CRAY interconnetting network that, together with the lower performance of its processors, delays the speed-up saturation up to 64 processors. On this architecture, the saturation

131

of the speed-up is mainly caused by the existance of non-parallelizable parts of the code (e.g., the I/O). In Tablel, the computational time needed to advance one time-step using 8, 16 and 32 processors is reported in order to give an estimate of the relative speed of the three architectures. Platform IBM SP3 SGI 03K CRAY T3E

Time/t-step (s) 8 proc. 16 proc. 32 proc. 3.22 1.66 1.26 1.74 3.55 1.17 9.62 4.57 2.41

Table 1: Time per time-step using 8, 16 and 32 processors on the three architectures employed

Finally, the effect of the criterion of domain decomposition on the speed-up is shown in Figure 5. Figure 5(a) and (b) refer to the IBM SP and CRAY T3E, respectively. For the IBM SP, the layers decomposition produces an increase of the speed-up with respect to the blocks decomposition for both mesh sizes. This can be explained considering that, despite the layers decomposition has a greater number of data to exchange with respect to the blocks decomposition (the total length of the inter-processor boundaries increases with TV instead of about \/N), the number of neighbours of each processor is smaller thus decreasing the number of barriers and therefore the total time for communication. This trend is, in some cases, inverted for the CRAY T3E due to its much smaller latency time that makes the amount of data exchanged the controlling factor.

Figure 5: Effect of the domain decomposition on the speed-up on the IBM SP3 (left) and on the CRAY T3E (right)

132

5

Conclusions

We carried out the parallelization of a transient, 2D, multiphase flow code of pyroclastic flow dynamics by adopting a domain decomposition of the real space grid and using a SPMD paradigm within a message passing scheme. The speed-up curves show a good scalability of the code with the possibility to define an optimal number of processors as a function of the mesh size selected. The employed parallelization strategy allows a good portability of the code on the three platforms tested and suggests the adoption of different partitioning criteria on different architectures. The results obtained are overall promising for a future 3D extension of the model. Acknowledgments Code development and numerical simulations were performed using computer resources of the CINECA Consortium, Casalecchio di Reno (BO), Italy. Partial support from Gruppo Nazionale per la Vulcanologia, Istituto Nazionale di Geofisica e Vulcanologia, Italy, project no. 2000-2/9 is also acknowledged. References 1. Bursik, M.I., and A.W. Woods, The dynamics and thermodynamics of large ash flows, Bull. Volcanol, 58, 175-193, 1996. 2. Dade, W.B., and H.E. Huppert, Emplacement of the Taupo ignimbrite by a dilute turbulent flow, Nature, 381, 509-512, 1996. 3. Dobran, F., A. Neri, and G. Macedonio, Numerical simulation of collapsing volcanic columns, / . Geophys. Res., 98, 4231-4259, 1993. 4. Dobran, F., A. Neri, and M. Todesco, Assessing pyroclastic flow hazard at Vesuvius, Nature, 367, 551-554, 1994. 5. Harlow, F.H., and A. A. Amsden, Numerical calculation of multiphase fluid flow, J. Comput. Phys., 17, 19-52, 1975. 6. Neri, A., G. Macedonio, D. Gidaspow, and T. Esposti Ongaro, Multiparticle simulation of collapsing volcanic columns and pyroclastic flows, sub judice and VSG-Report no. 2001-2, 2001. 7. Neri, A., T. Esposti Ongaro, M. Todesco, G. Macedonio, P.Papale, R. Santacroce, A. Longo, and D. Del Seppia, Numerical simulation of pyroclastic flows at Vesuvius aimed at hazard assessment, Final Report EC Project ENV4-CT98-0699, 2000b. 8. Valentine, G.A., and K.H. Wohletz, Numerical models of Plinian eruption columns and pyroclastic flows, J. Geophys. Res., 94, 1867-1887, 1989.

A FAST DOMAIN DECOMPOSITION ALGORITHM FOR THE SIMULATION OF TURBOMACHINERY FLOWS P. GIANGIACOMO, V. MICHELASSI AND G. CfflATTI Universitd degli Studi Roma Tre, Dipartimento di Ingegneria Meccanica e Industrial, Via della Vasca Navale 79, 00146 Roma, Italy E-mail: paolo@giangiacomo. it, michelas@uniroma3. it, chiatti@uniroma3. it Parallel computation techniques may bring a considerable saving in time in the computer simulation of turbomachineiy flows. The implicit Navier-Stokes solver XFLOS, operating in MPI environment, takes advantage of the peculiar features of these flows. This is attained by decomposing the fluid domain into up to 16 streamwise subdomains to be solved in parallel. The explicit evaluation of spanwise fluxes at block interfaces slows down the convergence rate with the highest number of processors. The convergence rate of single processor runs may be restored by a sub-iterative solution of the implicit system in the spanwise direction. In this case, the reduction in the number of iterations to reach a fixed residual level was found to more than compensate the additional effort per iteration, and with 16 processors a speed-up of 12 at fixed residual level was achieved. The algorithm was successfully tested for the prediction of a centrifugal compressor impeller performance.

1

Introduction

The design of modern centrifugal compressors takes ever more advantage of CFD techniques. CFD is often alternative to experimental practice in the first phases of the design process in which time consuming trial-and-error procedures may take full advantage of the computer simulations [1]. The increased reliability of CFD allows the off-design condition performances to be accurately predicted. This implies that the characteristic curves, spanning the entire operational range, can be conveniently computed in a relatively short computational time. In this scope, parallel multiprocessor computing techniques, which split the computational task among several processors, may bring a considerable speed-up of the computations [2]. Parallel solvers are often based on domain decomposition techniques that can provoke a considerable increase in the number of iterations to converge. Still, when the solver is dedicated to flows in streamlined channels, like in turbomachinery components, it is possible to split the computational domain and alter the algorithm to take full advantage of this class of problems. In this view, a simple parallel version of the time-marching implicit Navier-Stokes solver XFLOS [3,4] for distributed memory architectures has been developed. The solver is aimed at the simulation of flows in stators and rotors of turbines and compressors. The parallelised code has been used to predict the performance of a centrifugal compressor impeller for industrial applications. The main scope of this version of

133

134

the code was to obtain a considerable reduction of the computational time with a limited number of processors and optimized inter-processor data exchange. 2

Algorithm

XFLOS solves the three-dimensional Navier-Stokes partial differential equations written in unsteady conservative form. The equations are made non-dimensional with respect to the stage inlet stagnation temperature, pressure and inlet molecular viscosity. For rotor flows, either absolute or relative flow variables may be adopted, and for the present computations absolute variables were chosen. Turbulence is accounted for by means of a two-equation model, with the addition of a constraint to threshold the overproduction of turbulent kinetic energy near stagnation points and avoid instabilities. The equations are discretised by an implicit time marching scheme, and by centred finite differences in space. The three-dimensional transport equations, T(£,n,Q, are split into the sequence of three one-dimensional problems by the diagonal alternate direction implicit (DADI) algorithm: (RHS=)F(0+F(ii)+Ffe) = T(§.Ti, S )£Tfe)xT(il)xTfe) in which F represents the space fluxes, T the implicit operators, £,T|,£ are the streamwise, pitchwise, and spanwise direction respectively. Each one-dimensional operator, T(£), T(TQ, T(Q, requires the solution of a scalar pentadiagonal system. 3

Domain decomposition strategy and system re-coupling

A unique simply connected structured grid is generated for single and multiprocessor runs. The grid is decomposed in the spanwise direction only into nonoverlapping blocks (see Figure l,a) each of which is assigned one processor. All points of the overall grid belong to one block only, and no interpolation is required to assemble fluxes at the interfaces. This procedure avoids recomputations at the interfaces, and theoretically ensures the identity of single and multiprocessor solutions at convergence, apart from the different truncation errors. This simple domain splitting was deemed particularly advantageous for turbomachinery flows with small spanwise convection and weak secondary flows for which F(Q«F(^) and F(£)«F(r|) in most of the computational domain. The spanwise fluxes and damping terms at the sub-domain interfaces are evaluated explicitly by extending each block by two node layers in the spanwise direction. At each iteration, the solution within the extensions is simply exchanged between neighbouring blocks without interpolation, and the equations are solved only in the unextended region.

135

The boundary condition at the impeller exit is imposed in term of the pressure integral over the whole exit section. In multiprocessor runs, this requires the exit pressure integrals for each block to be exchanged among processors.

Figure 1 - Sample domain decomposition into non-overlapping blocks (£ sweep is decomposed).

Observe that with respect to single processor computations, the present decomposition does not alter the RHS and the implicit operators, T(£) and T(rj), in the streamwise and tangential directions. Conversely, the explicit evaluation of spanwise fluxes and artificial damping terms at the interfaces (see Figure l,b) uncouples the implicit operator in the spanwise direction C,. This is made clear in the following equation

'•.

C E

A C E

B A C (E)

D B A (C) (E)

AQHU (D) (B) A C E

(D) B D A B C A

D B "-.

AQS!., AQK,2

RHS

n.K-2

RHSn.K_1-D•AQi1™^.i, RHS nK - B - A Q j . ' f f - D - A Q K g RHS n + u - C • AQJJ-" - E • A Q ^ ! RHS n+ , 2 -EAQ' n m K -"

AQK.3

in which n,K and n+1,1 are the last node K of the n-th block and the first node of the (n+l)-th block, respectively. At the interfaces the domain decomposition drops the terms in round brackets and reduces the overall implicitness. In order to compensate this effect, the neglected implicit terms are iteratively evaluated, and reinserted as a correction to the RHS of the C, sweep to restore a partial implicit link between the blocks, as visible in the previous equation. AQ(m) is solved in function of AQ<m_1) for a fixed number of sweeps, starting from AQ(1)= 0. This correction is applied only to the five scalar equations of the mean flow.

136

4

The impeller

The XFLOS code has been used to simulate the flow inside the centrifugal impeller shown in Figure 2. The impeller has 17 shrouded blades with an external diameter of approximately 400 mm and rotates at 4578 RPM. The exit flow field is high subsonic. The periodic nature of the flow field allows only one vane to be solved. The computational mesh, a coarse version of which is shown in Figure 2, has 73x69x50 nodes in the streamwise, tangential and radial direction respectively. Grid points are clustered around the blade and near the endwalls. This node distribution has been verified to ensure grid independent results.

Figure 2 - The impeller (shown with shroud off)

5

Characteristic curves

Figure 3 shows the computed characteristic curves in terms of polytropic compression efficiency, r), and load coefficient, x, versus the flow coefficient, (j). The coefficients are made non-dimensional with respect to the design values, d. The computed curves are made of 11 computed steady state operation points obtained by changing the exit static pressure. The accuracy of the solver, discussed in Michelassi et al. [3], allows capturing the essential features of the impeller with a good degree of confidence as proved by me good fit with the measured characteristic curve. The most relevant feature of this simulation is the overall CPU time: with 2.52xl0 5 grid nodes and 16 processors the computation of each operative point requires approximately 1700s to reach an overall residual reduction of approximately three orders of magnitude. This means that a full characteristic curve made of 11 characteristic points requires approximately 5 hours if each calculation is restarted from scratch. When initialising each computation with the flow field relative to the previous point the overall CPU time can be reduced of a factor 2/3.

137 \2 1.1 •

*-A-

~4*W,

X

1.0

5^0.9 >

V

0.8

\ -A— computed • measured

0.7 0.6 0.5 1.4

(a)

1.2

"K.V v

-£ 1.0 0.8 0.6

—A— computed • measured

VA

\

(b)

o.o

0.4

0.8

1.2

1.6

2.C

Figure 3. Characteristic curves of the centrifugal impeller.

6

Computational details

To exploit the parallel capabilities of the code, the fluid domain has been subdivided into up to 16 blocks. To balance the computational effort among processors, the blocks have been given the same number of nodes in the spanwise direction, except for the first and last block, which have one more node layer over the endwall. With 16 processors, this resulted in as little as 3 node layers per block. All the runs have been performed on a Cray T3E computer with MPI FORTRAN, and the CPU time has been monitored by Cray's MPP Apprentice performance analysis tool, running on Cray MPP systems. The equation unbalances for the five scalar equations of the mean flow are very similar, and their mean value is plotted in Figure 4 versus the number of iteration. With a single sweep of the spanwise system (i.e. no subiteration), the domain decomposition into 2, 4, or 8 processors does not deteriorate the convergence rate. With 16 processors, the residual level after 1000 iterations is higher by less than one order of magnitude, and one subiteration is sufficient to restore the single-processor convergence history, for a total of two sweeps of the spanwise system T(£). This proves the validity of the proposed decomposition that does not requires excessive inter-block communication and still maintains a good degree of implicitness.

138

400 600 Iteration

800

1000

Figure 4 - Mean residual versus iterations. Table 1 - Speed-up and efficiency after 1000 iterations

2 processors, 1 sweep 4 processors, 1 sweep 8 processors, 1 sweep 16 processors, 1 sweep 16 processors, 2 sweeps

Speed-up 2.12 4.37 8.49 14.5 12.9

efficiency 1.06 1.09 1.06 0.90 0.81

Table 2 - Speed-up and efficiency at 0.25-10"* mean residual level

1 processor 2 processors, 1 sweep 4 processors, 1 sweep 8 processors, 1 sweep 16 processors, 1 sweep 16 processors, 2 sweep

iterations 590 630 590 630 970 630

speed-up

efficiency

1.99 4.37 7.94 8.77 12.1

0.99 1.09 0.99 0.55 0.76

Speed-up and efficiency of the parallel computations after a fixed number of iterations are summarised in Table 1. For up to 8 blocks, the code experiences efficiencies larger than one. This is a known result (Michelassi and Giangiacomo [4]), and is caused by the reduction in the execution time of some XFLOS subroutines, probably due to caching and/or striding, being larger than the time consumed by data exchange. For 16 processors the reverse situation occurs, and the

139 efficiency falls below one. When performing the second spanwise sweep the efficiency decreases by approximately 10 %.

10-1

10-2 10-1 Normalised Computational Time Figure 5 - Mean residual versus time normalised with respect to the 1-processor computational time.

However, from the application point of view it is indeed more significant to compare speed-up and efficiency at a fixed residual level (i.e. with the same quality of the solution). This comparison is summarised in Table 2 which reports a maximum speed-up factor of about 12 for a mean residual of 0.25-10-4. Apart from the difference between 590 and 630 iterations, which is deemed negligible, it is important to note the poor speed-up achieved with 16 processors and one sweep, which is just 10 % better than that for 8 processors. The re-coupling given by the extra £ sweep proves to be efficient, since the increased computational effort per iteration is more than compensated by the reduction in the number of iterations. Figure 5 shows the bi-log convergence history versus the elapsed time, normalised with the elapsed time of the single-processor run. This plot shows that the gain in efficiency brought by the subiterations when using 16 processors is worth the slightly increased complexity of the code. 7

Conclusions

The parallel version of the solver allows a fast computation of characteristic curves of centrifugal impellers. The accuracy and conservation properties of the code are not affected by the domain decomposition. The computations were greatly accelerated by running the code on a parallel computer. The parallel XFLOS code

140

proved to be both effective and efficient in the simulation of the centrifugal impeller, thanks to the simple domain decomposition and to the lack of recalculations and interpolations. Decomposition into up to 8 processors did not reduce appreciably the convergence rate versus iteration number. With 16 processors the convergence rate reduces with respect to the one-processor case. To recover the single processor convergence, the sub-domains needed to be re-coupled by the sub-iterative procedure. The re-coupling proved quite successful since it also managed to improve the efficiency of the 16-processor calculation. In general, the strategy based on the reduction of the processor data exchange in the implicit step is compensated by the re-coupling when using 16 processors. With 16 processors, the computation time could be reduced to 1/12 while retaining the same solution quality of the single processor runs. 8

Acknowledgements

The technical support and CPU time granted by Cineca is gratefully acknowledged. The authors also wish to express their gratitude to Dr. Marco Giachi of Nuovo Pignone-GE for providing die geometry of the impeller and the experimental data. References 1. Casey, M., V., Dalbert, P., Roth, P., The use of 3D viscous flow calculations in the design and analysis of industrial centrifugal compressors. Journal of Turbomachinery 114 (1992) pp.27-37 2. Schiano, P., Ecer, A., Periaux, J., Satofuka, N. (ed.), Parallel Computational Fluid Dynamics - Algorithms and Results Using Advanced Computers (Elsevier, 1997). Proceedings of the Parallel CFD '96 Conference, Capri, Italy, May 20-23 1996. 3. Michelassi V., Pazzi, S., Echtner, S., Giangiacomo, P., Martelli, F. Giachi, M., Performances of Centrifugal Compressor Impellers in Steady and Unsteady Flow Regimes Under Inlet Flow Distortions. International Gas Turbines and Aeroengine Congress and Exhibition, 4-7 June 2001, New Orleans, LA, ASME Paper 200 l-GT-325. 4. Michelassi, V., Giangiacomo, P., Simulation of Turbomachinery Flows by a Parallel Solver with Sub-iteration Recoupling. 1st International Conference on Computational Fluid Dynamics, July 10-14, 2000, Kyoto, Japan.

MASSIVELY PARALLEL IMAGE RESTORATION WITH SPATIALLY VARYING POINT-SPREAD-FUNCTIONS GERARD GORMAN Applied Modelling and Computation Group, T. H. Huxley School, Imperial College, Prince Consort Road, London SW7 2BP, U.K E-mail: [email protected] ANDY SHEARER, NIALL WILSON, TRIONA O'DOHERTY, RAYMOND BUTLER Information Technology Dept.,NUI, Galway, Galway, Ireland. E-Mail: {shearer, niall, triona, ray}@itc.nuigalway.ie This paper describes a newly developed code, iMPaIR, which performs iterative image deconvolution in parallel. The basic algorithm used is the Richardson-Lucy Maximum-Likelihood iterative procedure. Additionally, iMPaIR can use a spatially-variant point spread function in the deconvolution process. The basic Richardson-Lucy algorithm is described as well as details of the parallel implementation. Applications and results in the areas of astrophysical imaging and medical x-ray imaging are briefly discussed. In the medical field such restoration algorithms are impractical on single processors - computation time should measured in seconds rather than hours. We show that for this type of application a small number of processors should be able to analyse a full X-Ray image (~ 3730x3062 pixels) in less than a minute.

1

Introduction

The problem of image restoration (deconvolution), can be expressed in the following generic form;

t>{r)=jv{%)h{r-t,)d%

(1)

where \y is the function sought, § is the observed function and h is the kernel of the equation, or in particular, the point-spread-function (hereafter PSF). h represents a local averaging of \\i function values, which is associated with the remoteness of the observer and/or the impulse response of the measuring device. Fundamentally, imaging involves the detection and counting of photons, therefore the functions <)), \|/ and h can be considered to be probability density functions. Thus, by definition these functions must be non-negative everywhere. This fact offers a powerful constraint in the determination of \\i. In the case of a digitized image, Equation 1 can re-expressed as the quadrature,

In this form, <{> represents a pixel value. One might naively try to solve for Vj/ through straightforward matrix inversion. However, the matrix h can contain many values

141

142

close to zero, thus errors get amplified. To overcome this problem, iterative inversion techniques have been developed to solve Equation 2. One such approach is the Richardson-Lucy procedure 12. This technique utilizes the non-negativity constraint on \|/,-, and uses the Principle of Maximum Likelihood as a goodness-of-fit criterion. Consider the log-likelihood of some estimate \|/,

# = 2>X<M

(3)

where (j), is calculated from Equation 2, v|f; being some estimate of the density source function, and <(),• is the original image. The problem now involves the maximisation of H(i\fi) subject to the constraint \j/, > 0 for all ;'. This constrained optimization problem can be solved iteratively using the algorithm,

where r denotes the iteration number and \\fi = constant. Since each iteration increases the likelihood, and #(\|/,-) is a convex function, the maximum H is approached monotonously. In practice the image restoration algorithm described has a tendency to break \|A into near delta functions after high numbers of iterations. It is this fact which also complicates the automated halting of image reconstruction based on a contrast criteria. The authors examined the variance of Pearson's r and the Chi-square statistic between the original image and the restored image convolved with PSF. It was found that both statistics indicated an increase in goodness fit at high iteration numbers although visual inspection of the image showed a decrease in quality. The problem with giving a quantitative measure of image improvement, for an arbitrary image, is a general one because naturally the actual distribution function is unknown. Statistical analysis of the image reconstruction process, when the actual distribution function was known, has been published elsewhere 3 . It was noted by the authors that the Chi-square difference between two successive images decreased exponentially. We found that the rule-of-thumb of stopping the procedure after the difference between successive iterations drops to 1 /e of the maximum gave a near optimal solution. This is complemented by a hard upper limit on the number of iterations. A further regularisation of the Richardson-Lucy procedure was proposed by Starck et al 4 . Starck used a regularisation technique based on a wavelet denoising algorithm applied to the residual between iterations, where the residual is defined as, % = $,-$

(5)

Specifically the algorithm performs a hard thresholding of the wavelet coefficients of the residual. The threshold level is set by a study of variation of Gaussian noise in

143

this wavelet space. Although the algorithm proposed by Starck is effective, it is computationally demanding due to the redundant atrous wavelet transformation applied. In addition, the method used to set the denoising threshold is based on the assumption of Gaussian noise. In order to improve upon this we use an algorithm based on a non-redundant transform - the Daubechies wavelet5 which is more efficient in terms or storage. We denoise the residual between iterations using a soft thresholding (wavelet shrinkage) which is statistically more attractive than hard thresholding 6 . We select the level dependent soft threshold as a fraction k of the wavelet coefficient standard deviation. 2

Parallel Approach

Iterative deconvolution is an ideal candidate for parallelisation. The reconstruction process is an expensive computation due to the high number of convolutions required. But once a buffer zone is provided (this is required to absorb errors due to the Fourier transformation at the edges), the image can be processed in small segments. Thus an algorithm for processing the whole image would look like: • divide the image into N sub-images • include a halo region for each sub-image • process each of the sub-images • put the sub-images back together to form the total deconvolved image To execute this in parallel, it is natural to assume a task farm model (master-slave model). The master first reads in the image to be deconvolvedand then decides how the image is to be subdivided, thus forming a list of jobs. It then distributes these jobs amongst idle slaves for processing, and awaits results. When a slave is finished with a subsection, it returns the result to the master and waits for another job. In practice, for low numbers of processors, the master process is usually idle, thus, in a system which supports multi-tasking, it is possible to run a slave on the-same processor as the master without degradation of performance. It is apparent from the above discussion that the more subsections / is divided into for processing, the more computation must be done, as there is more padded area to be processed. So although there is no reason why me image could not be subdivided into the same number of jobs as pixels, this would in effect require the same amount of computation as an image of extent N'pxNp* (Npsf + 1 ) , where N'p is the effective number of pixels processed, Np is the number of pixels in the original image and Npsf is the number of pixels in the PSF. The extra cost of die buffer region is slightly offset due to the fact that the cost of computing the fast Fourier

144

transformation increases (FFT) as NlogiN rather than N, thus the time taken to carry out a FFT on all the segments is faster than carrying out a FFT on the whole data set at once. There is another reason why we would want to process an image in segments. In nearly all systems of measurement, the PSF displays a space dependence (indeed it may have a dependence on the amplitude of the signal but we will not consider that here). The spatial dependence may be associated with remote effects such as a variance in photon scatter in biological media, or with a variance in the response of the detector itself. Often this variance can be ignored, but in astronomical imaging and digital mammography a spatially variant PSF (hereafter S V-PSF) has been shown to give significantly better results. In this work we focused on the spatial variance of the PSF associated with the impulse response of the imaging apparatus. A PSF model is fitted to a set of measured PSFs so that a PSF can then be determined for an arbitrary location on the image. The model assumes that the PSF varies smoothly either linearly or quadratically over the domain. This assumption can be easily tested for any specific instrument through an examination of the Chi-squared statistic between measured and generated PSFs. It is assumed that for sufficiently small subsections of the main image, a good approximation to a SV-PSF can be made by generating a PSF for the centre-most pixel of the subsection, and then assuming this PSF to be locally invariant. From an implementation point of view, very little overhead is incurred when a slave (having a PSF model) has to generate a PSF for each job. It also has a negligible impact on communication overheads as the PSF model only has to be broadcasted once at startup and thereafter only two additional integers must be sent for each job (the global coordinates of the centre-most pixel in the sub-image). In practice a balance has to be determined between the cost of dividing the image into subsections to be processed, and the increase in image quality when the PSF is sampled at a higher frequency. From this perspective it is interesting to look at the speedup curve of iMPaIR for a 2048 x 2048 test image which was divided into 64 sub-images of 512 x 512 pixels (Figure 2). The tests were carried out on a Beowulf cluster composed of 16 650MHz AMD processors in regular PC system boards communicating through a 100 MBit Ethernet switch, and an SGI Origin 2000 with 16 250MHz MIPS R10000 processors. What we see in the figure is that the speedup curve for both architectures stays close to the ideal curve, which is what one would expect. The presence of plateaus is due to the fact that in the present implementation, the sub-image size is fixed and independent of the number of processors available; hence there is always the same number of jobs regardless of the number of slave processes. So, if we have 64 jobs and 11 slaves and we assume that the slaves are running on the same system so that they will each take approximately the same

145

time to complete a job, then after 5 cycles we would expect to have 2 retired slaves and 9 working slaves which indicates that the task will require 6 cycles in total to complete. However, if 12 slaves were used, then after 5 cycles we would have 8 retired slaves and 4 working slaves which means that it still requires 6 cycles for the overall task to complete. Another interesting observation is that the performance of the Beowulf cluster scales almost as well as the more closely coupled SGI architecture. This illustrates that the communication overhead of the algorithm is minimal and so inexpensive clusters could be used to perform this task in a minimal amount of time.

Numb»r of slavn p r o c o a s o s

Figure 1. Speedup plot of iMPaIR, using 64 image subsections.

3

Applications

The above treatment is quite general and the code could be applied in many areas. The applications considered here are astronomical images and medical x-ray images, but the application of iMPaIR in the field of remote sensing is also been looked into. One of the first issues which arises in these applications is the determination of the PSF. If we consider Equation 1 we see that if the function \|/ is the Kronecker Delta function, 8(x), then, <j>(r) = h(r — x). In astronomical imaging, stars can be considered to be unresolved point sources, thus delta functions. Therefore it is reasonably straightforward to determine the PSF at some location, x provided there is a reasonably isolated star located there. In the case of medical x-ray images it is not so straightforward as there are no natural delta features in a body. However we were able to directly measure the impulse response of the instrument. This was done by placing a shield, which had holes cut in it via laser, between the x-ray tube and

146

Figure 2. On the left is an unprocessed section of the image. On the right is the processed image after 50 iterations

shield. The holes were sufficiently small such that they were beyond the resolution of the detector, thus approximating a delta function. A sheet of perspex, which has a scatter coefficient in the order of that of human tissue, was used to approximate the scattering due to biological tissue. Thus the PSF can then be determined in the usual manner. When a sufficient number of PSF's have been measured from the field of view, a PSF-model can be generated.

3.1 Astronomical Imaging In the past, one of the main areas that iterative deconvolution has been used is in astronomical imaging. In these cases the impulse response function has two sources. One is the impulse-response of the detector itself, and the other arises from the scattering of light in the atmosphere (the latter effect naturally only being an issue for ground based telescopes). It is interesting to note that particular interest in iterative deconvolution techniques arose in NASA after it was discovered that the primary mirror was defective in the Hubble Space Telescope. Accurate measurement and interpolation for the PSF allowed them to largely restore the images. Figure 2 illustrates the difference between the before and after images. This image pair is of a globular cluster in the Andromeda Galaxy. Only a small section is shown as the total image is too large for reproduction here (1999 x 1999 pixels). The technique is useful because it allows us to resolve objects which might otherwise be lost in noise or indistinguishable from nearby stars, while simultaneously denoising and improving the contrast.

147

Figure 3. On the left is the original X-ray image of a ball and socket joint. The right hand side shows the processed image

3.2 Medical Imaging Digital mammography has the potential to provide radiologists with a tool which can detect tumours earlier and with greater accuracy than film based systems. It has the additional advantage mat lower doses are required in general. Although a digital mammography system can provide much greater contrast when^ompared wift a conventional film system, the ability to detect small artifacts associated with breast cancer is limited by a reduced spatial resolution due to lack of screen sharpness and scatter induced fog (in this case screen sharpness refers to how accurately the imaging plate detects the x-rays, i.e. the impulse response of the detector). Screen sharpness is reduced due to scattering of x-rays in the imaging plate. We model the radiological image formation process as the convolution of a PSF (which we have measured) with the projected tissue density source function. When applied to a University of Leeds TORMAX breast phantom our results show as much as a two-fold improvement in resolution at the 50 percent MTF level 3 . Our results show that the regularised deconvolution algorithm significantly improves the signal-to-noise ratio in the restored image. The image sizes involved in this work are quite large (3730 x 3062). Thus the parallel processing of the image can give an important speedup in processing speed which will be particularly important in the field. Looking at the before/after image pair in Figure 3 gives one an idea of how an x-ray may be improved to aid diagnosis.

148

4

Conclusion

Using a SV-PSF can give much improved resolution when performing iterative deconvolution on images. By subdividing images into many sub-images each with their own PSF, iterative deconvolution can be carried out efficiently in parallel. iMPaIR is designed to be a general, portable, image restoration tool and is implemented using the MPI library. Currently we are adding a Maximum Entropy routine to the code as it is known to produce better results in some applications. We are also in the process of conducting a survey of different optimization strategies known for these methods. This will include among other things investigating different parameters associated with wavelet regularisation. The code has been released under GNU Public License 7 . A copy of the code can be downloaded from http://moby.th.ic.ac.uk/. Acknowledgments The authors would like to acknowledge the support of the European Commission through grant number HPRI-1999-CT-00026 (the TRACS Programme at EPCC), and Enterprise Ireland for supporting the Irish Computational Grid and their ASTI programme for support of medical imaging. References 1. W. H. Richardson. Bayesian based iterative method of image restoration. J. Opt. Soc. Am., 62:55, 1972. 2. L. B. Lucy. An iterative technique for the rectification of observed distributions. AJ, 75:745, 1974. 3. T. O'Doherty P. Abbott, A. Shearer and W. van der Putten. Image deconvolution as an aid to mammographic artefact identification i: basic techniques. Proc. SPIE3661, pages 698-709,1999. 4. F. Murtagh J. L. Starck. Image restoration with noise suppression using the wavelet transform. A&A, 288:342, 1994. 5. I. Daubechies. Ten lectures on wavelets. SIAM, 1992. 6. D. L. Donoho. De-noising by soft thresholding. IEEE Transactions on Information Theory, 41:613,1995. 7. Gnu. http://www.gnu.org.

A S Y N C H R O N O U S A L G O R I T H M S FOR P R O B L E M OF RECONSTRUCTION FROM TOTAL IMAGE

N. M. G U B A R E N I Technical University of Czestochowa, Institute of Econometrics & Computer Science 42-200 Czestochowa, Poland E-mail: [email protected] In this paper we propose the new asynchronous and parallel algorithms for problem of reconstruction from a total image. They may be realized on a massively parallel computing systems consisted of independent elementary processors and a central processor. It was conducted computer simulation of these algorithms with application to problem of image reconstruction from interferogramms. Some experimental results of computer simulation compared the evaluations of errors and the rate of convergence of these algorithms are presented.

1

Introduction

The problem of reconstruction of inaccessible internal structure of an object by means of some d a t a called projections is very i m p o r t a n t in many practical applications of sciences and technology. Projection d a t a , or projections for short, for a function f(x,y) t h a t has to be reconstructed, are represented mathematically as line integrals of f(x, y) along the lines L = x cos 6 + ysin 0 with parameters / and 6: p(l,9)=

/ f(x,y)dl= JL

/ J

/

f(x,y)6(l

- xcos$

- ysmd)dxdy,

(1)

J — oo

where the limits of integration in the first integral depend, in general, on /, 9, and dl means integration along the line L. If the number of obtained projections is large enough (in medicine, for example) then for solving integral equation (1) it is more preferred to use analytical methods based on the inverse Radon transform. However in some scientific fields (such as optical researches, geophysics, investigation of plasma) the number of projection d a t a is very limited as a rule. In these cases we usually construct a full discretized model and use the iterative a l g o r i t h m s 2 . In many practical applications instead of projections p(9,1) as initial d a t a we have a set of some other functions depending on p{9,1). For example, in the problems of optical interferometric tomography for reconstruction of change of the space refractive index distribution in any chosen cross section of phase object by means of limited number of projections the initial d a t a are a set of

149

150 interferogramms, which represent some linear combination of projections: N

9N(x,y)

= Y^Pj(xcos°j

+ Vs^ej),

(2)

t h a t means t h a t the value of each point of reconstruction x is determined with the help of sum of all projections, passing through this point. This function we shall call a total (or s u m m a r y ) image. In the case of a full discretized model from total image the value in the j-th pixel represents a sum of all projections passing through this pixel. In this paper our problem is to find the best approximation of function f(x,y) by means of given function gN(x,y), connected by equations (1) and (2), in the case when the number of directions N is very small ( < 6). For solving this problem we construct the full discretized model and use iterative parallel algorithms PSUM and P M S U M . We also consider asynchronous implementation of these algorithms on an asynchronous massively parallel computing system (MPCS) and corresponding asynchronous parallel algorithms for the obtained model. In order to compare these algorithms and evaluate t h e m from the point of view of their rate of convergence and accuracy we conducted the computer simulation on a uniprocessor for a number of modeling t o m o g r a m m s . Some of these experimental results are represented in this paper. These experiments show t h a t for the problem reconstruction from total image with a very limited number of projections it is more effective to use the asynchronous algorithm AMSUM.

2

Parallel algorithms P S U M and P M S U M

One of approaches for solving our problem is to discretize the functions involved in it and solve the system of linear algebraic equalities. For this purpose we construct the full discretized model. Let a function f(x,y) be defined in some domain D C R2 • First we introduce a Cartesian grid of square picture elements, called pixels, so t h a t it covers the domain D. Then we numbered the pixels in some agreed manner from 1 to n. We assume t h a t the function f(x,y) has a constant value Xj throughout the j t h pixel for j = l , 2 , . . . , n . Sources and detectors t h a t transmit and recive some energy are assumed to be points and the rays between t h e m are assumed to be lines. Secondly we choose in each set of projections obtained at the same angle dj the finite number M of registrated rays. Then we denote by aij (i = 1,2, ...,m; j = 1,2, ...,n; m = N x M) the length of intersection of the ith ray with the j t h pixel. We

151

descretize the summary image function g(x, y) by the following way:

gN(x,y) = hf

if J>,y)€ i-th pixel

L0

(3)

otherwise

We denote by Si the numbers of all rays crossed i-th pixel. Then we have 9i =9? =^2PJ,

(4)

J 6 Si

where pj is a projection along the jth ray. By S = (sy) we denote the matrix of dimension n x m which elements are defined by the following way: 13

_ / 1, \0,

if 3 £ Si otherwise

and B = SA = (bij) G Rn'n. As a result we obtain the full discretized model of reconstruction from total image in the following matrix form: Bx = g,

(5)

where g = (#;) 6 Rn, x = (XJ) G Rn is the image vector and B = (bij) G Rn,n. In this section we consider the parallel iterative algebraic algorithms for solving system (5). Let us introduce the necessary denotations:

-Jb\xl Qix=*+Qiyi«^;rv

ui

w

is a projection operator, and

Q? = ( l - a , ) I + wQ,-

(7)

is a relaxation projection operator (here (.,.) denotes inner product in R,"). Then we can consider the following parallel algorithm: Algorithm 1 ( P S U M ) 1. x(°) is arbitrary; 2. Calculate the (k + l)th iteration by y^Qpx*,

i=l,2,...,n

x H 1 = CT(yi,y2,..,y„),

(8) (9)

152

where Q%k are operators defined by (7), C is a constraining operator of the following form: a if Xi < a H if a <Xi
{

b

if Xi > b

and n

T(yi,y2)...,y„) = ^B,-y,-,

(11)

»=i

where B, are matrices of dimension n x n with real nonnegative elements and n

X>,=E,

(12)

t=l

where E is the unit matrix. In dependence on the choice of matrices B; we have different realization of algorithm 1. Since for our problem x > 0, gi > 0 and b ' > 0 for i = 1, 2, ...n, we can also construct the following parallel multiplicative algorithm Algorithm 2 ( P M S U M ) 1. x ( 0 ) € R" and x(°) > 0. 2. Calculate the (k + l)th iteration by

4+1) = 4}f[yj'''

( 13 )

where

»*" = (o^) , " ,, 'c' = 1'2 »

<">

7^ are positive real numbers such that n

o
V'k-

( 15 )

t=i

These algorithms can be realized on parallel computing structure consisted of n elementary processors and one central processor. In each (k + l)th step of iteration every ith elementary processor computes the coordinates of vector y,- in accordance with formula (8) (or (14)) and then the central processor computes the (k+ l)th iteration of image vector x in accordance with formula (9) (or (13)).

153 3

Asynchronous algorithms A S U M and A M S U M

In this section we consider asynchronous implementation of parallel algorithms PSUM and P M S U M on nonsynchronous parallel computing structure and corresponding asynchronous parallel algorithms ASUM and AMSUM for obtained model. Let algorithms PSUM and P M S U M be executed on M P C S consisted of n independent processors connected with some central processor. T h e number n is equal to the number of all unknowns in system (5). In such a computing system each ith elementary processor executes its calculations independently with its own pace in accordance with formula (8) (or (14)). Then it sends the obtained results to the central processor and loads from it a new value of image vector as its initial data. T h e central processor has its local memory and executes its calculations in accordance with formula (9) (or (13)) without waiting the full u p d a t e of all processors and stores its computing result in its local memory. In our model we allow several processors to end their calculations at the same time. So the same components of image vector may be updated concurrently. As a result we obtain some other chaotic algorithms t h a n parallel algorithms PSUM and P M S U M . In order to describe the algorithms obtained from our model we will be used the notions of a sequence of chaotic sets and a delay s e q u e n c e 1 , 3 . For our model the sequence of chaotic sets has simple interpretation: it sets the time diagram of work for each independent processor during nonsynchronous work of parallel computing system. So the subset Ik is the set of the numbers of those processors which access the central processor at the same time. In these notations our model of asynchronous implementation of the algorithm PSUM may be represented in the following form: Algorithm 3 (ASUM) 1. x(°) is arbitrary 2. Calculate the (k + l ) t h iteration by y

xk+l

M

=

( Q?*<(t). (y*"1''

= C(akxk

if i€lk otherwise,

+ (1 - ak) £ B.-yt,,-), teik

(15)

(16)

where i £ Ik, I = {Ik}f=i i s a sequence of chaotic sets, Ik C {1, 2,..., n } ; ! l l
154 numbers, B j are matrices of dimension n x n with real nonnegative elements satisfying (12). Asynchronous implementation of the algorithm P M S U M may be represented in the following form: Algorithm 4 ( A M S U M ) 1. x° G R" and x° > 0. 2. Calculate the (k + l ) t h iteration by

where

v =((b.-lx^))J

lf ieh

'

(18)

yfj are positive real numbers such t h a t 0 < ^ lij^ij < 1 f ° r Vj, k, I — {h}^=i ieh is a sequence of chaotic sets , / j , c { l , 2 , . . . , n } ;
Computer simulation and experimental results

T h e asynchronous parallel algorithms ASUM and AMSUM were implemented in a simulated parallel environment on a sequential machine. Calculations of all elementary processors t h a t could be run in parallel were timed together and results were stored in t e m p o r a r y arrays. For synchronous parallel algorithms the results of all such calculations were updated into global arrays when all computations were completed. For asynchronous parallel algorithms we had to give the time diagram of work for each independent processor. According to this diagram each processor in each step of its local iteration had its own execution time t h a t was a random integer value and could be changed for different local iterations. Every time, when processors ended their calculations, their results were updated into global arrays. T h e number of iterations was increased as soon as all computations relating to all independent processors were completed, another words when each processor executed its local iteration if only one time. In order to imitate the work of such asynchronous parallel computing system we built the sequence of chaotic sets and the delay sequences just as in w o r k 4 . For conducting the computing experiments and comparing the p h a n t o m d a t a with their reconstruction by algorithms PMSUM and AMSUM there were

155

^="

Figure 1: Surfaces of function f2(x, y) a) the initial function; b) AMSUM reconstruction,

chosen the next functions: k

fi(x,y)

= ^2ciexp{-ai[(x

- x„i) cos tpt - (y - yoi) sin ipt]

-

-bf {(x - xoi) sin
f(x, y) =

1, 0,

if (x - x0)2 + {y - j / 0 ) 2 < r 2 otherwise

for which we may easy to calculate their projections data. These models present the most typical measurement situations in plasma (emission) tomography. In our experiments we used these functions with different parameters. In this paper we present results for fi(x,y) with parameters: k — 1; a\ = 9.5; 6i — 4.5; ci = 5; (pi = 0; XQI = 0; t/oi = —0.5; and for f2(x,y) with parameters: a\ = 1, n = 0.08, #1 = 0.2, y\ = 0.2, 02 = 0.5, r 2 = 0.16, x2 = - 0 . 2 , j/2 = - 0 . 2 . We also have n = 1024, and N = 4 for /i(ar, ?/) and JV = 3 for / 2 ( « , y). T h e convergence characteristic plots are given in view of graphs for such numerical characteristic as: Si - mean-root square error and ($5 - normalized absolute error. T h e results of the computer simulation are shown in Figure 1 and Figure 2.

156 5 5 %0.3 T

\

!

! ! 11 nil

V-1--h-M-l-t-t-l1 \ 1 -i 1 1 1 1 1 1 1 \ 1 1 1 1 1 1 1

1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1

1 1 I I Mil

I

b ) 11 !!ii

l----l--r-M-hm 1 1 1 1 1 1 1 1 1 1 M U M

1 1

1 F 1 1 1 1 1 1 M U M

1 1 I I II l l |

Figure 2: Dependence of a) Si and b) 6$ on number of iterations for image reconstruction of function / 2 (x, y) by algorithms 1 - PMSUM and 2 -AMSUM

5

Conclusion

The aim of this paper was elaboration and comparing the new parallel and asynchronous algorithms for problems of reconstruction from total image. We discussed the potential implementation of them on an asynchronous parallel computing system. It was shown that the asynchronous algorithms have an improved performance by an order of magnitude over parallel versions of these algorithms. These computer experiments showed that for image reconstruction from interferogramms with very limited number of projections it is more preferred to use the algorithm AMSUM. References 1. D.P.Bertsekas, J.N.Tsitsiklis , Some aspects of parallel and distributed algorithms- a survey. Automatica, 1991 (27/1) 3-21. 2. Y.Censor, S.A.Zenios, Parallel optimization. Oxford University Press, 1997. 3. D.Chazan, W.Miranker, Chaotic relaxation. Linear Algebra Appl. 1969 (2) 199-222. 4. N.Gubareny, A.Katkov, J.Szopa, Parallel Asynchronous Team Algorithm for Image Reconstruction. In Proc. of the 15th IMACS World Congress on Sc. Comp., Model, and App. Math. (Berlin, 1997), 553-558.

PARALLEL FLOOD MODELING * L. HLUCHY, G. T. NGUYEN, L. HALADA, V. D. TRAN Institute of Informatics, SAS, Dubravska cesta 9, 84237 Bratislava, Slovakia In this paper, we present a backbone of parallel flood simulation that often leads to solving of large sparse systems of partial differential equations. Parallel simulation is essential with satisfactory accuracy in comparison with CPU-time consumed sequential methods. Our experimental results of parallel numerical solutions for flood simulation are done on Linux clusters using cyclic reduction method. We also provide experimental results done on a DMS SGI Origin2000 for comparison. Our measurements show that Linux cluster can provide satisfactory power for parallel numerical solutions, especially for large problems.

1

Introduction

Recently, floods causes widespread damages in the world. Therefore, modeling and simulation of flood in order to forecast and to make necessary prevention is very important. The backbone of the flood simulation problem is a flood numerical modeling, which requires an appropriate physical model and robust numerical schemes for a good representation of reality. Flood simulation systems such as MIKE21 [1], SMS [2] consist of a graphical user interface (GUI) for preprocessing, post-processing and computational modules. The GUI is used for reading terrain maps, defining initial conditions, boundary condition and other parameters needed for the simulation. All the data are passed to the computational modules that perform the simulation based on mathematical models and their numerical solutions. The output data from computational modules are sent back to the GUI for visualization, animation and further processing. The computational modules are the only computation-expensive part of flood simulation systems. They typically take several days of CPU-time for simulation of large models. For critical situations, e.g. when a coming flood is simulated in order to predict which area will be threatened and to make necessary prevention, such a long time is unacceptable. Numerical methods for flood simulations are generally based on Finite Difference or Finite Elements space discretisation, which lead to solving large-size problems. Therefore, using high-performance computing and networking platforms to reduce the computational time [3] of flood simulation is very important and imperative [9]. This can drastically reduce the computational time, which allows simulating larger problems and consequently provides results that are more reliable. * This work is a part of the EU 5FP ANFAS (datA fusioN for Flood Analysis and decision Support) IST1999-11676 RTD and is supported by the Slovak Scientific Grant Agency within Research Project No. 2/7186/20.

157

158

The contribution of this paper is to present parallel numerical methods for flood simulations and to study behaviors of a parallel programming models for distributed memory cluster computation [11], also for distributed shared memory machine [10] with a suitable application. The rest of the paper is organized as follows. Section 2 introduces the mathematical model of flood simulation problems. Section 3 presents numerical meuiods to be used. Section 4 is concerned with possibility of parallel implementation, a view into our Linux cluster and SGI Origin2000, numerical experiments and measurements. Section 5 summarizes with some conclusions. 2

Environmental Motivations and Mathematical Models

A review of the literature shows that many 2D depth-averaged numerical models have been developed and applied to free-surface flow problems. Although a depthaveraged model may suffer from some limitations on physical interpretation, e.g. invalid governing equations near bore with sharp curvature; the computed results may still be used for a simulation model to predict the evolution of a river flood. These partial differential equations are of non-linear type of variables of C P, q, where C is surface elevation, p, q discharges [1] [2]. These variables are most convenient for defining appropriate boundary conditions. For convection-dominated open channel flows such as flow affected by tidal flow and dam-break flow, numerical difficulties and non-adequate simulated flow may occur due to inadequate treatment of the non-linear convective terms. However, the usefulness of such models has been demonstrated on a number of examples. One of the 2D depth averaged flow models is MIKE21 [1], which solves mass equation and momentum equation by finite difference method in the space-time domain using Alternating Direction Implicit technique. The equation matrices that result for each individual grid line are resolved by Double Sweep algorithm, i.e. equations are solved in one-dimensional sweeps, alternating between x a v directions. In the x-sweep, the mass equation and momentum equation are solved taking C,fromn to (n+1/2) and/? from n to (n+1) in time space. For the term involving q, the two levels of old, known values are used, i.e. (n-1/2) and (n+1/2). In the y-sweep the mass equation and momentum equation are solved taking f from (n +1/2) to (n+1) and q from (n+1/2) to (n+3/2) while terms inp use the values just calculated in the x-sweep at n and (n+1). The mass and momentum equation thus expressed in a one-dimensional sweep for sequence of grid points in a line lead to following coupled linear systems of equations: Aj,k-Pj.,,k(^1) + Bj,k-i;j,k^,/2» + Cj,k-pj,k(n+1) =D j - k forj=l,2...,J (1) aj,k-Cj,k(n+,/2) +b f k - P j , k ( n + 1 ) +cj,k<j+1,k(n+1/2) = dj,k forj=l,2...,J Equations (1) yield the tridiagonal system of linear equations Ax = y. In general, if such lines are K, we need to compute such system ^T-times with different coefficients of matrix A and vector y. Naturally, this part of computation is time consuming and is suitable for parallelization.

159 3

Numerical Solutions

We consider a linear system Ax = y of size n, A e lCx" and a given vector y e FC. For a tridiagonal matrix A, we denote the nonzero value of the matrix as follows: (2) A=

a.

b' » y

The computation of the Gaussian elimination (GEM) for a tridiagonal matrix is linear in n but it does not provide a possibility for parallel computation. The Gaussian elimination is a purely sequential algorithm. In the next section, we describe an algorithm, which is suitable for parallel computation [7]. 3.1

Recursive Doubling

Recursive doubling [4] is performed as follows: we write matrix A as A=(B+I+C), where diagonal^ =/, B is the strictly lower diagonal part, C is the strictly upper diagonal part. After multiplying matrix A by the matrix (-B+I-C) we obtain a matrix with three nonzero diagonals again, but now the nonzero diagonal entries are in positions (i, j) with | j -j\ = 2, i.e. their distance to the main diagonal is doubled. We describe the first step in more detail for three neighboring equation i-1, i, i+1: = yt-i

a ,x,_, + bixi

+ c,xM

aMx,

+bMxl+,

+c f+1 x, + 2

(3)

= y, = >\+1

Equation (i-1) is used to eliminate x,./ from the i-th equation and the equation (i+1) is used to eliminate xi+i from the /-th equation. The new equation is: (4) with coefficients: (5)

pr =

-cjbM

y / " -y.

+

ary.-t+pry,^

160

After N=Llog2(n)Vsteps, there is only one main diagonal left and we can compute for i=l, 2,..., n. x<>n = ^.""/ft.c"

3.2

(6)

Cyclic Reduction

The cyclic reduction [4] algorithm is a modification of the recursive doubling algorithm, which avoids computation redundancy by computing only needed values a,• , bjk>, c!k>, y/k>. The entry in position (2i, 2i-l) is eliminated using row (2i-l) and the entry (2i, 2i+l) is eliminated using row (2i+l). This leads to fill-in in positions (2i, 2i-2) and (2i, 2i+2). This also eliminates all odd numbered unknowns and all even numbered unknowns are coupled to each other only. The algorithm runs in two steps: Step 1: For k = 1, 2, ..., Lhg2(n)J Compute aP, b/k>, cf", yf* with / = 2k, ..., n and step 2* by (5). In step k = Llog2(n)JiheTe is only one equation for i = 2" with TV: Llog2(n)J Step 2: For k = Llog2(n)J, ..., 1, 0 compute x, by the formulation: y^-a\k\x. . . s i

3.3

. - c / " . x . ,. i

,_2

(7)

,+7

/

Cyclic Reduction versus Recursive Doubling

size of matrix - Recursive doubling

-Cyclic reduction

Fig. 1. Cyclic reduction versus Recursive doubling

The recursive doubling gives a large possibility for parallel computation by the large amount of redundancy operations. Cyclic reduction computes only the necessary entries for xN with TV = llog2(n)J but still provides potential parallelism. The sequential computational runtime of the recursive doubling algorithm is 0(nlog2n) that is larger than the Gaussian elimination. The sequential computational runtime of the cyclic reduction algorithm is O(n) as the runtime of the Gaussian elimination (GEM). The advantage of both recursive doubling and

161

cyclic reduction is a large possibility for parallel computation [4] [6]. The computation times of cyclic reduction and recursive doubling are shown in Fig. 2. Both recursive doubling and cyclic reduction algorithms can be generalized to banded matrices (block size>l) by applying the described computational steps for elements to blocks and using matrix operations instead of operations on single elements. 4

Parallel Implementations

Both recursive doubling and cyclic reduction are classified as fine-grained algorithms [4], which are suitable for shared memory computers [5]. In this section, we describe results of our experiment on our PC Linux cluster and on SGI Origin2000. The cyclic reduction algorithm was chosen because of its possibility for parallelism and good computational runtime. 4.1

A Linux Cluster and the SGI Origin2000

Our Linux cluster consists of eight Pentium-Ill nodes under Linux operating system. All the computers are interconnected with lOOMb/s Ethernet network, through 16port switch. Each computing node has 512 MB memory and the whole system memory is 4GB. For comparison, we run our experiment on distributed shared memory SGI Origin2000 [10], which consists of 4 dual computer nodes, where all 8 processors are R10000, 250 MHz, with two 4MB cache memories. The whole system has 2GB of memory. For our experiments, we use MPI library Mpich [8] implemented for Linux and SGI Origin2000. On SGI Origin2000, parallel applications run under the LSF batch system. nn

800

j r 700 -•

80-

3? 70 -

| 60 ~50 2 40 • • 1 30 t§ 20 ^10

-

1

3

4

• Pentium Cluster 1

5

6

7

8

"

9 10 11 12

package SIZ e ? Byte

11

• | 600 ^ 500 S 400 •g 300 S 200100 -

LI • 1 • III

""

J •J1X1

1 <"

_ 1 2

"

1 | 1

• 1

f

1 .1 • I I I 1 ^mXMX

2 3 4 5

II

6 7 8 9 10 11 12 13 14

• Origin2000 |

packa ge size 2" B/tes

Fig. 2 Througput in Linux cluster and Origin2000

Fig. 2 shows results of throughput experiments on both systems with various package sizes. The reason for this leveling off is the bus in the shared memory system. The bus is limited to a finite bandwidth and number of transactions that can carry out per second. With high-performance microprocessors, the demands on the bus by processors are large enough that the bus's capacity can be exhausted [10].

162

4.2

Parallel Cyclic Reduction

The cyclic reduction is a row-oriented algorithm. Each processor deals with its own data, e.g. certain number of rows. Communications are required only for bounded values and only if it is necessary. We assume the matrix size is n and the number of processors is/?. The parallel algorithm is performed as follows: Step 1: For k = 1, 2, ..., Llog2(n)J Each processor Pj (1 <j , yf" ofln/pjrov/s with step size 2* using (5) If Pj needs values from other processor, it receives four data values from each Pprev and/or Pnex, to finish its computation. Step2: For k = llog2(n)J, ..., 1, 0 Each processor P, (1 <j
Experiments and Measurements

We have tested our MPI program on two different machines: Linux cluster and SGI Origin2000 with different input tridiagonal matrices. The message-passing program shows good results on both machines in different ways. Processors 1 2 3 4 5 •

•

.

•

«

n=4000 2.2 1.3 1.4 : .5 i. .4 2.2

n=100 000 59.7 30.3 20.2 15.0 12.6 11.3

K=500 0 0 0

777.4 563.3 279.1 202.9 165.3 138.0

n=1000 000 1554.5 1207.8 781.9 701.6 487.9 387.4

11^5 000000 7776.9 6040.5 4052.9 3588.6 2878.9 2347.2

Table 1. Execution time (ms) on SGI Origin2000

On the shared memory SGI Origin2000, for small systems of less than 4000 equations, the execution time per processor is too short in comparison with the communication time (Table 1). Processors 1 2 3 4 5 6

n=4O000 --T.3

.'-.8 . .9 .'\5 ••'.6

•:-.o

n=100000 142.1 75.5 68.3 50.0 67.4 41.2

n=500 000 719.6 366.4 269.8 213.6 220.4 163.6

n=1000 000 1442.9 739.3 526.0 407.7 342.9 286.7

Table 2. Execution time (ms) on Linux cluster

0=5 000 000 7047.9 3771.8 2518.3 1975.0 1627.7 1334.6

163

On the distributed system Linux cluster (Table 2), the communication time is longer and the small speedup is reached for systems with more than 40000 equations, i.e. 10 times larger than on SGI Origin2000. For systems with 5.104 to 5.105 equations speedups on SGI Origin2000 with six processors are quite good: from 4 to 5 approximately. The speedups go down when size of systems increases. The result is stable around 3.0 to 3.2 for systems with more than 106 equations (Fig. 3). We consider such results cause distributed shared memory effects and cache performance.

!i: 2

3

4 5 processors

6

7

8

(—•—50000 -Q-100000 —A-1000000|

Fig. 3. Speedup on SGI Origin2000

Fig. 4. Speedup on Linux cluster

On the Linux cluster speedups grow quite linearly in comparison with the number of processors. With six processors and the size of systems 10 , the speedup is around 3.5. For larger system with more than 106 equations, the speedup values are stable around 5.2 (Fig. 4).

Fig. 5 Speedup for banded matrices with block size=10 on Linux cluster

For interest, Fig. 5 shows speedups of parallel cyclic reduction applied for banded matrices with block size=10 on Linux cluster. When we suppose n is matrix size and r is the block size, the sequential runtime of solving sub-matrix sized r by e.g. GEM is 0(r3) and O(nfr) for solving the banded matrix using sequential cyclic reduction. That means the whole sequential runtime is 0(nf).The speedup of parallel cyclic reduction applied for banded matrices is very good, rise nearly linear with number of processors (Fig. 5).

164

5

Conclusion

In this paper, we have shown that the cyclic reduction method can be efficiently implemented for distributed memory systems, especially for Linux-clusters. A comparison with a diread implementation shows that the Linux-cluster leads to better results witii more reliable runtime behavior for large systems of partial differential equations than on distributed shared memory machines. Otherwise, SGI Origin2000 shows better results for smaller systems of equations, for which the algorithm needs more communications per given computation rows than large systems. 6

References

1. Mike21: 2D engineering modeling tool for rivers, estuaries and coastal waters http ://www.dhi software .com/mike21 / 2. SMS modeling package http://www.bossintl.com/html/sms overview.html 3. Tran V.D., Hluchy L., Nguyen G.T.: Parallel Program Model and Environment, ParCo'99, Imperial College Press, pp. 697-704, 1999, The Netherlands. 4. I.S. Duff, H.A. van der Vorst: Developments and Trends in the Parallel Solution of Linear Systems, Parallel Computing, Vol 25, pp.1931-1970, 1999. 5. A. Agarwal, D. A. Kranz, V. Natarajan: Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors, IEEE Trans, on Parallel and Distributed Systems, Vol. 6, No. 9, September 1995, pp. 943-962. 6. G. M. Megson, X. Chen, Automatic parallelization for a class of regular computations. World Scientific, 1997. 7. T. L. Freeman, C. Phillips: Parallel Numerical Algorithms. Prentice Hall 1992. 8. MPICH A Portable Implementation of MPI http://wwwunix.mcs.anl.gov/mpi/mpich/ 9. Selim G. Akl: Parallel Computation Models and Methods, Prentice Hall 1997. 10. Origin2000 Architecture http://techpubs.sgi.com/Iibrary/manuals/3000/0073511-00 l/html/O2000Tuning. 1 .html 11. Pfister G.F.: In Search of Clusters, Second Edition. Prentice Hall PTR, ISBN 013-899709-8, 1998.

THE Xyce™ PARALLEL ELECTRONIC SIMULATOR - AN OVERVIEW* S. HUTCHINSON, E. KEITER, R. HOEKSTRA Computational

Sciences

Department

H. WATTS Microelectronics

Validation

Department

A. WATERS, T. RUSSO, R. SCHELLS, S. WIX, C. BOGDAN Component Information

& Models

Department

Sandia National Laboratories, Albuquerque, NM 87185, USA URL: http://www.cs.sandia.sov/Xvce/ The Xyce™ Parallel Electronic Simulator has been written to support the simulation needs of the Sandia National Laboratories electrical designers. As such, the development has focused on providing the capability to solve extremely large circuit problems by supporting largescale parallel computing platforms (up to thousands of processors). In addition, we are providing improved performance for numerical kernels using state-of-the-art algorithms, support for modeling circuit phenomena at a variety of abstraction levels and using objectoriented coding-practices that ensure maintainability and extensibility far into the future. The code is a parallel code in the most general sense of the phrase - a message passing parallel implementation - which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms [1]. Furthermore, careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved even as the number of processors grows. Thus, it is able to solve circuit problems of an unprecedented size. This paper summarizes the code architecture and some of the novel heuristics and algorithms used to address these large problems and the parallel infrastructure.

1

Introduction

The Xyce Parallel Electronic Simulator under development at Sandia National Laboratories is aimed at supporting the laboratory's electrical designers as part of U.S. Department of Energy's Accelerated Strategic Computing Initiative (ASCI). This initiative uses high performance computational simulation to help offset the lack of underground testing and is pushing the limits of scientific computing with a goal of reaching a 100 TFlops capability in the near future. As part of this initiative, the code is targeted at very large (~1000 processors) distributed-memory parallel computing platforms and will provide improvements over existing technology in several areas. This paper describes the current work underway in the analog simulator portion of the code. * Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under Contract DE-AC04-94AL85000.

165

166

In addition to providing support for the simulation of circuits of unprecedented size (up to several million analog devices) via large-scale parallel computing, novel approaches to critical numerical kernels such as improved time-stepping algorithms and controls, better nonlinear convergence, the use of parallel Krylov solvers and preconditioners and improved device models are being implemented. These improvements aim to minimize the amount of simulation "tuning" required on the part of the designer and facilitate the usage of the code. Another feature required by designers is the ability to easily add device models to the code, many of which are specific to the needs of Sandia. To this end, the device package in Xyce is designed to support a variety of device model inputs. These input formats include the typical, so called, analytical models, behavioral models, look-up tables and support for conductance values extracted from devicescale PDE models. Combined with this flexible interface is an architectural design that greatly simplifies the addition of circuit models. This paper discusses these features and continues in the next section with a description of the overall code architecture including support for the parallel circuit topology and load balance methods. Following this we discuss the parallel solution methods and conclude with some preliminary results and a summary of this work. 2

The Xyce™ Code Architecture

From the beginning, Xyce™ was designed not as simply a "parallelization" of an existing circuit simulation capability. Providing parallel capability to any simulation code involves more than parallel data structures. Overall design and, in particular, algorithms designed for parallel computing must be used if one hopes to achieve high parallel efficiency. To this end, modern code-design and leading-edge parallel algorithms were utilized from the outset. These include UML (Unified Modeling Language) tools for object-oriented design. Figure 1 illustrates the overall code architecture for the analog simulation kernels currently within Xyce™. 2.1

Parallel Circuit Topology

A key to maintainability, extensibility and efficiency in any circuit simulation code is the underlying representation of the problem topology. This is not only true for representing the actual network of the problem but also for associated data structures such as sparse matrix graphs, etc. Topology information is also vital for parallel solution methods wherein the problem is decomposed into subdomains. Here, topology information is used in heuristic approaches (i.e., graph-based decomposition methods such as those in ParMETIS [2] and Zoltan [3]) to obtaining good problem decompositions. With these needs in mind, the Xyce™ Simulator has been designed around a flexible graph subsystem that is used to describe the distributed circuit topology and other graphs used throughout the simulation. This

167

topology package is built on the GTL (Graph Template Library) [4] that provides ready access to advanced graph methods for ordering and sorting. Accompanying the ability to perform these functions on a distributed graph is the need to move data between processors and access needed data that may be owned by another processor. This is accomplished using the Zoltan library [3] that provides distributed load balancing and data migration utilities. 2.2

Parallel Circuit-Problem Decomposition

One method for performing calculations in parallel is to decompose the problem domain into subdomains and assign these to individual nodes of the parallel computer. A variety of methods may be used to perform this decomposition and usually ensure that two issues are resolved: 1) each processor has an equal amount of computational work (i.e., load balanced) and 2) communication costs between processors are minimized. However, for circuit simulation a complicating factor is that both the calculations and the communication patterns (dictated by circuit topology) can be very heterogeneous. That is, at the level at which the problem is discretized, the resulting "elements" (e.g., lumped parameter models) are heterogeneous. For example, in analog circuit simulations, the computational complexity of the circuit devices can vary by over two orders of magnitude. Furthermore, when solutions are desired at differing levels of fidelity (e.g., mixedsignal), the problem is, by definition, heterogeneous and to a much larger degree. This is typically not a problem for conventional serial algorithms but can have disastrous effects on efficiency when computing in parallel. Within the Xyce™ Simulator topology manager, we use weighted graphs and leading-edge graph decomposition heuristics to ensure load-balance and minimal communication costs. Another issue that must be addressed for distributed memory parallel implementations is how one deals with the data dependencies between processor nodes. Currently in Xyce , device "ghosting", similar to node or element "ghosting" used in distributed PDE simulations, is used since this method decreases communication by allowing all Jacobian and right-hand side vector loads to be computed locally and necessitating only small portions of the solution vector to be communicated. However, this method requires redundant computation of some of the Jacobian elements for devices that are needed by multiple processors to complete their respective Jacobian entries. Thus, a non-ghosting implementation is also underway and it is likely that the best choice for handling parallel data dependencies will problem and/or machine dependent.

168

3

Parallel Solvers

As mentioned in Section 2 above, a key to an efficient parallel implementation is providing good parallel numerical algorithms. Within the Xyce™ Simulator are state-of-the-art time integration, nonlinear- and linear-solution algorithms. Also being developed are several heuristic methods that couple problem-specific information into the numerical algorithms for improved convergence. The time/nonlinear/linear nested kernels have been designed with a philosophy that recognizes each of these individual solution methods as part of an integrated whole. Step-size, convergence and accuracy results for one method can impact these issues for the other methods. This is especially true since, as will be discussed below, iterative linear solvers are used to provide parallel scalability for large problems. 3.1

Transient Solution

Currently implemented for the time integration solver are methods based around standard trapezoidal and Gear formulas as well as infrastructure support for various decoupling approaches such as waveform relaxation and hierarchical decomposition. These are designed to solve the differential/algebraic system F(t,«(O,«'(O) = 0 for the time history of the solution vector u(t). In addition to these standard algorithms mentioned, variable-theta and second order A-contractive methods are planned which are methods designed explicitly for stiff, circuit simulation problems. These algorithms are used with appropriate time-step selection and error estimation methods, which are tightly coupled with the underlying nonlinear solver to ensure solution accuracy and good overall convergence. For a further description of these algorithms, see [7].

169 3.2

Nonlinear Solution

The nonlinear corrector equations are solved at each time step to obtain u = u„ using Newton's method (or some variant thereof) to "linearize" the equations. Specifically, an iterative procedure is applied where

is used to correct the previous iterate solution and the Sim) is obtained by solving MS(m)=p(m) Notice that unlike many SPICE-type solvers, we do not solve, at the linear level, for the solution itself but rather for a solution update. This is critical to obtaining accurate solutions and helps to improve the convergence behavior of the underlying Krylov-based iterative solvers, as these update-vector magnitudes are much smaller. As is done with some SPICE-type solvers, "optimization" methods can be implemented here including solving the linear portion of the circuit separately from the nonlinear system; reforming the Jacobian only when it's lack of accuracy causes convergence to suffer, etc. This later approach falls within the broader category of inexact Newton methods. The idea with these methods is that, since we solve for only the update as mentioned above, as long as the residual is being reduced to an appropriate tolerance, the accuracy of the linear solution does not impact the overall solution accuracy. This is especially true in the early steps of Newton's method when one is far from the solution and the linear approximation to the solution space is itself very rough. This allows one to tradeoff the cost of highly accurate linear solutions with the reduction in convergence of the Newton's method. In practice, this can reduce overall computational cost dramatically. As mentioned above, another way we utilize the inexact Newton methods is to link the convergence criteria of the Krylov iterative solvers (for the linear problem) to the convergence state of the Newton solver. That is, again early in the Newton solve, we relax the convergence tolerance on the iterative linear solvers and gradually increase it as the Newton solver itself converges. This method has shown promising results in other fields where iterative solvers are used [5]. Lastly, the use of the "update" version of Newton's method has allowed us to implement a variety of global convergence enhancement methods, typically based on a backtracking framework. Here, at every Newton step, a convergence check is performed (e.g., residual norm reduction) to see if the solution has actually "improved". If not, then the update is scaled according to a specified algorithm, until a acceptable update is found. This is often referred to as "damping" the Newton steps. Again, see [5] and the references therein for further information.

170 3.3

Linear Solution

Lastly we come to the solution of the linear subproblems generated by the nonlinear solver. Since Xyce™ is targeted at solving extremely large problems on parallel computers, iterative solvers based on preconditioned Krylov subspace methods are the primary algorithms used. These methods are used for their superior scaling (njn vs. n2) as the problem size increases. Furthermore, they are much easier to implement in a message-passing paradigm than are their sparse-direct counterparts. The Xyce™ Simulator uses the Trilinos solver library that is an object-oriented suite of parallel linear services. These include support for parallel-distributed objects such as vectors and sparse matrices, associated operators, parallel linear solvers and preconditioners, and eigenvalue calculations. Trilinos is currently under development and being based on the Aztec library [6]. In the past, the use of iterative solvers for circuit simulation has been problematic. This is primarily due to their inability to handle, in a general way, the variety of matrix-condition values and sparsity patterns presented by circuit problems. However, we have developed specialized ordering and preconditioning approaches which help efficiently solve the linear subproblems. Since there are many ordering, scaling, preconditioning and solution methods available within Trilinos, a discussion on the merits of each of the possible combinations is beyond the scope of the paper. Instead, we give a simple example. Using the so-called "mem-plus" circuit comprised of 7454 MOS transistors, 14274 capacitors and results in 32634 equations resulted in 1037951412 nonzeroes using the original (SPICE) skyline storage scheme. When we attempted to solve this problem using the GMRES iterative solver with ILUT (thresholding, incomplete LU factorization) preconditioning, the ILUT factorization algorithm took over 8 hours and iterative method never converged. When we reordered the matrix using a minimum degree strategy, the skyline storage resulted in only 205002 nonzeroes (reduced by a factor of over 5000). The ILUT factorization only took 11 seconds and the problem converged in 40 iterations. Thus, it is clear that, at least for some classes of circuit problems, advanced solution methods can have a dramatic impact on the linear solution efficiency. 4

Preliminary Results

As the Xyce Simulator is in the very early stages of development; the results presented here are preliminary. Figure 2 illustrates the scaling performance of a transmission line circuit on a SGI Origin 2000. For this circuit (whose topology is linear in nature), achieving and efficient load-balance is possible as is evident in this figure for scaled-efficiency defined as

171

E(p, N) =

time for solution of problem of size A' on 1 proc. Z £ time for solution of problem of size Np on/? procs.

In this graph, where perfect scaling would yield a horizontal line at 100%, this figure illustrates that for both the netlist- and the random-ordered, the efficiency drops off as the problem is scaled. However, the application of partitioning of PartKway method (ParMETIS/Zoltan) and data migration improves the performance dramatically. 5

Summary

This paper presents an overview of the Xyce™ Parallel Electronic Simulator currently under development at Sandia National Laboratories. The code is targeted at meeting the electronic designer's simulation needs for system-scale simulations that may approach millions of analog devices. The code is written in C++ and with an object-oriented design to ensure extensibility and long-term maintenance. It is fully parallel and uses message-passing communication (MPI) which allows it to run on a variety of architectures from single processor workstations to shared and distributed memory architectures. Furthermore, it is designed to be scalable up to very large computers having thousands of processors. In addition to its large-scale parallel capabilities, the Xyce™ Simulator has several additional unique attributes designed to improve overall performance, flexibility and usability. These include a parallel topological backbone, a modern and flexible device model interface and advanced load-balancing capabilities, stateof-the-art numerical algorithms. Initial simulation runs on relatively modest problem sizes indicate good parallel scaling and overall performance. This is in spite of the fact that virtually no optimization has been done on the code and that very rudimentary load balancing methods were used. We expect that concerted efforts in the next year will provide enhanced performance through code optimization and application of the advanced load balancing and problem distribution capabilities available to the simulator. References 1. Brightwell R., Fisk L., Greenberg D., Hudson T., Levenhagen M., Maccabe A., and Riesen, R., Massively Parallel Computing Using Commodity Components. Parallel Computing, 26, (2000), pp. 243-266. 2. Karypis G. and Kumar V., A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20 (1998), pp. 359-392.

172

3.

Devine K, Hendrickson B., St.John M., Boman M. and Vaughan C , Zoltan: A Dynamic Load-Balancing Library for Parallel Applications, User's Guide. (Sandia National Labs. Tech. Rep. SAND99-1377, Albuquerque, NM, 1999). 4. Forster M., Pick A. and Raitner M., Graph template library (GTL). http://brahms.fmi.uni-passau.de/GTL/ 5. Shadid J., Tuminaro R., Walker H., An inexact Newton method for fully coupled solution of the Navier-Stokes equations with heat and mass transport. Journal of Computational Physics, 137 (1997), pp. 155-185. 6. Tuminaro R., Heroux M., Hutchinson S. and Shadid J., Aztec User's Guide: Version 2.1. (Sandia National Labs. Tech. Rep., Albuquerque, NM, 1999). 7. Watts H., Keiter E., Hutchinson S., and Hoekstra R., Time Integration for the Xyce™ Parallel Electronic Simulator. (Accepted to ISCAS'01, 2001). *yce"A!iaios_SMiu!«vk n K c f;?(S

iiojrt*!::^

9£'^.*iZ»y

Mv::w :!• - i

'X^:X£ l5BPiQSi*L*Q2P£i. Library (GTL)

^

:

,

Tobie

Zoltan Partionini

1

iTiL;:*i.'?l-,-.'>.

;

Tims integration

\ •:

Nonlinear Solver*

i

Linear Algebra Services

\ '•

Figure 1. Block diagram illustrating the key modules for the analog portion of Xyce™.

Scaled Parallel Efficiency

•Netlist - Random Repartitioned

30

40

Number of Processors Figure 2. Graph of scaled parallel efficiency on SGI Origin 2000.

A BREAKTHROUGH IN PARALLEL SOLUTIONS OF MSC.SOFTWARE LOUIS KOMZSK (1) , STEFAN MAYER (2), PETRA POSCHMANN (3), PAUL VANDERWALT (4), REZA SADEGHI (4), CLAUDIO BRUZZO (5), VALERIO GIORGIS (6) MSC.SOFTWARE (1)

2975 Redhill Avenue, Costa Mesa.CA 92626. USA

(4)

260 Sheridan Avenue. 309 RaloAllo.CA 94306. USA

(2)

Am Moosfeld 13 81829 Munich, GERMANY

(5)

Viale ft Bisagno 2/10. 16100 Geneva, ITALY

(3)

815 Colorado Blvd. Los Angeles. Ca 900411777, USA

(6)

Via Giannone 10. 10121 Torino, ITALY

More than a decade ago MSC offered the first parallel production system of MSC.NASTRAN. During this decade MSC has intensified its effort on parallel processing and is now ready to deliver MSC.NASTRAN V70.7 and MSC.MARC V 9, both of which contain very important new parallel features. This paper describes these exciting features and provides preliminary performance results. We believe that these systems mark the best in parallel performance in commercial finite element analysis ever and present a breakthrough in parallel computing in our market.

1

Introduction

More than a decade ago MSC offered the first parallel production system of MSC.NASTRAN, based on the shared memory paradigm. At that time and during the following years the parallelization efforts concentrated on parallelizing the computations in several expensive modules, for example the matrix decomposition. However, during recent years it has become clear that a more extensive parallelization is necessary to satisfy the users' ever growing demand for higher performance. Moreover, it has turned out that not only computations, but also the I/O traffic must be parallelized in order to obtain highly efficient parallel analysis solutions. MSC started to work on new parallelization approaches earlier this decade, this time based on the distributed memory paradigm to be able to address parallel I/O issues as well. First successes were obtained in the European Europort project, which resulted in the distributed parallel production version MSC.NASTRAN V69.2 available on the IBM SP architecture from 1996. Encouraged by the results of this project, the efforts on distributed parallel MSC.NASTRAN have been intensified during me past four years.

173

174

Concurrently the MARC product has also been parallelized. The first parallel version of MARC based on the domain decomposition, Version K6-3, was released in March, 1997 on SGI, HP, IBM, DEC and SUN then in 1999 NT and finally in 2001 Linux. MSC is now ready to deliver MSC.NASTRAN 2001, and MSC.MARC 2001 that contain even more exciting new parallel features. It will also be possible to execute these versions on a homogeneous cluster of workstations. The results obtained with these systems mark the best in parallel finite element analysis performance ever and present a breakthrough in distributed parallel computing in our market. Distributed memory computers are essentially tighdy coupled networks of workstations. The connection between the processors (nodes) of a distributed memory computer is usually a very fast proprietary network or switch. The communication between these nodes is via standard interface libraries, for example MPI (Message Passing Interface), which is used by MSC in the distributed memory parallel implementations. Each node of a distributed memory computer has its own memory, hence the name. Even more important is the fact that the nodes have their own local I/O (disk) device enabling efficient parallel I/O operations.

2

Distributed solution techniques

The distributed solution techniques implemented in Version 2001 of MSC.NASTRAN and MSC.MARC are based on the principle of domain decomposition. We found two possible and convenient ways to decompose finite element problems. They are the frequency and the geometric domain decomposition. In the following me basic principles of both of these techniques are discussed. It is important to note, that the distributed solution techniques provide the same results and output formats as the serial solutions, apart from some additional performance summaries related to the distributed processes. We also made the additional user interface needs very minimal as shown below. 2.1

Frequency domain decomposition based solutions in MSC.NASTRAN

The frequency domain decomposition technique applies only to certain solution sequences where the notion of frequency is utilized. Such are the normal modes and frequency response analyses. More specifically, the frequency domain decomposition based normal modes analysis is built on the frequency segment approach of the Lanczos method. The frequency domain decomposition based

175

frequency response analysis utilizes the independence of the discrete frequencies given on the user's list. 2.1.1

Distributed normal modes analysis in MSC .NASTRAN

The segmented version of Lanczos was first introduced in die early 1990s to alleviate the problems encountered on very wide frequency range runs mostly on Cray computers. In auto-industry tests it was found that the orthogonality is lost when very long Lanczos runs are executed while trying to span a wide frequency range of interest. We introduced me segment version to force intermittent shifts at semi-regular intervals resulting in more frequent restarts of the Lanczos process. While it has significantly improved the solution quality, i.e. we avoided aborts due to loss of orthogonality, the number of shifts was usually more than in the nonsegmented run. It is a natural extension to this logic to execute the segments in parallel, which is the basis of the distributed parallel implementation in Version 2001. This process is executed in a master/slave paradigm. One (the master) of the processors will execute the coordination and collection of the separately calculated eigenvectors into an appropriately ordered set. This guarantees the continuation and proper exit of die master process. The slave processes will contain only the eigenvalues and eigenvectors they found upon exiting the READ (Real Eigenvaue Analysis DMAP) module. 2.1.2

Distributed frequency response analysis in MSC.NASTRAN

The distributed frequency response analysis is based on assigning exclusive subsets of the user given frequency list to each processor. Each processor calculates the responses of the structure at the frequencies given in its subset. The calculation may be executed in a master/slave mode, which was the only method available in Version 69.2 or now in a more efficient symmetric ("all master") operation mode automatically selected by the code. In the master/slave mode the master processor is distributing the structural matrices and the frequency list to the slave processors in the beginning of the frequency response module and upon completion of the response calculation collects the results. This operation mode is used in the distributed modal frequency response analysis solution. In the symmetric mode all processors behave identically. They calculate their own structural matrices and respective frequency subsets and the responses. They each also complete the solution sequence, resulting in higher scalability due to the locally kept output results. This symmetric operation mode is used in the distributed direct frequency response calculations.

176

2.2

Geometric domain decomposition based solutions in MSC.NASTRAN

The principle of geometric domain decomposition is applicable on a much higher level than the frequency domain decomposition. In the frequency case the distributed solution is focused on a certain module (READ and FRRD1) and solution sequence (Sol 103 and 108, as well as 111). The geometric domain decomposition principle transcends many DMAP modules and also solution sequences. In this first production version of this technology we focused on the linear static analysis (Sol 101), however, many aspects of the development work carry over to other solution sequences to be delivered in distributed form in the future. This technology may also be used in connection with the frequency domain decomposition in a hierarchic fashion. Finally, this technology in part relies on the very strong superelement technology of MSC.NASTRAN, which gives its foundation. The cornerstone of the geometric domain decomposition is an automatic domain decomposition tool. This tool works from geometry (connectivity) information and uses heuristic algorithms to create subdomains. The main criteria in creating subdomains are: minimizing the boundary between the domains, achieving load balance and minimizing the cost of the solution of the interior of the domains. In Version 70.7 we use the EXTREME tool for domain decomposition, however, other tools such as METIS may also be used in the near future. After automatically creating the subdomains, MSC.NASTRAN's superelement process will be executed. The shortcoming of the boundary solution of the conventional superelement process, the explicit creation of the residual matrix, however, is avoided. This requires very advanced and efficient distributed algorithms designed and implemented by MSC. These proprietary algorithms encapsulate most of the interprocessor communication via MPI utilities. The data recovery for each domain is done independendy on each processor and does not require any communication. This results in linear or better speed-ups and disk-space savings during that part of the run. 2.3

Geometric domain decomposition based distributed solutions in MSCMARC

The application of the domain decomposition technique requires further considerations in MSC.MARC, due to the nonlinear focus. The various sources of nonlinearity, such as material, geometric and kinematic nonlinearity as well as contact need to be taken into consideration.

177

Additional requirements distinct from linear analysis are handling of multiple steps and die step-by-step changing load balance in contact analysis. In general, die stability plays a key role in the distributed implementation of MSC.MARC. Above considerations and requirements induced a different subdivision strategy dian in MSC.NASTRAN. Using domain decomposition in MSC.MARC, die geometric model is subdivided in me pre/post processor MENTAT (generate option). This enables us to take into consideration user given natural or CAD system specified substructures and handle other physical constraints. It also enables an interactive control of the subdivision process. On the post processing side, mis approach facilitates a data recovery process by domains or in the full complexity of die model, as the user wishes. Finally, diis approach results in a domain decomposition methodology trans- parent to MSC.MARC application or solutions. The metiiodology may be applied in general nonlinear analysis applications including contact. 3

Application Examples

3.1

MSC.NASTRAN Version 70.7 results

To demonstrate the performance improvement obtained by die distributed Lanczos technique, we choose a carbody model of approximately 200,000 degrees of freedom consisting of mainly shell (QUAD4) elements (Figure 1).

Figure 1. Carbody model

Figure 2. Exhaust manifold model

Figure 3. Crankshaft model

The normal modes analysis was executed up to 400 Hz, extracting 1876 eigenvectors, using lOOMw of memory. The disk usage (the main processor's scratch database high water mark) and the elapsed time of the total solution are shown in Table 1.

178

Number of CPUs 1 2 4 8

Disk Gbytes 10.3 7.7 7.0 6.7

Elapsed min:sec 449:27 329:23 122:39 87:57

Speedup — 1.4 3.7 5.1

Table 1. Distributed normal modes performance results (fig. 1) The scalability and the speedups are quite remarkable. We obtained a 5 fold elapsed time speedup on 8 processors. It is even more important to point out the significantly lower disk requirements on the individual processors in the case of the parallel runs due to the smaller scratch space needs. The distributed parallel frequency response capability is demonstrated by the analysis of an exhaust manifold model (Figure 2). The model consists of app. 50,000 degrees of freedom and was built of mixed (shell and solid) elements. Direct frequency response analysis (Sol 108) with 100 frequency steps resulted in the performance measurements of Table 2. Number of CPUs 1 2 4 8

Elapsed Seconds 9521 5001 2672 1546

Speedup 1.9 3.6 6.2

Table 2. Distributed frequency response performance results (fig. 2) This analysis represents an even better scalability. This is due to the fact that me complete frequency response solution was executed in parallel and none of the processors collected the response results of others, the results were kept local and the database separate (mergeresults=no). The effect of the geometric domain decomposition based distributed execution of the linear statics analysis is shown with the following example. The model is an automobile crankshaft (Figure 3) with about 150,000 degrees of freedom and built exclusively from solid elements. Table 3 compares the elapsed time (min:sec) and disk space requirements of Solution 101 with one and 8 processors.

179

Disk Elapsed

1 Processor 1.4GB 38:42

8 Processor 0.26GB 8:35

Speedup s/Saving 5.4 4.6

Table 3. Distributed linear statics performance results (fig. 3) The speedup of 4.6 on 8 CPUs is very good and the 5.4 fold saving in maximum local disk space is significant. It is important to note that all 1 CPU runs shown in this section were executed with the fully tuned serial production implementation of MSC.NASTRAN and not from single processor runs of the parallel algorithms. 3.2

MSC.MARC2000 results

The performance improvement obtained in MSC.MARC V 9 is first shown in a metal forming application. This example is the analysis of the thread - rolling process of the manufacturing of a carseat shown on Figure 4. Such an application is widespread in the automobile industry since many components of automobiles are manufactured in that fashion.

Figure 4. Carseat model

Figure 5. Engine

The finite element model consists of 9600 elements, 9881 grids and 44 time steps were executed. The results on an SGI ORIGIN 2000 (250 MHz) shared memory parallel computer are shown in Table 4. Number of CPUs 1 2 4 8

Elapsed Seconds 7,702 4,899 2,025 1,267

Speedup 1.6 3.8 6.1

Table 4. Distributed metal forming application results (fig. 4)

180

This excellent speedup is almost identical to the speedup obtained in the MSC.NASTRAN Version 70.7 parallel frequency response analysis example shown in Table 2. This fact demonstrates the applicability of the domain decomposition principle across products and solution techniques. An additional example of a nonlinear analysis of an engine block is shown in the following Figure 5. The analysis was to determine the transient thermal response to a high temperature heat source applied within the cylinders. The model has about 500,000 elements and approximately 1.8 million degrees of freedom, 10 time steps were executed. The results on a SUN E3500 (333 MHz) computer were as follows: Number of CPUs 1 8

Elapsed Minutes 78 12

Speedup — 6.5

Table 5. Distributed nonlinear analysis results (fig. 5) The efficiency result shown in tiiis application is truly remarkable and demonstrates the strategic importance of the domain decomposition methodology. 5. Conclusions We hope to have demonstrated the significant performance advantages of the new distributed solution techniques. These techniques truly unleash the power of parallel processing and present groundbreaking solutions in our segment of the CAE market. We believe that these capabilities will be very useful for a wide range of our user community. We are continuing the distributed parallel development and are committed to further improvements. We are planning to expand the geometric domain decomposition based technology to other solution sequences of MSC.NASTRAN, for example normal modes and transient analysis in the future. We are also continuing to improve the domain decomposition solutions in MSC.MARC, especially in the area of multi-disciplinary (structure, heat, electromagnetic, fluid) applications.

181 Implementation of a parallel Car-Parrinello code on H i g h Performance Linux-based Clusters S.Letardi 1 , M . C e h W ^ F . C l e r i 2 ' 3 , V.Rosato 1 ' 3 ENEA, Casaccia Research Centre, HF'CN Project, Roma, Italy ENEA, Casaccia Research Centre, Divisione Materiali Avanzati, Roma, Italy 3 INFM, Unita di Ricerca Roma I 1

2

4

A. De Vita 4 ' 6 , M. Stengel 4 Insitute Romand de Recherche Numefique en Physique des Materiaux (IRRMA), INR~Ecublens, 1015 Lausanne, Switzerland 5 INFM and Dipartimento di Ingegneria dei Materiali, Universita di Trieste, via A.Valerio 2, 3^127, Italy A Car-Parrinello Molecular Dynamics code has been implemented on two different Linux-based parallel platforms, based on Alpha processors. The clusters are characterized by state-of-the-art interconnection networks, enabling high throughput and low latency. The performance and the scalability of the code are reported in detail, together with the analysis of the performance of the most intensive computational tasks. Results indicate a good scalability of the code on the platforms and the adequacy of the interconnection networks to sustain a large workload.

1

Introduction

Molecular Dynamics based on Density Functional Theory x (DFT), the socalled Car-Parrinello approach 2 (CP hereafter) has gained a great deal of success in several application domains, from quantum chemistry to materials science, from biology to reaction's chemistry, as its accurate modeling of the bonding properties allows to make qualitative and quantitative predictions of the physical phenomena. The CP approach is particularly intensive from the computational point of view. Its application to complex systems has coincided with the advent and the diffusion of parallel computing. The availability of high performance parallel computers has, in fact, given the possibility of significantly extending the applicability of the CP approach to large systems, to long time scales up to accessing the scales of complex chemical reactions. From the side of the computational platforms, the recent years have been characterized by a change of paradigm which has imposed parallel platforms, assembled with commodity-off-the-shelf (COTS) components, on the proprietary parallel architecture. The new architectural solution, based on COTS processors, interconnected via commercial networks (IN) and managed by Unixlike operating systems, such as Linux, (we will refer to them, hereafter, as Beowulf Clusters, BC), are becoming the most widespread parallel platforms. The

182

first generation of BC's, based on low-cost COTS components, has revealed, however, to be inadequate to cope with highly parallel tasks. Communication networks and the general-purpose architecture of low-cost processors generally fail to sustain large workloads when a relevant number of processes are used. A second generation of BC's has been thus introduced by assembling high-end COTS processors and commercial IN's networks characterized by low latency and high bandwidth. Although the cost of this solution is higher than that of the first generation BC's, it results to be competitive with that of proprietary solution. The cost-effectiveness has triggered the diffusion of these platforms in the last years. In this work we have tested the implementation of a parallel computational code (a CP-based molecular dynamics) on two second-generation cluster-based platforms: the first, located at the ENEA Casaccia Research Center of Rome (Italy), is called Feronia; the second, located at the Centre for Advanced Parallel Applications (CAPA) of the Ecole Polytechnique Federate at Lausanne (Switzerland), is called Swiss TO. They keep the open-source operating system, from the BC experience, but they introduce more sophisticated IN's to increase platform scalability when a large number of processing elements is used. 2

Description of the platforms: the Feronia and the Swiss TO clusters

The processors (PEs) of the Feronia platform are the Compaq-Alpha EV67 (667 MHz) with 4MBytes of L2-cache. They are assembled in dual-processors nodes (API UP2000) in the SMP configuration. Each node is characterized by 1 Gbyte of DRAM and by the presence of the Tsunami chipset for increasing the intra-node communication bandwidth. The peak performance of each dualprocessor node is about 2.6 Gflops. The Feronia configuration has, to date, 16 nodes, interconnected by the QsNet network. QsNet is an IN with a fattree topology and provides a peak bandwidth of 340 Mbytes/second in each direction; the process-to-process latency for remote write operations is 2 us. The design of the data network guarantees that all nodes obtain this level of performance whatever the size of system, ensuring the maximum possible scalability for parallel applications. The network interface is based on QSW's third generation "Elan" ASIC dedicated I/O processor to offload messaging tasks from the main CPU: a 66 Mhz 64-bit PCI interface, a QSW data link (a 400 Mhz byte wide, full duplex link), cache and local memory interface. QsNet offers a remote access latency of 5 fisec and a bandwidth of 210 Mbyte/s on MPI applications.

183

The Swiss TO architecture is based on Compaq Alpha EV56 (500 MHz) processor. There are 8 processors connected via an EasyNet bus. The IN sustained performances are 35 MByte/s, with 5 fisec and 12 fisec latencies for FCI and MPI communications, respectively. The theoretical peak power of the platform is of 8 Gflop/s. The Swiss plan for high performance computing will upgrade the Swiss TO architecture to 504 processors interconnected by TNet network with a performance of about one Tflops. 3

Description of the code

The first-principles Molecular Dynamics based on the Car-Parrinello method allows to simulate the behavior of atomic aggregates where the forces acting on the atoms are evaluated from the knowledge of the electronic ground state. This is calculated in the frame of the DFT theory in the Kohn-Sham (KS) formulation 3 and within the Born-Oppenheimer approximation. In this formulations the energy of a system of electrons in presence of nuclei can be written in terms of the electron density ne(r) and of orthonormal orbitals *&*(?"). Car and Parrinello 2 proposed that the electron wave functions could be treated as classical variables and introduced a classical Langrangian L = Y,»j i

dr|*i|2 + ^ M / i ? f - £ [ t f i , . R / ]

+

^ A i j ( < 9i\9j

> - % ) (1)

ij

I

where the VPj's are the KS orbitals, Mj the ionic masses, /< the fictitious electron mass, Ay are the Lagrangian multipliers needed to impose the orthonormality constraints and E[ipi,Rj] is the energy of an interacting ionic and electronic system. From this Lagrangian the equations of motion are derived. In the CP technique, plane-waves (PW) are chosen as orbital basis set. The number of PW elements to be used to build up atomic orbitals depends on the level of accuracy required by the calculation. It is usually of the order 10 4 -10 5 . The CP code evaluates of different terms of eq.(l) by using the representation of the *j(r) in terms of PW:

{Si Because some terms of the energy E[ipi,Ri] are diagonal in real space and others are diagonal in the reciprocal space, computational resources can be saved computing each quantity in the space where it is diagonal. The use of PW's, in fact, allows to switch from the position to the momentum (k-) space representation by means of Fast Fourier Transform (FFT). This implies that

184

the FFT's represent the most intensive part of the calculation; FFT scales as 0(M log M), where M is the number of PW's. A further relevant task of the CP code is the orbital's orthogonalization: this task scales like N2M, where AT is the number of KS states. Moreover, the DFT algorithm in the usual Local Density Approximation is cubic-scaling with the volume of the physical system. Therefore much care must be devoted to a wise implementation of the computational task. 4

Parallelization s t r a t e g y

The LAUTREC code (The LAUsanne Total REal to Complex energy package) has been used as test code; it implements a first-principles Molecular Dynamics in the CP approach and, since June 1994, it represents the first implementation of a CP-based code on a parallel platform. The parallelization scheme follows the pioneering work of Clarke and co-workers 4 adapting it to the CP method's features. The code is written in Fortran 90, and uses the MPI communications interface. Either on Feronia and on Swiss TO the MPICH primitives have been suitably coded to efficiently run on the available proprietary IN's. The code is communication-intensive and achieves optimal performance and good scalability if optimised distributed linear algebra routines and hardware communications are used. The code essentially performs: 3D FFTs and dense linear algebra operations, like matrix-matrix multiplications and eigenvalue solving. For these reasons, BLACS/SCALAPACK subprograms like e.g. NUMROC, DESCINIT, PSGEMM, PSSYEVX are used. The parallelization strategy is data driven. An equal subset of the planewaves basis set, used to expand the electronic orbitals, is assigned to each processing element (PE). The distribution of the scalar-product workload among the different PEs is a key issue in the parallelization strategy, as the task of performing scalar products between PW, described in eq. (1), is one of the most computationally-cost intense of the code. The PW distribution is made in a way to localize on the PEs as much as possible the scalar product, and to perform some final internodes sum of the partial results. Data (electronic wave-functions) are distributed on a 3D mesh, and the algorithm requires continuous transformations between direct and Fourier space. Therefore, a computationally relevant task is constituted by a 3-D complex FFT to be performed on a grid whose dimensions depend on the system size and on the cut-off energy. For this task, the parallelization strategy takes also into account the symmetry properties of the reciprocal space. Real wavefunctions are thus used when adopting T-point sampling of the reciprocal space.

185

In particular, this halves the basis set effectively used in the calculation by using the symmetry relation Ci(—g) = c*(g) on the expansion coefficients in eq. (2). To avoid conflicts between this (G-space non-local) symmetry and the main parallelization strategy, the distributed data unit effectively consists of couples of conjugate reciprocal space FFT mesh columns, so that the electronic orbitals can be paired in complex vectors constructed from private PE data before the complex FFTs are carried out. Special features of the porting (so called 'T-point symmetry") and of the static load balancing (so called " self-inverse mesh tiling") require the use of an algorithm-specific parallel FFT package. 5

Benchmark details

We performed several test-cases in order to check the performance and the scalability of the LAUTREC code on the two different BC platforms by using the dedicated hardware. The test-cases are related to the simulation of model-systems whose size is typical of those used in scientifically meaningful simulations. On the Feronia cluster we have simulated a carbon system. Two different model-systems have been used: a large one, hereafter referred to as F l , formed by 120 carbon atoms, with a cell dimensions of 13.00 * 12.50 * 6.65 A3 and an FFT grid of 100 * 96 * 56 points; and a small system, F2, composed of 60 carbon atoms, with a cell dimensions of 13.00 * 12.50 * 3.32 A 3 and an FFT grid of 100 * 96 * 32 points. The choice of the cut-off of 40 Ry results in the use of about 15600 plane waves for F l and of about 7800 plane waves for F2. Benchmarks on the Swiss TO cluster have been performed with the following three systems: a system S I , composed of 108 aluminium atoms, with a cell dimension of 12.143 A 3 , a cut-off energy of 10 Ry, an use of about 3200 plane waves and a FFT grid of 48 3 points; system S2, composed of 32 water molecules, with a cell dimension of 9.883 A 3 , a cut-off energy of 30 Ry, the use of about 8900 plane waves, and an FFT grid of 723 points; system S3, composed of 32 water molecules, with a cell dimension of 9.883 A 3 , a cut-off energy of 90 Ry, an use of 46700 plane waves, and an FFT grid of 1203 points. It must be underlined that the number of FFT grid points (nrl *nr2*nr3) must be chosen in such a way to have n r l * nr2, nr2 * nr3, nrZ * nrl multiple of 2 * PEs. 6

R e s u l t s and discussion

We firstly analyse the results obtained on the Feronia cluster on the F l and F2 systems.

186

Figure 1: Speed-up of the LAUTREC code on the Feronia cluster for the F\ and F2 cases, (a): simulations with P / N = 2 ; (b) simulations with P / N = l

For the largest system, F l , we have, at most, a memory occupancy of 450 MBytes on a single node that allows to run the test-cases also on a single node (1 GByte of DRAM). The resulting speed-up data have been reported in Fig.l. It must be noticed that the LAUTREC code compels the use of a number of processes equal to 2P, (with p = l , 2...) for this reason, we have defined the Speed Up as S = TCpu(Np = 1)/TCpu{Np = 2 P ), where Np is the number of processes and TQPU is the wall clock time of the run. The TQPU has been averaged on several runs. We have run the code either by allotting one process or two processes per node. In the first case (hereafter indicated as P / N = l ) only one PE is used on each node, in the second case (hereafter indicated as P / N = 2 ) , two PEs per node are used. Figure 1 shows an overall good scalability of the code on the Feronia platform. It allows a good sustained scalability also for the most time-consuming routines, namely FFT and the orthogonalization one (GRAHAM). The former is also intensive from the point of view of the IN workload. Furthermore we can observe, from Figure 1(b), that there is an improvement of speed-up when passing from 16 PEs to 32 PEs. It could be due to a better distribution of the FFT among the PEs. A further appreciation of the performance of the QsNet can be also extracted from the comparison of data of Table 1 and 2. It comes clear that the most time-consuming routines are mathematical routines, using to solve

187

2

4

6

8

number of PEs

Figure 2: Speed-up of the 5 1 , 52 and 5 3 systems on the Swiss system 5 3 ; diamonds= system 52, triangles= system 5 1 .

TO platform.

Circles=

the parallel FFT and the wave functions orthogonalization. The data domain, in both cases, is completely distributed among the processors. By comparing Table 1 with Table 2 one can observe that the CPU time for the case P / N = l is slightly lower than that for the case P / N = 2 . However this difference becomes progressively smaller by increasing the number of PEs. We guess that the improved performance of the case P / N = l could be due to the lower amount of data in each node that engaged the DRAM and the communication port toward the QSnet. We can observe, from Table 2, that we have a very good speed-up for the GRAHAM routines. When passing from 16 PEs to 32 PEs, however, the communications become too intensive, although this fact does not affect the overall speed-up. In Fig. 2 we show the results obtained on three cases (SI, S2 and S3) on the Swiss TO platform. We can note, observing the graph in Fig.2, that for the smaller systems (diamonds and circles) we have more pressure on communications. They clearly pinpoint where the IN technology starts loosing edge. On the other side, the behaviour of the SI system (triangles) is even slightly superlinear, presumably due to a more efficient use of L2-cache when the system size is spread over a larger number of processors. Second-generation BCs, based on fast interconnecting network (such as QsNet and Easy Bus), are suitable parallel platforms to deal with DFT-based molecular dynamics codes which are intensive applications for both number-

188

P/N 2/2 4/4 8/8 16/16

Total (sec) 924 514 241 194

FFT (sec) 490 251 131 98

GRAHAM (sec) 215 129 44 22

Table 1: Execution times of the different tasks for 20 time-steps of the Fl system on the Feronia platform, for the case P/N = 1. Total is the total execution time, FFT the time spent in the Fourier transform routine, and GRAHAM is the time spent in the orthogonalization procedure.

P/N 2/1 4/2 8/4 16/8 32/16

Total (sec) 1288 627 313 273 143

FFT (sec) 612 294 162 126 51

GRAHAM (sec) 329 155 66 36 61

Table 2: Same as Table 1, for the case of P/N=2

crunching and communication activities. The special care that has been taken in controlling load balancing and the communications costs allowed good speed up till 16 nodes. These results open the way to the possibility of having good scalability figures on larger capability platforms (of the same BC type) which are going to be assembled, both at ENEA and at CAPA-EPFL. References 1. 2. 3. 4.

P.Hohenberger, W.Kohn, Phys. Rev. 136 (1964) B864. R.Car, M.Parrinello, Phys. Rev. Lett. 55 (1985) 2471. W.Kohn and L.J.Sham Phys. Rev. 140 (1965) A1133. L.J.Clarke, I.Stich and M.C.Payne, Comput. Phys. Commun. 72, 14 (1992).

PRAN: SPECIAL PURPOSE PARALLEL ARCHITECTURE FOR PROTEIN ANALYSIS

A.MARONGIU CASPUR E-mail: marongiu@die. ing. uniromal. it P.PALAZZARI, V.ROSATO ENEA Casaccia Research Center, HPCN Project, Rome E-mail: {palazzari, rosato}@casaccia. enea. it Computational genomics is gaming an increasing interest in the scientific community. Genomic and proteomic analysis are mainly based on character processing, so standard processors are not well suited to support the computations required, being the largest part of their silicon devoted to support memory management and floating/fixed point numerical computations. In this work we present the PRAN (PRotein ANah/ser) architecture purposely designed to recognize the occurrences of a given short sequence of characters (m-peptide) within a large proteomic sequence. PRAN has been designed using the automatic parallel hardware generator developed in our research centre in the framework of the HADES project (HArdware DEsign for Scientific applications). Due to its high degree of parallelism exploited within the architecture, PRAN is able to reach high computational efficiency. PRAN has been implemented on a prototyping board equipped with a Xilinx Virtex XV1000 FPGA. Test comparisons with standard commodity off the shelf processors evidenced that PRAN is nearfy one order of magnitude faster.

1

Introduction

The increasing relevance of computational methods in genome sequencing and analysis is compelling the design of new computing architectures. General purpose computational platforms, in fact, are purposely designed to support floating- (or fixed)-point numerical operations, while most of the computational complexity involved in genomic (or proteomic) analysis is based on "character processing". This implies that general purpose commodity-off-the-shelf (COTS) architectures are often deployed, in those contexts, with a very low efficiency. The partial inadequacy of COTS platforms to sustain high computational efficiency on this class of computations has fostered a renewed interest in the design of specialised hardware devices, purposely realised to cope with the algorithmic complexity inherent to character processing ([l]-more extended in [2]). In this work we present the design of a dedicated hardware device, based on the FPGA technology, purposely designed to be used in a field which can be called "linguistic proteomics", i.e. the search of the frequency of occurrence of basic m-peptides in a given protein (or in a whole proteomic set). The developed chip implementing this computational task has been called PRAN (PRotein ANalyzer).

189

190

Given the m-string (representing the m-peptide) and the M-string (representing the protein or the proteome), the exhaustive search for the number of occurrence of all me possible m-strings (whose number is 20m) into a M-string requires mx(M-m)x20m character comparisons. To give a feeling of the complexity of this computation, nearly 9xl0 16 character comparisons are required to search for the frequency of epta-peptides (m=7) inside a proteome with 10 amino acids; wim a sustained power of 109 character comparisons per second (a reasonable number for the current leading edge processor technology), this computation would require 34 months. Our approach is based on an automatic hardware generator which is able to produce a synthesizable VHDL from high level specifications. 2

Finding m-peptides within a proteome

A proteome P is a sequence of characters defined over the alphabet A = {I,F,V,L,W,M,A,G,C,Y,P,T,S,H,E,D,Q,N,K,R} Each character in A represents an amino acid. The typical length of P is M=106-s-107 characters. Peptides are sequences of characters belonging to A; m-peptides are sequences of m characters. Each character aj within P has an occurrence frequency ffa) _ „ number of occurrences of a,

flfr) =

—

~

(1)

M Given a m-peptide mpe Am, its Theoretical Occurrence Frequency TOF(mp) is the product between the extraction probability of mp, supposed contained in the proteome, and its existence probability, i.e.

TOF(mp)=-r

Il f ( a i)

(2)

M-m a,emp The Measured Occurrence Frequency of mp, MOF(mp), is obtained through an experimental measure. All the m-peptides are generated and are searched within the whole proteome; MOF(mp) is the number of matches of mp divided by the whole number of trials, i.e. w^™ x number of occurrences of mp in P MOF(mp)= — = (3) 20m(M-m) It is interesting to determine the m-peptides having MOF(mp) significantly larger than TOF(mp): they are good candidates to be studied by the biologists searching the 'basic lexicon' of proteins. As there is no 'a priori' criterion to determine which are the m-peptides with large value of the ratio , all the 20m m-peptides must be, in principle, analysed TOF(mp)

191

and their MOF(mp) must be computed counting, for each mp, the number of its occurrences in P. m character comparisons are required to check if the m-peptide mp is present in the j * position of P: thus, the determination of MOF(mp) involves 20m(M-m)m character comparisons (all the m-peptides are searched in the whole proteome). Clearly more sophisticated algorithms could be adopted to determine MOF(mp) for all the m-peptides but, due to the very small granularity of the basic operations involved, the overhead to manage conditional statements is significant, thus vanishing the advantage of these algorithms. We refer to the following m-peptides searching algorithm input proteome P of length M, m (order of the peptides), threshold value T output • , MOF(mpK „, all the peptides witfi *—^ > T TOF(mp) begin mp_result = {} ; for each a^eA compute f(ai); while not all the m-peptides have been generated generate a new m-peptide mp; scan P to determine MOF(mp); mc

MOF(mpK„ 2. 1 then insert mp into mp result;

if

TOF(mp) end

The computational intensive part of previous algorithm is contained in the ' s c a n P t o d e t e r m i n e MOF(mp)' statement; it has been decided, in order to enhance the processor capability to deal with this computational problem, to develop a special purpose parallel architecture, called PRAN (PRotein ANalyser), to efficiently implement such a statement. 3

Design Approach

The design approach is based on the automatic Parallel Hardware Generator (PHG) tool developed in ENEA (Italian Agency for New Technology, Energy and Environment) by two of the authors in the framework of the HADES project (HArdware DEsign for Scientific applications). The PHG theoretical framework is described in [4] while the detailed theory is reported in [3]. PHG produces a synthesizable VHDL [5] from high level specifications given by means of affine recurrence equations [6-8]. In order to achieve the final circuit description, PHG performs the following steps:

192 •

parsing of recurrence equations describing the algorithm to be implemented and generation of the intermediate format; details on the language used to describe recurrence equations can be found in [9]; • automatic extraction of parallelism by allocating and scheduling the computations through a space-time mapping [4]. The mapping is represented by an integer unimodular matrix derived through an optimization process [10]. This step produces the architecture of the system expressed as a set of interconnected functional units (data path) managed by a control Finite State Machine (data path controller) which enforces the scheduling; • generation of the synthesizable VHDL representing the architecture determined in the previous step. The VHDL code is then synthesized through the standard Electronic Design Automation tools. We used Synopsys FPGA compiler II to produce the optimized netlist and Xilinx Foundation Express 3.1 to place and route it into the target FPGA. Using the SIMPLE (Sare IMPLEmentation) language [9], the design behaviour has been specified through the following SIMPLE program: /*Input definition*/ Input str[i] 0 < i < M-l Input substr[i] 0 < i < m-l /*Result definition*/ Result res, restree; /•Initialisation equation*/ res[i,j]=l -m+1 < i < M-l j = -l /*match equation for the first (m-l) characters of the m-peptide*/ Eq.l: res[i,j]=res[i,j-1] and (str[i+j] == substr[j]) 0 < i+j < M-l, 0 < j < m-2 /*match equation for the m-th character of the m-peptide. This equation is defined for each akeA*/ Eq.2: restree [i, j ] = res[i,j-l] and (str [i+j ] ==akeA) 0 < i+j < M-l, j=m-l /•Output definition*/ Write(restree[i,j]) 0 < i < M-m, j=m-l

Input definition specifies the input variables along with their validity domain. Result definition specifies the intermediate/final result of the algorithm. The output definition specifies which final results must be produced by the algoritiim. Equation 1 computes the partial result r e s [ i , j ] : r e s [ i , j ] is set to 1 if, starting from position i, the first j characters of the input string match the first j characters in the m-peptide, otherwise r e s [ i , j ] is set to 0. Equation 2 computes the output result r e s t r e e [ i , j ] . Equation 2 has the same meaning of equation 1. The only difference is that in equation 2 the match is checked against each character ak 6 A. The final result r e s t r e e [ i , m - l ] assumes so either the value 'not matched' if at least one of previous m-l characters does not match, or the value ak, i.e. the last character of the m-peptide. Given the initial m-l characters of a m-peptide,

193

{mim2v,mm_i}, r e s t r e e [ i , m - l ] = a^eA means that the m-peptide {mim2,...,mm_i,aic} is contained, starting from the position i, in the input sequence. 4

PRotein ANalyser Architecture

As COTS processors are powerful enough to analyze m-peptides with m ranging up to 6, we designed PRAN to study the case m=7. Moreover the M value has been set equal to 4096 and 5 bits are used to encode the characters. The architecture obtained applying the PHG to the recurrence equations defining the problem is the pipelined structure sketched in figure 1. mp(0)

mp(\)

mp(2)

mp{3)

mp(4)

mp(5)

r - = - D - = - D - = - Q - = - Q - = - D - = -O-i Strln(f)

P0

StrOul(i)

"11111" Figure 1: Pipelined structure implementing the string matching algorithm

The pipeline structure receives the first 6 characters of the epta-peptide from the input mp(0)...mp(5) and the input sequence from the input port Strln. The output character is produced at the output port StrOut. The pipeline structure contains 6 compare blocks, represented by a box with the '=' label. The logical scheme of a compare block is shown in figure 2. As we see, compare blocks are combinatorial circuits setting to 1 their output when the two characters presented at the A and B input ports are equal and the C input port receives the 1 value. Output StrOut, which corresponds to equation 2, is implemented through a multiplexing operation which produces the current string character, if a 7-peptide matched, otherwise returns the special character 'not matched' ("11111"). A

Out

Figure 2: Logical scheme of the 'compare block'

194

The PRAN architecture has been designed to be hosted by a prototyping board (see figure 3) equipped with PCI interface, 4 independent SRAM banks (512Kx32 each), 1 Xilinx Virtex XV1000 FPGA, 2 I/O ports (8 bit width). SRAM 4->

To host '

m u <

SRAM XV1000 SRAM 4->

z

SRAM 4-+

J

Input & Control Port

Figure 3: Block diagram of the prototype board hosting the PRAN architecture

In order to increase the parallelism, and according to the board I/O constraints, during the analysis the whole proteome P is divided into N not-overlapping strings which feed N replicas of the previous pipelined structure. Due to the I/O constraints, in our implementation of PRAN we chose N=12, i.e. we search for 20 7-peptides simultaneously on 12 strings of length 4096. The architecture of PRAN is sketched in figure 4, being the blocks labelled 'pipe compare' depicted in figure 2. As we see in figure 4, PRAN receives the 12 input strings through 60 input lines, being each character encoded through 5 bit. In our test-bed board, the input strings are stored in 2 SRAM memory banks, each one organized as 32x512Kbit. Two other SRAM banks are used to collect the output of PRAN; the output consists of 12 characters, each one coding if one of the 20 7-peptides matched and, if a match is detected, which was the 7-peptide that matched. PRAN drives the lines, and the addresses, to control the 4 SRAM memory banks. Through a dedicated input port the first 6 characters of the 7-peptide to be checked are loaded into 6 internal registers (R0,...,R5). These registers are connected to the inputs of all the 12 'pipe compare' blocks. Constrained at the speed grade of the FPGA we used (XV1000 6), the synthesized design is clocked at a frequency fck= 40 MHz. The scheme of the algorithm to analyse P is the following begin mp_result = {}; for each a^eA compute f(ai); while not all the (m-1)-peptides have been generated generate a new (m-1)-peptide m_lp; for k=l to M/(4096*12) DMA input data to board; use PRAN to analyse the 12 sequences containing 4096 characters; DMA result data from board; compute MOF(mp);

.f MOF(mp) ^ T then TOF(mp)

insert mp into mp_result;

195 Input & Ctrl Port

-y-

RO

Rl

LI

R2

R3

I J

R4

R5

SRAM addresses &. control signals

Addr(4xl9)

Ctrl

StrInO 5 /

y

OutO

5/

Outl

5

,.

-y-*\ Slrln! 5

¥-

l

-P-

pipe compare 11

•y-

Figure 4: PRAN architecture

5

Results

In order to test the performance of the implemented PRAN prototype, we searched for the 7-peptides on the proteome constituted by all the protein sequences contained in the yeast Saccharomyces cerevisiae. We implemented the searching algorithm on three different systems, based respectively on a Sun UltraSparc 60 (450 MHz), on an Alpha EV6.7 (667 MHz) and on a Pentium II (333 MHz) connected to the development board equipped with the PRAN prototype. The implementation of the 7-peptides search on the UltraSparc and Alpha systems fully exploits the parallelism of the data bus, allowing the simultaneous comparison of a number of sub-strings equal to the number of characters that can be encoded on the data word. To fix the ideas, the search is performed simultaneously on 6 sub-strings on systems equipped with a 32 bit wide data bus, because 6 characters (corresponding to 30 bit) can be encoded on a 32 bit word. Table 1 summarizes the results, reporting for each system a) the time required to check the number of occurrences of one 7-peptide on a proteome with 10 characters, b) the time to search all the 7-peptides on the same string c) the speed-up of the PRAN architecture vs the systems.

196 Sun UltraSparc 60 Time to check one 7-peptide over 106 characters (sec) Time to check all the 7-peptides over 10' characters (days) Speed-UP=t„deni/tp|UN

Alpha EV6.7

Pentium II + PRAN

0.0322 478

0.0119 176

0.0021 32

15

5.6

1

Table 1: test results The exploitation of the parallelism inherent to the problem has allowed to sustain, with the PRAN processor, significantly higher performances with respect to those attainable with conventional high-end general purpose processors. This device is going to be used in the search of genomic structures called "tandem repeat" (repeated sub-sequences containing a few bases). For this problem, being 3 the number of bits needed to encode the basic character, more functional units can be added, thus increasing the overall length of the treated tandem repeat structure. References 1. Proceedings of the SIMAI Symposium on "Formal Methods for HW/SW design for Grand Challenge Scientific Applications" - Ischia (Italy), June 2000 2. Special Issues of the Computer Physics Communication Journal on "Formal Methods for HW/SW design for Grand Challenge Scientific Applications" Guest Editors P.Palazzari, V.Rosato - to appear in 2001 3. A.Marongiu, "Hardware and Software High Level Synthesis of Affine Iterative Algoridims", Ph.D Thesis in Electronic Engineering, "La Sapienza" University of Rome, February 2000. 4. A.Marongiu, P.Palazzari, "Automatic Mapping of System of Affine Recurrence Equations (SARE) onto Distributed Memory Parallel Systems", IEEE Trans, on Soft. Eng., 26, (2000), 262. 5. IEEE standard VHDL language reference manual. IEEE std. 1076-1993 6. C.Mongenet, P.Clauss, G.R.Perrin, "Geometrical Tools to Map System of Affine Recurrence Equations on Regular Arrays", Acta Informatica, 31, (1994), 137 7. K.H.Zimmermann, "Linear Mapping of n-dimensional Uniform Recurrences onto ^-dimensional Systolic Arrays", Journal of VLSI Signal Processing, 12, (1996), 187. 8. A.Darte, "Regular Partitioning for Synthesizing Fixed-Size Systolic Arrays". Integration, The VLSI Journal, 12, (1991), 293. 9. A. Marongiu, P. Palazzari, L. Cinque and F. Mastronardo, "High Level Software Synthesis of Affine Iterative Algorithms onto Parallel Architectures". Proc. of the HPCN Europe 2000. May 2000 Amsterdam, The Netherlands. 10. A. Marongiu, P. Palazzari, "Optimization of Automatically Generated Parallel Programs", Proc of the 3rd IMACS International Multiconference on Circuits, Systems, Communications and Computers (CSCC'99) - July 1999, Athens

DESIGN OF A PARALLEL A N D DISTRIBUTED SEARCH ENGINE

WEB

S. ORLANDO 0 , R. PEREGO*, F. SILVESTRr # °Dipartimento di Information, Universita Ca' Foscari, Venezia, Italy "Istituto CNUCE-CNR, Pisa, Italy ' Dipartimento di Informatica, Universita di Pisa, Italy This paper describes the architecture of MOSE (My Own Search Engine), a scalable parallel and distributed engine for searching the web. MOSE was specifically designed to efficiently exploit affordable parallel architectures, such as clusters of workstations. Its modular and scalable architecture can be easily adjusted to fulfill the bandwidth requirements of the application at hand. Both task-parallel and data-parallel approaches are exploited within MOSE in order to increase the throughput and efficiently use communication, storing and computational resources. We used a collection of html documents as a benchmark and conducted preliminary experiments on a cluster of three SMP Linux PCs.

1

Introduction

Due t o the explosion in the number of documents available online today, Web

Search Engines (WSEs) have become the main means for initiating navigation and interaction with the Internet. Largest WSEs index today hundreds of millions of multi-lingual web pages containing millions of distinct terms. Although bigger is not necessarily better, people looking the web for unusual (and usual) information prefer to use the search engines with the largest web coverage. This forced main commercial WSEs to compete for increasing the indexes. Since the cost of indexing and searching grows with the size of the data, efficient algorithms and scalable architectures have to be exploited in order to manage enormous amount of information with high throughputs. Parallel processing thus become an enabling technology for efficiently searching and retrieving information from the web. In this paper we present MOSE, a parallel and distributed WSE able to achieve high throughput by efficiently exploiting a low cost cluster of Linux SMPs. Its expansible architecture allows the system to be scaled with the size of the data collection and the throughput requirements. Most of our efforts were directed toward increasing query processing throughput. We can think of a WSE as a system with two inputs and one output. One input is the stream of queries submitted by users. The other input is the read-only database, which contains the index of the document collection. The WSE process each query of the stream by retrieving from the index the references to the I most

197

198

relevant documents. Such set of I references is then put on the output stream. The main parallelization strategies for a WSE are thus: Task parallel. Since the various queries can be processed independently, we can consider query processing an embarrassingly parallel problem. We can thus exploit a processor farm structure with a mechanism to balance the load by scheduling the queries among a set of identical workers, each implementing a sequential WSE. Data parallel. The input database is partitioned. Each query is processed in parallel by several data parallel tasks, each accessing a distinct partition of the database. Query processing is in this case slightly heavier than in the previous case. Each data parallel task has in fact to retrieve from its own partition the locally most relevant / references. The final output is obtained by combining these partial outputs, and by choosing the I references which globally result to be the most relevant. Task have Data work

+ Data parallel. A combination of the above two strategies. We a processor farm, whose workers are in turn parallelized using a parallel approach. The farming structure is used to balance the among the parallel workers.

The modular architecture of MOSE allowed us to experiment all the three strategies above. The third parallelization strategy, which combines Task and Data parallelism, achieved the best performances due to a better exploitation of memory hierarchies. The paper is organized as follow. Section 2 introduces WSE and Information Retrieval (IR) principles, and surveys related work. Section 3 describes MOSE components, discusses parallelism exploitation, and shows how MOSE modular and scalable architecture can be adjusted to fulfill bandwidth requirements. The encouraging experimental results obtained on a cluster of three Linux SMPs are shown in Section 4, while Section 5 draws some conclusions. 2

W S E and IR Principles

A typical WSE (see Figure 1) is composed of the spidering system, a set of Internet agents which in parallel visit the web and gather all the documents of interest, and by the IR core constituted by: (1) the Indexer, that builds the Index from the collection of gathered documents, and, (2) the Query Analyzer, that accepts user queries, searches the index for documents matching the query, and return the references to these documents in an understandable

199

Figure 1. Typical organization of a WSE.

form. Query results are returned to users sorted by rank, a kind of relevance judgment that is an abstract concept largely linked to users taste. Ranking is performed on the basis of an IR model that allows to represent documents and queries, and to measure their similarity. In general, as the size of the indexed collection grows, a very high precision (i.e. number of relevant documents retrieved over the total number of documents retrieved) has to be preferred even at the expense of the recall parameter (i.e. number of relevant documents retrieved over the total number of relevant documents in the collection). In other words, since users usually only look at the first few tens of results, the relevance of these top results is more important than the total number of relevant documents retrieved. In order to grant high precision and computational efficiency, WSEs usually adopt a simple Weighted Boolean IR model enriched with highly effective ranking algorithms which consider the hyper-textual structure of web documents 1 ' 2 . Moreover, due to its compactness, most WSEs adopt an Inverted List (IL) organization for the index. An IL stores the relations among a term and the documents that contain it. The two main components of an IL index are: (1) the Lexicon, a lexicographically ordered list of all the interesting terms contained in the collection, and, (2) the Postings lists, lists associated to each term t of the Lexicon containing the references to all the documents that contain t. Many large-scale WSEs such as Google, Inktomi and Fast, exploit clusters of low-cost workstation for running their engines, but, unfortunately, very few papers regard WSE architecture design 1,3 , since most developments were done within competitive companies which do not publish technical details. On the other hand, many researchers investigated parallel and/or distributed IR systems 4 ' 5 ' 6 ' 7 ' 8,9 focused on collections of homogeneous documents. Lin and Zhou7 implemented a distributed IR system on a cluster of workstations, while Lu 8 , simulated an interesting distributed IR system on a Terabyte collection, and investigated various distribution and replication strategies and their impact on retrieval efficiency and effectiveness.

200 SabcollectkHis

Indexing

Figure 2. Indexing phase.

3

M O S E Structure

The IR core of MOSE is composed of the Indexer and the Query Analyzer (QA) modules. In this paper we only briefly surveys indexing issues, and focus our attention on the QA whose functionalities are carried out by two pools of parallel processes: Query Brokers (QBs) and Local Searchers (LSs). MOSE parallel and distributed implementation exploits a data-parallel technique known as document partitioning . The spidering phase returns p subcollections of documents with similar sizes. The subcollections are then indexed independently and concurrently by p parallel Indexers (see Figure 2). The result of the indexing phase is a set of p different indexes containing references to disjoint sets of documents. The p indexes are then taken in charge by a data-parallel QA whose task is to resolve user queries on the whole collection. To this end the QA uses k QBs and p LSs. The k QBs run on a front-end workstation, and fetch user queries from a shared message queue. Every fetched query is then broadcast to the associated p LSs (workers), possibly running on different workstations. The p LSs satisfy the query on the distinct subindexes, and return to the QB that submitted the query the first I references to most relevant documents contained within each subcollection. The QB waits for all the / • p results and chooses among them the / documents with the highest ranks. Finally, such results are returned to the requesting user. Figure 3 shows the logic structure of the MOSE architecture. A QB, along with the p associated LSs, implements a data parallel worker which concurrently serve the user queries. In order to manage concurrently more queries and to better exploit LSs' bandwidth, k QBs are introduced within a QA. System performances can be furthermore increased by replicating the QA in n copies. All the parallelization strategies depicted in Section 1 can be thus realized by choosing appropriate values for n, k, and p. A pure task parallel approach

201

Figure 3. Structure of MOSE Query Analyzer.

corresponds to p = 1, while n > 1 and/or k > 1. By choosing p > 1, n = 1 and fc = 1 we obtain a pure data-parallel implementation. A hybrid task + data parallel strategy is finally obtained for p > 1, while n > 1 and/or k > 1. Indexer. The Indexer has the purpose of building the index from the gathered web documents. The indexing algorithm used is a parallel version of the Sort Based algorithm which is very efficient on large collections due to the good compromise between memory and I/O usage 2 . Moreover, the index built is Full Text and Word Based. The Lexicon is compressed exploiting the common prefixes of lexicographically ordered terms (Shared Prefix Coding), while the Postings lists are compressed by using the Local Bernoulli technique 2 . MOSE parallel Indexer exploits the master/worker paradigm and standard Unix SysV communication mechanisms (i.e. message queues). Since each subcollection of web documents is indexed independently (and concurrently on different workstations), the current Indexer implementation exploits parallelism only within the same SMP architecture. The master process scans the subcollection, and sends the reference to each document (i.e. the file offset) along with a unique document identifier to one of the worker processes on a self-scheduling basis. The workers independently read each assigned document from the disk and indexes it. When all documents have been processed, the workers write their local indexes to the disk, and signal their completion to the master. At this point the master merges the local subindexes in order to create a single index for the whole subcollection. A distributed implementation of the Indexer could be easily derived, but should require all the processing nodes to efficiently access the disk-resident subcollection, and that at least a single node can access all the subindexes during the merging phase. Query Broker. Each QB loops performing the following actions: Receipt and broadcasting of queries. Independently from the mechanism ex-

202

ploited to accept user queries (e.g., CGI, fast CGI, PHP, ASP), user queries are inserted in a SysV message queue shared among all the QBs. Load balancing is accomplished by means of a self scheduling policy: free QBs access the shared queue and get the first available query. Once a query is fetched, the QB broadcasts it to its p LSs by means of an MPI asynchronous communication. Receipt and merge of results. The QB then nondeterministically receives the results from all the LSs (i.e., p lists ordered by rank, of I pairs document identifier, and associated rank value) . The final list of the / results with the highest ranks is than obtained with a simple 0(1) merging algorithm. Answers returning. The list of / results is finally returned to the CGI script originating the query that transforms document identifiers into URLs with a short abstract associated, and builds the dynamic html page returned to the requesting user. Local Searcher. LSs implement the IR engine of MOSE. Once a query is received, the LS parses it, and searches the Lexicon for each terms of the query. Performance of term searching is very important for the whole system and are fully optimized. An efficient binary search algorithm is used at this purpose, and a Shared Prefix Coding technique is used to code the variable length terms of the lexicographically ordered Lexicon without wasting space 2 . Minimizing the size of the Lexicon is very important: a small Lexicon can be maintained in core with obvious repercussions on searching times. LS exploit the Unix mmap function to map the Lexicon into memory. The same function also allows an LS to share the Lexicon with all the other LS that run on the same workstation and process the same subcollection. Once a term of the query is found in the Lexicon, the associated posting list is retrieved from the disk, decompressed, and written onto a stack. The LS then processes bottomup query boolean operators whenever their operands are available onto the top of the stack. When all boolean operators have been processed, the top of the stack stores the final list of results. The / results with the highest ranks are then selected in linear time by exploiting a max-heap data structure 2 . Finally, the I results are communicated to the QB that submitted the query. 4

Experimental Results

We conducted our experiments on a cluster of three SMP Linux PCs interconnected by a switched Fast Ethernet network. Each PC is equipped with two 233MHz Pentiumll processors, 128 MBytes of RAM, and an ULTRA SCSI II disk. We indexed 750.000 multi-lingual html documents contained in the CDs of the web track of the TREC Conference and we built both a monolithic

203 Task Parallel: one va two WSs (2 QBs per L.S, 5000 emeries)

Task Parallel vs. Hybrid (2 QBs per LS, 5000 queries) T P - ^ TP + DP — * - -

\ s 1 E

"\. ,,._ 10000 number of QAs (n)

(a)

number of LSs (p * n)

(b)

Figure 4. Results of the experiments conducted.

index (p = 1) and a partitioned one (p = 2). The monolithic index contains 6.700.000 distinct terms and has a size of 0.96 GBytes (1.7 GBytes without compression), while each one of the two partitions of the partitioned index occupy about 0.55 GBytes. The queries used for testing come from an actual query log file provided by the Italian WEB Search Company IDEARE S.p.A. We experimented Task-Parallel (TP), and hybrid (TP + DP) configurations of MOSE. We mapped all the QBs on a single workstation, while the LSs were placed on one or both the other machines. Independently of the configuration used (one or two index partitions), two QBs were introduced (k = 2). Figure 4.(a) reports the average elapsed times, i.e. the inverse of the throughput, required to process each one of 5000 queries for the TP case (p = 1) as a function of n, i.e. the number of QAs exploited. The two curves plotted refer to the cases where the LSs were mapped on one or two SMP machines. We can see that when two QAs are used they can be almost indifferently placed on one or two SMP machines, thus showing the efficacy of the sharing mechanisms used. On the other hand, as we increase the number of QAs, the difference between exploiting one or two machines increases as well. We can also observe that it is useful to employ more QAs than the available processors. Figure 4.(b) compares the TP solution with the hybrid one (TP + DP). Testing conditions were the same as the experiment above. In the case of the hybrid configuration, all the LSs associated with the same partition of the index were placed on the same workstation in order to allow the LSs to share the lexicon data structure. The better performance of the hybrid approach is evident. Superlinear speedups were obtained in all the TP + DP tests. They

204 derive from a good exploitation of memory hierarchies, in particular of t h e buffer cache which virtualize the accesses t o the disk-resident posting lists. 5

Conclusions

We have presented t h e parallel and distributed architecture of MOSE, and discussed how it was designed in order to efficiently exploit low-cost clusters of workstations. We reported the results of preliminary experiments conducted on three S M P workstations. T h e results highlighted the greater performances resulting from exploiting a hybrid Task + Data parallelizafcion strategy over a pure Task-parallel one. There are a lot of i m p o r t a n t issues we plan t o investigate in the near future. The most i m p o r t a n t is performing an accurate testing of M O S E on larger clusters and document collections in order t o analyze in greater detail the scalability of the different parallelization strategies. Fastest interconnection network such as Myrinet have also t o be tested. Moreover, we are interested t o study query locality and the effectiveness of caching their results within QBs, and "supervised" document partitioning strategies aimed at reducing the number of index partitions needed t o satisfy each query. References 1. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW7 / Computer Networks, vol. 1-7, pages 107-117, April 1998. 2. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes - Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers Inc., 1999. 3. Fast Search &: Transfer ASA. The fast search server - white paper. Technical report, Fast Search & Transfer ASA, December 1998. 4. LA. Macleod, T.P. Martin, B. Nordin, and J.R. Phillips. Strategies for building distributed information retrieval systems. Information Processing & Management, 6(23):511-528, 1987. 5. T.P. Martin, I.A. Macleod, J.I. Russell, K. Lesse, and B. Foster. A case study of caching strategies for a distributed full text retrieval system. Information Processing & Management, 2(26), 1990. 6. F.J. Burkowski. Retrieval performace of a distributed text database utilizing a parallel process document server. In Proc. of the Intern. Symp. On Databases in Parallel and Distributed Systems, Doublin, Ireland, July 1990. 7. Z. Lin and S. Zhou. Parallelizing I/O intensive applications for a workstation cluster: a case study. Computer Architecture News, 5(21), December 1993. 8. Zhihong Lu. Scalable Distributed Architectures for Information Retrieval. PhD thesis, University of Massachussets Amherst, 1999. 9. B. Cahoon, K.S. McKinley, and Z. Lu. Evaluating the performance of distributed architectures for information retrieval using a variety of workload. IEEE Transactions on Information Systems, 1999.

ACTIVE CONTOUR BASED IMAGE SEGMENTATION: A PARALLEL COMPUTING APPROACH

V. POSITANO, M.F. SANTARELLI, A. BENASSI, C.N.R. Institute of Clinical Physiology, Via Moruzzi, 1 - 56124, Pisa, Italy [email protected]. it C. PEETRA,

NETikos Web Factory, Via Matteucci 34b, Pisa, Italy L. LANDINI Department of Information Engineering: EIT, Via Diotisalvi, 2 - 56126 Pisa, Italy Segmentation is a fundamental task in image processing, but its hard computational requirements represent a not easily defeated drawback, mainly for tasks in which real time computing is a primary requirement, such as critical medical applications. In the present paper a parallel algorithm for image segmentation will be proposed, which allows fast convergence for active contour algorithm. A parallel computing approach to this problem has led to develop a multithreaded application based upon a parallel algorithm that allows reducing processing time. The algorithm has been implemented on a shared memory multiprocessor machine, using ANSI C++ language and Posix Threads libraries, in order to exploit code portability. Two basic schemes have been proposed for parallelism development and performance issues have been analyzed for both implementations.

1

Introduction

Today available medical imaging systems are able to collect a massive amount of data in short time. As an example, the number of images acquired by a magnetic resonance scanner in a typical cardiovascular examination can vary from 200 to 1000. Due to the large amount of data provided by acquisition devices, the traditional way to perform data analysis (i.e. sequential examinations of images and mental three-dimensional reconstruction) becomes ineffective. So that, software tools for automatic segmentation and quantitative analysis are needed to fully exploit the features of the today available medical devices [1,2]. One of the most important drawbacks in image segmentation is the very high computational complexity of the related algorithms, that require a large amount of computing time, especially for analysis of large medical images or continuous motion tracking applications; the use of a parallel computing approach can answer to this problem [3], as some of the authors showed in a previous works [4,5]. The

205

206

need of fast parallel algorithms for image segmentation emphasizes the meaning of new parallel computing solutions from both software and hardware perspective. In this paper we propose a parallel implementation of an active contour algorithm, also known as "snake" algorithm, for the automatic image segmentation. The snake algorithm was demonstrated effective in medical image segmentation [1,2,6,7], because it preserves the continuity of the segmented anatomical tissues and it can be used to dynamically track the motion of moving structures, as the human hearth. A lot of improvements of the original active contour algorithm [8] were proposed in the literature. With the aim of taking advantage of the processing power offered by a shared memory multiprocessor, we have implemented a multithreaded version of the "Greedy Algorithm" proposed by Williams and Shah [9] rearranging its basic idea to match the logics of a parallel computing approach. We chose to use a shared memory multiprocessor machine due to its programming facilities about load balancing issues and its flexibility about "allpurpose" applications; on the other hand, the need for "ad hoc" algorithms to exploit the inherent parallelism of an application is a central requirement, in order to exploit all the available processing power. The Posix threads [10] freely available libraries allow implementing such application increasing both code portability and the use of all available system resources in a "built-in" optimized way. Moreover, SMP machines are often used as console of medical devices (e.g. NMR, PET), so that the development of algorithms able to exploit SMP platforms it is of great interest. 2

Algorithm description

Active contour algorithms literature starts from the original Kass, Witkin and Terzopoulos [8] solution, which uses the variational calculus technique. The Williams and Shah [9] approach was based on dynamic programming; its goal was to enhance the existing method, preventing some problems including numerical potential instability. The last algorithm we referred to is the most computationally efficient one, so it has been chosen as a starting point to develop our parallel solution. An active contour is defined as a closed curve s(x,y) that moves through the spatial domain of an image to minimise the following energy functional: i

E

= J { £ t a t [*(x,y

)]+£„,

E = 1 - cL (x,vj +fis [x,y\ Hit

[s(x,

y )]}

dxdy

(2.1) (2.2)

/*

where Eint represents the internal energy of the contour, due to bending or discontinuities; it contains a first-order term, which increases where there is a gap in the curve, and a second-order continuity term which increases where the curve is

207 bending rapidly. The "Greedy" algorithm prevents the shrinking of the curve, encourages the even spacing of the points and minimizes the formula complexity that has been used to give a reasonable estimate of curvature. The elastic parameters a and |3 are to be chosen to realize the utmost homogeneous behavior for contour evolution; guidelines for the most effective choice are given in the Williams and Shah's paper [9]. Eal is the image energy, which depends on various features: the ones presented by Kass et al. are lines, edges and terminations. Given a gray-level image I(x,y) viewed as a function of continuous position variables (x, y), typical external energies designed to lead an active contour toward step edges are:

^H

V /

M

2

or Eex,=^(Ga(x,y)*i(x,y)f

where Ga(x,y) is a two-dimensional Gaussian function with standard deviation crand V is the gradient operator. In the "Greedy" algorithm, the snake is treated as a circular array of coordinates, assuming to take the same point as first and last one of the contour. The algorithm is an iterative one; during each iteration, for each snake point Si the energy function is computed for the image pixel corresponding to the current snake point location and for its m neighbors (figure 1, left). The location having the smallest energy value (say m8) is chosen as the new position of Si (figure 1, right). Si-1 has already been moved to its new position during the current iteration. Its location is used with mat of each of the proposed locations for Si to compute me first-order continuity term.

Figure 1: The greedy algorithm

The location of Si+1 has not yet been moved. Its location, along with that of Si-1, is used to compute the second-order constraint for each point in the neighborhood of Si. For this reason SO is processed twice, one as the first point in the list, one as the

208

last point. This helps to make its behavior more like that of the other points. The iterative process continues until the number of points moved at last iteration isn't small enough to discourage further computations. A Pseudo-code for "Greedy" algorithm is given in [9]. 3

A parallel solution

Our basic idea is to divide up the entire snake in some portions that are to be elaborated by different "processes". As described in the previous section, in the "Greedy" algorithm the behavior of a snake point depends only upon the couple of neighboring points. This allows to create portions of snake sharing only the edge points. In figure 2, a 12-points snake (SO, ..., S l l ) is divided in three portions (c0,..,c4 c0,..., c4 c0,...,c4). For each iteration, every snake portion is updated by a different process (A, B, C processes). The edge points are elaborated by two different processes because they belong to two snake portions. As example, in figure 2 the snake point SO is used by both A and C processes (aO and c4 points). At the iteration end, the first point of each snake portion is updated with the value of the last one in the previous portion, in our example the aO point is update by the c4 point. In this way, each edge point is treated as the single (first and last) edge point in the sequential solution, as described in section 2. This allows obtaining a more homogeneous snake behavior. The overlapping points introduce a computational complexity overhead; in fact the original complexity of the algorithm was 0(n*m), where n is the number of snake points and m is the neighborhood area (see figure 1); the parallel algorithm, instead, must elaborate n + k points for the snake, considering that if A: is the number of snake portions that have been thought to evolve each on its own, k is also the number of overlapping points. The introduction of overlapping points responds to the need for dividing the snake and distributing the work between processes with the minimum requirement for communication and synchronization between processes, without loosing the continuity of the contour; in fact, a couple of overlapping points must be exchanged between each pair of neighboring processes at each iteration of the algorithm to let the snake evolving as a whole. The segmentation of the snake and the separated evolution of each snake's part introduce an unavoidable "error", that is a difference between the snake deformed as a whole (by the sequential algorithm) and the one deformed in an autonomous way, portion by portion (by the parallel algorithm). In order to reduce this kind of error, we vary at each new iteration the points where the breaking up of the snake takes place. This avoids the persistence of the error in the same area, "spreading" instead the error all along the entire contour. Some validation tests have proven that the error tends to disappear when the number of iterations is large enough.

209

Figure 2. Snake partitioning.

4

Multithreaded implementation of the parallel algorithm

We chose the multithreaded programming paradigm for the algorithm's implementation in order to exploit the "lightness" of threads in comparison with real processes. The absence of a real context switching during the commutation of the threads that are executing on various processors, and the address space sharing between threads, is a good feature for low-level synchronization machineries effectiveness in a shared memory environment. Furthermore such a choice allows to transparently gaining good scalability features, due to me underlying automated "hardware" load balancing mechanism. We have proposed two alternative schemes for parallelism exploitation: the first one (Al) includes parallel procedures (threads) being called repeatedly with each new iteration of the algorithm. This allows to reduce the need for communication and synchronization between threads, charging the parameter passing mechanism with those tasks; nevertheless the algorithm suffers from the continuous switching between parallel sections of code and short but unavoidable sequential parts that have the task to prepare the parameters for the call of the next iteration's parallel procedures. The second scheme (A2), instead, has parallel procedures that can be execute without being interrupted for all the needed iterations. In A2 scheme, we need to provide a communication and synchronization mechanism between threads, which allows exchanging the edge, points between threads at each new iteration. This scheme allows exploiting the parallelism much better then the first one, but it someway requires a communication overhead. The overhead is less invalidating the

210

algorithm performances if it can be redeemed with longer single iterations (larger snake portions). 5

Results and conclusions

The algorithms have been tested on a Sun Ultra Sparc 4450 (4 UltraSparc II 450 Mhz processors, 4Mb L2 cache, 2Gb Ram.) using Solaris 8 as Operative System. The two algorithms previous described (i.e. Al and A2) have been tested by changing three parameters: the total number of snake points, the number of iterations and the number of parts in which the snake is divided, i.e. the number of concurrent threads. The total number of snake points is related to the distance between two adjacent points and consequently to the size of the smallest image features segmented by the algorithm. The number of iterations is related to the maximum range in which the image features can be detected. As example, in table 1 the processing times relevant to 300 algorithm iterations for Al, A2 and the sequential (S) algorithms are showed. A 1024x1024 image was used. This test was performed splitting the snake in 4 parts. Points Al time A2 time S time Speed-up A1 Speed-up A2

300 5.15 5.72 13.92 2.70 2.43

500 8.24 9.27 22.82 2.77 2.46

700 10.55 11.72 32.17 3.05 2.74

900 12.85 12.83 41.20 3.20 3.21

1200 16.50 16.98 54.86 3.32 3.22

1600 21.22 21.48 73.19 3,35 3.40

2000 36.13 27.66 91.43 3.23 3.30

2600 40.39 34.53 118.8 3.29 3.44

Table 1: processing time (sec.) and speed-up for the two algorithms (4 threads, 300 iterations, m=9)

Table 1 shows that the performances of A2 get better when the snake has many points, because each iteration is longer and the synchronisation time, that is a constant quote, can be better redeemed. In a second test, the algorithm complexity was fixed setting at a constant value the product between the number of iterations and the number of snake points. The performance of the two algorithms was tested changing the thread number. As showed in figure 3, the algorithm performance increases with the number of snake point, because the weight of the computation time related to the overlapping points decreases when the snake portion size increases.

211 H 4 threat •Stfiisa* 3$y~ Q16threads _ a32threads 3-^—

i 266256 512/12B 1024*4 2048/32 4096/16

128/512 256/256 51S1SB 102*64 204932 4096'16

pofcWIterdiGns

poinHAtrndions

Figure 3: Speed-up for Al and A2 algorithms (m=9).

Using a high number of threads with both algorithms can introduce a communication overhead, much more visible if the snake is composed by a small number of points. The same amount of data exchange can be spread upon larger portions, enhancing the performances, if each portion is large enough. On the other hand, a large number of threads minimizes the loss of performance due to the thread synchronization at the end of each iteration. That is, at the end of each iteration, all threads have to wait for the slowest one. Increasing the number of threads, the difference among execution times will be reduced. points/Iterations 128/512 20 15 E 10 » E 5

w—•

"

512/128

1024/64

•—•

2048/32

4096/16

..-»•••...y5* 'N^* . - J T - V ^ " * •s*-^?^

2 1-5

1

-* ----"

|

- -10 •a $ -15 M-20 -25 -30

256/256

.•••V"

•'-

yr

- - • - - 4 threads • — o tn reads - -s - 16 threads - -x- - 32 threads

»"

Figure 4: Speed improvement for A2 algorithm respect to Al (m=9).

In figure 4, the speed improvement in percentage obtained using the A2 algorithm instead of the Al one is showed. The algorithm Al works better for a little number of points, the algorithm A2 allows better performance for a big one. For a number of threads equal to 8, the A2 algorithm get better for any number of points/iterations. From figure 3, it is clear that a thread number equal to 8 is a good

212 compromise in order to obtain good performances for every value of point/iterations index. In conclusion, real time image processing and motion tracking applications can get advantage from the use of parallel algorithm. An algorithm for image segmentation using the active contour technique was proposed, that have proven to be suitable for shared memory multiprocessors, responding to the need of speeding up segmentation. Moreover, the use of multi-purpose machine and portable libraries allows to spread the use of parallel computing technique on me medical community using the resources available in a typical medical environment. References

1. Santarelli M. F, Positano V., Landini L., Benassi A. A New Algorithm for 3D Automatic Detection and Tracking of Cardiac Wall Motion. Computers in Cardiology , IEEE, Los Alamitos, 1999; pp. 133-6. 2. Ayache, I. Cohen, and I. Herlin. Medical image tracking. In A. Blake and A. Yuille, eds., Active Vision, Chapt 17, MIT-Press, 1992. 3. Stacy M., Hanson D., Camp J. and Robb R. A., High Performance Computing in Biomedical Imaging Research. Parallel Computing 24:9(1998) 1287-1321. 4. Positano V., Santarelli M. F., Landini L., Benassi A. Using PVM on computer network to perform fast pre-processing of large medical data set, Par. Comp.Fund. and Appl. Proceeding of ParCo99, ICP.pp. 185-192 5. Positano V., Santarelli M.F., Landini L., Benassi A., Fast and Quantitative Analysis of 4D cardiac images using a SMP architecture. PARA'98, Lect. Notes in Comp. Science, No. 1541, Springer-Verlag, 1998, pp 447-451. 6. Cohen L.D., and Cohen I. Finite element methods for active contour models and balloons for 2D and 3D images. IEEE Trans On Pattern Analysis and Machine Intelligence, 15:11,Nov. 1993.pp. 1131-1147. 7. Avedisijan A. et. alt., CAMRA: Parallel application for segmentation of left ventricle (LV) in short cardiac axis MR images. Med. Imag. Understanding and Analysis, July 1999. 8. Kass M., Witkin A., Terzopoulos D. Active contours models. Int. J. Comp. Vision., 1987, pp. 321-331. 9. Williams D. J., Shah M. Fast Algorithm for Active Contours. Image Understanding, vol. 55, January, pp. 14-26, 1992. 10. Sun Microsystem, "Multithreaded Programming Guide", 2000.

PARALLELIZATION OF A N U N S T R U C T U R E D FINITE V O L U M E SOLVER FOR T H E M A X W E L L EQUATIONS J. RANTAKOKKO AND P. EDELVIK Uppsala University, Information Technology, Department of Scientific Box 120, SE-75104 Uppsala, Sweden E-mail: [email protected]

Computing,

An unstructured finite volume solver in 3D has been parallelized with OpenMP. The code is used by the Swedish industry for electromagnetic computations. Results from a grid around a small aircraft show good scaling on a SUN Enterprise 6000 server with 16 processors. A parallelization strategy for a distributed memory model is also presented and some preliminary results are given.

1

Introduction

The Maxwell equations are a mathematical formulation of propagating electromagnetic waves. The equations describe phenomena such as scattering of radar signals from airplanes or radiation from mobile phones. Modeling these phenomena for high frequencies is computationally very demanding, especially on complex geometries in 3D. The Maxwell equations can be solved very efficiently with a finite difference scheme on a staggered grid using the classical FD-TD method proposed by Yee5. Also, the structured grid makes it easy to parallelize the computations. Very large and realistic problems can be solved with this method. The drawback with structured grids is that it is difficult to model complex geometries accurately. Unstructured grid methods such as finite element and finite volume solvers can better resolve the details in the geometry but are in general less efficient than a structured grid solver. A remedy is then to combine the unstructured and structured grid methods in a hybrid method, i.e. use an unstructured grid near the object and then connect this to a structured grid in the outer region. In this paper we will describe the parallelization of an unstructured finite volume solver. The solver will be used as one part in a hybrid solver as described above but can also be used as a stand alone solver. Due to the complex and unstructured data dependencies we have chosen to use OpenMP for the parallelization but a strategy for a distributed memory model using MPI will also be discussed.

213

214

2

Finite volume solver

The finite volume solver is based on the integral formulations of Faraday's and Ampere's laws. The geometry is discretized with an unstructured staggered grid using tetrahedrons in the primary grid, generated by a Delaunay grid algorithm. A dual grid, the Dirichlet tessellation, is then constructed with nodes in the centers of each tetrahedron. The electric field variables reside normal to the dual faces and in each primary node, while the magnetic field variables are located normal to the triangular faces and in each dual node. Additional variables are used on the edges of the two grids. Furthermore, the time derivatives are approximated with an explicit time-stepping scheme, the third-order Adams-Bashforth method. The calculations are organized such that as much as possible is computed initially. The update of the field variables is reduced to matrix-vector and vector-vector operations. The matrices are stored in compress sparse row format. For further details on the solver see Edelvik2.

3 3.1

Shared memory model Parallel implementation

There are eight matrices with different structure that are used to update the field variables in each time step. The more or less random structure of the matrices makes it very difficult to do an efficient distributed memory parallelization. On the other hand, using a shared memory model is straightforward. Thus, we have chosen to use OpenMP for our parallelization of the solver. The only critical issue is the load balance. The number of nonzero elements in the rows of the matrices vary between three up to about twenty. A static decomposition over the rows would then cause a load imbalance in the matrix-vector multiplications. This was solved by using the dynamic scheduling directives of OpenMP. Now, only the chunk size parameter must be tuned for the application in order to optimize the parallel performance. Minor modifications of the code were sufficient to port it to OpenMP. We had to rewrite some of the Fortran 90 array syntax to be able to parallelize the loops and to avoid temporary array allocations. Temporary arrays can degrade the parallel performance since the memory allocation is handled sequentially. This did speed up the serial performance as well.

215

3.2

Performance results

We have run the solver on three different parallel computers that support OpenMP; SUN Enterprise 6000 (UltraSparc 2), SGI Onyx2, and SGI Origin 2000. The characteristics of the computers are given in Table 1 and the compiler options that we have used are found in Table 2. The application we have run is the generic aircraft RUND, see Figure 1. The grid consists of 131000 tetrahedrons and 32500 primary nodes. A snapshot of the computed solution is shown in Figure 2.

Figure 1. Part of the unstructured grid around the aircraft RUND.

Figure 2. A snapshot of the computed solution showing surface currents.

The serial code runs fastest on SGI Origin but the SUN server gives the best parallel scaling. As we can see in Figure 3, the code runs at the same speed on all three machines using 8 threads. When using more threads SUN will give the best performance. (It was not possible to use more than eight threads on the two SGI machines at the time of the experiments.)

216 15 - « - SGI ONYX - B - SGI ORIGIN 2000 - * - SUN E6000 Ull

10

/

Q.

/

a

/

/ / / f / / //

LLI UJ Q. W

5"

/

/

/ . /

/

tit

//

V

A

10 15 20 NUMBER OF THREADS

25

30

(a) Fixed size speedup

SGI ONYX SGI ORIGIN 2000 SUN E6000 Ull

10 15 20 NUMBER OF THREADS

30

(b) Runtime per iteration

Figure 3. Speedup and runtime per timestep from the OpenMP parallelization. different parallel computers are compared. The application is the generic aircraft modeled with an unstructured grid consisting of 131000 tetrahedrons.

Three RUND

217 Table 1. Hardware configurations of the three computers.

Model SGI0nyx2 SGI 02K SUN E6K

CPU/MHz/# R10000/250/12 R10000/250/32 USH/250/32

Main Memory 4GB 32GB 8GB

L2 Cache 4MB 4MB 4MB

LI Cache 32KB 32KB 32KB

Table 2. Compilers and compiler options used.

Model SGI Onyx2 SGI 02K SUNE6K

Compiler MlPSpro 7 MlPSpro 7 FORTE 6.1

Options -64 -mp -mips4 -freeform - 0 3 -OPT:Olimit=0 -64 -mp -mips4 -freeform - 0 3 -OPT:Olimit=0 -fast -openmp

SGI Origin gives a noticeable drop in the parallel efficiency going from two to four threads. This is probably due to its memory hierarchy. The main memory is distributed over the nodes and at each node there are two CPUs. Going to four threads leads to remote node memory accesses which is slower than the on-node memory accesses. We get the same effect on the Sun system going from 14 to 16 threads. The Sun system consists of two servers with 16 processors each. The servers have their own memory but are connected with the Wildfire interconnect. For more than 29 threads the performance will drop as several threads are then scheduled to the same processors. 4 4.1

Distributed memory model Data distribution strategy

We have an unstructured grid consisting of tetrahedrons. Additionally, there is a dual grid with nodes in the centers of the tetrahedrons. There are six solution variables residing at different locations of the grids, as explained in Section 2. This is illustrated in Figure 4 below. The variables depend only on their nearest neighbors, either in the primary grid or in the dual grid. An obvious data distribution strategy would then be to use domain decomposition and partition the grids with a graph partitioning algorithm. This would minimize the data dependencies between the processors, i.e. minimize the communication. The partitioning of the nodes in the tetrahedral grid can be done with standard methods. A layer of ghost points overlapping the neighbor partitions can be added. The edges are then

218

Figure 4. A cell in the primary grid and a dual face.

assigned to the same processors as the nodes. In the ghost point layer the edges are replicated in the two partitions. The partitioning of the dual grid can follow the partitioning of the primary grid in the same way as the edges but then some of the nodes may be replicated in several partitions, at most in four (the four surrounding nodes of a tetrahedron may all be in different partitions). 4-2

Implementation

issues

With a local numbering of the nodes within each partition and a layer of ghost points overlapping neighbor partitions it will be possible to re-use the serial solver for each partition. Moreover, to distinguish between ordinary nodes and ghost points, the ghost points can be gathered at the end of the arrays. The ghost points are updated by the neighbor processors and are then communicated. Thus, the solver needs to be extended with functionality for partitioning the grids and the corresponding field variables, translation from global to local numbering, reassembling or reorganization of the matrices for the local numbering of the field variables, distribution from global to local arrays, communication of ghost points, and gathering from local to global arrays. Also, data redistribution routines may be needed in a parallel hybrid model. The partitioning of the grids can be done with the graph partitioning package Metis, Karypis 4 . To minimize the communication overheads in updating the ghost points we utilize a communication table. The communication table is computed once

219

and for all initially and is then used for packing and unpacking data in and out of the communication buffers. It keeps track of which elements should be communicated and with which processor. The actual communication is implemented with the asynchronous operations MPLIsend and MPLIrecv. The requests are handled with first come service, i.e. with MPLWaitany. Letting the system decide in which order to process the messages minimizes the total wait time for the communication requests.

4-3

Performance results

All the above described functionality has been implemented for parallelization of one specific operation in the solver, the electric field filtering operator at the tetrahedral nodes. The code was organized such that all computations were duplicated on all processors, except for the filtering of electric node values which was parallelized. In practice, we ran the serial code with all data on all processors simultaneously until the filtering operation. At this point, the data was distributed to private local arrays and the filtering operation was done in parallel. Then, the data was gathered globally again continuing in the serial fashion with the other operations. Now, only a small fraction of the solver was parallelized. The other operations are similar matrix-vector updates as the filtering operation. Hence, by timing this part of the code we will get an idea of how the full parallel code will perform. Again, we have run the solver with the generic aircraft RUND. The results are summarized in Table 3. We could not run on more than four processors as all data was replicated on all processors and the parallelization required additional memory. We simply ran out of memory. In the full parallel code the data will not need to be replicated on all processors, it will be distributed. Hence, we will then be able to run larger and larger problems as we increase the number of processors.

Table 3. Performance results from the distributed memory model using the RUND geometry. The timings are from IBM SP2 for one filtering operation of the electric node values.

Proc Time

1 0.383

2 0.199

4 0.114

The speedup is modest, 3.35 on four processors. This is somewhat better than the OpenMP version on the two SGI machines.

220

5

Further work

The performance of the OpenMP version can be improved by reordering the nodes and the edges with, for example, the Reverse Cuthill McKee algorithm. This will give better data locality and reduce the off-node data accesses. Also, the serial performance can be significantly improved, Anderson 1 , due to less cache misses. Furthermore, the MPI version needs to be completed to be able to run the code on distributed memory machines that do not support OpenMP. Also, for very large problems that require many processors we may need MPI to keep up the scalability. It seems that OpenMP may have some limitations in scalability for large processor configurations due to memory hierarchies. In the full hybrid solver, a parallelization strategy could be to block the grids geometrically letting each unstructured part be a complete block. The unstructured parts should anyway be kept small due to efficiency reasons. We can then use a dual paradigm programming model combining MPI and OpenMP. We distribute the blocks to shared memory nodes. Within the nodes we use OpenMP and between the nodes MPI for the communication. Then, it would not be necessary to parallelize the unstructured parts with MPI and we would still get good scaling. 6

Conclusions

We have parallelized an unstructured finite volume solver for the Maxwell equations using OpenMP. We have run the code on three different parallel computers, SUN Enterprise 6000, SGI Onyx2, SGI Origin 2000, that support the shared memory programming model. The results from SUN show very good scaling of the code up to 14 threads. The problem size is small, only 131000 tetrahedrons. Large applications can contain several millions of tetrahedrons, Mavriplis3. For full size applications the code has a good potential to scale very well. From the run-times we can clearly see the impact of the memory hierarchy to OpenMP. Going from local memory accesses to remote node memory accesses gives a significant drop in efficiency and a decrease of the slope in speedup. This indicates a limitation of the scalability of the OpenMP version for large processor configurations with distributed memory. To keep up the efficiency MPI will be needed. We have outlined a strategy to parallelize the code with a distributed memory programming model, i.e. with MPI. Preliminary results show a performance similar to the OpenMP code for a small number of processors. In

221

conclusion, it was far simpler to parallelize the code using OpenMP than using MPI. Finally, we have also discussed the idea of a dual paradigm model combining OpenMP and MPI for a hybrid solver. Acknowledgments The work presented in this paper was performed within the framework of the Parallel and Scientific Computing Institute (PSCI) in cooperation with the Swedish industrial partners Ericsson and SAAB. Computer time was provided by the National Supercomputer centers, PDC at Royal Institute of Technology in Stockholm and by NSC at Linkoping University. References 1. W.K. Andersson, W.D. Gropp, D.K. Kaushik, D.E. Keyes, B.F. Smith, Achieving High Sustained Performance in an Unstructured Mesh CFD Application, ICASE Report No. 2000-2, NASA Langley Research Center, Hampton, Virgina, USA. 2. F. Edelvik, Finite Volume Solvers for the Maxwell Equations in Time Domain, Licentiate thesis 2000-005, Department of Information Technology, Uppsala University, Box 120, S-751 04 Uppsala, Sweden. 3. D. Mavriplis, Parallel Performance Investigations of an Unstructured Mesh Navier-Stokes Solver, ICASE Report No. 2000-13, NASA Langley Research Center, Hampton, Virgina, USA. 4. G. Karypis, V. Kumar, Metis: Unstructured Graph Partitioning and Sparse Matrix Ordering System, Technical Report, University of Minnesota, Department of Computer Science, Minneapolis, 1995. 5. K. S. Yee, Numerical solution of initial boundary value problems involving Maxwell's equations in isotropic media, In IEEE Trans. Antennas Propag., 14(3):302-307, March 1966.

A N H Y B R I D O P E N M P - M P I PARALLELIZATION OF THE P R I N C E T O N O C E A N MODEL G. S A N N I N O , V. A R T A L E ENEA,

C. R. Casaccia,Via

Anguillarese

301, 00060 Rome,

Italy

P. L A N U C A R A CASPUR,

P.le A. Moro 5, 00185 Rome, E-Mail: [email protected]

Italy

This paper deals with the parallelization of one of the most popular threedimensional oceanographic model: the Princeton Oceean Model (POM). The parallelization is achieved using standard tools like MPI and OpenMP, to ensure portability of the code. A robust and efficient domain decomposition method is used to solve, in principle, large scale examples on clusters of shared memory machines. A comparison between the pure MPI and the hybrid M P I + O p e n M P is shown in terms of elapsed time for a standard seamount problem varying the number of grid points and cpus.

1

Introduction

Ocean models represent one of the fundamental tools to investigate the physics of the ocean. In the last twenty years they have been successfully applied to a wide range of oceanic and climate problems. The numerical methods used by ocean models consist in discretizing the Navier-Stokes equations on a three dimensional grid and computing the time evolution of each variable for each grid point. Nowadays, to answer the huge amount of computational demand raised to conduct simulation on high resolution computational grid, a parallelization strategy is needed. In this paper we present a hybrid OpenMP-MPI parallel version of the Princeton Ocean Model (POM), one of the most used ocean model, and evaluate the parallel performance obtained. In particular in section 2 we shortly describe the ocean model features and algorithm, in section 3 we illustrate the technique used to parallelize the serial code and in section 4 we report the performance obtained. Conclusions are summarized in section 5. 2

Princeton Ocean Model (POM) description

The Princeton Ocean Model is a three-dimensional primitive equation model, i.e., solves in finite difference form the Navier-Stokes equations along with a

222

223

nonlinear equation of state which couples the two active tracers (temperature and salinity) to the fluid velocity 7 . It has been extensively applied to a wide range of oceanic problems including estuarine and shelf circulation studies 1, data assimilation in the Gulf Stream 8 and general circulation studies in the Mediterranean Sea 10 . The model algorithm uses an explicit differencing scheme for the two horizontal velocity components, the temperature and the salinity; the other variables, the pressure, the density and the vertical component of velocity are calculated by implicit differencing. The model also solves a set of vertically integrated equations of continuity and momentum, usually called external mode to provide free surface variations. Because of the very fast speed of surface waves in the ocean, the time step used to solve this mode is very short. For computer time economy the 3D-equations (internal mode), are solved using a larger time step, at the end of which the two modes are combined; this procedure is known as time splitting technique. The model specifies the values of all variables at the nodes of a curvilinear orthogonal grid, staggered as an Arakawa-C scheme, conserving linear and quadratic quantities like mass and energy. The model also uses a sigma-coordinate system for which details on the transformed equations and numerical algorithm can be found in 2 ' 6 . 3

Parallelization

POM is a FORTRAN 77 code that was initially designed for serial processors and later converted to vector processors. In the last decade, vector supercomputers are begun obsolete and new hardware based on commodity chip have take place. Nevertheless, the serial code is still widely used and can be downloaded from web 13 . In the last years some attempt towards a parallel implementation of the Princeton Ocean Model was carried out (see for example 9>3>5). The key points for the above parallelization are the use of message passing tools (PVM or MPI) coupled with domain decomposition technique or data parallel language (HPF) and Fortran 90 style. We found some disadvantages in both approaches: PVM is no longer used by the High Performance community while HPF suffers a lack of performance on a large number of hardware platforms; moreover, a vendor implementation of HPF is not installed on all machines (for example on IBM systems). Last but not least, not all the above parallel implementations are available for the POM user and, where available, configuring and optimizing a black box parallel code for a particular architecture or

224

physical test is a very complex task. Thus, we decided to develop our parallel version of the code assuming that the optimal choice has to be based on the message passing library MPI and domain decomposition technique. Here follows a brief description of the MPI code (see 4 for a complete description). The MPI code is structured assuming a two (or one) dimensional geometric decomposition of the ocean computational grid into a set of smaller sub-domains. In particular, the grid is horizontally partitioned across the latitude and/or longitude dimensions into rectangular blocks, leaving the vertical dimension unchanged (see Fig 1)).

Figure 1. 2D domain decomposition

Each sub-domain is then allocated to a single processor, which ran a complete copy of the code, i.e. it is responsible for stepping forward the dynamic ocean variables on the sub-domain under consideration. This technique has been chosen because is both easy to develop from uni-processor code and easy to maintain. In order to guarantee the same results of the serial code at the last bit precision of the machine, and to reduce the frequency of communication, all the sub-domains are overlapped at the inner boundaries on a slice of 2 grid point's thickness. These slices represent grid points (halo points) that contain copies of the boundary values stored on neighboring sub-domains (Kernel points) in the grid topology. At some stages, during the calculation, the v£&ues stored on

225

these slices must be updated receiving values from neighboring kernel points. Two types of inter-machines communications are needed to keep the computation consistent with the sequential code: (1) point-to-point communication to update halo values among neighboring machines and (2) global communication at each external time step to check the stability (CFL) condition, and to gather the output variables. One of the main result of this kind of parallelization is that the memory storage is reduced. This is because each processor only needs to store a portion of the global memory, and is therefore possible to solve, on a cluster, problems larger of the memory size of a single machine. The result is more evident treating large size oceanographic experiments. The balance of the computational load, and the minimization of the interprocessor communications are simultaneously and automatically achieved in our MPI code. A routine, placed at the beginning of POM code, computes an optimal decomposition both in I D slices or 2D rectangular sub-grids, depending on the number of grid points of the global horizontal model domain and on the number of requested machines respectively. It is interesting to note that Fortran 90 features are widely used in the MPI code (dynamic memory management and array syntax instructions) giving us a more readable and efficient program. The results are very promising; nevertheless, the memory size and the demand of CPU for real problems (for example, for a Model of the Mediterranean Sea, with a horizontal resolution of 5 Km, are necessary about 844 x 332 grid points) are so big that naturally lead us to develop a code for a specific target architectures: the cluster of SMP (Symmetric Multiprocessor) machines. The interest for this kind of architecture is recent, but it seems to be the trend for building a supercomputer at a relatively low cost. Compaq clusters based on Linux and OSF operative systems are installed at CASPUR and ENEA respectively. We remark that, on such a cluster, MPI should be used for internode communication, while shared memory should be the best choice for intranode communication. The code presented in this work has been optimized for fully exploiting the parallelism on such architecture. Within each SMP machine OpenMP is used to divide the entire work among different processors using threads. OpenMP is a fully accepted standard for SMP parallelization and is efficiently implemented from all vendors ( n ) . It realizes the so-called multi-threaded paradigm in which each thread access shared and private data items. Parallel regions are used to parallelize do-loops in the code. An example: C ADD VISCOUS FLUXES DO 860 J=2,JMM1 DO 860 1=2,IM

226 860

FLUXVA(I,J)=FLUXVA(I,J) 1 -D(I,J)*2.E0*AAM2D(I,J)*(VAB(I,J+1)-VAB(I,J))/DY(I,J) it becomes: C ADD VISCOUS FLUXES !$OMP PARALLEL DO !$OMP&PRIVATE (I,J) DO 860 J=2,JMM1 DO 860 1=2,IM 860 FLUXVA(I,J)=FLUXVA(I,J) 1 -D(I,J)*2.E0*AAM2D(I,J)*(VAB(I,J+1)-VAB(I,J))/DY(I,J) !$OMP END PARALLEL DO

Distribution of the entire work is automatically done by compiler, but any privatization of data or synchronization points in the code, needed to avoid race conditions, are done by programmer. We have used a very optimized implementation of OpenMP available on a great variety of platforms, Guide from KAI ( 12 ) for portability reasons. Also, a parallel debugger (ASSURE) and a Performance Viewer (Guideview) are available and we used strongly in the parallelization steps. PH-POM uses the same input and output files as the original POM. File input/output is performed only by the master processor. 4

Performance results: the seamount case

To demonstrate the potential performance of PH-POM we have compared the pure MPI version and the hybrid MPI+OpenMP one with the serial code. In particular, an idealized seamount case was defined in two different configurations: in the first configuration 42 sigma levels and a horizontal grid made by 1000 x 300 grid points {Big case), while 32 sigma levels and the same horizontal resolution for the second configuration (Medium case). The bottom topography is completely flat at —4500 m except in the center of the domain where a steep seamount is defined as H (x, y) = H0 [l.O - 0.9e(x2+y^/L2j

(1)

where H0 is 4500 m and L is 25 -103 m. In both cases, the horizontal computational model grid is staggered onto a rectangular grid. In particular it is stretched so that the resolution is highest at the center where is defined the seamount. The resulting bathimetry is illustrated in Fig 2. The external and internal time steps are 6 sec and 180 sec respectively. Performance analysis were done on a cluster of 4 IBM SP3 interconnected via High Performance

227

Figure 2. Seamount geometry. The grid is stretched so that the resolution is highest at the center.

Switch, a fast communication link. Each SP3 node is a 16 processors machine equiped with Power3 processor at 375 Mhz clock rate. We have tested two different kind of communication among different machines: SLOW communication, setting the Internet Protocol (ip) mode both for SP3 switch and MPI calls within the same node (MPLSHARED-MEMORY no); FAST communication, using the switch in user space (us mode) and MPI calls via shared memory (MPLSHARED-MEMORY yes). Table 1 shows that for a cluster in which the interconnection between different machines is relatively slow, the hybrid configuration is faster than the pure MPI one, both for the Medium and the Big test case. On the contrary, the MPI configuration has in general better performance with respect to the hybrid one in the FAST communication case; nevertheless, for a choosen set of MPI process and OpenMP threads (32 x

228

2) both the Big and the Medium case shows a decreasing of the elapsed time with respect to the pure MPI.

MPI Medium HYBRID Medium HYBRID Medium HYBRID Medium HYBRID Medium MPI Big HYBRID Big HYBRID Big HYBRID Big HYBRID Big

N u m . Proc. 64 64 64 64 64 64 64 64 64 64

Task Decomp.

4x16 8x8 16x4 32x2

SLOW 2327 1739 1385 1521 2080

FAST 1022 1375 1115 1012 978

2807 2245 1845 1962 2433

1374 1818 1490 1336 1291

4x16 8x8 16x4 32x2

Table 1. Elapsed time in seconds. 1 DAYS of simulation The final comparison with the serial POM shows very good speed-up results with respect to the hybrid code setted in the best configuration; more precisely, a speed-up of almost 47 for the Big configuration and 51 for the Medium one. 5

Conclusion

The parallelization of the Priceton Ocean Model (POM) has been succesfully realized. The code is in principle able to solve very large problems with great efficiency; portability has been succesfully realized using MPI and OpenMP standard. Moreover, using Fortran 90 features, a clear and modular code has been developed and the program is suitable to run efficiently on generic clusters (using MPI) but also on clusters of shared memory machines using the hybrid approach. The preliminary results are encouraging. The idea is to provide more and more complicated and challenging physical tests using bigger and bigger clusters. Acknowledgments We wish to thanks Giorgio Amati for useful suggestion and discussion.

229 References 1. Blumberg A. F. and G. L. Mellor, 1983: Diagnostic and prognostic numerical circulation studies of the South Atlantic Bight. J. Geophys. Res., 88, 4579-4592. 2. Blumberg A. F. and G. L. Mellor, 1987: A description of a threedimensional coastal ocean circulation model. Three-Dimensional Coastal Ocean Models, Coastal Estuarine Science, N. S. Heaps, Ed., Amer. Geophys. Union, 1-16. 3. Boukas, L. A., N. T. Mimikou, N. M. Missirlis, G. L. Mellor, A. Lascaratos, and G. Korres, The parallelization of the Princeton Ocean Model, in: Lecture Notes in Computer Sci., Amestoy et al. (Eds.), Springer, 1685, 1395-1402, 1999. 4. Sannino G., Artale V., Lanucara P., Parallelization of the Princeton Ocean Model: a Domain Decomposition approach, submitted to Parallel Computing 5. POM Benchmark Results page: http://www.aos.princeton.edu/WWWPUBLIC/htdocs.pom/POMcpu .txt 6. Mellor G. L., 1991: User's guide for a three-dimensional, primitive equation, numerical model. AOS Program Rep., Princeton University, Princeton, NJ, 34 pp. 7. Mellor G. L., 1991: An equation of state for numerical models of oceans and estuaries. J. Atmos. Oceanic Technol., 8, 609-611. 8. Mellor G. L. and T. Ezer, 1991: A Gulf Stream model and an Altimetry Assimilation Scheme.J. Geophys. Res., 96, 8779-8795. 9. Oberpriller, W. D., A. Sawdey, M. T. O'Keefe and S. Gao, 1999: Parallelizing the Princeton Ocean Model using TOPAZ, Parallel Computer Sys. Lab.,Dept. Elec. Comp. Eng., University of Minnesota, Tech. Report, 21pp. 10. Zavatarelli M. and G. L. Mellor, 1994: A numerical Study of the Mediterranean Sea Circulation. J. Phys. Oceanogr.,25, 1384-1414. 11. OpenMP home page http://www.openmp.org/ 12. KAI Software home page http://www.kai.com/ 13. POM home page http://www.aos.princeton.edu/WWWPUBLIC/htdocs.pom/

A SIMD SOLUTION TO BIOSEQUENCE DATABASE SCANNING

BERTH. SCHMIDT, HEIKO SCHRODER AND THAMBIPILLAI SRKANTHAN School of Computer Engineering, Nanyang Technological University, Singapore E-mail: asbschmidt(a),ntu.edu.ss, asheiko&ntu.edu.se, astsrikan&ntu.edu.ss Molecular biologists frequently compare an unknown protein sequence with a set of other known sequences (a database scan) to detect functional similarities. Even though efficient dynamic programming algorithms exist for the problem, the required scanning time is still very high, and because of the exponential database growth finding fast solutions is of highest importance to research in this area. In this paper we present an approach to high-speed biosequence database scanning on the Fuzion ISO, a new parallel computer with a linear SIMD array of 1536 processing elements on a single chip. This results in an implementation with significant runtime savings.

1

Introduction

Scanning protein sequence databases is a common and often repeated task in molecular biology. The need for speeding up this treatment comes from the exponential growth of the biosequence banks: every year their size scaled by a factor 1.5 to 2. The scan operation consists in finding similarities between a particular query sequence and all the sequences of a bank. This operation allows biologists to point out sequences sharing common subsequences. From a biological point of view, it leads to identify similar functionality. Comparison algorithms whose complexities are quadratic with respect to the length of the sequences detect similarities between the query sequence and a subject sequence. One frequently used approach to speed up this time consuming operation is to introduce heuristics in the search algorithms [1]. The main drawback of this solution is that the more time efficient the heuristics, the worse is the quality of the results [9]. Another approach to get high quality results in a short time is to use parallel processing. There are two basic methods of mapping the scanning of protein sequence databases to a parallel processor: one is based on the systolization of the sequence comparison algorithm, the other is based on the distribution of the computation of pairwise comparisons. Systolic arrays have been proven as a good candidate structure for the first approach [5,11], while more expensive supercomputers and networks of workstations are suitable architectures for the second [4,7]. Special-purpose systolic arrays provide me best price/performance ratio by means of running a particular algorithm [6]. Their disadvantage is the lack of flexibility with respect to the implementation of different algorithms. Programmable SIMD architectures strive for algorithmic flexibility and the speed of special-

230

231

purpose systems. In this paper we present a biosequence database scanning implementation on the Fuzion 150, a single chip SIMD array containing 1536 processing elements [10]. We will show that this approach leads to significant runtime savings. This paper is organized as follows. In Section 2, we introduce the basic sequence comparison algorithm for database scanning and highlight previous work in parallel sequence comparison. Section 3 provides a description of the Fuzion 150 architecture. The mapping of database scanning onto the parallel architecture is explained in Section 4. The performance is evaluated and in Section 5. Section 6 concludes the paper with an outlook to further research topics. 2

Parallel Sequence Comparison

Surprising relationships have been discovered between protein sequences that have little overall similarity but in which similar subsequences can be found. In that sense, the identification of similar subsequences is probably the most useful and practical method for comparing two sequences. The Smith-Waterman (SW) algorithm [12] finds the most similar subsequences of two sequences (the local alignment) by dynamic programming. The algorithm compares two sequences by computing a distance that represents the minimal cost of transforming one segment into another. Two elementary operations are used: substitution and insertion/deletion (also called a gap operation). Through series of such elementary operations, any segments can be transformed into any other segment. The smallest number of operations required to change one segment into another can be taken into as the measure of the distance between the segments. Consider two strings 51 and 52 of length /l and 12. To identify common subsequences, the SW algorithm computes the similarity H(ij) of two sequences ending at position i andy of the two sequences 51 and 52. The computation of H(ij) is given by the following recurrences: Wij) = max{0, E(ij), F(iJ), H(i-lj-l)+Sbt(SlhS2j)}, \
232

element (PE) to each character of the query string, and then to shift a subject sequence systolically through the linear chain of PEs (see Fig. 2). If/l is the length of the first sequence and 12 is the length of the second, the comparison is performed in /1+/2-1 steps on /l PEs, instead of 11x12 steps required on a sequential processor. In each step the computation for each dynamic programming cell along a single diagonal in Fig. 1 is performed in parallel. 0

A

T

C

T

c

G

T

A

T

G

A

0 G

0 0

0 0

0 0

0 0

0 0

0 0

0 2

0 1

0 0

0 0

0 2

T C

0 0

0 0

2 1

1 4

1 BKa 3 1

1

4

3 3

3 2 3 6 5 4

5 5 4 6 5 7

2 2 5 6 9:: 8 7 6

1 1 4 5 8

0 1 1 0 3 6 7

T A T C A C

0 0 0 0 0 0

0 2 1 0 2 1

2 2 4 3 2 1

4 4 4 5 5 6

1mm^e^M 3 4 6 5 4 5

A

mmJfi5 5 7 6

.
HIM 9

!

T

G

0 0 3 2 2 5 8 7 9 9

0 2 2 2 1 4 7 7 8 8

Fig. 1: Example of the SW algorithm to compute the local alignment between two DNA sequences ATCTCGTATGATG and GTCTATCAC. The matrix H{ij) is shown for the computation with gap costs a = 1 and P = 1, and a substitution cost of+2 if the characters are identical and - 1 otherwise. From the highest score (+10 in the example), a traceback procedure delivers the corresponding alignment (shaded cells), the two subsequences TCGTATGA and TCTATCA. subject sequence

^

^

query sequence

... G T G A c — M ^ H — M S i — > Fig. 2: Sequence comparison on a linear processor array: the query sequence is loaded into the processor array (one character per PE) and a subject sequence flows from left to right through the array. During each step, one elementary matrix computation is performed in each PE.

In addition to architectures specifically designed for sequence analysis, existing programmable sequential and parallel architectures have been used for solving sequence problems. Special-purpose systolic arrays can provide the fastest means of running a particular algorithm with very high PE density. However, they are limited to one single algorithm, and thus cannot supply the flexibility necessary to run a variety of algorithms required analyzing DNA, RNA, and proteins. P-NAC was the first such machine and computed edit distance over a four-character alphabet [8]. More recent examples, better tuned to the needs of computational biology, include BioScan and SAMBA [5,11]. Reconfigurable systems are based on programmable logic such as field-programmable gate arrays (FPGAs) or custom-designed arrays. They are generally slower and have far lower PE densities than special-purpose architectures. They are flexible, but the configuration must be changed for each algorithm, which is generally more complicated than writing new code for a programmable architecture. Splash-2 and Biocellerator are based on FPGAs, while MGAP and PIM have their own reconfigurable designs [2,6].

233

Our approach is based on the SIMD concept. SIMD architectures achieve a high performance cost ratio and can at the same time be used for a wide range of applications. Cost and ease of programming fall between the other two classes. 3

The Fuzion 150 SIMD Architecture

SIMD architectures typically use a large number of PEs controlled by a single instruction sequencer. This means that only a single instruction stream operates at any given time, but it operates on every data processor at once. A benefit of this architecture is that the instruction sequencer is not replicated and thus can incorporate instruction flows and threat management using a very low proportion of the overall silicon real-estate. The majority of the silicon real-estate is used in the highly regular PEs. This allows for the efficient design of a highly regular small PE.

Pig. 3: System block diagram of the Fuzion 150 architecture and the architecture of an individual PE.

Early SIMD architectures suffered to some extend due to the small amounts of area for each PE, e.g. [2,3]. The increase in integration on ICs now allows a large SIMD array, with local PE memory and controllers on a single die. The Fuzion 150 system architecture shown in Fig. 3 provides a general-purpose processing solution for many application areas including network processing and 3D graphics. It combines control and data processing on the same silicon, whilst taking account of

234

the very different processing requirements of the control and data planes. The control plane and housekeeping operations are performed on the embedded processing unit (EPU), which is a 32-bit ARC™ core. Data plane operations utilize the PEs in the Fuzion core. The processor array is made up of six blocks of PEs. Every block consists of a linear array of 256 8-bit PEs. The PEs at the borders of each block are also connected with the borders of the next blocks. This provides a linear array of 1536 PEs. The PE itself is based around an 8-bit ALU, connected to a 32 Bytes register file. This register file is multi-ported to give neighbor communications, and concurrent access to the PE's own on-chip 2 KBytes DRAM. The PE also has direct access to a linear expression evaluator (LEE) that can perform an operation of the form Ax + By + C on a per cycle basis. Data I/O for the processor array operates on a per block basis. A high performance I/O engine allows data transfer at up to 700 MBytes/s per block via the Fuzion bus. At a clock frequency o f / = 200 MHz and using a word format of w=8 bits each PE can execute flm = 200-106 word operations per second. Thus, the Fuzion 150 parallel computer performs up to 300 GOPS. 4

Mapping of Sequence Comparison onto the Fuzion 150

Since the length of the sequences may vary, the computation has to be partitioned on the Fuzion 150. For sake of clarity we firstly assume the processor array size to be equal to the query sequence length M, i.e. M=1536. Fig. 4 shows the data flow for aligning the sequences A = aoai...aM-i and B = bob\...bK.\, where A is the query sequence and B is a subject sequence of the database. As a preprocessing step, symbol a,, / = 0,...,.M-1, is loaded into PE (m,ri) (notation for PE m of block n, i.e. we {0,...,255}, we {0,...,5}) with n = / div 256 and m = i mod 256 for even n and m = 255—/ mod 256 for odd n. B is loaded into the local memory. After that the row of the substitution table corresponding to the respective character is loaded into each PE as well as the constants a and p\ B is then completely shifted through the array in M+K-l steps (Fig. 4). In iteration step k, I < k < M+K-X, the values H(ij), E(ij), and F(ij) for all i,j with 1 < / < M, \<j
235

compute K sequence alignments in time 0(ATA/) using Q(M) processors. As the best sequential algorithm takes 0(KA?) steps, our parallel implementation achieves maximal efficiency. i .•*..!: •nc.-pntv

-... . j

PE Block 5

. . I ... . *-. .-. • .. b, i). A,,.

-.v

--~&

.

*-H *-M

— ••••• •

•

I ! PE Block 1

• PE Block 0

Fig. 4: Data flow for aligning two sequences A and B on the Fuzion 150: A is loaded into the processor array one character per PE and B is completely shifted through the array in M+K-\ steps.

Only the highest score of matrix H is computed on the Fuzion for each pairwise comparison (see Fig. 1). The front end PC carries out ranking the compared sequences and reconstructing the alignments. So far we have assumed a processor array equal in size of the query sequence length (M=I536). In practice, this rarely happens. Assuming a query sequence length less or larger the array size, our implementation is modified as follows: 1. k X M = 1536: In this case we can just replicate the implementation for M PEs on each subarray of size N, i.e. k alignments of the same query sequence with different subject sequences are computed in parallel. 2. k X 1536 = M: A possible solution is to split the sequence comparison into k passes. However, this solution requires I/O of intermediate results in each iteration step. This additional data transfer can be avoided by assigning kIM characters of the sequences to each PE instead of one. On the Fuzion 150 it is possible to assign up to 16 characters per PE. Thus, query lengths of up 24576 characters can be processed within a single pass, which is sufficient for molecular applications. 5

Performance Evaluation

A performance measure commonly used in computational biology is millions of dynamic cell updates per second (MCUPS). A CUPS represents the time for a complete computation of one entry of the matrix H, including all comparisons, additions and maxima computations. To measure the MCUPS performance on Fuzion 150, we have given the cycle count to update 16 //-cells per PE in Table 1. Because new //-values are computed for 16 characters within 1934 clock cycles in each PE, the whole array of 1536 PEs can perform 24576 cell updates in the same time. This leads to a performance of (24576/1934) x/CUPS = (24576/1934)

236

x 200 x 106 CUPS = 2541 MCUPS. Because MCUPS does not consider data transfer time and query length, it is often a weak measure that does not reflect the behavior of the complete system. Therefore, we will use execution times of database scans for different query lengths in our evaluation. During the computation of 16 new //-cells in each PE, one new character is input via the Fuzion bus into each subarray that performs a sequence comparison. The required data transfer time is totally dominated by above computing time of 1934 cycles per iteration step. Table 1: Instruction count to update 16 //-cells in one PE of the Fuzion with corresponding operations.

Operation in each PE per iteration step Get H(i-1 J), F( j-1 j), bj, max;.\fromneighbor Compute t = max{0, H(i-lj-\) + Sbt(ahbj)} Compute F(ij) = mzx.{H{i-\j)-a, F(i-\jy$} Compute E(ij) = m a x { # ( y - l ) - a , E(ij-l)-$} Compute H(ij) = max{f, F(ij), E(ij)} Compute maXj - max{H{ij), maxiA} Sum

Cycle Count 22 576 336 448 368 184 1934

Table 2: Scan times (in seconds) of TrEMBL 14 for various length of the query sequence on the Fuzion 150 and a Pentium in 933 MHz. The speed up compared to the Pentium is also reported.

Query length Fuzion 150 speedup Pentium HI 933

256 12 88 1053

512 22 97 2132

1024 42 101 4252

2048 82 705 8581

4096 162 106 17164

Table 2 reports times for scanning the TrEMBL protein databank (release 14, which contains 351'834 sequences and 100'069'442 amino acids) for query sequences of various lengths with the S W algorithm. The table shows the execution time for the Fuzion 150 compared to a sequential C-program on a Pentium. As the times indicate, the parallel implementation scales linearly with the sequence length. For the comparison of different parallel machines, we have taken data from [3] for a database search with the SW algorithm for different query length. The Fuzion 150 is three to four times faster than the much larger 16K-PE MasPar. The 1-board Kestrel [3] is six times slower than a Fuzion 150 chip. Kestrel's design is also a programmable fine-grained SIMD array. It reaches the lower performance, because it has been built with older CMOS technology. SAMBA [5] is a special-purpose systolic array for sequence comparison implemented on two add-on boards, which are around five times slower than the Fuzion. 6

Conclusions

In this paper we have demonstrated that the new Fuzion 150 SIMD parallel computer is very suitable for scanning biosequence databases. We have presented an

237

efficient mapping of the SW algorithm onto this particular architecture. This leads to a speed up for database scanning of more than 100 compared to a Pentium III 933. The exponentially growth of genomic databases demands even more powerful parallel solutions in the future. Because comparison and alignment algorithms that are favored by biologists are not fixed, programmable parallel solutions are required to speed up these tasks. As an alternative to special-purpose systems, hard-toprogram reconfigurable systems, and expensive supercomputers, we advocate the use of specialized yet programmable hardware whose development is tuned to system speed. Our future work will include identifying more applications that profit from this type of processing power. Apart from its performance figures, the most promising property of the Fuzion design is its flexibility. It will be interesting to study the performance for this architecture for applications like multimedia video compression and medical imaging. References 1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W, Lipman, D.J.: Basic local alignment search tool, J. Mol. Biol, 215 (1990) pp. 403-410. 2. Borah, M., Bajwa, R.S., Hannenhalli, S., Irwin, M.J.: A SIMD solution to the sequence comparison problem on the MGAP, in Proc. ASAP'94, IEEE CS (1994) pp. 144-160. 3. Dahle, D., Grate L., Rice, E., Hughey, R.: The UCSC Kestrel general purpose parallel processor, http://www.cse.ucsc.edu/research/kestrel/papers/pdpta99.pdf in Proc. PDPTA '99 (1999). 4. Glemet, E., Codani, J.J.: LASSAP, a Large Scale Sequence compArison Package, CABIOS 13 (2) (1997) pp. 145-150. 5. Guerdoux-Jamet, P., Lavenier, D.: SAMBA: hardware accelerator for biological sequence comparison, CABIOS 12 (6) (1997) pp. 609-615. 6. Hughey, R.: Parallel Hardware for Sequence Comparison and Alignment, CABIOS 12 (6) (1996) pp. 473-479. 7. Lavenier, D., Pacherie, J.-L.: Parallel Processing for Scanning Genomic DataBases, Proc. PARCO'97, Elseiver (1998) pp. 81-88. 8. Lopresti, D.P.: P-NAC: A systolic array for comparing nucleic acid sequences, Computer 20 (7) (1987) pp. 98-99. 9. Pearson, W.R.: Comparison of methods for searching protein sequence databases, Protein Science 4 (6) (1995) pp. 1145-1160. 10. Pixelfusion Inc.: http://www.pixelfusion.com (2000). 11. Singh, R.K. et al.: BIOSCAN: a network sharable computational resource for searching biosequence databases, CABIOS, 12 (3) (1996) pp. 191-196. 12. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences, J. Mol. Biol. 147 (1981) pp. 195-197.

PARALLEL P R O G R A M PACKAGE FOR 3D U N S T E A D Y FLOWS SIMULATION E. V. SHILNIKOV AND M. A. SHOOMKOV Institute for Mathematical Modeling Russian Academy of Sciences, 4a Miusskaya Sq., Moscow 125047, Russia E-mail: [email protected] The program complex is presented for numerical simulation of 3D sufficiently unsteady flows of viscous compressible heat conducting gas. This complex is based on explicit kinetically consistent finite difference schemes and designed for parallel computers. It is tested on different types of multiprocessor computer systems. The results of a test problem solution are described.

1

Introduction

The problem is extremely actual for modern aerospace applications of detailed investigation of the oscillating regimes in transsonic and supersonic viscous gas flows over various bodies. Under certain freestream conditions such flows may be characterized by regular self-induced pressure oscillations. Their frequency, amplitude and harmonic properties depend upon the body geometry and external flow conditions. It is possible that these pulsations may have a destructive influence upon mechanical properties of the different aircraft parts especially in the resonant case. From mathematical point of view such 3D problems are quite difficult for numerical simulation and are a subject of interest for many scientific laboratories. Making choice of numerical method we have to take into account the fact that to predict a detailed structure of unsteady viscous compressible gas flow we need to use high performance parallel computer systems. The opinion is widely spread that for viscous gas flow simulation we have to use only implicit schemes because of their good stability properties. When modeling the stationary problems we usually don't interested in the details of stabilization process. So it is naturally to use some implicit scheme which permits the program to run with large time step. In the case of essentially unsteady flow, especially for oscillating regimes, we have to receive detailed information about high frequency pulsations of gas dynamic parameters. This fact limits the time step acceptable for the difference scheme by the accuracy requirements. For many interesting problems these limitations neutralize the advantages of implicit schemes. So for such problems the explicit difference schemes seem to be preferable because of their simplicity for program realization, especially for

238

239 parallel implementation. For this reason one of the explicit versions of original algorithms named kinetically consistent finite difference (KCFD) schemes x was selected for numerical simulation of essentially unsteady viscous gas flows. 2

Numerical Algorithm

The kinetic schemes differ from the other algorithms primarily in that the basis for their foundation is the discrete model for one-particle distribution function. The averaging of this model by the molecular velocities with the collision vector components results in the production of the difference schemes for gas dynamic parameters. These schemes were successfully used for solving various gas dynamic problems 2 _ 4 . In addition, they permit to calculate oscillating regimes in super- and transsonic gas flows, which are very difficult for modeling by means of other algorithms. It's also must be mentioned that the numerical algorithms for KCFD schemes are very convenient for the adaptation on the massively parallel computer systems with distributed memory architecture. This fact gives the opportunity to use very fine meshes, which permit to study the fine structure of flow. The explicit variant of these schemes 5 with soft stability condition r = 0(h) was used as the numerical background for our parallel software. The geometrical parallelism principle have been implemented for constructing their parallel realization. This means that each processor provides calculation in its own subdomain. The explicit form of schemes allows to minimize the exchange of information between processors. Having equal number of grid nodes in each subdomain the homogeneity of algorithm automatically provides processors load balancing. It may also be mentioned that the choice of numerical method is not critical for the program package presented. The finite difference scheme used here may be interpreted as the difference approximation of the conservation laws for mass, momentum and energy. Any conservative scheme may be rewritten in such a conservation-law form. So in order to change one explicit scheme to another in this package one have to do nothing but rewrite the subroutine which calculates the flows between grid cells. Not only difference scheme but also the governing equations may be replaced in a similar manner if they allow a conservation-law form. 3

Parallel Implementation

The developing of applied software intended for numerical simulation of 3D flows is a very painstaking job, especially while creating distributed applied

240

software for MIMD computers. In order to simplify our work we decided to follow some quite obvious but important principles which violation may probably lead to failure: • all complicated but not very time-consuming operations must be processed by separate sequential programs; • every combined operation ought to be subdivided by several independent simple operations; • essentially parallel program must be as simple as possible and have extremely clear logical structure. Such an approach gives us the opportunity to minimize bug's probability in essentially distributed program. It is well known that mistakes in parallel programs are much more difficult for analysis than in sequential ones. The chosen methodology helped us to achieve a success in our activity. Installation of our parallel applied software on different types of multiprocessor computer systems of various manufacturers did not cause any serious problems. The basic ideas accepted determined the structure of whole software bundle. Total data processing procedure consist of three separate stages: • covering full mesh region by subdomains, every of which will be processed on one processor; • converting the content of files with task geometry data, boundary conditions and some auxiliary information into data of several formats, every of which specially suited for specific internal needs; final preparations for distributed calculations; • fulfilling parallel computations. C and Fortran languages were used to develop software. MPI libraries were taken for realization of message passing at the distributed stage. Two first stages are carried out by sequential programs. First of them divides complete computational volume into necessary number of 3D rectangular subvolumes. This fragmentation must provide processors load balancing and minimal messages interchange among processors in accordance with geometry parallelism principle. The result of this stage is the text file describing 3D subvolumes in terms of grid nodes numbers. User can edit this file manually if needed. Description of task geometry, boundary conditions and grid information are kept in another text file. Special simple language is used for this description. Particular compiler translates content of this file to intermediate arrays in some format convenient for further transformations. These arrays contain

241

vertices point coordinates, information about body surface and so on. They are later used for results visualization." Compiler implements syntax checking during data input. It is necessary in order to keep out of different kinds of mistakes especially in the case of complex body shape. The data obtained are the input for modules, which transform and organize them for parallel computations by several steps. Final data structure tuned for minimal interchange between CPU and RAM. Last action of the second stage is to allocate needed data to binary files every of which contains data portion, needed for one processor. The last stage is the start of distributed computational program. The main criteria of this program is efficiency. To satisfy this requirement it's logical structure is done as clear, compact and simple as possible. In addition this approach essentially simplifies the debugging process. 4

Results of Test Problem Simulation

The problem of viscous compressible gas flow along the plane surface with rectangular cavity in it was taken as a test problem. The numerical experiments were made for time-constant freestream parameters which were taken in accordance with the experimental data 6 , - inflow Mach number M^ = 1.35; - Reynolds number Reh = 33000; - Prandtl number Pr = 0.72 - share layer thickness near separation 6/h = 0.041; - relative cavity length l/h = 2.1; - relative cavity width w/h = 2.0; As experiments 6 as previous calculations 7 ~ 8 show that the intensive pressure pulsations in the cavity took place for such inflow parameters and cavity geometry. The computational region is presented on Figure 1. The inflow is parallel to the XY-plane and makes angle tp with the X direction. The geometrical parameters of cavity and computational region are determined by coordinates of points A, B, C, D, E, F on Figure 1. Their values are A = (2.1,0.0,0.0), B = (2.1,1.0,0.0), C = (2.1,0.0, -1.0), D = (-2.0,0.00.0), E = (5.5,0.0,0.0), F = (0.0,3.0,0.0). where all coordinates are related to cavity depth. The height of computational region was also taken equal to it. The beginning distribution corresponds to share layer over the cavity and "Geometry description language is not specifically gas dynamic and can be used for description of tasks of another physical nature.

242

Figure 1. The scheme of computational region

immobile gas with the braking parameters inside it. The calculations were accomplished on rectangular grid with the total number of cells over 1200000. Detailed information of 3D gas flow around the open cavity was obtained for different angles of incidence ip. For ij) = 0 the 3D gas flow structure in the middle part of the cavity was approximately the same as for the 2D problem. Gas behaviour in other cavity regions was essentially three dimensional. Lengthwise gas movement was combined with traverse one in these regions resulting in the gas vortices and swirls appearance. There is a region with the return flow before the separation point with a small vortex in this region. During the following time this vortex decreases down to it's total disappearing. The processes of origination and collapse of large scale vortex periodically repeat in the presence of feedback between cavity rear bulkhead and the place of it's origination. These results are in agreement with the experimental data 6 obtained for the such types of cavities. The analysis of flow structure for low values of incidence angle was fulfilled. Nonzero incident angle leads to appearance of traverse vortical motion over whole cavity (oscillation of longwise swirls) and some vortices in the X Yplane inside the cavity. The fact which seems to be very interesting is the disappearance of boundary layer separation on the forward cavity edge. This effect may be explained by the weakening of feedback between cavity rear and forward bulkheads in the case of nonzero if). Properties of pressure oscillations in critical cavity points were studied. The spectrum analysis of these oscillations was carried out. The pressure oscillations spectra in the various cavity points are presented on Figure 2. The analysis showed the presence of the intensive high frequency discrete

243

components. Areas of the most probable wreckage on the cavity surface were revealed.

10 4 — • — i — ' — i — • — i — • — i — • — i — ' — i — • — i — I

0

100(H)

40000

00000

00000

IOO0O0

110000

140000

\ — • — i — • — i — • — i — • — i — • — i — ' — i — ' — i —

o

20000

40000

Frequency (Hz)

40000

00000

100000

110000

140006

Frequency (Hz)

Figure 2. Pressure oscillations spectra in the middle point (left picture) and in the corner (right picture) of the rear cavity edge.

Thus, using of detailed spatial mesh allow to calculate flowfield in the cavity and visualize middle scale structures. One can hope that more detailed grid will make possible to receive the whole structure of flow in the cavity in transient case. There is also perspective from our point of view to use some kinetical analogue of K-E model of turbulence. 5

Comparison of Different Computer Systems

The program package was tested on MIMD computers with MPP architecture (MCS-1000, MSC-1000M, Parsytec CC), with SMP architecture (HP V2250) and Beowulf cluster. 64-processor MCS-1000 computer system is equipped with 533MHz Alpha 21164 EV5 chips. Host computer functions under Digital Unix or Linux operating systems. Slave processors function under VxWorks operating systems. This computer has rather slow processor interchange communication channels. 640-processor MCS-1000M computer system is equipped with 667MHz 64-bit 21164 EV67 Alpha processors. Myrinet network is used for the interprocessor communications. MPP computer Parsytec CC is equipped with twelve 133MHz PowerPC604 chips. Fast interprocessor exchange communication channels has bandwidth up to 40 MBytes/Sec. All nodes function under IBM AIX 4.1.4 operating system.

244

SMP computer HP V2250 is equipped with 16 superscalar RISC 240MHz HP PA-8200 chips and 16GBytes RAM. It functions under HPUX 11.0 operating system. This computer demonstrated the most high reliability. Beowulf cluster is combined by 16 dual-processor IBM PC nodes. Every computer is equipped with two 550MHz Pentium-Ill chips with 512 MBytes RAM. Each node function under Red Hut Linux 2.2.5. All nodes connected by lOOMBit/Sec Ethernet local area network. Beowulf cluster had insufficient reliability at testing time, however it had the best performance/price ratio. The simulation of above test problem using these computers yields us following results (equal number of processors used): 1) MCS-1000M — — 5.10 relative performance units; 2) HP V2250 2.40 relative performance units; 3) Beowulf cluster - 1.90 relative performance units; 4) MCS-1000 1.70 relative performance units; 5) Parsytec CC 0.44 relative performance units; Worth be mentioned that the equipment -I- system software couples were tested. It means that modification of important components of system software (i.e. high level language compiler or MPI libraries) may substantially (with ratio 1.5 - 2.0) change final results. Authors do not have certain information if every tested MIMD system were provided with the most appropriate system software. The scaling of program package was tested on MCS-1000 computer system. The same problem was solved using different number of processors (up to 40). The number of grid cells in this variant was about 400000 in order to have a possibility to solve the problem on one processor. The obtained results are given in Table 1. This computer has relatively low bandwidth of interprocessor data exchange. This fact is one of the reasons of efficiency decreasing. Another reason is the relatively small number of computational grid nodes. Table 1. The performance dependence on processors number (MSC-1000). Number of processors

Speedup

Efficiency

1 2 10 40

1 1.95 9.4 34.2

100% 97.5% 94% 85.5%

The efficiency for more fine grids was measured on the MCS-1000M computer system. Increasing of the total grid nodes number up to 2000000 leads to the efficiency growth (Table 2). The computational time on the same num-

245 ber of processors increases practically proportionally to the grid size (scaling with respect to the problem size). This fact is quite natural for the algorithm based on the explicit schemes. Table 2. The performance dependence on processors number (MSC-1000M).

Number of processors

Speedup

Efficiency

1 2 10 40

1 1.96 9.6 36.8

100% 98% 96% 92%

Acknowledgments The investigations were supported by Russian Foundation for Basic Research (grants No. 99-07-90388 and 99-01-01215) References

4.

5. 6. 7. 8.

Elizarova T.G. and Chetverushkin B.N., in: Mathematical Modelling. Processes in Nonlinear Media (Nauka, Moscow 1986, in Russian). B.N.Chetverushkin, in: Experimentation, Modelling and Computation in Flow, Turbulence and Combustion, Vol. 1, eds. J.A. Desideri, B.N.Chetverushkin, Y.A.Kuznetsov, J.Periaux and B.Stoufflet, (Wiley, Chichester, 1996). Abalakin I.V., Antonov M.A., Chetverushkin B.N., Graur I.A., Jokchova A.V., Shilnikov E.V., in: Parallel Computational Fluid Dynamics: Algorithms and Results Using Advanced Computer, eds. P.Schiano et. al. (Elsevier, Amsterdam, 1997). Antonov M.A., Chetverushkin B.N., Shilnikov E.V., in: Proceedings of the Fourth European Computational Fluid Dynamics Conference, 7-11 September 1998 (Athens, Greece, Wiley, 1998) Chetverushkin B.N., Shilnikov E.V., Shoomkov M.A., in: CD Proceedings of ECCOMAS-2000, September 2000, Barcelona, Spain. Antonov A., Kupzov V., Komarov V. Pressure oscillation in jets and in separated flows (Moscow, 1990, in Russian). Rizetta D.P., AIAA Journal, 26(7), 799 (1988). Duisekulov A.E., Elizarova T.G., Aspnas M., Computing & Control Engineering Journal 4(3), 137 (1993).

PARALLEL LOSSLESS C O M P R E S S I O N ALGORITHM FOR MEDICAL IMAGES B Y U S I N G WAVEFRONT A N D S U B F R A M E A P P R O A C H E S AKIYOSHI WAKATANI Konan University, Kobe, 658-8501, Japan Email : [email protected]

Although lossless J P E G is inherently a sequential algorithm, in order to achieve a high speed lossless compression for medical images including motion pictures (ultrasound images) and C T scan images, it should be parallelized. We propose a scalable CODEC algorithm with conventional two approaches to parallel version of lossless J P E G , wavefront and subframe approach. We also show the preliminary performance prediction of the algorithm by using experimental CODEC model and discuss a beneficial side effect caused by the subframe approach.

1

Introduction

Recently vast volume of information has been electronically archived or communicated in a variety of areas. Especially medical applications deal with huge data like medical images. Huge medical images are mainly classified into two classes, motion pictures (ultrasound images) and tomographic images (CT or MM). Both of them consists of plural still images, but the difference is that a motion picture is a set of temporal contiguous images and a tomographic image is a set of spatial contiguous images. So far there has been many research activities on the compression algorithms for motion pictures. MPEG is one of the most prominent algorithm, which consists of DCT transformation and motion estimation. However, since MPEG loses the portion of the image information at the expense of the high compression efficiency, the coding algorithm is not used for some applications which need to preserve the information completely such as medical applications. In order to preserve all information of images, lossless compression algorithms, like lossless JPEG 1 , have been studied mostly for still images. In order to compress huge size medical images in realtime, we propose a scalable CODEC algorithm with conventional two approaches, "wavefront approach" and "subframe approach" to the parallelization of lossless JPEG algorithm for contiguous images and show the effectiveness of the algorithm.

246

247

2

Lossless J P E G

JPEG is a set of some compression algorithms for still pictures. Lossy version of JPEG is frequently used in many applications such as internet. Lossless JPEG is a reversible algorithm which compresses the original image without any loss of the information and expand the compressed image to the original image completely. Because of the property lossless JPEG is used for some medical image application.Lossless JPEG mainly consists of prediction and coding and utilizes 8 predictors shown in Table 1. Pattern 0 1 2 3

Predictor no prediction •f»,j-i p

i-hj Pi-l,j-l

Pattern

Predictor

4 5 6 7

PiJ-l + Pi-l,j ~ Pi-lJ-l Pi,j-1 + (Pi-l,j — P t - l , j - l ) / 2 Pi-U + (Pij-i - Pi-i,.,--i)/2 (Pij-i+Pi-u)^

Table 1. Predictors for lossless J P E G

In order to implement this compression algorithm for motion pictures, the algorithm should consist of parallel computations. However, it is hard to apply the lossless JPEG to motion pictures straightforwardly because the algorithm is inherently sequential. In particular, the decoder of lossless JPEG is hard to parallelize because the prediction of a pixel depends on the previous pixel which is also generated from the prediction of its previous pixel. 3 3.1

Wavefront approach Image segmentation

As mentioned earlier, the decoding of lossless JPEG is difficult to parallelize. So we focus on the decoding process here. For pattern 4 to 7, the prediction of Pij depends on both P,-itj and Pij-i and the prediction depends on Pt-ij-i for pattern 3 to 6. To cope with those dependences, we divide the image space into diagonal segments, called time segment, shown in Figure 1 where ,W and 'if' stand for the width and height of the image. We assume that W is larger than 'if' for simplicity and this segmentation is called "wavefront" 5 . If time segment t is less than H the pixels Pk,t-k (k = 0..t) are decoded concurrently, if time segment t is less than W and greater than H — 1 the pixels Pk,t-k {k = 0..if - 1) are decoded concurrently, and the pixels Pk,t-k (k = t — W + 1..H — 1) are decoded otherwise. The pixels in the same time segment

248 . H - 1 .~i •

^

<M

•mm*"w* &

+H-2

y&™3(/™

fe*v«.^:/

a.'"nyTW' •jfr--/:

<5rV\rVV i

:

y*ffij>

vp

=°

i (a) Time segment

(b) VP (virtual processor)

Figure 1. Wavefront approach

are independent from each other for all the prediction patterns (0 to 7). All pixels in each time segment can be predicted by using pixels in left-above time segment which are generated previously. Here we define a set of time segments (t < H) first time set, a set of time segments (H < t < W) second time set and a set of time segments (W < t) third time set. 3.2

Processor allocation

A computation object which executes the decoding procedures is called VP(virtual processor), which has a distributed memory to keep the part of the image and the compressed data. VP{ (i = 0...H - 1) is assigned to the computation of the i-th pixel of each time segment. For example, in time segment 2, VP0, VPi and VP2 deal with the pixels P2,o, Pi,i and P 0 , 2 respectively. Namely the pixel Pu is decompressed in time segment (i+j) and the decompression is executed on VPj if the time segment is in the first time set (t < H) or on VPn-i otherwise. On each time set, VP's are required regular communication with other VP for most prediction patterns. For example, PUj is decompressed by using Pij-i and P J - I J for the prediction pattern 7. Since PtJ and P^ij are executed on VPj and PUj^ is on VPj-i on the first time set, VP,_i needs to send the value of Pf,_,-_,. to VPj. This communication is called "forward communication" On the other hand, for the second time set, since Pitj and P,-,j_i are executed on VP f f _i_i and P^hj is on VPH-u the opposite com-

249 munication is required, which is called "backward communication". Therefore only the ring communication path is required to implement parallel version of lossless JPEG. The pattern for all the predictions are summarized in Table 2. predict. 0 1 2 3

first none for none for

second none none back back

third none none back back

predict. 4 5 6 7

first for for for for

second back back back back

third back back back back

Table 2. Communication pattern

3.3

Communication issue

Each VP should be allocated to a real processor. When the number of VP is more than the number of RP (real processors), some VP's are allocated to the same RP. That is, RPi emulates the behavior of VPt, V P i + n , VPi+2n and so on, where n stands for the number of RP. As described in the last section, the regular communication, which may degrade the system performance, is required for each decoding. So, in order to reduce the communication (invocation) cost, some communications with the same source and the same sink should be vectorized into one message, which is called "message vectorization"2. Namely for forward communication, VPi, VPi+n and VPi+2n need to send the data to VP,+i, V P J + n + 1 and VPi+2n+i • However, since VPi, VPi+ri and VPi+2n are allocated to RPi and VPt+i, VPi+n+i and VP«+2n+i are allocated to RPi+i, all the communications are vectorized into one message from RPi to RPi+i. The number of the communication invocation is reduced, so the total communication overhead is amortized. Message delivery consists of two phase: sending a message and receiving a message. For forward communication, RPi sends a message to RPi+i and receives a message from ftP;_i. As longer the message is, as longer the delay between the source and sink of the communication is. Thus, in order to hide the communication delay behind the computation, the vectorized message should be divided into some chunk. We call this technique "message stripmining"3. That is, if RPi emulates N VP's and communicates a message with N size after computation of decompression on a time segment, N is divided into G pieces (the size of the divided message, M, is | r ) . First RPi emulates the first M VP's and send a message with the size of M. Then RPt starts the next M VP's and receives a message which was sent just after the first

250

computation. Therefore the communication delay of the first message can be hidden behind the computation for the second VP set. By iterating the above procedure, the communication delay can be ignored except for the first one. 4

Parallelism

We will confirm the effectiveness of the algorithm analytically. The total computational complexity of decoding is roughly the multiplication of the number of pixels and the computation cost for each pixel because the communication cost ideally can be hidden behind the computation. So if we assume that the computation cost for each pixel is 1, the total should be W x H. For each time segment, RP's emulate VP's repeatedly but the number of emulation of a RP are different from others. For example, if the number of VP's is 7 and the number of RP's is 3, two of VP's emulate VP's twice but one of them emulates VP's three times. Thus the number of emulation of VP is the least integral value greater than or equal to ^T^where nRP and nVP stand for the number of RP and VP respectively. The computation complexity when nRP processors are used is described below.

^ = E f e i + £ r - ^ i + E r—-&—i i=0

parallelism

i=H

i=W

total = cost

According to the above equation, the parallelism for the standard size (512 x 512) of image is almost linear to the number of processors until 256. 5

Subframe approach

Since an image is a set of independent pixels, the image can be divided into several subframes and any image processing, including image clustering 6 , can be applied to each subframe independently to enhance the parallelism of the algorithm. This scheme is called "subframe approach". Although the subframe approach can be applied to compression algorithm as well, the compression rate may be affected. The dark side of the effect is that when subframes are compressed independently, header information must be added to each compressed data, so the total compressed data include plural header information. Namely , when the number of division of the image is d and the size of header information is h, the size of (d - 1) x h is so redundant that it may spoil the benefit of the subframe approach. On the other hand,

251

the bright side of the effect is that different predictor can be provided for each subframe, so the optimal predictor can improve the total compression rate. Original compression rate is 0.428 for CT image. While the rate is 0.406 with 4 by 4 divisions, the rate is 0.538 with 32 by 32 divisions. Namely, with the moderate number of divisions, the compression rate can be improved by 5% because the optimal predictor can be applied to each subframe, but as the number of divisions increases, as the cost of redundant header information increases. The compression rate for Ultrasound image is also improved by 6% with 4 by 4 divisions. 6

Experimental CODEC model

We describe an experimental CODEC model with the wavefront and subframe approach in order to show the feasibility of our algorithm. Let N be the number of processors, and let total.org and totaLenc be the total amount size of the original image and the encoded data respectively. Note that the compression rate is total-enc/totaluyrg. It is also assumed that only master processor can access to the I/O system and thus other processors should communicate with the master processor in order to access the I/O system. 6.1

Encoder

Original image should be encoded with different 8 predictors and then the optimal predictor should be chosen among them. Namely each encoding process with different predictor can be carried out independently. Therefore, the original image should be divided into y subframes and 8 processors should encode each subframe with different predictor concurrently. Procedure 1. The master processor iterates the broadcast of each subframe data with the size of total-orgj^ to 8 processors (processor group) y times. 2. Each processor encodes the subframe with one predictor respectively. 3. The optimal predictor should be chosen with each processor group. 4. The encoded data with the size of total-org/'y processor group to the master processor.

should be sent from each

Step 2 is a main part of encoding. When the communication cost of one data per pixel is larger than the computation cost of one data per pixel, the total elapsed time is dominated by the communication time even if the

252

communication can be overlapped with the computation. However, since the communication cost is expected to be very slight on the dedicated system to our algorithm, the elapsed time for the second step dominates the total time and thus the parallelism of the encoder can reach N. 6.2

Decoder

Unlike the encoding, the decoding process uses only one predictor to reconstruct the original data from the encoded data. As mentioned in earlier section, since the decoding process has data-dependency between adjacent pixels, the wavefront approach is adopted for parallel processing. As the image is decomposed into y subframes, each subframe should be decoded by the wavefront approach with 8 processors in order to exploit maximum parallelism. Procedure 1. The master processor iterates the broadcast of each encoded data with the size of about total-enc/ ^ to processor group y times. 2. Each process group should uncompress the encoded data to the original subframe data with the wavefront approach in parallel. 3. The original subframe data should be combined within the process group and should be sent to the master processor later. As described in section 3, enough parallelism can be exploited for step 2 described above. However, on the real system, the cost of the neighboring communication that comes from the wavefront approach may degrade the total performance. We will consider this issue in the following subsection. 6.3

Communication

Experiment

In order to estimate the communication cost for the step 2 of the decoder mentioned above, we make a (limited) experiment. We implement our algorithm on PC cluster (4PC of Celeron 500MHz with Fast Ethernet and MPICH under Linux) and measure the parallelism of the decoding process with varying the vector size of the message to be sent between processors. Figure 2 shows the speedup of the decoding process for predictor 1 with varying the vector size. Although the communication cost degrades the performance in some degree, our experiment shows that enough parallelism can be extracted and our algorithm is feasible even on PC cluster. Therefore our algorithm with two approaches is expected to provide enough performance for realtime applications when the dedicated communication system is available.

253 3b

2CPUs --•— 4CPUs -—o—

speedup

3 2.5 2, 1.5

•

•

1 100 vector size

Figure 2. Performance of decoding with the wavefront approach (CT image [512x512])

7

Conclusion

It is described that lossless JPEG can be parallelized by two different approach in order to improve compression speed of contiguous images, particularly for medical applications. The wavefront approach requires the communication exchange, but some optimization can reduce the communication cost dramatically. We also confirm that the subframe approach not only enhance the parallelism of the algorithm but also improve the compression rate with the moderate number of divisions of the image. Therefore our algorithm can provide the solution to a variety of applications that require lossless compression, even if it is built on an on-chip architecture or large computer systems. References Kongli Huang, "Experiments with a Lossless JPEG Codec", Thesis Cornell Univ., (1994) Chau-Wen Tseng, "An Optimizing Fortran D Compiler for MIMD Distributed-Memory Machines", PhD Dissertation, Rice Univ., (1993) Akiyoshi Wakatani and Michael Wolfe, "Effectiveness of Message StripMining for Regular and Irregular Communication", Proc. of 7th Int'l Conf. on Parallel and Distributed Computing Systems, (1994) Kongji Huang and Brian Smith, "Lossless JPEG Codec (Version 1.0)", ftp.cs.cornell.edu in the pub/multimed/ljpg, Cornell Univ., (1994) Akiyoshi Wakatani, "Scalable Lossless Compression Algorithm for Motion Pictures", IEEE Trans, on Consumer Elec, Vol.44, No.2, (1998) Akiyoshi Wakatani,Yoshiteru Mino and Hiroshi Kadota, "An Evaluation of the Subframe Parallel Approach for Image Clustering" Systems and Computers in Japan, Vol.27, No. 14, (1996)

ADAPTIVITY AND PARALLELISM IN SEMICONDUCTOR DEVICE SIMULATION HELMUT WEBERPALS AND STEFAN THOMAS Technische Informatik, Technische Universitat Hamburg-Harburg, Schwarzenbergstrasse 95, D-2I07I Hamburg, Germany E-mail: [email protected] Simulating complex systems interactively yields insights which are inaccessible to theory or experiment. With increasing complexity of a system, however, the space as well as the time complexity of a simulation rises, too; therefore, interactive simulation calls for parallel processing. As an application we present the simulation of semiconductor devices in which geometry, doping, and material parameters interact in a complex manner. A short discussion of the physical effects found in modern semiconductor devices, motivates the three-dimensional hydrodynamic model equations. Using a transformation of the natural variables allows us to apply fast elliptic solvers. In order to speed-up the solution process, we reduce the number of unknowns by adaptive local grid refinements. Finally, parallel processing leads to an additional speed-up which turns out out be scalable over a wide range of processors.

1

Introduction

In order to simulate the physical effects encountered in submicron semiconductor devices, the full set of the three-dimensional hydrodynamic model equations must be solved with high spatial resolution for the following reasons: • The geometrical structure of a semiconductor device is truly 3-dimensional; thus the streamlines of the current cannot be modelled adequately by a lower dimensional simulation. • The effect of hot electrons requires to take the energy balance into account, because the collisions between electrons are absorbing energy from the electric field. • The effect of velocity overshoot which is encountered in a narrow region around the drain requires a high spatial resolution of the computation. To run this complex simulation interactively we followed a hop-skip-and-a-jump (1 2-3) approach: 1. Find the best model and solver. 2. Apply adaptive refinement. 3. Exploit parallelism with respect to both the solver and the adaptivity.

254

255

As an example to illuminate the concepts of this paper we have chosen an n-channel MOS transistor; for later reference its geometry is sketched in Fig. 1.

Figure 1. The geometry of an n-channel MOS transistor

2

The model equations

The hydrodynamic model of semiconductor device physics is based on the following equations for electrons and holes of charge e and mass m: the Poisson equation2 for quasi static fields V-(-eVq>)=e(p-n

+ N)

(1)

where

256

The first three integrals of the Boltzmann equation yield the conservation of: o the charge density %-l-VJn=-R dt e -JL+-V.jp=-R dt e where R denotes the net recombination rate and / the current density,

(2) (3)

o the current density which yields an explicit expression for J, and o the energy density d(nwn) -, •+ V-S„ = J„ • V

(4) (5)

where nwn and pwp denote the energy density of the electrons and holes, respectively, and S the energy flux. From these equations we obtain the objective functions: • the electric potential
257

1. regularity of refinement and 2. non-rectangular domains. In order to take advantage of adaptivity we need a criterion where to refine the grid. Since we do not know the exact solution, we cannot control the exact error; but as a substitute we may use the residual, i.e. the deviation that we get when substituting an approximate solution into the equations. If the residual at a grid point exceeds a given accuracy, this grid point is called critical. Since the numerical stability of the solution process has to be guaranteed in the first place, we have to restrict ourselves to regular refinements in which each edge is refined only once. Fig. 2 illustrates a two step regular refinement in one dimension where the critical points are marked by a cross. The adaptive refinement reduces the space complexity significantly as is

h,

-o-

*~^*r

-O- h0/2 h0/4

Figure 2. Regular grid refinement in one dimension

obvious from Fig. 3 which shows the refinement in the back plane of Fig. 1. After three steps of successive refinement the finest grid comprises approximately 166000 points compared with approximately 790000 points which a uniform grid of the same spacing would have; thus we gain a factor of approximately 5. Whereas a uniform discretization of a semiconductor device leads to rectangular domains, the refined grid in Fig. 3 exhibits an irregular boundary. This problem is solved by padding the boundary with a layer of shadow points which guarantee that every point in the computational domain is surrounded by the same number of neighbouring points. For an efficient implementation of the matrix-vector multiplication the points on the refined grid and their shadow points are renumbered columnwise

258

0.2

0.4

0.6

0.8

0.3

0.6 x/|xm

1.2

0.9

Figure 3. A 3-level hierarchy of adaptive grids in the back plane y = 0 of Fig. 1

PZ

© ©©

' • % *

i

^§4:

© © OHD ©

%

5

%

<M> (18)

I Pr Figure 4. An adaptive grid and its coefficient matrix

as is sketched in Fig. 4. This results in a coefficient matrix which combines sparsity with a diagonal structure, if we take into account the interaction with shadow

259 points, which are marked by dots in Fig. 4, by modifying the right hand side. It is this structure that we exploit in the following. 4

Adaptivity and Parallelism

In order to speed up the simulation process even further we apply parallel processing. To this end we subdivide the computational grid into slices each of which is taken care of by one processor. The boundary layers of each slice are padded with a plane of shadow points which serve as a buffer for holding the boundary values of the neighbouring slice. As the computation is refining the computational grid adoptively, the number of grid points per processor becomes imbalanced. Fig. 5 exemplifies the situation in the case of three processors: In the beginning each processor is responsible for a 4 x 5-grid; when the critical points marked by a cross have been refined, processor 1 holds 49, processor 2 holds 12, and processor 3 holds 24 grid points. In order to keep the number of grid points per processor nearly constant, which means approximately 20, we introduce a fourth processor and redistribute the slices as sketched in Fig. 6. Although the redistribution of the computational grid requires communication among processors, we must refrain from administrating the grid in a

Figure 5. The imbalance caused by adaptive grid refinement

260

(1)

v =

22

P) 21

(3) V =

22

4) - 2i n j< ' = 0

Figure 6. The redistribution of grid points during adaptive refinement

centralised fashion, because this will inevitably prevent the parallel simulation from being scalable. The problem becomes even more complicated, if we take into account that the number of active processors may change during the computation. By a global sum operation the processors communicate the total number of active grid points. To exploit the diagonal structure of the coefficient matrix in Sect. 3, only complete planes are assigned to each processor by a rounding procedure. This initial distribution is improved by a few iterations of a diffusion method. Fig. 7 summarizes the result of our decentralised redistribution of grid points. Whereas the imbalance of grid points on 8 processors leads to a 40 percent deviation from the mean computation time, the redistribution achieves a 2 percent deviation! Summing up, our simulation, which was written from scratch, achieves an efficiency of 79 percent using 61 processors of a Cray T3D; the execution times decreases from 2.5 hours on a single processor to 3 minutes. This means that simulating complex systems interactively is feasible!

261

2 3 4 5 6 7 S Processor id

1 2 3 4 5 6 7 8 Processor id

Figure 7. Adaptivity and load balancing

5

Conclusions

From this application we may draw the following conclusions: • Parallel computing opens the door to an interactive simulation of complex systems in science and engineering. • Adaptive computing paves the way to handling complex systems and has to go hand in hand with parallel processing. • Adaptive parallel computing requires that the complete solution process be local and decentralised in order to achieve scalability. References 1. C. Cercignani: The Boltzmann Equation and its Applications. New York, NY: Springer (1988). 2. J. D. Jackson : Classical Electrodynamics. New York, NY: Wiley, 3rd edition (1999). 3. Q. Lin, N. Goldsman, and G. Tai: A globally convergent method for solving the energy balance equation in device simulation. Solid-State Electronics 36 (1993)411-419. 4. J. Slotboom: Computer-aided two-dimensional analysis of bipolar transistors. IEEE Trans. Electron Devices 20 (1973) 669-679.

This page is intentionally left blank

ALGORITHMS

This page is intentionally left blank

ANALYSIS OF COMMUNICATION OVERHEAD IN PARALLEL CLUSTERING OF LARGE DATA SETS WITH P-AUTOCLASS STEFANO BASTA ISI-CNR,

Via P. Bucci, cubo 41-C, Rende (CS), Italy E-mail: basta@si. deis. unical. it DOMENICO TALIA

DEIS,

Universita della Calabria, Via P. Bucci, cubo 41-C, Rende (CS), Italy E-mail: talia@deis. unical. it

This paper focuses on the advantages derived by the use of P-AutoClass, an MIMD parallel version of the AutoClass system, for parallel data clustering. AutoClass is an unsupervised data mining system that finds optimal classes in large data sets. The P-AutoCIass implementation divides the clustering task among the processors of a multicomputer so that each processor works on its own partition and exchanges intermediate results with the other processors. In particular, we point out that a main reason of P-AutoCIass scalability is that communication costs do not depend on size of data to be analyzed. In fact, our analysis puts in evidence that the overhead due to data exchange among processors depends on the number of attributes and classes but not on data set size. We present theoretical considerations that support this P-AutoClass feature and illustrate some experiments showing this behavior.

1

Introduction

Clustering is an unsupervised learning technique that separates data items into a number of groups or clusters such that items in the same cluster are more similar to each other and items in different clusters tend to be dissimilar, according to some measure of similarity or proximity. Differently from supervised learning, named classification, where training examples are associated with a class label that expresses the membership of every example to a class, clustering assumes no information about the distribution of the objects. Therefore the clustering task must both discover the classes present in a data set and to assign items to such classes in the best way. Most of the early clustering analysis algorithms come from the area of statistics and have been originally designed for relatively small data sets. In the recent years, clustering techniques have been efficiently extended to deal with large databases so that they play a role in the field of Knowledge Discovery [7]. Clustering algorithms are very computing demanding and thus require the use of high-performance machines to get results in a reasonable amount of time. Executions of clustering algorithms taking one week or about 20 days of computation time on sequential machines are not rare [8]. Scalable parallel computers can provide the appropriate setting to efficiently execute clustering algorithms for extracting knowledge from

265

266

large-scale databases. Recently, there has been an increasing interest in parallel implementations of data clustering algorithms. Generally, very large data sets are analyzed by data mining algorithms, thus performance scalability is possible when computation overhead does not strictly depend from data size. Parallel approaches to clustering can be found in [6, 11, 12]. In this paper we consider P-AutoClass [5], a distributed-memory parallel implementation of a clustering algorithm based on Bayesian classification. We discuss this parallel clustering algorithm and, in particular, evaluate the effect of communication overhead on the global computation time. We show some effective advantages that justify the effort of parallelization of AutoClass. The rest of the paper is organized as follows. Section 2 introduces clustering mediods and provides an overview of Bayesian classification and sequential AutoClass. Section 3 outlines the main features of P-AutoClass. Section 4 discusses a very important property of P-AutoClass: me communication overhead does not depends on the number of tuples of a data set but on the number of attributes and classes. This makes the parallel algorithm scalable when the size of analyzed data sets increases. On the basis of this behavior, it seems very appealing the utilization of P-AutoClass for speeding up clustering of large data sets. Section 5 contains some experiments validating these theoretical results. Finally, Section 6 draws some conclusions. 2

Clustering and AutoClass

Clustering techniques play a major role in data mining of large data sets and have received an increasing interest in various fields such as sociology, biology, statistics, artificial intelligence, and information retrieval. Clustering algorithms can roughly be classified into the three following types [4]: hierarchical, partitional and density or mode-seeking. Hierarchical methods generate a nested sequence of clusters by a hierarchical decomposition of a set of N objects represented by a dendogram. Partitional methods produce a partition of a set of objects into K clusters by optimizing a given criterion function. One of the most known criterion is the squared error criterion. The K-Means clustering [1] is a well-known and effective method for many practical applications that employs the squared error criterion. Density search techniques [10] consider objects as points in a metric space and suggest that clusters should be those parts of the space characterized by a high density of data. High density regions are called modes. A statistical density search-based approach considers dense regions of probability density of data objects as different groups. This approach uses statistical concepts with the aim to represent the probability density function through a mixture model. The Bayesian approach to unsupervised classification, used in AutoClass [3], provides a probabilistic method to inductive inference. The Bayesian classification,

267

rather than just partitioning objects into distinct classes, searches for the best class descriptions that predict the data objects in a model space. In this approach class membership is expressed probabilistically, that is an instance is not assigned to a unique class, instead it has a probability of belonging to each of the possible classes. The classes provide probabilities for all attribute values of each instance. Class membership probabilities are then determined by combining all these probabilities. Class membership probabilities of each instance must sum to 1, thus there are not precise boundaries for classes: every instance must be a member of some class, even though we do not know which one. Bayesian theory is based on the assumption that the objects under consideration are controlled by probability distributions and predictions on new objects can be made by reasoning about these probabilities and the observed objects. Let D={X,,...Xm} denotes the observed data objects, where instances or items X,. are represented as ordered vectors of attribute values X = {Xtl, ...Xik}. Unsupervised classification aims at determining the best class description (hypothesis) h from some space H that predicts data D. The term best can be interpreted as the most probable hypothesis given the observed data D and some prior knowledge on the hypotheses of H in the absence of D, that is the prior probabilities of the various hypotheses in H when no data have been observed. Bayes theorem provides a way to compute the probabilities of the best hypothesis, given the prior probabilities, the probabilities of observing the data given the various hypotheses and the observed data. 3

P-AutoClass

Parallel execution of data mining algorithms require dividing up the computation, so that each processing element can perform a part of the data mining task in parallel with the other processing elements of a parallel computer. The results are thus achieved as fast as possible. In doing this parallelization it is important to split the computational task in a balanced way among the processing elements, but it is equally important to minimize the data communication among them during the data mining process and separate accesses to the data set. The execution of AutoClass on large data sets requires times that in many cases are very high. For example, in the clustering of a satellite image, AutoClass took more than 130 hours [9] and for the analysis of protein sequences, the discovery process required from 300 to 400 hours [8]. These considerations and experiences suggested the necessity to implement a parallel version of AutoClass, called P-AutoClass, to mine very large data sets in a reasonable time. This has been done by exploiting the inherent parallelism presents in the AutoClass algorithm implementing it in parallel, using the MPI tool kit, on MIMD multicomputers [5]. Among the strategies for parallelization, has been selected the Single Program Multiple Data (SPMD) approach because it does not

268

require to replicate the entire data set on each processor, as other strategies do. That strategy also does not raise load balancing problems because each processor executes the same code on a data set of equal size. Finally, most of the operations are performed locally on each processor. Other parallel prototypes of AutoClass have been designed, but those have mainly been realized on SIMD parallel machines by using parallelizing compilers that automatically generated data-parallel code starting from the sequential code [3, 15] or have been implemented running in parallel several sequential AutoClass programs and combining results [16]. This last case does not guarantee sequential AutoClass semantics and results, whereas SIMD implementations cannot exploit all the inherent parallelism of the algorithm. 3.1

Design of the parallel algorithm

The main part of the algorithm is devoted to classification, generation and evaluation of generated classification. This loop is composed of a set of substeps, among which the new try of classification step is the most computationally intensive. In AutoClass class membership is expressed by weights. This step computes the weights of each item for each class and then it computes the parameters of classification. These operations are executed by the base_cycle function which calls the three functions updatewts, update parameters and update approximations. The time spent in the basecycle function is about the 99,5% of the total time of the algorithm. Therefore, this function has been identified as mat one where parallelism must be exploited to speed up the AutoClass performance. In particular, if we analyze the time spent in each of the three functions called by base_cycle, it appears [5] that the updatewts and updatejparameters functions are the most time consuming functions whereas the time spent in the update approximation is negligible. Therefore, P-AutoClass is based on the parallelization of these two functions using the SPMD approach. To maintain the same semantics of the sequential algorithm of AutoClass, the parallel version is based on partitioning data and local computation on each of P processors of a distributed memory MIMD computer and on exchanging among the processors all the local variables that contribute to form the global values of a classification. The function update wts calculates the weights wy for each item / of the active classes to be the normalized class probabilities with respect to the current parameterizations. The parallel version of this function first calculates on each processing element the weights wtj for each item belonging to the local partition of the data set then it sums the weights w,- of each class j (wj=2^ Wjj) relatively to its own data. Then all the partial w, values are exchanged (using the MPI_Allreduce operation) among all the processors and summed in each of them to get the same value in every processor. The updateparameters function computes for each class a set of class posterior parameter values, which specify how the class is distributed along the various

269 attributes. In the parallel version of this function, the partial computation of parameters is executed in parallel on all the processors, then all the local values are collected on each processor (using also in this case the MPI_Allreduce operation) before to utilize them for computing the global values of the classification parameters. 4

Reasoning about Communication Time

Before analyzing communication time, we discuss how the computation time of the sequential and parallel versions of AutoClass can be modeled by simple functions. The computation time Ts of sequential AutoClass depends on the number of classes, the number of instances, the number of attributes, the number of tries and the number of cycles that must be executed in each try until convergence is reached: Ts(l, Xb Cj) = 0(NTCT X, A Cj). Notation Cj

X, A P

Definition number of classes number of data instances number of attributes number of processors

Notation NT R-TI.T

CT

Definition the number of tries time to reduce n data of type T number of cycles per try

Table 1. Notation used in this section. In the parallel version of AutoClass the compute time TP depends on the size of a single partition of the instances, not on the total data set. Furthermore, two terms that take into account the time to exchange (reduce) data classification (of double type) among the processors must be considered. In particular, the term R)d (Cj +1) is related to the reduce operations in the update wts function, and the term R3d CjA is related to the reduce operations in the updateparameters function. Then the compute time in P-AutoClass is expressed by the following formula: TP (P, Xb Cj) =0(NTCT

f(X,/P) ACj+ R,,d (Cj +1) + R3.d CjAJ)

= 0(Ts/P + NT CT [R,,d (Cj +1)+ R3.d CjAJ). From these formulas, we can observe that the communication time is proportional to the number of classes and attributes and not to the number of data items. This is because, differently from other parallel data mining algorithms, in P-AutoClass processors exchange values related to attributes, classes and classification parameters. Therefore, the amount of data exchanged among the processors does not depend on size of mined data sets but on the number of attributes composing tuples and clusters in which data are grouped. This property assures that the communication overhead of P-AutoClass does not increase when the number of instances in the data set increases, but it is higher when the items consists of many attributes. However, the number of attributes is very low

270

in comparison with the number of attributes. In conclusion, we can deduce that PAutoClass is a scalable algorithm for clustering massive data repositories. In the following section, we show some experiments we performed to verify theoretical hypothesis on communication. 5

Experimentation

We run our experiments on a Meiko CS 2 with up to 10 SPARC processors connected by a fat tree topology with a communication bandwidth of 50 Mbytes/s in both directions. We describe here results on two data sets: the synthetic data set DS1 described in [13] which is composed of 100000 items with 2 real attributes and the IRAS Low resolution Spectral Atlas [2]. This data set consists of 5425 mean spectra of IRAS point sources, where each spectrum is a multi-dimensional data that consists of 100 channels ( so we have 100 attributes per tuple). To perform our experiments on the first data set we used different partitions of data from 5000 tuples to the entire data set, and asked the system to find the best clustering starting from different numbers of clusters. Each classification has been repeated 10 times and times presented here represent the mean values obtained after these classifications. We measured the elapsed time and the absolute speedup (Sp=Ti/Tp) of P-AutoClass in comparison with sequential AutoClass. For the sake of brevity, we only show the speedup (figure 1 and figure 2) in clustering the DS1 and IRAS data sets.

** 1

4

4

4 ! 4 4

a -

s

J^

9

& ,

t ^

r

*

4 \

«

i

2i

S\

4

1

2

3

4

5

6

7

B

B

number of processors

Figure 1. Speedup for DS1.

ID

number of processors

Figure 2. Speedup for IRAS.

From figures 1 and 2, we can notice that for the DS1 data set the speedup confirms the advantages obtained using several processors; differently, in the case of IRAS data set, the speedup is limited (for 7 processors speedup is 3.2 and for 10 processors it is 3.5). Furthermore, to verify the effect of the size of data set on the

271

application speedup, we doubled the 5425 spectra of IRAS data set. On this artificial data set composed of 10850 tuples we got a better speedup (not shown in detail here): using 10 processors the speedup is 4.7. This suggests that as the size of the data set increases the application can get a larger benefit from the use of a large number of processors. However, on these speedup results there is a significant influence of communication overhead that in the first case (DSl) is low and in the second one (IRAS) is significantly higher. To confirm our considerations on communication time, it suffices observing the following two figures. It should be mentioned that the influence of number of classes on the performance of PAutoClass on these two data sets is comparable; in fact for DSl the number of classes discovered is 100 and for IRAS classes obtained are about 80. 400 350

*^dL^_

300

^^

y

/

. 250 - predicted

<"•*

• 200

-measured 1

•

!

Figure 3. Communication time for DSl data set. 2500-, 2000

|

1000

^ ,'

- - j ^

5

—•—measured -—--predcted

5000

Figure 4. Communication time for IRAS data set. Comparing figures 3 and 4, the great difference between DSl and IRAS overhead times (about 300 seconds vs about 2000 seconds on 10 procs) can be explained considering that each tuple of DSl consists of only 2 attributes whereas each tuple in the IRAS data set consists of 100 attributes. For example, on 4 processors, the overhead for DSl is 230 seconds (7% of elapsed time), whereas in the IRAS case,

272

on the same number of processors, the communication overhead is 1020 seconds (17% of elapsed time). These results show how the number of attributes contributes significandy to the inter-process communication overhead in P-AutoClass. Figures 3 and 4 show also a good accordance between measured and theoretical results using the formula in section 4 where the Rid and RSJ terms are modeled Meiko CS-2 machine on the basis of a formula given in [14]: Rxd = 55.94 log p + (0.0167 log p)m where m is the size of the exchanged message andp the number of processors. 6

Conclusions

Clustering is an important data mining task that is very helpful when no prior information is available about the structure of data to be mined. Clustering algorithms are computationally intensive, particularly when large data sets must be classified. Therefore, parallel design and implementation of clustering algorithms on high-performance computers can offer a valid support to solve this problem. In this paper we discussed the communication overhead of P-AutoClass, a parallel implementation of the AutoClass algorithm based upon the Bayesian method for determining optimal classes in the area of knowledge discovery. We evaluated the P-AutoClass algorithm focusing our attention on inter-process communication to underline the advantage of using this parallel data mining algorithm for large data sets, especially when data items are composed of a limited number of attributes. 7

Acknowledgements

We would like to thank G. Raimondo for his work on P-AutoClass evalution. This work has been performed when D. Talia was with ISI-CNR. References 1. K. Alsabti, S. Ranka, V. Singh, An Efficient K-Means Clustering Algorithm. Proceedings of the First Workshop on High Performance Data Mining, Orlando, Florida, 1998. 2. P. Cheeseman, J. Stutz, M. Self, W. Taylor, J. Goebel, K. Volk and H. Walker, Automatic Classification of Spectra from the Infrared Astronomical Satellite (IRAS), NASA Reference Publication 1217, 1989. 3. P. Cheeseman and J. Stutz, Bayesian Classification (AutoClass): Theory and Results, in Advances in Knowledge Discovery and Data Mining, AAAI Press/MIT Press, pp. 61-83, 1996. 4. B. Everitt, Cluster Analysis, Heinemann Educational Books Ltd, London, 1977.

273

5. D. Foti, D. Lipari, C. Pizzuti, D. Talia, "Scalable Parallel Clustering for Data Mining on Multicomputers", Proc. of the 3rd Int. Workshop on High Performance Data Mining HPDMOO-IPDPS, LNCS, Springer-Verlag, Cancun, Mexico, May 2000. 6. A.A. Freitas and S.H. Lavington, Mining Very Large Databases with Parallel Processing, Kluwer Academic Publishers, 1998. 7. U.M. Fayyad, G. Piatesky-Shapiro and P. Smith. From Data Mining to Knowledge Discovery: an Overview. In U.M.Fayyad et al. (Eds) Advances in Knowledge Discovery and Data Mining, pp. 1-34, AAAI/MIT Press, 1996. 8. L. Hunter and D.J. States, Bayesian classification of protein structure, IEEE Expert, 7(4), pp. 67-75, 1992. 9. B. Kanefsky, J. Stutz, P. Cheeseman, and W. Taylor, An improved automatic classification of a Landsat/TM image from Kansas (FIFE), Technical Report FIA-94-01, NASA Ames Research Center, May 1994. 10. A.K. Jain and R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988. 11. D. Judd, P. McKinley and A. Jain, Performance Evaluation on Large-Scale Parallel Clustering in NOW Environments, Proc. of the Eight SIAM Conf. on Parallel Processing for Scientific Computing, Minneapolis, March 1997. 12. C.F. Olson, Parallel Algorithms for Hierarchical Clustering, Parallel Computing, 21:1313-1325,1995. 13. T. Zhang, R. Ramakrishnan and M. Livny, BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proceedings of the ACM SIGMOD, Montreal, Canada, pp. 103-114, June 1996. 14. G. Folino, G. Spezzano, D. Talia, Evaluating and Modeling Communication Overhead of MPI Primitives on the Meiko CS-2, Recent Advances in Parallel Virtual Machine and Message Passing Interface - Proc. of EuroPVM/MPI98, LNCS 1497, Springer-Verlag, pp. 27-35, September 1998. 15. J.T. Potts, Seeking Parallelism in Discovery Programs, Master Thesis, University of Texas at Arlington, 1996. 16. B. Wooley, Y. Dandass, S. Bridges, J. Hodges, and A. Skjellum. 1998. Scalable Knowledge Discovery from Oceanographic Data. In Intelligent engineering systems through artificial neural networks. Volume 8 (ANNIE 98), ASME Press, pp. 413-24.

A C O A R S E - G R A I N PARALLEL SOLVER FOR P E R I O D I C RICCATI EQUATIONS PETER BENNER Zentrum fur Technomathematik, Fachbereich 3/Mathematik und Informatik, Universitat Bremen, D-28334 Bremen, Germany; bennerfflmath.un.i-bremen.de RAFAEL MAYO, ENRIQUE S. QUINTANA-ORTi Departamento de Ingenieria y Ciencia de Computadores, Universidad Jaume I, 12080-Castellon, Spain; {mayo,quintana}
1

Introduction

In this paper, we analyse the parallel solution of the discrete-time periodic Riccati equation (DPRE) Xk=Qk+AjXk+1Ak - AlXk+1Bk(Rk nxn

+ BlXk+,Bk)-^BlXk+lAk,

nxm

rxn

nxn

^> mxm

where Ak G R , Bk G R , Ck G R , Qk G R , Rk G R , and Xk G R n x n is the sought-after solution. The equation is said to be periodic as for some integer p > 0, Ak = Ak+P, Bk = Bk+p, Ck = Ck+P, Qk = Qk+P, and Rk = Rk+P. Under mild conditions, the periodic symmetric positive semidefinite solution Xk is unique 7 . DPREs arise in the analysis and design of periodic linear control systems, e.g., the periodic linear-quadratic optimal control problem, model reduction of periodic linear systems, etc. 7 Periodic linear systems naturally arise when performing multirate sampling of continuous linear systems. Periodic systems with large state-space dimension n and/or large period arise, e.g., in the helicopter ground resonance damping problem (p=41) and the satellite attitude control problem (p=120) 6 ' 1 2 . The need for parallel computing in this area can be seen from the fact that (1) represents a non-linear system with pn2 unknowns. Reliable methods

274

275

for solving these equations have a computational cost of 0(pn3) flops (floatingpoint arithmetic operations). In this paper we analyse the parallelization of a coarse-grain DPRE solver, with a cost of 0(p2n3) flops, introduced by Benner and Byers 3 ' 4 . The resulting parallel algorithm only requires efficient point-to-point communication routines and a few high-performance numerical serial kernels for well-known linear algebra computations. A preliminary analysis of the parallel algorithm and basic experimental results using a native implementation of MPI on a Myrinet™ multistage interconnection network (MIN) were reported by the authors 5 . Here we extend these ideas and show that a careful overlapping of computation and communication allows us to obtain a parallel algorithm that can be ported to other commodity interconnection hardware (e.g., a Fast Ethernet switch) while maintaining similar parallel performance. A second DPRE solver with a lower computational cost (0(p\og2(p)n3) flops) is also known 4 . This algorithm presents a slightly irregular communication pattern, but preliminary research shows that a careful treatment of the communication also allows to overlap it with the computation. We do not analyse this second algorithm further in this paper. In section 2 we briefly review a numerical solver for DPREs based on a reordering of a product of matrices associated with (1). The parallelization of the solver is analysed in section 3. In this section we also describe the use of non-blocking communication routines to overlap communication and computation. In section 4 we report the performance of our approach on a cluster of Intel Pentium-II processors, connected via a Myrinet MIN and a Fast Ethernet switch, using three different communication libraries. Finally, some concluding remarks are given in section 5. 2

Numerical solution of D P R E s

Consider the In x In periodic matrix pairs Lk =

Ak ~Qk

0 In

Aft =

In

BkRk A\

Bl

k =

0,l,...,p

associated with the DPRE (1); here, /„ denotes the identity matrix of order n. If all the Ak are non-singular, the solutions of the DPREs are given by the d-stable invariant subspace of the periodic monodromy matrices 10 n A = M jfe+ i J) _ 1 L fc+p _ 1 M fc+ i p _ 2 L fc+p _ 2 ---M fc i L fc ,

k = Q,l,...,p-l.

(2)

Note that the monodromy relation still holds if (some of) the A^ are singular and the algorithm presented here can still be applied in that case 4 > 7 ' 10 .

276

The periodic QZ algorithm is a numerically sound DPRE solver that relies on an extension of the generalized Schur vectors method 10 . This method suffers from the same parallelization difficulties on a distributed parallel computer as the QR algorithm 9 . Although recent attempts have proposed improvements for the parallel QR algorithm n , the parallelism and scalability of these algorithms are still far from those of traditional matrix factorizations 8 . In this paper we follow a different approach for solving DPREs without explicitly forming the monodromy matrices 3 ' 4 . The approach relies on the following lemma. L e m m a . Consider Z,Y g WLnxn, with Y invertible, and let Q21 Q22

Y -Z

\R 0

(3)

be a QR factorization of [YT,—ZT]T; then Q22Q21 = ZY~l. The basic idea is to apply repeatedly the swapping procedure to the matrix products of the form Li+1M~l in IIfc, k = 0 , 1 , . . . ,p- 1, until we obtain reordered monodromy matrices of the form nfc = M^Lk

= (Mfc • • • Afj+p-iJ-^Lt+p-i • • • Lk),

k =

0,l,...,p-l, (4) without computing any explicit inverse. The solutions of the corresponding DPREs are then computed from the appropriate subspaces of the matrix pairs (Lk,Mk), k = 0,l,...,p-l3>*. Hereafter, we denote by (Y, Z) «- swap(Y, Z) the application of the lemma to the matrix pair (Y,Z), where Y and Z are overwritten by Q22 and Q21, respectively. The algorithm can be stated as follows 3-4: for k = 0,1,... ,p — 1 Set Lk 4r- Lk , M(k+\)

mod p «- Mk

end for t = l,2,...,p-l

for k = 0,1,... ,p— 1 (£(fc+t) mod p, Mk) <- swap(L(k+t) Lk <— L(k+t) M(k+t+l)

m o d p,

Mk)

mod pLk

m o d p *— MkM(k+t+1)

modp

end end for k = 0,1,... ,p — 1 Solve the DPRE using end

(Lk,Mk)

The algorithm is only composed of QR factorizations and matrix products. The computational cost of the reordering procedure is 0(p2n3) flops

277

and 0{pn2) for workspace. The costs of the last loop are 0(pn3) flops and (D(jm2) for workspace. 3

Parallelization issues

Our coarse-grain parallel algorithm employs a logical linear array of processes, Po, Pi,...,Pp-i, where p is the period of the system. In the algorithm the matrix pair (Lk,Mk) is initially stored in process Pk, k = 0 , 1 , . . . , p — 1. During the swapping stage, each swapping of a matrix pair is carried out locally in a different process. By the end of this stage, a reordered pair (Lfc,Mft), as in (4), is stored in process Pk, and the corresponding solution of the DPRE is computed locally. The parallel algorithm can be stated as follows. In process Pk : S e t Lk <— Lk, M( f c + i) m o d p «— Mk Send I/* t o P(fc+ p _i)modp R e c e i v e £ ( f c + i ) m o d p from P(k+i) mod p f o r t - 1,2,. . . , p - 1 iL{k+t)

mod p, Mk)

Send

Mik+t)

< - SWap{L(k+t)

mod p,

Mk)

mod p

tO P ( f c + p _ 1 ) mod p ^(k+t)

mod

R e c e i v e M(k+t+1) S e n d Lr(k+t) mod p mod p

PLk

mod p t 0

from P ( * + i ) m o d p

P(k+p-l) M

(k+t+l)

mod p mod p

Receive L^+t+i) mod? from f(*+i) mo d P end Solve the DPRE using (Lk,Mk) This parallel algorithm requires efficient point-to-point communication routines, high-performance numerical serial kernels for the matrix product and the QR factorization, like those available in BLAS and LAPACK 2 , and a serial spectral decomposition algorithm for obtaining the appropriate deflating subspace based, e.g., on the matrix disk function 3 . The algorithm presents a regular and local communication pattern as the only communications necessary are the left circular shifts of Mk and Lk • Assume our system consists of np physical processors. In case np > p, there will be np — p idle processors in the system and the speed-up of the algorithm will be bounded by p. As the user can select the number of processors to use, in order to avoid idle processors, hereafter we assume that p > np. Thus, we map a cluster of l(p—r—l)/np\ +1 neighbour processes onto processor Vr, r = 0 , 1 , . . . , np — 1.

278

As the communication pattern only involves neighbour processes, the algorithm only requires the communication of np matrix pairs between neighbour processors at each iteration of loop t. The remaining p — np transferences are actually performed inside a processor, and can be implemented as matrix copies. Neglect now for a moment the communication time in the algorithm. The theoretical maximum speed-up of the algorithm is given by Sp = r P -i. In practice, we will only achieve this theoretical maximum in case we can overlap communication and computation in loop t. The degree of overlap depends on the efficiency of the communication subsystem and the computational performance of the processors. To derive a theoretical model for our parallel algorithm we will use a simplified variant of the "logGP" model x; here, we assume that the time required to send a message of length I is given by a+/3/. (Basically, the latency and overhead parameters of the logGP model are combined into a, while no distinction is made between the bandwidth, / 3 _ 1 , for short and long messages.) We also define 7 as the time required to perform a flop. Using a non-blocking (buffered) communication Send routine in the algorithm, the transference of M needs to be overlapped with the computation of the matrix product Lk «— L(k+tj mo&pLk. Analogously, the transference of L needs to be overlapped with the computation of M^+t+i) mod p <~~ MkM(k+t+\) modp- Therefore, in our algorithm, the maximum speed-up is obtained for 2g 3 7 > a + (3q2, where q = 2n is the size of the matrices in our problem. Usually, (3 3> 7 and from a certain "overlapping" threshold q communication will be completely overlapped with computation for all q > q. Notice that the algorithm is scalable in p as we only need to increase np so that p/rip is constant to keep the performance. However, the algorithm is not scalable in n, as a single processor needs storage space for at least a single pair (Lk,Mk). 4

Experimental results

All our experiments were performed on a cluster of Intel™ Pentium-II processors at 300 MHz, with 128 MB of RAM each, using IEEE double-precision floating-point arithmetic (e ta 2.2 x 10~ 16 ). An implementation of BLAS specially tuned for this architecture and Linux OS were employed. Performance experiments with the matrix product routine in BLAS (DGEMM) achieved 180 Mflops (millions of flops per second) on one processor; that roughly provides a parameter 7 « 0.0055 us.

279 Table 1. Performance of the communication libraries and hardware.

Library/Hw. GM/Myrinet MPI/Myrinet MPI/Ethernet

a (/*s.) 25.9 37.4 161.5

/ T 1 (Mbps.) Short messages Long messages 483.2 592.8 213.1 254.4 31.2 38.7

The cluster consists of 32 nodes connected by two different networks: a Myrinet MIN and a Fast Ethernet. Myrinet provides 1.28 Gbps, full-duplex links, and employs cut-through (wormhole) packet switching with sourcebased routing. Our other interconnection network consists of a well-known technology: Fast Ethernet. We employed basic Send and Receive communication routines implemented in three different libraries. GM is a native application programming interface by Myricom for communication over Myrinet. We also used the MPI library developed for this architecture by Myricom on top of the GM API (avoids the overhead of using the TCP/IP protocol stack). The communication over the Fast Ethernet was performed using a standard implementation of MPI on top of TCP/IP. Table 1 reports the communication performance of these routines measured using a simple ping-pong test, both for short and long messages. The sizes of these messages, 20 and 500 KB respectively, were set to approximate the size of the smallest and the largest problem in our tests (q = 2n=50 and 250). The figures in the table for MPI/Ethernet should only be considered as rough approximations. We found a high variability in our experiments in this case due to the operation of the IP and ICMP protocols, contention, etc. We could expect a parallel implementation based on the communication routines in the GM API to achieve a much higher efficiency. However, as long as communication and computation in our coarse-grain algorithm are completely overlapped, the communication performance will not play an important role in the execution time. From the figures in the table, the theoretical overlapping threshold q is at 5, 8, and 45 for GM/Myrinet, MPI/Myrinet, and MPI/Ethernet, respectively. Figure 1 reports the speed-up of our solvers for problems of period p=8 and 16 using np=4, 8, and 16 processors. (Similar results were obtained for 24 and 32 processors.) The figure shows the high parallelism of our approach which obtains speed-ups close to the maximum in almost all cases. The communication time only contributes to the execution time for the

280

smallest problem sizes reported in the experiment for the MPI/Ethernet and the MPI/GM. We point out, among others, two reasons for the slight deviation of the experimental results and the model. In general, a and ft depend on the message length. Furthermore, the computational cost of a flop depends on the problem size and the type of operation; e.g., the so-called Level 3 BLAS operations (basically, matrix product) exploit the hierarchical structure of the memory to achieve a lower 7.

— • — GM/Myiinet -—--~ MPI/Myrinet — — MPI/Bhemet 50

100

150

' GM/Mynnet MPI/Myrinet MPI/Ethemet

200

50

Size of the problem(q=2n)

4

" J^ a

150

200

7

•^

" • * "

6

s

5

Speec

3 3J

100

Size of the problem(q=2n)

&+

fO

«

3 2

3

-.

— • — GM/Myrinet •••—-• MPI/Myrinet — • — MPI/Ethemet

— • — GM/Myrinet . . . . . . . MPI/Myrinet —••— MPI/Ethemet

1 0

50

100

150

Size of the problem(q=2n)

50

100

150

Size of the problem(q=2n)

Figure 1. Speed-up of the D P R E solvers for n p = p = 8 (top left), np = p = 16 (top right) np = 4, p = 8 (bottom left), and n p = 8, p = 16 (bottom right).

5

Concluding remarks

We have analysed the parallelization of a numerical solver for DPREs on a cluster of personal computers connected using two different communication networks. The experimental results show that the parallel solution of

281

DPREs is computation-bounded, and speed-ups close to the maximum are easily achieved. Acknowledgements This research was supported by the Conselleria de Cultura y Educacion de la Generalidad Valenciana GV99-59-1-14, and the DAAD Programme Acciones Integradas Hispano-Alemanas. References 1. A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. LogGP: Incorporating long messages into the LogP model for parallel computation, J. Parallel and Distributed Computing, 44(l):71-79, 1997. 2. E. Anderson et al. LAPACK Users' Guide. SIAM, Phil., PA, 1994. 3. P. Benner. Contributions to the Numerical Solution of Algebraic Riccati Equations and Related Eigenvalue Problems. Dissertation, Fak. f. Mathematik, TU Chemnitz-Zwickau, Chemnitz, FRG, 1997. 4. P. Benner and R. Byers. Evaluating products of matrix pencils and collapsing matrix products. To appear in Numer. Linear Alg. Appl., 2000. 5. P. Benner, R. Mayo, E.S. Quintana-Orti, and V. Hernandez. Solving Discrete- Time Periodic Riccati Equations on a Cluster. Lecture Notes in Computer Sciences 1900, pp. 824-828. Springer-Verlag, 2000. 6. P. Benner, V. Mehrmann, V. Sima, S. Van Huffel, and A. Varga. SLICOT - a subroutine library in systems and control theory. In B.N. Datta, editor, Applied and Computational Control, Signals, and Circuits, volume 1, chapter 10, pages 499-539. Birkhauser, Boston, MA, 1999. 7. S. Bittanti, P. Colaneri, and G. De Nicolao. The periodic Riccati equation. In S. Bittanti, A.J. Laub, and J.C. Willems, editors, The Riccati Equation, pp. 127-162. Springer-Verlag, Berlin, 1991. 8. L.S. Blackford et al. ScaLAPACK Users' Guide. SIAM, Phil., PA, 1997. 9. G.H. Golub and C.F. Van Loan. Matrix Computations. John Hopkins University Press, Baltimore, 1989. 10. J.J. Hench and A.J. Laub. Numerical solution of the discrete-time periodic Riccati equation. IEEE Trans. Aut. Cont, 39:1197-1210, 1994. 11. G. Henry, D. Watkins, and J. Dongarra. A parallel implementation of the non-symmetric QR algorithm for distributed-memory architectures. Technical Report CS-97-352, University of Tenneesee at Knoxville, 1997. 12. A. Varga and S. Pieters. A computational approach for optimal periodic output feedback control. Automatica, 34(4):477-481, 1998.

P R E C O N D I T I O N I N G OF SEQUENTIAL A N D PARALLEL JACOBI-DAVIDSON M E T H O D LUCA BERGAMASCHI AND GIORGIO PINI Dipartimento di Metodi e Modelli Matematici per It Scienze Applicate, Universita di Padova, Via Belzoni, 7, 35131 Padova PD FLAVIO SARTORETTO Dipartimento di Informatica, Universita di Venezia, Via Torino 155, 30171 Mestre VE The Jacobi-Davidson (JD) algorithm was recently proposed for evaluating a number of the eigenvalues, close to a given value, of a matrix. JD goes beyond pure Krylov-space techniques by wisely expanding such space, via the solution of a so called correction system, thus in principle providing a more powerful method. Preconditioning the Jacobi-Davidson correction equation is mandatory when large, sparse matrices are analyzed. We considered several preconditioners: Classical block-Jacobi, and IC(0), together with approximate inverse (AINV) preconditioners. We found that JD is highly sensitive to preconditioning. The rationale for using AINV preconditioners is their high parallelization potential, combined with their efficiency in accelerating JD convergence. We parallelized JD by a datasplitting approach, combined with techniques for reducing the amount of communication data. We show that our code is scalable, and provides appreciable parallel degrees of computation.

1 Introduction. An important task in many scientific applications is the computation of a small number of the leftmost eigenpairs (the smallest eigenvalues and corresponding eigenvectors) of the problem Ax = Ax, where A is a large, sparse, symmetric positive definite (SPD) matrix. In this paper we analyze the performance of Jacobi-Davidson (JD), as presented in the seminal work by Sleijpen and Van der Vorst *, and extensively analyzed in 2 . JD relies upon computing the Ritz values of search spaces which are expanded via a so called correction system. We deal with large, sparse matrices, thus the correction system is solved via preconditioned Krylov iterations. In principle, preconditioning must apply to matrices A — AI, where A is an eigenvalue guess. To avoid numerical instability, we exploit preconditioners for the sparse, SPD matrix A. We exploit three types of preconditioners, Block-Jacobi, IC(0), and AINV 3 . A parallel implementation of the JD algorithm has been coded via a dataparallel approach, allowing preconditioning by any given approximate inverse. We implemented ad-hoc data-distribution techniques, in order to reduce the

282

283

amount of communication among the processors, which could spoil the parallel performance of the ensuing code. An efficient routine for performing parallel matrix-vector products was designed and implemented. Numerical tests on a Cray T3E Supercomputer show the appreciable degree of parallelization attainable by the code. 2 Preconditioned Jacobi-Davidson algorithm Let us consider the eigenproblem A x = Ax, A being a large, sparse N x N matrix. The eigenvalues and corresponding eigenvectors are denoted by Ai < ^2 £ • • • £ -Wi a n ( i u i i u 2 , • • •, I*N, respectively. Assume we need to compute s eigenpairs, (\, ii*), i = l,...,s, which are close to a given value 0. JD algorithm computes each eigenvalue by taking the Ritz values in a search space which is expanded by a search vector which is the solution of a linear system, called the correction system x . It involves the matrices A — AI, where A is an eigenvalue guess. In principle such expansion provides better eigenvector guesses than the classical Krylov space. We implemented JacobiDavidson Algorithm 4.5 sketched in 4 . JD algorithm needs to assess many parameter values. Some of them are needed by every iterative eigensolver: The tolerance on the relative residual of each eigenvalue, r, and the maximum number of (outer) iterations, n j t i m a x . Beside these, to avoid un-affordable memory requirements, JD needs assessing the maximum and minimum dimensions, m m a l and mmin, of the search space, S. Usually A ~ A2-, for some i, thus the correction system is likely to be nearly indefinite. Its solution is computed by an iterative Krylov solver, since we deal with large, sparse matrices. It is not necessary to accurately solve the correction system, as pointed out in 5 . We performed at most n£°^x ( a n o t too large integer number) Krylov iterations, stopping the process as soon as the 2-norm of the relative residual is smaller than a prescribed (not too small) tolerance, r^corrh Being the correction system usually ill-conditioned, nearly degenerate, preconditioning is mandatory to achieve even poor accuracy, by performing a small number of iterations. Note that in the sequel the term inner iterations denotes the iterations performed to iteratively solve by Krylov methods the correction system. 3

Preconditioned

We implemented left preconditioning 4 for the correction system, which in essence involves the matrix C = A—AI. This matrix is usually ill-conditioned, nearly singular, thus computing its preconditioners is not a numerically stable

284

process. We computed preconditioners, M , for the reduced operator A, rather than the whole C. Disregarding AI allows for computing the preconditioner only at process start, irrespective of the value of A, which changes through the iterations. In the sequel, let us assume that A is a SPD, N x N matrix. We exploited the well known incomplete Cholesky IC(0) preconditioner, and block-Jacobi(/r), probing block dimensions k=l, 2, and 3. Our experience suggests that when solving by iterative methods linear systems involving our test matrices (see for example 6 ) , larger blocks do not provide appreciably faster convergence, compared to the larger amount of storage required, In view of a parallelization, note that block-Jacobi does not usually provide a satisfactory convergence acceleration, while the more effective IC(0) preconditioning is not apt to parallelization. We found that AINV preconditioning is quite as effective as IC(0), the former being more easily parallelized. The approximate inverse preconditioner AINV 3 explicitly computes an approximation of A - 1 by annihilating either all the elements outside selected positions, or those whose absolute value is smaller than a drop tolerance e. We preferred the latter strategy, since it is more apt to treat unstructured, sparse matrices. A sparse approximate inverse of A, M = Z Z T is produced, Z being an upper triangular matrix. Preconditioning by AINV requires only matrixvector products, which can be effectively parallelized, by wise data-parallel strategies. 4

Parallelization

To attain an appreciable parallel degree, we assume that JD preconditioning is performed either by block-Jacobi, or AINV techniques. Under this assumption, preconditioned JD differs from un-preconditioned at most due to additional matrix-vector products, F v , F T v , where F is a preconditioning factor. Thus, both the preconditioned and un-preconditioned JD algorithms can be decomposed into a number of scalar products, daxpy-like linear combinations of vectors, QV -I- /?w, and matrix-vector multiplications. We focussed on parallelizing these tasks, assuming that the code is to be run on a machine with p identical, powerful processors, our model machine being a Cray T3E Supercomputer. Scalar products, v-w, are distributed among the p processors by a uniform block mapping. Each N x N matrix is uniformly row-partitioned among the p processors, so that roughly [N/p] rows of A, F and F T , are stored on each processor. Large problems can be solved by using a small amount of memory, provided that a suitably large number of processors is activated. We tailored the implementation of the matrix-vector products, Av, Fv, F T v , for

285

# 1 2 3 4 5 6

problem 2d - MFE 3d~FE 3d-FE 3d-FE 3d-FD 3d-FE

N 28,600 42,189 80,711 152,207 216,000 268,515

Nnz 142,204 602,085 325,835 2,211,955 1,490,400 3,926,823

AJV/AI

7.94 4.21 2.42 7.26 1.50 2.82

x x x x x x

105 103 107 104 103 109

Table 1. Main characteristics of our sample problems. N = matrix size, Nnz = nonzero elements in matrix A; FE = Finite Elements, FD = Finite Differences, MFE = Mixed Finite Elements. N 28,600 T(M) 42,189

J

nz

T(M) 80,711 T(M) 152,207

"(12

T(M) 216,000 T(M) 268,515 T("M)

IC(O) 85,402 0.02 322,137 0.11 203,273 0.04 1,182,081 0.30 853,200 0.26 2,097,669 4.88

AINV(e) e =0.1 201,765 0.31 141,956 0.53 1,640,603 1.15 2,594,714 4.74 853,200 1.94 985,304 3.34

£ =0.05 464,705 0.54 341,523 0.69 1,641,085 1.16 3,222,199 6.36 853,200 1.94 1,626,471 4.55

e =0.025 790,632 0.90 601,360 1.08 1,712,476 1.22 3,798,763 8.38 2,732,760 5.89 4,421,002 13.42

Table 2. Number of nonzero entries, N„z , in the preconditioning factors. T(M) gives the CPU seconds spent to sequentially compute each preconditioner on a DEC ALPHA 500 MHi machine.

application to sparse matrices, by minimizing data communication between processors 7 . 5

Numerical results

Table 1 shows the main characteristics of our test matrices. They arise from the discretization of parabolic and elliptic equations in two and three space dimensions. They are representative of a larger set of test matrices that we analyzed. We performed sequential JD runs on a DEC ALPHA 500 MHz with 1536 MB RAM. We measured the CPU time spent by JD algorithm, irrespective of the time needed for I/O and for computing the preconditioning factor. Table 2 shows some features of our preconditioning factors, i.e. the number of non-zero elements and the CPU time spent for computing each factor.

286

N 28600 42189 80711 152207 216000 268515

e=0.025 15.62 21.83 4.67 1.01 1.00 •25.97

AINV e=0.05 15.50' 21.83 10.82 1.00 1.00

6=0.10 16.07 24.45 13.07 1.00 1.00 •25.97

IC(0) 13.55 19.48 13.32 3.31 1.00 •25.96

fc=l 1.09 18.90

block-Jacobi fc=2 fc=3 5.49 7.16 12.16 12.80

* *

* *

1.00 25.54

1.00 25.70

* 1.00 25.93

Table 3. Computation of s = 10 eigenpairs using AINV(e), Jacobi(k), and IC(0). Average number of inner iterations performed in one outer iteration. The symbol "*" means no convergence attained within 1000 outer iterations. A dagger, "\", signals that Ai and M were missed.

N 28600 42189 80711 152207 216000 268515

e=0.025 46.10 79.16 139.35 567.69 522.48 •12202.85

AINV e=0.05 34.01 50.94 114.56 767.69 429.62

*

e=0.10 28.65 55.13 120.54 810.65 433.35 •10787.97

IC(0) 24.94 66.99 64.87 89.17 369.10 •11994.77

*=1 129.56 36.86

block-Jacobi Jt=2 52.70 45.25

# *

* *

329.27 9617.28

725.25 8306.34

*=3 54.41 45.98

* *

726.28 9074.46

Table 4. Same as Table 3, concerning CPU seconds.

The time spent is negligible with respect to the overall sequential time spent for computing a number of the smallest eigenvalues (cf. Table 4). Concerning the choice of the Krylov solver, experiments with GMRES, CGS, and Bi-CGSTAB were performed by our group 2 . Analyzing very illconditioned matrices, like our largest test matrix (./V=268,515), we found that when GMRES and CGS are exploited, JD fails to converge more frequently than using Bi-CGSTAB. In the sequel, the correction equation is solved by (preconditioned) Bi-CGSTAB. Concerning the choice of the parameters in JD algorithm, we experimented with several combinations of values. Our results suggest that suitable choices for analyzing our test matrices are r=1.0e-3, m m j„=5, mmax=20,

» K = 2 6 , r(—)=1.0e-l. The starting vector was set to v 0 = ( 1 , . . . , 1) T . Tables 3 and 4 show some values recorded when computing the smallest s — 10 eigenpairs of our test matrices. We studied how the efficiency of AINV preconditioners depend upon the dropping parameter e. Recall that the number of non zero elements in AINV increases for smaller e, thus increasing the cost of applying the preconditioner. Table 3 shows that when N < 152,207, switching from e = 0.1 to e = 0.025, produces a better preconditioner, since a smaller average number of inner iterations, Atn, is produced. Note that when iV=216,000, using whatever preconditioner, and when N=152,207, exploiting

287

N

T2

sf>

D

o(2) 8

o(2) °16

<;(2)

o(2) '-'64

28600 42189 80711 152207 216000

37.96 58.52 216.30 1687.6 402.22

1.83 1.71 1.52 1.89 1.94

2.97 3.06 2.53 3.48 3.53

4.40 4.50 3.37 5.78 6.05

5.53 5.75 4.42 8.31 10.01

5.85 6.55 5.33 10.04 13.23

Table 5. CPU time, T?,, spent for computing by AINV-JD s — 10 eigenpairs on 2 processors. The (halved) speedups, Sj, , are also reported.

AINV, Ain — 1 holds true, i.e. one Bi-CGSTAB iteration is enough to match Table 4 shows that using a smaller e the higher cost of applying the preconditioner increases the CPU time, except when JV = 152,207. In such case, when e increases the efficiency of AINV does not decrease very much, since Ain does not change. However, we found that the residuals were slightly smaller for smaller e values, and the convergence of AINV-JD turn out to be faster, due to better search spaces. Note that when JV=268,515 AINV(0.05) is not effective; IC(0)-JD and the other AINV-JD miss Ai and A2, while Jacobi(A:)-JD do not. Such high dependence on the preconditioner seems to be peculiar of JD. When considering all our test cases, AINV(0.1)-JD needs less than 1.3 times the CPU of IC(0)-JD, except when JV=152,207; in this latter case, the CPU time decreases together with e, suggesting that a dropping value less than 0.025 should be set. We say that AINV(0.1) is competitive with IC(0). Block-Jacobi preconditioners, which require much smaller storage than AINV and IC(0), fail to provide convergence when JV = 80,711 and JV — 152,207. Inspecting Table 4, we see that IC(0)-JD, which never fails, and AINV(0.1)-JD, which fails only when JV = 268,515, e = 0.05, are more robust than block-Jacobi. Let us now consider our parallel JD code. We performed parallel runs on the T3E1200 machine of CINECA Consortium, located in Bologna, Italy. The machine is a stand alone system made by a set of DEC-Alpha 21164 processors. They perform at a peak rate of 1200 MFLOP/s, and they are interconnected by a 3D toroidal network having a 480 MByte/s payload bandwidth along each of 6 torus directions. In the present configuration there are 256 processing elements, half of them have a 256-MByte RAM, the remaining a 128-MByte RAM. Due to memory limitations, it was not possible to analyze some matrices by one-processor runs. We performed p-processor runs, when p=2, 4, 8, 16, 32, 64. Define the "halved" speed-up, S^ = T2/Tp, p - 2,4,8,16,32,64. We T(corr)

288

N Jacobi(l) Jacobi(3) AINV(O.l)

T2 * *

*

N

h

Jacobi(l) Jacobi(3) AINV(O.l)

* * *

T4

T8

Tie

?32

^64

4876.12 3566.94 7237.26

2723.76 2273,81 4868.35

1854.82 1499.78 2483.66

930.34 842.34 1929.31

726.96 800.60 1661.95

h

h

27,118 23,192 25,038

25,584 19,110 28,080

/l6

28,288 20,592 22,386

I32

20,956 23,322 23,972

^64

21,138 22,152 26,520

Table 6. N =268,515. CPU time, Tp, spent for computing s = 3 eigenpairs onp processors, using AINV-JD and Jacobi(k)~JD. The corresponding total number of inner iterations, Ip, is also reported. In each run, \i and A2 were missed. An asterisk means that the code did not run, due to memory limitations.

used AINV(O.l) in parallel computations, hereafter called AINV, being both more robust than block-Jacobi and quite as effective as IC(0), yet more easily parallelizable than the last one. Setting apart for a moment the case N = 268,515, Table 5 shows the CPU seconds and (halved) speed-up recorded when running our AINV-JD code. One can see that the algorithm is scalable. Satisfactory (halved) speed-ups are produced, when the amount of data on each processor is large enough. Indeed, satisfactory speedups running on p=32, and p=64 processors are obtained as an example when iV=216,000, where S^2' = 10.01. The efficiency is approximately 20/32 ~ 63%, confirming the appreciable scalability of our algorithm. Due to CPU time limitations, AINV-JD cannot be run to compute the smallest s = 10 eigenpairs of our largest matrix (AT=268,515). Table 6 shows some results about computing only s = 3 eigenpairs. The number of iterations changes appreciably with p (see Ip values). We do not report any speed-up, since they are affected by this numerical instability. The CPU time spent by AINV-JD is larger than for block-Jacobi-JD. Since the latter has lower storage requirements and CPU cost than the former, this result could suggest that for very large matrices block-Jacobi-JD could be a good choice. However, we found that, irrespective of the preconditioner, every parallel run missed Ai and A2. As a consequence, parallel JD is not reliable for computing the eigenvalues of our largest matrix. Table 3 shows that when ./V=268,515, for each preconditioner Ain ~ n^°mr}x = 26 holds true, which signals their poor efficiency. When dealing with very large, ill-conditioned problems, more efficient preconditioners must be exploited in order to make JD productive.

289 6

Conclusions

The following points are worth emphasizing. AINV is an effective preconditioner for Jacobi-Davidson, quite as effective as IC(0). Both are more robust than block-Jacobi. Our parallel JD algorithm displayed a satisfactory degree of parallelization. However, when very large, sparse, ill-conditioned matrices are considered, some eigenvalues can be missed. Jacobi-Davidson is highly sensitive to preconditioning, which heavily affects the expansion of the search space. Further work is in progress to devise appropriate preconditioners. Acknowledgments This work has been supported in part by INDAM-GNCS and MURST italian project Mathematical Models and Numerical Methods for Environmental Fluid Dynamics. References 1. G. L. G. Sleijpen and H. A. van der Vorst. A Jacobi-Davidson method for linear eigenvalue problems. SI AM J. Matrix Anal. Appl., 17(2):401425, 1996. 2. L. Bergamaschi, G. Gambolati, and M. Putti. Iterative methods for the partial symmetric eigenproblem. In Proc. of the 2000 Copper Mountain Conf. on Iter. Meth., April 3-7, 2000. 3. M. Benzi, C. D. Meyer, and M. Tuma. A sparse approximate inverse preconditioner for the conjugate gradient method. SIAM J. Sci. Cornput., 17(5):1135-1149, 1996. 4. Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst. Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. SIAM, Philadelphia, PA, 2000. 5. G. L. G. Sleijpen, H. A. Van der Vorst, and E. Meijerink. Efficient expansion of subspaces in the Jacobi-Davidson method for standard and generalized eigenproblems. ETNA, 7:75-89, 1998. 6. L. Bergamaschi, G. Pini, and P. Sartoretto. Approximate inverse preconditioning in the parallel solution of sparse eigenproblems. Numer. Lin. Alg. Appl, 7(3):99-116, 2000. 7. L. Bergamaschi and M. Putti. Efficient parallelization of preconditioned conjugate gradient schemes for matrices arising from discretizations of diffusion equations. In Proc. of the Ninth SIAM Conf. on Parall. Proc. for Sci. Comp., March, 1999. (CD-ROM).

PARALLEL C H E C K P O I N T I N G FACILITY O N A METASYSTEM YUDITH CARDINALE AND EMILIO HERNANDEZ Universidad Simon Bolivar, Depariamento de Computation y Tecnologia de la Information, Apartado 89000, Caracas 1080-A, Venezuela E-mail:{yudith,emilio] ©Idc.usb.ve

http://suma.ldc.usb.ve* This article describes experiences on incorporating uncoordinated checkpointing and recovery facilities in a Java-based metacomputing system. Our case study is SUMA, a metasystem for execution of Java bytecode, both sequential and parallel. We have incorporated an algorithm that implements a communication-induced checkpointing protocol into SUMA. This algorithm induces processes to take additional local (forced) checkpoints and to log all in-transit messages to ensure consistent global checkpoints. Preliminary results about the performance overhead produced by the implementation of this algorithm are presented.

1

Introduction

The access to remote high performance computing facilities for execution of Java programs is a relatively effortless goal with current distributed system technology1. A checkpointing facility is potentially very useful for high performance parallel computing, especially because programs executed on parallel systems typically take a long time. Checkpointing involves capturing the state of a process in terms of the data necessary to restart the process from that state. For parallel applications, a local checkpoint is the process state in a single node and a global checkpoint is a set of local checkpoints, one per process, that defines a checkpoint for the parallel application. Parallel checkpointing approaches can be classified into two main groups according to the kind of coordination used to take checkpoints. In the coordinated approach 2 ' 3 it is necessary to synchronize all processes to take a global checkpoint, by sending control messages explicitly. In the uncoordinated approach 4 ' 5,6 , the processes take their local checkpoints more or less independently, but must exchange information about these local checkpoints in order to build a global checkpoint. This information is mainly piggy-backed on the messages sent between processes. It is important to produce a consistent global checkpoint, which is a global checkpoint that avoids orphan and keeps all in-transit messages. "This work was partially supported by grants from Conicit (project Sl-2000000623) and from Universidad Simon Bolivar (direct support for research group GID-25)

290

291 Orphan messages are those that are sent but not received during the recovery, e.g. because the local checkpoint used by the receiving process is after the receive. In-transit messages are those that are received but not sent during the recovery, e.g. because the local checkpoint used by the sending process was taken after the send. In this paper we address the implementation of a parallel checkpoint facility on a metacomputing system that executes Java bytecode, called SUMA (Scientific Ubiquitous Metacomputing Architecture) 7 . In order to take local checkpoints, we have selected an approach that consists in extending the Java Virtual Machine (JVM) in such a way that the computation state is accessible from Java threads, which take the checkpoint. This extension provides a facility for extracting the application state and storing it in a serialized Java object 8 . The reasons for choosing such approach in the context of a metasystem are presented in 9 . Following this approach it is possible to take architecture independent checkpoints in each node. The rest of the paper is organized as follows. Section 2 describes the checkpointing algorithm. Section 3 explains its implementation within SUMA, from which the recovery process is made automatically. Section 4 shows some experimental results related to the overhead introduced in application performance. Section 5 presents conclusions and future work. 2

Uncoordinated Checkpointing Protocol with Message Logging

Uncoordinated checkpointing algorithms replace global synchronization by tracking dependences on local checkpoints and logging enough messages in order to determine a consistent global checkpoint. The parallel checkpointing protocol described in this section is a combination of the protocols described in5 and 6 . The protocol described in 5 logs in-transit messages in order to be able to resend them during the recovery process. On the other hand, the protocol described in 6 , focuses on avoiding orphan messages by taking checkpoints that have not been scheduled previously (forced checkpoints). The combined protocol cope with both orphan and in-transit messages, taking advantage of the fact that part of the information that the original protocols exchange is common to both protocols. We modified the mpiJava10 methods for implementing the combined protocol. mpiJava is a wrapper for MPI 1 1 . This implementation uses these variables: integer ckpt[Num-of

.Processes]

292 boolean taken[Num-of-Processes] boolean greater[Num.-of-Processes] boolean sent-to[Num-of-Processes] integer known.received[Num-of-Processes][Num-of-Processes] struct volatileJog { message m integer source integer dest }; list(struct volatileJog) v_log;

A private method called take-checkpoint actually takes the checkpoints, either normal or forced. It first increases the checkpoint number corresponding to the current process (its logical clock). Then, for each message in vJog, it checks whether the message is not an in-transit one, in which case the message is suppressed from v Jog. Finally, the current state is saved using the extended JVM facilities. A copy of the checkpoint number array (ckpt) and a copy of vdog are also saved in stable storage. Method mpiJavasend is the wrapper of the mpisend routine. It registers the event "send" and executes the actual mpisend routine. Current message is logged in v Jog, while method take-checkpoint, as explained above, identifies the messages that may become in-transit messages during a recovery. Method mpi Java-receive is the wrapper of the mpi-receive routine. It executes the actual mpi-receive routine and executes forced checkpoints (if necessary) to avoid orphan messages during a potential recovery. Java Native Interface (JNI) is used to deliver the message to the application. This method is more complex because it actually checks the control information and has to consider several cases. A detailed description of this method follows: procedure mpiJava-receive (m, source_rank) / / Receives from process source-rank message m actual-mpi-receive(source-rank, m, r-greater, r-ckpt, r-taken, r-known-received); lj Evaluating control information to avoid orphan messages if (3 rank : (sent_to[rank] A r_greater(rank]) A (ckpt[sourcej-ank] > ckpt[my-rank]) V (r.ckpt[my_rank] = ckpt[my-rank] A r.taken[my_rank])) then take-checkpoint / / forced checkpoint endJf switch case ckpt[source-rank] > ckpt[my.rank] do ckpt[my-rank] := ckpt[source-rank]; greater[my~rank] := false; forall rank ^ myjrank do greater[rank] := r-greater[rank] end-do; end case case ckpt[source-rank] = ckpt[my-rank] do forall rank do greater[rank] := greater[rank] A r-greater[rank] end_do; end case / / case ckpt[source~rank] < ckpt[my-rank] do nothing end-switch; forall rank J± my.rank do switch

293 case r.ckpt[rank] > ckpt[rank] then ckpt[rank] := r-ckpt[rank];taken[rank] := rJ:aken[rank]\ end case case r-ckpt[rank] = ckpt[rank] then taken[rank] := taken[rank] V rJtaken[rank] end case / / case r~ckpt[rank] < ckpt[rank] do nothing end-switch; end_do; JNI-deliver(m); / / gives message m to the application knownjrectived[myjrank, source-rank] := known-received[my~rank, source-rank] + 1; forall [rank-x, rank.y) do known-received[rank.x,rank-y] := max(known~received[rankjx, rank-y], rJcnown-received[rank~x, rank.y]); end-do end procedure

3

Implementation in SUMA

SUMA is a metacomputing system for execution of Java bytecode. It is implemented as a set of CORBA classes, following a three-tier metacomputing system model 1 , SUMA execution agents (actual platforms where program execution takes place) may offer integrated profiling and checkpointing/recovery services, as well as local access to numerical and communication libraries. Execution agents that offer parallel execution provide mpiJava10. They are actually front-ends of parallel platform, and we call them Parallel Execution Agents. A Parallel Execution Agent may provide checkpointing and recovery services, in which case it launches the necessary threads for executing the application and taking the checkpoints, as explained below. Figure 1 shows the steps during the development and execution of a parallel application on SUMA, when the checkpointing service is requested. Depending on the checkpointing tool used to take local checkpoints, it may be necessary to instrument the source code or bytecode at the client side. The code instrumentation at the client side can be done automatically, by using a pre-processor (step 1). The user requests the checkpointing service explicitly when submitting the application for execution in SUMA (step 2). This can be executed from a sumaClient, which can invoke either sumaExecute or sumaSubmit services. Both services provide the checkpointing option. After the sumaClient invokes the remote execution with the checkpoint option, SUMA builds the Execution Unit object (step 3), finds a parallel platform with checkpointing support and sends the Execution Unit object to the Parallel Execution Agent at the selected platform (step 4). Once this Parallel Execution Agent has received the object representing the application, it starts an Execution Agent Slave in a extended JVM at each node that the application required (step 5). Two threads are started initially within this extended JVM:

294 1 PARALLEL PLATFORM

f 6J Load c l a i m and data dyn.ro Icaity

KfltndriJVMl SUMAClan Loader App. threads

o

0

TTIIICC SUMACkpThreadMoaltor | Instrument code and llave^urr(spp)

compile with stub

SUMA CORE

J

0

fnode.Executcfapp

£iluuiidJYM_2_ I SUMACU.iLo.der

Parallel Execution Agcot

App. thread!

O

Return ( £ ) Return Invoke SUMA

©

©

f Takecknti | SUMACkpThrcadMonitor j

with checkpoint option Extended JVM D j SUMACIauLoader 1

.

App. threads

DDO © | SUMACkpThrtadMonltor |

Figure 1. Parallel application execution on SUM A

SUMAClassLoader and SUMACkptThreadMonitor. SUMAClassLoader is in charge of loading classes and data from the client during the execution (communicates with the client through CORBA callbacks). SUMACkptThreadMonitor periodically takes basic checkpoints, in an asynchronous way. Additionally, this thread is in charge of taking forced checkpoints when it is required by the checkpointing protocol. The main thread is loaded directly from the client by SUMAClassLoader and started (step 6). The local checkpoints will be taken from this moment (step 7). If the execution finishes successfully, each Execution Agent Slave informs the Parallel Execution Agent of it and finishes (step 8). The Parallel Execution Agent passes this information back to the SUMA core (step 9), which returns the results to the client (step 10).

3.1

Recovery

If a failure occurs in the parallel platform while an application is running, an exception occurs in the SUMA core. In this case, previously described

295 steps 9 and 10 are not executed. Instead, SUMA starts a recovery algorithm automatically, similar to the one described in 9 . The last local checkpoints taken on each node before the failure do not necessarily form a consistent global checkpoint. For this reason, it is necessary to execute an algorithm to build the last consistent global checkpoint. The summarized algorithm follows: • Non-faulty processes are requested to take a local checkpoint • The most recent consistent global checkpoint is determined in the following way: — obtain the last logical clock value "a" that is common to all processes. - build a consistent global checkpoint Ca=Ci<xi,C2,X2, • • •> C„tX„ (CitXi is a local checkpoint of Pi) defined as: Vk, Ck,xk is the last checkpoint of Pk such that the value of Ck,Xk logical clock is less or equal than a • All messages in transit with respect to the ordered pair {CiiXi,Cj,Xj) are extracted from the stable storage of Pi and re-sent to Pj. 4

Experiments

We implemented previous algorithm in SUMA and conducted experiments to evaluate it in practice. In this section we present some results related to checkpointing intrusiveness. The platforms used in the experiments are several 143 MHz SUN Ultra 1 workstations that share file systems through NFS. These workstation have Solaris 7, 64 MB Ram and a 32 Kbytes cache memory. The interconnection network is a switched 10Mbps Ethernet. We executed a small parallel Java program that calculates ir number. This program follows a master-worker approach. In each iteration the workers exchange messages with the master, and both send and r e c e i v e primitive calls are non-blocking. All processes execute the same amount of work, regardless of the number of processes used, and keep an array of data obtained during the execution. In order to evaluate checkpointing intrusiveness, we measured the total execution time with and without invoking the checkpointing service. Table 1 shows these measurements, which are independent of SUMA overhead. For all cases five checkpoint were taken. Checkpoint size for each case is shown in Table 1.

296 Table 1. Total execution time. Number of processes 2

6

Checkpoint size 4KB 121KB 1MB 4KB 121KB 1MB

Without checkpoints 6.70 6.74 7.03 7.63 7.68 7.96

min min min min min min

Activating checkpoints 6.74 min 7.10 min 9.92 min 7.67 min 8.46 min 10.26 min

Let us analyze the case in which the checkpoint size is 1MB. The overhead exhibited by the use of checkpoints is 41% for 2 processors and 29% for 6 processors. This overhead is mainly attributable to checkpoint saving calls and the addition of the checkpointing protocol. It clearly depends on the size of the checkpoints taken. The extra time for going from 2 to 6 is 13% without taking checkpoints and 3.4% when taking checkpoints. This preliminary experiment showed that the size of the checkpoints has a higher influence on the execution time than an increment in the number of processors participating in the protocol. Further experiments are necessary to better characterize the combined influence of both factors.

5

Conclusions and Future Work

In this paper we present an implementation of a parallel algorithm based on a communication-induced checkpointing protocol. The algorithm was implemented on a metasystem that executes parallel Java bytecode using mpiJava, a wrapper for MPI. This algorithm induces processes to take additional local (forced) checkpoints and to log all in-transit messages to ensure consistent global checkpoints. Preliminary results about the performance overhead produced by the implementation of this algorithm are presented. These results show that overhead produced by the checkpoint size seems to be more significant that overhead produced by the protocol when the number of processors is increased, at least in the range of a small number of processors. However, experiments with real parallel applications are important to corroborate these results. We expect more parallel mpiJava applications to become available in the public domain for continuing to test the checkpointing facility presented. We are also studying other aspects related to checkpointing on a metasystem, such as the recovery procedure.

297

References 1. G.C. Fox, W. Furmanski, T. Haupt, E. Akarsu, and H. Ozdemir. Hpcc as high performance commodity computing on top of integrated Java, corba, com and web standards. Lecture Notes in Computer Science, 1470:55-74, 1998. 2. J. Plank and K. Li. Low Latency, Concurrent Checkpointing for Parallel Programs. IEEE Transactions on Parallel and Distributed Systems, pages 874-879, 1994. 3. G. Stellner. CoCheck: Checkpointing and process migration for MPI. In 10th International Parallel Processing Symposium, pages 526-531. IEEE Computer Society Press, April 1996. 4. D. L. Russell. State restoration in systems of communicating proccess. IEEE Transactions on Software Engineering, SE6(2)-.183-194, 1980. 5. A. Mostefaoui and M. Raynal. Efficient message logging for uncoordinated checkpointing protocols. Technical Report Publication interne No. 1018, IRISA, France, June 1996. 6. J-M Helary, A. Mostefaoui, R Netzer, and M. Raynal. Communicationbased prevention of useless checkpoints in distributed computations. Technical Report Publication interne No. 1105, IRISA, France, May 1997. 7. E. Hernandez, Y. Cardinale, C. Figueira, and A. Teruel. SUMA a scientific metacomputer. In Proceedings of the International Conference ParCo99, pages 566-573. Imperial College Press, January 2000. 8. S. Bouchenak. Making Java applications mobile or persistent. In Proceedings of 6th USENIX Conference on Object-Oriented Technologies and Systems (COOTS'01), January 2001. 9. Y. Cardinale and E. Hernandez. Checkpointing facility in a metasystem. To be published in Euro-Par 2001 proceedings, January 2001. 10. Mark Baker, Bryan Carpenter, Geoffrey Fox, Sung Hoon Ko, and Sang Lim. MPIJAVA: An object-oriented JAVA interface to MPI. In International Workshop on Java for High Performance Network Computing,IPPS/SPDP, pages 748-762, April 1999. 11. E. Lusk W. Gropp and A. Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, 1994.

A N EFFICIENT D Y N A M I C P R O G R A M M I N G PARALLEL A L G O R I T H M FOR T H E 0-1 K N A P S A C K PROBLEM M. ELKIHEL AND D. EL BAZ LAAS du CNRS, 7, avenue du Colonel Roche, Toulouse 31077, France E-mail: [email protected] [email protected] An efficient parallel algorithm for the 0-1 knapsack problem is presented. The algorithm is based on dynamic programming in conjunction with dominance of states techniques. An original load balancing strategy which is designed in order to achieve good performance of the parallel algorithm is proposed. To the best of our knowledge, this is the first time for which a load balancing strategy is considered for this type of algorithm. Finally, computational results obtained with an Origin 2000 supercomputer are displayed and analyzed.

1

Introduction

In this paper, we consider the parallel solution of the 0-1 knapsack problem. The 0-1 knapsack problem is a combinatorial optimization problem which is used to model many industrial situations such as, for example: cargo loading, assignation problems, capital budgeting, or cutting stocks. Moreover, this problem occurs as a subproblem to be solved when more complex optimization problems are solved. This problem is well known to be NP-hard. The 0-1 knapsack problem has been intensively studied in the past (see Horowitz and Sahni 7 , Ahrens and Finke *, Fayard and Plateau 5 , Martello and Toth 9 and Plateau and Elkihel n ) . More recently, several authors have proposed parallel algorithms for the solution of this combinatorial problem. Cosnard and Ferreira 4 have presented a parallel implementation of the two lists dynamic programming algorithm on a MIMD architecture for the solution of the exact subset sum knapsack problem. We note that data exchanges are not needed for this type of application. Other interesting parallel approaches for the solution of dense problems which are based on dynamic programming have been proposed. We can quote for example Chen, Chern and Jang 3 for a method carried out on pipeline architectures and Bouzoufi, Boulenouar and Andonov 2 for an efficient parallel algorithm using tiling techniques on MIMD architectures. The reader is referred to Gerash and Wang 6 for a survey of parallel algorithms for the 0-1 knapsack problem. In this paper, we propose an efficient parallel algorithm for the 0-1 knapsack problem. The algorithm is based on dynamic programming in conjunction with dominance of states techniques. We present also an original load

298

299 balancing strategy which is designed in order to achieve good performance of the parallel algorithm studied. To the best of our knowledge, this is the first time for which a load balancing strategy is proposed for this type of algorithm. Finally, computational results obtained with an Origin 2000 supercomputer are displayed and analyzed. Section 2 deals with the 0-1 knapsack problem and its solution via dynamic programming. An original parallel algorithm and associated load balancing strategy are presented in Section 3. Finally, computational results obtained with an Origin 2000 supercomputer are displayed and analyzed in Section 4. 2

The 0-1 Knapsack Problem

The 0-1 unidimensional knapsack problem is defined as follows:

{

n

>.

n

^,Pjxj\'52wJxj

(1)

where C denotes the capacity of the knapsack, n the total number of items considered, pj and Wj, respectively, the profit and weight, respectively, associated with the j - t h item. Without loss of generality, we assume that all the data are positive integers. In order to avoid trivial solutions we assume moreover that we have: J2]=i WJ > C a n ( ^ wj < C for all j £ {1,..., n}. Several methods have been proposed in order to solve problem (1). We can quote for example methods based on dynamic programming studied by Horowitz and Sahni 7 , Ahrens and Finke * and Toth 12 , branch and bound methods proposed by Fayard and Plateau 5 , Lauriere 8 , Martello and Toth 9 and finally mixed algorithms combining dynamic programming and branch and bound methods studied by Martello and Toth 10 and Plateau and Elkihel n . In this paper, we concentrate on a dynamic programming procedure (see Ahrens and Finke 1) which is based on the concepts of list and dominance. For each positive integer fc, we define the dynamic programming recursive list Lk as the following set of monotone increasing ordered pairs : k

k w x

Lk = {(w,p) | w < C and w = 2_\ j ji

P

=

2_^Pixi}'

^

According to the dominance principle, all states (w,p) such that there exists a state (w' ,p') which satisfies: w' < w and p < p', must be removed from the list Lk- In this case, we usually say that the state (u/,p') dominates the state (w,p). As a consequence, we note that by using the dominance concept, any

300

two pairs (w,p),(w',p') of the list Lk must satisfy: p < p' if w < w'. The following sequential algorithm permits one to build recursively the list Ln from which we can derive the optimal solution of problem (1). Sequential Algorithm L0 = {(0,0)} FOR k = 1 TO n DO Lk = {(w + wkxk,p+pkxk) | (w,p) G Lfc_i and w + wkxk < C, withxfc G {0,1}} ~Dk

, ^ '

where Dk denotes the set of all dominated pairs at stage k. 3

Parallel Algorithm

In this Section, we present an original parallel algorithm based on the dynamic programming method. The main feature of this parallel algorithm is that all processors cooperate via data exchange to the construction of the list. In particular, all dominated pairs are removed from the list at each stage of the parallel dynamic programming method. More precisely, each processor creates its own part of the global list Lk at each stage k; so that the total work is shared by the different processors and data exchange permits each processor to remove all dominated pairs. Several issues must be addressed when considering the design of a parallel algorithm: the initialization of the parallel algorithm, the work decomposition and tasks assignation and the load balancing strategy.

3.1

Initialization,

Work Decomposition and Task Assignation

The initialization of the parallel algorithm is sequential. First of all, a sequential process performs fc(0) stages of the dynamic programming algorithm. This process generates a list which contains at least lq pairs ; where q denotes the total number of processors and / the minimal number of pairs per processor. Let L be the ordered list which results from the sequential initialization. The list L is partitionned as follows: L = L\ U ... U Lq, with | Li |= I, for i = 1,..., q - 1 and | Lq |> I, where | L; | denotes the number of elements of the sublist Lj. For simplicity of presentation, Lj will denote in the sequel the sublist assigned to processor Et. We note that if the different processors Ei work independently on their list Li without sharing any data, then at the global level some dominated

301

pairs may not be removed from the local sublists; which induces an overhead. Thus, it is necessary to synchronize the processors at each iteration and share part of the data produced at the previous iteration in order to eliminate all the dominated pairs from a global point of view. These operations may reduce the size of the sublists L, and the size of the union of the sublists will be the same as if the list was generated by a sequential algorithm. We detail now the parallel algorithm. In the case where the iteration number k is odd, data exchange occurs only between any processor Ei and processors Ei+\, ...,Eq. In the case where k is even data exchange occurs between any processor Ei and processors 22i,.„,.Et-_i, for natural load balancing. We note that for simplicity of presentation, we consider here only the case where the iteration number k is odd. In the following parallel algorithm we present computation and communication at iteration k. We shall denote by Li the sublist associated with processor Ei and the smallest pair of Li will be denoted by (t«i,i,Pi,i). Parallel Algorithm FOR i = 1 TO q DO BEGIN IFt^l THEN Processor Ei copies (witi,piti) in the shared memory; IF i ^ q THEN CASE The pair (t«i+i,i>P»+i,i) is available Processor Ei copies the following pairs in the shared memory:

{(wi,j,Pij) I mj +wk> m+1,1} Processor Ei merges the list Li with all pairs («/,-,. +Wk,Pj,. + Pk) which are such that w^i — w^ < Wjt. < w i+1,1 — wj, and j
302

We note that the copy of (wi,i,Pi,i) stored in the shared memory by processor Ei will be used by processors E\, ...,£^_i, in order to know what part of its sublist must be copied in the memory for the purpose of removing dominated pairs. We have chosen to exchange data between any processor E» and processors Ei+i,...,Eq every odd iteration and between any processor Ei and processors E\,...,Ei-i every even iteration. This strategy is designed in order to realize a kind of simple and natural load balancing. With this type of data exchange, two processors, say Ei and E9, play a particular part since, generally, they tend to accumulate more pairs than other processors. In this sense, processors E\ and Eq contain valuable information that will permit one to design an efficient load balancing strategy as we shall see in the next subsection.

3.2

Load Balancing Strategy

In order to obtain a good performance of the parallel algorithm, it is necessary to design an efficient load balancing strategy. As a matter of fact, if no load balancing technique is implemented, then it results from the merging process described in the Parallel Algorithm of the previous Section that processors E\ and Eq become often overloaded. We propose a load balancing strategy which is designed in order to obtain a good efficiency while presenting a small overhead. As a consequence, the strategy considered does not balance loads at each iteration; it is rather an adaptive strategy which takes into account several measures which are presented in the sequel. The strategy chosen is such that a load balancing is made if a typical processor which is overloaded can take benefit of it. The main test is made systematically on processor E\ since we have chosen to take a decision every two iterations and processor E\ can become overloaded every odd iterations according to the Parallel Algorithm presented in Section 2. The load balancing process will assign an equal load i.e. an equal number of pairs to all processors. In the sequel, Ok, Oek+2, respectively, will denote a measure of the load balancing overhead at iteration k, and an estimation of the load balancing overhead at iteration k + 2, respectively. Ofc, is function of the total number of pairs in the global list and 0%+2 ls estimated in function of the augmentation of pairs ratios per processor. Similarly, Ife, T%+1, T£ +2 , respectively, denote the computation time of processor E\ at iteration k, and estimations of

303

the computation time of processor E\ at iterations k+l and fc+2, respectively. These estimations are computed according to: T£+1=rTfcandTfce+2=r2Tfc,

(4)

where the ratio r satisfies:

*• = •£-•

(5)

J-k-l

More generally, we shall use the following notations: p-i

Tp(h) = ^Tk_t+Ok,

(6)

t=0

P-I

TP,2 (fc) = E

T

k~i + T *+i + Tk+2 + Oek+2.

(7)

i=0

We note that iteration k — p is relative to the last load balancing phase. The condition for load balancing will be: TP(k) TPt2(k) ( p p+2 ' ' If this condition is satisfied, then all the benefit of the load balancing will be for processor Ei, and as a consequence for the parallel machine (load balancing is then made at the end of iteration k); else, there is no load balancing. So, load balancing is made according to the following algorithm: Load Balancing Algorithm p=0 FOR k = jfe(0) + 2 TO n - 2, 2 DO BEGIN compute 2 i M and ^ ^ r I P IiM. < .-»W P

p+2

THEN perform load balancing p= 0 ELSE p= p+2 END

304 Table 1. Computing time in seconds and efficiency for a gap 100 and range 10000. size 200 400 600

1 processor time 4.42 33.36 107.90

2 processors time efficiency 2.58 86% 18.46 90% 56.89 95%

4 processors time efficiency 2.04 64% 11.88 70% 34.19 79%

8 processors efficiency 1.25 43% 9.10 46% 23.02 59%

time

Table 2. Computing time in seconds and efficiency for a gap 10 and range 10000. size 200 400 600

4

/ processor time 19.92 114.68 300.29

2 processors time efficiency 10.18 98% 94% 60.69 164.54 91%

4 processors time efficiency 72% 6.91 34.84 82% 89.36 84%

8 processors time efficiency 5.11 49% 21.01 68% 64% 58.70

Numerical Results

The numerical experiments presented here are relative to difficult 0-1 knapsack problems. It is well known that the number of variables is not the fundamental criterion for difficulty. We have studied several series of strongly coupled problems which are considered as difficult problems. The numerous instances considered are relative to problem sizes which are equal to 200, 400, 600, respectively, with data range defined as follows. The weights Wj are randomly distributed in the segment [1,10000] and the profits pj in [WJ - g,Wj + g], where the gap, denoted by g, is equal to 100 for the first set of data and 10 for the second set. We have considered problems such that C = 0.5 2 ? = i wjThe parallel algorithm has been implemented on a NUMA shared memory supercomputer Origin 2000 by using the Open MP environment. Parallel numerical experiments have been carried out with up to 8 processors. Numerical results are displayed on Tables 1 and 2, where the average running time in seconds and efficiency are given at each time for over 20 instances randomly generated of knapsack problems. From Tables 1 and 2, we note that the efficiency of parallel algorithms generally increases with the size of the problem. We note also that the efficiency increases with the difficulty of the combinatorial problem: parallel algorithms are generally more efficient for problems with a smaller gap. We see that the efficiency decreases when the number of processors increases. However, the efficiency remains good; which shows that the load balancing strategy is efficient.

305

Acknowledgments Part of this study has been made possible by a support of CALMIP, Toulouse and CINES, Montpellier. References 1. J. H. Ahrens and G. Finke, Journal of the Association for Computing Machinery 23, 1099 (1975). 2. H. Bouzoufi and S. Boulenouar and R. Andonov Actes du Colloque RenPar'10 (1998). 3. G. H. Chen and M. S. Chern and J. H. Jang, Parallel Computing 13, 111 (1990). 4. M. Cosnard and A. G. Ferreira, Parallel Computing 9, 385 (1989). 5. D. Fayard and G. Plateau, Mathematical Programming 8, 272 (1975). GerWan94 6. T. E. Gerash and P. Y. Wang, INFOR 32, 163 (1994 ). 7. E. Horowitz and S. Sahni, Journal of the Association for Computing Machinery 2 1 , 275 (1974). 8. M. Lauriere, Mathematical Programming 14, 1 (1978). 9. S. Martello and P. Toth, European Journal of Operational Research 1, 169 (1977). 10. S. Martello and P. Toth, Computing 2 1 , 81 (1978). 11. G- Plateau and M. Elkihel, Methods of Operations Research 49, 277 (1985). 12. P. Toth, Computing 25, 29 (1980).

PARALLEL ALGORITHMS TO OBTAIN THE SLOTINE-LI A D A P T I V E CONTROL LINEAR RELATIONSHIP JUAN C. FERNANDEZ Dept. de Ingenieria y Ciencia de los Computadores, Universidad Jaume I, 12071-Castellon (Spain), Phone: +34-964-728265; Fax: +34-964-728435, e-mail: [email protected] LOURDES PENALVER Dept. de Informdtica de Sistemas y Computadores, Universidad Politecnica de Valencia, 46071- Valencia (Spain), Phone: +34-96-3877572; Fax: +34-96-3877579, e-mail: [email protected] VICENTE HERNANDEZ Dept. de Sistemas Informdticos y Computacion, Universidad Politecnica de Valencia, 46071-Valencia (Spain), Phone: +34-96-3877356; Fax: +34-96-3877359, e-mail: [email protected] The dynamics equation of robot manipulators is non linear and coupled. One way to solve the control movement is to use an inverse dynamic control algorithm that requires a full knowledge of the dynamics of the system. This knowledge is unusual in industrial robots, so adaptive control is used to identify these unknown parameters. In general their linear regressor is based on the linear relationship between the unknown inertial parameters and their coefficients. A formulation to generalize this relationship is applied to the Slotine-Li adaptive algorithm. The objective of this paper is to reduce computing time using parallel algorithms.

1

Introduction

The dynamic equation of robot manipulator torque in open chain is determined by highly coupled and non linear differential equation systems. Thus, it is necessary to use approximate or cancelling techniques to apply some control algorithms, such as inverse dynamic control, over the full system. To apply these control techniques it is necessary to know the dynamics of the system. This knowledge allows the existing relations among the different links to be established. The links to establish the kinematic relations of the system are defined from Denavit-Hartenberg parameters. The inertial parameters, mass and inertial moments for each arm which are usually unknown, provide full knowledge of the dynamics. These parameters should be estimated using least square or adaptive control techniques. Using adaptive control it is possible to solve both movement control and

306

307

parameter identification problems. Some of these algorithms can be found in 1 ' 4 ' 6 ' 3 . One of the problems in applying this kind of algorithm is that of obtaining the relationship T = YT{q,q,q)9,

(1)

where YT(q,q,q), known as regressor, is a n x r matrix, n being the number of links and r the number of parameters; and 6 is the r x l parameter vector. Considering all the different parameters, r will be lOn. In this paper the linear relationship for the adaptive Slotine-Li algorithm using several algebraic properties and definitions given in 5 is employed. This formulation is a computable and general solution for any robot manipulator in open chain using Denavit-Hartenberg parameters and is an extension of the Lagrange-Euler formulation. The main problem with this formulation is its high computational cost, but there are several studies, 7 ' 2 , where this is reduced. In this paper the parallel algorithm to obtain the linear relationship for the adaptive Slotine-Li algorithm using the Lagrange-Euler formulation is presented. The structure of the paper is the following: In section two the dynamic model is presented. Section three describes the Slotine-Li algorithm. Section four presents the parallel algorithm to obtain the linear relationship for the adaptive Slotine-Li algorithm. The results for a Puma robot are described in section five. And finally the conclusions of this paper are presented in last section. 2

The Dynamic Model

The dynamic equation of rigid manipulators with n arms in matrix form is T = D(q)q + h(q,q) + c(q),

(2)

where r is the n x l vector of nominal driving torques, q is the n x l vector of nominal generalized coordinates, q and q are the n x l vectors of the first and second derivatives of the vector q, respectively, D is the inertia matrix, h(q, q) is the vector of centrifugal and Coriolis forces and c(q) is the vector of gravitational forces. 3

Slotine-Li Algorithm

This approach is not exactly based on dynamic inverse, because the control objective is not feedback linearization but only preservation of the passivity

308

properties of the rigid robot in the closed loop. But this method has been chosen because it is a well known adaptive algorithm. ^From Eq. (2) the control law chosen is u = D (q) v + %, (q, q) v + c(q) - KDr,

(3)

where v = qd — \(q — qd) = qd — Ae and r = q — v = e + Ae. A is a positive force diagonal matrix. Replacing this expression in Eq. (2) then Dq + hq + c = Dv + Tiv + c - KDr.

(4)

Given that q = r +v and q = r + v the previous equation can be written as Dr + hr + KDr = Dv +Hv + c = Ys(q,q, v,v)0,

(5)

where D — D — D, h = h — h,c = c — c and 9=0 — 6. Note that Y$ is not a function of manipulator aceleration. Only vector v depends on the reference aceleration and the velocity error. The adaptive control law resulting is

0=-r-1Yg(q,q,v,v)r,

(6)

T = Ys(q,q,v,v)6 - KDr.

(7)

Moreover, KD and T are diagonal matrices. 3.1

Reformulation

Using several algebraic properties and definitions given in 5 it is possible to obtain a computable version of Yg(q,q,v,v). The expression for the inertial matrix is given by D{q)v = YDi,{q)6D,

(8)

where -rtr(BDU)YDi(q) ) =

•• rtr

0

••

0

•• rtr

= [v{Ji) "W

(BDin)' rtr(BD2n)

(9)

(Bonn).

•••"(Jn)]T,

(10)

®Ujk)vj,

(11)

with k

BDik =

sXUik j=i

309 where Jk is the inertia tensor related t o the link k, Ujk is the effect of the movement of link k on all the points of link j , rtr of a matrix is a row vector whose components are the traces of the columns of this matrix and v is the column vector operator 5 . T h e expression for the centrifugal and Coriolis forces is given by h(q,q)v

=

YHv(q,q)9h,

(12)

where rtr (Bfcn) ••• YHv(q,q)

rtr(Bhln)

0

•••rtr(BM»)

0

rtr(Bhnn).

,8h = OD,

(13)

with i

i

Bhil = Yl "52(Uil ® Ukjl)qjVk, j=\

(14)

*=1

where Ukji is t h e effect of the movement of links j and I on all the points of link k. T h e vector of gravitational forces can be expressed as a linear relationship with the inertial parameters
(15)

Yc(q)0c,

where j ^ t / n gTU12 • • • gTUln gTu2„ 0 gTU22 Yc(q) =

0 0

9C = [ m i n 7712^2 •••

(16)

0 gTUnn mnr„\

(17)

where fi is the position of the centre of mass of link i (m,) with respect t o the origin of coordinates of link i. ^Prom the previous results v

s(q,q,v,v)

= YDi> + YHv + Yc.

(18)

The next section describes the parallel algorithm t o obtain Eq. (6), Eq. (7) and Eq. (18).

310

4

Parallel Algorithms

In order to obtain a low computational cost special attention has been paid to the following aspects: • It is possible optimize the number of operations involving the matrix products. Morever, the number of operations involving the Kronecker product is reduced 2 . These matrices have enough null elements to permit a reduction in the number of arithmetic operations. • It is possible to use different properties of the matrices, such as symmetry, repeat rows, etc., in order to reduce computational cost. • Parallel computing is used to reduce computational cost. The parallel algorithms to obtain Eq. (18) are presented below, where n is the number of links and p is the number of processors. To determine the calculations of each processor, two parameters have been defined, ik, the initial link, and fk the final link of processor Pk. Then Pk computes the operations of the links i*, i*+i ,• ••, fk- In this case: • Processor Pi has the values i\ = 1 and / i = n — p + 1 to obtain the matrices and vectors involved in the Slotine-Li parallel algorithm. • Processors Pk, k = 2 : p, have the values ik = fk = n — p + k to obtain the matrices and vectors involved in the Slotine-Li parallel algorithm. To obtain the matrices of Eq. (18), I/y matrices (U structure Eq. (19)) and Uijk matrices (DU structure Eq. (21)) must be computed. The U structure is given by Uu

Ul2 • • um

U

• •

Uln U2n

•Qi °Ai

Qi °A2

%Q2

X

A2

Qi °An °A1Q2 xAn

••

•A.n—iicjn

Unn

(19)

l

A

where Qi is the constant matrix that allows us to calculate the partial derivative of %Aj. To obtain U, 'Aj matrices (A structure Eq. (20)) are necessary. This structure is given by °Ai °A2 Y A2

sin

A=

(20) l

A sin _

311

where %Aj = tAj_x j 1Aj,i = 0:n-2, and the matrices of the diagonal are obtained from the robot parameters. In the parallel algorithm Pk computes i Aj,i = Q:fk-l,j=i + l:fk. Processor Pk needs °At, i = ik : fk, to obtain matrices Uu, i = ik : fk. To obtain the remaining matrices of U, Uij, j =ik : fk, i = 2 : j , each processor needs ° J 4,_i and *~1Aj,j=ik:fk,i = 2:j. All these matrices have been computed previously. DU has the following structure DU= [DUX DU2 -DUn f .

(21)

As Ukji = Ujki, it is only necessary to compute the following block of DUt, i = 1:n DUi (i : n,i: n) —

Uni UiH-\-i ' ' ' 0 Uii+u+1 ' • '

0

0

•••

Uiin Uii+in

(22)

Uiinn

Processor Pk, k - 1 : p, computes Uuj = QiUij, j = ik : fk, i = 1 : j , because it has the required U matrices. To obtain the remaining matrices of DU, Pk, k = 1 : p, computes Uijk, k = ik : fk, i = 2 : k and j = i : k. There are two situations: • To obtain the matrices of row i, UUJ = V , Q ; t _ 1 Aj, processor Pk needs Vi = ° i4i_iQi, « = 2 : / fc , and *-lAjt j = ik : A , i = 2 : j . These matrices have been computed previously. l 1 l x • To obtain the matrices of row I > i, UUJ = Uu-iQi ~~ Aj, ~ Aj is first computed by Pk. But Uu-i has been computed in another processor. In order to avoid communications Pk replicates the computation of this matrix.

Then, each processor calculates the block of columns of the structures U and DU corresponding to rank [ik : fk]. Pk, k = 1 : p, computes rtr(Boij), rtr(Bhij), Ycij and F s y , j — ik : fk and i = 1 : j . As the processors have all the information to compute BD and B^, no communication among them is necessary. To obtain matrix Yc no communication is necessary because each processor has the required U matrices. With this information Pk computes Eq. (18). To obtain Eq. (6), each processor Pk computes the following expression i

«i = E - r « l y a S » *

312

for i = ik • fk- And finally, the control law r = YsQ — KDT must be computed. Each processor Pk computes Ti,i — 1 : /*, using the matrices Ys%j and 0,- that it has calculated. Each processor sends the computed vector r,- to processor Pp. This processor receives these values and it obtains the final value of r. This is the only communication present in the algorithm. The product KDT is also computed in Pp. 5

Results

The sequential algorithm is evaluated using sequential execution time, Ts. The parallel algorithms are evaluated using parallel execution time Tp (p processors), Speed-up, Sp = Ti/Tp and efficiency, Ep — Sp/p. In the experimental results a Parsytec PowerXplorer multicomputer has been used. This multicomputer has four nodes where each node is composed by a Motorola Power PC 601 and an INMOS Transputer T800 communication processor. Communication routines in PVM and C language are used. The results have been obtained using the parameters of a Puma 600 robot with six links. In the parallel algorithms 2, 3, 4 and 6 processors have been used. In each case the links have been distributed among the processors according to ik and fk parameters. To present the results the following notation is used: pxax • • • x, where px is the number of processors used and ax • • • x is the number of links computed by each processor. For example, in p2a31, the first three links are calculated in Pi and the fourth link is computed in P? • If n = p the notation ol is used. Table (1) shows the results, in miliseconds, to compute the parallel algorithms of the adaptive Slotine-Li control, where Ts = 17,808 miliseconds. If each link is computed by one processor (p6al) a poor efficiency is obtained Table 1. Experimental results with n = 6 links. Algorithm

III p3a222 p3a321 p3a411 p4a3111 p6al

P 2 2 2 3 3 3 4 6

T„

16,591 14,269 11,854 14,284 10,442 10,336 10,241 14,165

Speed-up 1,073 1,248 1,502 1,246 1,705 1,723 1,738 1,257

Efficiency 53,66% 62,4% 75,11% 41,55% 56,84% 57,43% 43,47% 20,95%

because the load balance is not good. It is possible to obtain a good load balance using communications among the processor but the execution time will increase. The objective is to reduce execution time even though efficiency is not good.

313

6

Conclusions

A generalized formulation for the linear relationship between variable dynamic terms and inertial ones is presented. The presented formulation provides this relationship knowing only the Denavit-Haxtenberg parameters for any robot manipulator in open chain. It is possible to apply this formulation to the Slotine-Li adaptive algorithm using the Lagrange-Euler formulation. Although the use of the Lagrange-Euler formulation produces a high computational cost, it possible to reduce it in two ways: • Eliminating the high quantity of null terms and exploiting the symmetry properties of the matrices. • Using parallel computing to obtain the different matrices of the dynamic equation and the linear relationship of the Slotine-Li adaptive algorithm. Using these two techniques it is possible to obtain the linear relationship of the Slotine-Li adaptive algorithm and apply it to an on-line identification problem in real-time. The presented formulation can be used for a robot manipulator with more than two links. In this paper this formulation is applied to a six link Puma manipulator. References 1. J. Craig, Adaptive Control of Mechanical Manipulators, Addison-Wesley (1988). 2. J.C. Fernandez, Simulacion Dindmica y Control de Robots Industrials Utilizando Computacion Paralela, Ph.D. Univ. Politecnica de Valencia (1999). 3. R. Johansson, Adaptive Control of robot manipulator motion, IEEE Transactions on Robotics and Mechatronics, 10(6), 475-481 (1999). 4. J.M. Ortega and M. Spong, Adaptive motion control of rigid robots: A tutorial, Automatica 25(6), 877-888 (1989). 5. L. Penalver, Modelado Dindmico e identificacion parametrica para el control de robots manipuladores, Ph.D. Univ. Politecnica de Valencia (1998). 6. J.J. Slotine and W. Li, On adaptive control of robot manipulators, Inter. Journal Robotics Research, 6(3), 49-59 (1987). 7. A. Y. Zomaya, Modelling and simulation of robot manipulators. A parallel processing approach, World Scientific Series in Robotics and Automated Systems, 8, (1992).

A H I G H - P E R F O R M A N C E G E P P - B A S E D SPARSE SOLVER ANSHUL GUPTA IBM

T.J.

Watson

Research Center, E-mail:

P.O. Box 218, Yorktown [email protected]

Heights,

NY

10598

This paper describes the direct solver for general sparse matrices in Watson Sparse Matrix Package (WSMP). In addition to incorporating the best ideas from a number of other solvers into one package, the implementation of Gaussian Elimination with Partial Pivoting (GEPP) in WSMP introduces several new techniques. Our experiments on a suite of 25 large industrial problems show that the G E P P implementation in WSMP is about two and a half times faster than the best currently available similar software.

1

Introduction

The Watson Sparse Matrix Package (WSMP) l is a high-performance, robust, and easy to use software package for solving large sparse systems of linear equations using a direct method on the IBM RS6000 workstations and IBM Scalable Parallel (SP) systems. The WSMP library has separate sets of routines for solving symmetric and general sparse linear systems. This paper describes the techniques and algorithms used in developing the direct solver for general sparse matrices in WSMP. Developing an efficient parallel, or even serial, direct solver for general sparse systems of linear equations is a challenging task that has been the subject of research for the past four decades. Several breakthroughs have been made during this time. As a result, a number of very competent software packages for solving such systems are available i'2.3.4.5.6-7'8. The implementation of Gaussian Elimination with Partial Pivoting in WSMP incorporates the best ideas from a number of other solvers into one package and introduces several new techniques. Our experiments on a suite of 25 industrial problems show that LU factorization with partial pivoting (GEPP) in WSMP is about two and a half times faster than the best among other similar softwares available at the time of WSMP's release 9 . As is typical for such solvers, WSMP's general sparse solver has four distinct phases, namely, Analysis or preprocessing, Numerical Factorization, Forward and Backward Elimination, and Iterative Refinement. In this paper, we focus on the more important and expensive steps of analysis in Sec. 2 and numerical factorization in Sec. 3. Sec. 4 contains experimental results and a comparison with MUMPS 2 .

314

315 2

The Analysis Phase

The analysis phase of WSMP plays an important role in the achievement of its overall goal of a fast solution to the given sparse system. In this section, we describe the various algorithms used in preprocessing the sparse coefficient matrix before its LU factorization. Gupta 10 gives more details on these algorithms. 2.1

Reduction to block-triangular form

The first step in the symbolic preprocessing of a sparse system in WSMP is to find a row permutation to reduce the coefficient matrix into block uppertriangular form u . The task of solving the original system now reduces to a block-triangular solve involving factorization of only the smaller irreducible block-diagonal matrices, and therefore, can be accomplished with much less computation. Additionally, some of the algorithms used in symbolic and numerical factorization in WSMP work correctly only if the coefficient matrix is irreducible to a block-triangular form. All sparse matrices do not yield a block-triangular form and sometimes, this step results in the trivial case of a single diagonal block consisting of the entire matrix. 2.2

Row permutation to shift large entries to the diagonal

Permuting the rows of a sparse matrix to ensure a non-zero diagonal and to maximize the product of the absolute values of the diagonal entries has been used successfully to reduce or eliminate dynamic row-exchanges due to partial pivoting during LU factorization 2>6>12. This is accomplished by finding a maximum weighted matching in the bipartite graph in which there is an edge of weight log(l -I- \a,ij\) if ay ^ 0 in the original coefficient matrix A. We take advantage of this preprocessing step in WSMP and use partitioning and scaling of the bipartite graph, as described by Gupta and Ying 13 , to speed up this step. 2.3

Graph-partitioning based ordering for fill-in reduction

After permuting the rows of the coefficient matrix, we compute a nesteddissection ordering based on the symmetric structure of A + AT. We use a multilevel graph partioning algorithm 14 to compute this permutation and apply it symmetrically to both the rows and the columns of A. As shown in Gupta 14 , an ordering based on graph partioning is, on an average, more effective in reducing fill-in than local heuristics such as MMD used in SuperLU 6

316 and AMD used in MUMPS 2 . 2.4

Symbolic factorization

Gilbert and Liu 15 described elimination structures for unsymmetric sparse LU factors and gave an algorithm for sparse unsymmetric symbolic factorization. Their elimination structures are two directed acyclic graphs G(L°) and G(U°). These are transitive reductions of the graphs G(L) and G(U), respectively, of the factors L and U of A. The union of G(L°) and G(U°) is also the minimal task-dependency graph for the LU factorization of A without pivoting, where a task T* corresponds to the elimination of the i-th row and column during Gaussian elimination. Using a minimal dependency structure is beneficial because it avoids some overheads due to redundancy in symbolic and numerical factorization and exposes maximum parallelism. However, some researchers have argued that computing an exact transitive reduction can be too expensive 5 ' 1 6 and have proposed using sub-minimal DAGs. In this section, we describe an inexpensive way of computing transitive reductions and maintaining a minimal task-dependency DAG during symbolic factorization and a near-minimal data-dependency DAG during numerical factorization. Gilbert and Liu 15 maintain the original matrix A as well as the structures of the factors L and U by their rows. The symbolic factorization then involves performing the following four steps on each index i from 1 to n: (1) Compute the structure Struct(Lj«) of the i-th row of L by traversing the DAG G(f/f_1) consisting of the first i — 1 nodes, (2) Transitively reduce Struct(Lj») to extend the DAG G(L?_i) to G(L?), (3) Compute the structure S t r u c t ^ . ) of the i-th row of U as the union of the structures of all nodes that are connected to node i by an edge in G(L^), and (4) Transitively reduce Struct(C/,,) to extend the DAG GQJU) to G{U9). Before the symbolic factorization step, we split A into a lower triangular part Ai stored column-wise and an upper triangular part Au stored row-wise. During symbolic factorization, we maintain the structures of L and U in column-wise and row-wise formats, respectively. We then need to apply only steps (2) and (3) of Gilbert and Liu's algorithm on two sets of similar data structures for the lower and upper triangular factors and DAGs. An important advantage of maintaining the factor structures in this manner is that supernodes can be identified as symbolic factorization proceeds; i.e., as soon as Struct(L*j) and Struct([/j„) are computed, we can determine if z belongs to the same supernode as any one of nodes with edge connections to i in G{L°) and G(U°). This is possible because we can compare the column structures of L and row structures of U. We use a relaxed supernode criterion and if

317

Struct(L.j) and Struct([/j») are nearly similar to those of one of candidate indices, we absorb i into that supernode. As the supernodes are identified, we add them to two linked-list based data structures that represent the row structures of L and the column structures of U. In other words, Struct(L;») and Struct([/*j) are available in the form of linked lists, but they contain only one entry for each supernode. Similarly, we only maintain supernodal DAGs that we denote by G(L°S) and G(Ug), which have much fewer nodes than G{L°) and G{U°), respectively. Moreover, the number of edges in the structures of L and U that need to be traversed to compute the transitive reductions is even smaller. This is because during the process of absorption of nodes to form supernodes, all cliques and near cliques of L and U are replaced by single supernodes. These cliques and near cliques are responsible for a significant portion of the edges in the structures of L and U and thus account for a majority of the time spent in transitive reduction computation in Gilbert and Liu's 15 algorithm. The method we described above is fast and yields the structures of L and U, as well as a minimal supernodal task-dependency DAG, which is the union of G{L°S) and G{US). Unlike symmetric sparse factorization, in which the task-dependency graph is a tree and is the same as the data-dependency graph, in unsymmetric factorization, the task-dependency DAG is only a subset of the data-dependency DAG. The latter is what we need to proceed with the multifrontal factorization in order to correctly apply all the updates. We now describe the steps that enable us to construct a minimal datadependency DAG D(V, E) whose vertex set V is the set of all supernodes. First, we initialize the edge set E with all the edges of G(L°S) U G(JJg). The edges that are in G(L°S) — G(Ug) are labeled as L-edges, the edges that are in j such that A; £ Struct(L»j), and none of the indices contained in any U-parent of s belong to Struct(C/«j), then we add a U-edge st to E. L-edges are added similarly. Next, for each supernode s, if t is the supernode with the smallest index with both L- and U-paths from s, then an LU-edge st is added to E and all edges directed out of s to supernodes greater than t are deleted. The result is a near-minimal data-dependency DAG valid for sparse factorization of A without partial pivoting. In the next section, we will describe a simple procedure to enhance this DAG to accommodate partial pivoting.

318

3

Unsymmetric multifrontal factorization with partial pivoting

Due to space constraints, we will not describe the basic unsymmetric-pattern multifrontal algorithm, for which we refer the reader to the paper by Davis and Duff 5 . Hadfield 17 introduced a method to handle partial pivoting within the framework of the unsymmetric-pattern multifrontal algorithm. Our serial sparse LU factorization uses the same basic ideas as Hadfield's. When factoring a supernode, we first try to find a column that has at least one row entry within the supernode that satisfies the pivoting criterion. Thus, partial pivoting within a supernode can involve both a row and a column exchange. This pivoting does not alter the structure predicted by symbolic factorization. If some rows and columns of a supernode fail to satisfy the pivoting criterion, then these indices are transferred to the LU-parent of the supernode. This is equivalent to a symmetric permutation. Such inter-supernode pivoting can add fill-in over and above that predicted by symbolic factorization in the supernodes on the paths from the current supernode to its LU-parent. It also adds some extra data-dependencies in the portion of the DAG between the supernode with failed pivots and its LU-parent. Therefore, the data-dependency DAG that we described in Sec. 2.4 needs to be augmented. Unlike Hadfield 1T, we add these edges to the DAG before proceeding to the actual numerical factorization. Our experience with a large number of real problems shows that the number of extra edges that is required to accommodate all possible pivot failures is fairly small. If t is the LU-parent of a supernode s and there exists an L-path sr such that r < t and the LU-parent of r is greater than t, then a U-edge rt is added to the DAG, if it doesn't already exist. Similarly, if there exists a U-path sr such that r < t and the LU-parent of r is greater than t, then an L-edge rt is added to the DAG. After augmenting the data-dependency DAG with these edges, we can use it for LU factorization of multiple matrices with the same sparsity pattern. The multithreaded sparse factorization in WSMP follows the datadependency DAG described earlier. Two types of parallelism are used. Whenever possible, independent portions of the DAG are processed in parallel. While processing those portions of the DAG where enough independent tasks cannot be found to keep all processors busy, multiple processors are used to perform the level-3 BLAS operations on supernodes in parallel. Before starting parallel factorization, we spawn T threads that wait on a task queue to pick up tasks and execute them. T is set to a value greater

319

than the actual number of processors to facilitate load-balancing. The parallel algorithm starts by calling the root of the DAG with parameter P that indicates the maximum number of parallel subtasks that this task may spawn. For the root task, P is set to T. For a given task, if P is 1, then the task recursively performs all subtasks corresponding to those of its children in the DAG that have already not been finished or under execution by another parent. If P is greater than 1, then the tasks divides P in proportion to the amount of work associated by each of its unprocessed children and places the subtasks corresponding to these children in the task queue. Thus the original P = T gets subdivided into smaller parts and eventually becomes 1 as the computation proceeds down the DAG. After the subtasks corresponding all the children of a task are finished, their contributions are assembled in the frontal matrix corresponding to the current task and a (possibly parallel) partial LU factorization is performed. Since children may be shared, appropriate locks are used to avoid race conditions. To avoid excessive overhead, locks are used only when there is a possibility of a conflict. Through a one-time depthfirst search of an undirected version of the data-dependency graph during the analysis phase, we identify its biconnected components and the corresponding articulation points. During LU factorization, we check the P parameter of the task corresponding to the smallest articulation point between a given task (supernode) and the root of the DAG. Only if this P is greater than 1, the given supernode has the potential for being accessed by multiple threads. 4

Experimental results

Gupta 9 presents a detailed experimental comparison of most of the commonly used software packages for solving general sparse systems of linear equations using a direct method and concludes that MUMPS 2 was overall the fastest direct sparse general solver before WSMP. Therefore, in this paper, we compare WSMP with MUMPS. Table 1 shows the sparse LU factorization times for 25 publicly available sparse matrices on 1 and 4 CPUs. All experiments were conducted on an RS6000 with 375 MHz Power-3 processors with 64 KB LI cache each, a shared 8 MB L2 cache, and a shared 4 GB RAM. Both codes were compiled with - 0 3 optimization in 32-bit mode. Therefore, only 2 GB of memory was available to each process. However, since MUMPS uses MPI, each of the four parallel processes could use up to 1 GB of memory each. As Table 1 shows, WSMP is about two and a half times faster than MUMPS on an average. A significant part of WSMP's advantage over MUMPS comes from the use of a better fill-reducing ordering. For very unsymmetric matrices, the underlying unsymmetric pattern factorization algo-

320 Table 1. Factorization times of MUMPS and WSMP on 1 and 4 processors. Matrix af23560 av41092 bayerOl bbmat comp2c e40r0000 ecl32 epb3 fidapOll fidapmll invextrl lhr34c mil053 mixtank nasasrb onetonel onetone2 pre2 raefsky3 raefsky4 rmalO twotone venkat50 wang3 wang4

N 23560 41092 57735 38744 16783 17281 51993 84617 16614 22294 30412 35152 530238 29957 54870 36057 36057 659033 21200 19779 46835 120750 62424 26064 26068

NNZ xlO6 0.48 1.68 0.28 1.77 0.58 0.55 0.38 0.46 1.09 0.62 1.79 0.76 3.72 2.00 2.68 0.34 0.28 5.96 1.49 1.32 2.37 1.22 1.72 0.18 0.18

1 CPU 4.05 12.0 1.10 48.0 10.2 0.83 64.7 2.70 8.73 11.6 38.9 2.21 42.8 64.8 13.1 3.66 1.17 Fail 4.56 13.0 4.13 43.5 4.87 15.1 11.8

MUMPS 4 CPUs 2.27 6.98 0.63 20.7 7.33 0.61 31.5 1.39 7.64 7.38 23.6 1.15 16.6 31.0 10.2 2.67 0.82 Fail 3.45 8.97 2.98 26.1 2.74 6.48 5.84

S 1.8 1.7 1.7 2.3 1.3 1.3 2.0 1.9 1.1 1.6 1.6 1.9 2.5 2.1 1.3 1.4 1.4

1.3 1.4 1.4 1.6 1.8 2.3 2.0

1 CPU 3.96 4.59 0.95 22.9 1.64 0.56 23.1 1.66 3.93 6.50 9.93 0.92 23.0 21.9 6.98 2.25 0.72 127. 3.16 4.91 2.47 13.5 2.83 6.65 6.84

WSMP 4 CPUs 1.83 2.65 0.95 8.26 0.67 0.28 7.41 1.23 1.78 2.60 4.67 0.93 10.6 8.32 3.37 1.52 0.70 55.3 1.40 2.34 0.99 9.05 1.13 3.50 3.08

S

2.2 1.8 1.0 2.8 2.4 2.0 3.1 1.4 2.2 2.5 2.1 1.0 2.2 2.6 2.1 1.5 1.0 2.3 2.3 2.1 2.5 1.5 2.5 1.9

2.2J

rithm of WSMP gives it a considerable advantage over the symmetric pattern factorization of MUMPS.

5

Conclusion

In this paper, we have shown how the use of proper algorithmic techniques for ordering, symbolic factorization, and numerical factorization, has allowed us to develop a solver that significantly improves the state of the art of direct solution of general sparse systems of linear equations.

321

References 1. A. Gupta. WSMP: Watson sparse matrix package. Technical Report RC 21888 (98472), IBM T. J. Watson Research Center, NY, November 20, 2000. http://www.cs.umn.edu/~agupta/wsmp.htTnl 2. P. R. Amestoy, I. S. DufF, and J. Y. L'Execellent. Multifrontal parallel distributed symmetric and unsymmetric solvers. Comp. Meth. in Appl. Mech. Eng., 184:501-520, 2000. 3. C. Ashcraft and R. G. Grimes. SPOOLES: An object-oriented sparse matrix library. In Proc. Ninth SIAM Conf. Parallel Proc. for Sci. Comp., 1999. 4. M. Cosnard and L. Grigori. Using postordering and static symbolic factorization for parallel sparse LU. In Proc. IPDPS 2000. 5. T. A. Davis and I. S. DufF. An unsymmetric-pattern multifrontal method for sparse LU factorization. SIAM J. Mat. Anal. Appl., 18(1):140-158, 1997. 6. X. S. Li and J. W. Demmel. A scalable sparse direct solver using static pivoting. In Proc. Ninth SIAM Conf. Parallel Proc. for Sci. Comp., 1999. 7. K. Shen, T. Yang, and X. Jiao. S+: Efficient 2D sparse LU factorization on parallel machines. SIAM J. Mat. Anal. Appl., To be published. 8. O. Schenk, W. Fichtner, and K. Gartner. Scalable parallel sparse LU factorization with a dynamical supernode pivoting approach in semiconductor device simulation. TR 2000/10, Swiss Federal Institute of Technology, November 2000. 9. Anshul Gupta. Recent advances in direct methods for solving unsymmetric sparse systems of linear euqations. Technical Report RC 22039 (98933), IBM T. J. Watson Research Center, Yorktown Heights, NY, April 20, 2001. 10. Anshul Gupta. Improved symbolic and numerical factorization algorithms for unsymmetric sparse matrices. Technical Report RC 22137 (99131), IBM T. J. Watson Research Center, Yorktown Heights, NY, August 1, 2001. 11. I. S. DufF, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. Oxford University Press, Oxford, UK, 1990. 12. I. S. DufF and J. Koster. On algorithms For permuting large entries to the diagonal of a sparse matrix. Technical Report RAL-TR-1999-030, Rutherford Appleton Laboratory, April 19, 1999. 13. A. Gupta and L. Ying. Algorithms for finding maximum matchings in bipartite graphs. Technical Report RC 21576 (97320), IBM T. J. Watson Research Center, Yorktown Heights, NY, October 19, 1999. 14. A. Gupta. Fast and effective algorithms for graph partitioning and sparse matrix ordering. IBM J. Res. Dev., 41(1/2):171-183, 1997. 15. J. R. Gilbert and J. W.-H. Liu. Elimination structures for unsymmetric sparse LU factors. SIAM J. Mat. Anal. Appl., 14(2):334-352, 1993. 16. S. C. Eisenstat and J. W.-H. Liu. Exploiting structural symmetry in a sparse partial pivoting code. SIAM J. Sci. Comp., 14(l):253-257, 1993. 17. S. M. Hadfield. On the LU Factorization of Sequences of Identically Structured Sparse Matrices within a Distributed Memory Environment. PhD thesis, University of Florida, Gainsville, FL, 1994.

SYSTEM MODEL FOR IMAGE RETRIEVAL ON S Y M M E T R I C MULTIPROCESSORS O. KAO Department of Computer Science, Technical University Julius-Albert-Strasse 4, D-38678 Clausthal-Zellerfeld, e-mail: [email protected]

of

Clausthal Germany

Retrieval with a priori extracted features limits the applicability of image databases, as a detail search for important objects cannot be performed. A dynamic retrieval solves this problem, but it also requires immense computational resources and can only be handed by parallel architectures. A simulation and an analysis of the main parallel database architectures have shown, that clusters and symmetric multiprocessors can be - dependent on the scope of the application in mind and-the number of images managed - suitable for different aspects of image databases with dynamic retrieval. In this paper a prototype for the realisation of an image database on a symmetric multiprocessor system is introduced.

1

Introduction

Systems for the archival and retrieval of images can be found in many areas, for example medical applications, remote sensing, news agencies, authorities, and museums. They contain general sets of images and allow a search for a number of images similar to a given sample image or sketch. An image description requires the extraction and analysis of various information regarding the objects, persons, textures, etc. in the scene, which can be classified into following groups: • Raw image data is the matrix with the pixel colour values. • Technical information describes the image resolution, number of used colours, file format, etc. • Results of the image processing and analysis enclose image features derived directly from the raw data, such as colour distribution, texture and shape descriptors, as well as topological data. • Knowledge-based information describes the relationship between the image elements and the real world entities, for example who or what is shown on the image. • World-oriented information concerns usually the acquisition time and date, photographer and a manually compiled set of keywords.

322

323

The raw image data, the technical and world-oriented information can be represented by traditional data structures and stored in existing databases. The user can either browse the database or search it by entering keywords. Examples for such databases with images1 are widely available on the Internet, for example as a part of web presentations of museums and art galleries. These primitive retrieval and annotation methods have many disadvantages in terms of time-intensive and difficult selection of appropriate descriptions. Therefore, an image database should support improved, content-based querying, retrieval, and annotation methods. Interfaces for the creation of a query should include visual methods like query-by-pictorial-example and query-by-sketching2. 2

Image Retrieval

The state-of-the-art approach for the creation and retrieval of image databases is based on the extraction and comparison of a priori defined features. These are directly derivable from the raw data and represent properties related to the dominant colours, important shapes, textures, etc. Recent works consider the image semantics and try to define and integrate different kinds of emotions 3 . The similarity degree of a query image and the target images is determined by calculation of a multidimensional distance between the corresponding features. The result is a sorted list with n distances, where n is the number of stored images. The first k elements correspond to the most similar images in the database and the raw data of those images is presented to the user. Satisfactory retrieval results are reported for a number of existing image databases, when a small number of image classes is handled. Acceptable system response times are also achieved, because no further processing of the image raw data is necessary during the retrieval process. The straightforward integration in traditional database systems is a further advantage of this approach. On the other hand extraction of simple features results in disadvantageous reduction of the image content. Important details like objects and persons shown on the picture are not sufficiently considered in the retrieval process and a precise detail search is not possible. For the realisation of queries like "Show all images containing the marked object X" more powerful methods with dynamic feature extraction are necessary. Thereby, manually selected image elements are extracted, analysed, and described by a number of features. Subsequently, these regions of interest are directly compared to all images in the database in order to find the marked object.

324

An example for this operation is given by template matching. The region of interest is represented by a minimal bounding rectangle and correlated with all images in the database. In contrast to the previous approach image features and attributes cannot be defined and extracted a priori, so all images in the database have to be processed. The dynamic feature extraction increases the computational complexity for the query processing significantly. The matching process is repeated for all possible image sections, sizes, and rotations. The maximal range for these transformations and the stride depend on the concrete application. The wider the range, the more complex and compute intensive is the matching operation. For example, the processing of a 1024 x 768 image with a 100 x 100 template with only two sizes and without rotation requires approximately 12344-106 additions. Summarising 4 , the computational and memory requirements of image databases with dynamic retrieval often exceed the performance capabilities of computer architectures with a single processing element (PE). Therefore, the properties and the suitability of parallel architectures for the realisation of dynamic image retrieval are assessed. 3

Parallel architectures for dynamic image retrieval

Parallel architectures represent key components of modern computer technology with a growing impact on the development of information systems. From the viewpoint of the database community these are classified5 in shared nothing, shared everything, and shared disk architectures. Shared nothing architectures are usually used for the realisation of databases distributed over wide area networks. Each node has an individual database management and file system. All operations are performed with the local data and the communication follows the client/server paradigm. Shared everything architectures are the main platform for parallel databases and most vendors offer appropriate extensions for existing systems. Fast communication and synchronisation as well as easy management are advantages of this architecture class. Disadvantages concern the fault tolerance and limited extensibility. Shared disk systems host distributed databases in local area networks. Each node has a local database management system and is connected over a shared file system. An important issue of these systems is the fault tolerance for critical applications. The important question is, which computer architecture is suitable for an efficient realisation of dynamic image retrieval. In particular, the problem is characterised by the large data blocks of the images and other multimedia objects. Thus, the communication between the I/O subsystem, memory, and

325

PEs has a significant impact on the overall performance. The investigation for the solution of this problem encompasses the development and simulation of system models and validation of the obtained results by executing performance measurements with the implemented prototypes. A simulation-based assessment of parallel architectures for image databases is provided by G E I S L E R 7 . The simulation task consists of the decoding, decompression, and analysis of a subset of a larger database, which is created as a preliminary result by a database retrieval using a priori extracted attributes. As a measure for the performance the number of decompressed and analysed bytes in proportion to the elapsed time is taken (throughput). Of special interest are the speedup and the efficiency of the different architectures with respect to an increasing number of PEs. The results of the performance measurements with the simulated architecture models show, that shared nothing architectures offer the best scalability and speedup increase for large image databases. On the other hand shared everything architectures are well-suited for image databases, if a relatively small set of images has to be managed and no more than eight PEs are used. Moreover, the synchronisation and communication via shared memory blocks enables an efficient execution of queries combining a priori and dynamically extracted features. The elimination of the time-intensive transfers of large data blocks between the nodes during workload balancing offers significant advantages compared with shared nothing architectures. Another reason for the development of such a system model is the necessity for validation of the assumed simulation factors such as transfer latency between I/O subsystem and memory, delays through access conflicts, and management overhead. Investigations for the large multimedia data blocks are still not available, so these factors are difficult to model and to evaluate. The next section describes a prototype for image retrieval with dynamic feature extraction, developed for a shared everything architecture. A system model for cluster architectures and performance measurements for the problem described are provided by K A O and STAPEL 4 . 3.1

Dynamic image retrieval on a shared everything architecture

Widespread examples of shared everything architectures are symmetrical multiprocessors (SMP). The SMP parallelisation of image processing operators is usually based on the subdivision of the image into independent sections and the distribution over the available PEs. This approach is not always well-suited for image retrieval methods, as several operators consider the dependencies between all pixels, thus an image separation could lead to falsi-

326

fication of the results. Therefore, the proposed SMP prototype is based on task parallelisation, where the available PEs compute a number of images simultaneously. As already noted the efficiency and the speedup of the parallel retrieval depend significantly on the characteristics of the I/O subsystem, as timeintensive data transfers cause long idle times of the PEs and reduce the performance gain 6 . Therefore, the main demand on the introduced SMP processing model is the minimisation of memory conflicts and idle times. This is achieved by developing two cooperating layers. The first layer contains only one loading module, which transfers the images from the file system into a particular area of the main memory. A dedicated PE executes this process permanently. The second layer contains processing modules, which are assigned to the other available PEs in the system. Each processing module requests from the loading module a pointer to the next image and information about the size. If there are unprocessed images in the memory, it receives the memory address of the section with the image and starts the content analysis. Otherwise the requests are temporarily queued, until a new image is loaded. Subsequently the memory block is analysed and the results are stored in an index structure or in a file.

Figure 1. System model for a SMP-based image retrieval

327

During the image analysis the loading module fetches new images and stores them in a memory buffer. If the maximal capacity of the buffer is reached, the loading process is suspended and an additional PE supports the image analysis. Once the amount of images in the buffer falls bellow a given threshold value the loading process is re-started. In the last phase, the results of the image analysis are collected and evaluated, thus an image ranking can be generated. A graphic representation of the system model is shown in Figure 1. The implementation uses the Posix thread concept: A main thread controls the entire system, initialises, starts, and synchronises the loading and computing threads and unifies the sub results. An individual thread is also assigned to each processing and loading module. 4

Performance Measurements

The performance of the developed prototype is measured on a parallel SMP server DEC Alpha 4100 with four PEs (600 MHz DEC Alpha 21164), one GByte main memory, and Digital Unix operating system. The reference runtimes for the speedup and efficiency are obtained using one processing thread. The test database contains more than 10000 images in a raw format, which does not provide any data compression. The average size of the images amounts 1.2 MBytes corresponding to a resolution of 768 x 512 pixels. A sequence of standard operators is applied to the images. The results are stored in memory-resident index structures eliminating the necessity for hard disk access. The buffer size is set to ten, i.e. as soon as ten unprocessed images are in the memory, the loading module is temporarily suspended and an additional processing thread is started on the now available PE. The measured runtimes are compared to the straightforward processing model, where four concurrent processes are started. Each of these processes loads an image, analyses it, and updates the index structure. Three different configurations considering operations with a low analysis effort (< Is/image), moderate (1.5-3s/image), and high analysis effort (15s/image) are investigated in order to evaluate the management overhead and the performance gain by minimising the access conflicts. All performance measurements have in common, that the best speedup and efficiency values are obtained with three processing threads (note that there is an additional loading thread). In the case of the low effort configuration a performance gain of approximately 16% is achieved. The probability for an access conflict falls with an increasing processing time per image and thus the loading thread is less important. This can be seen clearly using a

328

high effort configuration, where the modified system model needs nearly 20% longer than the straightforward solution. No significant difference is noted when image analysis with moderate effort is applied.

Spctdup A F J I b w u i

•Efficiency

4 5 8 Processing Threads

Figure 2. Speedup and efficiency values obtained with the implemented prototype (1000 images, moderate analysis effort)

As an example, the speedup and efficiency diagram of the configuration with moderate analysis effort is depicted in Figure 2. As expected three threads deliver the best speedup of 2.87. Four threads increase the management and initialisation effort and lead to a lower speedup of 1.98. An unusual value is noted when using five processing threads, as a significantly better speedup (2.42) is obtained. Retrieval with six threads delivers nearly the same value as the case with four threads, but a higher speedup is reached with seven threads. This unexpected behaviour can still be seen - even more clearly - if an increasing number of images and if a large number of up to 20 processing threads is considered. An analysis of the runtimes shows that the processing with odd numbers of threads leads to the expected behaviour with a global minimum by three threads and increasing runtimes with a larger number of threads. A sinus-like diagram results from the processing with an even number of threads, as the runtimes decrease with a larger number of threads. We suppose, that this behaviour is caused by the Digital UNIX components for thread management, as corresponding tests on a DEC workstation result in a similar diagram, but not the measurements on a Linux SMP workstation.

329 5

Conclusions

In this paper an approach for image retrieval based on dynamic feature extraction and comparison is discussed. The related performance problem can be solved by powerful parallel architectures. Therefore, a prototype for the realisation of an image database on a symmetric multiprocessor is introduced. Performance measurements showed differences to the known simulation results, mainly caused by varying overhead for the thread management and synchronisation. On the other hand the suitability of the proposed SMP system model for image retrieval with low complexity operations can be acknowledged, as reasonable speedup and efficiency values are obtained. Moreover, the straightforward implementation and management offer significant advantages compared to cluster architectures, which are more suitable for a dynamic retrieval of very large image databases. Future work includes development of parallel methods for dynamic feature extraction and comparison. Moreover, a further refinement of the simulation models, additional performance measurements, and comparisons with other SMP system models have to be executed. References 1. Santini, S., Jain, R., Image Databases and not Databases with Images. Proceedings Image Analysis and Processing, ICIAP 97, (1997) pp. 38-48. 2. Ashley, J., et. al., Automatic and semi-automatic methods for image annotation and retrieval in QBIC. Proceedings of Storage and Retrieval for Image and Video Databases III, (1995), pp. 24-35. 3. Del Bimbo, A., Expressive Semantics for Automatic Annotation and Retrieval of Video Streams, Proceedings of the IEEE Conference on Multimedia & Expo, (2000), pp. 671-674. 4. Kao, O., Stapel, S., Case Study: Cairo - A Distributed Image Retrieval System for Cluster Architectures. T. K. Shih (Edt.): Distributed Multimedia Databases: Techniques and Applications, Idea Group Publishing, (2001), to be published, 5. Reuter, A., Methods for parallel execution of complex database queries. Journal of Parallel Computing, 25(13-14), (1999), pp. 2177-2188. 6. Gaus, M., Joubert, G.R., Kao, O., Distributed high-speed computing of multimedia data, Proceedings of Parallel Computing, (1999), pp. 510-517. 7. Bretschneider, T., Geisler, S., Kao, O., Simulation-based Assessment of Parallel Architectures for Image Databases. Proceedings of the Conference on Parallel Computing (ParCo 2001), (2001), to be published.

G R A N U L A R I T Y A N D P R O G R A M M I N G P A R A D I G M S IN PARALLEL M P P IMAGE CODING R. NORCEN'AND A. UHL RIST++ & Dept. of Scientific Computing, University of Salzburg, Austria E-mail:

{rnorcen,uhl}Qcosy.sbg.ac.at

Matching Pursuit Projection (MPP) is an approach to non-orthogonal transformation based compression. Similar to fractal compression, this type of encoding has to be paid with an enormous computational complexity. Despite its computational demand, M P P image and video coding has proven to be competitive in terms of rate-distortion performance. In this work we discuss approaches to parallelize a Matching Pursuit image coder. Different granularity levels and MIMD programming paradigms are compared with respect to their efficiency.

1

Introduction

Matching Pursuit Projection (MPP) or variants of it have been suggested for designing image compression1'2 and video compression 3 ' 4 ' 5 algorithms and have been among the top-performing contributions to MPEG-4. However, the good rate-distortion performance has to be paid with an enormous computational complexity. Parallelism is one possibility to increase the processing speed and to make the MPP approach useful for a wider range of applications. Although the outer structure of MPP seems to inhibit any parallelism at first look, there has been already done some work in this area. Feichtinger et al. 6 propose a hierarchical parallelization by a Gram-Schmidt type orthogonalization of the codebook (=dictionary) whereas Dodero et al. 7 discuss the parallelization of Mallat's MPP package via exploiting fine grained parallelism in the dictionary. In this work we discuss different granularity levels and MIMD programming paradigms for the parallelization of an MPP image coder. In particular, we compare image tiling and dictionary partitioning with respect to granularity. Message passing and a simple thread-based implementation are used in our algorithms. Experimental results are given for an SGI Power Challenge, a Cray T3E, and a Siemens HPC cluster.

• S U P P O R T E D BY T H E AUSTRIAN SCIENCE FUND F W F , P R O J E C T NO. P13903.

330

331

2

A Sequential Approach to M P P Image Coding

Let D = {
(Rnf,gy)

3. Rn+1f

Rnf-{Rnf,g7n)97r.

=

4. If \Rn+1f\ < e or n > Max.Iterations

then STOP

5. n = n + 1 , goto 2 This algorithm selects iteratively the best dictionary function g 7 n , that is, the function with the highest inner product of function and current residual Rnf, and subtracts it from the residual. The process stops when the norm of the residual is below a chosen threshold e or a maximum number of functions ( i t erations) Max.Iterations is reached. The first iteration of a 2-dimensional MPP is illustrated in figure 1.

Figure 1. Illustration of the first iteration: g 70 is the best matching atom (dictionary function) for residual R°f (original image). (R°f,g~,0\ g 7 o is then subtracted from R°f to retrieve residual R1/.

Schneider9 introduced the Korn coder, a variant of MPP using a dictionary based on rotated Gabor functions1 and several speedup techniques to enhance the runtime performance of MPP (different to those proposed in 1 0 ).

332

The image compression quality of the Korn coder is comparable to state of the art wavelet image coders like the EZW or the SPIHT coder and clearly superior to JPEG (see figure 2).

Figure 2. Compression ratio 35 achieved with J P E G (left: PSNR 24.11) and the Korn coder (right: PSNR 27.44): Image Lena 256 2 .

The fundamental building blocks of the Korn coder are a set of 32 x 5 rotated Gabor functions. Every function can be translated in the xy-plane to be centered on an arbitrary position (u,v), all these possible translations form the dictionary D of the Korn coder 9 . All functions gv £ D of equal size build Scale(s), s £ {0,1,2,3,4} and have a support of 2 S + 2 pixels in both dimensions x and y. Several speedup techniques are implemented in this coder to reduce the coding time for a 256 x 256 pixels image from a couple of days to only a few seconds: Scale-by-Scale Speedup: In iteration m, the scale-by-scale-speedup always looks for a best matching atom glm £ Scale(s) instead of investigating all functions gv £ D. Encoding is done progressively from Scale(4) atoms, describing the low-frequencies, to Scale(0) atoms, describing the high-frequency information. Block Decomposition with Overlap: The entire image is decomposed into so called logical blocks of size p x p. To find a matching atom, only dictionary functions centered within one selected block (px, py) are investigated. In successive iterations, image blocks are covered consecutively row by row. Updating Approach: Among consecutive iterations of a MPP, a lot of re-

333

dundancy can be exploited: Inner products which are not influenced by the subtraction of gln can be reused in the next iteration n + 1. Even influenced inner products can be computed by one single product ( = <#7,ffr,> - (Rnf,9^)

{9^,9^)

• 9

Because of the memory demand for a reasonable dictionary , the usage is limited to small dictionary functions of Scale(0) and Scale(l). Image and Basis Function Subsampling: A certain part of the image and dictionary functions can be subsampled by a factor sub. After looking for a best match in this subsampled space applying the same MPP techniques, the best match is subtracted from the original residual. 3

Parallel M P P Image Coding

To further increase the runtime performance of the Korn coder, we propose several approaches for MIMD type parallelizations. 3.1

Coarse-grained Parallelization: Image Tiling

We apply an image tiling where one or more parts of the image are assigned to one processor, which is a common approach in parallel image processing. However, in contrast to classical block-based algorithms like e.g. JPEG, independent parallel processing of the image parts will lead to a different result (generally worse) as a sequential execution since overlapping blocks are used in the Korn coder and the computations in adjacent image parts influence each other. Note, that the image is not partitioned into n equally-sized tiles when employing n processors. In order to match the block structure of the algorithm better, a recursive tiling procedure is applied where always the largest tile is split into two equally-sized tiles if an additional tile is required (see fig. 3 as example for n = 3). Obviously, this leads to differently sized tiles for different processors and the associated load balancing problem. The latter turns out to be of minor relevance since even in case of equally-sized tiles the load will be unbalanced in general due to the data dependent execution behaviour of MPP. Therefore, a load balancing scheme is mandatory anyway: A processor having finished the computations on its image tile is assigned to assist the most loaded remaining processor by taking over the work corresponding to half of the basis functions in the dictionary. This scheme may also be applied recursively. Let P\,...,Pn be the set of available processors. The input image / is split into n tiles where every Pi gets one tile plus additional overlapping data

334

(denoted local image data). According to the image tiling P{ may have one or more neighbour processors. To compute one parallel iteration for Scalers) in dependent-mode, every P, has to perform the following steps: 1. Find the best matching atom gi of Scale(s) applying the techniques of the Korn coder, and - if worth - subtract gt from the local image data. 2. If the local image data of a neighbour processor of P, is influenced by gt, send update information to that processor. 3. Receive match information from neighbours to update the local image data. To compute one parallel iteration in independent-mode, we skip steps 2 and 3 and restrict the subtraction of atom gt to the image tile without overlapping data. Figure 3 illustrates the image data after performing one iteration in

Figure 3. Dependent- versus independent-mode with 3 processors: Residuals after performing the first parallel iteration (Left image: Initial encoding/region data; middle image: Dependent-Mode; right image: Independent-Mode).

independent-mode and one iteration in dependent-mode. Independent-mode, though easy to implement, turns out to reduce the rate-distortion significantly for an increasing number of processors, since image data at the block-borders cannot be encoded efficiently thereby introducing severe blocking-artifacts. However, employing dependent-mode for atoms of all scales does not perform efficiently in parallel due to the high amount of synchronization especially for the small scale (Scale(s), s € {0,1} ) atoms. It turns out that regarding both, rate-distortion performance and parallel efficiency, it is best to encode scale 4,3, and 2 atoms in dependent-mode, while computing the two remaining update scales 0 and 1 (describing the details) in independent-mode (denoted as best-dependent-mode). Note, that even though the rate-distortion performance is equal to the sequential case the bitstreams are not identical. Fig.

335

4.a displays the speedup for this type of parallelization employing an MPI implementation and compares different architectures. The high performance of the Cray's interconnection network is the obvious reason for its superior performance.

(a) Coarse grained MPI on different architectures,

(b) The Power C algorithm cornpared with both MPI granularity levels.

Figure 4. Speedup for parallel M P P image coding: Lena 256 2 .

3.2

Fine-grained Parallelization: Dictionary

Partitioning

Each processor Pj,i = 1, ...,n loads the entire input image into his local memory space and performs encoding like the sequential Korn coder with the following important difference: within each logical image block, the set of admissible basis functions is partitioned into n subsets at each scale. To evaluate one match, each processor then computes the best match for its subset of dictionary functions and transfers the information about its choice to a host process. Subsequently, the overall best match is identified and broadcasted to all processors. Finally, all processors update their image data and proceed to the next logical block. It is obvious that this type of parallelization reproduces exactly the bitstream of the sequential Korn coder. However, the performance for this type of parallelization is restricted (see MPI-fine-grained in fig. 4.b). It seems that this parallelization is too fine-grained at least for the message passing MIMD approach - synchronization and communication demand is too high to result in a scalable algorithm.

336

3.3

Programming Paradigms

So far, we have considered message passing based MPI algorithms only. This programming paradigm requires an explicit programming of each communication event occurring among processors and consequently program development is very time demanding. However, message passing programs written in e.g. MPI or PVM may be used without changes on different architectures (no matter if multiprocessors or multicomputers). Additionally, we use PowerC 11 to exploit the shared memory on the Power Challange, which uses compiler directives like #pragma for generating parallel threads and is restricted to SGI systems. Employing PowerC, a sequential algorithm may be transformed easily into a parallel one by simply identifying areas which are suitable to be run in parallel i.e. in which no data dependencies exist. Subsequently, only local and shared variables need to be declared and parallel compiler directives are inserted. As a consequence, program development can be performed very quickly. When using PowerC, we again face two levels of granularity: image level (coarse-grained) and dictionary level (fine-grained) parallelism. As expected, the more lightweight PowerC approach produces better results for the finegrained case as compared to message passing since synchronization and communication may be implemented much more efficient. However, performance is on an acceptable level only for large scale basis functions. On the other hand, coarse-grained parallelization may be implemented easily only similar to the MPI independent-mode with the same rate-distotion decrease for large scale basis functions. As a consequence, the following mix of fine-grained and coarse-grained work distribution turns out to produce the best speedup values: • Encode the bigger scaled atoms (scales 4,3, and 2) in a fine-grained fashion • Encode the remaining two update scales 0 and 1 in a coarse-grained fashion thereby ignoring data dependencies (where only few exist) Speedup values for this approach are shown in figure 4.b. We can see, that this type of parallelization is nearly able to reach the performance of the coarsegrained MPI implementation and clearly outperforms the MPI fine-grained one. Note, that in contrast to the coarse-grained MPI algorithm the resulting bitstream is very similar to the sequential case.

337

4

Conclusions

Parallel MPP image coding may be performed with different granularity: In the coarse-grained case the image is distributed and we result in a significant performance gain. However, the bitstream is different from sequential execution although the rate-distortion performance is equal. In the fine-grained case we distribute the computations at the dictioanry level, but performance gains are lower and no scalability is achieved in the message passing case. A thread based parallelization using PowerC is able to deliver acceptable performance gains with minimal implementation effort by mixing the granularity according to basis function support and by ignoring possible data dependancies for small scale functions. However, for all algorithms considered scalability is limited. Future work will focus on resolving this restriction to further decrease runtime. References 1. F. Bergeaud and S.G. Mallat. Matching pursuit of images. In H.H. Szu, editor, Wavelet Applications II, volume 2491 of SPIE Proceedings, pages 2-13. SPIE, 1995. 2. S.R. Safavian, H.R. Rabiee, and M. Fardanesh. Projection pursuit image compression with variable block size segmentation. IEEE Signal Processing Letters, 4(5):117-120, 1997. 3. K. Osama, O. Al-Shaykh, E. Miloslavsky, T. Nomura, R. Neff, and A. Zhakhor. Video compression using matching pursuits. IEEE Transactions on Circuits and Systems for Video Technology, 9(1):123-143, 1999. 4. R. Neff and A. Zakhor. Modulus quantization for matching-pursuit video coding. IEEE Transactions on Circuits and Systems for Video Technology, 10(6):895-912, 2000. 5. P. Czerepinski, C. Davies, N. Canagarajah, and D. Bull. Matching pursuits video coding: Dictionaries and fast implementation. IEEE Transactions on Circuits and Systems for Video Technology, 10(7):1103-1114, 2000. 6. H.G. Feichtinger, A. Turk, and T. Strohmer. Hierarchical parallel matching pursuit. In T.J. Schulz and D.L. Snyder, editors, Image Reconstruction and Restoration, volume 2302 of SPIE Proceedings, pages 222-232, 1994. 7. G. Dodero, V. Gianuzzi, M. Moscati, and M. Corvi. Scalable parallel algorithm for matching pursuit signal decomposition. In P. Sloot, M. Bubak, A. Hoekstra, and B. Hertzberger, editors, High Performance Computing and Networking, Proceedings of HPCN'98, volume 1593 of Lecture Notes on Computer Science, pages 458-466. Springer-Verlag, 1998. 8. S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Trans, on Signal Process., 12(41):3397-3415, 1993. 9. P. Schneider. Matching Pursuit image and video compression - speedup techniques. Master's thesis, University of Salzburg, October 1999. 10. A. Pece and N. Petkov. Fast atomic decomposition by the inhibition method. In Proceedings of 15th International Conference on Pattern Recognition (ICPR), Vol.3 (IEEE Press) pp.215-218, 2000. 11. B. E. Bauer. Practical Parallel Programming. Academic Press, NY, USA, 1992.

PARALLEL Q U A S I - N E W T O N OPTIMIZATION ON D I S T R I B U T E D MEMORY MULTIPROCESSORS I. PARDINES Dept. Arquitectura de Computadores y Automdtica, Univ. Complutense, 28040 Madrid, Spain E-mail: [email protected] F. F. RIVERA Dept. of Electronics and Computer Science, Univ. Santiago de Compostela, 15782 Santiago de Compostela, Spain E-mail: [email protected] A parallel implementation of a quasi-Newton algorithm to solve nonlinear optimization problems with linear constraints on a distributed memory multiprocessor is presented in this work. We propose parallel algorithms for the stages which take most of the computational time in the process. Results that show the high efficiency of our proposals are presented.

1

Introduction

In this paper, we focus on nonlinear problems subject to linear constraints, that can be expressed as: minimize subject

F(x) = f(xN)+cTx, to Ax — b and I <x

,....
where x G 3tn, xN G !Rm represents the nonlinear variables, and I and u are boundary vectors. Quasi-Newton methods are used to deal with the nonlinearities of the problem 5 . The optimization process usually takes large computational times, so parallel implementation of quasi-Newton methods have been broadly studied 7 . In many of these proposals, each processor is allowed to perform part of the computations of the next iterations, evaluating speculative gradients and Hessians, and decomposing the iterative equations according to an orthonormal basis. 2

Quasi-Newton methods

Quasi-Newton methods are based on the idea that an approximation to the curvature of a nonlinear function can be computed without explicitly forming

338

339

the Hessian matrix. Therefore, they use an approximation to the Hessian matrix instead of the Hessian itself. Let Xk be the starting point at fc-th iteration. In the first step, a direction of search dk = —H^ 9k is computed, where Hk is the approximation of the Hessian and gk is the gradient. The solution will be moved in the next iteration through dk- After that, an optimal stepsize, A*, along the search direction from Xk is found. Therefore, the next iteration point is given by Xk+i = Xk + ^kdk- Finally, the approximated Hessian is updated by the correction matrix Dk, Hk+i = Hk + Dk, where Dk depends on xk+\ - xk and c^+i - gk. There are many possible choices to select these approximated matrices. One of the most efficient updates is the BFGS proposed by Fletcher 2 . From now on, we will refer to a quasi-Newton algorithm that uses this update method, and our parallel proposals are focused on it. We have done a detailed profiling of this algorithm when it is used to solve many nonlinear problems with linear constraints such as the HUESMOD problem from the CUTE library x . A summary of this study is shown in table 1. Note that the BFGS method spends a large amount of time updating the Hessian approximation. In the following subsections, the three stages of the method, for which we propose parallel implementations, are described. Table 1. Percentage of computational time spent in each stage of the quasi-Newton method Problem name Number of variables Percentage Quasi-Newton update of time for Search direction the different Stepsize stages Others

2.1

HUES-MOD 1430 82.6 9.5 1.7 6.2

Quasi-Newton update

The BFGS approximation of the Hessian matrix is based on the factorization Hk — RTR, where R is a nxn upper triangular matrix. At every iteration of the optimization process, R suffers rank-one modifications (R+vwT), thereby losing its triangular pattern. To restore it, it is necessary to apply two sweeps of plane rotations 3 . The first sweep Si, called backward sweep, is intended to cause limited damage to the triangular structure of R. Si is defined from n — 1 rotations in the (n,n - 1), (n,n — 2), ... and (n, 1) planes to reduce v to a multiple of

340

e„, the n-th unit vector. When Si is applied to R, S\R becomes a triangular matrix with the last row entirely dense. The rotation angle a, corresponding to the plane rotation (n,n — i) is computed from the n — i and n elements of vector v. As the components 1 to n — 1 of v are destroyed, this vector is overwritten with the last row of R' = Si(R + vwT). To complete the update, the upper-triangular pattern of matrix R must be restored. Then, a second sweep S2, called forward sweep, of n — 1 rotations in the (l,n), (2,n), ..., (n — l,n) planes is applied to annihilate elements 1 through n — 1 of the last row of R'. To compute the rotation angle a\ of a plane rotation (i,n) we use the (i,i) and the (n,i) elements of the matrix R'. So, for each rotation, the updated values of the elements of the n-th row that are modified in previous rotations are needed. There are dependences in the computation of the rotation angles due to the update of the n-th row in each rotation as is shown in figure 1. The only source of parallelism is associated with the computation of the first n — 1 rows of R'. Jcompuc r3'. r5'

C

,v,lCompule q | M — »J |u "pPddi -c' vv,3 -Yi"

*q

Compute r>' , Tj . rJ

Cbrtfiut

3 Compute V, , \M

Compute V,

V, Vj V,

Compute f. .T^ ,Tj

7 ^

Computi

=0, Update v^ , v 2 | LTv2=0. Compute V, I 1

J Compute f i , r=

Figure 1. Dependence and data flow graph for the two sweeps of plane rotations

2.2

Computation of the search direction

At every iteration of the BFGS method, to compute a search direction, the following system must be solved: RTRdk = -gk

(2)

This linear system implies two triangular systems: a lower triangular system RTq = -gk, and an upper triangular system Rdk = q-

341

The first system can be solved using a forward substitution scheme 3 : t-i

=

(~(9k)i

l,2,---,n

(3)

j=o

This method presents data dependences since each qi can only be computed after all the qj, j < i have been computed, and the corresponding column of RT has been updated (figure 2). Note that this graph is similar to the one for the forward sweep. The dependences are similar though more relaxed because the updates of the RT elements (rfjqj) are independent, and they only influence the next element of the solution. Moreover, due to the existing analogies in the data flow of the forward sweep and the method for solving the lower triangular system, we have used the same strategy for their parallelization. The solution of an upper triangular system is similar to the lower triangular one.

I — | Updae r ; | _>] Updat r ; " ] ^ Compute q

Figure 2. Dependence and data flow graph for the solution of a lower triangular system

2.3

Computation of the stepsize

The objective of this stage is to find the stepsize A* along the search direction from the current point Xk- It is necessary to solve the following onedimensional optimization problem: f(xk + Xkdk) = min f(xk + 6dk), with 0 < 9 < Xn

(4)

For each Xk, the function and sometimes the gradient at the new trial point Xk +Aftdfc must be evaluated. When a procedure to calculate the gradient is not available, an approximation of the gradient by finite differences can be used Vf(xk)i ^ (gk)i =

f(xk + het) - f(xk) h

(5)

where ft is a small quantity. If m represents the number of nonlinear variables, this approximation requires m parallel evaluations of / in addition to f(xk)-

342

3

Parallel implementations

In this section two parallel strategies are introduced. The first one is applied to the sweeps of plane rotations and the solution of triangular systems. The second one is proposed to perform the gradient computations in parallel. 3.1

Parallel sweeps and triangular solvers

Our objective is to develop an efficient parallel algorithm on a distributed memory multiprocessor. Suppose that the columns of R are distributed in a cyclic scheme among the P processors of the system. In the forward sweep, at every plane rotation the n-th row, which is used to compute the next rotation angle, is modified. Therefore, the processor which applies rotation (i,n) has to send a message, with the updated values of the n-th row, to the one which computes rotation (i + l , n ) . This strategy would imply a great number of messages, so some sequential computation is introduced to avoid it: the rotation angles are computed by every processor. After that, each processor can update in parallel the elements of the columns which have been assigned to it. In the same way, the factors rfjqj, in the triangular system, can be computed in parallel if all the processors compute qj. Based on this idea, our parallel strategy consists of dividing the matrices into triangular and rectangular blocks of size s = aP (a 6 IN). This size is chosen to assure that each processor is working in parallel over one or more columns of the matrix. The computations associated with the triangles are computed by every processor. They include the computation of the angles and the updates of some elements of the matrix in the sweeps of plane rotations; I.»iiimuii -»-nopal

Figure 3. (a) Sequential FLOPS versus Number of Messages in the parallel sweeps algorithm for the HUES-MOD problem, (b) Distribution Scheme of matrix R.

343 and the calculation of the corresponding elements of the solution as well as some products rfjqj for the triangular systems. The operations assigned to the rectangles are performed in parallel. These computations consist of the updates of the columns that have been assigned to each processor in the case of the sweeps, and the computation of the remaining factors in the case of triangular systems. Note that [n/aP\ triangular blocks are defined, and an additional block of size n%{aP) if it is necessary. The algorithm is as follows: the P processors carry out the computations associated with a triangle, and then, they can update in parallel the s columns above it. After that, it is necessary to collect the data that have been modified previously on each processor before the computations associated with the next triangle are performed. Therefore, the amount of sequential computation and the number of messages depends on s (figure 3.1) as well as the size of the messages whose length is equal to s. Our parallel algorithm implies a certain distribution of data. The matrix is divided into two vectors (figure 3.1): the one which has the entries associated with the triangles is copied in every processor, and the other, which is exclusive for each processor, holds the entries of the parallel portion. This consists of the segments of columns distributed on a snake scheme 6 . Using this distribution we obtain a good load balance, better than the standard block and cyclic distributions. 3.2

Parallel gradient evaluations

As we described in section 2.3, using finite differences to calculate the gradient of the objective function in every trial point implies evaluating the function m +1 times. This can be carried out in parallel. Therefore, the computations of the gradient elements are distributed in a wrap-around fashion among the processors. The efficiency of this parallel implementation is limited by the number of nonlinear variables. 4

Results

The parallel algorithm has been implemented on the FYijitsu AP3000 distributed memory multiprocessor, using the MPI 4 message passing library. The problems HUES-MOD (n=1430) and DTOC1L (n=1998) from the CUTE library were used to present these results that are similar in other examples. We show the speedup defined as sp = Ti/Tp, where 7\ is the execution time of the sequential code, and Tp is the parallel execution time using P processors.

344

The results for the three stages of the quasi-Newton algorithm are shown in figures 4 and 5. The speedup increases with the size of the triangular blocks until s = 6 P , where the amount of sequential computation becomes very important. Note that in figures 4(a), 5(a) the speedup is higher than in figures 4(b), 5(b). The reason for this is the greater number of floating point operations in the sweeps of plane rotations than in triangular systems. By increasing the size of the problem an improvement in the scalability is achieved, because of the decrease in the ratio of sequential computation. Since the number of processors is lower than the number of nonlinear variables, we can exploit all the existing parallelism in the computation of the stepsize. Moreover, communications are not necessary in parallel gradient evaluations, so the results of speedup are very close to the optimum (figures 4(c), 5(c)). 5

Conclusions

In this work the BFGS quasi-Newton method is revised, and its parallel implementation is presented. The different stages of the method were analyzed and their computational costs were measured. The similarities between the dependence graphs for the sweeps of plane (c): Stepsize: Parallel gradient evaluations

Figure 4. Speedup for the three main stages applied to the HUES-MOD problem (c): Stepsize: Parallel gradient evaluations

(a) Quasi-Newton upd te: Parallel sweeps

0

p

g

e

=

9,0

IOP-1

'

P=2 P=4

"i'n

E

-rl

7,0

-i-|H 1,0

l f=f

Figure 5. Speedup for the three main stages applied to the DTOC1L problem

345

rotations and the solution of triangular systems are detected. So, we propose the same parallel strategy for both algorithms. We develop a strategy that combines parallel and sequential computation. Moreover, gradient computations are carried out in parallel, distributing the function evaluations in a wrap-around fashion among the processor. As the problem has a large number of nonlinear variables, the obtained speedup is nearly optimal. Acknowledgments This work was supported by the Xunta de Galicia under project 99PXI20602B. References 1. I. Bongartz, A. R. Conn, N. Gould and P. L. Toint, CUTE: Constrained and unconstrained testing environment, ACM Transactions on Mathematical Software, 21 (1),, 1995, pp. 123-160. 2. R. Fletcher, A new approach to variable metric algorithm, Computer Journal, 13, 1970, pp. 392-399. 3. P. E. Gill, W. Murray and M. H. Wright, Numerical Linear Algebra and Optimization, Volume 1, Addison-Wesley Publishing Company, Redwood City, California,1991. 4. William Gropp, Edwing Lusk and Anthony Skjellum, Using MPI Portable Parallel Programming with the Message-Passing Interface, Scientific and Engineering Computation Series, 1994. 5. B. A. Murtagh and M. A. Saunders, Large-Scale Linearly Constrained Optimization, Mathematical Programming, 14, 1978, pp. 41-72. 6. I. Pardines, J. J. Pombo and F. F. Rivera, Parallel algorithm for backward and forward sweeps of plane rotations, Proceedings of the International Conference on Applied Informatics 2001. 7. C. H. Still, The parallel BFGS method for unconstrained optimization, Proceedings of the Sixth Distributed Memory Computing Conference, 1991, pp. 347-354.

A PARALLEL CONDENSATION-BASED M E T H O D FOR THE STRUCTURAL DYNAMIC REANALYSIS PROBLEM

A N D R E A S P E L Z E R AND HEINRICH VOSS Technical

University Hamburg,

of Hamburg - Harburg, Section of Mathematics, D Germany, E-mail: {pelzer, voss} @tu-harburg.de

21071

In the dynamic analysis of a structure quite often a small number of the smallest eigenvalues and eigenmodes of a large and sparse general eigenvalue problem have to be determined assuming that good approximations to the demanded eigenmodes are available from previous computations. In this situation powerful approaches like Lanczos method or Jacobi-Davidson may be inferior to methods which in general are known t o be slower. In this note we demonstrate this effect comparing P-ARPACK, the parallel version of ARPACK, with a parallel condensation method with general masters for a finite element model of a container ship.

1

Introduction

The response of a structure to dynamic excitations depends, to a large extent, on the natural frequencies of the structure. Excessive vibration occurs when the frequency of the excitation is close to one of the natural frequencies of the structure. Hence, in a design process one has to modify a structure several times to shift the natural frequencies of the structure out of the range of excitation frequencies. Typically, in each design step the structure is perturbed only by a small amount, and the modal shapes are not altered very much whereas the natural frequencies can change significantly. Therefore, in the design process reasonable approximations of the desired mode shapes are available from previous computations. A similar situation occurs in the dynamic analysis of a structure where the dynamic behaviour has to be determined for different loadings. Powerful approaches like Lanczos or Jacobi-Davidson methods for sparse general eigenvalue problems Kx = \Mx

(1)

are not able to exploit the knowledge of good initial approximations to the set of desired eigenvectors, but they have to solve each eigenvalue problem from scratch. Hence, with a good approximation already available methods like simultaneous inverse iteration 7 or condensation methods with general masters 4 which are usually inferior to the powerful approaches mentioned above could be faster. In particular this could be true in the dynamic analysis of structures since the eigenvalue problem under consideration is the discrete version

346

347

of a system with an infinite number of degrees of freedom which itself is only a model of the real structure. Hence, it does not make sense to compute the eigensolutions with higher accuracy than the modelling error and the discretization error, and therefore the accuracy requirements for approximations to the eigenvalues and eigenvectors are not very high. Moreover, in a parallel environment a small number of Rayleigh quotient iterations or condensation methods need only little communication whereas in the Lanczos process in every iteration step communication is necessary to perform 2 scalar products, 2 _axpys and to compute K~lMx for some vector x, and similar considerations hold for Jacobi-Davidson's method. In this note we report on numerical experiments computing approximations to some eigenvalues and corresponding eigenvectors of a finite element model of a container ship as they are needed in the analysis of the dynamic response. We considered a parallel implementation of condensation with general masters on two environments: a heterogeneous HP workstation cluster and an HP N-Class parallel computer. We found that it well compares to the Lanczos method implemented in the established package P-ARPACK 6 if only approximations of low accuracy of a few eigenvalues at the lower end of the spectrum are needed and if reasonable approximations to eigenvectors are available. Our paper is organized as follows. In Section 2 we briefly sketch condensation with general masters, Section 3 summarizes our parallelization concept using substructuring, and Section 4 describes details of the finite element model under consideration, the accuracy requirements in the response analysis, and the numerical results. 2

Condensation with general masters

We consider the general eigenvalue problem (1) where K G ]RA"'n' and M e IR.' n ' n ' are symmetric and positive definite which are usually the stiffness and the mass matrix of a finite element model of a structure. To deal with the large number of degrees of freedom static condensation is frequently employed to economize the computation of a selected group of eigenvalues and eigenvectors. To this end the degrees of freedom are decomposed into masters and slaves. After reordering the equations and unknowns problem (1) can be rewritten as V Ksm

KS3 I \xs

I ~

I Msm

Ms

(2)

Neglecting the inertia terms in the second equation, solving for xa, and sub-

348

stituting xs into the first equation one obtains the condensed problem K0xm

= \M0xm

(3)

where Ko M 0 := Mmm - KmsK^Msm

:=

Kmm — KmaKsa

Ksm,

(4)

- MmsK^Ksm

+ KmsKJ,MB8K^Kam.

(5)

Nodal condensation has the disadvantage that it produces accurate results only for a small part of the lower end of the spectrum. The approximation properties can be improved substantially if general masters 4 are considered. Let z i , . . . , zm G HI™ be independent vectors, and define Z := (zi,..., zm) G ffi,(n,m). Then the projected eigenvalue problem K0Xm := PTKPxm

= XPTMPxm

=: XM0xm,

(6)

where P = K-1Z(ZTK-1Z)-1ZTZ

(7)

is called condensed eigenvalue problem with general masters zi,...,zm. It is easily seen that this is exactly the reduced problem of nodal condensation if we choose z\,..., zm as unit vectors corresponding to the master degrees of freedom. Since (ZTK~lZ)~lZTZ is a nonsingular matrix the condensed problem (6) is equivalent to the projection of problem (1) to the space spanned by the columns of K~lZ. Hence, condensation is nothing else but one step of simultaneous inverse iteration with initial guess X = M~lZ £ ]R ( "' m) . Therefore, we can expect good approximation properties of condensation if we include general masters Zj = MXJ into the condensation process where Xj are approximate eigenvectors of problem (1) corresponding to the desired eigenvalues. Hence, choosing approximate eigenvectors from previous design steps as general masters should yield reasonable eigenvalue approximations. In the next section we combine nodal condensation with substructuring to obtain a coarse grained parallelization. To generalize this concept to condensation in the presence of general masters the following result is of great convenience. Theorem 1 Assume that ZTZ = I. Then the projection matrix Z in (7) can be determined from the linear system

349

Moreover, the condensed stiffness matrix is given by PTKP

3

= S.

(9)

Parallel Condensation

For nodal condensation the following strategy yields a coarse grained parallel algorithm 8 based on the master-worker paradigm. Suppose that the structure under consideration has been decomposed into r substructures and let the masters be chosen as interface degrees of freedom. Assume that the substructures connect to each other through the master variables only. If the slave variables are numbered appropriately, then the stiffness matrix is given by

K =

' K-mm "-mal Ksml Kssl Ksm2 0

\**-smr

^ms2 0 K3S2

"

t)

•

"•msr

'

0 0

•

(10)

••• **-ssr ft-ssr J

and the mass matrix M has the same block form. It is easily seen that in this case the reduced matrices in (3) are given by r KQ

L

mmj = Kmm — 2_, ^n

:

— J^mm

i=l

/ „ j=l

KmsjKssjKsmj

(11)

and M 0 = M„

Mmmj

'•— K-msjK-ssjMsmj

+ MmsjKS3jK3mj

(12)

KmsjKgSjMssjKssjKsmj.

(13) Hence they can be computed completely in parallel, and the only communication that is needed is one fan-in process to determine the reduced matrices K0 and M 0 . If general masters added to the interface masters then according to Theorem 1 the block structure of K in (10) has to be augmented by columns and rows containing the general masters. If the support of each of the general masters is contained in exactly one substructure then the linear

350

Figure 1. Container ship

system (8) can be solved substructurewise 5 , and the reduced matrices K0 and Mo again are obtained by a fan-in process. For general masters having global support as they appear in reanalysis problems we developed a coarse grained parallelization concept which is discussed in detail in a report 3 . In this algorithm the communication consists of two fan-in processes and one broadcast to compute the reduced matrices KQ and M 0 . We implemented the algorithm in Fortran90 using LAPACK 3 and BLAS routines for the linear algebra and MPI 1.05 for message passing. We tested the program on a heterogeneous workstation cluster consisting of (subsets of) one HP J5000, one HP J2240 (each with a double processor), one HP C3000 and five HP 9000, 712/100 connected by fast ethernet, and on an HP N-Class parallel computer with 8 PA 8500/440 Mhz processors organized as one cluster. 4

Results and Discussion

To test the performance of the parallel method mentioned in Section 3 we considered the vibrational analysis of a container ship which is shown in Figure 1. Usually in the dynamic analysis of a structure one is interested in the response of the structure at particular points to harmonic excitations of typical forcing frequencies. For instance in the analysis of a ship these are locations

351

in the deckshouse where the perception of the crew is particularly strong. The finite element model of the ship (a complicated 3 dimensional structure) is not determined by a tool like ANSYS or NASTRAN since this would result in a much too large model. Since in-plane displacements of the ship's surface do not influence the displacements in the deckshouse very much it suffices to discretize the surface by linear membrane shell elements with additional bar elements to correct warping, and to model only the main engine and the propeller as three dimensional structures. For the ship under consideration this yields a very coarse model with 19106 elements and 12273 nodes resulting in a discretization with 35262 degrees of freedom. We consider the structural deformation caused by an harmonic excitation at a frequency of 4 Hz which is a typical forcing frequency stemming from the engine and the propeller. Since the deformation is small the assumptions of the linear theory apply, and the structural response can be determined by the mode superposition method taking into account eigenfrequencies in the range between 0 and 7.5 Hz (which corresponds to the 50 smallest eigenvalues for the ship under consideration). The dynamic behaviour of the ship has to be simulated for different service conditions, i.e. for different velocities and different cargo distributions, and each of these positions yields a specific position of the ship in the water. When computing the vibrations of a ship embeded in the water the influence of the surrounding fluid on the structure is accounted for in the form of the so-called hydrodynamic masses which have to be added to the masses in the nodal degrees of freedom on the wet surface2. Physically, these masses represent the amount of fluid that is accelerated by the vibrating solid. To summarize, the dynamic analysis of a ship necessitates to solve a couple of sparse generalized eigenvalue problems the mass matrices of which are small modifications of each other. The finite element model is very coarse and therefore the accuracy requirements are very modest. An error of 10 % for the natural frequencies often suffices1. In our experiments we assume that the eigenmodes of the dry ship (i.e. without hydrodynamic masses) corresponding to eigenfrequencies which are less than 7.5 Hz have been computed in a previous calculation and are known. To determine the modes of a specific wet ship we subdivide the model into 10 substructures (cf. Figure 2 where we attached the number of degrees of freedom to the substructures). Choosing all interface degrees of freedom as masters we obtain a reduced problem of dimension m = 2097 and the slave subproblems are of dimensions between 1134 and 4792. Although the eigenfrequencies of these two models differ quite a bit (the relative differences of the natural frequencies lie between 10 % and 50 %) the

352

Figure 2. Substructuring of a container ship

approximation properties of the condensation method are enhanced considerably if we add 50 dry modes as general masters to the interface masters when solving the wet model. For nodal condensation only the 15 smallest eigenfrequencies are obtained with a relative error less than 10 % whereas with 50 dry modes as additional general masters the relative error of the 50 smallest eigenfrequencies of the wet model is less than 9.83 %. Next we dropped the interface masters, i.e. we considered only 50 dry modes as general masters in the condensation method. The accuracy decreased only slightly to a maximal relative error of 9.89 % of the eigenfrequency approximations. We compared the performance of these condensation methods to that of P-A.RPACK where we provided vectors K~lMx in the reverse communication interface taking advantage of the substructuring above and the Schur complement and where we tuned the parameters such that we obtained approximations to the 50 smallest natural frequencies with a maximal relative error of about 10 % (actually 12.5 % were arrived). The following table contains the runtimes needed for these 3 methods on the HP N-Class parallel computer and on a network of one HP J5000, one HP J2240, one HP C3000 and 3 HP 9000, 712/100. int.face+glob. mast, glob, mast. P-ARPACK HP N-Class 158 90 249 Workstation cluster 925 273 424 Notice that the user can only define processes, and assign these processes

353

to workstations but not to processors. Hence, with a heterogeneous cluster with workstations of different computing speed and different numbers of processors it is not easy to obtain a good load balancing. This is in particular the case since the substructuring can not be changed easily without increasing drastically the number of interface masters and the dimension of the Schur complement, respectively. For the N-Class parallel computer the user defines processes and the local scheduling is organized by the operating system. Acknowledgements Thanks are due to Christian Cabos, Germanischer Lloyd, who provided us with the finite element model of the container ship. The first author gratefully acknowledges financial support of this project by the German Foundation of Research (DFG) within the Graduiertenkolleg "Meerestechnische Konstruktionen". References 1. C. Cabos, Private communication 2001 2. C. Cabos and F. Ihlenburg, Vibrational analysis of ships with coupled finite and boundary elements. Report, Germanischer Lloyd 1999. http://www.germanlloyd.org/mba/research/fi/damp/paper.pdf 3. B. Hofferek, A. Pelzer and H. Voss, Global masters in parallel condensation of eigenvalue problems. Report 31, Arbeitsbereich Mathematik, TU Hamburg-Harburg 1999. http://www.tuharburg.de/mat/SCHRIFTEN/Berichte.html 4. W. Mackens and H. Voss, Nonnodal condensation of eigenvalue problems, ZAMM 79, 243 - 255 (1999) 5. W. Mackens and H. Voss, General masters in parallel condensation of eigenvalue problems, Parallel Computing 25, 893 - 903 (1999) 6. K.J. Maschhoff and D.C. Sorensen, P.ARPACK: An efficient portable large scale package for distributed memory parallel architectures. In J. Wasniewski, J. Dongarra, K. Madsen and D. Olesen (eds.), Applied Parallel Computing in Industrial Problems and Optimization, Volume 1184 of Lecture Notes in Computer Science, Springer Verlag, Berlin 1996 7. B.N. Parlett, The Symmetric Eigenvalue Problem, Classics in Applied Mathematics, SIAM, Philadelphia 1998 8. K. Rothe and H. Voss, A fully parallel condensation method for generalized eigenvalue problems on distributed memory computers. Parallel Computing 21, 907 - 921 (1995)

C O M B I N E D M P I / O P E N M P IMPLEMENTATIONS FOR A STOCHASTIC P R O G R A M M I N G SOLVER D. ROTIROTI, C. TRIKI AND L. GRANDINETTI Department

University of Calabria of Electronics, Informatics and 87030 Rende (CS) - ITALY E-mail: [email protected]

Systems

Stochastic Linear Programming (SLP) is an effective tool to deal with problems for which some of the input data are uncertain. These problems are typically characterized by a very high number of variables and constraints and the use of conventional computational resources is usually inappropriate for their solution. Parallel systems are required in order to achieve high level of efficiency in reasonable time. In this paper we propose several MPI and OpenMP implementations on the basis of which we develop a two-level parallel solver that takes advantage from the features offered by both the standards. Experimental results show that the two-level solver could be faster than any other basic algorithm.

1

Introduction

In several real world applications there is the need for solving problems with input data which are not known with certainty; these data are usually modelled as random variables, with some probability distribution. Stochastic Linear Programming (SLP) is an effective tool to deal with this kind of problems giving the optimal solution across the different events that could be observed. We focus on two-stage SLP problems, with a finite and discrete distribution of the random variables. The realization of the random variables consists in the occurrence of one of the N possible events, known as scenarios. In this formulation, decision variables are divided into two groupes: anticipative and adaptive (or first- and second-stage) variables. The anticipative decisions, denoted with x, are taken before knowing the random values, while the adaptive ones are determined after the realization of the random event. For each scenario I = 1,...,iV a corresponding second-stage vector yi is calculated containing the relative decisions. The two-stage SLP problem with N scenarios can be modelled in the following general form (for detailed description of the model the reader can refer, for example to 1 ) :

354

355 N

min cl +

^picjyl 1=1

s.t.

AQX

= bo

Ttx + Wiyi = hi x,Vi>0

l = l,...,N l = l,...,N.

(1)

As can be easily noted, the number of variables and constraints increases considerably as the number of scenarios N increases. For most of the real-world applications N is very big so the resulting problem can not be solved using conventional sequential systems. Moreover, for many nowadays problems a real-time solution still remains a challenge. The use of parallel machines is necessary to achieve high level of efficiency in reasonable time. Not many works have been published on the parallelization of SLP techniques since this field is still considered in its infancy. The main interest has been concentrated on parallel algorithms deriving from the implementation of primal-dual Path Following (PF) interior point methods. Within the PF algorithm it is easy to split the overall problem in order to handle N independent sub-blocks of the constraint matrix each block corresponding to one scenario. De Silva and Abramson have implemented the PF algorithm on a Fujitsu AP1000 with 128 distributed memory processors 2 . Jessup, Yang and Zenios have parallelized the technique of Birge-Qi3 factorization(BQ) on an Intel iPSC/860 hypercube 4 . Their effort was continued by the implementation done by Yang and Zenios on a Connection Machine CM-5e in order to develop a parallel PF solver5. All these implementations are based on the message passing paradigm. More recently, Beraldi, Grandinetti, Musmanno and Triki have proposed both (an efficient) PVM and (a poorly tuned) OpenMP implementations of the BQ factorization on an Origin2000 machine 1 . This paper will continue these efforts towards developing efficient messagepassing (MPI) and shared-memory (OpenMP) implementations of the BQ factorization. Moreover, we propose novel two-level parallel implementations in which we combine the more attractive features of the two standards in order to reduce computation time. It is worthwhile noting that the benifit of the two-level parallelism has been recently recognized in other applications like in CGWAVE6. But no tentatives have been published in this direction in the stochastic programming field. In this paper we will describe first the basic MPI and OpenMP implementations. The second section will be devoted to the two-level implementations

356

and the discussion of their results with special regard to the issues of scalability and portability. A brief summary will conclude the paper. 2

Basic Parallel Implementations

In this section various parallel implementations of the PF algorithm are presented. All the versions are based on the parallelization scheme of the BQ factorization as described in our previous paper 1 . In this scheme most of the computation tasks, corresponding to the independent scenarios, are carried out in parallel on the available processors. Some communication and synchronization points are necessary in order to form the scenarios-coupled matrices. (Interested readers are invited to refer to the above mentioned paper for details.) 2.1

MPI

Implementations

A first MPI version of the solver has been developed by using a static load balancing policy. The work is divided assigning a block of scenarios to each worker, so that the difference between the size of blocks is as small as possible. Each worker can easily recognize its position within the group and accordingly its slice of work without any communication. Every processor computes its part of scenarios, whereas the first-stage stuff is done redundantly by all the processors. This is more efficient than using a single processor to operate on first-stage computation and then broadcast the results to all the others. A correct implementation of the algorithm requires also a reduction in order to form the scenarios-coupled matrices by using the sum operator. Figure 1 shows the speedups of the solution of test problem scagr7 with 864 scenarios selected from the SLP library of Holmes 7 . Detailed wall-clock times of other test problems are shown in table 1. The failure in the solution of the scagr7.936 problem and the deterioration of performance in the case of scsd8.504 are due to computational difficulties caused by the increase of the problem size after the replication of the first-stage variables. Even though these results were satisfactory, we made the tentative of developing another version of an MPI implementation based on the masterworker model. In this scheme the workload is balanced dynamically through the introduction of an additional task (Boss) that starts by the distribution of the input data on the available workers. When a worker is idle it sends a request to the boss who will assign it the next free block to be computed. In order to maintain a high level of efficiency in this scheme particular attention

357

1 2

3

4

5

6

7

Figure 1. Speedups with MPI (left) and with fine grained OpenMP (right)

Table 1. Computation time (seconds), MPI version

Problem scagr7.864 scagr7.936 scsd8.432 scsd8.504 sctap 1.480

Variables 69180 74940 121170 141330 92304

Constraints 67407 73023 69130 80650 74910

1 cpu 9.64 10.3 22.60 25.40 29.34

2 cpu 4.84 5.16 11.53 13.00 14.92

4 cpu 2.49 2.65 6.03 6.93 7.60

8 cpu 1.26 3.94 45 4.07

should be addressed to the critical zone of synchronization. During the reduction each task caches the corresponding data handled during the first part of the iteration and then continues to operate on the same scenarios after the reduction. With this cache friendly execution communication overhead is kept as low as possible. The dynamic implementation ensures a good load balancing and achieves encouraging results specially on non dedicated and heterogeneous systems. However, using our parallel machine and running in a dedicated environment the static version performs with a slight advantage with respect to the dynamic implementation. 2.2

OpenMP

Implementations

Developing a shared-memory version by simply introducing the parallel directives of OpenMP in the sequential code seems to be an easy and fast task. However, optimising the code and using the incremental level of parallelizing

358 Table 2. Computation time (seconds) with the coarse grained OpenMP implementation

Problem scagr7.864 scagr7.936 scsd8.432 scsd8.504 sctapl.480

Variables 69180 74940 121170 141330 92304

Constraints 67407 73023 69130 80650 74910

1 cpu 9.60 10.18 22.73 25.36 29.33

2 cpu 5.07 5.43 12.33 13.73 15.08

4 cpu 2.53 2.69 6.44 7.11 7.79

8 cpu 1.44 1.48 6.23 6.94 4.19

features offered by OpenMP in order to ensure high performance necessitate a long and painful implementation effort. In a first version parallelization was introduced with respect to the secondstage components by splitting the workload at the inner loops level. At each update step of y in the iterative BQ procedure, a component-wise division of the vector is performed among the available processors. The distribution of input vectors is done statically among the threads in order to maximize the reuse of the data. Intuitively, this fine grained version is expected to allow better integration with the MPI layer in a combined implementation. The experimental results for the test problem scsd8 with 508 scenarios depicted in Figure 1 show that as the number of threads increases the performance deteriorates. The same trend is observed for the other test problems. This is mainly due to the high number of the synchronization points needed at the end of each parallel loop when data dependency between two consecutive loops is met. A possible way to overcome this deficiency is to insert OpenMP directives at the outer iteration level, i.e. scenario-wise rather than component-wise parallelism. This coarse grained version tracks the same parallel scheme as the one described in the MPI implementation and presents the same advantages, but at the same time suffers from the same bottleneck caused by the reduction operation. A collection of computational results of this version is shown in Table 2. The same results corresponding to the problem scagr7 with 936 scenarios are shown in Figure 2 as well. 3

Two-level parallelism implementations

Combining the MPI and OpenMP implementations have the objective of taking a full advantage from parallel systems based on a distributed shared mem-

359 Table 3. Execution time (seconds) with two-level implementations

Problem scagr7.864

scagr7.936 scsd8.432 scsd8.504 sctap 1.480

MPI/OpenMP (coarse grained) 8- 1 4- 2 2 - 4 1 - 8 1.32 1.44 1.26 1.27 1.39 1.40 1.48 3.94 3.45 3.41 6.23 6.94 45 3.99 3.96 4.07 4.12 4.35 4.15

MPI/OpenMP 8- 1 4- 2 1.51 1.26 1.62 3.94 5.59 6.32 45 5.91 4.07

(fine grained) 2 - 4 1 - 8 2.01 2.95

2.17

3.49

8.14 9.23 7.85

13.30 15.65 11.48

ory architecture. Two versions have been developed: MPI with the fine grained OpenMP version from one side, and MPI with the coarse grained version from the other side. In the first version the MPI layer implements a scenario-wise parallelism, whereas the OpenMP directives are introduced at a component-wise level parallelism. This version of the solver has been run for each test problem by varying simultaneously the number of MPI tasks and OpenMP threads whereas the number of processors remains unaltered (i.e. 8 processors). As can be shown in Table 3, the execution time gets longer as the number of OpenMP threads increases. It is clear that this two-level version suffers from the same defects of the fine grained OpenMP implementation. More interesting results have been obtained by combining the MPI implementation with the coarse grained OpenMP version. In this case the whole number of scenarios is divided among the MPI tasks and each task splits its slice among the OpenMP threads. The most critical region, i.e. the reduction, is done in two different phases: an inter-threads reduction via OpenMP directives and then an MPI call among the tasks. Table 3 reports the solution time by using this second version of the twolevel implementation. These results show that the two-level version outperforms the pure OpenMP implementation for all the test problems. Moreover, for some test problems the solution time can be even lower than that of the MPI code. This is the case of some well structured problems like scsd8 with 432 scenarios (depicted in Figure 2) in which the best execution time is measured with two MPI tasks and four OpenMP threads each. Another remarkable advantage is that both the two-level versions are able to overcome the failure observed in the solution of the scagr7.936. This

360

XSpMi? i » MMJ \

IS]

Sr

• «-> •a-4

«; z( t,sj

OS' 5-

Figure 2. Speedups with coarse grained OpenMP (left) and execution time in seconds with the two-level parallelism (right)

advantage is expected to persist even in the solution of bigger problems on larger parallel systems. 3.1

Portability

In order to ensure a perfect portability of our implementations only standard features of the MPI and OpenMP libraries have been used. Particular attention was devoted in order to avoid the use of the extensions offered by the SGI implementation of the OpenMP on the Origin2000. These extensions, such as data placement directives, even though can improve the performance of the shared-memory versions have the disadvantage of limiting the portability of the code to Origin-like systems. 3.2

Scalability

As mentioned above, the two-level implementations are more scalable with respect to both the pure MPI and OpenMP codes. While the MPI version suffers from the explosion of the problem size, OpenMP implementations are very sensitive to the number of synchronization points. On the other hand, some interesting remarks on the scalability of each of the two-level versions can be drawn. More specifically, the second version (MPI-coarse grained OpenMP) not only performs better but also it is expected to scale better in the case of problems with high number of scenarios. On the contrary, the MPI-fine grained version is expected to scale better for problems

361

characterized by high number of second-stage variables and few scenarios. Indeed, while the second version can use not more than N computational units the first two-level parallelism code is able to use N • ri2 processors, where n-i is the size of each vector yi, I = 1,...,N. 4

Conclusions

In this paper we propose new implementations using two-level parallelism to solve SLP problems. Several versions have been developed as a result of the possible combinations of MPI static and dynamic implementations from one side and OpenMP fine and coarse grained parallelism from the other side. For the test problems we considered, and using the multiprocessor machine Origin2000, the best results have been obtained combining the static MPI version with the coarse grained OpenMP implementation. By using other systems, such as cluster of multiprocessor machines, different conclusions could be expected. This may be the subject of further investigations in this field. References 1. P. Beraldi, L. Grandinetti, R. Musmanno, and C. Triki. Parallel algorithms to solve stochastic linear programs with robustness constraints. Parallel Computing, 26:1889-1908, 2000. 2. A. De Silva and D. Abramson. Parallel algorithms for solving stochastic linear programs. In A. Zomaya, editor, Handbook of Parallel and Distributed Computing, pages 1097-1115. McGraw Hill, 1996. 3. J. R. Birge and L. Qi. Computing block-angular Karmarkar projections with applications to stochastic programming. Management Science, 34(12), 1990. 4. E. R. Jessup, D. Yang, and S. A. Zenios. Parallel factorization of structured matrices arising in stochastic programming. SIAM Journal on Optimization, 4:833-846, 1994. 5. D. Yang and S. A. Zenios. A scalable parallel interior point algorithm for stochastic linear programming and robust optimization. Computational Optimization and Applications, 7:143-158, 1997. 6. S. W. Bova e altri. Dual level parallel analysis of harbor wave response using mpi and openmp. The International Journal of High Performance Computing, 14(l):384-392, 2000. 7. D. Holmes. A collection of stochastic programming problems. Technical Report 94-11, Department of Industrial and Operations Engineering, University of Michigan, 1994.

THE SAME PDE CODE ON M A N Y DIFFERENT COMPUTERS

PARALLEL

WILLI SCHONAUER, T O R S T E N ADOLPH AND H A R T M U T H A F N E R Rechenzentrum der Universitat Karlsruhe D-76128 Karlsruhe, Germany E-mail: [email protected] FDEM (Finite Difference Element Method) is a black-box solver for nonlinear systems of elliptic and parabolic PDEs. This very complex code is run on many different parallel computers. The number of processors is selected so that nearly equal theoretical peak performance results. The discussion reveals interesting properties of the code and of the computers.

1

Introduction

The FDEM (Finite Difference Element Method) program package (for details see next section) is a black-box solver for the solution of nonlinear systems of elliptic and parabolic PDEs by a FDM on an unstructured FEM grid. The resulting large sparse linear systems of equations are solved iteratively by the integrated LINSOL program package. These software packages have been efficiently parallelized with appropriate data structures [1]. As this software must run on all types of (commercially) available parallel computers, we computed in summer 2000 the same examples on different models of the IBM RS/6000 SP, on the Cray T3E1200 and on the Hitachi SR/8000F1. We reported at the First SIAM Conference on Computational Science and Engineering, Sept.2000 in Washington, DC on these experiments (there are no proceedings). We got valuable hints how to tune the code above all on the Hitachi. In March 2001 we compared our code in a similar way on the IBM SP WinterHawk-2 and NightHawk-2, on the Hitachi SR8000-F1 and Fujitsu VPP5000, see [2]. In this paper we extend these investigations to all types of parallel computers that were attainable for us until June 2001. 2

The FDEM

Here we give a short overview of the FDEM [3,4,5], as far as it is necessary to understand the behavior of the code. We use a FDM of arbitrary consistency order q on an unstructured FEM grid that has been generated by a (commercial) mesh generator. For practical reasons we use only the orders g = 2,4, 6. We use a polynomial approach of order q in 2-D or 3-D to gener-

362

363

Figure 1. Triangular mesh with rings of nodes around the center node of the formula.

ate difference formulae of order q by means of " influence pplynomials" which have the value 1 in one node and zero in the other nodes of the formula. For the determination of the coefficients of the influence polynomials we search for a set of neighboring nodes in rings of nodes around the center node for which we compute the coefficients, see F i g . l . W i t h the formula of order q + 2 we generate an error formula for an estimate of the discretization error. For the P D E s and boundary conditions for the unknown function u we allow an operator of the following from t h a t is an arbitrary nonlinear function of all of its arguments: rU

= r\t-,

X, y, Z, U, U j , Ux, Uy, U 2 , UXX:

Uyy i V>zz , UXy,

u

Xzy

u

yz)

— U.

For a system of m P D E s u and Pu have m components. The solution is by linearization with the Newton-Raphson m e t h o d and discretization using the error estimates. From this results the "error equation" t h a t gives a transparent view of all errors and is the base of all selfadaptation and control procedures. For the parallelization of F D E M the nodes are sorted for the x-coordinate and distributed in p parts onto the p processors (distributed memory). This is a 1-D domain decomposition. Note t h a t we have a black-box. However, processor ip needs for the rings of nodes of F i g . l information of its left and right processor, see Fig.2, or even of more processors. T h u s information of "overlap" nodes is stored twice. Then the generation of the global large and sparse m a t r i x and of the r.h.s can be executed locally on each processor. T h e storing is by blocks of packed rows in p parts over the processors. Then each processor calls the linear solver package LINSOL [6,7,8]. LINSOL presently has implemented 14 iterative methods of generalized CG (conjugate gradient). Storing is possible by 8 elementary d a t a structures. LINSOL is efficiently parallelized. We implemented a LU or ILU preconditioner with optimal d a t a structures for parallel computers. A working array with unpacked rows, single row wrap around over the processors, is travelling over the m a t r i x during the factorization. T h e factors L and U are resorted into columns t h a t are distributed in row-blocks over the processors to have optimal d a t a structures

364 on processor

u-K **

J M

2

J

^ \

ip-1

if> I tp + 1

i over- •• low-' nodes needed for proc. ip

Figure 2. Illustration for parallelization a n d overlap

for the forward elimination a n d backward substitution. 3

The Computers

W h a t is t h e best single value t h a t characterizes a supercomputer t h a t is used for scientific computing? W e consider as a realisitc scale t h e performance for the vector triad with d a t a from memory: di = h + a * di t h a t needs 3 loads a n d 1 store per cycle and arithmetic unit. It is ultimately a measure for t h e m e m o r y bandwidth. In scientific computing the memory bandwidth determines t h e performance. However, now parallel computers have large caches or cache hierarchies so that t h e performance of t h e vector triad for d a t a from cache (measure for cache bandwidth) plays an increasing role. Further, t h e performance of t h e communication network is i m p o r t a n t for t h e execution of a parallel code. In Table 1 we give t h e measured maximal values for t h e vector triad for a single processor a n d for t h e comminication bandwidth, for S M P nodes for intra and in parentheses for inter node communucation. T h e number of processors for the parallel computers has been selected so t h a t t h e total theoretical peak performance is 24000 M F L O P S . Where we do not meet this value we use a time correction factor (ratio of actual peak to 24000).

4 4.1

The Comparison The Model

Problem

We want to solve with F D E M the following system of 3 P D E s which are the Navier-Stokes equations in velocity/vorticity form UXX + Uyy +Uy — fl =0,

Vxx + Vyy ~ LUX ~ f2 = 0 ,

uwx + vuiy - (UJXX + u)yy)/Re

- / 3 = 0,

365

under the boundary conditions « - 9i = 0, v- g2 = 0, u +

uy-vx-g1=0.

We have introduced in the equations artificial forcing functions /,- and gi that - 3 2 x +y are determined that the exact solution is U = V=LJ = U = e ( ) which is a sugar-loaf type function. The solution domain is x = [—2,2], y = [-0.5,0.5]. The grid is 256 x 128 with 64770 linear triangles as space structure, we have 98304 unknowns. The consistency order is q = 4. A large value of the Reynolds number Re makes the resulting linear system more critical. Therefore we solved two cases: Casel : Re = 1, here converges the BICGstab2 method, the dominant operations are the many matrix-vector multiplications with the large sparse matrix. Casel : Re = 104, BICGstab does no longer converge and we use full LU preconditioning with fill-in within the skyline of the band matrix. The dominant operations are the factorization in the working buffer and heavy data transfer operations to rearrange the data into optimal data structures. 4-2

The Measurements

It was a painful but interesting task to install the FDEM amd LINSOL codes via the Internet on all the tested computers. The results are presented in Table 2. Where the theoretical peak of the computer is not 24 GFLOPS the corrected time is given in parentheses and these timings must be used for a comparison. It must be recalled that the Fujitsus are quite different from the other micro-processor based computers, that the Hitachis have a special PVP (pseudo-vector processing) and the Cray has by its stream buffer a similar technique. We usually tried several compiler options, but we did not tune the code to a special computer. The discussion of the timings of Table 2 is not easy in a few words. All computers have (in scaled form) the same theoretical peak performance. Obviously it is for some computers easier for our code to keep the CPUs busy (where we have short timings) than for other ones. The Hitachis are an obvious example to show how the software under which the code runs, affects the performance. The difference between NightHawk and WinterHawk for case 2 demonstrates that using 8 processors of a parallel job on one node of the NightHawk reduces the performance for the same type of processor (bad configuration, but we had no other choice). The older IBM Thinl20SC (Power2SC processor) has less internal parallelism than the Power3-II processor which increases the time. Similar arguments hold for the Compaq (new Alpha 21264 processor) and Cray (old Alpha 21164), although here the

366

architectures are quite different, too. It might be very instructive for the manufacturers to find out for what reason their computer needs a large time, i.e. the CPU is not so busy for our code. Table 1. Configurations, properties and measurements of t h e vector triad and communication for t h e investigated c o m p u t e r s . theor. peak m a x . perf. for theor. per vector triad for peak proc. d a t a from per p no. (time LlL2proc. of f mecorrect . cache cache mory no . c o m p u t e r MHz M F L O P 3 proc. factor' (size) (size) IBM SP 303.0 16 (2 163.9 24000 1 NightHawk-2 375 1500 46.3 nodes) (64KB) (8MB) Power3-II proc. IBM SP 16 (8 294.1 169.5 24000 2 WinterHawk-2 375 1500 48.8 (64KB) ( 8 M B ) nodes) Power3-II proc. 222.2 3 IBM SP T h i n l 2 0 S C 120 480 50 24000 53.1 (128KB 1 " Hitachi SR8000-F1 537.6 224.2 24000 4 375 1500 16 without C O M P A S (128KB 1 " PVPb Hitachi SR8000-F1 383.9 16 (2 182.8 24000 5 375 1500 with C O M P A S nodes) (128KB 1 " PVP6 HP Superdome 26496 370.4 6 552 2208 12 34.8 PA8600 proc. (1.104 ( 1 M B ) Compaq Wildfire 23392 428.6 230.8 7 731 1462 16 13.6 (0.975 (64KB) (4MB) A l p h a 21264 proc. SGI Origin 3000 188,7 78.1 8 24000 400 800 30 33.0 M I P S R12000 proc. (32KB) (8MB) SUN Fire 6800" 371.7 (prefetch disabled) 94.1 24000 750 1500 16 9 16.8 (64KB) (8MB) UltraSPARC-Ill proc. CRAY T3E-1200 345.6 301.2 24000 10 600 1200 20 70.8 A l p h a 21164 proc. (8KB) (96KB) 671.5 22400 11 Fujitsu V P P 3 0 0 140 2240 10 (vec(0.933 tor) 2361..; 19200 12 Fujitsu V P P 5 0 0 0 300 9600 2 (vec(0.800 tor)

max. comm. double

pingpong (inter) MB/se< 40.6 IP" 209.7 (74.7) 53.6 545.7 (394.1) 394.1 307.2 66.6 116.3 89.7

199 473

1420

" F o r 8 p r o c . p e r n o d e o n l y t h e slow I P - p r o t . is p o s s i b l e ( H W l i m i t a t i o n of a d a p t e r ) . P V P = p s e u d o vector processing. c O n l y p r e l i m i n a r y v a l u e s b e c a u s e t h e p r e f e t c h was d i s a b l e d .

5

Is there a Simple Model?

Let us ask: If we have measured the execution time of FDEM for case 1 and for case 2 on the IBM WinterHawk-2 and we know for another computer its performance for the vector triad (LI, L2 cache, mem.) and its communica-

367

tion performance can we then extrapolate the time for FDEM on the other computer? We made different approaches to describe the computers 1-10 of Table 1 by a single simple model, with and without communication. It was not possible because the architectures are too different. After many fruitless attempts we finally found out that such a simple model was only possible for the 4 similar architectures of the IBM WinterHawk, Compaq, SGI and SUN. Table 2. Results of the measurements for the different computers and extrapolated timings (see text). measured time (corrected time e x t r a p o l a t e d nc . c o m p u t e r from 2, sec t i m e ) , sec case 1 case 2 case 1 case 2 1 IBM SP NightHawk-2 12.12 394.17 (11.64) (221.41) 2 IBM S P WinterHawk-2 11.04 212.41 11.04 212.41 3 IBM SP T h i n l 2 0 S C 15.71 491.17 (3.25) (56.09) Hitachi SR8000-F1 without 29.97 299.44 (2.40) 4 (55.84) COMPAS Hitachi SR8000-F1 with (2.95) 5 119.22 1191.98 (72.67) COMPAS 1509.69 6 H P Superdome (20.64) 29.48 (32.55) (184.98) (1666.7) 52.48" 255.13 7 39.60 241.26 C o m p a q Wildfire (51.17) (248.75) 8 SGI Origin 3000 10.9 220.64 8.71 203.56 32.06 27.88 457.99 460.32 9 SUN Fire 6800 6 10 CRAY T3E-1200 58.87 496.19 (6.09) (104.12) 11 Fujitsu V P P 3 0 0

63.17(58.94)

12 Fujitsu V P P 5 0 0 0

56.08 (44.86)

408.16 (326.53)

"Needed 2 Newton steps (others need one) 6 Only preliminary values because the prefetch was disabled.

We define a characteristic performance factor cpf for the vector triad on a p processor computer i by cpfi = Pi • (a • rcache,i + b • rmemti),

(1)

where r c a c /j e , rmem are the peak cache and memory performances in MFLOPS, then cpfi is a type of total performance. If we have a problem with M "operations" , the execution time of computer i in sec is timet = M/(106-cpfi).

(2)

The same holds for computer j . If we eliminate from these relations M we get time; =

timei. (3) cpfj For M (arbitrary) we choose the number of operations that can be executed with 24000 MFLOPS (theoretical peak of tested computers) in one sec, i.e.

368

we have M/(24000 • 106) = 1 => M = 24 • 109. For the determination of a and b we get from (1) and (2) _ Fcache.i

" a ~T rmern,i

' " — , „R

M ,.

u

•

.

(, V

If we have m computers, this relation should hold in the ideal case for all m computers, with ra > 2. For m > 2 we have an overdetermined m x 2 linear system of equations for a, b which is solved by the least squares method. As mentioned above we tried to get "reasonable" values of a, b for all 10 microprocessor-based computers, but we failed. An acceptable solution was only possible for the above mentioned set of 4 computers, if we took as cache performance for case 1 the LI and for case 2 the L2 cache performance. We got the following values: easel : a = 33.12 • 1 0 - 6 , b = 2.679, case2 : a = 0.02269, b = 0.063018. This means that the memory performance is dominant in both cases. With the above values and relation (3) we extrapolated from the timings of the IBM WinterHawk the expected timings for the Compaq, SGI and SUN. These values are also shown in Table 2 and they are fairly good. Just for curiority we also show the extrapolated values for the remaining computers in parentheses. They show clearly that this extrapolation completely fails. For these computers the maximal performance above all from memory is not significant, there must be used the performance for some smaller value of the vector length n, i.e. another rmem. 6

Concluding remarks

We have executed the complicated PDE solver code FDEM on many different parallel computers. We selected two data sets for which the matrix-vector multiplication (case 1) and the factorization of a band matrix (case 2) are dominant operations. The configuration was selected that the theoretical peak performance for all computers is the same. In Table 1 the properties of the computers were compiled while in Table 2 the measured (and corrected) performances were presented. The attempt to describe all the measurements in a simple approach by the maximal cache and memory performance of the vector triad failed because of the diversity of the architectures. Only for four similar architectures the approach succeeded.

369 A cknowledgement We thank heartily all computer centers that allowed us to use their computers, and we thank their staff that helped us to install and run the code: NIC Jiilich (Cray), LRZ Munich (Hitachi), DKFZ Heidelberg (HP), URZ Magdeburg (Compaq), ZHR Dresden (SGI), RZ-RWTH Aachen (SUN). References 1. W. Schonauer, Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers, self-edition W. Schonauer, Karlsruhe, 2000, see URL http://www.uni-karlsruhe.de/~rz03/book/. 2. T.Adolph, W.Schonauer, H.Hafner, Comparison of the FDEM (Finite Difference Element Method) code on the IBM WinterHawk-2 and NightHawk-2 and on the Hitachi SR8000-F1, submitted to the proceedings of the Internat. FORTWIHR Conf.2001, High Performance Scientific and Engineering Computing, Erlangen, Germany, March 2001. 3. W. Schonauer, T. Adolph, How WE solve PDEs, J. of Computational and Applied Mathematics, 131 (2001), pp. 473-492. 4. W. Schonauer, T. Adolph, Development of a black-box solver for nonlinear systems of elliptic and parabolic PDEs, in P. Neittanmaki et al. (Eds.), Proc. ZTd European Conf. Numerical Mathematics and Advanced Applications, Worl Scientific Publishing, Singapore, 2000, pp. 698-706. 5. W. Schonauer, T. Adolph, The FDEM (finite difference element method) program package: Fully parallelized solution of elliptic and parabolic PDEs, Proceedings of ECCOMAS 2000, Barcelona, CD-ROM ECCOMAS 2000, Laboratori de Calcul de la Facultat d'Informatica de Barcelona (FIB, UPC), ISBN 84-89925-70-4, 20 pages. 6. H. Hafner, W. Schonauer, Parallelization of the iterative linear solver package LINSOL, in E. H. D'Hollanderet al. (Eds.), Parallel Computing: State of the Art and Perspectives, Elsevier, Amsterdam 1996, pp. 625628. 7. H. Hafner, W. Schonauer, R. Weiss, The program package LINSOL-Basic concepts and realization, Applied Numerical Mathematics, vol 30, no. 2 3, 1999, pp. 213-224. 8. H. Hafner, W. Schonauer, R. Weiss, Parallelization and integration of the LU and ILU algorithm in the LINSOL program package, in V. Malyshkin (Ed.), Parallel Computing Technologies, Springer Lecture Notes in Computer Science 1662, Berlin 1999, pp. 417-427.

W H A T DO W E G A I N FROM HYPER-SYSTOLIC ALGORITHMS O N CLUSTER C O M P U T E R S ? W. S C H R O E R S , TH. LIPPERT, K. SCHILLING Fachbereich

Physik,

Bergische

E-mail:

Universitat Wuppertal, D-^2097 Germany [email protected]

Wupperial,

The principle problem of parallel computing, the degradation of performance by internode-communication, is aggravated on modular cluster-architectures with offboard communication cards. In this contribution we consider so-called iV-squared problems, the prototype being the iV-body computation with long range interaction. We demonstrate that hyper-systolic algorithms are able to enhance the performance of such computations on parallel machines by alleviating the communication bottleneck, at the cost of moderately increased memory. We determine the efficiency of implementations on the Alpha-Linux-Cluster-Engine ALiCE at Wuppertal university compared to the SIMD system APE100.

1

Introduction

The exact computation of mutual interactions between n elements leads to n 2 interaction terms. The computational effort growing with the power of two, such "n 2 problems" truly are a high performance computing task sui generis. With the current trend in HPC going from costly proprietary devices towards cost-efficient clusters built from workstations or PC's, it appears very interesting to study such type of computations on these devices. To be more specific, let us consider the computation of an array {y} = yi,..., yn from an array {x} = xi,..., xn by means of a function / (•, •): Vi = ^f{xi,Xj), i

(1 <*',.?'
(1)

The computation of the function /(•, •) is required for all pairs Xj, Xj. Therefore, the computational complexity is of order O (n 2 ). The class of n 2 -problems considered in (1) occurs in a wide variety of fields, e.g. molecular dynamics, where /(•,•) is the (long range) force between the bodies. Molecular dynamics is of importance in the areas of astrophysics, thermodynamics and plasma physics; for general overviews see the e.g. l i 2 . Other important fields of application are polymer chains with long-range interactions 5 , or genome analysis. Also signal processing 3 and statistical analysis of time series 4 falls into this class. Furthermore, one can generalize the form of (1) by choosing two different arrays {x} and {z} as the arguments of

370

371

In order to implement the exact computation of (1) on a parallel machine with p processing elements (PEs), basically three parallelization strategies are known: Replicated data method. The algorithm requires the complete array {x} to be broadcast at the begin of the calculation, such that all the PEs contain the complete arrays {x}. PE k = l,...,p computes f(xi,xj) for j = 1 , . . . , n and i = (k — 1)^ + 1 , . . . , k—. Thus, the communication complexity is of order 0(np). Systolic array computation. Instead of a broadcast, only nearestneighbor communication is required, while the process is split into n time steps. To each PE k — 1 , . . . ,p a sub-array {£*} with n/p elements is assigned. The sub-array is shifted time-step by time-step to all other PEs where the functions /(•,•) are evaluated after each time-step. The communication complexity again is of order 0(np). Hyper-systolic array computation. As an extension of the systolic array computation (see chapter 3), this method allows for the reduction of the communication complexity to order O [n^/p) • The price to pay is an increased memory consumption of order O {^/p) • In general, n'2 problems are dominated by computation for large n (when n/p is still large) and by communication for large p (when n/p becomes small). The meaning of "large" and "small" is of course machine and implementation dependent. In chapter 4 we will present some general guidelines to decide which problem sizes n/p are the appropriate choices for a given machine. 2

Systolic array computation

We map the system of n elements onto a logical 1-dimensional ring of p PEs, each of which can carry out a local computation and a shift to its neighbor PE on the ring, see Ref. 7 ' 8 for a detailed description. The arrays {x} and {y} are partitioned into p sub-arrays {xa} and {y~p} with a = 1 , . . . ,p of n/p elements and are homogeneously distributed to the PEs. Note that by virtue of (1) a systolic system of n processors can be mapped onto a system of p processors. This procedure is called hierarchy mapping. The algorithm consists of the following steps: 1. Compute {ya} = £ f c e { c t } / ({£«}, {£<*,(*)}) locally on each node.

372

2. Shift the sub-arrays along the 1-dimensional ring topology with cyclic boundary conditions {xp} «- c s h i f t ({xa}).

3. Compute the function J2{p\ f ({^a}, {xg}) on each node.

an

d

sum U

P t n e result to {ya}

4. Repeat steps 2-3 (p - 1) times. The result is {ya} = f ({xa}, {x}) on each node. The communication corn-

Table 1. The distribution of the data packets {xa} tation.

Step 0 1 2 3 4 5 6 7

1 8 7 6 5 4 3 2

2 1 8 7 6 5 4 3

3 2 1 8 7 6 5 4

on the PEs during the systolic compu-

{x 7 } on PEs 4 5 3 4 2 3 1 2 8 1 7 8 6 7 5 6

6 5 4 3 2 1 8 7

7 6 5 4 3 2 1 8

8 7 6 5 4 3 2 1

plexity of this algorithm is of order 0(np). Symmetries or anti-symmetries of /(•,•) are not yet exploited. If the evaluation of / (•, •) is very costly, one may thus loose up to a factor of two in computation time. Furthermore, we note that during the systolic process the combination of elements (i,k) could be found n times, if one had stored intermediate arrays. However, this specific combination is only required for the PEs i and k. The load is equally distributed to all PEs and no idle cycles occur. This fact renders the algorithm suitable on both SIMD and MIMD architectures. The method is also known as "Orrery-algorithm". Its extension, the HalfOrrery-algorithm, allows the exploitation of symmetries of / (•, •)• However, the Half-Orrery-algorithm is equivalent to a specific case of the hyper-systolic algorithm to be discussed in the next section.

373

3 3.1

Hyper-systolic array computation General concept

Next, let us try to remove the "redundancy" of combinations which occurs in the previous case. We still want to keep the advantages of the algorithm, namely the symmetric implementation on all PEs. We now keep the intermediate arrays in memory, and during each step we consider all combinations of pairings between the data packets on the current PE. This strategy will of course not reduce the number of operations required, but we can expect that the number of rows required (and thus the number of communication operations) becomes smaller. THEOREM

1 The minimum number of rows required to compute (1) is given

by

if the function f (•, •) is symmetric or antisymmetric in its arguments and by

if no symmetries are exploited. Proof: The minimum number of pairings required is given by p(p — l ) / 2 (or p(p — 1) if there are no symmetries), the number of possible combinations between the elements of a given column is ( £ ). Since there are p such columns, the following inequality holds:

or, respectively P ( P - 1) < f

2

)P-

• Since the problem is homogeneous for all PEs (and thus also for the data packets { i 7 } initially distributed on the PEs), it is sufficient to consider the treatment of data packet {x Q =i}- The problem is solved if we generate the combinations with all other data packets { x ^ / i } . A possible solution for this problem is shown in table 2. The configuration can be reached if the sub-array

374

{£/3=i} is shifted by strides of (1,1,2) (instead of only strides of 1 as in the previous chapter); thus the array is not distributed to all PEs (only to 1, 2, 3 and 5). However, so far all components of {j/7} have been computed, but some Table 2. The minimal number of rows required to have all pairings of d a t a packets { x ^ i } with {xa=i} occur at least once. Some combinations with {x/3^1} may occur more than once. This table has been constructed under the assumption that the function / (•, •) is symmetric.

Line 0 1 2 3 4 0

6 7

1 8 7 fi 5 4 3 2

2 1 8 7 6 5 4 3

3 2 1 8 7 6 y

4

{x 7 } on PEs 4 5 3 4 2 3 1 2 8 1 7 8 6 7 5 6

6 5 4 3 2 1 8 7

7 6 5 4 3 9

1 8

8 7 6 5 4 3 9

1

have been accumulated on the wrong PEs (e.g. the pairing (1,4) on PE 5 must be shifted to the PEs 1 and 4 and added to {j/ 7 =i} and {2/7=4} to complete the calculation). Therefore, we have to store and finally shift the resulting partial arrays {j77,a=i,...,h} in the same way backwards as we previously shifted the partial arrays {xy} forward and sum them up at the correct place. Thus, the total amount of communications required is 6 for this algorithm (in contrast to 7 for the previous case). In the general case for p processing elements we need h shifts of {xa} and h shifts of {ya}- On the other hand the total memory requirement has increased to h — 1 arrays of {ya }! Below we are going to show that always a solution for h of order ^/p exists, thus the communication cost for the new algorithm is of order O (n^/p). 3.2

Hyper-systolic bases

The recipe to find the minimum number of rows required to solve the complete problem can be cast in terms of Additive Number Theory. We define the hyper-systolic base Ah = (ao = 0, ai,..., ah) to be an ordered set of numbers where the a; =o,...,/» are the strides of the shifts introduced in section 3.1. Then we can formulate the problem in the following way: 2 Let I be a set of integers I = { 0 , 1 , . . . ,p - 1}. Find the ordered set Ah of h + 1 integers with h minimal, such that each m £ I can be THEOREM

375

represented at least once as the ordered partial sum m = a,i + Oj+i + • • • + di+j,

m = p - (a,•+ at-+i -\

4- ai+j),

0 < i + j < h,

i, j G N.

The second equation has to be fulfilled to exploit symmetries of f (•, •). This optimization problem is equivalent to the ft-range p(h, Ah) of the extremal basis Ah with only ordered partial sums allowed as formulated in theorem 2. It reminds one of the postage stamp problem where no ordering was required 1 0 , n . This class of problems is NP-complete, however, so finding the exact solutions become increasingly costly for large p. However, we may find a base which is valid for arbitrary p and whose length still only grows as O(nvp). THEOREM

3 The regular base ( \ A2Khy.-l = I 0, 1, . . . , 1, Khyg,.. .,KhyS \ KhyB-l j KhyB

with Khys > y/p (KhyS > \/p/2) solves the (symmetric) problem. The communication complexity still is O (riy/p) • Proof: The base A2Khy,-i contains K\iys — 1 elements a, = 1. Thus, any number 0 < r < -ft'hys — 1 can be represented as an ordered partial sum. Furthermore the partial sums i ^2

Kh

y* = (J ~i

+

l K

) hyS

l=i

are integer multiples of Kilys with 0 < j — i + l < K\lys. Therefore any m
Cost functions

In the previous section the bases found are most effective if all communications have equal costs, regardless of the stride. In the general case, we have to

376

minimize the cost function:

Cs (Ah) = £ {Cli"} ("') + C{ia) ^ ) > i

where C^*"} (I) and C^"} (I) are the costs to communicate {xa} and {j/Q} by a stride of /. Again simulated annealing methods may help to find good solutions. 4

Implementations and results

Table 3 gives a list of the strides for the shortest bases (under the assumption that / ( • , • ) is symmetric). We see that for some values of p the shortest base length obtained from eq. (2) cannot be realized. Furthermore we note that a solution for a given value of p is not necessarily valid for p — i, 1 < i < p 13 . As a test-case for the efficiency of hyper-systolic routing on parallel computers we give results for a Kepler-force computation, implemented on two different machines. Figures 4 and 4 show results for two APE-100 parallel systems. These machines are SIMD machines with a static networking topology, designed as a 3-dimensional torus. They deliver a performance of 50 MFlops/node in single precision. The code has been optimized, and the force computation has been carried out in single precision. The summation, however, is in double precision. The results have been taken from 9 . The APE architecture offers a very low latency. Therefore, small particle numbers per node (less than 100) can be efficiently simulated on APE. Next let's consider results from the Wuppertal Alpha-Linux-ClusterEngine (ALiCE). This system consists of a cluster of Compaq DS10 workstations, equipped with Alpha 21264 EV6 616 MHz processors, 256 MB memory and with a 2nd level cache of 2 MBytes. Each node delivers a peak performance of 1232 Mflops and they are connected by a fat tree Myrinet network. The implementation of the Kepler force is in standard Fortran 90 using MPI without any further optimizations beyond those the compiler offers. The entire calculation is performed in double precision. Figure 4 shows the total times of a single evaluation of the forces for different numbers of particles per node. Figure 4 gives the relative performances achieved for these cases. For small particle numbers the results are contaminated by latency effects (the Parastation software used at this stage of the project showed a latency around 55p,s). On ALiCE, particle numbers of less than hundred particles per node do not make efficient use of the system. Local particle numbers of the order of

h- CO h-' h-^ I—'I—' H

H

K

W

oo^Cj^ji^ji^-

JJ u a

00

H-

Oi ^ ^

oo co >^ i^

B

1

t>.OlWi^'

t-» -

>^ a u to to

H

to —--—-•—•- o oo u

W O l CO CO M

*

N

0

-

-

Oi

Oi

Oi

pi

*

00

H-*

-~s i—»~—' oi

^ — Oi tO O

>£. to ~ "

H-» tO "

H-» - -

Ol"

co-

|_i -

oo t—i to c o - a h-» -—' to to — ' i-»

-

oo w oo a -

-

'

*—^^—

1

0

H

H

H

H

'

-—'

h-'

* '

W

N

H

H'I—'

"—'

i i i i O i t O i ^ " " " "" n H O l N O l t O Ol" ^—' -—^ "—^ -—' tO

C H O l S O l

K3 H» tO CO h -

to oo -q ai 01

>-^ '—'— —' — ' H-^

,

4^ 4^ CO CO CO CO CO

O l l O M l M O W M C O I - ' H M H CO ' H t O H H t O H t o o o c u ^ H i o N t o o i u t o - g i - " M H H S H O i c n o i o i o I—> ffi 0 0 tO tO ~ " " i*^" "" " " "" " " I-" tO M I - tO " _ O " "*_ "" " I—> I—' H^h-'00- 1 i t O l H O ) M 0 0 M " O i •" O l O O l C M C n S O O H S O O " H- " CO *». t O ^ C O O i O i - a - H - * " l_i s. QO- H - f j ^ H - i Oi ~-I- 00 K H H H H p M H oo- -g H en rfi- - -a -a ~ oo- oo o oo H M H -—' - i-» o o,s o

H'*>C»h^(Ot001

l-i h-' ,£• * • CO >£• (O H-1 tO

l

[O h ^ h ^ H ^ t — > tO tO h '

l

E N S O ) O l f f l O ) 0 ) O l O ' O l O l O l O i a i O l O l C n C n * k i ^ i ^ f r i l i i ^ i ^ f r g j i M C n ^ . M I O i - ' O t O O O S f f l b i i t k W t O H ' O O O O N O C n . l i M t O

00 -

Cn i - 4i Ol Oi Oi O Oi itk QO ^ t^'—' >£• O " '--'- t\3^-"—"*—"—•^-'-—^

jt>._coj^.jP4.h-h--

H H K t l l H H h ' U U H ' H H H H K i t i t l l t O M H K a h- ~ - - - - o - - - - - -s

H- h - k M N 5 M K W H H H H h '

l

i ^ W M H O ! 0 0 0 - J 0 1 0 < ^ M N 5 H O « l Q O S a O l ^ W N 3 H O

a

s a

IA

1

4<1> •o

Sw w c • ffi <

a; ST

5' £

its 00 ^1

378 60

50

200

1000 # of particles

Figure 1. Relative performance on the 32-node APE-100 Q4 (peak: 1.6 GFlops/single).

O (103) particles per node are required for effective pipelining of communication and memory-to-cache-data-transfer. We observe a decrease of the hyper-systolic curve at about 8200 particles/node. At this point the total size of data involved is about 1.2 MBytes which is 60% of the 2nd level cache of the machine. As a consequence we observe a performance degradation due to cache effects once the data size reaches about 50 — 60% of the total cache size. However, since we could exploit symmetries, the total number of floating point operations required to compute the resulting force is still smaller than for the systolic algorithm and thus the total timing is still advantageous in figure 4. 5

Conclusions

In this talk we covered different routing strategies for different parallel computing architectures. We have motivated the use of hyper-systolic algorithms and discussed their implementation on two different parallel architectures.

379

0

5000

10000 15000 20000 25000 30000 35000 # of particles

Figure 2. Relative performance on the 512-node APE-100 QH4 (peak: 25.6 GFlops/single).

i 800

of peak

1

i

y

/ s-

-

/

* *

\ <•l Perf

/

— Systolic aigorilhm 1 - - Regular hypersystolic • - * Generalhypersyslolicj

/

/ /' ''>'

-

l0

f

y' —

-

5.0KIO3

1,0K I0 4

u

Local particle number

Figure 3. Timing on a 32-node partition of ALiCE.

They allow parallization of n 2 problems at minimal communicational cost,

380 t

1

1

1 — -» —

- 'i

'

Systolic algorithm Regular hypeniyslolic General hypersystoljc

-

'

.

1

5,uxitf l.axio4 Local particle number

Figure 4. Relative performance on the 32-node partition of the ALiCE.

but have increased memory consumption. The low latency on APE machines allows small data sizes per node, while on computer clusters larger numbers of particles per node are more advantageous. Acknowledgments The authors wish to thank N. Eicker for his support on ALiCE and his valuable comments on compiler usage and software optimization. The ALiCE cluster software development is supported under DFG grant LI701/3-1. W.S. is supported by the DFG Graduiertenkolleg "Feldtheoretische und numerische Methoden in der Statistischen- und Elementarteilchenphysik". The authors furthermore wish to thank H. Simma for his important remarks on compiler optimizations on APE100. References 1. W. Smith, Computer Physics Comm. 62, 229 (1991). 2. P. Hut, astro-ph/9704286. 3. L.R. Rabiner and B. Gold, Theory and application of Digital Signal Processing (Englewood Cliffs, N.J.: Prentice Hall, 1975). 4. T.D. Dontje, Th. Lippert, N. Petkovand K. Schilling, Parallel Computing 18, 575 (1992). 5. M. Levitt, Current Biology Ltd. 1, 224 (1991).

381

6. Th. Lippert, N. Petkov, P. Palazzari and K. Schilling, accepted for publication in Parallel Computing. 7. N. Petkov, Systolische Algorithmen und Arrays (Akademie-Verlag, Berlin, 1993). 8. Th. Lippert, PhD thesis, Rijksuniversiteit Groningen, 1998. 9. Th. Lippert, U. Glaessner, H. Hoeber, G. Ritzenhofer, K. Schilling and A. Seyfried, Int. Journal Mod. Phys. C 7, 485 (1996). 10. H. Scheid in Zahlentheorie (Bl-Wissenschaftsverlag, Mannheim, Germany, 1991). 11. M. Djawadi and G. Hofmeister, Mainzer Seminarberichte, Additive Zahlentheorie 3, 187 (1993). 12. P. Palazzari, Th. Lippert and K. Schilling in Proc. NATO Advanced Research Workshop High Performance Computing, Technology and Applications, ed. L. Grandinetti et.al. (Cetraro, Italy, 1996). 13. W. Schroers, Th. Lippert and K. Schilling, in preparation.

L E A D E R ELECTION IN W R A P P E D B U T T E R F L Y NETWORKS WEI SHI Department of Computer Science, Colorado State University, Colorado State University, Ft. Collins, CO 80523 A. BOUABDALLAH Heudiasyc CNRS UMR 6599, Univ. de Technologie de Compiegne,BP 20529, 60205 Compiegne, Prance D. TALI A DEIS, Universita' della Calabria, Via P. Bucci, cubo 41-C, 87036 Rende, Italy PRADIP K SRIMANI Department of Computer Science, Clemson University, Clemson, SC 29634-0974

1

Introduction

Lader election algorithms for various kinds of networks.with or without link orientation have been extensively studied. Existence of orientation of the links have been shown not to improve the message complexity of leader election for either rings or torii 2 , while that helps for cliques; the lower bound on message complexity of leader election in oriented cliques is O(N) 3 and that for unoriented cliques is fi(iVlogA'') 4 . Authors in 3 have considered complete networks with a sense of direction. Recently, Tel 5 , s has developed leader election for oriented hypercubes whose message complexity is linear in the number of nodes. In this paper, our purpose is to propose an election algorithm for the oriented butterfly networks 6 in O(-^JJ)2. As far as we know, no leader election algorithm exists for these butterfly graphs although the straightforward application of existing algorithms would give a leader election algorithm for unoriented butterfly graphs with 0(N2) message complexity. 2

Butterfly Graph Networks &: Leader Election

A wrapped butterfly graph, denoted by Bn, is defined 6 as follows: a vertex is represented as {zn-\ • • • ZQ,£), where z n _i • • • ZQ is a n-bit binary number and I is an integer, 0 < I < n - 1. We will refer to zn-i • • • ZQ as ring index and I as column index. Two vertices (z„_i • • • z0,l) and {z'n_1 • • • z'0,1') are

382

383

connected by a bidirectional edge iff (f = I + l(mod n))A (z[ = zi} Vi V z\ — Zi, Vi, except for i = t). Consider a butterfly graph B n , with n = 3. There are 24 nodes in B3. In order to better illustrate the connections among vertices, the figure shows the image of the left 8 nodes on the right side. Each node has a two part label. The left part is a binary number of length n and the right part is an integer with value from 0 to n — 1. Consider an arbitrary node {rn-x •••ro,c) in Bn. The four edges this node has can also be derived from the following four generators: Gi(r n _i •••r 0 ,c) = r „ _ i - - T o , a ( c ) G2(r„-i •••r0,c) = r„_i •••r 0 ,/?(c) Gsirn-i •••ro,c) = r n _ i • • -rc+ifcr^x •••r0,a(c) Gi(rn-\ •••ro,c) = r„_i • • •r /3(c ) +1 f^( c )r^( c )_ 1 • • -r 0 ,/?(c) where functions a(c) and 13(c) are denned as: a(c) = c+ l(mod n) and 0(c) = c- l(mod n). Recently, the same butterfly topology (with wrap around) Bn is redefined in 7 and the properties have been studied. We highlight the key features below. Remark 1 Bn is a symmetric (undirected) regular graph of degree 4, ^ s n x 2 " nodes andnx2n+1 edges. Bn has a logarithmic diameterV(Bn) — [^pj and Bn has a vertex connectivity 4, i-e., for any pair of nodes there exist 4 node disjoint paths between them. Bn has many other interesting properties; see 7 for details. Lemma 1 In a butterfly graph Bn, consider an arbitrary ring index with value of r n _ i • • • TQ. There are n nodes with this ring index and they form a ring of size n. Proof : There are n nodes in Bn that have the ring index value rn-i • • • ro- The labels of these n nodes are (r n _! • • • r0,£), 0 < I < n — 1. They axe connected by edges derived from generators G\ and G2, and it is easy to see that these n nodes form a ring of size n. • Remark 2 In Bn, we use Rn to denote the subgraph which consists of the nodes with the same ring index r = r n _i • • -ro- From Lemma 1, we know that R^ has n nodes and is a ring of size n. Remark 3 iFrom an arbitrary node (r,c) in Bn, only the nodes in RTn can be reached using direction 1 and 2.

384

3

Leader Election Algorithm in Butterfly Graph

In this paper, as in 8 , we assume that the nodes in the graph do not know their canonical labels. Each node has a unique but uninterpretable label. We will still refer to the nodes by some canonical labels for convenience, but these labels or names have no topological significance 8 . We also assume in this paper that the network is oriented in the sense that each node can differentiate the links incident to it by different generators, (in contrast, a node in an un-oriented star graph distinguishes its adjacent links by different but uninterpreted names). We define the direction of a link as the index of the generator that generates the link, i.e. the link associated with generator Gj has direction i, 1 < i < 4. Definition 1 Consider an arbitrary ring index value r — rn—\''' 7*0 in a butterfly graph Bn. A ring regional leader, RL(r) is a node in R^ that is identified as the leader of subgraph RTn. The election algorithm in butterfly graph consists of two steps. In the first step, every node executes a uniform algorithm to elect ring regional leaders. After the first step, there will be 2 n ring regional leaders elected, with one in each of the 2 n subgraphs i?*, z — z n _i • • -z0,Zi = 0|1. In the second step, each ring regional leader will execute an algorithm and compete with each other to elect a final leader of Bn. Remark 4 In an oriented butterfly graph, an arbitrary node (r, c) can identify the subgraph Rrn with the knowledge of direction of links. The node does not have to know its canonical label in order to identify the region. 3.1

Election in RTn

In a butterfly graph Bn, there are 2 n subgraphs R„, 0 < z < 2n - 1, each is a ring with n nodes (Lemma 1). In this section, we present the algorithm to elect ring regional leaders within these subgraphs. The algorithm uses only direction 1 and 2 links for sending and receiving messages and we have the following remark. Remark 5 When electing ring regional leaders, an arbitrary node (r, c) in Bn only sends messages to and receives messages from the nodes within R„, because only the links with direction 1 and 2 are used and these links can only leads to the nodes in R1^ (Remark 3). Since in each subgraph i?*, 0 < z < 2 n - 1, the nodes only interact with the nodes within their own subgraph, each Rzn can be viewed as a self-contained module running the ring regional leader election algorithm. Therefore, it is sufficient to present the algorithm for an arbitrary subgraph,

385

Rrn. The same algorithm can run simultaneously at all 2" subgraphs R„, 0 < z < 2 n - 1, in Bn. Consider R^, which is a ring of size n. It contains n nodes with the same ring index (r), but different column index (from 0 to n - 1) in their canonical labels. The nodes do not know their canonical labels but each has a unique ID. The node with the largest ID will be elected as the regional leader in this algorithm. The algorithm requires each node sending a message RingTest(id,i,b) around RTn. The message carries three parameters. The first parameter is the ID of the node initiating the message. The second parameter i is an integer that counts the number of nodes this message has passed, b is a boolean to indicate if id is large enough to be the regional leader of R„. Every node initially sends this message along direction 1. At any node receiving RingTest(id,i,b), it compares its own id with the id in the message. If its own id is larger than the one in message, it set b to FALSE indicating that the node originating the message can not be the regional leader. The node also increments the value of i by 1 in order to record the number of node this message has passed. After updating i and 6 values, the node transmits the message along direction 1 to the next node in the ring, and the message will finally gets back to the node who initiates it as the value of i is set to n. When the node gets the message back, it checks the value of b in the returned message and decides whether or not to become a regional leader in R^ accordingly, i.e. b = TRUE means the node is elected as the regional leader and b = FALSE indicates otherwise. Since all n nodes in R„ send out RingTest message, there will be exactly one node who get the returned message with b still set to be TRUE. All other n — 1 nodes will have b set to be FALSE in their returned messages. So there will be only one node elected as regional leader of R„. At each node, there is also a local variable RL which is used to store the id of elected regional leader. The pseudo code of procedure Ring Elect is listed below. It is executed uniformly at all n nodes in R^. Procedure RingElect(RTn) Initial Conditions: 1. Every node in RTn has a unique ID and its local variable RL set to be 0. Invocation of the Procedure: 1. Node (rj), 0 < I < n - 1 in RTn sends RingTest(ID, At node (r,c) with I D , upon receiving message

I,True)

RingTest(ID',i,b):

386

11 ID' is the label of the node who initiates this message, if {ID' > RL) RL = ID' if (i = = n) { / / This means the message has been back to the node that originates it if (b == True) II This node has the largest ID in i?£, so it becomes the leader. Current node (r,c) becomes the regional leader of Rrn. else Current node becomes non-leader. } else { if (b== False || ID' > ID) send message RingTest{ID',i + 1,False) through direction 1. else send message RingTest{ID',i + l,True) through direction 1. } Lemma 2 For an arbitrary ring index r — r n _i • • • r§, procedure RingElect(Rn) requires n 2 messages. Proof : There are n nodes in Rrn. Each node initiates a message RingTest which travels through n nodes. So the whole procedure requires n 2 messages. D 3.2

Leader Election in Bn

After the first step, there will be 2 n regional leaders elected, with one in each ^n> 0 5: z 5: 2 n — 1. The objective for the second step is to elect one leader for entire butterfly graph Bn from those 2" regional leaders. In the algorithm, these 2" regional leader will invoke a uniform procedure to compete for the leadership of Bn and all other non-leader nodes only responds to the messages they receive, i.e. they do not initiate any message without receiving message from other nodes. Remark 6 We use RL(z) to denote the label of the regional leader in R^, 0 < z < 2" — 1. RL(z) does not specify the function to derive the id of the regional leader. It only indicates the dependency of the id to the ring index. In this step, each regional leader is required to initiate a TreeTest message. The message travels through a tree structure in butterfly graph Bn.

387

The algorithm ensures that the message will get to the nodes with different ring index, so that the id of each regional leader will be compared with the ids of all other regional leaders. As in the first step, only the node with the largest id will become the leader of entire butterfly graph B„. The procedure TreeElect is invoked by the regional leader of R„, where r is an arbitrary ring index value. This procedure is executed at every ring regional leader (The detailed pseudo-code is omitted). There are two types of messages used in the procedure. The first type of message TreeTest travels downward through a binary tree as each node transmits the message through direction 1 and 3 (shown in Example 2). The message carries three parameters. ID and b have the same meaning as in the first step, i is the level of the transmitting node in the tree. The second type of message is TreeReply, which goes through the reversal paths of TreeTest. The comparison between the value of local variable RL and the ID from the message only happens at leaf nodes, so the value of b in the message only change at leaf nodes. Intermediate nodes only pass message TreeTest to both children and collect TreeReply from them.

•

True Node

O Image Node

Path of Message TreeTest()

Figure 1. Travel of Message TreeTestQ in B3 Figure 1 shows the structure of B3 and the execution of procedure TreeElectQ from node (000,0). In this example, it is assumed that (000,0) is one of the regional leader elected from the first step, because only regional leaders execute procedure TreeElectQ. After node (000,0) invokes procedure

388

TreeElectQ, message TreeTestQ travels along a complete binary tree rooted from node (000,0). The message will reach 8 leaf nodes of the tree. These 8 leaf nodes all have different ring index and covers all possible ring index values that B3 can have. After these leaf nodes get message TreeTestQ, they compare variable RL with the ID included in the message and send TreeReplyQ message back to their parents with the boolean parameter in the messages properly set. Message TreeReplyQ will travel along the reversal paths of tree structure and finally get back to the root node (000,0). The root node becomes a leader or non-leader according to the boolean values returned in message TreeReplyQ from its two children. L e m m a 3 Consider an arbitrary row index value r and the execution of procedure TreeElect at the regional leader RL(r) of R^, for any integer i, 0 < i < n, there are 2* nodes that receive both message TreeTest(RL(r),i,b) and TreeReply(RL(r),i,b). These 2l nodes have different row index, but the same column index. Proof : We prove by induction. Since Bn is vertex symmetric, we can assume that r = 0 and the regional leader is the identity node, i.e. RL(c) = (0 • • • 0,0). And we only prove for message TreeTest(id, i, b). The proof for message TreeReply(id,i,b) can be similarly established. Obviously, the observation is true when i = 0 and 2* = 1. When i equals to 0, there is 2* = 1 node that receives message TreeTest(RL(r),i,b). It is the ring regional leader itself and it has ring index 0 and column index 0. In algorithm, the node sends message TreeTest{RL{r),i + l, b) through direction 1 and 3. Consider the nodes that receive message TreeTest(RL(r),i + 1,6). From the definition of generators G\ and G3, we know that these nodes will all have column index i + 1. It is also easy to see that these 2 x 2' = 2 i + 1 nodes have different ring index from 0 to 2*+1 — 1. Hence, the proof is established.

a L e m m a 4 After every regional leader finishes procedure TreeElect, only one node will become leader of Bn. All other leaders will become non-leader. Proof : For an arbitrary ring index value r, there are 2 n nodes receive message TreeTest(RL(r), n, b). From Lemma 3, we know that these 2 n nodes all have different ring index from 0 to 2" - 1 . Since these nodes compare their local variables RL with the RL(r) from the message, the id of regional leader RL(r) will be compared with the ids of all regional leaders Rzn, 0 < z < 2 n - 1 . In order for a node to become leader after procedure TreeElect, its id must be larger than all other 2 n - 1 regional leaders. We know that there's only one such node, and only this node will become the leader of Bn. All others

389 will become non-leader. • 2n+2 Lemma 5 There are 2 number of messages generated in this step in entire Bn. P r o o f : There are 2 n regional leader elected from the first step. Each of them executes procedure TreeElect which sends TreeTest message to 2 n + 1 nodes. TreeReply message will be generated afterwards and send to 2 n + 1 nodes. So the total number of messages used in this step is 2 n x 2 x 2 n + 1 = 22n+2

•

Theorem 1 The total number of message needed for a leader election algorithm in Bn is (n 2 + 2 n + 2 ) x 2" which is Oi^)2. Proof : The result can be easily derived by adding the number of message used in two steps. • Acknowledgments Part of the work of Srimani was supported by an NSF award # ANI-0073409. References 1. N. Santoro. Sense of direction, topological awareness, and communication complexity. ACM SIGACT News, 16:50-56, 1984. 2. H. L. Bodlaender. New lower bound techniques for distributed leader finding and other problems on rings of processors. Theoretical Computer Science, 81:237-256, 1991. 3. M. C. Loui, T. A. Matsuhita, and D. B. West. Election in a complete network with a sense of direction. Information Processing Letters, 22:185187, 1986. 4. E. Korach, S. Moran, and S. Zaks. Tight upper and lower bounds for some distributed algorithms for a complete networks of processors. In Symposium on Principles of Distributed Computing, pages 199-207,1984. 5. G. Tel. Linear election in oriented hypercubes. Technical Report RUUCS-93-39, Computer Science, Utrecht University, 1993. 6. F. T. Leighton. Introductions to Parallel Algorithms and Architectures: Arrays, Trees and Hypercubes. Morgan Kaufman, 1992. 7. P. Vadapalli and P. K. Srimani. A new family of Cayley graph interconnection networks of constant degree four. IEEE Transactions on Parallel and Distributed Systems, 7(1), January 1996. 8. G. Tel. Linear election in hypercubes. Parallel Processing Letters, 5:357366, 1995.

This page is intentionally left blank

SOFTWARE TECHNOLOGY &

ARCHITECTURES

This page is intentionally left blank

A S S E M B L I N G D Y N A M I C C O M P O N E N T S FOR M E T A C O M P U T I N G USING C O R B A ABDELKADER AMAR, PIERRE BOULET AND JEAN-LUC DEKEYSER Laboratoire d'Informatique Fondamentale de Lille Universite des Sciences et Technologies de Lille Cite Scientifique — Bat. M3, 59655 Villeneuve d'Ascq cedex {amar,boulet, dekeyser}t lifl. fr

France

Modern technology provides the infrastructure required to develop distributed applications able to use the power of multiple supercomputing resources and to exploit their diversity. The performance potential offered by distributed supercomputing is enormous, but it is hard to realize due to the complexity of programming in such environments.In this paper we present an approach designed to overcome this challenge, based on the Common Object Request Broker Architecture (CORBA), a successful industry standard. This approach consists in the development of a dynamic graph of distributed objects. Each of these objects represents a small encapsulated application and can be used as a building block in the construction of powerful distributed meta-applications.

1

Introduction

In this paper we describe a system which employs the Common Object Request Broker Architecture, CORBA, to implement a distributed computing application. This application is based on a component assembly, each one realizing an elementary task, to achieve a global processing. The components are connected as a general data flow graph. The concept of dynamicity (replacement and migration of components) seems essential to us because of our application domain, namely scientific computing. Indeed, we target long running parallel and distributed applications. Allowing to improve the code or the hardware while the application is running is an important goal to us. The CORBA architecture allows the programmer to construct metaaplications without concern for component location, heterogeneity of component resources, or data translation and marshaling for the communications. This paper is organized as follows. Section 2 presents the data flow, component structure and the communication protocol. Section 3 describes the concept of dynamicity. In the section 4, we report our experiments in building a video filter pipeline. Finally, section 5 will present our conclusions and plans for future work.

393

394

2 2.1

Graph Components Goals

Like mentioned above, our goal is to form a data flow of connected CORBA components. This data flow can represent any acyclic component graph. To explain our architecture model, the illustrations and the test application presented in this paper is a pipeline component case. This data flow has to run on different architectures and the components can be developed in different languages (through unified interfaces). The use of CORBA facilitate this interoperability in a heterogeneous environment. 2.2

Data Flow Model and Component Structure

The communication protocol is based on the exchange of data between the components by sending different requests along the edges of the graph. A communication can be triggered by the sending or the receiving component as explained below. As the connected components run asynchronously, we need a way to buffer the data between these components. As we want to keep the data flow ordered, we use FIFO queues to link the components. A FIFO is a list of elements flowing from a component to another one. This FIFO is implemented in two parts: an output FIFO stored in the source component containing ready to be sent data elements and an input FIFO stored in the destination component containing to be computed data elements. A component is made up of four object types (see figure 1): a computation object to carry out the processing of the input FIFO elements, one or more input and output FIFO objects (figure 1 shows the component structure with one input and one output FIFO), and the management object to supervise the data flow inside the component and with the other connected components. Each of these objects is a thread inside the component process and thus they run concurrently. We have not used the classical approaches to exploit parallelism with CORBA (e.g. thread per object, thread per method) because in our approach, these notions of client and server in the component 6 7 are not present in our model. The computation object handles the FIFOs by calls to the Get and Put methods which respectively indicate the consumption and production operations. 2.3

Communication Protocol

Components are built to carry out a certain process, specified in the computation object. The feeding of the input FIFOs is done by a "Factory" object

395 MANAGEMENT OBJECT

... Put

Get

COMPUTATION OBJECT Figure 1. Component structure

(file, sensor or computation) for the graph components without input link, and by the specified connections for the others components. To manage the communications, two thresholds on the number of elements of the FIFO have been defined: • a minimal threshold (for the input FIFOs) which indicates to the component that it is necessary to ask the preceding one to feed its input FIFO, • a maximal threshold (for the output FIFOs) which indicates to the component that offering a part of its FIFO is necessary to avoid overloading. Request (&2)

Management object Gel from | fifo(a3) J Input Fifo

Output Fifo

~j Management object (a4l F

Mm Threshold (.1)

Max Threshold (bl)

Management object Putin fifo (b3)

Output Fifo

Figure 2. Threshold management : minimal threshold (a) and maximal threshold (b)

When a minimal threshold is detected, the input FIFO alerts the corresponding management object, which sends a "request" to the previous component. This last one makes a read operation in its output FIFO (the size of read data is calculated according to some criteria taking into account the size of the output FIFO) and responds with a "satisfyRequest" method invocation. In the other case, when the length of the output FIFO exceeds the maximal threshold, the management object sends an "Offer" request to the next component which feeds its input FIFO. The computation in the calculation object and the communication in the management object are executed concurrently, each object being a thread.

396

However, to preserve the consistency of the data, we use critical sections to lock the concurrent accesses to the FIFO elements. To make the access to the FIFOs more flexible, the critical section concerns the access to one element. 3

Dynamicity

Dynamicity is seen under three aspects: • the first is the application's incremental development, • the second is to be able to replace, during the execution, a component by another (for example to change the calculation function or to use a more performant or specialized component)", • the third is to migrate a component from a computer to another for various reasons such as hardware upgrading or load balancing. 3.1

Interactive Console

To control the components, a console has been developed. It consists of a program which controls the component connection and execution by the use of a simple language. The presence of a manager program is contrary to the peer-to-peer character of component systems. However, the console is minimal and serves only two roles, collaboration control and component replacement. All the communications between the components are done without involving the console. One of our future objectives is to replace this console by a visual environment 4 by which we can visually create and link components to build a meta-application. The component links that form the data flow (see figure 3) are made interactively by a console to which all the components must be connected just after their launching. The use of a console allows more flexibility in the connection choice, and a dynamic control of the components. 3.2

Replacement of a Component

The component replacement is done interactively from the console (see figure 4). The replacement scenario is the following : 1. The components connected to the one that must be replaced are prevented to send or to request anything. This avoids the loss of data or the waiting of a response that will never come. "we focus on long running applications, so this feature is important

397 c

direct links between m m p m e n t s

CONSOLE CONSOLE 4 11«* i

LZJr

|A. C)

Lr°~

i

*3

Figure 3. Connection of the components

2. The console then asks the component that must be replaced to send the contents of its buffers to the new component. When the component to be replaced receives this command, it suspends the computation and executes the transfer of buffers. 3. Components are connected to the new one. The new component is connected to the others, and computation is launched.

Update the links (6) C ancel the transfers (3)

~ * Component (i)

Order to send (4)

Console

1

* Suspend the _ Calculation (1)

Component to replace

•Suspendeded calculation(2)

1 Component (J)

Cancel the transfers (3) i

Update the links (6) Launch the "ompone it (7)

New f c o m p o n e n t

Send the contents ofthefifaMS)

Figure 4. Replacement of component

We have thus defined a dynamic graph of components. The following section validates this design by some experiments.

398

4 4-1

Experiments Test Application

We have developed a simple graphic filter pipeline, in which each component includes one of the following filters: zoom, rotation and convolution effect. These filters are used to process a bitmap picture list. The first pipeline component reads the bitmap files, and the last one saves the processing results into other bitmap files. To reduce communications between components, the FIFO element is constructed as a structure containing a picture, and exchanges are done by packages of one or more pictures. The ORB used to develop this pipeline is ORBacus *. We have used the JThread library 2 to develop a multithreaded component. The principal intention by developing this graphic application, was to validate our model, and to prove that the pipeline is operational and that it really overlaps the communications by computations. 4-2

Performance

In this experiment we have used an SMP machine with four Pentium II Xeon processors at 450 MHz. We have developed a sequential version, written in C. Then we have done the same computation with one component (so without communication) to measure the CORBA overhead. Finally in the distributed simulation, the code has been split in three components that are pipelined and execute concurrently. The respective running times are given in table 1. Sequential program without CORBA 662 s

Sequential program with CORBA 1035 s

CORBA program with 3 components 438 s

Table 1. First experiment times.

A first remark is the large overhead due to the use of CORBA. This overhead depends on the implementation we have chosen, and can also be explain it by the use of CORBA sequences which slow the program execution. One may find more optimized ORBs that can effectively reduce this overhead. Nevertheless we have a 1.5 speedup from the sequential program to the pipelined one and a 2.5 speedup if we compare the two CORBA programs. This clearly indicates that the three components run concurrently and that the pipeline works well. For the second experiment we use the same programs but we try other target architectures. Indeed, we have used three networked workstations. We

399

have done several measures: the sequential program running on a 266 MHz Pentium II computer, the sequential program embedded in a component on the same machine, the three component programs on three 266 MHz Pentium II computers linked by a 10 Mb/s Ethernet bus and finally the same three component program where one component (the second one) as been migrated onto a 450 MHz Pentium II machine. This last experiment shows the interest of the migration of a component when there is a hardware upgrade for example. Sequential program Sequential Program 3 component program with CORBA without CORBA on 3 266 MHz PCs 1212 s 1666 s 665 s 3 component program on 2 266 MHz and 1 450 MHz PCs 612 s Table 2. Second experiment times.

The results are shown on table 2. For the first three measures, we obtain similar results in the first experiment (the speedups are 1.8 and 2.5). The last measure shows an improvement when using a faster machine. As the three computation tasks are quite balanced, the whole pipeline is slowed down by the slowest task, hence the not so spectacular speed-up (1.07). 5

Conclusion

In this paper, we have presented only a first step to build a computation intensive application based on the assembly of CORBA components. The system we have presented is a dynamic graph of software components based on the CORBA architecture. In particular, we have shown that it can support dynamicity in the assembly and the linking of the components, and communication and computation within a component are done concurrently (by building multithreaded components). Further work will involve building a component generator that will allow to package existing routines into interoperable components. When the CORBA Component Model will be available, we may use it to build our components if it allows enough dynamicity. Another extension we consider is allowing parallel computations inside the components. We note the "Cobra" 3 and "Pardis" 5 approaches to do parallel computing inside CORBA objects. These systems provide special ORBs to support the parallelism within the CORBA objects. They extend

400

the CORBA object model by the notion of parallel object. This extension allows to build interaction involving data-parallel components which exchange distributed data structures. Active research is also being done on optimizing the performance of CORBA itself for high-speed networks 8 9 10 . Other efforts are done in the component approach such as the common component architecture (CCA) 11 . References 1. ORBacus For C++ and Java Object Oriented Concepts Documentation 2000. 2. JThreads/C++ Java-like Threads for C+ + Object Oriented Concepts Documentation 2000. 3. T. Priol, C. Rene, G. Alleon. SCI-based Cluster Computing in Programming SCI Clusters using Parallel CORBA Objects, Springier Verlag, 1999. 4. P. Boulet, JL. Dekeyser, JL. Levaire, P. Marquet, J. Soula, and A. Demeure Visual data-parallel programming for signal processing applications to appear in PDP 2001, February 2001. 5. Katarzyna Keahey, Dennis Gannon. PARDIS : CORBA-based Architecture for Application-Level Parallel Distributed Computation, Supercomputing'97 (http://www.supercomp.org/sc97/), November 1997. 6. Irfan Pyarali, Carlos O'Ryan, and Douglas C. Schmidt, A Pattern Language for Efficient, Predictable, Scalable, and Flexible Dispatching Mechanisms for Distributed Object Computing Middleware, Proceedings of the IEEE/IFIP International Symposium on Object-Oriented Real-time Distributed Computing, March 15-17, 2000, Newport Beach, California. 7. Douglas C. Schmidt. Evaluating Architecture for Multi-threaded CORBA Object Request Brokers, ACM Special Issue on CORBA, Krishnan Seetharaman, October 1998, Vol. 41, No. 10. 8. High Performance CORBA Working Group, http://www.omg.org/ homepages/realtime/working_groups/high_performance_corba. html 9. omniORB : Free High Performance CORBA 2 ORB. http://www.uk. research.att.com/omniORB/ 10. Real-time CORBA with TAO (The ACE ORB). h t t p : / / w w w . c s . w u s t l . edu/~schmidt/TAO.html 11. CCAT Project, h t t p : / / w w w . e x t r e m e . i n d i a n a . e d u / c c a t /

SIMULATION-BASED A S S E S S M E N T OF PARALLEL A R C H I T E C T U R E S FOR I M A G E DATABASES T. BRETSCHNEIDER, S. GEISLER, O. KAO Department of Computer Science, Technical University of Clausthal Julius-Albert-Strasse 4, 38678 Clausthal-Zellerfeld, Germany E-mail: {bretschneider,geisler,okao)Qinformatik.tu-clausthal.de Image databases are characterised through the huge amount of data which can only be handed by parallel architectures. The requirements on the systems depend on the scope and obligate a detailed analysis of the application and the employed architecture, respectively. This paper investigates and compares the usability and efficiency of three different parallel systems with the emphasis on applications in the field of image databases. A simulation model was developed which allows the flexible design and assessment of arbitrary parallel architectures.

1

Introduction

The development and application of digital technologies results in the production of huge amounts of multimedia data. The scope and spread of document management systems, digital libraries, and photo archives used by public authorities, hospitals, co-operation etc. grows day by day. Each year, with an increasing tendency, Petabytes worth of multimedia data is produced. This information has to be systematically collected, registered, stored, organised, and classified. Furthermore, adequate search procedures and methods to formulate queries have to be provided to access the information. For this purpose a large number of prototypes and operational multimedia database management systems is available 3 . Image databases are an important subsystem of general multimedia databases and digital libraries. Retrieval of images starts usually with the creation of a sample image or sketch, which is subsequently used as a starting point for a similarity search in the available stock. The state-of-the-art approach for the description of image contents is based on the a priori extraction of features related to the colour distribution, the image layout, and the consideration of important segments and textures. These are computed when the image is inserted into the database resulting in significant reduction of the computational effort during runtime. Only whole images are compared, thus important image details like persons, objects, and other user-selected regions of interest are not sufficiently considered. However, if a detailed search is required, then all images have to be processed dynamically during the retrieval phase. The user selects certain regions of interest manually and describes these by a number of features. Subsequently

401

402

all images are searched for the occurrence of these features and the selected object can be found in different environments. Thereby deviations regarding rotation, size, colour etc. have to be considered. This approach improves the retrieval quality significantly, but requires additional computational resources which exceed the capabilities of computer architectures with a single processing element (PE) 2 . Parallel architectures offer a solution for the performance problem and support the usage of dynamically extracted features in existing applications. The simulation-based analysis of the properties and suitability of different parallel architectures for image retrieval is the main objective of this paper. PEs, memory, and storage units as well as network structures are modelled using technical characteristics of current devices and evaluated by considering typical retrieval operations as examples. 2

Parallel architectures

Parallel architectures represent key components of the modern computer technology with a growing impact on the development of information systems. From the viewpoint of the database community the parallel architectures are classified into 1 : • Shared nothing architectures, • Shared everything architectures, and • Shared disk architectures. Shared nothing systems are usually used for the realisation of databases distributed over wide area networks. Each node has an individual database management and file system. All operations are performed on the local data and the inter-node communications are usually based on the client/server paradigm. Figure 1 (a) shows a schematic of a shared nothing architecture. Shared everything architectures like depicted in Figure 1(b) are the main platform for parallel database systems and most vendors offer parallel extensions for existing database systems. Independent query parts are distributed over a number of PEs 1. Fast communication and synchronisation as well as easy management are advantages of this architecture class. Disadvantages concern the fault tolerance and limited extensibility of shared everything systems. Shared disk systems host distributed databases in local area networks. Each node has its own database management system and is connected over

403

\h \jp [h \h Memory I Memory I Mcmoiy I Memory I

f3^ ^

^ w

f=bj

^ £f) Cjp tfi .

5

,

Memoy,

T

[fQ (b)

Figure 1. Parallel database architectures: (a) shared nothing; (b) shared everything; (c) shared disk

a shared file system. An important issue of these systems is the fault tolerance for critical applications. Figure 1 (c) visualises the structure of this architecture class. The important question is, which computer architecture is suitable for the realisation of multimedia databases. In particular the problem is characterised by the large data blocks of the images and other multimedia objects. Thus, the communication between the I/O subsystem, the memory, and the PEs has a significant impact on the overall system performance. 3

Simulation

Discrete-event simulation is the commonly used method for the performance evaluation of parallel architectures 4 ' 5 . The following sections present the simulated models, parameters, and the results of the performance measurements. 3.1

Architectures

The shared everything memory architecture consists of 2, 4, 8, 16 or 32 processors each equipped with an individual cache memory. Two hard disks are supplied for the 16 processor SMP, while the SMP with 32 PEs accesses four hard disks. Each processor of the shared disk architecture is equipped with

404

a private memory. The simulated system has an additional shared memory, since disk activities are time-consuming and can be optimised by the usage of the additional shared memory. The models for the shared disk and shared everything architecture consist of identical via a bus system connected basic modules, e.g. PE, cache memory, main memory, and hard disk. Each PE in the shared everything architecture works independently from the others as no communication is necessary while analysing a file. In a pipelined processing stream the first file is loaded from the hard disk into the main memory, a second file is requested by the PE and loaded while in parallel to this job (due to direct memory access) the decompression and analysis of the first file is started. As an additional step in the shared disk architecture the file is copied into the private memory after loading it from the hard disk into the shared memory. However, during decompression and analysis only data from private memory is needed. The simulation of the cluster architectures uses the results of the architecture with a single PE, i.e. the modules are whole computers with previously measured response time. The network connection is a bus system. The cluster model considers two different methods of data distribution over the available nodes. In the first case each node stores the whole database and a master node starts the slave nodes with a fixed number of files which have to be analysed. As each node has all files on its hard disk, it is able to process the job without further communication. In the second case each node stores only a quarter of the available data. The master node induces the slave nodes to analyse single, locally stored files ordered by descending size. An acknowledgement is sent to the master, when the task is completed. As soon as all local files are processed additional files from another nodes are requested. These are then re-distributed from overloaded nodes within the cluster 6 .

3.2

Simulation parameters

The technical parameters of the modules are chosen to match the values of their actual common equivalents. The PE is clocked with 1 GHz, the response time of the main memory corresponds to that of a 100 MHz SDRAM, and the hard disk has an average transfer rate of 35 MByte/s, while the cluster network connection has a speed of 10 MByte/s. The files in a multimedia database generally do not have the same size. In the simulated database the file size is modelled by a normal distribution with a mean value of 5 MByte and a standard deviation of 2 MByte. The files are stored with a compression rate of approximately 90%. For the processing it is assumed that the effort corresponds to the file size. The values are chosen

405

Figure 2. Comparison of architectures: speedup

in such a way, that for a file size of 5 MByte both the decompression and the analysis algorithm take nearly half a second processing time. 3.3

Performance

measurements

The simulation task consists of the decoding, decompression, and analysis of 200 files. These represent a subset of a larger database, which is created as a preliminary result by a database request using static attributes. As a measure of the performance the number of bytes that are decompressed and analysed in proportion to the elapsed time is taken (throughput). Of special interest are the speedup and the efficiency of the different architectures with respect to an increasing number of PEs. The reasons for different execution time of the different architectures can be discovered by measuring the workload of the individual components. 4

Results

The simulation results are given as average values from several independent runs and are shown in Figure 2 and 3. The differences between the architectures using up to eight PEs are insignificant and the speedup is almost equal to the theoretical optimum. Using 16 or more PEs the efficiency of the SMP and shared disk computers decreases while the efficiency of the cluster is still larger than 90% for 32 nodes and larger than 80% for 64 nodes. However, the difference between the two cluster models is comparatively small. The investigation of the utilisation of the components reveals the bottleneck in the

406

1B

24

32

40

48

Number of processing elements

Figure 3. Comparison of architectures: efficiency

shared everything and shared disk architectures. The main memory and the hard disk in the shared everything and shared disk architecture have a high degree of utilisation if using more than eight PEs. The utilisation of the bus is even higher. Using 16 or 32 PEs the increase in hard disk usage is smaller since the number of hard disks is doubled each time. The utilisation of the shared main memory in the shared disk architecture is insignificant smaller (1 - 5%) than in the shared everything system due to data locality and thus a higher cache hit rate. Figure 4 shows the utilisation of the main memory and the hard disk(s), respectively. The high utilisation of the memory, the hard disk, and the bus leads to an increased number of collisions during the communication and therefore to a larger fluctuation of execution time. As the time for the analysis of a file varies the collisions are not predictable. A high number of collisions occurs after the first request is transmitted, when every PE starts to load its first file from hard disk to memory. Loading the second or later files produces fewer conflicts since the hard disk accesses do not occur simultaneously, i.e. the different file sizes result in varying processing times and thus the PEs do not require the bus at the same time. Hence, the throughput increases during the analysis process (see Figure 5). As in shared nothing architectures every PE has its own hard disk and bus system this problem does not arise and the throughput is higher at the beginning. Another important reason for the low throughput at the beginning is conditioned by the chosen technique of performance measurement since only completely processed files are considered for the measurement while uncompleted files are not assessed.

407 45 A

40

/

35

? 3° §25 g 20

y

i« 10 5 0

• SMP T Shared disk

sSaf'

s*4

0

8

12

16

20

24

28

0

32

Number of processing elements

4

8

12

16

20

24

28

32

Number of processing elements

(a)

(b)

Figure 4. Utilisation of: (a) main memory; (b) hard disk(s)

• T A • 60

80

100

120

140

Cluster (8 PEs) Cluster (16 PEs) SMP (8 PEs) SMP (16 PEs) 160

180

200

Number of analysed files

Figure 5. Normalised throughput vs. number of analysed files for shared everything architectures and cluster with distributed data storage

The different architectures do not exhibit a great difference in performance for requests which require the analysis of only a small subset of the database. Using 16 or more processors for large-scale databases the speedup of the shared disk and shared everything architectures increases less than the speedup of the shared nothing architectures. However, the cluster with 64 nodes has a high efficiency of over 80%. The advantage of this architecture can even be

408

increased using high-speed network technologies and sophisticated scheduling and load balancing algorithms. 5

Conclusions

In this paper a simulation-based assessment of parallel architectures for image databases was presented. According to the performance measurements for a multimedia retrieval, clusters are much better scaleable than the shared disk or shared everything architecture. The clusters can consist of two or four processor SMP computers, as the SMP architecture has a high efficiency for small numbers of PEs. Future work includes refinement of the simulated elements, i.e. the decomposition of the main components like e.g. PE, memory, and hard disk into smaller units to achieve a more detailed model with the ability to assess the influence of the individual parts. A study to investigate the performance improvement due to increasing speeds of PEs, memory, and networks as well as the impact of scheduling and workload balancing strategies is also suggested. References 1. Reuter A., Methods for parallel execution of complex database queries. Journal of Parallel Computing, 25(13-14) (1999) pp. 2177-2188. 2. Kao O., Towards Cluster Based Image Retrieval. Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), (2000) pp. 1307-1315. 3. Subrahmanian V.S. and Jajodia S. (Edts.), Multimedia Database Systems: Issues and Research Directions. (1996) Springer. 4. Cremonesi P., Rosti E., Serazzi G., and Smirni E., Performance evaluation of parallel systems. Journal of Parallel Computing, 25(13-14) (1999) pp. 1677-1698. 5. Lazarov V. and Iliev R., Discrete-event Simulation of Parallel Machines. IEEE Proceedings of the 2nd Symposium on Parallel Algorithms / Architecture Synthesis, (1997) pp. 300-307. 6. Kao O., Steinert G., and Drews F., Scheduling Aspects for Image Retrieval in Cluster-Based Image Databases. Proceedings of the IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2001), to be published.

S T R U C T U R E D PARALLEL P R O G R A M M I N G A N D S H A R E D OBJECTS: E X P E R I E N C E S IN DATA MINING CLASSIFIERS GIANLUCA CARLETTI, MASSIMO COPPOLA Universita di Pisa, Dipartimento di Informatica, Corso Italia 40, 56125 Pisa, Italy [email protected] — http://www.di.unipi.it/~coppola We propose the addition of a shared object abstraction to a skeleton-based parallel programming language, to deal with large shared data in dynamically behaving data-driven computations. These are commonly found in the field of Data Mining. We report parallel speed-up results of a task-parallel C4.5 classifier using a Shared Tree object. With respect to this implementation, we analyse the positive impact of the methodology on the fundamental characteristics of the language: expressiveness, programmability, efficiency of applications.

1

Introduction

The goal of Parallel Programming Environments (PPE) is to allow writing high-level, efficient, portable parallel programs by hiding most of the machinedependent issues inside the support of the environment. Our group has developed the SklE PPE, r<2 based on the concept of structured parallel programming. The coordination language SklE-CL employs skeletons to structure the parallelism and as a hosting mechanism for modules written in several sequential languages. The skeleton approach eases sequential code reuse and parallel restructuring of applications. The present work describes experiments in adding a shared object support to SklE-CL, in the form of a Shared Tree (ST) object library. One of the most critical issues we observed in data-intensive applications is that different abstractions are still needed to efficiently express and support operations on distributed, non-local data. When data is out-of-core different data layouts and access methods are needed (see for instance the survey by Vitter 3 ) . Dealing with the memory hierarchy accounts for a significant portion of the algorithm design and programming efforts, especially for dynamic, irregular-size problems like multi-grid solvers, N-body simulations, general Divide-and-Conquer (D&C) algorithms. Very often the dynamically evolving data structures which are needed are much larger than the main memory of a single processing node of NUMA and multicomputer architectures. The best implementation of an operation may depend on the size of the data structure involved. We can choose among sequential algorithms working either in-core or out-of-core, and parallel algorithms, depending on the relative sizes of data, local and global memory. We are studying the feasibility of

409

410

the object abstraction to this purpose: objects are stored in a virtually shared memory, their methods hide in the details of data management, and the object Run-Time Support (RTS) dynamically selects the appropriate method implementation. There is an increasing trend in applying object oriented methods to simplify the task of parallel programming. Some works 4 recast the concept of skeleton as a set of coordination classes and design patterns, other works use object oriented concepts and tools to ease software component interfacing in High Performance Computing environments 5 . We are willing to keep the distinction between the parallel structure of software, which is given by skeleton composition, and the structure of data, which will be given by objects. Our approach aims at using objects to solve the problems of managing the data in a scalable way. The Shared Tree library is a first experiment along this research line. We have chosen Data Mining algorithms, and C4.5 in this work, to be the reference applications 6,7 \ Typical Data Mining (DM) problems can be formulated as searches, where a model space is explored proceeding through complex queries to the input data. Most search strategies can exploit inter-query and intra-query parallelism. C4.5 is a D&C algorithm where a tree model is built by independent sub-tasks using separate data partitions. It incurs the problems of load balancing and data locality exploitation that are typical of parallel DM algorithms. It also needs data parallelism (which is not yet implemented and hence is not thoroughly discussed here) to achieve good efficiency in its initial phase. We eventually want to encapsulate inside object interfaces the tree structure and the different implementations of the basic operations of the algorithm. This approach to D&C and irregular problems is more geared toward asynchronous parallelism than the pure data-parallel one. 2

Adding a Shared Tree Object to the SklE P P E

The Shared Tree (ST) is a general-purpose data structure, designed to be efficient for dynamic algorithms like C4.5. The ST layer manages dynamically shaped trees, with unbounded node degree and size of the data areas. The nodes are allocated in a virtually shared memory space. The library accesses the tree structure and the data stored in the nodes through the handle mechanisms provided by a lower-level of software, the smReference. Any single node of the tree (see fig. 1) is composed of a fixed-size segment holding a label, structure information (up to a constant number of references to son nodes) and two links to dynamic segments, these holding other son handles and the data contained in the node. There are methods to create, visit and modify the tree structure, and to access to the objects it holds. The language support level contains the templates that support the skele-

411 Lorai reference to ™ode

PPPf

Local reference to data

tQE£S Memory

Figure 1. Structure of a ST node tons, and the experimental object management layer, the smReference. The smReference purpose is to be a common support for libraries supplying various kinds of shared object classes. The smReference interface provides the primitives needed to dynamically allocate, free and access segments of any size inside a shared memory space. A smReference opaque data type is used as a handle for them, with locking and mutual exclusion supported on the memory regions associated to handles. Each memory segment is a private address space of the process (es) sharing the handle, and interference among distinct object references is prevented by the support. System-wide communication support comes from MPI and the virtual distributed shared-memory library DVSA 8 . The latter supplies the abstraction of a common address space in multiprocessor architectures. It statically allocates the shared areas at initialisation time. While the lowest layer of this software architecture is tied to specific hardware, we are working on a fully portable implementation. We plan to extend the ST definition with collective operations, i.e. methods which have a parallel implementation, to exploit parallel computation over huge data. Ideally we would have a single abstract interface that operates sequentially or in parallel depending on the size of the data and the kind of operation requested. The implementation of these methods and their integration with the RTS of the PPE are in the first design stage.

3

C4.5 as a Divide and Conquer Problem

The core of the C4.5 classifier 9 , like most tree-induction algorithms, is the construction of a decision tree. This is a D&C process with a conquer step much easier than the divide and recursive steps. Analysis of a task-parallel implementation in SklE is reported in Becuzzi et a/.7 Due to space limitations, here we briefly describe only the time-consuming divide step. The input is a database D in tabular form of N rows of k fields, each row containing attributes ( a i , . . . a*) from the sets A i , . . . A*. There are two kinds of attributes: categorical (discrete, finite domain) ones and continuous, i.e. real-

412

valued ones. One of the categorical attributes is distinguished as the class of each tuple. The output of C4.5 is a decision tree T. This model of the data describes the behaviour of the class attribute w.r.t. the values of the other attributes. Each interior node (decision node) of T represents a test on the value of one of the attributes. The result of the test defines a partition of the data, with one branch made for each different outcome. The leaves of T are class-homogeneous subsets of the data. The tree is built in a greedy manner starting from the root, at each node choosing a test, partitioning the data and classifying independently the son nodes. The input data are partitioned over the current frontier of the tree during the building phase, and over the set of its leaves at the end of the algorithm. Two aspects of the algorithm mainly impact the parallelisation, the tree expansion strategy and how the split is evaluated and carried on. The divide operation requires two sub-steps: Dl, computing a cost function over each column of the data, and D2, splitting the data according to the column that has the least cost. For each continuous attribute, step Dl needs to access the column in sorted order. Step D2 requires either to globally set up index structures of size O(N), or to actually permute the current partition according to one key column. For several nodes near the root of T, the size n of a data partition is larger than the available local memory. While the decision tree T easily fits in memory, it is very irregular. The classification procedure uses the whole current partition, so locality in the data access follows the evolving structure of the knowledge model. This is also why fully static, parallel work decompositions only scale to a limited extent. Many efforts have been put in developing algorithms that do not repeatedly sort the partition data, maintaining index structures across the D2 step. A commonly used variation is to have binary splits only, which impair the performance w.r.t. categorical attributes, but allow some optimisations for the continuous ones. The sequential SLIQ classifier 10 uses vertical partitioning of the data, together with auxiliary information to maintain all the columns pre-sorted. The SPRINT algorithm n lowers the memory requirements and fully distributes the input over the processors, but to split the larger nodes still requires hash tables of size O(N). These parallel solutions exploit data parallelism at each node evaluation, following a breadth-first strategy for the building phase, instead of the depth-first one of C4.5. Joshi et al. 12 pointed out that SPRINT needs to communicate an O(N) amount of data per processor at each split, so it is inherently unscalable. Their ScalParC uses a breadth-first level-synchronous approach, together with a custom parallel hashing and communications scheme. It is memory-scalable and has a better average split communication cost, with a worst-case still O(N) per-level.

413 Instead of using a static data distribution and balancing load with a levelwise approach, we want to exploit the locality of the algorithm where possible, to reduce the parallel overhead. We pursue a task partitioning approach in which larger tasks are computed in parallel, while sequential independent processes deal with the smaller ones. Load balancing is left to the implementation of the skeletons. The shared object abstraction allows us expressing this kind of algorithms in a clean and structured way. Asynchronously exploiting parallel and sequential computations this way can boost efficiency and locality, as Srivastava et a/.13 have pointed out for the Hybrid Parallel Formulation of decision-tree building algorithms. 4

Skeleton plus Shared Tree Implementation of C4.5

In Becuzzi et al. 7 we show that the task parallelisation of C4.5 has to address the problems of managing and communicating dynamic data structures. Delaying the threshold calculation for continuous attributes smooths the behaviour of task workload. We began to use the shared memory support to lower the communication costs. While in the "pure-skeleton" version the tree T was held inside the local environment of one of the modules, in the current program we use the ST library. We see the structure of the parallel implementation of C4.5 in fig. 2a. The recursive computation happens in a data-flow loop skeleton, containing a pipeline of two more skeletons, a farm one and a sequential one. The farm construct we placed the Divide within expresses the independent parallelism among separate subtree expansions. The sequential module Conquer sends back nodes requiring further work. The tree T is a Shared Tree object, whose leaves and pending decision nodes contain the associated input data. We initialise the root with the whole training set, then each node expansion physically splits its partition among the son nodes. The Conquer module doesn't contain any sequential code from C4.5, but is the right point where to apply a task selection policy 7 to enhance parallelism by choosing the best expansion order. A simple priority scheme works the best in our case. The farm template uses dynamic load-balancing and buffered communications, so its efficiency raises with the available parallelism, i.e. the amount of waiting tasks, and with the computation/communication ratio. The general trend is that larger nodes are slow to expand and generate many smaller ones, so to boost available parallelism we have to split the larger ones first. The Conquer module receives handles to the nodes and accordingly puts them in a priority queue. To increase the computation/communication ratio the Divide modules apply an expansion policy to choose how many nodes they build before

414

skeleton templates

—-o

I

P

| Divide

I

Conquer -"•="»

20 40 60 80 100 120 140 160 180 200 N. of expansions per-task

Figure 2. (a) Modular structure of the task parallelisation (b) Per-task (subtree) classification time and time for shared memory operations, w.r.t. subtree size (number of nodes). No preallocation Fixed-size preallocation Full preallocation

20 40 60 80 100 120 140 160 180 200 N. of expansions per-task

Completion Time without ST Completion Time with ST

6 8 10 N. of Processors

Figure 3. (a) Task completion time with pre-allocation on fixed and dynamic-size data segments (b) Overall completion time with and w/o the ST library, w.r.t. parallelism.

sending away an incomplete subtree. Several parameters control this policy, the task size being the main one. The base heuristic is that large task expansion is needed to gain parallelism at startup, small tasks are best expanded sequentially, and the other ones are expanded to incomplete subtrees up to a given number of nodes and computation time bounds. The actual limits were tuned following the experimental approach described in our previous work 7 , taking into account the different cost model for communications'1. The first tests showed clearly that as parallel work increases, the bandwidth of allocations required by the object layer may become too high for a centralised free-memory repository, due to the software lockout. In fig. 2b we " T h e test file is Adult from the ML repository of the University of California, Irvine, which is made up of about 48 thousand records, with 8 categorical and 6 continuous attributes. For the results shown, large task have more than 2000 cases, small ones less than 50, and sequential computation bounds are 1 second and 70 nodes.. The software runs on a Meiko CS-2, a shared-nothing parallel machine with a fat-tree communication network.

415

see the effect on the application: task completion time Tt and shared memory overhead are plotted w.r.t. the number of node expansions for that partial subtree. The overhead clearly dominates the result. We solved the problem introducing a pre-allocation primitive to local repositories, a solution that is best suited to algorithm that can foresee a bunch of allocations. When an expanded subtree has to be copied back to the ST, all the ST nodes can be set up at once. Fig. 3a shows the improvement of Tt when using preallocation on part of or all the shared segments in each node. In fig. 3b we compare the completion time of our two task-parallel C4.5, with and without the Shared Tree. The overhead required by the object support code is noticeable for very low number of processors, but access to remote memory removes the bottleneck in the conquer module, thus reducing other overheads and improving the scalability. While the application without the ST reaches its limit at p = 6, the new one improves at least until p = 14. 5

Conclusions

We exploited the D&C nature of tree induction to make use of a shared tree class which does not require synchronisations. The technique is general to those D&C problems that have a simple conquer step. The Shared Tree library we have realised has proved itself useful in lowering the communication and synchronisation loads. The speedup results obtained are good if we compare to other task-parallel implementation of classifiers. This is an evidence that the object abstraction can be effective, within a skeleton structured PPE, to improve the support for dynamic irregular computations. Large distributed data structures enhance the expressiveness of the skeleton programming model. Distributed objects help to avoid some centralisation points in the computation, fully exploiting the nature of communication and synchronisation pattern of the skeletons. The application programmer can more easily focus on separate issues (algorithm structure, data structure, application tuning) when designing the application. Clearly spotted parallelism exploitation policies improve both code and performance portability. Future work will address three key points. The whole software support has to be made portable across different architectures. We need to improve the performance of dynamic allocations to enlarge the class of DfeC problems that can adopt the parallelisation scheme we have shown. The main issue is complementing task parallelism with data parallelism, as the state of the art in parallel classifiers and D&C algorithms suggests to do. The target of our research is to test external-memory and data-parallel techniques with the Shared Tree, to subsequently add them to a general-purpose object layer.

Acknowledgements We wish to thank P. Becuzzi, D. Guerri and L. Vanneschi for their help, and M. Danelutto and M. Vanneschi for many insightful discussions about parallel object support.

References 1. M. Vanneschi. PQE2000: HPC Tools for Industrial Applications. IEEE Concurrency:Parallel,Distributed & Mobile Computing, 6(4):68-73, 1998. 2. B. Bacci, M. Danelutto, S. Pelagatti, and M. Vanneschi. SklE : A heterogeneous environment for HPC applications. Parallel Computing, 25(1314): 1827-1852, December 1999. 3. J. S. Vitter. External Memory Algorithms and Data Structures: Dealing with MASSIVE DATA. Draft, http://www.cs.duke.edu/~jsv, Jan. 2000. 4. D. Goswami, A. Singh, and B. R. Preiss. Using object-oriented techniques for realizing parallel architectural skeletons. In S. Matsuoka et al. 14 . 5. B. Smolinski, S. Kohn, N. Elliott, and N. Dykman. Language Interoperability for High-Performance Parallel Scientific Components. In S. Matsuoka et al. 14 . 6. P. Becuzzi, M. Coppola, and M. Vanneschi. Mining of Association Rules in Very Large Databases: a Structured Parallel Approach. In Euro-Par'99 Parallel Processing, volume 1685 of LNCS, pages 1441-1450, 1999. 7. P.Becuzzi, M.Coppola, S.Ruggieri, and M. Vanneschi. Parallelisation of C4.5 as a Particular Divide and Conquer Computation. In J. Rolim et al., editor, Parallel and Distributed Processing, volume 1800 of LNCS, 2000. 8. F. Baiardi, D. Guerri, P. Mori, and L. Ricci. Evaluation of a virtual shared memory by the compilation of data parallel loops. In 8-th Euromicro Workshop on Parallel and Distributed Processing. IEEE Press, Jan. 2000. 9. J.R. Quinlan. C4-5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, 1993. 10. M. Mehta, R. Agrawal, and J. Rissanen. SLIQ:A fast scalable classifier for data mining. In 5th Int. Conf. on Extending Database Technology, 1996. 11. J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A Scalable Parallel Classifier for Data Mining. In Proc. of the 22nd VLDB Conference, 1996. 12. M. V. Joshi, G. Karypis, and V. Kumar. ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets. In Proceedings of 1998 International Parallel Processing Symposium, 1998. 13. A. Srivastava, E-H. Han, V. Kumar, and V. Singh. Parallel Formulations of Decision-Tree Classification Algorithms. Data Mining and Knowledge Discovery: An International Journal, 3(3):237-261, September 1999. 14. S. Matsuoka, R.R. Oldehoeft, and M. Tholburn, editors. Computing in Object-Oriented Parallel Environments, volume 1732 of LNCS, 1999.

ACCESS HISTORIES V E R S U S S E G M E N T HISTORIES FOR DATARACE D E T E C T I O N MARK CHRISTIAENS, MICHIEL RONSSE AND KOEN DE BOSSCHERE Ghent University, ELIS, Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium E-mail: {mchristi,ronsse,kdb}@elis.rug.ac.be In this article, we compare two approaches for detecting dataraces in parallel shared memory programs: access histories and segment histories. We show that access histories are preferable in theory since they provide more context about the dataraces. In practice however, testing access histories using the Splash II benchmark, we found that access histories consume too much memory to be applicable for general programs. We also implemented access histories in our race detection system for Java called TRaDe. Using a novel approach that takes into account the type information still present while executing Java programs, we are able to improve the behavior of access histories so as to make them preferable to segment histories.

1

Introduction

A datarace occurs in a multi-threaded application when two or more threads access and modify a common data structure without properly synchronizing their activities, dataraces are usually considered to be bugs and are generally accepted as being very hard to find. Every help we can get to automatically detect dataraces is welcome. Luckily, there exist several tools that attempt to locate dataraces: RecPlay 1 , Eraser 2 , JProbe 3 , AssureJ 4 , TRaDe 5 . . . Both RecPlay and Eraser are tools to detect dataraces in general programs while JProbe, AssureJ and TRaDe focus on Java programs. Benchmarks have shown TRaDe to be the fastest of the Java race detectors. 5 In order to detect dataraces, these tools need two pieces of information: a list describing accesses by the active threads to common data and the ordering (before, after or parallel) between these accesses. The ordering between accesses can be determined by using a construct called a vector clock.6 The only important property of vector clocks, in the context of this article, is that they have a size proportional to the total number of parallel threads present in the system and can therefore become fairly large. In this article, we will focus on comparing two dual methods for storing accesses to variables: "access histories" 7 and "segments histories" 1 . When using access histories, a list is associated with every variable. This list contains information on the most recent read operation per thread and the last write operation. Therefor, if there are # T threads, the number of entries in this list can be at most

417

418

# T + 1. The information must contain: a vector clock, indicating how this operation is ordered in relation to other operations, and an instruction pointer and the thread identification number (TID) of the thread that performed the operation, to identify the code that performed the operation. Each time a new operation is performed on a variable, the operations in its access history are checked to verify whether this operation constitutes a race. If so, a race is flagged, if not the operation is entered into the access history, possibly retiring other operations from the access list. One quickly realizes that access histories may contain a lot of duplicate information. All the operations performed by one thread between two consecutive synchronization operations (a semaphore, a mutex, . . . ) are all ordered in the same manner with respect to operations performed by other threads. We call such a group a segment. If we were to store all these operations into access histories, each operation would have the same vector clock. Segment histories are the dual approach of access histories. Instead of associating information on an operation to a variable, information on the variables that where accessed by a segment (a group of operations) is associated to this segment. A segment history contains: a vector clock, a read bitmap and a write bitmap indicating which variables were read resp. written by the operations in this segment, and a thread identification number of the thread that performed the operations. Each time a read or write operation is performed by a thread, the read and write bitmaps of the current segment history of this thread are updated. Only when the segment is finished by a synchronization operation is the whole segment history checked against previous segment histories in order to find dataraces. Both segment histories and access histories have their advantages: • A segment can contain thousands of read and write operations in an efficient manner. • The cost of initializing segment histories is fairly high in comparison to access histories since two bitmaps must be allocated. The access history only needs to be set up once and gradually grows during the execution of the program. • The cost of updating segment histories is very low (setting one bit). Updating an access history on the other hand is costly. • Checking for dataraces using segment histories is very efficient. The checking for dataraces is postponed until the segment is ended by a syn-

419

chronization operation. At that point, all the accesses performed during the segment can be checked simultaneously. • An advantage of access histories is that every read or write operation is checked immediately, a datarace can be nagged as it occurs. More elaborate information on the exact circumstance of the datarace can then be given like the location (stack-trace) in the code where the datarace was detected. • Segment histories are local to one segment and therefore to one thread. No other threads will need to access the segment history while the segment is running. This means that virtually no synchronization is necessary when using segment histories. From this, it is clear that when searching for dataraces, we should prefer using access histories since they provide us with immediate and more elaborate information on dataraces. It is however unclear whether it is possible to use them since they may incur a large penalty. 2

Histories for general programs and Java programs

We have used a number of applications from the Splash II benchmark suite 8 to compare access histories and segment histories. A short description of the applications is provided in Table 1 together with the command line parameters we used for our measurements. In Table 2 the behavior of the Splash benchmark under normal conditions is characterized. We ran these benchmarks on an SMP Sun with 4 SuperSPARC processors running at 50 MHz with 192 MB shared memory. Each processor has a direct mapped L2 cache of 1024 KB. Both the data and code LI cache are 4-way set associative and can store 16 KB. The operating system used was Solaris 2.7. We implemented both access histories and segment histories on top of JiTI 9 . JiTI is a shared library which allows us to intercept any machine instruction by dynamically instrumenting an application as it is executed. Each time the original application executes an instruction reading from or writing to main memory, JiTI transfers control to our race detection routines where we log the operations either using access or segment histories. Since JiTI works at the level of machine instructions, the granularity of the race detection is the basic addressable unit of the processor which is a byte in this case. Hence the read and write bitmaps of the segment histories contain one bit per byte and consequently we have to create one access history per byte too.

420

Name Cholesky FMM LU C LUNC Ocean C Ocean N C Radix Raytrace W a t e r N2 Water S

Description Sparse matrix factorization Simulation of a system of bodies (FMM < inputs/input.256) Dense matrix factorization with contiguous blocks Dense matrix factorization with non-contiguous blocks Simulation of large scale ocean movements with noncontiguous partitions (OCEAN -p4 -n66) Simulation of large scale ocean movements with contiguous partitions (OCEAN -p4 -n66) Integer radix sort Raytracing (RAYTRACE -p4 i n p u t s / t e a p o t . e n v ) Evaluation of behaviour of water molecules using a 0(n 2 ) algorithm (WATER-NSQUARED < input) Evaluation of behaviour of water molecules using a O(n) algorithm (WATER-SPATIAL < input) Table 1. Description of the benchmarks used

Name Cholesky FMM LU C LU N C Ocean C Ocean N C Radix Raytrace W a t e r N2 Water S Average

#T 4 4 4 4 4 4 4 4 8 8

#A(106) 5.98 30.08 113.10 115.91 8.51 83.46 15.87 277.18 184.55 167.28 100.19

T(s) 1.0 1.2 18.4 4.6 5.0 8.6 1.6 11.6 23.0 17.4 9.2

M(MB) 1.67 2.47 3.20 1.68 2.49 24.44 8.00 37.57 2.11 1.63 8.53

#s

953 866 1213 855 20869 26919 227 153498 2558 2560 21052

#A/S 6272 34729 93244 135571 408 3100 69900 1805 72146 65343 48252

Table 2. From left to right: total number of threads created, total number of memory accesses, cumulative execution time of all threads, maximum memory consumption, total number of segments and the average number of memory accesses per segment.

We measured the behavior of the benchmarks under three different conditions. To get an idea of the overhead introduced by using JiTI, we first ran our benchmark with JiTI present so that every read and write operation was

421

Name Cholesky FMM LU C LU N C Ocean C Ocean N C Radix Raytrace Water-JV 2 Water S Average Overhead

Baseline T(s) M ( M B ) 17.21 108 211 8.18 750 14.90 778 14.90 294 7.58 29.64 785 107 14.90 5935 54.91 1376 15.35 1235 14.89 1158 19.25 125 2.26

Se »ment T(8) M(MB) 18.82 127 16.27 323 1084 16.10 14.88 1139 1095 8.11 38.02 1821 21.27 159 9357 56.16 2052 16.50 1842 16.12 22.94 1900 206 2.69

Access T(s) M ( M B ) 160.72 3863 38.08 5400

t t

t

f

2535 10951 7140

83.91 131.68 212.49

t t t

t t t

>5978 >647

>125 >15

Table 3. From left to right: the name of the benchmark, execution time and memory consumption for an execution doing datarace detection using resp. a dummy JiTI instrumentation, segment histories and access histories. A run was stopped if it consumed more than 300MB of memory. This is indicated by a f- The 'overhead' row of the table indicates the overhead compared to the original runs from Table 2.

redirected to a dummy routine. Next we ran the benchmarks while doing race detection using segment histories. We concluded by running the benchmarks a third time with access histories. The results are shown in Table 3. It is clear from Table 3 that for general programs, it is not feasible to use access histories due to memory constraints. The main reason for this failure is that every byte of data in the memory must be tracked during execution and this causes a huge overhead. The reason why segment histories perform so well can be seen clearly in Table 2. The average number of memory accesses per segment is very high which indicates that one segment history will correspond on average to tens of thousands of access histories which makes them more suitable for this approach. On average, the memory overhead of using segment histories is limited to 2.69. Associating access histories to every byte is clearly overkill and results in a counter intuitive definition of a datarace. A programmer reasons, not in terms of bytes but in terms of data structures. We must conclude that it is often preferable to associate access histories to data structures and not to individual bytes. In general programs, the information about which bytes constitute a data structure is lost. This is why we have turned to Java. In Java we still have type information at our disposal and we can therefore

422

Name Tomcat 3.2.1 Raja 0.4.0p3 SwingSet Java2D Colt 1.0.1.56

Description JSP web server producing the webpage corresponding to num/numguess.j sp Small ray trace program. Run with -v -p t x t - r 128x128 -d20 -o Phong-128x128.png Phong.raj Demo of Swing widget set included with Sun's JDK 1.2.1. All tabs in the demo were clicked once. Demo of Java 2D drawing package included with Sun's JDK 1.2.1. All tabs in the demo were clicked twice. Matrix manipulation library Table 4. Description of the Java benchmarks used

tune the access histories. If the object is an array, only one access history is associated with the whole array. If the object is a description of a Java class, then with every static field of the class an access history is associated. Finally, if the object is a general object (for example an I n t e g e r , a JFrame, . . . ) then with every field of this object, an access history is associated. We have adapted a Sun Java virtual machine (version 1.2.1) to do onthe-fly race detection for Java programs using the above mentioned novel approach 5 . We have evaluated the performance of access histories for datarace detection in Java using the benchmark programs described in Table 4. A number of important characteristics of these benchmarks were measured in Table 5. The results when doing datarace detection can be seen in Table 6. The measurements were performed using the interpreter version of the Java virtual machine 1.2.1 (-Djava.compiler=NONE) from Sun. The Java benchmarks were run on a Sun Ultra 5 workstation with 512 MB of memory and a 333 MHz UltraSPARC Hi with a 16 KB LI cache and a 2MB L2 cache. The memory overhead of using access histories is on average a factor 3.23 which is comparable to the memory overhead of segment histories in Table 3. This makes access histories a feasible and preferable alternative for doing datarace detection for Java programs. We have chosen not to implement segment histories since we believe it clear that access histories are superior for Java applications. The number of read/write operations per segment, as shown in Table 5 and Table 1, is very high for general programs but much lower for Java programs. The reason for this is that the synchronization granularity of typical Java programs is much finer than for a typical C/FORTRAN program. Many

423

Name Tomcat 3.2.1 Raja 0.4.0p3 SwingSet Java2D Colt 1.0.1.56 Average

#T 157 5 19 123 15

#A(106) 5.79 105.95 29.67 171.52 69.79 76.54

T(s) 26 25 43 672 80 170

M(MB) 11.59 11.84 29.13 38.55 12.95 20.81

#S(106) 2.52 0.30 1.84 2.26 0.30 1.44

#A/S 2.30 353.51 16.12 76.04 230.55 135.70

Table 5. From left to right is given: the name, the number of threads that were created during the run, the number of bytecodes accessing (reading or writing) the heap, the cumulative execution time of all threads, the maximum memory consumption during the run, number of segments, number of bytecodes accessing the heap per segment.

Name Tomcat Raja SwingSet Java2D Colt Average Overhead

TOO 494 219 162 1075 167 423.4 2.5

Mem(MB) 30.77 20.74 125.04 133.66 26.29 67.3 3.23

Table 6. Results when doing datarace detection using access histories. From left to right we see the name of the benchmark, the total cumulative execution time of all the threads and the total memory consumption.

data structures were designed to be thread safe from the start so synchronizations occur without the programmer requesting it. This makes an average segment very short. New segment histories would have to be set up continuously while not providing much benefit in memory overhead since not many read/write operations would be compacted in them. 3

Conclusions

We have argued that a programmer looking for dataraces prefers to use access histories since these can provide him with more elaborate information about the location and circumstances of a datarace. Through measurements of the behavior of the two systems while doing datarace detection on the Splash II benchmark, we have shown that for general programs, the use of access

histories is almost impossible due to the large memory overhead. Segment histories perform well in these circumstances mainly thanks to the coarse grained synchronization in these types of applications. We have also shown that in program environments where data structures can still be identified at runtime (like Java), the use of access histories is viable. Moreover, we have argued that access histories, using this novel approach, are desirable in cases where the granularity of synchronization is fine. References 1. Michiel Ronsse and Koen De Bosschere. Recplay: A fully integrated practical record/replay system. ACM Transactions on Computer Systems, 17(2): 133-152, May 1999. 2. Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: A dynamic data race detector for multithreaded programs. In Operating Systems Review, volume 31, pages 27-37. ACM, October 1997. 3. KL Group, 260 King Street East, Toronto, Ontario, Canada. Getting Started with JProbe. 4. Kuck & Associates, Inc., 1906 Fox Drive, Champaign, IL 61820-7345, USA. Assure User's Manual, 2.0 edition, march 1999. 5. Mark Christiaens and Koen De Bosschere. Trade, a topological approach to on-the-fly race detection in Java programs. In Proceedings of the Java Virtual Machine Research and Technology Symposium 2001, pages 105 116, Monetery, California, USA, April 2001. USENIX. 6. Dieter Haban and Wolfgang Weigel. Global events and global breakpoints in distributed systems. In 21st Annual Hawaii International Conference on System Sciences, volume II, pages 166-175. IEEE Computer Society, January 1988. 7. A. Dinning and E. Schonberg. An empirical comparison of monitoring algorithms for access anomaly detection. In Second ACM SIGPLAN symposium on Principles & practice of parallel programming, pages 1-10, March 1990. 8. S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In 22nd International Symposium on Computer Architecture, pages 24-36, Santa Margherita Ligure, Italy, June 1995. 9. M. Ronsse and K. De Bosschere. Jiti: A robust just in time instrumentation technique. In Proceedings of WBT-2000 (Workshop on Binary Translation), Philadelphia, 10 2000.

O N SKELETONS & D E S I G N PATTERNS MARCO DANELUTTO Dept. Computer Science - University of Pisa Corso Italia 40 - 56125 PISA - Italy [email protected] - www.di.unipi.it/~marcod Algorithmical skeletons and parallel design patterns have been developed in completely disjoint research frameworks, aimed at providing the programmer of parallel applications with an effective programming environment. At the moment different authors recognize the concrete similarities in the two approaches. In this work we try to summarize and assess common points and differences between the two approaches. From the analysis we conclude that by merging the two concepts we can develop very effective parallel programming environments. We outline the features of a new parallel programming environment inheriting from both the skeleton and design pattern related research.

1

Introduction

Skeletons have been introduced to provide the programmer with simpler and effective ways of denoting common parallelism exploitation patterns 1. By recognizing that a limited number of parallelism exploitation patterns appear in the majority of parallel applications, different researchers designed programming environments that provide the programmer with language constructs (skeletons) modeling these patterns. Two major advantages are offered by skeletons: first, most or, in some case, the whole, code needed to derive object code from skeleton source is produced by the skeleton support (compiler -I- run time support). Second, the programmer is relieved of all the cumbersome and error prone programming details related to parallelism exploitation: concurrent/parallel activity setup, mapping and scheduling, implementation of communications and of accesses to shared data structures, concurrent activity synchronization and termination are all handled by the skeleton support. Design patterns have been introduced as a way of specifying recurrent (object oriented) programming problems along with the solutions that can be used to solve them 2 . The key point in design patterns is that a design pattern not only describes a problem and a good solution to that problem; it also provides a good insight in the problem at hand by specifying its motivations, fields of applicability, implementation related details, examples of usage, etc. Therefore, the general idea of design pattern looks like to be very close to the idea of skeletons, although the usage made of the two concepts by the programmer turns out to be sensibly different. The idea of design pattern has

425

426

evolved since its introduction. At the very beginning, patterns have only been considered in the field of sequential programming. Recently, some researchers migrated the concept of design pattern to the parallel programming field 3 ' 4 . In so doing, further contact points with skeletons come into evidence. In this paper, we discuss both the features that are shared among skeletons and parallel design patterns and those that are peculiar of one of the two approaches, instead. Our thesis is that the skeleton and the parallel design pattern concepts could be merged into a new concept, inheriting the positive features of both worlds and providing the programmers of parallel applications with a more effective programming tool. 2

Skeletons

An algorithmical skeleton is a common, reusable, efficient parallelism exploitation pattern. Skeletons have been introduced by Cole in late '80 l, and different research groups participated in research activities on skeletons 5 ' 6 ' 7 . Skeletons are usually given to the user/programmer as language constructs in a coordination language 5 ' 6 or as libraries 8 . In both cases the programmer using skeletons develops parallel applications by instantiating skeleton parameters to specify the domain specific code and parameters. The only knowledge exploited is the skeleton parallel and functional semantics. No knowledge is required concerning the target hardware. The skeleton support takes care of generating/running all the code necessary to implement the user code 9 . The key point is the declarative aspect of parallel programming using skeletons and the clear separation between parallelism declaration and parallelism exploitation/implementation. The programmer declares the kind of parallelism he wants to exploit in the application, possibly using proper skeleton nesting. The skeleton support takes care of all of the details we need to cope with in order to implement that parallelism exploitation pattern matching the hardware features. This makes the strength and the weakness of skeletons: on the one hand, writing efficient parallel applications requires a moderate programming effort; on the other hand, it is difficult to implement parallel applications using parallelism exploitation patterns not covered by the available skeletons. 3

Design patterns

A design pattern is a representation of a common programming problem along with a tested, efficient solution for that problem 2 . Design patterns are usually presented in an object oriented programming framework, although the idea

427

is completely general and it can be applied to different programming models. A design pattern includes different elements. The more important are: a name, denoting the pattern, a description of the problem that pattern aims to solve, and a description of the solution. Other items such as the typical usage of the pattern, the consequences achieved by using the pattern, and the motivations for the pattern are also included. Therefore a design pattern, per se, is not a programming construct, as it happens for the skeletons. Rather, design patterns can be viewed as "recipes" that can be used to achieve efficient solutions to common programming problems. In the original work discussed in Gamma's book 2 , a relatively small number of patterns is discussed. The debate is open on the effectiveness of adding more and more specialized patterns to the design pattern set 10 . This situation closely resembles the one found in the skeleton community some years ago. Skeleton designers recognized that a small set of skeletons may be used to model a wide range of parallel applications 6 ' n but some authors claimed that specialized skeletons should be used to model application dependent parallelism exploitation patterns 12 . Recently, some questions concerning design patterns have been posed, that still have no definite answer: should design patterns become programming languages? Should they lead to the implementation of programming tools that provide the programmer with more efficient and "fast" 0 0 programming environments? 13 ' 14 Again, this is close to what happened in the skeleton community. After Cole's work it was not clear whether skeleton should have been made available as some kind of language constructs or they should have been used only to enforce a well founded parallel programming methodology. As a partial answer to the pattern/language question, the whole design pattern experience is being reconsidered in the parallel programming framework 3 ' 15,4,16 . The idea is to provide the user with a parallel programming environment (or language) based on design patterns. Most of the authors recognize many contact points with the skeleton work. The key point stressed is that parallel design patterns are presented, discussed and implemented (in some cases) as abstractions modeling common parallel programming paradigms, exactly in the same way skeletons have been traditionally presented. Common parallel design patterns have been therefore identified that cover pipeline computations, task farm (master/slave) computations, mesh (iterative, data parallel computations), divide and conquer computations, and different implementation methods have been proposed to provide the user with methods that can be used as parallel design patterns within a consistent programming environment. 3,15,16

428

4

Skeletons vs parallel design patterns

In this section we outline the results of the comparison of parallel programming environments based on skeletons 17 ' 5 ' 6 ' 18 and on design patterns 3 ' 1 4 ' 1 5 , 1 6 ' 4 with respect to high level features and implementation related features. The features discussed have been selected as they are of interest in the field of parallel programming and because they can be used to formulate an overall judgment on the effectiveness of parallel programming environments based on skeletons and/or design patterns. As an example, expandability guarantees that the choices made by the programming environment designers do not represent an obstacle in case the applicative framework require slightly different parallelisation strategies; instead, user architecture targeting, i.e. the ability provided to the programmer to insert architecture specific code in order to enhance application performance, allows more efficient code to be developed in all those cases where the implementation strategies provided by the programming environment turn out to be scarcely effective. Other features, usually discussed in this context, have been left out as they are too "low level" with respect to the abstraction level we deal with in this work. E.g., we do not discuss which computing models are supported by skeletons and design patterns (i.e. SPMD, MIMD, etc.) as both skeleton and design patterns frameworks can perfectly implement any one of the models, possibly embedded in more general (and higher level) parallelism exploitation skeleton/pattern. The comparison we performed (summarized in Table 1) on skeletons and patterns allowed us to conclude that skeletons and parallel design patterns share a lot of positive features, especially concerning the programming model presented to the user, but still have some points that make the two approaches different. Such points mainly concern expressive power, easy implementation and expandability of the two models. Skeletons have a better expressive power due to their fully declarative essence, leading to a better confinement of both parallelism exploitation and sequential code details. Some of the parallel design patterns proposed require programmer intervention which is not completely "high level", instead. As an example, task handling (e.g. retrieval of tasks to be computed) must be explicitly programmed by the programmer when using an embarrassingly parallel pattern3 whereas it is automatically handled by the farm skeleton, exploiting the same kind of parallelism 17 . 0 0 structure of parallel design pattern implementation represents a major advancement with respect to existing skeleton environment implementations. In particular, porting a parallel design pattern environment to a different machine should only require to properly subclass the existing classes imple-

429

Sktetm Style

completely declarative programming style

What abstracts Level of abstraction

common parallelism exploitation patterns very high (qualitative description of parallelism exploitation patterns) allowed, encouraged allowed, encouraged any sequential language may be used to program sequential portions of the code poor/none (new skeletons require direct intervention on the implementation of the skeleton support) structure parallelism exploitation with skeletons then leave the support to produce the "object" code good (intervention required on the compiling tools, both core and back end) programmer just supplies sequential portions of code used within skeletons

Nesting Code reuse Sequential language support Expandability (skeleton, pattern set) Programming methodology Portability User task to complete code Layering

skeleton layer declarative: user responsibility runtime support layer machine dependent, imperative, skeleton support: designer responsibility

Automatic performance tuning

possible, demonstrated. Uses target architecture or skeleton code derived parameters + analytical performance models + heuristics difficult (using pragmas)

User arch, targeting Performance

assessed

Paalld design pattern OO style (declarative, but also requires some code to complete automatically generated methods) common (OO) parallel/distributed programming patterns high/medium (qualitative description of parallelism exploitation patterns + high level implementation details/strategy) allowed, encouraged allowed, encouraged OO languages must be used (Java, C + + ) very high (protected user access to implementation layer + proper structure of the class hierarchy supporting the design pattern language) structure parallelism exploitation with patterns then leave the support to produce the "object" code then patch implementation methods (performance fine tuning) very good (OO structure of the support make easy to derive new classes matching new/different hardware features) programmer must complete sequential methods generated by the support present' 0 or absent' 0 . If present: pattern layer almost declarative (some specialization code needed): user responsibility runtime support layer machine dependent, OO, pattern support: designer responsibility + partial user responsibility (or nonexisting in some cases1 ) possible (class hierarchy structure of language support should enhance the possibility) Performance model usage: not demonstrated. possible, encouraged poor experimental data, promises to be good

Table 1. Skeleton vs. parallel design pattern based languages

menting the patterns that do not match the new target hardware features. This should be a much easier process than the effort required to port skeleton environments to a new machine (this usually requires to design a new compiler core and back-end 9 ). Last but not least, parallel design patterns environments present a definitely good expandability (i.e. the possibility to use new parallelism ex-

430

ploitation patterns, not included in the original pattern set) 16 . The protected access to the implementation layer classes given to programmer guarantees that both slightly different and completely new patterns can be added to the system by expert users. These new/modified patterns can be fully integrated (documented and certified) in the pattern system once proved to be correct and efficient. 5

Skeletons U parallel design patterns

As skeletons and parallel design patterns present features that are both desirable and compatible, we start thinking that a "merge" of the two worlds is worth to be studied. The goal is the design of a parallel programming environment that allows all the advantages presented by the two worlds to be achieved, while preserving the positive high level features of both systems. Such a programming environment should provide the user with high level parallel skeletons. In order to use skeletons, the programmer must only supply the code denoting the sequential computations modeled by skeletons. The parallel skeleton support should be implemented by using a layered, 0 0 design, derived from the pattern research results and similar to the one described in 16 . Parallel skeletons can be declared as members of proper skeleton/pattern classes. Object code generation can be asked leading to the automatic generation of a set of implementation classes, globally making up an implementation framework. The framework should address hardware targeting in the same way traditional skeleton implementation does 9 . Exploiting standard 0 0 visibility mechanisms, part of the implementation framework classes may be made visible to the programmer in such a way he can perform different tasks: fine performance tuning (by overwriting some of the implementation classes methods), introduction of new, more efficient implementation schemas (sub-classing existing implementation classes) or either introduction of new skeleton/patterns (introducing new skeleton classes, and using either existing or new implementation classes for the implementation). We claim that such a parallel programming environment will provide some definitely positive features: high programmability and rapid prototyping features will be inherited mainly from skeleton context. Code reuse, portability and expandability will all come from features derived from both the skeleton and the design pattern contexts. Performance will mainly come from clean implementation framework design but can also take advantage of the adoption of analytical performance models such as the ones used in the skeleton stuff. Last but not least, documentation of the skeleton provided in the style of design patterns (motivation, applicability, structure, consequences,

431

sample code, known uses 2 ) will greatly help the programmers in the parallel code development. The ideas we just outlined relative to the skeleton/pattern merging will be further studied and exploited by our research group within a couple of National Projects funded by Italian National Research Council and Spatial Agency. Both in the design of a new skeleton based coordination language (ASSIST-CL), that will be eventually provided to the user, and in the implementation of the relative compiling tools and run time support we are planning to heavily use features and concepts derived from both the skeleton & the design pattern world. 6

Conclusions

We presented an original comparative analysis of the skeleton and parallel design pattern worlds. From the analysis we concluded that merging the two worlds will lead to better parallel programming environments. We eventually outlined the features that a parallel programming environment should make available in order to take advantage of both the skeleton and the parallel design pattern research. The parallel programming environment we envision as the result of the merge between the skeleton and the design pattern worlds looks like to be similar to the one discussed in 4 , with two main differences: first, the programmer will not be required to specify any code but that necessary to define the parallel structure of the application at hand and the sequential portions of code used as skeleton/pattern parameters. Second, and more important, the compiling tools transforming skeleton/patterns into object code should also perform hardware targeting without any explicit programmer intervention, exploiting the available knowledge relative to the target hardware. References 1. M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computations. Pitman, 1989. 2. E. Gamma, R. Helm, R. Johnson, J. Vissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison Wesley, 1994. 3. B. Massingill, T. Mattson, B. Sanders. A Pattern Language for Parallel Application Programs. Euro-Par 2000, LNCS 1900, p. 678-681, 2000. 4. S. McDonald, D. Szafron, J. Schaeffer, S. Bromling. Generating Parallel Program Frameworks from Parallel Design Patterns. Euro-Par 2000, LNCS 1900, p. 95-105, 2000. 5. B. Bacci, M. Danelutto, S. Pelagatti, M. Vanneschi. SklE: a heterogeneous environment for HPC applications. Parallel Computing, 25:1827-

432

1852, Dec 1999. 6. P. Au, J. Darlington, M. Ghanem, Y. Guo, H.W. To, J. Yang. Coordinating heterogeneous parallel computation. Euro-Par '96, LNCS 1123, p. 601-614, 1996. 7. H. Burkhart, S. Gutzwiller. Steps Towards Reusability and Portability in Parallel Programming. In Programming Environments for Massively Parallel Distributed Systems, p. 147-157. Birkhauser, 1994. 8. M. Danelutto, M. Stigliani. SKElib: parallel programming with skeletons in C. Euro-Par 2000, LNCS 1900, p. 1175-1184, 2000. 9. S. Pelagatti. Structured Development of Parallel Programs. Taylor&Francis, 1998. 10. E. Agerbo, A. Cornils. How to preserve the benefits of Design Patterns. OOPSLA '98, p. 134-143, 1998. 11. M. Danelutto, R. Di Meglio, S. Orlando, S. Pelagatti, M. Vanneschi. A methodology for the development and support of massively parallel programs. Future Generation Corny. Sys., 8(l-3):205-220, July 1992. 12. C. W. Kessler. Symbolic Array Data Flow Analysis and Pattern Recognition in Numerical Codes. In K. M. Decker, R. M. Rehmann, editors, Programming Environments for Massively Parallel Distr. Systems, pp 57-68. Birkhauser, 1994. 13. C. Chambers, B. Handerson, J. Vlissides. A Debate on Language and Tool Support for Design Patterns. POPL 2000, 2000. 14. B. Massingill, T. Mattson, B. Sanders. A Pattern Language for Parallel Application Languages. TR 99-022, Univ. Florida, CISE, 1999. 15. D. Goswami, A. Singh, B. R. Preiss. Using Object-Oriented Techniques for Realizing Parallel Architectural Skeletons. ISCOPE'99, LNCS 1732, p. 130-141, 1999. 16. S. McDonald, D. Szafron, J. Schaeffer, S. Bromling. From Patterns to Frameworks to Parallel Programs, submitted, Dec. 2000. 17. B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, M. Vanneschi. P 3 L: A Structured High level programming language and its structured support. Concurrency Practice and Experience, 7(3):225-255, May 1995. 18. S. Newhouse, A. Mayer, J. Darlington. A Software Architecture for HPC Grid Applications. Euro-Par 2000, LNCS 1900, p. 686-689, 2000.

A PORTABLE MIDDLEWARE FOR BUILDING HIGH PERFORMANCE METACOMPILERS* M. D I S A N T O , F . F R A T T O L I L L O , E. Z I M E O School of Engineering,

University

of Sannio,

Benevento,

ITALY

W. RUSSO DEIS,

University

of Calabria,

Cosenza,

ITALY

As a consequence of the growth of the Internet, new software infrastructures for distributed computing have become an effective support to solve large-scale problems by exploiting the available networked computing power. This paper presents such an infrastructure, a first definition of a Java middleware for metacomputing, called HiMM, whose main goal is to support parallel and distributed programming based on objects in heterogeneous environments. HiMM enables dynamic applications to run on collections of nodes virtually arranged according to a hierarchical network topology, mapped on wide-area physical networks of computers or clusters of computers.

1

Introduction

The exponential growth of the Internet makes an enormous amount of computing power available. As a consequence, the idea to harness such power in order to solve large-scale problems has become attractive. This idea aims at transforming the Internet in a sort of electricity grid which everyone can draw on for executing power-crunching applications in a collaborative computing environment. In order to distribute the computing power and so to satisfy the calculation requirement of any networked user, special software infrastructures, called middleware, are necessary. Such infrastructures extend the concept introduced by PVM of "parallel virtual machine" restricted and controlled by a unique user, making it possible to build computing architectures, called metacomputers, based on network contexts characterized by different administrative domains, physical networks and communication protocols 1 ' 2 . The metacomputing architectures based on the Java-centric approach are nowadays arousing an increasing interest since Java has been designed for programming in heterogeneous computing environments and provides a direct support to multithreading, code mobility, and security. However, the Java distributed object model (RMI) does not give an adequate support neither to program dynamic parallel applications nor to efficiently run them on wide* RESEARCH PARTIALLY S U P P O R T E D N . 4 1 / 9 4 " , BURC N . 3 , JANUARY 17, 2000

BY T H E R E G I O N E

433

CAMPANIA U N D E R G R A N T S

"LAW

434

area networks. To overcome these drawbacks, different solutions have been proposed 3 ' 4 ' 5 . Most of them support parallel programming based on message passing communication models among workstation clusters or single workstations placed at random geographic locations. However, these communication models are considered inadequate to program collections of heterogeneous machines, such as clusters or MPPs, since they do not allow to actually exploit the heterogeneous features of computers and networks. Some other solutions, such as Manta 6 , involve custom implementations of metacomputers, all based on the use of native code as well as of modified JVMs and Java compilers, so strongly limiting code portability. In this paper we present a first definition of a middleware for metacomputing, called HiMM (Hierarchical Metacomputer Middleware), based on a previously implemented software infrastructure, called Moka7. The HiMM goal is to support parallel and distributed programming based on objects in heterogeneous environments. To this end, we have isolated a minimal set of basic services for running high performance dynamic applications on collections of parallel systems distributed over a wide area and we have implemented them in HiMM by using Java in order to guarantee portability both of the system and the application code. 2

The metacomputer architecture

HiMM enables a metacomputer to be built and dynamically reconfigured. Such a metacomputer consists of computing nodes, each physically allocated onto a computer belonging to a collection of workstations and parallel systems interconnected by heterogeneous networks. The computers (host) taking part in a computation make their computing power available in a Web-based, collaborative network environment: systems wanting to donate some computing time have to register themselves on a specific server, called Resource Manager (RM), which will publish each registration on the Web. Users that need to run power-crunching applications can build and configure their metacomputers according to the requirements of their applications by simply connecting their computers to a RM: using a common Java-enabled browser, users can retrieve a Java applet, called console, from the RM, by which to grab some published, available computing power. The console, which is also available as a tool of the HiMM library, is executed on the host (root) which the metacomputer configuration is launched from and carries out the configuration on the basis of information made available by the RM and stored on the root as a XML file. When the configuration has been performed, the console is used to build a dynamically configurable parallel abstract machine consisting of abstract

435

nodes mapped on the grabbed hosts. In particular, to meet the constraints of the Internet organization or to better exploit the performances of dedicated interconnects, each node is connected to the others by a virtual network characterized by a hierarchical topology based on a multi-protocol communication layer able to exploit the features of the underlying physical networks in a user transparent way; so, in the following we will refer to our metacomputers as HiMs (Hierarchical Metacomputers). In addition, each node can dynamically load the application code by exploiting the Distributed Class Storage System (DCSS) provided by an HiM. This way, an HiM can support an MIMD programming model, since each node can execute different program components that may not be present on the node. 2.1

The hierarchical organization of an HiM

Using collections of parallel systems distributed over a wide area means to employ computing platforms, communication protocols and interconnection networks with very different characteristics. This is an actual problem when achieving high speedup is the main goal of a parallel computation. To exploit the actual features of computers and networks, HiMM allows users to arrange the nodes in a metacomputer according to a hierarchical structure: for instance, nodes hosted by clusters of workstations intra-connected by a high speed network or hidden from the Internet can be managed as macronodes. A macro-node groups more nodes, each of which can be in-turn a macro-node. This induces a level-based organization in the metacomputer network, in which all the nodes grouped in a macro-node belong to the same level. Each macro-node abstractly appears as a single computing unit and is characterized by two entities: a coordinator and a RM. The former, that is a particular node allocated onto the host by which the macro-node is physically interfaced to the metacomputer network infrastructure, manages all the nodes belonging to the macro-node. The latter publishes information concerning the computing power made available by the nodes belonging to the macro-node. Coordinators and nodes consist of two main components: a Node Manager (NM), which is a part of the core of the HiM, and a Node Engine (NE), which takes charge of receiving application messages and processing them using custom semantics. In particular, the NM of a node at a given level creates the local NE by using the configuration information supplied by the coordinator of that level, monitors the liveness of the HiM, participates to the distributed garbage collection of the HiM nodes, and sets-up the macro-node sub-network on the basis of information published by the RM of that level. When the NM creates the NE on a node, it installs three specific components: the Execution Environment (EE), the Level Sender (LS) and the

436

Message Consumer (MC). They constitute an abstraction able to capture the node behavior during the execution of a distributed application. In particular, a message extracted from the network using a specific protocol is passed to the MC, which represents the interface between an incoming communication channel and the EE. Then, the message is synchronously or asynchronously delivered to the EE, which processes it according to its current semantics. As a result of the message processing, new messages can be created and sent to other nodes by using the LS, which represents the communication interface towards the nodes of a level. Differently by a simple node, a coordinator presents two LSs. Each of them interfaces the coordinator with a different network level (called up and down level), so allowing it to act as an intelligent application router between two adjacent levels. 3

The H i M M services

HiMM is a software layer situated between the operating system and the application level, which acts as glue among the different computers, so that they can be used as nodes of a unique coherent HiM. In particular, HiMM hides the low-level details tied to heterogeneous distributed interactions by providing users with powerful control functions. To this end, HiMM, according to a framework approach, groups modules that offer particular sets of services concerning: the heterogeneity of networks and computers, hardware or software failures, the management of multiple administrative domains, the interoperability with other middleware. To allow HiMM to be incrementally implemented, we have divided its services in two main categories: basic and extended services. Basic services, the only ones implemented in this first phase, enable users: (1) to easily configure a metacomputer at startup and effectively manage at run-time both its computing infrastructure and the application code distribution; (2) to efficiently manage communication among nodes arranged according to a hierarchically structured network; (3) to parametrically distribute application workload among nodes on the basis both of their actual computational capacity and the characteristics of the network which connects them. Extended services will enable users: (1) to automatically manage node and link failures and to supervise the dynamic reduction of a virtual machine; (2) to support authentication and authorization on machines belonging to different administrative domains, in order to control both the integrity of data exchanged among nodes and the security on each node; (3) to support interoperability with other middleware. HiMM implements the basic services in several modules whose interfaces

437

are designed to easily support extended services in the future. Configuration and management. Configuration and management services are provided on each node of an HiM by the component NodeMgr. In particular, due to the hierarchical structure of an HiM, the NodeMgr allows users to manage each level directly attached to the node by using one of its components: LevelMgr. During the building of an HiM, nodes can be created on each grabbed host (addNode(host)), and custom components can be loaded on each node (setModule(modules, node)) in order to complete the setup. This way, users can specify the semantics of the MC and EE for each node. In order to use the multi-protocol communication layer, the console enables users to specify the communication protocol that has to be used between every pair of nodes, among the macro-nodes themselves (setProtocol(prot, nodel, node2)), and among nodes within each macro-node (setProtocol(prot, node)). This feature enables high performance, specific protocols to be exploited among nodes within a macro-node, whereas communications among macro-nodes at different geographic locations can be routed by employing standard protocols. Each node can interact with the other nodes of its own level (NodeMgr.currentLevelMgr()), whereas only a coordinator can interact with nodes belonging to its adjacent levels (NodeMgr.downLevelMgrQ and NodeMgr.upLevelMgr()). Moreover, each node in a level is identified by an integer (LevelMgr.thisNode()) in the range from 0 to size-l, with size matching the number of nodes in the level (LevelMgr.size()). In particular, since a coordinator is interfaced to two adjacent levels, it is identified by two integers: the identifier used at the up level (NodeMgr.upl_evelMgr().thisNode()), which is dynamically assigned during the configuration phase, and the identifier used at the down level (NodeMgr.downLevelMgr().thisNode()), which is equal to the constant value NodeMgr.COORD. Communication among nodes. Communication towards the nodes of a level is managed by the component LevelSender. This component supports point-to-point asynchronous communication (send(msg, node)) among nodes of the same level. If a node at a given level wants to communicate with nodes belonging to other levels, it has to send the messages either to a macro-node or to the coordinator of its level, which take charge of carrying out message routing according to the semantics specified by the EE. In addition, HiMM supports two kinds of asynchronous broadcast communication mechanisms: a level broadcast (broadcast(msg)) and a descending broadcast (deepBroadcast(msg)). The former enables a node in a sub-network to communicate with all other nodes in the same sub-network. The latter routes messages sent by a node to the nodes belonging to both the sender level and all the other levels reachable through each macro-node that is recursively met.

438

Application code management. HiMM implements the DCSS for the dynamic loading of application code. Application components are initially stored only on the root host. At run-time they are distributed among macro-node coordinators on the basis of the demands issued by the nodes (DCSS.IoadClass()). This way, a coordinator represents a proxy cache for the bytecode retrieval at each level. Support to heterogeneous distributed programming. The support (HPHints) has been designed around the following main abstractions: the computational power of a node (HPHints.power(node)) and the quality of the network connection between two nodes (HPHints.connectionQoS(node)). The former abstraction is based on the idea that it is crucial to estimate the residual computing power of a node at run-time. Therefore, such power is calculated taking into account that: (1) a metacomputer can support a hierarchical organization of the network infrastructure; (2) more nodes may be located on a single host; (3) hosts can run in multiuser mode. Consequently, if a node is not a macro-node, its computational capacity is tied to the available residual computing power of the host; on the contrary, in the case of a macro-node, its power represents the available residual computing power of the whole sub-network identified by the macro-node. As concerns the latter abstraction, if the target node is not a macro-node, the connection quality is evaluated on the basis of the actual bandwidth and latency between the source and target nodes. If the target node is a macronode, the connection quality is estimated by evaluating also the quality of the network connection toward any node in the sub-network managed by the macro-node. 4

Using the H i M M services to write Java d i s t r i b u t e d p r o g r a m s

To stress the effectiveness of HiMM, we have implemented a set of modules (AoLevSndr, AoMessageConsumer, AoExecutionEnvironment) that, loaded on each node of the HiM at configuration startup, enable users to program according to the Active Object model 7 . This model is based on the following three main concepts: threads, which represent the active entities of a computation; global objects, which are common Java objects globally accessible through the HiM by using a global name space; active objects, which are auto-executable messages used by threads to modify global objects. Threads are not given an identity; so they can interact only by accessing global objects through the asynchronous sending of active objects to the nodes hosting the implementations of the global objects themselves. At destination, an active object can access the local pool of global objects' implementations and so select and modify them. In particular, the code tied to an active object is always ex-

439

ecuted inside the context of a thread: the only one activated by the system (single threaded up-call model) or the one started when an active object is received (popup-threads model). This way, an active object can become in turn a thread able to communicate with other threads. When the AoLevSndr is installed, specific communication primitives are made available to send active objects across the hierarchical infrastructure of the HiM both asynchronously (request(ao, node), broadcast(ao), deepBroadcast(ao)) and synchronously (Promise call(sao, node), Barrier broadcast(sao), Barrier deepBroadcast(sao)). The use of these primitives and of the services of HiMM is shown by a simple program, which performs the parallel multiplication of two square matrices, A and B. The proposed solution replicates B on each node of the HiM, while A is divided row-wise in smaller matrices whose sizes are calculated on the basis of both the computational power and the connection quality related to the nodes which the sub-matrices will be sent to. public class Main implements ActiveObject { public void handler(AoExecutionEnvironfflent ee) { LocalNameSpace ns = ee .getNameSpaceO ; < read matrices A and B>; GlobalUid gn = new GlobalUidO ; ns.bind(gn, new Matrix(B)); AoLevSndr lev = (AoLevSndr) ee .getNodeMgrO .downLevelSender() ; LevelMgr lmgr = ee.getNodeMgrO .downLevelMgrO ; Barrier bar = lev.deepBroadcast(new SynCreate (gn, "Matrix", B ) , AoLevSndr.POPUP); Promise result = lev.calKnew Part(A, gn), lmgr.thisNodeO , AoLevSndr .POPUP); f l o a t C H ] r = (float[][])result.getValue(); }}

The active object Main replicates the matrix B as a global object created on each node of the HiM by invoking deep BroadcastO, and sends the matrix A to the current node as a field of the active object Part. class Part implements ActiveObjectC { private float[][] mat; private GlobalUid gn; HPHints hints; private int nodeCapacity(int node){ return hints.power(node) * hints.connectionQoS(node); } public Part (float[3[] part, GlobalUid g) { mat = part; gn = g; } public Object handler(Ao£zecutionEnvironment ee) -{ if (!ee.getNodeMgrO .isCoordinatorO) return ((Matrix) (ee .getNameSpaceO) .lookUp(gn)) .multiply(mat) ; else { LevelMgr lmgr = ee.getNodeMgrO .downLevelMgrO ; hints = lmgr.getHPHints(); Promised result = new Promise [lmgr .sizeO] ; AoLevSndr lev = (AoLevSndr)ee.getNodeMgrO .downLevelSenderO ; for(int i = 0; (i < lmgr.sizeO) kk (i < mat.length); i++) i result [i] = lev.calKnew Part(sMat, gn) , i, AoLevSndr .POPUP) ; } < return the evaluated matrix mat x B >}- }}

On each node, the active object Part gets the local instance of Matrix and: (1) invokes its method multiply, if the node is not a coordinator; (2) otherwise it divides mat row-wise and sends each of the obtained sub-matrices to a node belonging to the down level.

440

5

Conclusions

The goal of our work was to define a minimal, extensible middleware to build metacomputers on heterogenous networks of workstations and MPPs in order to support the efficient execution of parallel and distributed object-based programs. The first implementation in Java provides a minimal set of programming primitives and services for developing and running dynamic applications on collections of nodes virtually arranged according to a hierarchical network topology. Such an organization, together with high level programming abstractions, enables users to better exploit the different performances characterizing parallel machines and clusters taking part in a computation. Moreover, due to the reflective behavior, HiMM allows users to install specific execution environments on top of a basic distributed architecture during the configuration phase. This feature promotes the use of our middleware to program parallel and distributed applications using custom communication and programming models. References 1. I. Foster et al, The Anatomy of the Grid: Enabling Scalable Virtual Organizations, Intl Journal of Supercomputer Applications, 15(3), 2001. 2. A. Grimshaw et al, Wide-Area Computing Resource Sharing on a Large Scale, IEEE Computer, 32(5):29-37, 1999. 3. I. Foster et al, Technologies for Ubiquitous Supercomputing: A Java Interface to the Nexus Communication System, Concurrency: Pract. & Exp., 9(6):465-475, June 1997. 4. M. O. Neary et al, Javelin: Parallel Computing on the Internet, Intl Journal on Future Generation Computer Systems, Elsevier Science, 15(56):659-674, 1999. 5. M. Beck et al, Harness: A Next Generation Distributed Virtual Machine, Intl Journal on Future Generation Computer Systems, Elsevier Science, 15(5-6):571-582, 1999. 6. R. V. van Nieuwpoort et al, Wide-Area Parallel Programming using the Remote Method Invocation Model, Concurrency: Pract. & Exp. 12(8):643-666, August, 2000. 7. M. Di Santo et al, An Approach to Asynchronous Object-Oriented Parallel and Distributed Computing on Wide-Area Systems, Intl Workshop on Java for Parallel and Distributed Computing, Mexico, May 1-5, 2000.

Using a Parallel Library of Sparse Linear Algebra in a Fluid Dynamics Application Code on Linux Clusters Salvatore Filippone Universita di Roma "Tor Vergata"

Pasqua D'Ambra CPS-CNR

Michele Colajanni Universita di Modena e Reggio Emilia January 10, 2002 Abstract Many computationally intensive problems in engineering and science, such as those driven by Partial Differential Equations (PDEs), give rise to the solution of large and sparse linear systems of equations. Fast and efficient methods for their solution are very important because these systems usually occur in the innermost loop of the computational scheme. Parallelization is often necessary to achieve an acceptable level of performance. We report on our experience in applying a library of sparse linear algebra numerical software for parallel distributed memory machines; we experimented it from inside a well known fluid dynamics application code for engine simulation on Linux clusters.

1

Introduction

Many scientific applications that motivate the use of parallel computers require the solution of large and sparse linear systems of equations. The possibility to find out basic kernels common to many parallel computations motivated several parallel BLAS-related projects. In particular, our group designed and implemented the Parallel Sparse BLAS library [6] that facilitates the implementation of modern solvers in a parallel environment setting. Poor efficiency and difficulty to integrate basic routines with real codes are common criticisms against the use of libraries for parallel computing. This paper aims to address the latter issue. To this purpose, we tested the applicability of the PSBLAS library by implementing a parallel version of the KIVA-3 fluid dynamics code. This application code has already been parallelized to a large extent during the PINEAPL (Parallel Industrial NumErical Applications and Portable Library) project [2]. Many interfacing problems were already analyzed during that project; in the present paper we discuss the adaptation of the existing interface code to the PSBLAS library. To achieve maximum portability, we used a cluster of commodity workstations, running the Linux operating system. This allowed us to test its device drivers thus obtaining some interesting performance results.

2

PSBLAS Library Overview

The PSBLAS library is internally implemented in a mixture of Fortran 77 and C language. The interface, built on top of the Fortran 77 kernels, is based on Fortran 90 [7]. This language choice

441

442 for the interface is at the right level of abstraction for the target applications of the PSBLAS library, and allows the use of advanced features, such as operator overloading and derived data type definition. The advanced Fortran 90 interface and the reliance on a proposed standard for serial sparse BLAS computations [5] are the main feature that differentiate our project from other ongoing research in the same area. For a more complete comparison see [6]. The PSBLAS library is designed to handle the implementation of iterative solvers for sparse linear systems on distributed memory parallel computers. The system coefficient matrix A is assumed to be square, but is otherwise general. The serial computation parts are based on the serial sparse BLAS [5], so that any extension made to the data structures of the serial kernels is available to the parallel version. The layered structure of the PSBLAS library is shown in Figure 1; we mainly use the Fortran 90 layer because of the facilities it provides for dynamic data management and for the compactness of the calling sequences. Application

PSBLAS (Fortran 90)

PSBLAS (Fortran 77)

Serial

BLACS

Sparse BLAS

MPI

Figure 1: PSBLAS library components hierarchy.

2.1

PSBLAS routines

The PSBLAS library consists of two classes of subroutines that is, the computational and the auxiliary routines. The computational routine set includes:

routines

• Sparse matrix by dense matrix product; • Sparse triangular systems solution for block diagonal matrices; • Vector and matrix norms; • Dense matrix sums; • Dot products. The auxiliary routine set includes: • Communication descriptors allocation; • Dense and sparse matrix allocation; • Dense and sparse matrix build and update; • Sparse matrix and data distribution preprocessing. The current implementation of PSBLAS addresses a distributed memory execution model operating with message passing.

443

2.2

PSBLAS data structures

In any distributed memory application, the data structures that represent the partition of the computational problem are essential to the viability and efficiency of the entire parallel implementation. The general criteria guiding the decomposition choices are: 1. maximizing load balancing, 2. minimizing communication costs, 3. optimizing the efficiency of the serial computation parts. The PSBLAS library addresses parallel sparse iterative solvers typically used in the numerical solution of PDEs. In these instances, it is necessary to pay special attention to the structure of the problem from which the application of the solver originates. The nonzero pattern of a matrix arising from the discretization of a PDE is influenced by various factors, such as the shape of the domain, the discretization strategy, and the equation/unknown ordering. The matrix itself can be interpreted as the adjacency matrix of the graph associated with the discretization mesh. Keeping it in mind, the allocation of the coefficient matrix for the linear system, in the PSBLAS approach, is based on the "owner computes" rule: the associated variable of each mesh point is assigned to a process that will own the corresponding row in the coefficient matrix and will carry out all related computations. This allocation strategy is equivalent to a partition of the discretization mesh into sub-domains. PSBLAS routines support any distribution that keeps together the coefficients of each matrix row; there are no other constraints on the variable assignment. The capability of supporting arbitrary data distribution choices is quite appealing for our intended application, in that it guarantees us the freedom to choose whatever data distribution we deem appropriate for our particular problem. The user defines matrix entries in terms of the global equation numbering; each process in the parallel machine builds the sparse matrix rows that are assigned to it by the user. Once the build step is completed, the local part of the matrix undergoes a preprocessing operation. During this step the global numbering scheme is converted into the local numbering scheme, and the local sparse matrix representation is converted to a format suitable for subsequent computations. The sparsity pattern of the matrix determines the communication requirements among different processors; it is possible that two matrices of the same size distributed in the same way should have very different communication requirements. This is because the need to exchange a data item among two processes arises whenever the matrix row assigned to a process contains a nonzero entry corresponding to a matrix row (and thus an associated variable) assigned to a different process; thus, the latter process will have to send the current value for the variable (i.e. vector entry), so that the former may complete its computation (e.g. computing a matrixvector product). The PSBLAS library uses the sparsity pattern of the coefficient matrix to build a suitable representation of the communication requirements, which is stored in a set of communication descriptors. Although the computation of these descriptors is rather expensive, it needs to be executed only when the mesh topology changes. More details about these items may be found in [6]. Here, we note that the use of the communication descriptors is encapsulated into suitable library routines, so that it is not necessary to delve into the internal storage format.

2.3

Iterative method implementation using PSBLAS

To give an example about the simplicity of using the PSBLAS library routines, we describe in Table 1 the template for the Bi-CGSTAB method from [3], with local ILU preconditioning and

444 norrawise backward error stopping criterion [1]. This example demonstrates the high readability and usability features of the PSBLAS with Fortran 90 interface. The mathematical formulation of the algorithms is quite comparable to the PSBLAS implementation.

Template [3] Compute r W = b Choose q (e.g. q =

Ax^ r^)

for i = 1 , 2 , . . . Pi-l = qTr^i~1^ ifi=l pf 1 ) = r (°> else if p = 0 f a i l u r e ft-i = ( p i - i / « - 2 ) ( a t _ i / u / i _ i ) p (i) = r ( i - l ) + f t _ 1 ( p ( i - l ) _ u ; . _ l V ( i - l ) ) endif s o l v e Mp — p ( i - 1 ' v& = Ap a% =Pi-i/qTvW s ~ r ^ - 1 ' — m/*' s o l v e Ms = s t = As UJi =

tTs/tTt

a;(») — x ^ - 1 ) + aip + UiS A*) = S — Wit Check convergence: l k ( i , l l o o < e ( | | ^ | | » - | N w | | o o + ||6||oo) end

PSBLAS Implementation c a l l i90_psaxpby(one,b,zero,r,desc_a) c a l l f90_psspmm(-one, A, x, o n e , r , d e s c _ a ) c a l l f90_psaxpby(one,r,zero,q,desc_a) b n i = f90_psamax(b,desc_a) ani = f90_psnrmi(A,desc_a) do i t = 1, itmax rho_old = rho rho = f 9 0 _ p s d o t ( q , r , d e s c _ a ) i f ( i t = 1) then c a l l f90_psaxpby(one,r,zero,p,desc_a) else if ( r h o " 0 ) stop b e t a = (rho/rho_oId)(alpha/omega) c a l l f90_psaxpby(-omega,v,one,p,desc_a) c a l l f90_psaxpby(one,r,beta,p,desc_a) endif c a l l f90_psspsm(one,L,p,zero,w,desc_a) c a l l f90_psspsm(one,U,w,zero,phat,desc_a) c a l l f90_psspmm(one ,A,phat , z e r o , v , d e s c _ a ) alpha « f 9 0 _ p s d o t ( q , v , d e s c _ a ) alpha = r h o / a l p h a c a l l f90_psaxpby(one,r,zero,s,desc_a) c a l l f90_psaxpby(-alpha,v,one,s,desc_a) c a l l f90_p3spsm(one,L,s.zero,w,desc_a) c a l l f90_psspsm(one,U,w,zero,shat,desc_a) c a l l f90_psspmm(one,A,shat,zero,t,desc_a) omega = f 9 0 _ p s d o t ( t , s , d e s c _ a ) temp = f 9 0 _ p s d o t ( t , t , d e s c _ a ) ome ga=ome g a / t emp c a l l f90_psaxpby(alpha,phat,one,x,desc_a) c a l l f90_psaxpby(omega,shat,one,x,desc_a) c a l l f90_psaxpby(one,s,zero,r,desc_a) c a l l f90_psaxpby(-omega,t,one,r,desc_a) m i = f90_psamai(r,de_sc_a) xni = i90_psamajc(x,desc_a) err - rni/(ani*xni+bni) if (err.le.eps) return enddo

Table 1: Sample Bi-CGSTAB implementation

3

KIVA-3 Code Overview

KIVA-3 is a CFD software package, developed at Los Alamos National Laboratories, for numerical simulation of chemically reactive flows with sprays. This software, widely used in engine applications, is also used at the Engine Technical Innovation Department of Piaggio & Co. S.p.A., for simulations of the scavenging process in two-stroke spark ignition engines, in order to predict performance of new engines in terms of consumption and pollution emission in the pre-design phase [2].

445 The mathematical model considered in KIVA-3 is the complete system of unsteady NavierStokes equations for compressible and turbulent flows, coupled with chemical kinetic and spray droplet dynamic models. The numerical method for solving the complete system of equations is based on a variable step backward Euler temporal finite difference scheme, where the timesteps are chosen using accuracy criteria. Each time step defines a cycle divided in three phases, corresponding to a physical splitting technique; the last two phases are devoted to the solution of fluid motion equations. The spatial discretization of these equations is based on a finite volume method. During the second phase, the diffusion terms and the terms associated with pressure wave propagation are implicitly solved by a modified version of the SIMPLE (Semi-Implicit Method for Pressure-Linked Equations) algorithm [8]. This algorithm, well known in the CFD community, is an iterative procedure to solve velocity, temperature and pressure equations, where the main computational kernel is the solution of large and sparse linear systems. In the original KIVA-3 code, sparse linear systems arising from implicit solution of diffusion terms inside SIMPLE algorithm are solved through a version of diagonally preconditioned Conjugate Residual method in a matrix free approach.

4

Integration of the Parallel Numerical Library

The KIVA code was originally designed as a purely serial application; therefore to parallelize it by means of our numerical library we had to modify the code structure. Since our objective was to demonstrate the viability of general purpose library routines, we attempted to maximize the modularization and reusability of the resulting codes; therefore we wrote a number of subroutines to interface the original code and data structures to the library solvers. The spatial discretization of the equations uses a staggered grid, where thermodynamic quantities are referred to the cell centers and velocities are referred to the cell vertices. Since the solution of equations for thermodynamic quantities (such as temperature, pressure and turbulence) requires cell center and cell face values, involving averaging among cells that share the same cell face, the linear systems arising from temperature, pressure and turbulence equations have coefficient matrices with the same sparsity pattern; they are unsymmetric but with a symmetric sparsity pattern with no more than 19 non-zero entries per row. In the case of the linear systems arising from the velocity equation, the discretization scheme leads to unsymmetric coefficient matrices with no more than 3 x 27 non-zero entries per row. In Figure 2 we show the typical sparsity patterns of the matrices arising from temperature and velocity equations for one of our reference test cases. The data distribution used in our experiments is a slightly generalized BLOCK distribution; a graph-partitioning based distribution is also supported by our version of the code, and we plan to experiment with graph partitioning tools in the near future. Once the matrix rows have been assigned, it is necessary to compute and store all auxiliary data necessary to the implementation of the data exchange among processes. These distribution related variables depend on the number of processes in the parallel application and on the size and sparsity pattern of the coefficient matrix; therefore they need only be computed when the sparsity pattern changes. During the simulation it is necessary to follow the piston movement in the cylinder; this is mostly done by stretching or shrinking the mesh cells, but when the aspect ratio deteriorates horizontal layers of the computational mesh are cut away (or added back), thus changing the number of cells and the mesh topology. These events are called "snapper" poins; they are relatively infrequent events; in our reference test case there are only 33 "snapper" points out of 851 total time steps. The code to build the coefficient matrix and the right hand side can be derived by the

446

Figure 2: Sparsity patterns of matrices residual calculations in the original implementation; the original code typically loops twice on all cell indices, first scattering and then gathering contributions from/to the current cell index residual, based on the value of the field in that cell and on the boundary conditions in force. It is then straightforward, even if tedious, to keep track of these contributions into a sparse matrix representation setting. Once the sparse matrix for one of the scalar fields has been built, its sparsity pattern can be reused for all the other scalars; thus for every linear system solution between two successive starting points we employ the pattern reuse capabilities of the PSBLAS library, instead of regenerating the matrix from scratch. The linear solvers themselves are those provided in the PSBLAS library; they are very similar to the sample code shown in Table 1. For the momentum equation we used the same template code, but we substituted some specialized routines to perform the matrix-vector product without explicit assembly of the coefficient matrix. In fact, in the momentum equation each entry in the coefficient matrix is a 3 x 3 dense block, since it operates on the three velocity components; these memory requirements would have been very large and would have caused a serious slowdown. To perform the data exchange among processors we build a "ghost" matrix, having the same connectivity as the momentum equation matrix, with simple l x l entries, then we apply the resulting data exchange patterns to the momentum vectors. Another feature of the PSBLAS library we used is the ability to operate on dense matrices; because of the above considerations, it is convenient to apply the data exchange operations to a dense matrix with three columns containing for each row the three components of the velocity vector corresponding to that index. The linear solvers for the momentum, temperature and pressure equations are tied together into the SIMPLE loop: after each linear solver invocation, the relevant vectors are rebuilt with all-to-all operations. In principle, it would also be possible to exchange only the so called boundary data values; this is currently being implemented and tested. These data exchange operations are all carried out by using support routines provided by the PSBLAS library, and more specifically by the f 90_pshalo and f 90_psdgatherm routines, that perform the boundary data exchange and all-to-all communication respectively. We rarely call the BLACS routine directly, since most of the operations we need are nicely encapsulated into library support routines.

447

5

Results

The main intent of this work was to demonstrate the PSBLAS library to be adequate for integration in complex applications. Indeed, it took very little time to integrate the PSBLAS solvers into the KIVA-3 application, and this clearly demonstrates that the service level provided by the PSBLAS library is already adequate according to the application needs. Because of these issues, little emphasis has been put so far in raw performance results. We have run a few test cases on an inexpensive cluster of workstations, based on the Linux operating system. From a single processor perspective, there already is an improvement with respect to the original KIVA solvers. This improvement is due to two main factors: the new, modern solvers available, and the change in the solver implementation relying on matrix assembly. The new solvers perform much better than the original ones in terms of number of iterations and overall convergence behaviour. The matrix assembly approach is particularly favourable on machines using microprocessors based on the Intel x86 architecture. This is because the resulting matrixvector product procedure is more balanced in the execution of integer operations with respect to floating point operations, which is an advantage on this architecture. The test case for which we show our result is derived from the PINEAPL project [2]; it is a two-strokes engine modeled with a mesh having a total of about 20000 cell indices. Indices are reordered at various stages in the simulation; the first batch of cell indices, corresponding to cells at or outside the physical boundaries, changes in size at each snapper point. The scalar linear systems vary accordingly from a minimum size of 9272 equations with 147016 nonzero entries (approximately 0.17 % of the total number of entries) to a maximum of 15672 equations with 265452 nonzero entries (or 0.11 %); the corresponding average number of nonzero entries per row varies from 15.9 to 16.9. The simulation times in table 2 are split into total and linear solver; this is because the application is not yet fully parallelized, so that the remaining serial part becomes a bottleneck. We plan to complete the parallelization in the future; a preliminary analysis has shown that the explicit part of the computation can be parallelized based on the same data exchange routines already developed for the linear solvers used in the implicit phase. In all experiments we used the MPICH implementation of MPI on a standard 100 Mb/s Ethernet with a switch 1 . As can be seen from Table 2, the parallel efficiency is not very attractive when running on the Pentium II cluster. The most important factor is that the test case is fairly small compared to the speed of today's processors; nonetheless, we choose to use it in the tests because its numerical behaviour was well understood. The network connection is somewhat slow as well; this is due to the overhead of the T C P / I P protocol, both in the resulting latency and in the asymptotic bandwidth of the connection. The speedups obtained on the linear solvers are more interesting in the Pentium machines: since they are computationally more intensive, the network connection is more adequate, even if the overall performance is not very interesting. Another interesting phenomenon is the serial speedup with respect to the original KIVA code; this is especially apparent on the Pentium Pro cluster. The explanation for this is that the PSBLAS code is more balanced in the usage of integer and floating-point execution units, therefore it can be executed effectively on a wider range of CPU architectures. We are now moving the experiments onto the GAMMA implementation of MPI [4]. Preliminary results show an advantage of more than one order of magnitude in the communication latency. This is the most important network parameter in our applications that make a large use of sparse linear algebra solvers. 1 The Pentium-Pro experiments have been carried out on a Beowulf cluster operated by Center for Research on Parallel Computing and Supercomputers (CPS)-CNR, Naples, Italy; the Pentium and Pentium II experimemts have been performed at the University of Rome "Tor Vergata".

448

NP 1 1 2 3 4

Simulation timings in seconds total (linear solvers) Pentium Pentium Pro Pentium II Original code 21259 (13362) 17190 (11226) 10377(5982) PSBLAS code 19201 (10126) 11697 (5764) 10146 (5225) 14200 (6325) 10397 (4472) 8431 (3710) 8011 (4671) 13649 (5720) 9893 (3970)

Table 2: Timings

6

Conclusions

In this paper we verified that the PSBLAS library provides adequate support for the development of real parallel applications. In particular, our experience for fluid dynamics application demonstrates that PSBLAS provides a consistent behavior with respect to the numerical convergence which is quite comparable with that of commercial subroutine libraries. We plan to test the application on other platforms, including Beowulf clusters with optimized network drivers and parallel machines such as the IBM SP2, where the PSBLAS library has already proven its efficiency.

References [1] M. Arioli, I. Duff, and D. Ruiz. Stopping criteria for iterative solvers. SIAM J. Matrix Anal. Appl., 13:138-144, 1992. [2] L. Arnone, P. D'Ambra, and S. Filippone. A parallel version of KIVA-3 based on general purpose numerical software and its use in two-stroke engine applications. Int. J. of Corny. Research, 2001. To appear. [3] R. Barrett, M. Berry, T. Chan, J. Demmel, J. Donat, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. van der Vorst. Templates for the solution of linear systems. SIAM, 1993. [4] G. Chiola and G. Ciaccio. GAMMA: a low-cost network of workstations based on active messages. In Proceedings of PDP'97 (5th EUROMICRO workshop on Parallel and Distributed Processing), London, January 1997. [5] I.S. Duff, M. Marrone, G. Radicati, and C. Vittoli. Level 3 basic linear algebra subprograms for sparse matrices: a user level interface. ACM Trans. Math. Softw., 23(3):379-401, September 1997. [6] Salvatore Filippone and Michele Colajanni. PSBLAS: A library for parallel linear algebra computation on sparse matrices. ACM Transactions on Mathematical Software, 26(4):527550, December 2000. [7] Michael Metcalf and John Reid. Fortran 90 explained. Oxford University Press, 1990. [8] S. V. Patankar. Numerical Heat Transfer and Fluid Flow. Hemisphere Publ. Corp., 1985.

Performance evaluation of a graphic accelerators cluster M.R. GUARRACINO, G. LACCETTI, D. ROMANO Center for Research on Parallel Computing and Supercomputers CPS-CNR, Via Cintia 80126 Naples, Italy E-mail: {mario.guarracino, giuliano.laccetti, diego.romano}@dma.unina.it

Abstract High performance computer systems based on commercial off the shelf components, namely Beoxmilf class systems, have been widely adopted to develop and test scientific and engineering applications. Each computing node of such distributed memory multicomputer is a PC or a workstation connected to the others within a system area network. In this paper we show how to harness the computing power of accelerated graphic cards available in each node and realizing an efficient, portable and scalable software library for parallel hardware rendering. A detailed performance analysis of a graphic accelerators cluster, based on the evaluation of a distributed memory VRML (Virtual Reality Modeling Language) browser, is given in terms of execution times and well known performance parameters.

1

Introduction

Many scientific and engineering applications produce data that need to be displayed and analysed. When data are processed on distributed memory parallel machines, results are spread among nodes and need to be collected and displayed: visualization introduces a serialization in the computation process in the sense that results have firstly to be gathered to be then displayed. In the present work our aim has been to harness the computing power of accelerated graphic cards available in the nodes of a Beowulf class multicomputer, realizing an efficient, portable and scalable software library for parallel hardware rendering. Such tool enables the visualization of results as they still reside in the local memory of computing nodes, eliminating the need of gathering data before analysis. Only recently it has been possible to exploit the power of accelerated graphics cards by means of specialized Linux software drivers. Such drivers enables the use of OpenGL5, a collection of some hundreds software library functions, which allows to specify 3D objects and operations involved in the generation

449

450

high quality graphic images providing application program interfaces to graphical hardware. Our work has focused on the implementation of a parallel library conforming to OpenGL specification, tuned for low costs parallel architectures, such as Beowulf systems. The library is based on TNT_PMESA 4 by Sandia Laboratories, that allows the use of the Mesa rendering library a , to obtain a parallel software Tenderer and it differs from it in the fact that it allows hardware rendering on each node. Among the libraries implementing parallel OpenGL there are: WireGL3, which differs from our in the fact it renders a complete part of the final image on each node to compose it on a tiled display; PMESA6 that implements a parallel library for shared memory multiprocessor. The work is organized as follows: in the next section an overview of the parallel algorithm is given; section 3 is centered on the implementation issues and in section 4 the test case is described. Finally, in section 5, the performance evaluation is discussed. 2

Parallel algorithm

In this section let us consider the objects as surfaces constituted by faces. The object properties are represented through the discretization of the external surface: a rounded shape may be approximated with a number of triangles such that the surface is rendered through an optical effect. As the rendering of a triangle does not need more time if the dimensions of the triangle grow, in the same manner a surface may need a shorter time to be rendered if it is discretized with "grains" through a smaller number of triangles. If more processes cooperate to the task, the workload may be organized so that they work on the same number of triangles: if there are N objects and p processes, each single object is subdivided in p parts, one for each process. At the end of the distribution each process will hold £ of each of the N objects. With that strategy, objects surface polygons are equally distributed and each process locally renders them; finally the obtained images are composed into a complete image. Such composition is obtained through the binary-swap algorithm, that follows the scheme in Fig. 1 in the case of p — 8 processes. Let p be the number of processors and a be the number of pixels, the binary-swap algorithm ends in log2 p steps. At the i - th step each processor receives from the processor at distance 2 t _ 1 an amount of pixels equal to & to be composed with its own pixels obtained through local rendering. At the end of the composition each processor will have exchanged and composed ^ t i P % a

B . Paul et al.,"The Mesa 3D Graphics Library", http://mesa3d. sourceforge.net

451

Ba a s s s a i 1

2

3

4

5

7

8

a

• ^ H Receive ^ ^

6

Stage 3

Send

Figure 1: Binary-swap scheme pixels. Indeed, let Ta be the time needed to compose 2 pixels and Ts the time to exchange one pixel between two processors, the compositing time Tc will be: l°g2P

T

<= E § - ( T ° + T * ) -

(!)

i=l

We notice that only point to point communications are needed to compose the final image and their number logarithmically increases with the number of processors. 3

Implementation issues details

The Linux drivers for graphic cards are mainly developed within XPree86 project, whose aim is to provide open source X servers for various platforms and operating systems. Such project is providing the tools to use 2D and 3D accelerated cards and, due to the tight relationship between X servers and graphic drivers, it is not possible to use the drivers without such servers. That means it is not possible to obtain the results computed by graphic cards for off screen rendering, unless a server is running on each node. On the other hand, since Beowulfs do not have consoles on computing nodes, X servers fails to start due to the lack of a secure console monitor. For this reason, security has to be changed so that it is possible to start a server and a process running on a node can connect to the server even if it does not have console ownership. Data is transferred to the graphic card through AGP and read and written using OpenGL primitives. The present software architecture is constituted of a Linux Red Hat 7.0, kernel 2.2.16, XFree86 4.0.1, Mesa Library 3.3 and MPIch 1.1.2.

452

The parallel OpenGL library is organized to be transparent to the calling application. Apart from the standard functions there is a function to initialize the processes and the environment for the distributed execution and a function to compose and display the final image. 4

Test case: a V R M L browser

VRML (Virtual Reality Modeling Language) 6 is the standard language to describe three-dimensional objects and interactive worlds. Let's take the example of a fridge design: it is possible to see if a bottle lying on a the shelf prevents the closure of the door or any other usage costrain related to its design. In this example the project manager evaluates his product functionality without building an expensive prototype. We need, obviously, a visualization software, called browser, whose aims are both the interpretation of information in the compressed VRML files and the representation of the virtual world using the human perspective and the interaction with the represented objects. Our performance analysis regards pLookat1 browser, a software conforming to the ISO/IEC 14772-1:1997 standard (VRML97) and originally based on a parallel library for software rendering. In the same work 1 performance analysis based on software rendering is given, whereas present analysis regards hardware rendering. Tests are performed on VRML scenes with a complexity up to 72 millions polygons and various window dimensions. 5

Performance analysis

The browser has been tested on a 28 nodes Beowulf parallel computer operated by CPS-CNR; each node has a 450 MHz Pentium II processor, 256 MB RAM, 10 GB HD. The 32 processors are connected by a switched 100 Mb/s full duplex fast ethernet network. The video card in each node is an AGP 2x Matrox Millennium with MGA-G200 chipset and 8MB memory. To evaluate the performance of the parallel algorithm, several examples with different characteristics and dimensions have been used. Timings for the sequential program is obtained setting p = l in the parallel version; nevertheless the additional time introduced in the parallel version to transfer data through AGP to local memory does not considerably affect total time. First of all let us observe the diagram in Fig. 2, which refers to 200 x 160 and 400 x 320 pixels windows. This is a comparative diagram that puts in 6

"Information technology - Computer graphics and image processing - The Virtual Reality Modeling Language (VRML) - Part 1: Functional specification and UTF-8 encoding", ISO/IEC 14772-1:1997 http://www.web3d.org/Specifications/VRML97/

453 I

-

i\

"'•-.. -M

—*i.

i

V,

IHHt

>

45000 polygons 200x160 46000 polygons 400x320 140000 polygons 200x160 140000 polygons 400x320 200000 polygons 200x160 200000 polygons 400x320

_*

it—

*"•».. "

"

•

—

—

•

-

.

_

m

""""•—"•*•

»— »••-

Processors

Figure 2: Total rendering time evidence a dicotomy: increasing the number of processors, the total rendering time decreases for each of the considered cases leading to two different time values and resulting in a smoother interaction with the virtual world for the smaller window. The rendering on each node of hundreds of thousands polygons is fast, but total times depend mainly on the image composition determined by the number of pixels and the network speed. This explains why the lines approach two different time values, related to the two image dimensions, as the number of processors grows. Therefore an important result can be highlighted: since the execution time for the compositing stage has been taken separately, it is clear that, regardless for the problem dimension - i.e. the number of the objects represented by the polygons - the compositing time remains the same, for a given dimension of the visualization window. Indeed, the compositing stage of final image begins when the local rendering is completed and it depends only on the number of pixels constituting the window, as it can be seen in Fig. 3. This is due to the fact that only one to one communications are used to gather the final image, as shown in equation 1. Now, to evaluate in detail the effects of the parallelization, two classical parameters, speed-up (Sp) and efficiency (Ep) are used 2 : Sp =; Ti {n)/Tp (n),

Ep = Sp/p

where T\ is the execution time on 1 processor and Tp is the execution time on p processors, evaluated as a mean of 10 executions; n is the dimension of the problem. Let us consider the diagram of Fig. 4.

0.7 Window dimensions

Figure 3: Compositing time for different window dimensions 300x300 pixels window

5

10

15

20

25

Processors

Figure 4: Speed-up for 300x300 window

Table 1: Speed-up of 2617130 polygons in a 300x300 window p sP 2 1.70133 4 2.94723 8 4.66606 12 5.80772 16 6.51119 20 6.82305 24 7.04523 28 7.23492

455 300x300 pixels window Polygons 327141 — i — 654282 — * — 1308565 • • - * — 2617130 ~ e — -

k 0.8

i \

>»

* . ^ \

"

0.6

\V

B a>

I

-

*"•»-..

^---

0.4

"— * ——

- • * —

—-

-—1

0.2

0 5

10

15 Processors

20

25

Figure 5: Efficiency for a 300x300 window It may be noted that, as the number of polygons grows, the gain in terms of speed-up is evident. For example, the case of 2617130 polygons in a 300x300 window is given in Table 1. Let us move now to the efficiency analysis for the parallel algorithm. As showed in Fig. 5, the efficiency values grow in the case of problems of larger dimensions (2617130 polygons), sensibly decreasing in case of just few hundred thousands polygons and a small number of processors. If the problem is not large enough, the communication time exceeds the time for local rendering. Since we must compose the image in a stage which takes a constant time, given the network speed and the image dimensions, it is useful to introduce the concepts of scaled speed-up and efficiency2:

pT.jn) '-Tp(pn)'

bb

T.jn) 'Tp{Pn)

btjp

where, as before, Ti{m) indicates the execution time for an algorithm on i processors for a problem of dimension m. Using these parameters we may understand the behaviour of our parallel renderer in case of very detailed models, composed of a large number of polygons, and more and more processors involved in computation. Let us analyze Table 2: if we scale the dimension of the problem with an increasing number of processors, we obtain values that show the efficiency of the parallel algorithm in the representation of extremely detailed, rather than extremely large images. Diagrams show how the algorithm scales for a larger number of processors and efficiency never drops below 0.9 reaching almost ideal values.

456

Table 2: Scaled speed-up and efficiency of 2617130 polygons in a 300x300 window Ep p Sp 2 1.95922 0.979609 4 3.89499 0.973748 8 7.70718 0.963398 12 11.5281 0.960672 16 15.2911 0.955692 20 19.0944 0.954718 24 22.8948 0.953952 28 26.6626 0.952235

6

Concluding remarks

The performance analysis proves the use of graphic accelerator clusters, on a low cost communication network, can speed-up rendering, even if a real time interaction is obtainable only with small sized windows; on the other hand, it is very efficient to visualize realistic objects, lowering the rendering time. The scaled analysis shows that the performances of graphic accelerators cluster scale, as the number of processors increases. References 1. M.R. Guarracino, G. Laccetti, D. Romano, "Browsing Virtual Reality on a PC Cluster", Proc. of IEEE International Conference on Cluster Computing (CLUSTER2000), 2000. 2. J.L. Gustafson, G.R. Montry, R.E. Benner, "Development of Parallel Methods for a 1024-Processor Hypercube", J. Sci. Stat. Comp., SIAM, 9, 4, 1988. 3. G. Humphreys, I. Buck, M. Eldridge, P. Hanrahan, "Distributed Rendering for Scalable Displays", Proc. of Supercomputing 2000 Conference (SC2000), 2000 4. T. Mitra, T. Chiueh, "Implementation and Evaluation of the Parallel Mesa Library", Proc. of IEEE International Conference on Parallel and Distributed Systems (ICPADS), 1998. 5. M. Segal, K.Akeley, "The OpenGL Graphics System: A Specification (Version 1.2.1)", Silicon Graphics, 1999. 6. A. Vartanian, J-L. Bchennec, N. Drach-Temam, "The Best Distribution for a Parallel OpenGL 3D Engine with Texture Caches", Proc. of The Sixth International Symposium on High-Performance Computer Architecture (IEEE HPCA), 2000

ANALYSIS A N D I M P R O V E M E N T OF DATA LOCALITY FOR T H E T R A N S P O S I T I O N OF A SPARSE MATRIX D. B. HERAS, J. C. CABALEIRO, F. F. RIVERA Department of Electronics and Computer Science Univ. Santiago de Compostela, Spain e-mail: dora,caba,[email protected] In this work we apply a quantitative data locality model, introduced in a previous paper, to a basic algebra kernel: the transposition of a sparse matrix (SpMT). The locality model is validated on the cache memory, as this is a level where the impact of locality is high and easily measurable. Matrices with a broad range of patterns have been used for the evaluation. Based on this model and in order to increase the locality we have modified the layout of data in memory applying graph based heuristic techniques based on graphs. These techniques have been compared to some standard ordering algorithms.

1

Introduction

A large number of scientific applications work with sparse matrices stored in special compressed formats x . As a result, these codes present irregular accesses whose pattern is difficult to predict and improve, thus degrading the performance in the use of the memory hierarchy. We have chosen the sparse matrix transposition algorithm as the object of this work because the pattern of the irregular accesses this code presents is very similar to the one obtained in the execution of the product of a sparse matrix by a dense vector (SpM x V) that was studied in 2 . However it presents higher complexity and, consequently, the effect of locality is more important. The vast majority of studies on the memory behaviour of irregular codes focus on modelling the effect of data locality over the cache level, that is to say, cache misses 3 . Thus, these models try to obtain a precise model characterizing the hit ratio in a particular cache. We develop a more general model as we consider locality in the execution of the code, not its effects. Our model takes into account only the characteristics of the code and of the sparse data structures, and it is not linked to a particular memory hierarchy, being thus simpler than the ones cited. A usual application of a locality model is to guide locality improvement. The traditional approaches for increasing locality are code oriented. Examples of these are blocking, loop unrolling, loop interchanging and software pipelining 4 . They are specially successful in the case of regular codes. However, as we consider sparse codes, we follow an approach for increasing the data

457

458 1. D O I = 1, N+l PTRT(I) = 0 4. D O I = 1, N END DO D O J = PTR(I), PTR(I+1)-1 2. D O I = 1, PTR(N+1)-1 K = IND(J) + 1 K = IND(I) + 2 P = PTRT(K) PTRT(K) = PTRT(K) + 1 INDT(P) = I END DO DAT(P) = DA(J) PTRT(l) = 0 PTRT(K) = P+l PTRT(2) = 0 END DO 3. D O I = 3, N+l END DO PTRT(I) = PTRT(I) + PTRT(I-l) END DO

Figure 1. Pseudocode for the transposition of a N x N sparse matrix

locality for the SpMT operation based on changing the internal layout of the sparse data structures and not on code transformations. In our approach the locality model is used as an optimization tool. Other applications that modify the data layout to improve the locality modelled in cache memories can be found in 5 . 2

A qualitative study of locality for the SpMT

Let N be the number of rows of a square sparse matrix A and Nz the number of nonzero elements of the matrix (entries). Figure 1 shows an efficient code for the transposition A *• of a square sparse matrix A 1. Each matrix is stored in Compressed Row Storage (CRS) format being DA, IND and PTR the three vectors (data, column indexes and row pointers) used by the CRS storage format to store A, and DAT, INDT and PTRT for matrix AT. Vectors DA and PTR are accessed only in the last nesting of the code and the accesses are sequential with flawless spatial locality. The accesses to vectors IND, in nestings 2 and 4, and PTRT, in nestings 1 and 3, are also sequential. The remaining accesses are irregular and present multiple indirection levels. They can be classified into two groups: • The irregular accesses to vector PTRT, of size N + 1, in nestings 2 and 4 are of the same type as the accesses to the dense vector in SpM x V. Therefore, the accesses are non uniform with a spacing determined by the separation among consecutive entries in the same row of A. • Vectors DAT and INDT, of size Nz, are accessed in nesting 4 with two indirection levels. The non uniform spacing between consecutive accesses depends on the number and the proximity among entries in each row.

459 We will focus our study on the locality for these irregular accesses. From the analysis above it can be deduced that a closer "grouping" of the nonzero elements in the rows of matrix A will lead to a higher spatial locality in the accesses to vector PTRT in nestings 2 and 4. In the same way, an increase in the grouping of nonzero elements in each column of matrix A will favour temporal locality when accessing PTRT in nesting 4, and spatial locality in the accesses to DAT and INDT. So the closer the grouping of entries over the pattern of the sparse matrix, the higher the data locality in the execution of the transposition. 3

The locality model

The locality properties explained above indicate that an evaluation of the grouping of entries over the pattern of the matrix will provide a good indication of locality for the transposition. This idea is also valid for the SpM x V operation and other sparse algebra codes. In 2 we have developed a model for quantifying locality based on the evaluation of the grouping of entries. In that paper we considered and evaluated the model for the case of the SpM x V but introducing also the idea that the model could be useful when studying other sparse algebra codes. In this section we briefly outline the characteristics of the locality model that will be applied in this paper to the locality optimization for the transposition. To quantify the locality induced in the execution of irregular codes for which the data locality is increased by a closer grouping of the entries over the pattern of a sparse matrix, we introduced in 2 two parameters: number of entry matches (ae/ems) and number of block matches (awocfcg). They are defined over pairs of rows or columns of the sparse matrix. The number of entry matches between two rows of a sparse matrix is defined as the number of pairs of non-zero elements in the same column. The concept of entry matches can be extended to block matches replacing the term "elements" by "blocks with at least one entry". Both concepts can also be defined for columns in a straightforward manner. In Figure 2 we show an example of measurement of these parameters. Bs denotes the block size. Based on these two parameters, we define a magnitude called distance between rows x and y, denoted by d(x,y). It is defined to indirectly measure the locality displayed by the irregular accesses on these two rows. The definitions of distance functions fulfill some requirements. The cost of computing the distance functions for a sparse matrix must be small, thus, the functions must be built from simple operations. In addition, if a relative order of the rows, in terms of distance is established, the order in terms of

460 X XX X X

X X

XX X

X X X

X X

X

X X

x

a

X

x

X X X

X X

X XX

X X X XX X XX X

X

X

X

(x,y) = 3

elems

Bs X X X X

a

X

X

X X

X XX

(x,y) = 2

blocks

Figure 2. An example of locality parameters

locality must be its inverse. V rows x, y, z

d(x,y) < d(x,z) O locality(x,y)

> locality{x, z).

This way a minimization of the distance function leads to a maximization of locality. Finally, the functions must be semidefinite positive, bounded and symmetric: d(x,y) — d(y,x). We propose three definitions of the distance, where neiems(x) is the number of elements in row x and nt,iocks(x) is the number of groups of elements in row x: d\{x,y)

= max(neiems(x),nelem3(y))

- aelems(x,y),

d2(x,y) = nblocks(x) + nbiocks(y) - 2*a Wocfcs (a;,y), ds (x, y) = neiems (x) + neiems (y) - 2*aetems (x, y).

(1) (2) (3)

For a given sparse matrix, three different inversely proportional measures of the locality of the SpMT for the whole sparse matrix (D\, Di and D3) can be defined by summing the distances between pairs of consecutive rows/columns {d\, d-2 and d% respectively) in the order they are accessed. N-l

Dj=J2dj(i>i

+ V> i = 1.2,3.

(4)

i=l

4

Formulation of the data locality improvement

Since the data locality associated to the irregular accesses of the SpMT depends on the pattern of the sparse matrix, we propose changing in the appropriate manner the layout of the sparse matrix in memory. This can be

461

achieved by arranging rows and columns so as to minimice the value of Dj and thus increase the locality. We formulate the locality improvement problem as a graph NP-complete problem. In the graph for our problem, the weight function of an edge reflects the distance between rows/columns. Each node of the graph represents a row/column of the matrix. Our problem consists in seeking the ordering of rows and columns of the sparse matrix that produces a minimum value of Dj. For solving the problem we have opted for a heuristic based on spanning trees. The algorithm is divided into two steps: • Construction of a minimum-spanning tree of the complete graph using the Prim algorithm 6 . The runtime required by this algorithm is 0(N2). • Visiting the different nodes of the tree to establish an order using a depthfirst search7. This algorithm requires a time 0(max(./V, |T|)) being T the set of edges joining any pair of nodes in the minimum-spanning tree and \T\ the number of such edges. To study different solutions we have also solved the problem using the greedy heuristic called nearest-neighbour algorithm 6 . This algorithm requires O(N). 5

Evaluation of the improvements

A significant effect of locality is the reuse that determines the number of cache misses in a cache memory. So, we have chosen cache misses as a criterion to validate our model. In this work we present results for six matrices with different non-zero patterns. Four of them come from the Harwell-Boeing sparse matrix library and the other two are synthetic matrices. Both, minimum-spanning trees (PI, P2, P3) and nearest-neighbor heuristics (Pin, P2n, P3n) were applied using the three distance functions (Di, D2 or D3). The cache misses were measured through simulation. The cache, whose line size is 8 words (64 bytes), is two-way set-associative. The LR[/replacement policy and no prefetching are considered. The number of cache misses was measured varying the cache size from 2K to 32K words. These cache sizes could seem too small as compared to caches in modern processors. The choice was determined by the sizes of the sparse matrices in our test set with the aim of having an appreciable number of interference misses (those that depends on the locality properties of the accesses).

462

(a) Orderings for matrix CRYSTM01

(b) Orderings for matrix BCSPWR10

(c) Orderings for matrix RND4000 Figure 3. Data locality optimizations for three different matrices

Figure 3 displays the results obtained for three of the matrices of the validation set which present very different sparsity patterns. We show results only for these matrices due to space limitations. The results are represented in terms of percentage of reduction in the number of cache misses over the original matrix (CRYSTM01, BCSPWR10 or RND4000) for each one of the six resulting reordered matrices. The smallest increases in locality are obtained for the banded matrices (CRYSTM01 in Figure 3). The reason is that in this case the original matrix already presents high, and therefore difficult to improve, locality. Matrix BCSPWR10 presents a more uniform distribution of entries over the matrix pattern while in matrix RND4000 the entries are uniformly distributed. In a large number of cases, the best results are obtained for P3 or P3n. Note that P3 and P3n correspond to cases for which the distance functions are defined in terms of entry matches, locality parameter leading to a closer grouping of the matrix entries than the block matches parameter.

463

5.1

Comparing with standard orderings

There are some standard orderings that are usually applied to modify the structure of a sparse matrix with objectives different from the search for locality improvement. Nevertheless, these orderings have an impact on locality. We have selected three of these methods to compare them to our best results: Minimum degree, the Cuthill-McKee and the one way dissection algorithms 1. In Figure 4(a) we present results of the application of these algorithms for the matrices of the test set in terms of reduction in the number of cache misses. The worst results are obtained applying the Minimum degree algorithm. In fact, for this algorithm, and for some cases of the Cuthill-McKee algorithm, no improvement was achieved with respect to the original matrix. If we compare these results to those obtained used the locality improvement orderings, for example with the P3 and P3n orderings, the conclusion is that P3 and P3n do not always present the best results although their results are always competitive. They obtain the best results in 4 of the 6 cases represented. In the other cases, the differences with respect to the standard methods is small. These results present similar trends as those obtained with other cache sizes. The arithmetic mean of improvements for the six different matrices are represented in Figure 4(b) for different cache sizes. It can be observed that the results obtained through locality improvement orderings present the best results in 3 of the 4 cache sizes considered. These 3 cases correspond to small cache sizes, situation where the locality has a greater relevance in the results. Note also that with the locality improvement methods, the smaller the cache size, the larger the improvements obtained, which indicates a better quality of the improvements. This is not the case using the standard methods. 6

Conclusions

A method to characterize and increase the data locality for irregular algorithms is applied in this paper to the transposition of a sparse matrix. The locality model is based on the computation of three different functions that quantify the locality associated to the pattern of the sparse matrix. Our proposal for improving the locality is based on changing the layout of the sparse matrix in memory because it defines the irregular accesses in the transposition operation. In our case these results have been obtained for a set made up of six matrices with very different patterns. Although the model is general enough to be applied to any level of the memory hierarchy, the results are validated on the cache. In our experiments we obtain that the reduction in cache misses increases up to 66% considering small caches (2K words) in

464

(a) Results for a 8K word cache

(b) Mean of improvements

Figure 4. Relative cache misses improvements when ordering according to standard ordering algorithms

which the locality is a determining factor for the number of cache misses. Acknowledgements This work was supported by the Xunta de Galicia under project 99PXI20602B. References 1. Sergio Pissanetzky. Sparse Matrix Technology. Academic Press, inc., 1984. 2. D. B. Heras, J. C. Cabaleiro, and F. F. Rivera. Modeling data locality for the sparse matrix-vector product using distance measures. Journal of Parallel Computing, 27:897-912, 2001. 3. O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In IEEE Int'l Conf. on Supercomputing (ICS'92), pages 578-587, 1992. 4. M. E. Wolf and M. S. Lam. A data locality optimization algorithm. In Proc. SIGPLAN'91 Conf. on Programming Language Design and Implementation, June 1991. 5. Si van Toledo. Improving memory system performance of sparse matrixvector multiplication. In 8th SIAM conf. on parallel processing for scientific computing, May 1997. 6. Gerhard Reinelt. The Traveling Salesman. Computational Solutions for TSP applications, volume 840 of LNCS. Springer-Verlag, 1991. 7. Alan Gibbons. Algorithmic Graph Theory. Cambridge University Press, 1984.

A U T O M A T I C MULTITHREADED PARALLEL P R O G R A M

GENERATION FOR MESSAGE PASSING MULTIPROCESSORS USING P A R A M E T E R I Z E D TASK GRAPHS EMMANUEL JEANNOT LORIA, INRIA Lorraine, France Emmanuel. [email protected] In this paper we describe a code generator prototype that uses parameterized task graphs (PTGs) as an intermediate model and generates a multithreaded code. A PTG is a compact and a problem size independent representation of some task graphs found in scientific programs. We show how, with this model, we can generate a parallel program that is scalable (it is able to execute million of tasks) and generic (it works for all parameter values).

1

Introduction

In the literature a lot of work exists to help programmers write parallel programs. Various approaches have been taken such as: data parallelism language with directives (HPF : ) ; linear algebra libraries (ScaLAPACK 2 ); communication libraries (MPI 3 ); shared memory programming environments (such as Athapascan-1 4 ) ; compilers for automatic parallelization (for example SUIF 5 ) or lastly, task parallelism languages (for example Pyrros 6 ) . Our work is an intermediate approach between the last two. In this paper we show how to automatically generate a program when modeled by a task graph. Parallel compilers 5 ' 7 ' 8 are, in general, based on a fine grain representation of the program. Most of the time this implies that communication and computation are considered as unitary. Fine grain analysis implies also that all instructions are modeled. Our work is based on a coarse grain representation of the program: instructions are grouped into tasks. We do not make any assumptions about the duration of communications nor about the duration of the tasks. Furthermore, since there are always fewer tasks than instructions the modelization of a program is less complex. A lot of tools have been developed for generating code from a task graph. This is the case for Pyrros 6 , Hypertool 9 or CASCH 10 . They all work in the same way. (1) they extract a task graph from a program, (2) they schedule the task graph using various scheduling algorithms, (3) they build a program that executes the found schedule. This approach has two major drawbacks. First, the task graph can be built only when the parameters of the input

465

466

program are instantiated. Hence, it is required to recompute the schedule and regenerate the program each time the parameter values change. Second, the size of the task graph depends on the parameter values of the program. Hence, this method does not work for large parameter values. Indeed, when the parameter values are too large, the task graph becomes too large to be stored in memory and scheduled (in general these tools are not able to schedule task graphs containing more than a thousand tasks). The main contribution of this paper is that we show how to overcome these two drawbacks for some programs found in scientific applications. We have designed a code generator prototype based on an intermediate model called the parameterized task graph (PTG). This model is a symbolic representation of task graphs. It requires a small amount of memory to be stored because it is independent of parameter values. Thus, we can generate a parallel program that works for all parameter values and that is able to execute million of tasks. Our runtime system executes tasks in a multithreaded fashion for computation/communication overlapping and time sharing execution. 2

2.1

Background

The Parameterized Task Graph

A parameterized task graph (PTG) is a compact representation of some task graphs that can be found in scientific applications. It has been introduced by Cosnard and Loi in 11,12 to build automatically task graphs. It uses parameters therefore its size is independent of the problem size.It is composed of three parts: a set of communication rules, a set of generic tasks and a cost function of each task. In the code, generic tasks are enclosed by keywords t a s k and endtask. The cost function gives, for each instance of a task, the number of operations performed by the task. Communication rules symbolically describe communications between tasks. It is out of the scope of this paper to describe the PTG model more precisely. For more details refer to: X1.12.13>14. 2.2

A Line of Semi-Automatic 15

Parallelization

We have designed a line of semi-automatic parallelization. It starts from a sequential program written in a FORTRAN-like language annotated with t a s k and endtask delimiters that partition the program into sequential tasks.

467

The program must have a static control 16 . This means that all the array subscripts and loop bounds must be affine functions of the program parameters and enclosing loop indices. For-loops are the only control structure allowed and all the instructions must be assignments. The first step of our line is to derive the parameterized task graph. This part is done by the PlusPyr tool 12 . Since PlusPyr is only able to treat static control program, our approach is suitable for regular and static computation only. The second step of the line consists in finding a symbolic allocation of the tasks. This means that we build a function that tells on which processor the instance of a task will be executed. Our algorithm 17 ' 13 , called SLC (Symbolic Linear Clustering) guarantee that, for any parameter values, the allocation is a linear clustering. A linear clustering is a set of disjoint paths of the instantiated task graph. The advantage of building a linear clustering is that it reduces unnecessary communications while preserving parallelism 18 . The last step of the line is the code generation and is described in this paper. 3

Overview of the Method

Execution of the tasks follows the macro-dataflow model: each task first, receives data, second executes the code and finally sends data. In order to avoid global management of data, data that are used by a task are stored within this task. We use a multithreaded runtime system for executing tasks. This means that each task is going to be executed by a thread. We use a thread library called PM2 19 . It uses Light Remote Procedure Call (LRPCj for transmitting data between processes. Sending messages with LRPC works as follows: when process P needs to transmit data to process P' it calls a procedure on process P'. This procedure is executed by a thread and manages the data transmitted as parameters. The multithreaded approach allows communication/computation overlapping because the sender is not blocked during data transmission, moreover, using LRPC allows a fully asynchronous transmission of messages. The receiver does not wait for data : a thread is launched each time a new message arrives. Message reception: when a tasks send a data it calls (with an LRPC) a receiving procedure that stores data within the corresponding task. It manages two sets: A set of waiting tasks which are the tasks that have already received messages but are waiting for other one(s). A set of ready tasks which are the tasks that have received all their messages and are ready to be executed. Task execution: the number of threads that execute tasks is fixed at the

468

beginning of the execution. Each thread takes a ready task execute this task and send the data using communication rules. If the ready task set is empty it blocks until a task becomes ready. Concurrent access to the ready task set is managed with a semaphore. For executing a task, the thread calls the procedure that contains the task code. Parameters of this procedure are the data that have been sent to the task. Sending messages: for sending a message a thread checks the communication rules set and determine which rule(s) have to be executed. The data are copied to a buffer and sent to the receiving process (We will see in the optimization section how to avoid the copying of data when possible). 4 4-1

T h e Parallel P r o g r a m Communication Rule Analysis

First, we determine if an communication rule describes a point - to - point communication or a broadcast 13 . Second, we analyze communication rules to determine which data are transmitted by a task. Third , we determine, for each transmitted data, if this data is read and/or written. Once this is done we are able to determine the memory location required for storing data within tasks. 4.2

Static Part

There are some parts of the program that do not depend on the input sequential program, such has the receiving procedures that store data within tasks, data structure management procedures, etc... 4-3

Generated Part

The code generation is done after the analysis of the rules and a simple analysis of the source program. We describe here the different parts of the code that depend on the source program. Task code: first, we ufunctionalize" the tasks. Functionalization is the opposite of inlining, a well-known optimization feature of compilers. This means that we decompose the sequential program in functions. Each function being a task. Parameters of these function are the accessed data of the corresponding task. Allocation and deallocation task functions: a task is created when it receives data for the first time. When data has finished its execution all these informations are freed.

469 Packing data: For each rule we generate a code that extracts the data described by this rule within the sending task and build the message. Unpacking data: when a message arrives, the data is to be stored within the receiving task. For each reception rule we generate a code that copies the input buffer into the attached data. 5

Optimizations

In order to generate an efficient parallel program, optimizations are necessary. Our code generator performs two types of optimizations. The first type concerns the code speed, the second type concerns memory utilization. Merging rules: some communication rules describe the transmission of contiguous data. Sometime it is possible to merge these rules into one rule that send the union of the two data. Use of global communications: when a message is a broadcast, we send only one message per node that executes some of the receiving tasks. Merging messages: when a task generates two different messages for the same receiving task, this messages are merged before sending. Transmission of data on the same node: We have optimized packing and unpacking of data in order to avoid the creation of temporary buffers. The data is directly copied from the sending task to the receiving task. Pointer transmission of data: In order to avoid data copy, we have optimized the reception of broadcasts: when possible, each receiving task of the same node points to the same data. When all tasks have finished using this data, it is freed from memory. 6

Results

We have run generated programs on a 16 nodes IBM SP2 and a cluster of 14 Motorola PowerPC (The POM: Pile Of Motorola) linked with BIP 20 . Since we obtain similar performances we only show the results for the POM cluster. 6.1

Speedup Results

Fig. 1 shows speedup results for the POM for the Gaussian Elimination and the Jordan Diagonalization. The linear clusters built by SLC are mapped on the physical processor in a cyclic fashion0. The matrix sizes are between 1000 and 4000. The baseline of the speedup is a straightforward C translation of ° T h e way clusters are mapped on the processors is fixed at the beginning of execution. The user has four choices: cyclic, bloc, bloc-cyclic or reflect

470 Jordwi Qagonilizfkon - POM

Figure 1. Speedup results for some compute intensive kernels

the original sequential code. For the Gaussian elimination with matrix size 4000 the program executes more than 8 million of tasks and checks more than 16 million of edges. We obtain a speedup of more than 12.28 on 16 processors. Results for the Jordan Diagonalization are not as good as for the Gaussian Elimination because the Jordan Diagonalization algorithm has more communication. 6-2

Timing Different Parts of the Program

Fig. 2 shows the proportion of ' ' the duration of each part of the Gaussian Elimination program for N=2000 on the POM. Results show that the time proportion in I0-5 & & executing tasks decreases as the number of processors increases. On the other hand the time proportion for selecting ready task in^ creases as the number of processor ' * - J L - ^ ^ t^""""" increases. r>^z_ This is due to the fact that, for a constant problem size, Figure 2. Time Proportion of threads have fewer task to execute of the Program for the G.E. when processor number increases: they spend more time blocked, waiting for ready tasks. Swiss 2000 on the POM

0.8

0.7

OTart axacuilan

0.6

4* BuUdinn messages VSintUng messages

|0.4

1-

•

» Tarts daSrvcllon

• Salsding readyttsks

-

OMIsc

V.1

0.1

. — ; — _

— 1 — , — ,

-»•

. — „ — „ — . _ _ : ,

-.,

.

Different Parts

471 6.3

Remark

It has not been possible to compare these results with those obtain by static scheduling tools. Indeed, due to memory constraint static scheduling tools are not able to schedule task graphs larger than a thousand task. However we have shown, in our previous work 1 3 , that SLC obtains similar performances than very good static scheduling algorithm, for small task graphs. 7

Conclusion

In this paper we have described the back-end of a complete line of semiautomatic parallelization based on coarse grain decomposition of the program. Our contribution is the following. We use the parametrized task graph as an intermediate model, hence the generated code works for all parameter values. The parallel program executes the symbolic allocation found by SLC, hence this allows to execute a very large program. Indeed, (1) the allocation does not depend of the parameter values, (2) computing the processor where to execute a task takes a constant time and memory size. We are able to execute task graph containing million of tasks and edges. In order to obtain performances, we have described various optimization features. We have demonstrated that multithreading is well suited for task computation. Our future works are directed towards the automatic placement of task delimiters. This is the only part that is not automatic in the proposed line. We also would like to extend the input language in order to treat more sophisticated program and extend the class of program that can be automatically parallelized. References 1. C. H. Koelbel, D. B. Loveman, R. S. Schreiber, G. L. Steel Jr., and M. E. Zosel. The High Performance Fortran Handbook. The MIT Press, 1994. 2. L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users' Guide. SIAM, 1997. 3. W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface. The MIT Press, 1994. ISBN 0-26257104-8. 4. F. Gallilee, J.-L. Roch, G. Cavalheiro, and M. Doreille. Athapascan-1: On-line Building Data Flow Graph in a Parallel Language. In IEEE Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT'98), Paris, October 1998. 5. S.P. Amarasinghe, J. M. Anderson, M. S. Lam, and C.W. Tseng. The SUIF

472

6.

7.

8. 9.

10. 11. 12. 13.

14.

15.

16. 17.

18.

19.

20.

Compiler for Scalable Parallel Machines. In seventh SIAM Conference on Parallel Processing for Scientific Computing, February 1995. T. Yang and A. Gerasoulis. Pyrros: Static Task Scheduling and Code Generation for Message Passing Multiprocessor. In Supercomputing '92, pages 428-437, Washington D.C., July 1992. ACM. Alain Darte and Frederic Vivien. Parallelizing Nested Loops with Approximation Distance Vectors: a Survey . Parallel Processing Letters, 7(2):133-144, 1997. P. Feautrier. Toward Automatic Distribution. Parallel Processing Letters, 4(3):233-244, 1994. M. Wu and D. Gajsky. Hypertool a programming aid for message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 1(3):330343, 1990. I. Ahmad, Y.-K. Kwok, M.-Y. Wu, and W. Shu. Automatic Parallelization and Scheduling on Multiprocessors using CASCH. In ICPP'97, August 1997. M. Cosnard and M. Loi. Automatic Task Graph Generation Techniques. Parallel Processing Letters, 5(4):527-538, 1995. M. Loi. Construction et execution de graphe de tdches acycliques a gros grain. PhD thesis, Ecole Normale Superieure de Lyon, France, 1996. M. Cosnard, E. Jeannot, and T. Yang. SLC: Symbolic Scheduling for Executing Parameterized Task Graphs on Multiprocessors. In International Conference on Parallel Processing (ICPP'99), Aizu Wakamatsu, Japan, September 1999. M. Cosnard and E. Jeannot. Compact DAG Representation and Its Dynamic Scheduling. Journal of Parallel and Distributed Computing, 58(3):487-514, September 1999. Emmanuel Jeannot. Allocation de graphes de tdches parametres et generation de code. PhD thesis, Ecole Normale Superieure de Lyon, France, October 1999. ftp://ftp.ens-Lyon.fr/pub/LIP/Rapports/PhD/PhD1999/PhD1999-08.ps.Z. P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(l):23-53, 1991. M. Cosnard, E. Jeannot, and T. Yang. Symbolic Partitionning and Scheduling of Parameterized Task Graphs. In IEEE International Conference on Parallel and Distributed Systems (ICPADS'98), Tainan, Taiwan, December 1998. A. Gerasoulis and T. Yang. On the Granularity and Clustering of Direct Acyclic Task Graphs. IEEE Transactions on Parallel and Distributed Systems, 4(6):686-701, June 1993. R. Namyst and J.-F. Mehaut. PM2: Parallel Multithreaded Machine. A computing environment for distributed architectures. In Parallel Computing (ParCo'95), pages 279-285. Elsevier Science Publishers, September 1995. Loc Prylli and Bernard Tourancheau. BIP: a New Protocol Designed for High Performance Networking on Myrinet. In Parallel and Distributed Processing, IPPS/SPDP'98, volume 1388 of Lecture Notes in Computer Science, pages 472-485. Springer-Verlag, April 1998.

PARALLEL L A U N C H E R FOR CLUSTER OF P C CYRILLE MARTIN* OLIVIER RICHARD ID-IMAG - France E-mail: [email protected], [email protected] This article describes an experimental parallel program starter for large-sized clusters of personal computers (PC). The starting of programs is parallelized by broadcasting a remote request execution. Broadcasts are performed using a spanning tree. Several experiments were conducted with N-ary and binomial trees, the latter performing better when the size of the cluster increases.

1

Introduction

Large-sized clusters (over 100 processors) are more and more used for highperformance computing. In order to exploit a large cluster, users or system administrators need efficient tools to install, administrate and monitor a cluster. To do that, several types of tools have been developed. To install software on a set of PC, the tool Ghost 1 uses a server which safely broadcasts binary files. The SCMS 2 tool developed in the SMILE project provides the most basic Unix administration commands for a cluster. It is based on remote command executions performed iteratively (remote shell: rsh, rexec, ssf, ssh). The C-PLANT cluster project 3 uses a set of dedicated daemons to perform administration and monitoring of large clusters. The monitoring tool Bwatch 4 probes iteratively memory and CPU load. This list is not complete because many other tools have been developed to solve specific problems. Those problems can be easily resolve with efficient multicast and reduce (merge) mechanisms 5 . These mechanisms have been studied intensively to provide collective communications for parallel computing. Vadhiyar study 6 gives a specialization algorithm taking to accent message size and network topology, to customize collective communications of MPI. Kielmann, from the MagPIe project 7 , extends the collective communications of MPI to metacomputing. He distinguishes two communication levels : efficient inside a cluster, and slower between two clusters. However, there exist no efficient and generic mechanism to build tools requiring efficient broadcast and reductions for system administration and monitor*PHD THESIS IN LIPS PROJECT, BULL / INRIA

473

474

ing. In a first step to build a generic tool, we propose in this article to study an efficient parallel program launcher for large-sized clusters. We assume that the interconnection network of the cluster can perform parallel communications between independent node pairs. This is the case of switched Ethernet networks, multi-stage interconnections such a Clos Myrinet switched8 and 2D or 3D grids of SCI 9 . In the following sections, we explain the design of parallel program starter prototype. The description of the algorithm used will be followed by a presentation and analysis of performance results. We will conclude with the problems raised by the design of a generic tool. 2

Case study: launcher program for clusters

In this article, we will focus on two approaches of the launching of programs we will only focus on the starting of a program or a command on a cluster. On small-sized clusters, the classical technique used for example in mpirun of MPICH 10 , is to start one node after another. The cost of this algorithm is linear in the number of processors and therefore not scalable. For larger clusters (C-PLANT 3 , Score 11 ), solutions used dedicated daemons were developed. To build an efficient launcher without dedicated daemons requires to parallelize the remote execution call mechanism ("remote shell Unix": rsh, rexec, ssh, ssf). A simple parallelization technique is based on recursive starting of programs on nodes. The starting scheme equals to a spanning tree composed by the cluster nodes 12 . Our prototype has been built following this way. 2.1

Principle of the standard rsh protocol

The principle of the rcmd(A) (figl) client command consists in requesting from a remote rshd server to create a process which executes the A command. A TCP connection is used to redirect the standard input and output of the remote process. This connection is closed when the A remote program is finished. Figurel shows the different stages of a remote execution call: setting up of a TCP connection, remote user identification and loading of program. 2.2

Principle of the launcher

The parallel starter triggers a broadcast to N other nodes parallelized in the following way:

475

1 2 3 4 5 6 7 8

Setup of a T C P connection with the inetd d a e m o n Starting of rshd d a e m o n by inetd (fork) R e m o t e computer & user authentification (NIS) A C K m e s s a g e to client L a u n c h e r p r o g r a m execution R e m o t e execution query on another node A c o m m a n d execution Waiting for completion

Figure 1. r c m d command execution: Setup of a T C P connection with the r s h d daemon (via inetd), user verification, A command execution.

• The initial node starts k remote processes, and keeps TCP connections established by the remote execution call (See 2.1). • Every one of those k nodes start k' nodes too, until all the N nodes concerned with the execution of the A program have been reached. • At the end of the A program, a node terminates when all of its children nodes have closed their TCP connection. We construct a diffusion tree which links all of the processes executing the A program. Starting a program rapidly is equivalent to performing a collective communication in which the message is composed by the program name, its arguments and the user ID. Since at each step of the algorithm, as those starting is composed of communications between independent node pairs, it is possible to exploit the physical network parallel communication possibilities. Before the program or command execution, we must consider two situations. The first one is when the program or the input data file reside on each node, in this case not any additional operation is required. Other, the program or data reside on the server side and must be broadcast on all nodes. We have considered two approaches : the use of distributed file system facilities (NFS) and the exploitation of the launcher diffusion tree. Afterwards, we will use the term data file to refer indistinctly to a program or an input data file. 3

Evaluation

In the first prototype, we have just considered N-ary and binomial tree. Figure2 shows the temporal development for the two types of tree used. First,

476

binomial

arity = 2

Figure 2. Construction of a 8 nodes binomial and binary tree, (figures indicates the steps of the algorithm)

we evaluate the cost of starting time when the data file reside on each node. Afterwards, we consider additional time for the diffusion of the data file on all nodes. 3.1

Cost model

Let A be the cost of one step (a rcmd time fig 1). The cost of starting an A-ary tree is \A\ogA{N) where N is the number of nodes. For a binomial tree this cost is Alog2(iV)- This cost model can be see as a simplification of the model of Bernaschi and Iannello 5 . In the following sections, we will describe the experimental conditions and analyze the performance results obtained with our program launcher prototype. 3.2

Experiments

The experiments were conducted on the ID-IMAG Laboratory cluster composed of 100 PCs connected by a 100 Mbps switched Ethernet network including 3 switchs connected by two lGbps links (33 nodes per switch). Each workstation is a 733 Mhz Pentium III, with 256 Mbytes of memory running the Linux operating system. The interconnection network is not completely switched. When the binomial tree algorithm is used, it is possible that each of the 33 nodes connected to one of the switchs sends a message (execution request) to 33 nodes to another switch. This situation may overload the link between those two switchs. To avoid this situation, it is sufficient that the first node of the spanning tree distributes its children on the different switchs. In order to reduce perturbation experiment, we have not used centralized server for the user login verification mechanism(NIS). We have conducted two set of experiments. In the first set, the step of the

477

broadcast of the data file is not considered. In the second set, the NFS file system or the launcher spanning tree are used to broadcast the data file. The results were obtained with 20 tests by measure with a confidence interval of 95%. Times measured include the construction of a spanning tree, the broadcast time according to the case and a null program execution time. These tests were made with different N-ary tree and the binomial tree. Times measured include a termination time which depends on the height of the spanning tree. As the cost of a remote request execution is important (A > rcmd time ~ 100 ms), the termination time (termination time ~ h * t, with h: tree height, t: time to close a TCP connection) can be ignored in this first experiment since. Time to starting a job on 64 nodes 2

1

1

1

1

spanning tree used: arity = 1 / arity = 2 • + • • arity = 5 -H3-y arity=64 •X" • at binomial tree -A- J" •'

e x y x

u o

y

,s °- 5 a

i

i X

^

*''. .:$'-"\

S

j*r

^

s

-

. -A- " ,H-B-H*$- "

n

//*-A-A

h

AH/ 0.25

/

-

jfc*

0.125 '

i

i

2

4

8 16 Number of node

i

i

i

32

64

Figure 3. Comparison of program starting time without the broadcast of the data file for several spanning tree topologies: N-ary and binomial, as function of the number of nodes.

3.3

results analysis without broadcast

Figure3 compare launching time obtained with N-ary and binomial trees. On a small number of nodes, the performances of the N-ary trees are nearly identical to the performances of the binomial tree. A binomial tree is better in theory when the tree is complete (number of nodes = 2"). When the number of nodes is not a power of 2, N-ary trees can have the same efficiency.

478

When the cluster size increases, the binomial becomes the more efficient, but the performance gain is limited to 20% for 64 nodes. 3.4

results analysis with broadcast

Time in sec 14 12 -

1 1 arity = I —*— arity = 2 ••+•• "NFS - - B binomial • * • file of 1 Mo

-1

1

r

•

1

1

1

,.--'

10 m

s ^

8

6

1

~' ''

_

_

4 2

/^

*

r 0

-

-s^

.s' 10

20

1

1

1

1

1

30

40

50

60

70

1

1

80 90 Number of node

Graph a Time in sec 3.5

chain arity = 1 binary file of! 0 Mo

3 2.5 2 1.5

15 20 Number of node Graph b

Figure 4. Comparison with data file broadcasting for several spanning tree topologies and use of the NFS file system and the launcher spanning tree for the broadcast of the program. The b graph focuses on the difference between two topologies : 1-ary (chain) and binary.

Figure 4 presents the launching time with the broadcast of the data file

479 (NFS and diffusion tree) and for several spanning trees. The size of program's data is 1 Mbytes for graph a and the binomial tree is used for the launching in the NFS test. The graph a shown that NFS server can't scale when cluster size increases. Broadcasting with the binomial or binary tree are nearly identical. At this size of program's data, the time for the establishment of diffusion trees is predominant. The graph b of the figure 4 compares the launching time between the 1-ary (chain) and binary trees with a 10 Mbytes input data file. The pipeline effect explains that the time for chain is better than for binary tree up to 12 nodes. The bandwith for binary tree is twice lower than for chain. Over 12 nodes, the time for the setting-up of the chain is too hight compared to the gain on the bandwith. 4

Comparison with Ksix project

Many of other project of launcher are include in complete administration and profiling tools for cluster. In the Cplant project (Asci White) 3 the launcher called Yod is based on a set of interconnected daemons. Ksix the starter of the SCMS project 2 is build in the same way. Request a remote execution on nodes is resumed to broadcast the-command to interconnected daemons. As our approach is different, to compare our result we had measured the execution time of a program (hello world like) after the establishment of the spanning tree. The execution request have been query to a shell launched with our starter (all standard I/O are redirected). Time measured is 25 msec on 64 nodes, it includes the broadcast of the command (the program is on all nodes), the program execution time and time for wait all the "hello-world". The measure of the Ksix launcher 13 for the same test is about 100 msec for 64 nodes (Pentium III 500 Mhz). These measures show that these two launchers have similar performances in same operation mode. 5

Discussion and future work

This article presents an efficient launcher for cluster, which shows several critical points : the remote execution request cost and the broadcast of program. The experimental results given in section 3 indicate that a binomial tree is more efficient to launch applications on a cluster when the data file is on each node. However the performance differences between N-ary trees ( 1 up to 4 ) and a binomial tree are limited. When the input file must be diffused on all nodes, the binomial and binary trees have nearly the same performance. With large input file (program or/and data) the chain topology had better performances on small number of nodes. For increase the performance of chain, the

480

time of its establishment must be reduce. Future works include that point and performance measures for larger size clusters. References 1. Symantec, http://www.symantec.com/ghost. 2. P. Uthayopas, J. Maneesilp, and P. Ingongnam. Scms: An integrated cluster management tool for beowulf cluster system. In PDPTA, pages 26-28, 2000. 3. Cplant, Proc. Second Extreme Linux Workshop, Monterey, California, 1999. 4. J. Radajewski. http://www.sci.usq.edu.au/staff/jacek/bWatch/. 5. M. Bernaschi and G. Iannello. Collective communication operations: experimental results vs. theory. Concurrency: Practice and Experience. 6. S. Vadhiyar, G. Fagg, and J. Dongarra. Automatically tuned collective communications. In ACM, editor, SC2000: High Performance Networking and Computing. 7. T. Kielmann, R. Hofman, H. Bal, A. Plaat, and R. Bhoedjang. MagPie: MPI's collective communication operations for clustered wide area systems. ACM SIGPLAN. 8. K. Verstoep, K. Langendoen, and H. Bal. Efficient reliable multicast on myrinet. In ICPP, Vol. 3. 9. Ieee standard for scalable coherent interface. Technical report, The Institute of Electrical and Electronics Engineers, 1992. 10. W. Gropp, E. Lusk, N. Doss, A. Skjellum, and A. high performance, portable implementation of the mpi message passing interface standard. In Parallel Computing, 22(6), pages 789-828, 1996. 11. Mpich-pm/clump : an mpi librairy based on mpich 1.0 implemented on top of score. Technical report, Parallel and Japan Distributed System Software Laboratory, 1998. 12. J. de RUMEUR. Communications dans les rseaux de processeurs. Masson Paris, 1994. 13. T. Angskun, P. Uthayopas, and C. Ratanpocha. Ksix parallel programming environment for beowulf cluster. In Parallel and Distributed Proceeding Techniques and Applications 2000,Las Vegas, Nevada , USA.

PARALLEL PROGRAM DEVELOPMENT USING THE NOTATION SYSTEM

METAPL

N. MAZZOCCA1, M. RAK1, AND U. VILLANO2 'DU, Seconda Universita' di Napoli, via Roma 29, 81031 Aversa (CE), Italy E-mail: [email protected], [email protected] Universita'delSannio, Facolta'di tngegneria, C.so Garibaldi 107, 82100 Benevento, Italy E-mail: villano@unisannio. it This paper shows the use of the MetaPL notation system for the development of parallel programs. MetaPL is an XML-based Tag language, and exploits XML extension capabilities to describe programs written in different programming paradigms, interaction models and programming languages. The possibility to include timing information in the program description promotes the use of performance analysis during software development. After a description of the architecture of MetaPL, its use to describe a simple example program and to obtain two particular program descriptions (views) is proposed as a case study.

1

Introduction

There is currently a high number of imperative parallel/concurrent programming languages, based on alternative memory models (shared-memory, message-passing, hybrid solutions), which allow different kinds of interaction models to be exploited (client-server, peer-to-peer, master-slave, ...). All the above can be targeted to radically different target hardware/software systems. This complexity is not even mitigated by the use of unifying programming approaches. As a matter of fact, the development of parallel software is still carried out ignoring software engineering principles and methods. Most of the times, the obtained programs are badly structured, difficult to understand, not easy to maintain or re-use. Sometimes, they even fail to meet the desired performance, which is the primary reason for resorting to parallelism. Software performance engineering (SPE) methods, successfully used for years in the context of sequential software, are the obvious solution for the development of responsive parallel software systems. The SPE process begins early in the software life cycle, and uses quantitative methods to identify the designs that are more likely to be satisfactory in terms of performance, before significant time and effort is invested in detailed design, coding, testing and benchmarking [1,3,10-13,16]. Starting from performance problems and the need of study them at the early stages of development, new software life-cycles, graphical software views and CASE tools were developed, oriented to the development of parallel software

481

482

[5,6,9,15,16]. However, the research efforts in this field have produced incompatible tools based on alternative approaches. Tools based on sequential software standards (e.g., UML) are beginning to be used in distributed systems, in particular for distributed object systems as Corba, but are not easily utilizable for general parallel programming. A very interesting possibility offered by software development in simulation environments [2,4] is the iterative refinement and performance evaluation of program prototypes. Prototypes are incomplete program designs, skeletons of code where (some of) the computations interleaved between concurrent process interactions are not fully specified. In a prototype, these "local" computations are represented for simulation purposes by delays equal to the (expected) time that will be spent in the actual code. The use of prototypes has shown that this synthetic way of describing the behaviour of a parallel program is very powerful: it is language and platform independent, shows only essential features of the software, and can successfully be used for performance analysis at the early development stages [3]. We will present here MetaPL, a notation system designed to be the evolution of the concept of prototypes. MetaPL is an XML-based "Tag" language, and provides predefined elements for the description at different levels of detail of generic parallel programs. Using XML extension characteristics [17], the capabilities of the notation system can be expanded whenever necessary. The use of a single, flexible notation system may help the development of CASE tools and data interchanging between them. Furthermore, its suitability for simulation promotes performance analysis techniques in the early stages of the software development cycle. This paper is structured as follows. In the next Section the notation rationale and its use in the context of parallel software development is outlined. Then the notation language is introduced, and the derivation of views of the software from its description is dealt with. The paper closes with a case study, which shows two views obtained from a simple program description. Finally, the conclusions are drawn. 2

Parallel Software Development Cycles

The design objective of MetaPL has been to provide a unifying notation able to assist the software developer in: •

•

the development of high-performance parallel software from scratch, by intensive use of prototypes and by performance prediction techniques integrated with software development tools (direct parallel software engineering, DPSE); the examination, comprehension, refinement and performance improvement of fully-developed programs by simplified program views such as diagrams, animations, simulations (reverse parallel software engineering, RPSE).

In DPSE, our development procedure requires the description of the basic structure of the algorithm through prototypes. These are static flow graphs made up

483

of nodes corresponding to blocks of sequential code and nodes corresponding to a basic set of parallel programming primitives (parallel activation and termination, communication, synchronization, ...). Once the blocks of sequential code have been annotated with execution time estimates, found by direct measurement on available code or by speculative benchmarking, it is possible to evaluate the time required for task interaction by simulative tools [7] or analytic models. The predicted overall performance is validated against problem specifications. If the results are not satisfactory, it is necessary to revise (some of) the choices made in the previous development steps. Otherwise, the prototypes are refined by replacing nodes with sub-graphs or even with real code. Performance is validated once again, the design is further detailed, and so on. When the process stops, the prototypes have been replaced by a fully-developed code compatible with the performance objectives. In RPSE, instead, a fully-developed program is to be represented as a prototype. This requires the construction of a static flow graph made up of nodes corresponding to concurrent programming constructs and nodes corresponding to sections of sequential code involving no interaction with other tasks. 3

The MetaPL Description Language: Core and Language Extensions

The MetaPL architecture (Fig. 1) hinges on a description language able to describe parallel and distributed computations. MetaPL is not decidedly a parallel programming language, but just a simple and concise notation to support forward and reverse development cycles. The basic assumption made is that the program to be described is written in a conventional imperative sequential language (maybe an object-oriented one), extended with commands/functions for performing the basic tasks linked to parallel programming (e.g., activation and termination of tasks, message/passing and/or shared-memory synchronization,...). As mentioned in the Introduction, the MetaPL description language is made up of a core with very limited description capability, and language extensions, which expand the core language, adding new commands typical of a new programming model or specific to a library or programming language. The objective of the core notation is to describe in a simplified way only the high-level structure of the parallel code (task activation and termination, plus minimal synchronization constructs) and generic sequential code. In the rest of this Section we will briefly and informally present the XML elements making up the MetaPL core and the message-passing language extension. The interested reader is referred to a companion paper [8] for a more thorough description of the meta-language. A MetaPL description is a hierarchical structure whose building blocks are different types of blocks of code. All blocks encapsulate sections of code, and may have attributes associates with them, such the actual code contained, the expected execution time or die name of a cost function, which gives the (expected) execution time as a function of program inputs, and a textual description of the actions

484

performed. The basic type of block is the CodeBlock, made up of a sequence of executable statements written in a conventional (sequential) programming language. A parallel program can be described as a set of CodeBlocks, along with commands that describe the high-level structure of the code. These commands can either be concurrent or sequential, depending on whether they involve some form of interaction among concurrent tasks, or not. Sequences of C o d e B l o c k s and sequential commands can be combined into S e q u e n t i a l B l o c k s . In their turn, S e q u e n t i a l B l o c k s and concurrent commands compose the Block, which is the basic program unit assigned to a processor, to be executed as a separate task. ^~^—A

\

^—i

•—/

\

Figure 1. General architecture of the MetaPL notation system

A MetaPL description may also include V a r i a b l e elements. These are identified by the name attribute, may contain an i n i t i a l v a l u e attribute and a further attribute describing their type. It is worth pointing out that the above described variables have only descriptive validity, and should never be confused with any variables possibly present in the encapsulated code. The possibility to describe programs that have (even at high level) alternative paths, or that perform an activity for a number of times that depends on user input, is the primary reason for inclusion in the MetaPL core of conventional sequential commands such as Loop and S w i t c h . The Loop available in MetaPL is a for cycle executed a known number of times. The S w i t c h command contains a sequence of Case elements, each of which contains the action to be performed if that option of the switch is selected. The attributes of a Case element are p r o b , the probability that the option of the switch is selected, and c o n d i t i o n , which describes the condition leading to the selection of the switch option. Concurrent commands are required to introduce into MetaPL the concept of a program composed^of concurrent interacting tasks. The task code description is given in the Task element, characterized by the name attribute; the code is composed of the MetaPL core statements introduced above and of the following

485

concurrent commands. The Spawn command is characterized by two attributes, the name of the spawned process and its identifier. The W a i t statement is used for the description of synchronization on task exit. The Exi t statement indicates the end of a task and is used to terminate the corresponding Wai t . The core notation can gain more expressive power by the use of language extensions. One of the extensions already developed is the Message Passing Extension (MPE). The MPE extends the core set of commands with the introduction of non-blocking send (Send) and blocking receive (Receive). These basic commands are sufficient to describe the majority of message-passing task interactions. 4

View Generation: the Filters

By exploiting suitably-defined extensions, MetaPL can describe most parallel and distributed programs. The views are descriptions less complete and structured than the MetaPL one, but able to highlight specific aspects of the program. For example, they could be UML sequence diagrams, portions of code, automatically-generated documentation of the program, or program traces useful for simulation purposes [4]. The derivation of views is performed by filters, which formally are extensions to the notation system. We will present here as examples filters that can be used to derive two views, useful as HTML documentation and as input for simulation, respectively. The filters are made up of a description of the target document format (this description typically is a DTD), and of a (non-empty) set of additional documents (translators). The target format description may be replaced by a simple declaration of a standard output format, e.g., HTML. The translators formally are XSLT (XSL Transformations) documents; they basically contain conversion rules from MetaPL to a new format and (possibly) vice versa, but are also used to produce additional support documents, such as translations logs. The same technique used to extend the core language of MetaPL is also used for the filters. A core filter is able to handle only the notation defined in the MetaPL core. It may be suitably extended by filter extensions, which allow the conversion of elements not belonging to the core set. 4.1

The MetaPL-HTML Filter

The MetaPL-HTML filter produces a simple hypertextual view that highlights the computational steps made by the program, and enables the developer to "navigate" through his/her code. Since the filter output format is HTML (a well-know format), the filter is made up only of an XSLT document (the translator) that defines the transformation from the MetaPL description to the HTML document. The HTML page generated by the filter is very simple: each task description starts with the task

486

name (retrieved by the name attribute in the t a s k element) and is followed by an HTML unordered list, which contains the actions performed by the task. The actions performed by the Task could be opaque CodeBlocks, which are substituted in the HTML output by the natural language description in the MetaPL element content, or MetaPL statements (belonging to the core set or to the extensions). 4.2

The Simulation Filter

The simulation filter can automatically handle the generation of traces that can be used to drive the HeSSE simulator [7] in order to obtain predictions of program performance on a (real or fictitious) distributed heterogeneous computing platform. The output format of the Simulation View is the trace format accepted as input by the HeSSE Simulator. HeSSE traces are sequences of events corresponding to the basic actions that the system can simulate, such as CPU bursts or message-passing calls. The filter contains two translators. The first XSLT, the MetaPL-HeSSE translator, is used to generate the trace input for simulation. The second one, Simulation-checker, is instead used to check for the availability of all the information needed to simulate a program description. A typical example of missing value is the number of iterations of a loop where the final value of the control variable is supplied only at run-time. 5

A Case Study: Description and Simulation of a Parallel Program

The case study relies on a toy example, a simple data-parallel message-passing program that finds the minimum in a vector. It shows how a simple algorithm description may be done, simulated and automatically documented using the MetaPL notation system and its views. A trivial algorithm, based on a master-slave approach, has been converted into a MetaPL description (not shown here for brevity's sake), using the core language along with the message-passing extension. Successively, the MetaPL description can be transformed into an HTML view using the MetaPL-HTML Filter. The resulting documentation is shown in Fig. 2(a). The MetaPL program description can immediately be simulated by using the MetaPL-HeSSE translator and producing traces for the HeSSE simulator. The availability of traces, along with a description of the target environment (type of processors and load, interconnection network and network traffic) makes it possible to estimate even before that a single line of code has been written a reasonable estimate of response time (typically accurate within 10% from actual response times), and other system statistics. Fig. 2(b) shows the space-time diagram (communication events plotted over time) obtained by simulating the program description under the assumption that the program is executed using four processors on a Fast Ethernet.

487

Parallel Program HTML Documentation Master • # ! !Jit \ t d i i r Tumi Trie •

winlraf tal W VHTUIM« i . .'.-

- b»V i

• •:

to i send ihe section nt'lfic v«rtoi' tn eseli !wi

• '.-••-•• -get thcp^niBJ itwiltimit sto^c iji oiiniiniin!

I.

(he tlen'rtr* r e c o t r i lesfiihaii riHiMnmm

Slave • -•

• - • gel tht v*etor frcm die master

• . . . controlled by variable i 1, :....... : * e i-Ji eletoen; of Vtrtor lets than minimum • minimum Is now tbt i-th element a f v e e w

_

• .,•-. to i send (he nuDimma IO iBWter

Figure 2. HTML documentation view (a) and spacetime diagram obtained by HeSSE simulation (b)

6

Conclusions

This paper has shown the use of MetaPL, an XML-based notation system, for the development of parallel programs. The main features of MetaPL, namely flexibility and simplicity, have been obtained by exploiting heavily XML extension capabilities. In fact, MetaPL can describe programs code based on different memory models, interaction structures and programming languages. The possibility to include timing information in the program description promotes the use of performance analysis during software development. After a brief description of the main features of the notation system, we have shown as an example two program views that can be derived from the program description by suitable filters. The first one is an automatically-generated documentation of the parallel program in HTML, which makes it possible to navigate the code description by a simple web browser. The second view instead enables the performance evaluation of an (incomplete) parallel program starting from the MetaPL description, producing a set of traces that can be used as input for the HeSSE simulator environment.

488

References 1. Adve V. and Sakellariou R., Compiler Synthesis of Task Graphs for Parallel Program Performance Prediction. Proc. I3lh Int. Workshop on Languages and Compilers for High Perf. Comp. (LCPC '00), (Yorktown Heights, USA, 2000). 2. Aversa R., Mazzocca N. and Villano U., Design of a Simulator of Heterogeneous Computing Enviroments. Simulation — Practice and Theory 4 (1996) pp. 97-117. 3. Aversa R., Mazzeo A., Mazzocca N. and Villano U., Developing Applications for Heterogeneous Computing Environments using Simulation: a Case Study. Parallel Computing 1A (1998) pp. 741-761. 4. Aversa R., Mazzeo A., Mazzocca N., and Villano U., Heterogeneous System Performance Prediction and Analysis using PS. IEEE Concurrency 6 (1998) pp. 20-29. 5. Gorton I., Jelly I. and Gray J., Parallel Software Engineering with PARSE. Proc. IEEE COMPSAC If Int Computer Science and Applications Conference (IEEE Press, Phoenix, USA, 1993) pp. 124-130. 6. Hatley D. J., Parallel System Development: a Reference Model for Case Tools. Proc. 9th Int. Conf. on Comp. and Comm. (Phoenix, USA, 1990) pp. 364 -372. 7. Mazzocca N., Rak M. and Villano U., The Transition from a PVM Program Simulator to a Heterogeneous System Simulator: the HeSSE Project. In Recent Advances in PVM and MPI (LNCS 1908), ed. by J. Dongarra et al. (SpringerVerlag, Berlin, 2000) pp. 266-273. 8. Mazzocca N., Rak M. and Villano U., MetaPL: a Notation System for Parallel Program Description and Performance Analysis. To be presented to the PaCT 2001 Conference, 3-7 Sept. 2001, Novosibirsk, Russia. 9. Roman G. C. Language and Visualization Support for Large Scale Concurrency. Proc. 10th Int. Conf. on Softw. Eng. (1988) pp. 296-308 . 10. Smith C. U. and Williams L. G., Performance Evaluation of a Distributed Software Architecture. Proc. CMG (CMG, Anaheim, USA, 1998). 11. Smith C. U. and Williams L. G., Software Performance Engineering for ObjectOriented Systems. Proc. CMG (CMG, Orlando, USA, 1997). 12. Smith C. U. and Williams L. G., Performance Engineering Evaluation of Object-Oriented Systems with SPEED. In Computer Performance Evaluation Modelling Techniques and tools (LNCS 1245) ed. by R. Marie et. al. (SpringerVerlag, Berlin, 1997). 13. Smith C. U., Designing High-Performance Distributed Applications Using Software Performance Engineering: a Tutorial. Proc. CMG (CMG, San Diego, USA, 1996). 14. Villacis J. and Gannon D., A Web Interface to Parallel Program Source Code Archetypes. Proc. 1995 ACM/IEEE Supercomputing Conf. (San Diego, 1995).

489 15. Winter S. C , Software Engineering for Parallel Processing. IEEE Colloquium on High Perf. Comp. for Advanced Control (IEEE press, 1994) pp. 8/1 -8/7 . 16. Woodside C. M., A Three-View Model for Performance Engineering of Concurrent Software. IEEE Trans, on Softw. Eng. 21 (1995) pp. 754-767 17. Extensible Markup Language (XML) 1.0, 2nd edition. http://www. w3. org/TR/2000/REC-XML-20001006

PAVIS: A PARALLEL VIRTUAL E N V I R O N M E N T F O R SOLVING LARGE MATHEMATICAL PROBLEMS

DANA PETCU AND DANA GHEORGHIU Western University of Timi§oara, Computer Science Department, B-dul V.Pdrvan 4, 1900 Timi§oara, Romania, E-mail: [email protected] The efficient solution of large problems is an ongoing thread of research in scientific computing. An increasingly popular method of solving these types of problems is to harness disparate computational resources and use their aggregate power as if it were contained in a single machine. We describe a prototype system allowing the data and commands exchange between different mathematical software kernels running on multiple processors of a cluster.

1

Introduction

Many scientific and engineering problems are characterized by a considerable run-time expenditure on one side and a medium-grained logical problem structure (e.g. complex simulations) on the other side. Often such problems cannot be solved in a single SCE (scientific computing environment, referring here a computer algebra system - CAS - or a specialized problem solving environment - P S E ) , but they can be parallelized effectively even on networked computers. For such problems we do not necessarily need a parallel SCE. Instead we need some methods to integrate the SCE and other autonomous tools into a parallel virtual system (e.g. couplings with external programs). On another hand, the numerical computing facilities of CAS can be improved by coupling them with specialized P S E . Several parallel virtual or distributed systems were constructed on top of three frequently used SCE, Maple, Matlab and Mathematica (see Section 2). We propose a system prototype, shortly described in Section 3, namely PaViS (PArallel Virtual mathematical Solver) which interconnects the above enumerated SCE kernels running on different processors. Section 4 benchmarks the system in solving different computing intensive problems.

2

R e l a t e d work

Various mechanisms have been developed to perform computations across diverse platforms. T h e common mechanism involves software libraries (Table 1). Unfortunately, some of these libraries are highly optimized for only certain

490

491 Table 1. Examples of SCEs with parallel and distributed facilities SCE Maple

Matlab

Mathematica

Multiproces. version Distributed 1 3 FoxBox 3 For Paragon 1 ||Maple|| 14 Sugarbush 2 AlphaBridge b Conlab 5 DPToolbox 7 Falcon 12 MultiMatlab 1 5 Distributed 1 3 Parallel Toolkit 16

Built upon Java MPI Kernel Strand Linda Mex PICL PVM F90 MPI Java RSH

Type Extended Extended New Extended Extended New Extended Extended Translation Extended Extended Extended

Parallel No Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes

Cluster Yes Yes

No No No No Yes Yes Yes Yes Yes Yes

platforms and do not provide a convenient interface to other computer systems. Other libraries demand considerable programming effort from the user. While several tools have been developed to alleviate these difficulties, such tools themselves are usually available on only a limited number of computer systems and rarely freely distributed. Maple, Matlab and Mathematica are examples of such tools. Moreover, the CAS users are often frustated by the time to rerun a program for different conditions, parameters or initial guesses. Such user's problems might be solved by a system that makes it convenient to spawn CAS processes on multiple processes of a parallel computer or a cluster. In many cases the needs for communication between the processors are rather small compared with the necessary computations. Some of the available parallel SCE implementations involve completely new codes rather than the use of the existing systems (with a good reason if the aim is high performance). A such rebuild is impossible without access to the SCE code source. Another disadvantage is that the existing SCE are at present so widely used, and so extensive in their capabilities, that it is unrealistic and inefficient to try to duplicate them. Another option is to build upon existing sequential SCEs and to produce some extensions. Several attempts have been made to combine Maple with parallel or distributed computation features. ||Maple||14 is a portable system for parallel symbolic computations built as an interface between the parallel programming language Strand and the sequential CAS Maple. Sugarbush2 combines the parallelism of C/Linda with the Maple. FoxBox3 provides an MPI-compliant distribution mechanism allowing parallel and distributed execution of FoxBox programs; it has a client/server style interface to Maple. Distributed Maple13 is a portable system for writing parallel programs in Maple, which allows to create concurrent tasks and have them executed by Maple kernels running on

492 different machines of a network. The system consists of two components: a Java class library which implements a general purpose communication and scheduling mechanism for distributed applications and a binding that allows to access the Java schedule from Maple. Parallel Computing Toolkit16 introduces parallel computing support for Mathematica; this commercial tool can take advantage of existing Mathematical kernels on a multiprocessor or a cluster. Distributed Mathematica 13 is a public domain system allowing to create concurrent tasks and have them executed by Mathematica kernels running on different machines on a cluster. The aim of MultiMatlab15 is to provide a tool that facilitates the use of a parallel computer or a cluster to solve course-grained large-scale problems using Matlab. The system runs on top of MPI. The user operating within one Matlab session can start Matlab processes on other machines and then pass commands and data between these various processes. Several other projects are taking the route of translating Matlab into a compilable language, either a parallel language or a standard language with the addition of message passing constructs. In Falcon? for example, the Matlab code is translated into F90 and the parallelization is left to a parallel computer; in ConLab5 the code is translated into C. DPToolbox7 runs on top of PVM and has three components: a Matlab/PVM interface that provides the communication primitives of the PVM system in Matlab, an interface for developing distributed and parallel Matlab applications, and an interface to manage parallel Matlab machines. 3

Pavis overview

PaViS intends to provide a uniform, portable and efficient way to access computational resources of a cluster or even of a parallel computer. It is built upon PVM, thus enabling users to create low-cost virtual parallel computers for solving mathematical problems. PaViS implements a master-slave paradigm. All message-passing details are hidden as far as possible from the user. The master controlled by the user contacts PaViS and sends computation requests (input parameters). PaViS runs appropriate slaves and returns computation results (outputs or error status) to the master. High-level commands are provided in SCE source form,

so they can serve as templates for building additional parallel programs. The system is comprised of a set of connected processing elements. Wrapping SCE is done by an external system which takes care of the concurrent execution of tasks: PVM active daemons (PVMDs) ensure the inter-processor communications. Each node connected to a PaViS session comprises three main components: command messenger, SCE interface and PVMD daemon.

493

User

1

Programming SCEware PaViS PVM

Master Slave ... Slave ! SCE

1t

CM

1 t

SCE

1t

CM

-4--J- —

_]

Slave ...Slave

SCE ! J SCE

1 ti

CM ; _L_J._

PVMD

Pre ce ssing element

1t

—r

¥

CM

I t

SCE

1t

Slave

SCE

SCE

1t

CM

CM

i— -1.A-.

PVMD

Slave

ft

Netw. Proc.element Net

1 t

CM

i— 1 tPVMD 4-*. Proc.element

Figure 1. Parallel virtual environment created to solve computational intensive problems using the message-passing strategy (CM denotes the command-messenger, processing element denotes a processor ant its local memory, a dash line indicates an optional system component, and actually a SCE can be Maple, Matlab or Mathematica) Table 2. SCE interface with PaViS: function package pvm Function spawn send receive settime exit time ProcID MachID, TaskID Tasks

Meaning create local or remote SCE processes send commands to SCE processes receive results from SCE processes start a chronometer kill the slave kernels and command-messengers post-processing time diagram in SCE graphic format process identifier on one machine station/process identifier into the virtual machine list the process identifiers

The command messenger is a daemon process that awaits and interprets master requests and slave responses (Figure 1). It assists the message exchanges between the SCE processes, coordinates the interaction between SCE kernels via PVM daemons (receives SCE commands from other SCE processes, local or remote, sends its results to other processes), and schedules tasks among processing elements (activates or destroy other SCE processes). A special file in SCE format, a function package, implements the interface between the SCE kernel and the command-messenger. Table 2 describes the current available functions. PaViS is built on top of three existing binding programs: PVMaple, PVMatlab and PVMathematica. Each one connects kernels of the same type. Recently described PVMaple9 prototype can be freely downloaded from www.info.uvt.ro/~petcu/pvmaple (binaries for Win32 and Unix platforms); PVMatlab and PVMathematica will be also available soon. Parallel Virtual Maple is a prototype system allowing to study the issue of interconnecting PVM and

494

Maple. Its design principles are very similar to those of Distributed Maple. The interface between Maple and command-messenger is described in a Maple function package pvm.m. The user interacts with the system via the text oriented Maple front-end. Initialization of a PVMaple session, activation of local or remote Maple processes and message-passing facilities are provided by the function package. The first six functions from Table 2 have equivalents in Distributed Maple. Pvmaple does not allow shared objects like Distributed Maple and on-line visualization of active processes, but allows more than one Maple command to be send once to remote processes (in a string). Tests performed on Unix networks where both applications are available have not reveal significant differences in application execution times 8 . The similar tool, PVMathematica, is closest to Distributed Mathematica13 than to Parallel Toolkit16. PVMatlab project is similar to DPToolbox. The same idea was used in both tools: to allow the user access to PVM functions supporting process management and messages exchanges from a frequently used SCE. DPToolbox is more complex that PVMatlab; the post-processing visualization of a session and the extension, PaViS, are the only arguments pro our tool. 4

P e r f o r m a n c e results

We prove by experiments that the response time of the SCE solvers can be improved by using cooperation between different kernels of the same type or by using the solving capabilities of other SCEs. We present here the test results obtained using a cluster of 4 dual-processor SGI Octanes linked by three 10-Mbit Ethernet sub-networks, Maple V Release 5 and Matlab 6. Table 3 shows an example of using two components of PaViS (PVMaple and PVMatlab) in order to improve the time response of basic SCEs. In this case the sequential time necessary to factorize the list of integers increases linearly with the problem dimension (list length). The sequential execution times reported by both basic SCE are roughly the same (around 140 s for 3000 integers). Splitting the list of integers and the factorization requests between several SCE kernels running on remote processors, we can obtain a shorter response time. Indeed, for the given case, a speed-up factor of 2.3 (2.2) was registered using 3 kernels of Maple (same for Matlab), one on the same processor like the user interface, two on remote processors. We expect to obtain higher speed-up values by increasing the problem dimension since the communication time increases slower with the number of integers than the computation time. To allow the collaboration between different SCE kernels the syntax of spawning, sending and receiving commands has been slightly modified, inclu-

495 Table 3. Code for distributed integer factorization within PVMaple/PVMatlab sessions: 3000 integers are randomly generated (intfac function), distributed in a round-robin fashion to 3 SCE kernels, and factorized (nfactor procedure), the final list of factors being presented in the user's SCE interface # file intfac.txt intfac=proc(dim,big,p) local n,r,i; r:=rand(big); n:=Q; for i to dim do n:=[op(n),r()]; od; nfactor(n,p); end: nfactor—proc(n,p) local tag,mes; raes=cat('readlib(ifactors): n:=", convert(n,string),':s:=(] : for i to nops(n) do", "if(i mod',convert(p,string),')=pvmfTasJcIdj-r, 'then s:=[op(s),ifactors(n[ij)]: fi:od: s;'); tag:=pvm[sendj( 'all ',mes); pvm[receive](tag,'aH') end:

% file intfac.m function f=intfac(dim,big,p) n=ceil(rand(l,dim)*big) f=nfactor(n,p); % file nfactor.m function r=nfactor(n,p) mes=['s=[]; n=',num2str(n),']',... 'for i=l:length(n), if mod(i,',.. . num2str(p),')==pvm[TasicJdj-l,'. 's=[s,factor(n(i))]; end; end; s']; tag:=pvm('send', 'all',mes); r=pvm('receive',tag, 'all');

# > > > > > >

% > > > > > >

Interactive session: Maple commands read "pvm.m": read "intfac.txt"; pvm[settime](); pvm[spawn]([''sgil ',2],['sgi3',l]); intfac(3000,2"32,nops(pvm[tasA:s]('J>l); pvm[exit](); pvm[time]('a.W);

Interac.session: Matlab commands pvm('init'); pymf'settime',) pvm('spawn', 'sgil ',2, 'sgi3',l) intfac(3000,2"32,3) pvm('exit') pvm('time')

ding a third parameter, the SCE type ('maple', 'matlab' or 'mathe'). The user must send string of commands according to the destination SCE type. The second example is concerning the graphics facilities of Maple. We consider the problem of plotting a Julia fractal: the complex numbers ZQ for which zn -> 0 where zn := z\_x + c, n > 1, c € C (Figure 2). Given a rectangle in the complex plane and a large integer N, an approximate Julia set can be plotted using a regular grid inside the rectangle and by drawing those grid points ZQ for which \ZN\ < 1. Our tests have shown that Maple is three times slower than Matlab in the computation of the ZN values (in the faster variant with a cycle instead a function composition), and Matlab is twenty times slower than a similar C program. Two improvements are possible: to use more than one Maple kernel to construct parts of the plots or to let a Matlab kernel to compute the z values (a special function is needed to convert the results send by Matlab in a Maple plot structure). In the case depicted by left part of Figure 2, we have obtain a speed-up of 1.8 using 2 Maple kernels (with equal loads; code suggested by the right part of Figure 2), 3.1 using 4 Maple kernels (on the 4 regular domains; small load imbalance) and 1.9 using a couple [master:Maple,slave:Matlab], 1.3 using 2 Matlab kernels and 1.1 (inefficient) using a couple [master:Matlab,slave:Maple].

496 #Maple sequential code > f:=(x,y)- >(x"2-y-2+0.32, 2*x*y+0.043): > g : = ( x , y ) - >x"2+y~2: > h:=proc(x,y) if g((f@@130)(x,y)) plot3d('h(x,y)', x=-l..l,y=-1.15..1.15,grid=(400,400], > view=[-l.. 1,-1.15..1.15,0..0.75],orientation=[90,0])i #Maple distributed code > read 'pvm.m'; f:=.. . g : = . . . h : = . . . ; pvm(settime](); > pvm[spawn](['sgil",l,"maple'],["sgi2",l,"maple"]); > m:=pvm[send](['sgir,l,"maple"], ' f:=...g:=...h:=...rl:=plot3d(...x=0..1...);") > r2:=plot3d(...x=-1..0...): > rl:=pvm[receive](["sgil',l,'maple*],m); > plots[display3d]([rl,r2]); pvm[exit](); pvm[time]('aH'); Figure 2. Maple plot of a Julia fractal in [-1,1] x [-1.15,1.15] with c = 0.32 + 0.043i, a grid of 400 x 400 points and N = 300: the plotting time is O(10 3 ) seconds in the sequential case and can be reduced to O(10 2 ) by using other Maple kernels or Matlab kernels

A more complex test is concerning the usage of PaViS to solve large initial value problem provided by the semi-discretization of partial differential equations. The test problem is a mathematical model of the movement of a rectangular plate under the load of a car passing across it 4 . Applying the method of lines, a large ODE system arises. The number ODE equations depends on the accuracy required in the PDE solution. Usually a such system has hundreds of equations. Maple cannot solve symbolically a such big system, and for numeric computations is too slow compared with Matlab or other programs written in standard programming languages. A first solution in order to solve the large ODE system in Maple is to use parallel Runge-Kutta methods for which subsets of stage equations can be solved independently at each time step (a sketched general procedure was shortly presented in 9 ) . For the semi-discretizated plate problem with 32 ODEs, PVMaple has reported an speed-up of 2.5 using 3 Maple kernels running on the cluster (further tests with PVMaple concerning ODEs are presented in 1 0 ). A second solution is to use specialized SCE on numerical computations, like Matlab, to improve the response time of user's preferred solver. Replacing Maple processes for solving stage equations with Matlab processes, we have obtained approximately one half of the initial response time with a small degradation in the system efficiency (further tests are presented in n ) . A third solution is to use PaViS as interface between different SCE: for example the user describes the ODE problem in Maple and activates remotely an ODE numerical solver from Matlab; currently the user must translate the data, the commands and the results between different SCEs which is acceptable only if they are outputs or inputs of other computational intensive procedures described in user preferred SCE.

497 5

Future improvements

PaViS is by no means in its final form. Various extensions in functionality are under development. T h e current system needs improvement in the area of robustness with respect to various kinds of errors, and in its documentation. We intend to couple the system also with some P S E and to include in PaViS an automatic command and d a t a structure translator between Maple, Matlab and Mathematica. PaViS is potentially useful for education in parallel programming, for prototyping parallel algorithms, and for fast and convenient of easily parallelizable computations on multiple processors. References 1. L. Bernadin, Maple on Massively Parallel Distributed Memory Machine, in Pasco '97, eds. M. Hitz et al (ACM Press, New York, 1997). 2. B. W. Char, Progress report on a system for general-purpose parallel symbolic algebraic computation, in ISSAC '90, (ACM Press, New York, 1990). 3. A. Diaz and E. Kartofen, FoxBox: a system for manipulating symbolic objects in black box representation, in ISSAC '98, ed.O.Gloor (ACM Press, 1998). 4. Hairer E., Wanner G., Solving ordinary differential equations II. Stiff and differential-algebraic problems (Springer-Verlag, 1991). 5. P. Jacobson, B. Kagstrom and M.Rannar, Algorithm development for distributed memory computers using ConLab, Sci.Programming 1, 185-203 (1992). 6. J. Kadlec and N. Nakhaee, AlphaBridge: parallel processing with Matlab, in Proceedings of 2nd MathWorks Conference (1995). 7. S. Pawletta, Distributed and parallel application toolbox for use with Matlab, anson. ucdavis.edu/ " bsmoyers/parallel.htm (1997). 8. D. Petcu, Working with multiple Maple kernels connected by Distributed Maple or PVMaple, Preprint RISC 18-01 (Linz, 2001). 9. D. Petcu, PVMaple: a distributed approach to cooperative work of Maple processes, in LNCS 1908: PVM-MPI'00, eds. J. Dongarra et al (2000). 10. D. Petcu, A networked environment for solving large mathematical problems, accepted for publication in LNCS: Proc. of EuroPar 2001 (Springer, 2001). 11. D. Petcu, Solving large systems of differential equations with PaViS, accepted for publication in LNCS: Proc. of PPAM 2001 (Spinger, 2001). 12. L. de Rose et al, Falcon: a Matlab interactive restructuring compiler, in Languages and compilers for parallel Computing (Springer, 1995). 13. W. Schreiner, Developing a distributed system for algebraic geometry, in EuroCM-Par'99, ed. B. Topping (Civil-Comp. Press, Edinburgh, 1999). 14. K. Siegl, Parallelizing algorithms for symbolic computation using ||Maple||, in 4th Symp. Princip. & Practice Par. Program. (ACM Press, San Diego, 1993). 15. A. Trefethen, MultiMatlab, www.cs.cornell.edu/Info/people/Int/multimatlab. 16. Wolfgang, Parallel computing toolkit, www.wolfgang.co.jp/news/pct.

DEVELOPMENT OF PARALLEL PARADIGMS TEMPLATES FOR SEMI-AUTOMATIC DIGITAL FILM RESTORATION ALGORITHMS G. SARDISCO - A. MACHI IFCAI- CNR, Via U. LaMalfa 153, 90J46 Palermo, ITALY E-mail: sardisco\[email protected] This paper describes the initial results of applying structured parallel paradigms to optimise algorithms for semi-automatic digital motion picture restoration. We adopt a skeleton-based compile-time approach to optimise the frame delivery time and the sequence completion time. We use two meaningful image-processing algorithms as case studies for selecting appropriate elementary and composed stream-parallel and data-parallel constructs. As a trade-off between parallel efficiency and template complexity we propose to use a small set of basic paradigms (farm, pipe, map/reduce) and to allow just one level of composition (pipe and comp). Finally, we describe the implementation templates of the paradigms on a shared-memory architecture and present the results of some benchmarks on the parallel code obtained using the developed templates. The obtained results are consistent with the expected performance.

1

Introduction

Digital film restoration is an essential tool for saving the amount of recorded events from film spoiling and ensuring an acceptable quality for the new media. Several algorithms have been proposed to identify defects and restore damaged frames [6, 7, 13], but their complexity and the high resolution of frames (3K x 2K) make the job economically affordable only for a few documents of major historical and cultural relevance. Among others, researchers co-operating to the Project FRAME (ESPRIT 24220) have developed a semi-automatic digital film restoration process that demonstrates the advantages of exploiting parallelism to lower the processing time on a mediumgrain distributed-memory parallel systems [8]. Plaschzug reports a lower limit of about one minute per frame for automatic batch-processing of moderately damaged sequences at HDTV resolution on clusters of 16 Hypersparc or 8 Ultrasparc processors with Elan2 fat-tree network [11]. In the proposed approach, the operator browses the sequence and bounds the sections (around damaged frames) on which to apply an appropriate series of spatiotemporal filters. These sections are batch-processed and results revised to confirm the detection of defects before actual removal. Trial processing of short sequences is suggested for tuning algorithm parameters before batch-processing. In a recent study [10], the authors propose to use video indexing techniques to automatically identify inter-frame continuity violations and to support the operator in his pre-processing work. Simple algorithms operating at medium resolutions are proved to be effective in identifying major discontinuities in a few seconds; hence, interactive processing looks within reach of medium-grain parallel processing.

498

499 Interactive processing can be even more fruitful if distinct steps in the series of filters are performed and tested separately before batch-processing. Implementing such a facility requires building a library of parallel image-processing modules and defining some composition rules for their sequential and pipelined execution. As a methodology able to ensure programmability, portability and performance in the development of parallel applications, some researchers [2, 3] suggest the use of structured parallelism, others [1,4] also propose to utilize a structured parallel language. In this paper, we describe an attempt to use the structured parallel approach for implementing a minimal subset of constructs that allow easy and effective creation of parallel code, exploiting a library of image-processing and pattern-recognition algorithms. This activity is carried out as a part of a European Commission FESR Project aimed to develop tools for the optimisation of semi-automatic digital film restoration. 2

Structured parallelism

Developing parallel programs is still a difficult and expensive task due to the plenty of topics to be addressed. Tools are needed for helping programmers to easily write portable and efficient parallel applications with high-level abstractions. On the other hand, building such tools requires imposing restrictions on the parallel structure of applications to make the problem tractable [9]. Interesting restricted models are skeleton-based models. These models provide users with a small set of parallel forms, called skeletons, abstracting typical parallel paradigms. The parallel structure of applications can be expressed by instantiating, composing and (possibly) nesting skeletons. This is not a limiting approach, as most applications share the same structure and use the same kinds of parallelism. Initially proposed skeleton-based models [2, 3] provide users with a flat library of skeletons. A skeleton instance is generated by inserting the user code into a predefined process graph that implements the skeleton on the target machine. A major problem of both the skeleton libraries is the lack of expressiveness and flexibility, since a parallel application must fit in a library skeleton. More recent works (SCL) [4] and (P3L) [1] propose a structured co-ordination language that allows complex parallel structures to be built as the composition of a small set of simple skeletons {structured parallel programming, by analogy with the structured sequential programming). Skeletons included in P3L are suited to describe the inner parallelism of most image-processing algorithms. However, we think that some extensions are needed for efficiently exploiting skeleton-based parallelism in the target application, such as mechanisms to implement pre-compiled libraries, dynamic dimensioning of input and output data, and run-time selection of templates. The development of a parallel compiler (and its related environment) goes out of reach of this project. We adopt the structured parallel approach to develop simple

500

tools that can help us build a parallel image-processing library using templates of basic parallel paradigms. The developed templates can eventually be included in the back-end of a skeleton-based compiler in a future project. In this respect, presently we are co-operating with the group of researchers that developed SklE [5], a parallel programming environment based on P3L. Basic parallel paradigms can be classified into two classes according to whether the parallelism is exploited on different elements of the stream {stream parallelism) or on different parts of the same element (data parallelism) [9]. The stream-parallel paradigms commonly used are: pipeline (different stages of the computation for different elements), farm (computation of different elements) and iterative (pipeline with an unbounded number of stages). The mostly used data-parallel paradigms are: basic (independent data-parallel activities), composed (data-parallel activities with a fixed number of interactions) and iterative (data-parallel activities with an unbounded number of interactions). 3

Structured description of parallel algorithms

In this section, we analyse the parallel structure of two image restoration algorithms used for indexing of sequences and detection of line scratches. Even if simple, these two algorithms contain elements enough to show both the power of the theoretical approach and the practical usefulness of getting access to a number of predefined skeletons to implement parallel algorithms. 3.1

Sequence indexing

The main objective of indexing is to segment a sequence in shots, homogeneous for scene content. Shot boundaries are defined by sudden changes in the scene due to camera breaks, or by smooth changes due to camera work, editing effects, or just rapid movements of scene objects. The indexing algorithm proposed in [10] calculates two different indexes of distance between each couple of successive frames. It then analyses difference time series to detect impulsive fluctuations due to scene change and smooth fluctuations due to dirt film joints or to sudden motion of large scene components. Computational complexity of the frame comparison stage is roughly linear with respect to the number of pixels and is not data dependent. Its detection and precision power maintains high even below TV resolution, and sequential delivery time at half TV linear resolution is compatible with interactive processing. Data movements are limited to reading of input frames, intermediate results are three histograms and tree index values for each frame. Time-series analysis depends on the frame comparison stage. Its computational complexity is not relevant being just linear with respect to the number of frames, so a parallel implementation of this stage is not worthwhile.

501

Both the stream-parallel and the data-parallel structured approach can be used to build a parallel version of the indexing algorithm. In both cases, a composition of elementary stages is required. Using the stream-parallel approach, a stream of frames is built and distributed in chunks to a set of processors to perform comparisons between couples of frames. Time-series analysis is executed sequentially after an ordered collection of distance indexes. Hence, a structured stream-parallel implementation is a composition of one farm and one sequential stage. We expect to pay parallelism overhead due to duplicate readings of frames at chunk borders and to re-ordering of data. Parallel efficiency is expected to depend on the ratio between the number of frames in the sequence and the cardinality of the processor set. The data-parallel approach, even if feasible, is expected to perform with lower efficiency because a rearrangement of the serial algorithm is required to cope with distortion in the comparisons topology introduced by whatsoever data partitioning scheme used.

Input Frame difference Time-series analysis

180x128 1 18 10

720x512 7 275 10

2880x2048 140 4400 10

Table 1. Measured times (ms) for sequential sequence-indexing at different resolutions.

3.2

Line scratch detection

Persistent line scratches are commonly caused by film abrasion when it passes over particles caught in the transport mechanism. Often scratches cross the entire frame along the direction of film transport and can occur in nearly the same location in successive frames. At scratch locations, the temporal continuity of image brightness is broken. The detection algorithms proposed in the literature [6, 7, 12] firstly enhance vertical image structures by applying a vertically directed gaussian or median filter, secondly detect local brightness maxima and minima, thirdly they search for the alignment of local extrema using Hough transform [7] or vertical histograms [10], and finally evaluate a scratch mask. In this paper, we limit to the treatment of fixed scratches and assume that scratch masks are obtained as vertical strips around peaks of column histogram of brightness extrema pixels. The computational complexity of the first tiiree steps is roughly linear with respect to the number of pixels, because filters use convolutions or ordering over a fixed topological neighbourhood of each image pixel. Histogramming is linear with the number of extrema pixels and slightly data dependent. Scratch strip selection on the histogram is lightweight, so a parallel implementation is not worthwhile.

502

Input data are single frames, intermediate results are several filter masks and output is a scratch mask with a list of scratch descriptors for each frame. Both the stream-parallel and the data-parallel approaches can be used to build up a parallel version of the scratch detection algorithm. In both cases, a composition of elementary stages is required. Using the stream-parallel approach, a stream of frames is built and frames are distributed to a set of processors that perform in cascade filtering, histogramming and scratch detection on the received frame. Therefore, a structured stream-parallel implementation is & pipeline including oat farm. No data dependency exists in the frame analysis and we expect just minimal parallelism overhead for ordering the output data. Time spent for input is two orders of magnitudes lower than time spent for computing, so a linear speed-up is expected on a medium-grain parallel system, limited by saturation of its I/O subsystem. Using the data-parallel approach, each frame is partitioned with overlapping of rows and distributed to the set of processors. Convolutions are applied in partitions and overlapping rows are updated between filter steps. Overhead is expected to be paid because of data re-processing in overlapped areas.

Input Scratch candidates Histogram mask Scratch mask

180x128 1 300 15 2

720x512 7 • 4750 300 5

2880x2048 140 67300 5700 17

Table 2. Measured times (ms) for sequential scratch-detection at different resolutions.

4

Development of constructs

The analysis of the sample algorithms shows that usually the parallelism of a task cannot be expressed using a single parallel paradigm, but through the composition of different paradigms. Specifically, the parallel task structure (figure 1) is a pipeline of sequential and parallel stages working on a stream of frames, where the first and the last stage are sequential modules generating the initial data and consuming the final results, and the inner stages are parallel (and sequential) modules processing the stream.

Input stream

Output stream

Figure 1. Parallel structure of a task.

503

To describe such a structure we need constructs that model the most common stream and data parallel paradigms as well as constructs allowing their composition. The set of constructs we consider as minimal for our purposes includes: • SEQ (sequential), • PIPE and FARM (stream-parallel), • MAP and COMP (data-parallel). The SEQ construct models a computation to be executed sequentially; the FARM and the PIPE constructs model respectively the farm and the pipeline paradigms; the MAP construct models a restricted version of the basic data-parallel paradigm with an optional strategy of reduce collection; the COMP construct models the composed data-parallel paradigm by allowing to combine basic data-parallel (and sequential) modules (figure 2). In the following, we describe the implementation templates for the MAP, FARM, and COMP constructs on SMP systems. The implementation of the PIPE construct is a current topic of research. composed

->

COMP = { SEQ | MAP }+

basic

-»

SEQ, FARM, MAP

Figure 2. Composed and basic constructs.

4.1

Implementation templates

Each construct is implemented by a network of interacting processes defined on top of an abstract parallel machine {implementation template). The abstract machine is a distributed-memory system with a set of fully interconnected processing elements (a node can exchange messages directly with any other node). The set of primitives provided is very small: process activation and point-to-point send/receive operations (similar mechanisms are found in any commercial machine). Each network process is mapped onto a distinct node. An implementation template defines: • a process graph (parametric as to the number of nodes), • a set of process templates (parametric as to the user-defined functions). The implementation template of the FARM constructs is shown in figure 3. Its process graph involves two kinds of processes: an emitter/collector (E/C) and a pool of workers (W). E/C is the process that controls the data distribution and collection, Ws are the processes which execute the computation. The process E/C executes a cycle where it receives from the in channel and distributes to the workers P inputs (or the end-of-stream), then collects from the workers and sends to the out channel their results, maintaining the input ordering. The process W executes a cycle where it gets an input (or the end-of-stream) from E/C, performs its computation and give back the result to it.

504

The process templates for E/C and W are stored in a process template library. The template of W is parametric as to the user-defined function to be computed. The template of E/C is parametric as to the types of input and output data and as to the number of worker processes. The parallelism management is fully provided by the process templates. In

W,

W2

Out

Wp.j

WP

Figure 3. FARM implementation template.

The implementation template of the MAP construct is quite similar. The main difference is that E/C is also parametric as to two user-defined functions D and C, that express the operations of data distribution and results collection, respectively. The input data can be distributed using scatter (partitioned among the workers) and broadcast (replicated to the workers) operations. Workers results can be combined using simple gather operations or more complex reduce (repeated application of a binary operator) strategies. As to the COMP construct, since it models a data-parallel computation that can be divided into a sequence of steps, its implementation template is an extension of the MAP template, where the process E/C operates both as the distributor/collector of every step and as the manager of the sequence, especially handling the passing of parameters from a step to the following ones. 4.2

Task management

The digital restoration application is implemented in the FESR project according to a three-tier client/server model. The application clients submit execution requests entered by users through a graphical interface to an application server that provides task management, graphics and file services. Sequences are provided by a parallel video-server on-demand. The task processes network, hosted on the nodes of a parallel server, is created at run-time by an application server component, called dispatcher. Each node of the network executes a generic process, that is it can serve any process of any task.

505

When invoked to start a task, the dispatcher selects the necessary nodes among the idle ones and sends them appropriate control data {process control block). These data define everything the process needs to know on the kind of computation it is requested to do, such as the operation identifier and parameters, the skeleton type, the process role, the i/o channels interconnections, and a few other control data. The task execution requests coming from users are inserted into a FIFO queue handled by a simple scheduler. When a task is selected, it is assigned a number of nodes based on the available resources and the dispatcher is invoked. The maximum number of nodes allocated to a task is currently indicated by the user (the minimum one is determined by the parallel structure). The scheduler goes on getting and dispatching tasks from the queue as long as there are idle nodes. 4.3

Implementation details

The abstract machine we have used to define the skeleton templates is implemented on a shared-memory multi-processor running a multi-threading operating system. For die sake of efficiency, we have chosen to exploit threads as task processes. Using threads, the send/receive operations can be realised with concurrent accesses to shared buffers, synchronised by means of mutexes and condition variables. With threads, an environment common to processes can also be easily realised. Costs of communications are very low because a send or receive operation results in copying pointers or memory areas depending on the context it is used in. The process template consists of two C++ modules implementing the fixed and the parametric part of the computation. The fixed part (parallelism handling) directs the computation widi calls to a set of generic functions (i.e., with generic pointers as parameters) defined in the parametric part. These functions along with a few data structures are automatically instantiated, by means of a simple translator, with the sections of code defined by the user in a specific file {construct definition). They handle such jobs as process environment creation/deletion, data distribution/collection, and processing (only workers). Function parameters are pointers to structures containing the input, output and environment variables. They are passed between different functions and successive calls of the same function, so keeping the state of the computation. This structure also applies in the COMP construct which adds a special function for linking the input variables of a step to the output variables of the previous ones. This is done using pointers to variables and a link-table automatically built from the COMP definition file which describes the sequence of modules to be composed. 4.4

Environment and tools

The development of a parallel module in our system consists of three steps: module definition, template instantiation, compilation and library inclusion.

506

In the first step, the module developer is provided with a ready-to-run construct definition file to be filled with operation-specific sections of sequential code. These sections define the module's input and output parameters, the data distribution and collection, and the processing environment. In the second step, a simple tool is called to produce the module source code by instantiating the process templates with the sections of code of the definition file. In the third step, the module source code is compiled and included in a library of parallel modules. The developed module can be used alone to build a parallel task (by adding the producing and consuming stages) or together with other modules, inside a COMP construct, to build a more complex parallel module. As we said before, the COMP definition file describes, in the form of a sequence of calls, the modules to be composed and the relationships among modules' input and output variables. The main advantage of our approach is that all parallel details are hidden from the module or task developer which can concentrate on the problem at hand and assemble building blocks almost as in a sequential environment, leaving parallelism issues with the template developer. However, since the support tools are not compilers but simple translators, the module developer has to follow some rules when dealing with the input, output and environment variables, such as using the -> operator for access and using pointers plus taking care of storage space allocation/release for dynamic parameters. 5

Experimental results

Benchmarks have been executed on a Sun Ultra Enterprise 6000 symmetric multiprocessor system, using up to 18 processors. Maximum performance of the parallel algorithms has been roughly estimated with the following model:

S(N)=l/(l-sf/N)/of where: • S(N) is the maximum expected speed-up with N workers, • sf is a factor accounting for the fraction of code intrinsically serial, • of is a factor accounting for re-processing overhead due to data overlap. sf'has been evaluated by running parallel code with no computation. It includes I/O, intermediate data scatter/gather and synchronisation between emitter-collector and workers. It does not account for data dependency. o/has been estimated from the number of overlapped rows on each frame in the MAP module or re-processed frames on chunks boundary in the FARM module. The maximum expected speed-up for the two algorithms is shown in Table 3.

507

The moderate expected speed-up for the indexing is mainly due to its relevant serial fraction. As to the scratch algorithm, the parallel overhead factor dramatically affects performance at low resolution. In fact, using 16 workers at low resolution, partitions are composed by just 8 proper rows and 14 overlap ones. Cine Cine Cine TV TV TV Small Small"Small _ sf of S{16) j l of S{16) sf of S(16) Indexing 0,05 1,03 8,30 Scratches 0,016 2,96 3,92 0,004 1,40 9,45 0,004 1,10 12,20 Table 3. Performance model values for the sequence-indexing and scratch-detection algorithms. Indexing: FARM template 10 8-

.^-* '

, ^ ^-r^*^*"^

§•i 6 -

Expected

/£•''

a *A aa.

Measured

X'' S

0

-"'" 1

2

4

6

10

8

12

14

16

workers

Figure 4. Expected and measured speed-up for the sequence-indexing algorithm. Scratches: MAP template

Small (m.) -TV

(m.)

•Cine

(m.)

Small (e.) 6

8

10 12 14 16

TV

(e.)

Cine

(e.)

workers Figure 5. Expected and measured speed-up for the scratch-detection algorithm.

508

Figures 4 and 5 show respectively the expected speed-up values for the FARM indexing and for the MAP scratch detection, in comparison with measured ones. Experimental results are consistent with the model and differences can be attributed to concurrency and to cache effects on the video-server. Benchmarks have also been performed on the COMP scratch detection realized as a sequence of two pre-compiled MAP modules. Performance results differ from MAP ones for just a little overhead of 10-20 ms, due to extra gathering and scattering of intermediate parameters between modules. 6

Conclusions and future work

In this paper, we have presented some initial results of applying structured parallel paradigms to optimise two representative algorithms for semi-automatic digital film restoration. From the analysis of the algorithms parallelism we propose the seq, the farm and the map/reduce constructs as primitive elements of a minimal set of skeletons, sufficient to support the implementation of a specialized parallel image-processing library. The pipe and comp skeletons allow the composition of basic constructs for processing of frame sequences and reuse of pre-compiled parallel modules. Templates for the farm, map/reduce and comp constructs have been developed on a shared-memory architecture and a translator built to facilitate programmers generating structured parallel code. Using these constructs, alternative parallel versions of each algorithm can be realized to optimise the delivery time during interactive sessions or the completion time during batch ones. A simple model, accounting for the intrinsic algorithm serial fraction and for the main parallelism overheads has been used to estimate maximum performance of the parallel code. Model parameters and actual algorithms performance have been measured on a Sun Ultra 6000 SMP. Obtained results are consistent with the expected performance within 10 % when using up to 16 workers. Activity is on-going to implement a pipe skeleton template. This construct will allow to overlap data I/O with processing in stream-parallel computations, whose performance are currently limited by concurrency on the video-server. Another field of future activity is the development of a smart-scheduler, able to exploit the performance profile of the pre-compiled parallel algorithms to achieve optimal dynamic partitioning of me processor pool. 7

Acknowledgements

We wish to thank F. Collura for participating in several productive discussions and for supporting system infrastructure on the Sun Ultra Enterprise 6000.

509 References 1. B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, M. Vanneschi, P3L: A structured high level programming language and its structured support, Concurrency: Practice and Experience, 7(3):225-255, May 1995. 2. M. Cole, Algorithmic Skeletons: Structured Management of Parallel Computation, The MIT Press, Cambridge, Massachusetts,1989. 3. J. Darlington, A. J. Field, P. G. Harrison, P. H. J. Kelly, R. L. While, Q. Wu, Parallel Programming using Skeleton Functions, In A. Bode, M. Reeve and G. Wolf, editors, Proc. of PARLE'93, volume 694 of LNCS, pages 146-160, Springer-Verlag, 1993. 4. J. Darlington, Y. Guo, H. W. To, J. Yang, Parallel skeletons for structured composition, In Proc. of the 5th ACM/SIGPLAN Symposium on Principles and Practice of Parallel Programming, Santa Barbara, California, July 1995, SIGPLANNotices 30(8),19-28. 5. B. Bacci, M. Danelutto, S. Pelagatti, M. Vanneschi, SklE: a heterogeneous environment for HPC applications, Parallel Computing, 25:1827-1852 December 1999. 6. L. Joyeux, O. Buisson, B. Besserer, S. Boukir, Detection and Removal of Line Scratches in Motion Picture Films, CPVR99 IEEE Fort Collins Colorado, June 1999. 7. A. C. Kokaram, Motion Picture Restoration, Springer-Verlag, London 1998. 8. LIMELIGHT ESPRIT Project, www.joanneum.ac.at/iis/projects/limelight/. 9. S. Pelagatti, Structured Development of Parallel Programs, Taylor & Francis, 1998. 10. A. Machi, M. Tripiciano, Video Shot Detection and Characterisation in Semiautomatic Digital Video Restoration, IAPR Proc. of 15th ICPR, Barcelona 2000, pages 855-959, IEEE Comp. Soc. 2000. 11. W. Plaschzug, G. Hejc, ESPRIT Project FRAME Public Final Report, www.hpcn-ttn.org/newActivityDetail.cfm?Activities_ID=l. 12. T. Saito, T. Komatsu, T. Hoshi and T. Ohuchi, Image Processing for Restoration of Old Film Sequences, Proc. ICIAP 10, pp. 709-714, Venezia 1999, IEEE Comp. Soc. 1999. 13. P. Shallauer, A. Pinz, W. Haas, Automatic Restoration Algorithms for 35mm Film, Videre: Journal of Computer Vision Research, Vol. 1, n. 3, 1999, The MIT Press. Work partially sponsored by C.E. 1403/95 FESR/FSE MURST Project Ricerca Sviluppo ed Aha Formazione.

EXPLOITING T H E DATA-LEVEL PARALLELISM IN M O D E R N M I C R O P R O C E S S O R S FOR N E U R A L N E T W O R K SIMULATION

ALFRED STREY AND MARTIN BANGE Department

of Neural Information

Processing, University of Ulm, D-89069 Germany E- mail:{ strey, mbange} ©neuro.informatik. uni- ulm.de

Ulm

Fast SIMD-parallel execution units are available in most modern microprocessors. They provide an internal parallelism degree in the range from 2 to 16 and can accelerate many data-parallel algorithms. In this paper the suitability of five different SIMD units (Intel's MMX and SSE, AMD's 3DNow!, Motorola's AltiVec and Sun's VIS) for the simulation of neural networks is compared. The appropriateness of the instruction sets for the fast implementation of typical neural network operations is analyzed and the results of an experimental performance study are presented.

1

Introduction

A few years ago, many researchers studied the parallel implementation of artificial neural networks. Their objective was to accelerate especially the compute-intensive training phase by parallelizing the learning algorithm and by mapping the neural network and the training data set onto the processors of a parallel computer architecture 1 . Today, modern microprocessors are operating at clock frequencies of 1 GHz or still higher and achieve already a satisfactory performance for many neural network applications. However for certain compute-intensive learning tasks or real-time pattern recognition tasks the power of a single processor is often still insufficient. If spacious parallel systems can not be used (e.g. on autonomous mobile robots), the instruction level parallelism (ILP) and data level parallelism (DLP) offered by current processors must be exploited to accelerate the neural network simulation on a single-processor system. ILP represents an implicit parallelism which is based on the simultaneous execution of instructions by the superscalar processor architecture. Hammami 2 has shown that for four different neural network applications the performance gain due to ILP is usually low. Internal data dependencies prevent the processor from executing instructions simultaneously. DLP was introduced in most microprocessor architectures several years ago to accelerate especially signal and multimedia applications. Here several 8-bit, 16-bit or 32-bit data elements are packed in a 64 or 128 bit register and all arithmetic operations can be executed on corresponding elements of two registers in a SIMD-parallel

510

511

mode. The user must explicitly program all data-parallel operations and is responsible for the attainable performance. Neural network algorithms are similar to signal processing algorithms because they are also based on vector/matrix operations. Furthermore, the encoding of all neural network variables in words of 16 bit is sufficient for training 3 . Thus, DLP seems to be suitable for the fast simulation of neural network algorithms. However, no detailed analysis about the achievable performance gain has been published so far. Merely Gaborit et al. 4 have shown that Intel's MMX can accelerate distance calculations in a neural network application. However, they do not consider training. In this paper a detailed analysis about the suitability of five different SIMD-parallel execution units for the fast simulation of neural networks is provided. First, the most important neural operations that must be accelerated by exploiting DLP are presented in the next section. Section 3 compares the appropriateness of the instructions sets of all five selected SIMD units for implementing neural operations. The results of an experimental performance study are presented in Section 4. Suggestions for future improvements of SIMD execution units conclude this paper. 2

Data-parallel neural network simulation

For a data-parallel simulation of an artificial neural network, the underlying algorithm must first be formulated in vector and matrix notation. Each vector represents the state or output variables of all neurons in one network layer and the elements of each matrix represent the weights connecting two neuron layers. If the SIMD-parallel hardware offers a p-way parallelism, the operations on p neighboring vector or matrix elements can be performed simultaneously. The following five vector-matrix operations represent the typical most compute-intensive operations during neural network simulation: I. calculation of the squared Euclidean distances dj between a vector u and all columns j of a weight matrix W : dj = ]Ci(w« ~~ wij)2 II. multiplication of a vector u with a matrix W : Xj = ^ UiWij III. multiplication of a vector 6 with the transposed matrix WT : z i = Hj SjWij IV. addition of a scaled outer vector product to a weight matrix W : Wij = wtj + rjytdj V. addition of a scaled distance-vector product to a weight matrix W : Wij = Wij + rj(ui - Wij)aj6j

512

3

I n s t r u c t i o n sets of SIMD extensions

Five SIMD-parallel execution units have been selected for this experimental study. Table 1 illustrates the main differences: Intel's MMX5 (also available in current AMD processors) and Sun's VIS 6 allow only operations on integer data, whereas Intel's SSE 7 and AMD's 3DNow!8 support only the 32-bit floating point data format. SSE and 3DNow! provide a few additional integer instructions for improving MMX. Motorola's AltiVec9 is the only SIMD extension that is designed as a general-purpose vector unit. It is suited for both integer and floating point calculations and offers the most powerful instruction set. Table 2 lists all SIMD instructions that are required for the data-parallel implementation of the neural operations I to V on all five SIMD units. Whereas a unique instruction is available for each SIMD-parallel operation on 32-bit float data elements, a variety of arithmetic instructions (for different data sizes, for different types, with/without rounding, with/without saturation) exists in case of integer operands. According to the results of Holt 3 , a precision of 16 bit should be selected for neural network implementation on integer SIMD units. Unfortunately, the frequently used SIMD-parallel multiplication of 16-bit numbers must be realized in a different way on all five units. Fig. l a shows the correct fixed point multiplication scheme applicable in all cases. Here certain 16-bit result windows must be selected out of p simultaneously computed 32-bit products. However, such a powerful instruction is not available on any multimedia unit. Instead it must be simulated by sequences of several instructions. MMX computes the 16 upper and the 16 lower bits of four 32-bit products separately with two different instructions. Two merge, two shift and one pack instruction are required for storing the four correct 16-bit results in a single register (see Fig. lb). AltiVec multiplies corresponding 16-bit elements on even or on odd index positions of two 128-bit registers. By using five additional instructions (2 x shift, 2 x merge and 1 x pack, see Fig. lc) a 128-bit register containing Table 1. Characteristics of the selected SIMD-parallel execution units

available in register bits data types parallelism no of instr. latency (in cycles)

MMX (Intel) Pentium II 64 8,16,32 bit 2-8 57 1 mult.: 3

SSE (Intel) Pentium III 128 float 4 70 1 arith.: 3-5

3DNow! (AMD) K6-2 64 float 2 45 2

VIS (Sun) Ultra I/II 64 16,32 bit 2-4 85 1 mult.: 3

AltiVec (Motorola) PowerPC G4 128 8,16,32 bit, float 4-16 162 1 complex op.: 3-4

513 Table 2. Instruction sets (used abbreviations: s = signed, u = unsigned, sat = saturation, m = modulo 2 " , r = rounded, h = only higher bits, 1 = only lower bits, a = arbitrary bits, t = stored in the upper bits, I — filled up with sign-extension) MMX (Intel) add/subtract 16s ± 16s 16s[m], 16s[sat] float ± float multiply 8s x 16s 8uxl6s 16sxl6s 16s[h], 16u[l] float x float multiply&add 16s±16sxl6s floatrfcfloatx float 32s[m] 16sxl6s+16sxl6s reduction by -432s float shifts 32 bit elements + 64/128 bit register + reordering pack (32->-16) +[l,sat] 8, 16, 32 merge/unpack permutation load/store 16 bit element 64/128 bit register +

SSE (Intel)

3DNow! (AMD)

+

+

VIS (Sun)

AltiVec (Motorola)

16s[m]

16s[m], 16s[sat]

+ 16s[h], 32s[f] 8s[h,4], 32s[4-]

+

16s[h,r]

16s[h,r], 32s

+

+ 16s[h,r,sat]

+ 32s[sat] + [sat]

+

+

+

16, 32

16, 32

+ +

+ +

+

+ +

+ [a,sat] 8

+ [l,sat] 16, 32 8

+ +

+ +

eight correct 16-bit results is computed. The VIS multiplier shown in Fig. Id offers two 8 x 16 bit -» 24 bit instructions that multiply either the 8 upper or the 8 lower bits of two 16-bit elements with two other 16-bit elements. After adding the partial 24-bit products (extended to 32 bit), two correct 16-bit results are generated by one pack instruction. Thus, up to seven instructions and even more clock cycles are needed for the SIMD-parallel computation of two, four or eight 16 x 16 bit products. MMX, VIS and AltiVec offer also a 16 x 16 bit —• 16 bit multiply instruction that calculates only the upper 16 bits of each product. However such a simplified fixed point multiplication is applicable only in certain cases. For the SIMD-parallel computation of vector-matrix products a special sum of products instruction calculating aj x bi + at+i x 6 i + 1 and a vector reduction (for the final addition of the partial sums) are useful. Unfortunately, they are available only in some instruction sets for a few data types (see Table 2).

514

Figure 1. Implementation of 16 x 16 bit —> 16 bit fixed point multiplication

A more general multiply & add instruction (calculating at x 6, 4- c,) can accelerate operations IV and V. Hardware support of saturation is important for implementing correct SIMD-parallel integer additions. Also rounding is advantageous to increase the precision during iterative neural network training, but it is only supported in certain AltiVec and 3DNow! instructions. On all SIMD units, further instructions are available for reordering the elements in a register. The pack instruction is required for selecting 16-bit windows out of one or two registers that contain several 32-bit products or sums (compare the multiplication example in Fig. 1). Only Sun's VIS allows the selection of 16-bit windows at arbitrary positions (however without rounding). All instruction sets offer a merge instruction (also called unpack) that mixes the elements of two registers into a new register according to a fixed scheme. For MMX it is mandatory for concatenating the upper and the lower halves of the products (compare Fig. lb). 3Dnow! and SSE provide generalized shuffle instructions and only AltiVec allows arbitrary permutations. Replication of scalar operands is required in operations III, IV and V and must be implemented either by an appropriate sequence of merge instructions or by a single permutation instruction.

515

4

Performance analysis

All neural operations I to V described in Section 2 were implemented on the five selected architectures. Either 32-bit floating point (SSE, 3DNow!), 16bit fixed point (mapped to 16-bit integer for MMX, VIS) or both (AltiVec) were used as parallel data types for all neural network variables. For Intel and AMD processors, the routines were encoded in assembly language. A C language interface was available only for the SIMD extensions of Sun and Motorola. For reference, all operations were implemented also in C using the f l o a t data type. The Gnu C compiler and the Gnu assembler were used for Intel and AMD processors, the Sun Workshop 5.0 C Compiler for Sun and the Metrowerks Code Warrior for Motorola's PowerPC. Compiler optimizations have been switched on. Special care was taken for the correct alignment of all data elements because the penalty for misaligned data is high for most SIMD units. As hardware platforms PCs with either 500 MHz Pentium III or Athlon, Sun workstations with 400 MHz Ultra II and a Macintosh containing a 500 MHz G4 PowerPC processor were used. The execution time of all five neural operations was measured on the SIMD units and on the floating point units of all processors. Then the speedup of the SIMD code compared to the reference float code was calculated for each operation. Fig. 2 illustrates some results and shows additionally the parallelism degree p of all five SIMD units which represents a kind of theoretical speedup. It can be seen that for certain operations a high speedup can be obtained that exceeds the theoretical speedup. On MMX the computation of operations I, II and IV is 5 to 10 times faster than the reference float implementation. Also for AMD's 3DNow! which offers only a parallelism degree of p = 2 the

operations: 10

parallelism degree p

D D D i l I II HI IV V

MMX

SSE

3dNow!

AltiVec integer float

MMX

SSE 3dNow!

AltiVec integer float

Figure 2. Measured speedup of the neural operations I to V on five SIMD units for small matrices (16 x 104 , left) and large matrices (128 x 832, right)

516

measured speedup for many operations was surprisingly high (up to 5.6 times faster than the float implementation for small matrices). This anomaly results on the one hand from several powerful SIMD instructions (such as sum of products or vector reduction) that replace more t h a n p scalar instructions. On the other hand some SIMD instructions (especially SIMD multiplications) provide shorter latencies than their corresponding scalar counterparts. For some operations (especially III and V) the SIMD performance remains below the theoretical speedup on certain SIMD units. This effect is due either to the high overhead for the required data reordering (especially replication of scalar operands) or to the lack of appropriate instructions. To study not only the SIMD performance of single operations but also of a complete neural network, a Radial Basis Function (RBF) network was implemented on all SIMD units and processor cores. The k-m-^n RBF network represents a typical artificial neural network model that can be used for many approximation and classification tasks. It consists of k input nodes, m RBF neurons with radially symmetric Gaussian output functions in the first layer, n simple linear neurons in the second layer and is trained by gradient descent. All neural operations I to V described in Section 2 are contained in the underlying algorithm. The recognition time and the training time according to the presentation of a single input vector to RBF networks of three different sizes were measured and the speedup of the SIMD-parallel code related to the reference float code was calculated (see Table 3). Recognition with RBF networks is mainly based on operations I and II that can be implemented very efficiently on most SIMD units (see above). So it is no wonder that the speedup for recognition is higher (up to 8.6) than the speedup for the RBF training phase (up to 6.6) that requires all five neural operations. When the network size is varied, two contrary effects can be observed. On the one hand the speedup for the integer implementation (e.g. on MMX or AltiVec) can be increased if the network is enlarged. This effect results from the relation between the size of the weight matrices and the size of the LI or L2 caches. Whereas larger 16-bit integer matrices still fit in the caches, this is not always the case for larger reference 32-bit float matrices.

Table 3. Measured speedup for recognition/training with RBF networks of different sizes size o f R B F network

MMX (Intel)

SSE (Intel)

3DNow! (AMD)

AltiVec (Motorola) (fixed) (float)

16-104-16 64-416-64 128-832-128

6.7 / 4.1 6 . 9 / 3.8 8.1 / 5.1

4.1 / 3.8 3.8 / 2.6 3,4 / 2.2

5 . 7 / 4.1 4.3 / 2.9 2.7 / 2.1

7.1 / 5.6 7.8 / 5.9 8.6 / 6.6

3.3 / 3.2 3.3/3.1 3.6 / 3.3

VIS (Sun) 2.0 / 2.0 2.2 / 2.2 2.3 / 2.3

517 On the other h a n d the floating point implementations on SSE and 3DNow! show the reverse effect. Here for larger matrices the fast SIMD units can no more be supplied with 32-bit d a t a elements at sufficiently high speed. 5

Conclusions

T h e instruction sets of the analyzed SIMD units are not optimal for the implementation of neural networks. Several instructions are missing for an efficient implementation of all i m p o r t a n t neural operations, e.g.: vector reduction and a general multiply&add for all d a t a types, fast replication of a specific element within a SIMD register or special load/store instructions for the handling of vector and matrix sizes t h a t are not a multiple of the parallelism degree. Furthermore, for integer SIMD units a special 16-bit x 16-bit -> 16-bit parallel multiply instruction t h a t supports an arbitrary selection of s a t u r a t e d and rounded result windows out of 32-bit products would be advantageous. Nevertheless, the overall SIMD performance t u r n e d out t o be fairly well. A speedup of u p to 9.8 for single neural operations and a total speedup of up t o 6.6 for neural network training could be achieved. For very large neural networks the simulation can be accelerated furthermore by combining cluster or S M P computing with local SIMD computing in each processor node. References 1. N.B. Serbedzija. Simulating Artificial Neural Networks on Parallel Architectures. Computer, 29(3):56-63, 1996. 2. O. Hammami. Neural network classifiers execution on superscalar microprocessors. In Proc. of ISHPC '99, pages 41-54. Springer, LNCS 1615, 1999. 3. J. Holt and J. Hwang. Finite precision error analysis of neural networks hardware implementations. IEEE Trans, on Computers, 42:281-290, 1993. 4. L. Gaborit, B. Granado, and P. Garda. Evaluating micro-processors' multimedia extensions for the real time simulation of RBF networks. In Proc. of MicroNeuro '99, pages 217-221. IEEE, 1999. 5. A. Peleg, S. Wilkie, and U. Weiser. Intel MMX for multimedia PCs. Communications of the ACM, 40(l):24-38, 1997. 6. M. Tremblay, J-M. O'Connor, V. Narayanan, and L. He. VIS speeds new media processing. IEEE Micro, 16(4):10-20, 1996. 7. S.T. Thakkar and T. Huff. Internet streaming SIMD extensions. Computer, 32(12):26-34, December 1999. 8. Stuart Oberman, Greg Favor, and Fred Weber. AMD 3DNow! technology: Architecture and implementations. IEEE Micro, 19(2):37-48, 1999. 9. K. Diefendorff, P.K. Dubey, R. Hochsprung, and H. Scales. AltiVec extension to PowerPC accelerates media processing. IEEE Micro, 20(2):85-95, 2000.

E X T E N D I N G T H E APPLICABILITY OF SOFTWARE D S M B Y A D D I N G USER R E D E F I N A B L E M E M O R Y S E M A N T I C S B VINTERt'*, O J ANSHUS*,T LARSEN*, J M BJ0RNDALEN* Department of Mathematics and Computer Science*'*, University of Southern Denmark Department of Computer Science*, University of Troms0 This work seeks to extend the applicability of distributed shared memory (DSM) systems by suggesting and testing an extended functionality "User Redefinable Memory Semantics" (URMS) that may be added to most DSM systems. We show how URMS is included with one DSM system, PastSet, and sketch the applicability and ease of use of URMS for a set of current applications. Finally, a microbenchmark of URMS on PastSet, show performance gains of 79 times over a naive DSM model, 48 times over a tuple based DSM model and 83% over MPI.

1

Introduction

Software distributed shared memory (SW-DSM) systems have existed for about 15 years, but have yet to achieve widespread application by programmers outside of the DSM research community. One reason contributing to this may be that DSM systems are usually demonstrated and tested only for a limited set of high-performance computing test applications; typically a subset of the SPLASH 1 benchmark suite. These test applications provide a thorough test of specific performance limits of DSM systems, but fails to demonstrate the applicability of DSM outside high-performance computing. This work proposes an extended DSM functionality, User Redefinable Memory Semantics (URMS), to simplify the use of DSM by programmers. We demonstrate how URMS is added to one SW-DSM system, PastSet, as well as the performance impact of URMS on PastSet, and show how URMS may be added to other types of SW-DSM. 2

User Redefinable Memory Semantics

To address some of the performance problems for DSM systems, we propose the concept of "User Redefinable Memory Semantics" (URMS). The principle behind URMS is to offer users the opportunity to redefine the semantics of any or all memory operations for memory areas that are specified by the user. The redefined semantics are specified by providing code that should be executed

518

519

instead of the memory operation. The specification of the redefinition may also include initialization code that is applied once to initialize the specified memory area. In effect, the redefined memory operation semantics will be applied for memory operations on the specified areas only. For example, a memory location may be redefined to accumulate a global sum of partial sums that are produced by independent processes. For that location, the STORE operation would be redefined according to code that stores the aggregate sum, and keeps track of a complition criterion that may be realized as an access count, process list, or by other means. The LOAD operation is redefined to return the aggregate sum only after the reduction has completed, in effect blocking until the termination criterion used for the STORE operation is satisfied. This approach makes it very simple for programmers to overlap communication and calculation if there is any work that may be done between the partial sum is ready and the global sum is needed. For the variations of this example that are to follow, we will consider only the limited case where a given number of contributions define the completion of an operation. Basically URMS may provide a way to introduce well known messagepassing techniques to DSM systems, while preserving the illusion that the programmer is using a shared memory computer. We hope that URMS based systems will be simpler to program and at the same time provide superior performance because the URMS memory cells may be handled in an optimized way by the runtime environment. 2.1

Adding URMS functionality to DSM systems

In this section we sketch how URMS functionality may be added to different variations of distributed shared memory: Region Based DSM, Shared Virtual Memory based systems, Object based DSM, and Structured DSM systems. Region Based DSM systems provide differing memory models, e.g. whether a region constitutes a single variable, a fixed size block, or a variable size block. Further, the transparency of the systems varies from full transparency to no transparency. We will illustrate how URMS may be implemented for a variable region size system with little transparency, such as CRL 2 . C Region Library, CRL, is based on program information to maintain the distributed shared memory. Programmers explicitly associate a region with a block of memory, and state when this memory is used with the instructions, rgn_start_read, rgn_end_read, rgn_start_write and rgn_end_write. Before regions can be addressed they must be created, and it is straightforward to add URMS functionality at this stage. Instead of a r g n _ c r e a t e ( s i z e ) a

520

programmer could use a r g n _ c r e a t e ( s i z e , o p e r a t i o n , parameters)., e.g. r g n _ c r e a t e ( s i z e o f ( d o u b l e ) , REDUCEJ3UM, WORKERS) would create a new region which is a global sum from WORKERS processes. When a process issues a rgn_start_write on the created region, the value that is written is automatically added to a global sum, and a r g n _ s t a r t jread will simply block until all WORKERS processes has issued their rgn_end_write on the region. Further, the global sum can be distributed to all participants before they issue rgn_start_read operations, which will remove the overhead on fetching the data. Shared Virtual Memory (SVM) systems, such as Shrimp-SVM 3 , base the DSM system on the page size of the native architecture, and use the paging mechanism to trap memory operations that cannot be serviced, e.g. a read to an address that is not present in local memory, or a write to a memory cell that is either replicated on other nodes or not present. The handling of URMS may be easily added to the exception handling of SVM page faults. For a global reduction operation, a SVM system can incorporate a specific data type for reduction variables. STORE to variables of this type causes exceptions that the runtime system may capture and add to the reduction. Loads from reduction variables are stalled if the reduction is not yet finished; otherwise the result is available and the application may continue un-delayed. If the SVM system supports shared writes by multiple processes, the URMS functionality may easily be added to the write assembly process. Object Based DSM, like Orca 4 is perhaps the least obvious candidate for URMS, since similar functionality may be achieved on user level, via a carefully considered library. Adding URMS to the runtime system is extremely easy however, and may result in improved performance compared to a user level implementation. Since most object based DSM systems are closely integrated with the programming language, it is straightforward to add keywords to identify URMS data: e.g. double sum = new GLOB AL_SUM(WORKERS), will associate the variable sum with a global sum reduction among WORKERS. Rather than maintaining coherent replicas of sum, writes to sum are identified at compile time and converted into a global reduction participation call. During execution, reads of sum may be blocked until the reduction has completed. Structured DSM cover a set of slightly similar DSM models that model memory in some structured way, such as Linda 5 . Linda models memory as an associatively addressed tuple space. Processes can add, read or retract process tuples and data tuples from tuple space. Adding URMS to a tuple space model is simple and probably represents the system with least intrusion to the given DSM model. A generic tuple is made up of a flag followed by a

521

set of data entities, e.g. ("P", double). The model can easily be extended by a set of flags, which relates to memory semantics. In this manner the global reduction can be handled by issuing out("GLOBAL_SUM", "group-flag", int WORKERS, double data), which is captured by the runtime system. Only as WORKERS tuples matching the group-flag are placed in tuple space, will the runtime system make available WORKERS tuples of the template ("groupflag", double data) where the data is the resulting sum. 3

Applications

Identifying useful URMS functionalities will be an application driven investigation; we have identified several potentially useful functions. One class can be taken directly from the reductions that are supported by MPI, such as global-sum, -product, -min and -max. The appearance of very high performance WAN's have spawned a large interest in meta- computing, i.e. the idea of a large computational grid 6 . A common feature in the grid-middleware projects is the ability to read a file from a remote server. However, scientific datasets are often very large, from multiple terabytes to pentabytes of data. Most often the user does not wish, nor possess the computing power, to process all the data. Instead, the application work only on data that fit a predefined criterion, thus transferring the complete dataset over a WAN, only to filter the data locally. If we replace the standard grid-model with DSM, the natural solution is to spawn a thread that runs on the data-server, which then filters the data at the source and only transfers the specific data the user is actually going to use. However, while any meta-computing middleware requires the servers to place some trust in verified clients, the full flexibility that comes with allowing remote clients to start threads, is unlikely to be accepted by those that hold the servers. Alternatively a DSM system, which supports URMS, may provide a solution that is just as simple to use for the programmer while preserving the control at the server-side. Data may be mapped in the users memory as available, and then accessed using the underlying DSM system. The filtering of data may be done by a URMS function that processes a logic-based query, which the user specifies. This way the server maintains full control over both the instructions that are executed and the compute-time a user may consume. In general URMS may prove to be a convenient abstraction to eliminate explicit communication in applications that use streams of data, such as streaming video, net-chat applications etc. URMS can eliminate stream operations by basically replacing sockets with URMS memory cells, thus reading from a stream becomes reading from an address and vice versa for writing. In

522

many ways this translates into the difference between accessing port-mapped and memory mapped devices in a computer system, anybody that have written operating systems knows that memory mapped devices are far easier to program than port-mapped devices. This approach could ease programming in many distributed fields and given a sensible naming convention within the DSM system, could hide most of the distribution aspects altogether. Servers for mobile agents can provide a single address where incoming agents are copied to for migration, which reduces the migration problem to a memory move. URMS may also allow large server applications, such as web servers or database servers to execute on DSM based clusters, by modeling request queues as URMS memory cells, e.g. adding a request to the queue is done by writing the request to an address. The URMS functionality can do simple load-balancing, or it may keep a history of the requests passed to the individual nodes and attempt to achieve a better utilization of cached data. 4

Adding U R M S to a D S M s y s t e m

To demonstrate our idea we have added URMS to our own structured DSM system, PastSet 7 . PastSet is similar to Linda in its perception of memory, but has a distinctly different execution model. Tuples can be stored, read, and manipulated in PastSet memory. Tuples are generated dynamically based on tuple templates that may also be generated dynamically. The collection of all tuples in PastSet based on a unique template is denoted an element of PastSet. Elements are unique, each corresponding to a unique template. A move operator (Mv) writes tuples to PastSet memory and an observe operation (Ob) specifies a template and reads a matching tuple. URMS functionality has been added to PastSet, we call the URMS framework X-functions, which are pairs of functions, one associated with writing and one with reading. In this way tuples may be manipulated as they are stored and read. In the current implementation only one X-function pair can be associated with each element at airy time. We have implemented X-functions that do generic pattern matching as well as global operations as used in MPI 8 . A global reduction works as follows, the first time the element is accessed the desired global reduction, e.g. sum, and the number of participants in the reduction is specified. When the reduction takes place all processes will simply write(Mv) their tuple to the element and read(Ob) the resulting sum, the reads automatically block until all participants have written their contribution to the global sum. We are experimenting with more advanced X-functions, including an X-function that

523 Ob(semaphore); Ob(counter); counter++; Ob(value); value+=local; if(counter==workers){ result=value; Mv(resuit); counter=0; Mv(counter); value=0; Mv(value); Hv(semaphore); }• else { Mv(value); Hv(counter); Mv(semaphore); Ob(resuit);

Ob(counter, value); counter++;

Mv(local); Ob(resuit);

value+=loeal; if(counter==workers) { result=value; Mv(result); counter=0; value=0; Hv(counter, value); } else { Mv(counter, value); Ob(resuit);

}

> Figure 1. Global reduction using std. memory model, tuples and URMS.

compresses bitmaps on write and decompresses on read. 4-1

Micro benchmark

To illustrate that one may achieve performance improvement of URMS over conventional DSM solutions we show the time taken for a global reduction of an integer for a set of DSM approaches. To further emphasize on scalability we show the progressive performance on one through 32 participants. Our test bench is a cluster of eight four-way Pentium Pro nodes interconnected by a 100 Mb/sec switched FastEther network. We choose a global reduction over a full application, even though the usability of global reductions is limited to dustydeck type applications, which is not necessarily the most likely use of URMS. However the global reduction is simple, widely known and allows us to perform simple performance comparisons. The global reduction is implemented in three versions, one that treats memory in a conventional way, one that utilizes the tuple nature of PastSet and finally one that use an URMS implementation of global reductions, all three reduction-codes are listed in figure 1. The first implementation models a standard memory. Because PastSet is built on blocking tuple operations there are no other synchronization mechanisms. Instead memory operations are used as replacements for semaphores and signals. Thus in the code a tuple called semaphore is used for achieving mutual exclusion, and the signal to show that the result is ready is replaced by a blocking read of the result. The second version is the natural way to perform a global reduction with a tuple based memory model. The reduction is performed with a tuple containing the partial sum and a counter. This

524

1 2

3 4

5 6

7 6 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 2B 29 30 31 32 Processors

Figure 2. Globed reduction performance using the algorithms in figure 1.

tuple migrates between all processes, the last process to add its partial sum writes a global result, which all others read. The final version is based on a URMS function, which performs the reduction. The URMS function simply treats writes to this tuple as partial sums and reads as reading the global sum, thus all reads are blocked until the global sum is ready. The three versions have been tested on one through 32 CPUs, measurements are taken as average over 1000 reductions and each experiment is repeated five times for each number of processes, the resulting numbers are quite stable with a standard derivation less than 1% at 32 processors. Figure 2 shows the latency of a global reduction. In addition to the three memory versions we have added the time used by LAM-MPI to perform MPIJVllreduce. The improvement that is gained with the use of URMS is significant. With 32 processors the URMS solution is 79 times faster than the standard memory model and 48 times faster than the more natural tuple version. The MPI version is 83% slower than the DSM version using URMS, since the global reduction is a generic MPI operation this is a significant proof-of-concept for URMS. 5

Conclusions

In this work we have introduced the concept of User Redefinable Memory Semantics, URMS. We believe that URMS is easy to use for programmers, and

525

solves efficiently some operations that have been documented to be costly to perform. We have shown the benefit of adding URMS to a structured DSM system, PastSet, and have argued why other DSM models may have even greater advantages from URMS. We are working to identify which functionalities are useful in URMS systems. We wish to test the URMS concept on other types of DSM, particularly region based and page based systems, in order to test potential performance benefits with other DSM models than the structured DSM. Using a micro-benchmark that performs a global reduction of an integer value, we have shown that there may be as big a performance advantage of URMS over conventional memory as 79 times and 48 times over the natural PastSet approach. The fact that a DSM system using URMS is 83% faster than LAM-MPI, on a global reduction micro-benchmark, indicates that future work on URMS could be very rewarding. References 1. Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. SPLASH: Stanford parallel applications for shared-memory. Computer Architecture News, 20(1):2-12, March 1992. 2. Kirk L. Johnson, M. Frans Kaashoek, and Debroah A. Wallach. Crl: High-performance all-software distributed shared memory. In Proceedings of the 15th SOSP, pages 213-228, December 1995. 3. R. Samanta, A. Bilas, L. Iftode, and J. P. Singh. Home-based svm protocols for smp clusters: Design and performance. In Proc. of the 4th IEEE Symp. on High-Performance Computer Architecture, February 1998. 4- H. E. Bal and A. S. Tanenbaum. Orca: A language for distributed objectbased programming. SIGPLAN Notices, 25(5):17-24, May 1990. 5. N. Carriero and D. Gelernter. Linda in context. Commun. ACM, 32(4):444~458, April 1989. 6. Ian Foster and Carl Kesselman. Computational grids. In Ian Foster and Carl Kesselman, editors, The Grid: Blueprint for a New Computing Infrastructure, pages 15-51. Morgan Kaufmann, San Francisco, CA, 1999. 7. Brian Vinter, Otto J. Anshus, and Tore Larsen. Pastset - a distributed structured shared memory system. In Proc. of High Performance Computers and Networking, Amsterdam, April 1999. 8. David W. Walker. The design of a standard message passing interface for distributed memory concurrent computers. Parallel Computing, 20(4):657-673, March 1994.

This page is intentionally left blank

INDUSTRIAL PERSPECTIVE

This page is intentionally left blank

PORTING AND OPTIMIZING CRAY T3E APPLICATIONS TO COMPAQ ALPHASERVER SC SERIES BY JOSEPH PARETI' 1

Abstract

Several scientific applications that were parallelized for the CRAY T3E family of massively parallel computers make best use of the hardware and software architecture and achieve a high degree of parallel efficiency [1]. This results from a well balanced design among components (21164 processor, chip-set including stream buffers, network router, GigaRing Channel) that enables a process-to-process latency of 1 microsecond on shmem [2]. To provide a growth path to T3E users, COMPAQ has introduced the AlphaServer SC Series, and has based its supercomputing offering to the Department of Energy [3] and other scientific accounts on this technology. The SC building-blocks are AlphaServer SMP nodes and the interconnect, provided by Quadrics Supercomputer World, Ltd., is a multistage, fat tree, cross-bar switch with headroom for growth in terms of node counts, Non Uniform Memory Access-SMP nodes, microprocessor design, adapter performance and number of rails per SMP node. The current implementation yields a process-to-process Direct Memory Access latency of 3 microseconds and a bidirectional bandwidth of 200 MB/sec, which are the result of integrating the fabric adapter in kernel thereby bypassing the operating system in message traffic. These numbers will scale up to 340 MB/sec using the 66MHz PCI version, and to 500 MB/sec using the 133MHz PCI-X version. CRAY applications leverage the following: (i) All variables are 64-bit, (ii) efficient math-routines from SCILIB, (iii) stream buffers detect and pre-fetch down smallstride reference streams, (iv) E-registers provide latency hiding and non-unit-stride access capabilities, barrier and fetchandop synchronization support, (v) well balanced performance across data communication routines. Compaq has created a software development environment including a Porting Assistant tool, symbolic (parallel) debuggers, and profiling tools that complement the base functionality of the GEM optimizing compilers [4] and of the libraries for performing math tasks, message passing, etc. Finally, a T3E application was modified by increasing the amount of (redundant) calculations to reduce communications cost for best use of the current SC architecture implementation.

1

J Pareti is with COMPAQ Computer Corporation [email protected]

529

530

2

T3E environment

The T3E nodes contain an Alpha 21164 processor [5], system control chip, local memory, and network router. These are connected in a 3d-torus topology providing up to 480 MB/s application bandwidth. I/O is based on the GigaRing channel with sustainable bandwidth of 267 MB/s for every four processors. Although the 21164 is limited by the 8KB primary data and instruction cache, which is supported by a 96KB 3-way associative secondary cache for data and instructions, local memory bandwidth is enhanced by a set of hardware stream buffers. These replace the on-board cache. In addition, there are 512 user plus 128 system external registers (E-registers) that allow for a large number of outstanding requests for latency hiding and support non-unit stride access by gathering up single words that can be loaded stride-one across the microprocessor system bus. To make best use of stream buffers, codes should maximize stream lengths and generate six or fewer concurrent reference streams. Various programming techniques can be employed, such as loop splitting to limit the number of streams, maximize inner loops trip counts, etc. E-registers support gets and puts operations and because there are a large number of E-registers memory interface can be highly pipelined. [2,6] E-registers enhance performance when no locality is available by efficient support of non-unit stride access and on-chip cache bypass. E-registers optimizations are activated by the programmer using either compiler directives or library calls. Cache bypass can return huge paybacks in such operations as matrix transpose where data is copied from memory into E-registers and back to memory. E-registers allow up to several hundred memory references to be outstanding concurrently thus providing excellent latency-hiding capabilities that obviate the 21164 limitations (only 2 outstanding references are allowed). Memory pipelining improves global memory bandwidth as can be shown by increasing the number of E-registers that assist memory copy operations. The shared memory access library is a simple one-sided communication library available on T3E that makes use of E-registers to deliver a network latency of 1 microsecond. Other message passing libraries however introduce additional overhead e.g. for buffering and deadlock detection and hence the latency is much larger.

2.1

T3E I/O Optimization

In cases where the Processor Element 0 (PEO) does the I/O operations, the I/O performance is not impaired because the message passing throughput is much larger than the disk transfer rate. In cases where the I/O PE's write to individual files the transfer rate is increased and disk striping can further improve the performance. For

531 large files parallel read /write operations by several PE's to the SAME file improve the performance (Global I/O, a layer of "Flexible File I/O"). 2.2

UNICOS/mk scalability features

UNICOS/mk supports the scalability of the CRAY T3E system in that it is distributed among the PEs, not replicated on each. This distribution of operating system functions provides a global view of the computing environment - a singlesystem image - that allows administrators to manage a system-wide suite of resources as a single entity. UNICOS/mk is divided into a micro-kernel and "servers." Each set of kernel and servers is distributed among the processors of a CRAY T3E system. Local servers process operating system requests specific to each user PE. Global servers provide system-wide operating system capabilities, such as process management and file allocation. Scalability is enhanced through 64-bit addressing, which allows the operating system to support large data sets and file systems and thousands of users. UNICOS/mk includes automatic resource management utilities such as the Global Resource Manager (GRM), User Database, and Data Migration Facility. Administrators can monitor system utilization and activity A visual installation tool helps them customize their systems. Further, the following is supported: • Distribution of file management services: File server functions are distributed using local file server assistants to provide maximum efficiency and performance. • Distio (I/O centrifuge): A single distio request from one PE can generate parallel I/O transfers that distribute data among some or all of the PEs used by a given program. • Multiple global file servers: File system management is distributed among multiple system PEs, which allow full use of the parallel disk paths supported on CRAY T3E systems.

3

SC Overview

An Hardware overview of the SC cluster is given in [7]. Below some key software features for Single System Management (SSM) are reviewed. 3.1

Resource Management System (RMS)

The job management facility, or resource management system (RMS), allows the administrator to treat the total set of CPUs provided by all nodes as a single large pool, which can then later be subdivided for use by different classes of job. When a

532

user executes a job, RMS is responsible for selecting the set of nodes that best fits the job's requirements and for creating the requisite processes on each node. Partitions can be configured for one of three types of usage: interactive, parallel, and general purpose. Interactive partitions are used for general login and program development. Parallel partitions are exclusively intended for parallel non-interactive jobs. General-purpose partitions can be used for either interactive or parallel program execution. RMS also supports the time-sharing of a partition between different long-running jobs: the system deschedules the set of processes that constitutes one job and then schedules the set of processes for one of the other jobs that is to be run. This is known as time-sliced gang scheduling. RMS can be used for access control. The resources in each partition are managed by a Partition Manager which mediates user requests, checking access permissions and resource limits before scheduling the user's jobs. RMS accounts for resource usage on a per-job basis. Total resource usage is aggregated. RMS stores state information (such as resource usage) in a Structured Query Language (SQL) database. This database can be accessed through a SQL interface which is included as part of the Compaq(r) AlphaServer(r) SC System Software.

3.2 3.2.1

File Systems Cluster File System

Within each file system domain of the AlphaServer SC system, the Cluster File System (CFS) ensures that all files, including the global configuration and administration files and system binaries contained in the root (/), /usr, and /var file systems, are visible to, and accessible by, all domain members. Each file system is served by no more than one member; other members access that file system as CFS clients. By default, the same pathname refers to the same file on all nodes. CFS preserves full X/Open and POSIX semantics for file system access and maintains cache coherency across AlphaServer SC domain members. 3.2.2

Parallel File System

The AlphaServer SC Parallel File System (PFS) delivers scalable bandwidth capability to a single job. It achieves this by striping the data of a single file system over multiple underlying file systems within a single CFS cluster domain. A single AlphaServer SC system can support multiple Parallel File Systems. For jobs executing on members of a single CFS cluster domain, the PFS allows single-file accesses from the different processes of the parallel job to be served concurrently by the underlying file servers (however, memory-mapped files are not supported). When an administrator creates a PFS, the constituent file systems and the default

533

stripe size are specified. For the programmer, PFS uses POSIX-compliant syntax and semantics. It supports additional file system commands (implemented as ioctls) that allow a job to interrogate the attributes of the file system (for example, determining which file system components are local to a process), modifying the stripe size and other attributes of new (zero-length) files, and so on. This enables a parallel job to optimize its input/output (I/O) with respect to the underlying PFS structure. 3.3

Device Request Dispatcher

3.4 Within each CFS cluster domain, the device request dispatcher (DRD) supports AlphaServer SC access to both character and block disk devices. All local and remote domain disk I/O passes through the DRD, which results in greater flexibility in hardware configurations. A node does not need to be directly attached to the bus on which a disk resides, to access storage on that disk.

3.5

Cluster Alias

A cluster alias is an IP address that makes an AlphaServer SC CFS cluster domain look like a single node to other hosts on the network or in other domains of the AlphaServer SC system; it provides a single-system view to network clients. Cluster aliases free clients from having to connect to specific nodes for services. The alias feature permits services to be load-balanced over different nodes. Furthermore, if the member providing the service goes down, a client reconnects with the cluster alias, which elects another member to provide the service. 3.6

Fast File System Recovery

The Advanced File System (AdvFS) is a journaled local file system that provides higher availability and greater flexibility than traditional UNIX file systems. Using transaction joumaling, AdvFS recovers file domains in seconds, rather than hours, after an unexpected restart such as a power failure. AdvFS joumaling also provides increased file system integrity. In addition, a separately licensed product, the Advanced File System Utilities, can be used to perform management functions on line while file systems are active.

534

4

Porting

COMPAQ Porting Assistant [8] is a tool for porting to COMPAQ Tru64UNIXand it flags incorrect code assumptions such as e.g. Sizeof(int)=sizeof(pointer). It also helps identifying platform specific assumptions, e.g in function calls, and to verify makefiles. In addition the following FORTRAN90 array syntax features have been investigated. 4.1

FOR TRAN90 arrays passing, non-blocking MPI calls

Using Fortran90 arrays syntax, calls may pass a message buffer with an actual argument of deferred shape type, as e.g.: array( scalar-index 1 :sclar-index2, scalar-index3,:,...) Fortran90 copies only the specified elements into a contiguous temporary array which is passed to the MPI routine. This is because a deferred shape array may be passed to a routine expecting an explicit array. A temporary array of appropriate size is allocated, the entire array is copied from the possibly-non-contiguous deferred shape array to the temporary array, and then the address of the beginning of the temporary array is passed. On MPI SUBROUTINE return, the contents of the temporary array are copied back to the deferred-shape array, and execution continues. The effect of the argument copying, for non-blocking MPI calls is the temporary argument copies are likely to be de-allocated before they can be safely consumed by MPI, and so when the data is eventually available buffer corruption may result. The work-around is to take the address of the message buffer arguments, and pass that address by value to the MPI routine, for example: %val( %loc( array( scalar-index 1 :sclalar-index2, scalar-index3,:,...))) 5

Debugging

On SC you use TotalView (from Etnus, http://www.etnus.com/flashhome.html') within the RMS prun command interface as follows: % totalview prun -a TotalView runs on the same node as prun, it starts a remote server process called the TotalView Debugger Server, tvdsvr, on each of the nodes used by the parallel program. The -a option is a TotalView option which specifies that the arguments which follow are for the program TotalView is running., e.g. in the example above the command line arguments for prun.

535

Debugging MPI parallel programs using TotalView can be done using basic, GUIdriven functionalities to e.g. stop the processes by setting a common breakpoint, and let the processes run till the program crashes at which point TotalView will provide more insight as to where or why it failed. We found the following features particularly useful: • attaching to a running process to e.g. detect deadlocks. A repeated attach hitting the same line of code is a strong indication there is indeed a deadlock. • Set a breakpoint in the source code pane (this is entirely GUI-driven) and examining variables values, including array segments (one can specify a stride in Fortran90 syntax, e.g. FOO(1:500:10)) that are inspected on each process (the data arrays pane carries the same color code as the process) • Examine the message queue to e.g. detect message size for messages in transfer • Disable the default behavior setting the same breakpoint for all processes. This feature may allow a process to continue and if e.g. a 'send' is posted one can then go on and inspect the message queue. • Address inspection, e.g. dive in a variable (use Right Mouse Button) and edit the variable type, e.g. rvor (real*8) --> integer*8 -> address

6

Profiling

Detailed profiling is possible using Compaq Continuous Profiling Infrastructure [9], and the recent enhancement ProfileMe that is an instruction centric profiler suitable for out-of-order processors (21264)[10]. They use on-chip counters (0 and 1 on 21264) and allow resolution between cycles and cache misses. These tools are designed to profile large production systems with minimal overhead from coarse grain (the eitire running system including kernel, shared libs) to fine grain (per application, per procedure). In addition, ProfileMe allows precise attribution of events to instructions and has been successfully employed for optimizing industry standard benchmarks. Event-based analysis is a necessity for understanding the (communications) performance of message passing programs. VAMPIR/VAMPIRTRACE (from Pallas Gmbh, http://www.pallas.comA allows the programmer to focus on any phase of the program execution by means of a timeline display showing communication characteristics for processes such as message size, bandwidth, etc. The timeline display can provide great insights as e.g. illustrated by a simple Finite Differences kernel involving domain boundary exchange. This was parallelized using point-to-point blocking MPI communications [1 l].As the diagram below shows, the sends do not complete until the matching receives are issued on the destination process. Since one process (the "top" process) does not send to anyone in the first step, it can receive from the process below it, thus allowing that

536

process to receive from below it, etc. This produces a staircase pattern of sends and receives that strongly influences performance as in this case the communication is entirely sequential. subroutine exchngl( a, nx, s, e, commld, nbrbottom, nbrtop ) include 'mpif.h' integer nx, s, e double precision a(0:nx+l,s-l:e+l) integer commld, nbrbottom, nbrtop integer status_array(MPI_STATUS_SIZE,4), ierr, req(4) c call MPI_SEND( 1 a(l,e), nx, MPI_DOUBLE_PRECISION, nbrtop, 0, 2 commld,status, ierr) call MPI_RECV( a(l,s-l), nx, MPI_DOUBLE_PRECISION, nbrbottom, 0, commld, status, ierr)

1 2

call MPI_SEND( 1 a(l,s), nx, MPI_DOUBLE_PRECISION, nbrbottom, 1, 2 commld, status, ierr)

call MPI_RECV( 1 a(l, e+1), nx, MPI_DOUBLE_PRECISION, nbrtop, 1, 2 commld, status, ierr) return end p0

pi

send

aend

p2

p3

p4

p5

p6

-OED

p 7

537

7

Case study

The program lapwl [1] is a communications intensive code originally written for the T3E using shmem. From profiling the code on an AlphaServer SC it appeared that due to the T3E different CPU speed ratio, the code did not fully exploit the AlphaServer performance when using version 1 of the SC software. The focus for code improvement was on the MPI version. 7.1

Problem Identification

The primary bottleneck to scaling was a routine called HNS. This distributes a data set of ~118KB over all tasks, performs calculations, and then sends the resulting data to all tasks via the MP I ALLGATHERV routine. It is expected that the time for ALLGATHERV should increase as the number of participating tasks increases. But the time increase was much larger than the reduction in computation time due to a smaller workload. Thus the total elapsed time for the routine increased and the code was not scaling. This was supported by running a special ALLGATHERV benchmark, which showed a very large latency for ALLGATHERV at large numbers of CPU's. So the problem was the routine and not the code. Given the current version of the software could not obviate this problem the only viable alternative was to reduce the number of tasks calling ALLGATHERV. 7.2

Implementation

The most flexible yet powerful method of addressing the ALLAGTHERV bottleneck was simply to reduce the number of CPU's participating. Thus the number of tasks calling ALLGATHERV was reducing by a factor of 4 and the results were then broadcast to the remaining tasks. The argument behind choosing a factor of 4 was that I would communicate between boxes (and there are 4 CPU's in an ES40) and replicate calculations within a box. Ideally this factor would be a free parameter that could be modified instead of hard-wired into the code. A detailed description of the new communicators follows: newcomm: This is the communicator used in the ALLGATHERV calls. It is defined on all tasks and includes task IDs of 0,4,8,... So only every fourth tasks participates. bcomm(4): This is the communicator used in the BCAST calls. A different communicator is defined for each subgroup of tasks. Tasks 0-3 have the same communicator, tasks 4-

538

7 have the same, etc. This way the broadcast will only occur between the four tasks (preferably within a node! but the modified code doesn't check for that. It assumes 0-3 will be in the same node but they might not be.) So pseudo-code looks like this: if (mod(pid,4).eq.O) then CALL MPI_ALLGATHERV( array,..., new_comm,...) endif call MPI_BCAST( array,..., b_comm,...) In order to create new communicators, new MPI groups had to be defined. But they do not affect the MPI routines, it is simply bookkeeping necessary for the communicators. 7.3

Results

Here is a table of HNS times with original and modified source (all other timings are the same):

#CPU 64 128 256 512

orig 280 359 864 3180

modified 196 156 127 139

So the modified version shows significant improvement over the original, but is still not scaling well. It is useful to also look at the proportion of computation to communication time within HNS: # CPU 64 128 256 512

comm 54 67 81 111

comp 142 89 46 28

It is promising that the communication time is not doubling with die doubling of CPU count. An important additional data point was provided by timing the code if all parallelization were removed and the entire set of calculations replicated. This time was 695 seconds, which

539 would be constant regardless of the number of processors. So it is clear that at all CPU counts parallelization results in improvement. But parallelizing to the fine granularity of all tasks participating in ALLGATHERV is detrimental to performance. The goal is to find the right balance between computation and communication for the available system.

8

References

1.

Pichlmeier and R. Dohmen „ Parallelization of the full-potential plane wave code WIEN 97"http://www.cineca.it/mpp-workshop/abstract/rdohmen.htm 2. Performance of the Cray T3E Multiprocessor http://www.tera.com/products/systems/cravt3e/paper.html 3. Kaufmann, R. "The Q Supercomputer at Compaq" http://www.compaq.com/hpc/news/news_hpc_60171.html 4. Noyce W. et al. " The GEM Optimizing Compiler System" Digital Technical Journal, vol. 4, 1992 5. Edmondson, J. et al. "Internal Organization of the Alpha 2116, a 300MHz 64bit Quad-issue CMOS RISC Microprocessor" - Digital Technical Journal, Vol. 7,No.l-1995 6. Scott,S.L., "Synchronization and Communication in the T3E Multiprocessor", Proc. Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 26-36, October 1996 7. Pareti, J. et al "COMPAQ and QSW Scalable Scientific Computing", Proceeding of the International Conference Parco99, pages 749-762 -Imperial College Press. 8. COMPAQ Porting Assistant http://www.tru64unix.compaq.com/portingassistant/locate.html 9. DCPI http://www.tru64unix.compaq.com/dcpi/ 10. ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors Appeared in the 30th Symposium on Microarchitecture (Micro-30), December, 1997. 11. Gropp W., et al " Using MPI" -1994 - The MIT Press; pages 71-80 9

Acknowledgements

Teresa Kaltz and Chuck W Schneider, both out of the COMPAQ High Performance Computing Expertise Center, Marlboro, MA have made the most significant contribution to application porting and optimization described in this paper.

This page is intentionally left blank

Author Index

Adolph, T. 362 Alessio, G. 83 Amar, A. 393 Anshus, O. J. 518 Artale,V. 222 Attig,N. 43 Bange, M. 510 Basta, S. 265 Benassi, A. 205 Benner, P. 274 Bergamaschi, L. 282 Bini, S. 83 Bj0rndalen, J. M. 518 Blochinger, W. 50 Bogdan, C. 165 Bonisch, T. 58 Bouabdallah, A. 382 Boulet, P. 393 Bretschneider, T. 401 Briesen, M. 66 Bruzzo, C. 173 Butler, R. 141 Cabaleiro, J. C. 457 Cardinale, Y. 290 Carletti, G. 409 Catarinucci, L. 75 Cavazzoni, C. 125 Celino,M. 181 Chiatti, G. 133 Christiaens, M. 417 Clematis, A. 83 Cleri, F. 181 Colajanni, M. 441 Coppola, M. 409 Crocchianti, S. 91 Czerwinska, J. 101

D'Acierno, A. 109 D'Ambra, P. 441 Danelutto, M. 425 Dattilo, G. 117 De Bosschere, K. 417 De Martino, M. 83 DeVita,A. 181 Dekeyser, J. 393 Di Santo, M. 433 Dongarra, J. J. 3 Edelvik,F. 213 ElBaz,D. 298 ElkiheLM. 298 Erbacci, G. 125 Esposti Ongaro, T. 125 Feltri, S. 83 Fernandez, J. C. 306 Filippone, S. 441 Frattolillo, F. 433 Fusco, L. 31 Geisler, S. 401 Gheorchiu, D. 490 Giangiacomo, P. 133 Giorgis, V. 173 Gorman, G. 141 Grandinetti, L. 354 Guarracino, M. R. 449 Gubareni, N. M. 149 Gupta, A. 314 Hafher, H. 362 Halada, L. 157 Heras,D. B. 457 Hernandez, E. 290 Hernandez, V. 274,306 Hey,T. 33 Hluchy, L. 157 541

542

Hoekstra,R. 165 Hutchinson, S. 165 Jeannot, E. 465 Kao, O. 322,401 Karypis, G. 34 Keiter, E. 165 Komzsik, L. 173 Kuchlin, W. 50 Kumar, V. 34 Laccetti, G. 449 Lagana, A. 91 Landini, L. 205 Lanucara, P. 222 Larsen, T. 518 Letardi, S. 181 Lippert, Th. 43,370 Macedonio, G. 125 Machi,A. 498 Marongiu, A. 189 Martin, C. 473 Mayer, S. 173 Mayo, R. 274 Mazzocca, N. 481 Messina, P. 40 Michelassi, V. 133 Nagel, W. E. 101 Neff,H. 43 Negele, J. 43 Neri,A. 125 Nguyen, G. T. 157 Norcen,R. 330 0'Doherty,T. 141 Orlando, S. 197 Pacifici, L. 91 Palazzari, P. 75, 189 Pardines, I. 338 Pareti, J. 529 Pelzer,A. 346 Penalver, L. 306

Perego, R. 197 Petcu, D. 490 Piermarini, V. 91 Pietra, C. 205 Pini, G. 282 Poschmann, P. 173 Positano, V. 205 Quintana-Ortf, E. S. 274 Rak,M. 481 Rantakokko, J. 213 Richard, O. 473 Rivera, F.F. 338,457 Roche, K.J. 3 Romano, D. 449 Ronsse,M. 417 Rosato, V. 181, 189 Rotiroti, D. 354 Riihle, R. 58 Russo, T. 165 Russo, W. 433 Sadeghi, R. 173 Sannino, G. 222 Santarelli, M. F. 205 Sardisco, G. 498 Sartoretto, F. 282 Schells, R. 165 Schilling, K. 43,370 Schloegel, K. 34 Schmidt, B. 230 Schonauer, W. 362 Schroder, H. 230 Schroers, W. 370 Shearer, A. 141 Shi, W. 382 Shilnikov, E. 238 Shoomkov, M. A. 238 Silvestri, F. 197 Sinz, C. 50 Spezzano, G. 117

543

Srikanthan, T. 230 Srimani, P. K 382 Stengel, M. 181 Strey.A. 510 Talia,D. 265,382 Tarricone, L. 75 Thomas, S. 254 Tran,V. D. 157 Triki,C. 354 Uhl,A. 330 Vanderwalt, P. 173

Villano, U. 481 Vinter, B. 518 Voss, H. 346 Wakatani, A. 246 Waters, A. 165 Watts, H. 165 Weberpals,H. 254 Weimar, J. 66 Wilson, N. 141 Wix,S. 165 Zimeo, E. 433

Proceedings of the International Conference ParCo2001

PARALLEL COMPUTING Advances and Current Issues The near future will see the increased use of parallel computing technologies at all levels of mainstream computing. Computer hardware increasingly employs parallel techniques to improve computing power for the solution of large scale and computer intensive applications. Cluster and grid technologies make possible high speed computing facilities at vastly reduced costs. These developments can be expected to result in the extended use of all types of parallel computers in virtually all areas of human endeavour. Computer intensive problems in emerging areas such as financial modelling, data mining and multimedia systems, in addition to traditional application areas of parallel computing such as scientific computing and simulation, will lead to further progress. Parallel computing as a field of scientific research and development has already become one of the fundamental computing technologies. This book gives an overview of new developments in parallel computing at the start of the 21st century, as well as a perspective on future developments.

P?55 hi; ISBN 1-86094-315-2

Imperial College Press www.icpress.co.uk

9 "781860"943157"

Introduction to parallel computing

Read more

Laboratories for parallel computing

Read more

Sourcebook of parallel computing

Read more

Industrial Strength Parallel Computing

Read more

Encyclopedia of Parallel Computing

Read more

Parallel and Distributed Computing

Read more

Parallel, Distributed, and Pervasive Computing

Read more

Parallel Computing: Principles and Practice

Read more

The Sourcebook of Parallel Computing

Read more

Parallel Computing: Principles and Practice

Read more

Parallel Computing: Principles and Practice

Read more

Parallel processing for scientific computing

Read more

Parallel Computing in Quantum Chemistry

Read more

Parallel processing for scientific computing

Read more

Parallel Computing in Quantum Chemistry

Read more

The Sourcebook of Parallel Computing

Read more

The Sourcebook of Parallel Computing

Read more

Communication Complexity and Parallel Computing

Read more

Mathematical foundations of parallel computing

Read more

Parallel Computing Using Optical Interconnections

Read more

Parallel Computing in Quantum Chemistry

Read more

Advances in Parallel, Distributed Computing

Read more

Parallel Combinatorial Optimization (Wiley Series on Parallel and Distributed Computing)

Read more

Applied Parallel Computing: State of the Art in Scientific Computing

Read more

Parallel Computing: Architectures, Algorithms and Applications - Volume 15 Advances in Parallel Computing

Read more

Algorithms and Parallel Computing (Wiley Series on Parallel and Distributed Computing)

Read more

Parallel Computing: Numerics, Applications, and Trends

Read more

Parallel computing: Numerics, applications, and trends

Read more

Languages and Compilers for Parallel Computing

Read more

Languages and Compilers for Parallel Computing

Read more

Recommend Documents

Introduction to parallel computing

OXFORD TEXTS IN APPLIED AND ENGINEERING MATHEMATICS OXFORD TEXTS IN APPLIED AND ENGINEERING MATHEMATICS * * * * * * *...

Laboratories for parallel computing

Sourcebook of parallel computing

SOURCEBOOK OF PARALLEL COMPUTING JACK DONGARRA University of Tennessee IAN FOSTER Argonne National Laboratory GEOFFRE...

Industrial Strength Parallel Computing

Industrial Strength Parallel Computing Related Titles from Morgan Kaufmann Publishers The Grid: Blueprint for a New C...

Encyclopedia of Parallel Computing

Encyclopedia of Parallel Computing David Padua (Ed.) Encyclopedia of Parallel Computing With  Figures and  Tabl...

Parallel and Distributed Computing

I Parallel and Distributed Computing Parallel and Distributed Computing Edited by Alberto Ros In-Tech intechweb.o...

Parallel, Distributed, and Pervasive Computing

This page intentionally left blank Advances in COMPUTERS VOLUME 63 This page intentionally left blank Advances i...

Parallel Computing: Principles and Practice

This book sets out the principles of parallel computing in a way which will be useful to student and potential user alik...

The Sourcebook of Parallel Computing

SOURCEBOOK OF PARALLEL COMPUTING JACK DONGARRA University of Tennessee IAN FOSTER Argonne National Laboratory GEOFFRE...

Parallel Computing: Principles and Practice

This book sets out the principles of parallel computing in a way which will be useful to student and potential user ali...